Algorithmia Blog

How to Retrieve Tweets By Keyword and Identify Named Entities

Last week we introduced the named entity recognition algorithm for extracting and categorizing unstructured text.

In this post we’ll show you how to get data from Twitter, clean it with some regex, and then run it through named entity recognition. With the output we get from the algorithm, we can then group the data by the category each named entity is assigned to, and then extract the categories we are interested in.

Despite their short character count every tweet contains a wealth of information. It also contains a ton of noise. This noise comes in the form of hashtags, URLs, and other non-alpha-numeric characters that obscure important data.

In order to use algorithms like sentiment analysis, LDA, or even named entity recognition, we need to clean up our data and strip away these artifacts.

Since Thanksgiving is upon us we’ll query for holiday-related tweets, clean them, and run them through the named entity recognition algorithm. We’ll look at the categories of “Location” and “Organization” that are used to identify trends about where people are going and events they’re attending.

Prerequisites:

For this recipe you’ll need a dataset that contains tweets retrieved from the Twitter Search API.

We’ll be using the Retrieve Tweets with Keyword algorithm to grab tweets, strip out the hashtags and other non-alpha-numeric entities and run it through the Named Entity Recognition algorithm to categorize  entities as “Organization,” “Person,” and more. You could also do this via the Twitter Search API.

We are getting tweets of all users in the last seven days without regard to user id or any other identifying information. Our goal is only to extract and classify entities, and then look for patterns in the data.

Step 1: Install the Algorithmia Client

This tutorial is in Python. But, it could be built using any of the supported clients, like Javascript, Ruby, Java, etc. Here’s the Python client guide for more information on using the Algorithmia API.

Install the Algorithmia client from PyPi:

pip install algorithmia

You’ll also need a free Algorithmia account, which includes 5,000 free credits a month – more than enough to get started with crawling, extracting, and analyzing web data.

Sign up here, and then grab your API key.

Step 2: Get Your Tweets

Using the Retrieve Tweets with Keyword algorithm we are now going to search Twitter for the keyword “Thanksgiving.” Of course you can change it to anything you want.

import re
from collections import defaultdict, Counter

import Algorithmia

client = Algorithmia.client("your_algorithmia_api_key")
def pull_tweets():
    input = {
        "query": "Thanksgiving",
        "numTweets": "700",
        "auth": {
            "app_key": 'your_consumer_key',
            "app_secret": 'your_consumer_secret_key',
            "oauth_token": 'your_access_token',
            "oauth_token_secret": 'your_access_token_secret'
        }
    }
    
    twitter_algo = client.algo("twitter/RetrieveTweetsWithKeyword/0.1.3")
    result = twitter_algo.pipe(input).result
    tweet_list = [tweets['text'] for tweets in result]
    return tweet_list

This algorithm takes an input that contains the query, number of tweets you want to search for, and your Twitter API tokens. The query is simply the text you wish to search the Twitter API for. Youll need to get your Twitter authentication keys and tokens here.

As you write the pull_tweets() function, pay attention to the line that sets the client variable to Algorithmia.client(algorithmia_api_key). This is where you pass in your Algorithmia API key (located on the My Profile page in the Credentials section).

The variable twitter_algois the name of the algorithm we’re using. Each algorithm’s description page will give you the appropriate name and version to use. In our case it is:

"twitter/RetrieveTweetsWithKeyword/0.1.3"

And last, the list comprehension tweet_list holds the text results from each tweet that we got from twitter_algo.pipe(input).result

While this algorithm contains other data about Tweets, such as the number of retweets and number of followers, we are only interested in the text field from the results.

Step 3: Clean Your Data

def process_text():
    """Remove emoticons, numbers etc. and returns list of cleaned tweets."""
    data = pull_tweets()
    regex_remove = "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^RT|http.+?"
    stripped_text = [
        re.sub(regex_remove, '',
               tweets).strip() for tweets in data
    ]
    return '. '.join(stripped_text)

In the process_text() function we remove emoticons, numbers, URLs and more using regex. This returns a clean list of tweets. Also, notice that we join the tweets using a period, which turns each tweet into a sentence. We can then pass one long string of sentences into our named entity recognition algorithm once, rather than calling the algorithm for each separate tweet.

Step 4: Get Your Named Entities and Categories

def get_ner():
    """Get named entities from the NER algorithm using clean tweet data."""
    data = process_text()
    ner_algo = client.algo(
        'StanfordNLP/NamedEntityRecognition/0.1.1').set_options(timeout=600)
    ner_result = ner_algo.pipe(data).result
    return ner_result

Which returns something like this:

[['Shirley', 'PERSON'], ['Caesar', 'PERSON'], ['Thanksgiving', 'DATE']], [['19', 'NUMBER']]

The function get_ner() calls the process_text() function, which cleans our tweets and passes the data into our named entity recognition algorithm using the .pipe() method.

You will want to set a timeout of about 10-minutes using the set_options method. While this algorithm won’t take that long to run, it will save us from getting an error if it runs over the default five-minute timeout. For more information about other set_options, check out the Data API docs.

Step 5: Find the Most Popular Entities

We’re going look at the categories “Location” and “Organization,” and find if there are any popular locations or events that are happening around the United States.

def group_data():
    data = get_ner()
    default_dict = defaultdict(list)
    for items in data:
        for k, v in items:
            if 'LOCATION' in v or 'ORGANIZATION' in v:
                default_dict[v].append(k)
    ner_list = [{keys: Counter(values)}
            for (keys, values) in default_dict.items()]
return ner_list
if __name__ == '__main__': group_data()

In the code above, we’ve created the group_data() function where we call the previously created function get_ner() and set that as our data variable. Next, we iterate over all the named entities and extract only data from the “Location” and “Organization” categories, grouping them by category.

This code will give you the final list of dictionaries that returns the “Location” and “Organization” results:

[  
   {  
      'LOCATION':Counter(      {  
         'America':8,
         'Turkey':6,
         'Chicago':4,
         'OHare':3,
         'International':2,
         'Airport':2,
         'USA':2,
         'Rock':1,
         'Hillsdale':1,
      }      )
   },
   {  
      'ORGANIZATION':Counter(      {  
         'Nations':6,
         'First':6,
         'Services':2,
         'TV':2,
         'Oils':1,
         'Soup':1,
         'Zenit':1,
         'Northwest':1,
         'Philip':1,
      }      )
   }
]

And that’s it! Of course you can take this further and look at other categories or create a data visualization with maps to show where the “Organizations” are located that are tweeting about Thanksgiving or any other combination you can think of.

For ease of reference here is the whole code snippet and you can run the script yourself!

Tools Used:

Get this recipe and more in the Algorithmia sample apps repo on Github.

import re
from collections import defaultdict, Counter

import Algorithmia
client = Algorithmia.client("your_algorithmia_api_key")

def pull_tweets():
    input = {
        "query": "Thanksgiving",
        "numTweets": "700",
        "auth": {
            "app_key": 'your_consumer_key',
            "app_secret": 'your_consumer_secret_key',
            "oauth_token": 'your_access_token',
            "oauth_token_secret": 'your_access_token_secret'
        }
    }
    
    twitter_algo = client.algo("twitter/RetrieveTweetsWithKeyword/0.1.3")
    result = twitter_algo.pipe(input).result
    tweet_list = [tweets['text'] for tweets in result]
    return tweet_list

def process_text():
    """Remove emoticons, numbers etc. and returns list of cleaned tweets."""
    data = pull_tweets()
    regex_remove = "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^RT|http.+?"
    stripped_text = [
        re.sub(regex_remove, '',
               tweets).strip() for tweets in data
    ]
    return '. '.join(stripped_text)


def get_ner():
    """Get named entities from the NER algorithm using clean tweet data."""
    data = process_text()
    ner_algo = client.algo(
        'StanfordNLP/NamedEntityRecognition/0.1.1').set_options(timeout=600)
    ner_result = ner_algo.pipe(data).result
    return ner_result


def group_data():
    data = get_ner()
    default_dict = defaultdict(list)
    for items in data:
        for k, v in items:
            if 'LOCATION' in v or 'ORGANIZATION' in v:
                default_dict[v].append(k)

    ner_list = [{keys: Counter(values)}
                for (keys, values) in default_dict.items()]
    return ner_list


if __name__ == '__main__':
    group_data()