Using Named Entity Recognition to Categorize Text Data

Use Named Entity Recognition to Categorize Unstructured TextUnstructured text content is rich with information, but it’s not always easy to find what’s relevant to you.

With the enormous amount of data that comes from social media, email, blogs, news and academic articles, it becomes increasingly hard to extract, categorize, and learn from that information.

While there are many algorithms that help with the discovery process, such as topic modeling or natural language processing algorithms, sometimes you want to categorize data in a more structured way.

Named Entity Recognition is an algorithm that extracts information from unstructured text data and categorizes it into groups. For example, if there’s a mention of “San Diego” in your data, named entity recognition would classify that as “Location.”

Algorithmia has two named entity recognition algorithms: one is an implementation of Stanford CoreNLP, and the other is Apache OpenNLP.

Both algorithms are accessible as API endpoints for seamless integration with your application or data science pipeline. These algorithms can be utilized with a few lines of code so you can take advantage of our scalable serverless architecture.

Quick Links:

What is Named Entity Recognition?

Named Entity Recognition is a form of text mining that sifts through unstructured text data and locates noun phrases called named entities. Named entities can then be organized under predefined categories, such as “person,” “organization,” “location,” “number,” or “duration.”

There are a wide variety of use cases that all use name entity recognition. One such goal is to make information easier to locate. This is done by first locating named entities and then categorizing them under different labels. After that step, the data can be aggregated under those labels for easy information retrieval. For instance, job listing data could have categories like “organization” or “location.” Users could then search by those specific categories.

Another use of named entity recognition is to implement it as a first step in the information retrieval process. After named entities are discovered and categorized, patterns between the different entities can be found. For example, a category “Location” with the entity “San Diego,” and the category “Organization” with the entity “Democratic” could be used to discover demographic information of peoples association with political parties.

Why You Need Named Entity Recognition

You shouldn’t have to spend all your time setting up complicated environments or worrying about installing and managing dependencies to get a library to work.

By wrapping these libraries as API endpoints, developers have access to named entity recognition algorithms that perform well at scale. Using our implementations of Stanford CoreNLP and Apache OpenNLP saves developers time without having to stress about server maintenance or downtime.

Algorithmia has support for Ruby, Rust, Python, JavaScript, Scala, Java, and R so it’s easy to extract entities in the language of your choice, even in real-time.

How to Categorize Data Using Named Entity Recognition

For this example we will show how to use the Stanford CoreNLP named entity recognition algorithm with Scala, but you could call it using any of our supported clients.

To get started using the algorithm, you’ll need a free API key from Algorithmia.

Sample API Call:

import com.algorithmia._

import com.algorithmia.algo._

val input = """"Incredible, Chicago city officials estimate that there's a record 5 million people at Cubs parade""""

val client = Algorithmia.client("your_api_key")

val algo = client.algo("algo://StanfordNLP/NamedEntityRecognition/0.1.1")

val result = algo.pipeJson(input)


Sample Output:

     ["Chicago", "LOCATION"],
     ["5", "NUMBER"],
     ["million", "NUMBER"],
     ["Cubs", "ORGANIZATION"]

And there you go, your entities are properly categorized in only a few lines of code without having to install dependencies or be responsible for the environment the algorithm runs in. No provisioning servers or worrying about scalability since we take care of all that for you.

Let us know what you think @Algorithmia.