Algorithmia

Use LDA to Classify Text Documents

LDA Topic TagsThe LDA microservice is a quick and useful implementation of MALLET, a machine learning language toolkit for Java. This topic modeling package automatically finds the relevant topics in unstructured text data.

The Algorithmia implementation makes LDA available as a REST API, and removes the need to install multiple packages, manage servers, or deal with dependencies. This microservice accepts strings, files, and URLs, as well as the ability to include a stop word list as an argument.

Want to understand the topic frequency of a document? Try LDA Mapper, a microservice that provides the topic distribution in a document. Combined, LDA and LDA Mapper help you find hidden topics and trends in unstructured text.

What is LDA?

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative topic bag of words model that automatically discovers topics in text documents. This model regards each document (observations of words) as a mixture of various topics, and that each word in the document belongs to one of the document’s topics. This algorithm was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.

For example, when classifying newspaper articles, Story A may contain a topic with the words “economic,” “downturn,” “Wall Street,” and “Forecasted.” It’d be reasonable to assume that Story A is about Business. Whereas Story B may return a topic with the words “movie,” “rated,” “enjoyed,” and “recommend.” Story B is clearly about Entertainment.

LDA works by calculating the probability that a word belongs to a topic. For instance, in Story B, the word “movie” would have a higher probability than the word “rated.” This makes intuitive sense, because “movie” is more closely related to the topic Entertainment than the word “rated.”

For more information about LDA go to it’s Wiki page or Edwin Chen’s blog.

Why Use LDA?

LDA is useful when you have a set of documents, and you want to discover patterns within, but without knowing about the documents themselves. LDA can be used to generate topics to understand a document’s general theme, and is often used in recommendation systems, document classification, data exploration, and document summarization. Additionally, LDA is useful in training predictive, linear regression models with the topics and occurrences.

How To Use LDA

The algorithm takes an object with an array of strings. As part of the API call you can specific a mode to balance speed vs quality. See the LDA algorithm page for more information about input options.

Sample Input

{
  "docsList": [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
  ]
}

Sample LDA API Call

import Algorithmia
client = Algorithmia.client('your_algorithmia_api_key')
algo = client.algo('nlp/LDA/1.0.0')
result = algo.pipe(input).result

Sample Output

[
  {
    "intelligence": 1,
    "text": 1,
    "create": 1,
    "super-resolution": 1,
    "forecasting": 1,
    "taxis": 1,
    "python": 1,
    "art": 1
  },
  {
    "learning": 3,
    "source": 1,
    "telegram": 1,
    "baidu's": 1,
    "chatbot": 1
  },
  {
    "deep": 3,
    "machines": 1,
    "summarize": 1,
    "self-driving": 1,
    "framework": 1,
    "overview": 1,
    "descent": 1,
    "minds": 1
  },
  {
    "age": 1,
    "world's": 1,
    "optimization": 1,
    "artificial": 1,
    "algorithms": 1,
    "image": 1,
    "singapore": 1,
    "debut": 1
  }
]

In the above example, we used some some tweets from @Algorithmia. There’s a few patterns that emerge from the documents. With more documents, the topics would be more clearly defined.

Combine LDA With LDA Mapper

The LDA Mapper algorithm can map the group of topics generated by LDA to the list of documents. This tells you the distribution of topics across the documents (e.g. Story A is made up of 35% Business, 25% Entertainment, and 40% Science topics).

Sample Input

{
  "topics": [
  {
    "intelligence": 1,
    "text": 1,
    "create": 1,
    "super-resolution": 1,
    "forecasting": 1,
    "taxis": 1,
    "python": 1,
    "art": 1
  },
  {
    "learning": 3,
    "source": 1,
    "telegram": 1,
    "baidu's": 1,
    "chatbot": 1
  },
  {
    "deep": 3,
    "machines": 1,
    "summarize": 1,
    "self-driving": 1,
    "framework": 1,
    "overview": 1,
    "descent": 1,
    "minds": 1
  },
  {
    "age": 1,
    "world's": 1,
    "optimization": 1,
    "artificial": 1,
    "algorithms": 1,
    "image": 1,
    "singapore": 1,
    "debut": 1
  }
],
  "docsList": [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
  ]
}

Sample LDA Mapper API Call

import Algorithmia
client = Algorithmia.client('your_algorithmia_api_key')
algo = client.algo('nlp/LDAMapper/0.1.0')
result = algo.pipe(input).result

Sample Output

The LDAMapper algorithm output each document, and the influence of topics on each document.

{
  "topic_distribution": [
    {
      "doc": "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
      "freq": {"0": 0.5, "1": 0, "2": 0, "3": 0.5}
    },
    {
      "doc": "Paddle: Baidu's open source deep learning framework",
      "freq": {"0": 0, "1": 0.5, "2": 0.5, "3": 0}
    },
    {
      "doc": "An overview of gradient descent optimization algorithms",
      "freq": {"0": 0, "1": 0, "2": 0.5, "3": 0.5}
    },
    {
      "doc": "Create a Chatbot for Telegram in Python to Summarize Text",
      "freq": {"0": null, "1": null, "2": null, "3": null}
    },
    {
      "doc": "Image super-resolution through deep learning",
      "freq": {"0": 0.25, "1": 0.25, "2": 0.25, "3": 0.25}
    },
    {
      "doc": "World's first self-driving taxis debut in Singapore",
      "freq": {
        "0": 0.3333333333333333,
        "1": 0,
        "2": 0.3333333333333333,
        "3": 0.3333333333333333
      }
    },
    {
      "doc": "Minds and machines: The art of forecasting in the age of artificial intelligence",
      "freq": {
        "0": 0.5714285714285716,
        "1": 0,
        "2": 0.1428571428571429,
        "3": 0.2857142857142858
      }
    }
  ]
}

Here’s a sample recipe for calling LDA, and passing the results to the LDA Mapper algorithm:

import Algorithmia

client = Algorithmia.client('your_algorithmia_api_key')

docslist = [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
]

# The LDA required input using a list of documents
lda_input = {
    "docsList": docslist
}

# LDA algorithm: https://algorithmia.com/algorithms/nlp/LDA
lda = client.algo('nlp/LDA/1.0.0')
# Returns a list of dictionaries of trends
result = lda.pipe(lda_input).result

# LDA Mapping algorithm: https://algorithmia.com/algorithms/nlp/LDAMapper
lda_mapper = client.algo(
    'nlp/LDAMapper/0.1.1')

# LDA Mapper input using the LDA algorithm's result as 'topics' value
lda_mapper_input = {
    "topics": result,
    "docsList": docslist
}

# Prints out the result from calling the LDA Mapper algo
print(lda_mapper.pipe(lda_mapper_input).result)

By using these two algorithms together, we’re able to create a microservice that gives us the same result with less code. If we wanted to, we could include a pre-processing algorithm in this pipeline, like Porter Stemmer, and then use sentiment analysis to discover the sentiment of the documents as they pertain to each topic.

Have fun exploring other algorithms on Algorithmia, and happy coding!

Algorithms Used: