Playing with n-grams

Links in this post require access to private beta – you can obtain access by just following this link. 

Inspired by @sampullara’s tweet we started thinking about how we could create an N-gram trainer and text generator directly in Algorithmia. It also happened to be almost Valentine’s day, so we wondered if we could apply the same principles to automatically generating love letters. 

Although it is clear that we probably wouldn’t be fooling our valentines with the fake letters it was still a fun exercise.

The task would be split up in three parts:

  1. generate the corpus of text 
  2. train a model 
  3. query the model for automatically generated text.  

N-gram models are probabilistic models that assign probabilities on the “next” word in a sequence, given the n-1 previous words. The power of n-gram models are limited in natural language processing due to the fact that they cannot model deep dependencies, that is, the nth word only depends on the n-1 previous words; but they perform well on simple statistical demonstrations.

Love as a Service?

A quick internet search on love letters provided us with enough of a corpus to extract some trigrams. Algorithmia already had a convenient way to extract text from given URL’s, and this is our data source for our love letter generator. Now we are ready to feed in our love corpus into the trigram generator.

The extraction of trigrams is done with  an algorithm that generates trigram frequencies. This algorithm takes in an array of Strings (the love letters in our corpus), a beginning and an end token, and a Data collection URL to write the final trigram frequencies. The beginning and end tokens are necessary for us to be able to generate sentences. There are word sequences that are fit to start a sentence and those that are fit to end a sentence. To be able to discern these, we use beginning and end tokens that are unique and they do not show up in the text, so instead of hard-coding them, we take them as inputs. 

N-grams explained visually:


After the small preprocessing step, we can go through the text and generate the frequencies in which they appear. This step can be referred to as the “sliding window” step where we go through each three-word (or word-token) combinations and keep recording the frequencies. The implementation details can be seen in the GenerateTrigramFrequencies  algorithm. We record the output as the three words followed by the total frequency in a file.

Now we can pass the trigram model that we obtained to the RandomTextFromTrigram algorithm. If we think of the trigram groups as possible paths to be taken down a graph, randomly choosing one of the possible nodes limits our choices already. By going from the start of the graph to the end and choosing randomly from our possible “next words”, we generate random text based on the original corpus.

And last as @sampullara tweet originally suggested we applied the same process to collect tweets by a user. We added an option to specify gram-length for the tweet generator. This was done due to the fact that, if there are not enough alternatives with different frequencies, the probabilistic nature of the generation of text is not very visible at the end result. Since tweets are generally short and fairly unique, trigrams do not result in very interesting combinations.

Note:we limit the number of tweets to be retrieved from a user so Twitter API doesn’t rate limit us. 

Want to try this yourself? All algorithms used for this blog post are online, available and open-source on Algorithmia. You can get private beta access by following this link.

  • Zeynep and Diego

Product manager at Algorithmia helping to give developers super powers.

More Posts - Website

Follow Me: