Links in this post require access to Algorithmia.com private beta – you can obtain access by just following this link.
Inspired by @sampullara’s tweet we started thinking about how we could create an N-gram trainer and text generator directly in Algorithmia. It also happened to be almost Valentine’s day, so we wondered if we could apply the same principles to automatically generating love letters.
Although it is clear that we probably wouldn’t be fooling our valentines with the fake letters it was still a fun exercise.
The task would be split up in three parts:
- generate the corpus of text
- train a model
- query the model for automatically generated text.
N-gram models are probabilistic models that assign probabilities on the “next” word in a sequence, given the n-1 previous words. The power of n-gram models are limited in natural language processing due to the fact that they cannot model deep dependencies, that is, the nth word only depends on the n-1 previous words; but they perform well on simple statistical demonstrations.
Love as a Service?
A quick internet search on love letters provided us with enough of a corpus to extract some trigrams. Algorithmia already had a convenient way to extract text from given URL’s, and this is our data source for our love letter generator. Now we are ready to feed in our love corpus into the trigram generator.
The extraction of trigrams is done with an algorithm that generates trigram frequencies. This algorithm takes in an array of Strings (the love letters in our corpus), a beginning and an end token, and a Data collection URL to write the final trigram frequencies. The beginning and end tokens are necessary for us to be able to generate sentences. There are word sequences that are fit to start a sentence and those that are fit to end a sentence. To be able to discern these, we use beginning and end tokens that are unique and they do not show up in the text, so instead of hard-coding them, we take them as inputs.
N-grams explained visually:
After the small preprocessing step, we can go through the text and generate the frequencies in which they appear. This step can be referred to as the “sliding window” step where we go through each three-word (or word-token) combinations and keep recording the frequencies. The implementation details can be seen in the GenerateTrigramFrequencies algorithm. We record the output as the three words followed by the total frequency in a file.
Now we can pass the trigram model that we obtained to the RandomTextFromTrigram algorithm. If we think of the trigram groups as possible paths to be taken down a graph, randomly choosing one of the possible nodes limits our choices already. By going from the start of the graph to the end and choosing randomly from our possible “next words”, we generate random text based on the original corpus.
And last as @sampullara tweet originally suggested we applied the same process to collect tweets by a user. We added an option to specify gram-length for the tweet generator. This was done due to the fact that, if there are not enough alternatives with different frequencies, the probabilistic nature of the generation of text is not very visible at the end result. Since tweets are generally short and fairly unique, trigrams do not result in very interesting combinations.
Note:we limit the number of tweets to be retrieved from a user so Twitter API doesn’t rate limit us.
Want to try this yourself? All algorithms used for this blog post are online, available and open-source on Algorithmia. You can get private beta access by following this link.
- Zeynep and Diego
We here at Algorithmia are firm believers that no one tool can do it all – that’s why we are working hard to put the world’s algorithmic knowledge within everyone’s reach. Needless to say, that’s a work that will be in progress for awhile, but we’re well on the way to getting many of the most popular algorithms out there. Machine learning is one of our highest priorities, so we recently made available two of the most popular machine learning packages: Weka and Mahout.
Both these packages are notable for different reasons. Weka is a venerable, well-developed project that’s been around since the 80’s. It includes a huge suite of well-optimized machine learning and data analysis algorithms as well as various supporting routines that handle formatting, data transformation, and related tasks. Weka’s only real weakness is that its main package is not well optimized for memory intensive tasks on large datasets.
Mahout is almost completely the opposite, being something of a cocky newcomer. It’s a recent project designed specifically for Big Data. Its set of algorithms seems tiny compared to Weka, its documentation is spotty and getting it to work can be a real headache (unless of course you’re using it through Algorithmia). However, as our results below show, if you have much data to crunch, it may be the only game in town. Mahout is designed to scale using MapReduce and while integration of MapReduce into its algorithms is neither complete nor easy to use, even in the single machine case, Mahout shows evidence of being more capable of handling large volumes of data.
So, which is better?
As you probably figured out already, it depends.
There are any number of ways to answer this question, but we opted to compare them by picking a popular algorithm present in both, and comparing performance on a well known and non-trivial machine learning task. Specifically, we applied random forests to the MNIST handwritten digit recognition dataset.
The MNIST dataset is consists of handwritten digits. Note that we did work with a reduced subset of the dataset that has about 42000 images for training and 28000 for testing.
Random forests are an ensemble classification method that enjoys great popularity due to their conceptual simplicity, good performance (both speed and accuracy for many applications) and resistance to overfitting. You can read more about them here or, of course, on Wikipedia. Despite differences in the implementations, all parameters apart from the number of trees were kept constant for the purpose of a fair comparison.
A quick demo to showcase
For accuracy, Weka is the clear winner, achieving its best accuracy at 99.39% using 250 trees, compared with Mahout’s best: 95.89% using 100 trees. It’s also worth noting that the number of trees had little effect on Mahout’s classification accuracy, which stayed at between 95 and 96 percent for all forest sizes tried. Runtimes for both were comparable.
This shouldn’t be terribly surprising, as Weka has been around for longer and we would expect to be at least somewhat better optimized, especially in the relatively small datasets for which it was specifically designed.
However, bear in mind that this MNIST is easily small enough to fit one one machine. Weka’s superior accuracy won’t count for anything once your data outgrows this and tools like Mahout will be your only option.
The Moral of the Story
So, it’s just like we told you. When taking into account the whole scope of enterprise data problems, there is no clear winner or one-size-fits-all solution. If you have a little data and want to get the most out of it, use Weka. If you have a bunch of data, Mahout is your best (or perhaps only) choice, even if performance isn’t quite what you would like.
One of the key advantages of Algorithmia is that you can easily try both head to head against your data set with no setup.
– Austin & Zeynep
As developers one of our biggest “problems” is our voracious appetite for news. From Twitter to HackerNews to the latest funding on TechCrunch, it seems, at times, we cannot avoid the gravitational pull of our favorite news feeds. I know, at least for myself, this is engrained in my routine: wake up, check Twitter, check TechCrunch, check The Verge, etc. I spend at least the first 30 minutes of every day reading feeds based on title and repeat this a couple more times through the day.
I recently discovered SkimFeed, which I love and call my “dashboard into nerd-dom,” basically it is a single view of the major tech sites’ titles. However, I wanted more information on each article before I decided to click on one, so I thought: Why not use text analysis algorithms as a more efficient way of consuming my feeds?
Building my very own text analysis API
There are a number of things that can be done as part of a text analysis API but I decided to concentrate on four elements I believed I could build the fastest:
- Strip documents of unnecessary elements
- Advanced topic extraction
- Automatically summarize stories
- Sentiment analysis
As all of these algorithms are already available in the Algorithmia API, I could piece them together without having to worry about servers, scaling, or even implementing the actual algorithms:
- ScrapeRSS – Retrieves all the necessary elements from an RSS feed
- Html2Text – Strips HTML elements and keeps the important text
- AutoTag – Looks for keywords that represent the topic of any text submitted to it. (Latent Dirichlet Allocation)
- SentimentAnalysis – Analyzes sentences for positive, neutral or negative connotation. Uses the Stanford NLP library.
- Summarizer – Breaks content into sentences, and extract key sentences that represent the contents topic. Uses classifier4j to ingest a URL and summarize its main contents
Now, it was just a question of making them work together (and it took ~200 lines of code).
Check it out here:
The first thing I needed to do was retrieve all the necessary elements from the RSS feeds (ScrapeRSS). Once I had located the main content I could strip all the unnecessary HTML (HTML2Text). Now I had a nice clean body of text to start doing analysis on.
Creating topic tags is an easy way of understanding an article at a very quick glance, so I fed our clean body of text through AutoTag. Now I had the RSS title and some topics, next step was to summarize each link into 3 sentences max to complement the tags. Finally, and mostly for fun, I wanted to see if it’s true that most news is negative, so I added SentimentAnalysis.
You can check out some of the code here (or just view-source on our demo page in the browser).
That’s one way to use the Algorithmia library to build your own text analysis API to analyze almost any data stream in < 30 minutes. .
If you recall from our last blog post we showed you how we used some of the algorithms in Algorithmia to generate topic tags for any URL. Internally, we used the topic generation algorithm to generate tags based on the algorithm’s description and, later, use these tags as part of our recommender algorithm.
With today’s post we want to show you how easy it is to integrate these type of recommender algorithms (an algorithm already available in Algorithmia) into your own workflow by showing it in action with one of our favorite developer tools, GitHub.
As GitHub has grown to almost 50 million repositories since 2011 (per our calculations), discovering new repositories has become less and less straight forward when navigating outside the most popular or featured repositories. With this in mind, we thought of a way to make it easier to tackle some of this complexity using some of the algorithms available in Algorithmia.
So here you go, give us any GitHub repository (as long as it has a README) and we will recommend other repositories based on the information we extract from the README:
How we built this
- The first step was to figure out the URL for every repository in GitHub. This might seem like a daunting task but the folks over at Github have been nice enough to make their entire data set available on Google Big Query. You can head over there and generate a list of every public repository since 2011.
- The second step, was to check every public repository’s readme file and run it through Algorithmia’s topic analysis algorithm.
- Once the topic analysis algorithm returns a set of tags, we save these tags to start generating the data model we will later use for our recommender algorithm as a mapping from URL -> [tags].
- Repeat this a couple million times (not the easiest task, but we had a virtually unlimited sized cluster to parallelize this task thanks to the Algorithmia platform), and voila a tagged data model of the entire GitHub world.
- Finally, it’s time to start working with our recommender algorithm. To start, we need to point it to the data model we built and then send the algorithm the new set of tags that we generated from the URL you provided us with. With two inputs, the recommender algorithm returns a number of relevant repositories based on the tags automatically generated.
If you are interested in using any of these algorithms on your own site or application, we can help – these are all available in the Algorithmia API, just send us an email.
— Your friends at Algorithmia