Algorithmia

Machine Learning Showdown: Apache Mahout vs Weka

We here at Algorithmia are firm believers that no one tool can do it all – that’s why we are working hard to put the world’s algorithmic knowledge within everyone’s reach. Needless to say, that’s a work that will be in progress for awhile, but we’re well on the way to getting many of the most popular algorithms out there. Machine learning is one of our highest priorities, so we recently made available two of the most popular machine learning packages: Weka and Mahout.

Both these packages are notable for different reasons. Weka is a venerable, well-developed project that’s been around since the 80’s. It includes a huge suite of well-optimized machine learning and data analysis algorithms as well as various supporting routines that handle formatting, data transformation, and related tasks. Weka’s only real weakness is that its main package is not well optimized for memory intensive tasks on large datasets.

Mahout is almost completely the opposite, being something of a cocky newcomer. It’s a recent project designed specifically for Big Data. Its set of algorithms seems tiny compared to Weka, its documentation is spotty and getting it to work can be a real headache (unless of course you’re using it through Algorithmia). However, as our results below show, if you have much data to crunch, it may be the only game in town. Mahout is designed to scale using MapReduce and while integration of MapReduce into its algorithms is neither complete nor easy to use, even in the single machine case, Mahout shows evidence of being more capable of handling large volumes of data.

So, which is better?

As you probably figured out already, it depends.

There are any number of ways to answer this question, but we opted to compare them by picking a popular algorithm present in both, and comparing performance on a well known and non-trivial machine learning task. Specifically, we applied random forests to the MNIST handwritten digit recognition dataset.

The Task

The MNIST dataset is consists of handwritten digits. Note that we did work with a reduced subset of the dataset that has about 42000 images for training and 28000 for testing.

The Algorithm

Random forests are an ensemble classification method that enjoys great popularity due to their conceptual simplicity, good performance (both speed and accuracy for many applications) and resistance to overfitting. You can read more about them here or, of course, on Wikipedia. Despite differences in the implementations, all parameters apart from the number of trees were kept constant for the purpose of a fair comparison.

A quick demo to showcase

The Results

For accuracy, Weka is the clear winner, achieving its best accuracy at 99.39% using 250 trees, compared with Mahout’s best: 95.89% using 100 trees. It’s also worth noting that the number of trees had little effect on Mahout’s classification accuracy, which stayed at between 95 and 96 percent for all forest sizes tried. Runtimes for both were comparable.

image

This shouldn’t be terribly surprising, as Weka has been around for longer and we would expect to be at least somewhat better optimized, especially in the relatively small datasets for which it was specifically designed.

However, bear in mind that this MNIST is easily small enough to fit one one machine. Weka’s superior accuracy won’t count for anything once your data outgrows this and tools like Mahout will be your only option.

 The Moral of the Story

So, it’s just like we told you. When taking into account the whole scope of enterprise data problems, there is no clear winner or one-size-fits-all solution. If you have a little data and want to get the most out of it, use Weka. If you have a bunch of data, Mahout is your best (or perhaps only) choice, even if performance isn’t quite what you would like.

One of the key advantages of Algorithmia is that you can easily try both head to head against your data set with no setup.

 – Austin & Zeynep

Create your own machine learning powered RSS reader in under 30 minutes

As developers one of our biggest “problems” is our voracious appetite for news. From Twitter to HackerNews to the latest funding on TechCrunch, it seems, at times, we cannot avoid the gravitational pull of our favorite news feeds. I know, at least for myself, this is engrained in my routine: wake up, check Twitter, check TechCrunch, check The Verge, etc. I spend at least the first 30 minutes of every day reading feeds based on title and repeat this a couple more times through the day.

Get the code sample for this project here.

I recently discovered SkimFeed, which I love and call my “dashboard into nerd-dom,” basically it is a single view of the major tech sites’ titles. However, I wanted more information on each article before I decided to click on one, so I thought: Why not use text analysis algorithms as a more efficient way of consuming my feeds?

Building my very own text analysis API

There are a number of things that can be done as part of a text analysis API but I decided to concentrate on four elements I believed I could build the fastest:

  • Strip documents of unnecessary elements
  • Advanced topic extraction
  • Automatically summarize stories
  • Sentiment analysis

As all of these algorithms are already available in the Algorithmia API, I could piece them together without having to worry about servers, scaling, or even implementing the actual algorithms:

  • ScrapeRSS – Retrieves all the necessary elements from an RSS feed
  • Html2Text – Strips HTML elements and keeps the important text
  • AutoTag – Looks for keywords that represent the topic of any text submitted to it. (Latent Dirichlet Allocation)
  • SentimentAnalysis – Analyzes sentences for positive, neutral or negative connotation. Uses the Stanford NLP library.
  • Summarizer – Breaks content into sentences, and extract key sentences that represent the contents topic. Uses classifier4j to ingest a URL and summarize its main contents

Now, it was just a question of making them work together (and it took ~200 lines of code).

Check it out:

Note: wow our poor servers have been Reddit’d. If you see a unable to find worker error , rest assured we are working to get things back to normal – check back with us in a couple of minutes. 

The process:

The first thing I needed to do was retrieve all the necessary elements from the RSS feeds (ScrapeRSS). Once I had located the main content I could strip all the unnecessary HTML (HTML2Text). Now I had a nice clean body of text to start doing analysis on.

Creating topic tags is an easy way of understanding an article at a very quick glance, so I fed our clean body of text through AutoTag. Now I had the RSS title and some topics, next step was to summarize each link into 3 sentences max to complement the tags. Finally, and mostly for fun, I wanted to see if it’s true that most news is negative, so I added SentimentAnalysis.

 Done.

You can check out some of the code here (or just view-source on our demo page in the browser).

 

image

That’s one way to use the Algorithmia library to build your own text analysis API to analyze almost any data stream in < 30 minutes. .

Cheers, Diego

An algorithmic approach to GitHub exploration

If you recall from our last blog post we showed you how we used some of the algorithms in Algorithmia to generate topic tags for any URL. Internally, we used the topic generation algorithm to generate tags based on the algorithm’s description and, later, use these tags as part of our recommender algorithm.

With today’s post we want to show you how easy it is to integrate these type of recommender algorithms (an algorithm already available in Algorithmia) into your own workflow by showing it in action with one of our favorite developer tools, GitHub.

As GitHub has grown to almost 50 million repositories since 2011 (per our calculations), discovering new repositories has become less and less straight forward when navigating outside the most popular or featured repositories. With this in mind, we thought of a way to make it easier to tackle some of this complexity using some of the algorithms available in Algorithmia.

So here you go, give us any GitHub repository (as long as it has a README) and we will recommend other repositories based on the information we extract from the README:

How we built this

  • The first step was to figure out the URL for every repository in GitHub. This might seem like a daunting task but the folks over at Github have been nice enough to make their entire data set available on Google Big Query. You can head over there and generate a list of every public repository since 2011.
  • The second step, was to check every public repository’s readme file and run it through Algorithmia’s topic analysis algorithm.
  • Once the topic analysis algorithm returns a set of tags, we save these tags to start generating the data model we will later use for our recommender algorithm as a mapping from URL -> [tags].
  • Repeat this a couple million times (not the easiest task, but we had a virtually unlimited sized cluster to parallelize this task thanks to the Algorithmia platform), and voila a tagged data model of the entire GitHub world.
  • Finally, it’s time to start working with our recommender algorithm. To start, we need to point it to the data model we built and then send the algorithm the new set of tags that we generated from the URL you provided us with. With two inputs, the recommender algorithm returns a number of relevant repositories based on the tags automatically generated.

If you are interested in using any of these algorithms on your own site or application, we can help – these are all available in the Algorithmia API, just send us an email.

—     Your friends at Algorithmia

Algorithmic Tagging of HackerNews (or any other site)

Part of making algorithms more discoverable is creating meta-data tags to classify them. Often sites will allow users to pick their own tags but what if the content had already been generated? This is the problem we faced when trying to tag all the algorithms in our API. Each algorithm had a description page and we believed that using some simple machine learning algorithms already in our API we could generate tags for each one.

By generating tag data, it becomes easy to classify documents, make recommendations, optimize SEO, etc. Below we show how we approached this task, using HackerNews as an example data source.

Full demo site

So how did we do this? Our secret sauce is that Algorithmia is designed to make it extremely easy to combine algorithms, to create a pipeline for processing and generating tags for almost any site.

The basics:

  • Given a site, pull the data and iterate over every link
  • Extract the text from each linked page
  • Run the text though topic analysis algorithm (such as Latent Dirichlet Allocation)
  • Return tagged data
  • Render tags next to links

All this really is, is a pipeline of algorithms (plus some clever front-end js :P). In most cases, this would require some serious code stitching and algorithm development, but most of the components already existed in the Algorithmia API, and the solution was extremely clean, check it out:

image

This also works for arbitrary webpages. Enter a URL below to automatically generate tags for it:

Once we had these its becomes easy to classify algorithms, make recommendations and relate one algorithm to another. We’ll leave that for the next post…

Get topic tags for your site dynamically – contact us !

Algorithm Development is Broken 2014

Algorithm Development 2014Highly complex algorithm development is often at the core of innovative software products. The most famous algorithm of the last decade led to the development of the $396 billion dollar titan Google, which changed our lives forever with a simple search box.

Today, an estimated 80% of equity trading in the U.S. is done using automated algorithms.  Even our phones use algorithms to figure out where we are, what we are doing, and what we might want to do next. A few days ago a company, DeepMind, that developed the next generation of Artificial Intelligence algorithms was bought for a record sum. These examples are just scratching the surface of what algorithms can do.

As the volumes of data we collect on a day-to-day basis grow, data architects are on a race to have more, better, and faster hardware; yet Moore’s Law is showing a slow down, indicating that throwing more hardware at a data problem is no longer going to fulfill the needs of organizations. As organizations are increasingly and continuously hungry for understanding on how they operate and how they can improve, they must concentrate more on the efficiency and quality of the underlying algorithms that compose suites of data analysis packages versus the hardware they run on.

The Current State of Algorithm Development

It’s safe to say that on any given day there are thousands of brilliant computer scientists developing the state of the art and pushing the limits of software, yet there lies a serious problem: Algorithms are being developed all the time, but are not getting into the hands of people and applications that could benefit from them.

  • A vast majority of algorithmic developments occurs in one of two places: either private Research and Development programs (e.g., Microsoft Research, Google, etc.) or in one of more than 320+ academic institutions with computer science departments around the world.  The output of these types of research  and development is almost always private white papers or academic papers.
  • Companies and application developers have a hard time taking advantage of advanced algorithms. Academic papers are hard to find and even harder to implement. Even in the best case, when published code is available, more time is spent setting up infrastructure and fighting with undocumented libraries and dependencies than actually doing any algorithm development.

For example, a while back I was researching Latent Dirichlet Allocation. In layman’s terms, it is an algorithm that allows for the extraction of topics from documents without an understanding of the language itself. A quick search on Google brought up a dozen research papers and a couple of libraries where this algorithm had been implemented. So where do I start? I guess I could go ahead and implement one of the libraries, but which one? Do I even know if this is going to work? Once I do figure it out will it work in my current system?

The Birth of Algorithmia

Algorithmia was born out of frustration with these problems and the current state of algorithm development and deployment.  Algorithmia is moving away from developers working in isolation and toward providing a community for algorithm developers to share knowledge, test algorithms, and run them directly in their applications. What’s different is that every algorithm is live, so no more spending time finding the right libraries, compilers, and/or virtual machines—Algorithmia takes care of all of that for you.  Find a useful algorithm in our API? Great, use that in your own applications.

Algorithmia is a live, crowd-sourced algorithm API. Our goal with Algorithmia is to make applications smarter, by building a community around algorithm development, where state-of-the-art algorithms are always live and accessible to anyone.

Intrigued? Come check us out.