We here at Algorithmia are firm believers that no one tool can do it all – that’s why we are working hard to put the world’s algorithmic knowledge within everyone’s reach. Needless to say, that’s a work that will be in progress for awhile, but we’re well on the way to getting many of the most popular algorithms out there. Machine learning is one of our highest priorities, so we recently made available two of the most popular machine learning packages: Weka and Mahout.
Both these packages are notable for different reasons. Weka is a venerable, well-developed project that’s been around since the 80’s. It includes a huge suite of well-optimized machine learning and data analysis algorithms as well as various supporting routines that handle formatting, data transformation, and related tasks. Weka’s only real weakness is that its main package is not well optimized for memory intensive tasks on large datasets.
Mahout is almost completely the opposite, being something of a cocky newcomer. It’s a recent project designed specifically for Big Data. Its set of algorithms seems tiny compared to Weka, its documentation is spotty and getting it to work can be a real headache (unless of course you’re using it through Algorithmia). However, as our results below show, if you have much data to crunch, it may be the only game in town. Mahout is designed to scale using MapReduce and while integration of MapReduce into its algorithms is neither complete nor easy to use, even in the single machine case, Mahout shows evidence of being more capable of handling large volumes of data.
So, which is better?
As you probably figured out already, it depends.
There are any number of ways to answer this question, but we opted to compare them by picking a popular algorithm present in both, and comparing performance on a well known and non-trivial machine learning task. Specifically, we applied random forests to the MNIST handwritten digit recognition dataset.
The MNIST dataset is consists of handwritten digits. Note that we did work with a reduced subset of the dataset that has about 42000 images for training and 28000 for testing.
Random forests are an ensemble classification method that enjoys great popularity due to their conceptual simplicity, good performance (both speed and accuracy for many applications) and resistance to overfitting. You can read more about them here or, of course, on Wikipedia. Despite differences in the implementations, all parameters apart from the number of trees were kept constant for the purpose of a fair comparison.
A quick demo to showcase
For accuracy, Weka is the clear winner, achieving its best accuracy at 99.39% using 250 trees, compared with Mahout’s best: 95.89% using 100 trees. It’s also worth noting that the number of trees had little effect on Mahout’s classification accuracy, which stayed at between 95 and 96 percent for all forest sizes tried. Runtimes for both were comparable.
This shouldn’t be terribly surprising, as Weka has been around for longer and we would expect to be at least somewhat better optimized, especially in the relatively small datasets for which it was specifically designed.
However, bear in mind that this MNIST is easily small enough to fit one one machine. Weka’s superior accuracy won’t count for anything once your data outgrows this and tools like Mahout will be your only option.
The Moral of the Story
So, it’s just like we told you. When taking into account the whole scope of enterprise data problems, there is no clear winner or one-size-fits-all solution. If you have a little data and want to get the most out of it, use Weka. If you have a bunch of data, Mahout is your best (or perhaps only) choice, even if performance isn’t quite what you would like.
One of the key advantages of Algorithmia is that you can easily try both head to head against your data set with no setup.
– Austin & Zeynep