All posts by Matt Kiser

Algorithmia Security Bounty Program

For us here at Algorithmia, protecting the privacy and security of our user’s information is a top priority. After some time in development, we are happy to announce that starting today we will be recognizing security researchers for their efforts through a bug bounty program..

A bug bounty program is common practice amongst leading companies to improve the security and experience of their products. This type of program provides an incentive for security researchers to responsibly disclose vulnerabilities and bugs, and allows for internal security teams to respond adequately in the best interest of their users.

All vulnerabilities should be reported via security@algorithmia.com. GPG key available below [1].

Guidelines

We require that all researchers:

 

Make every effort to avoid privacy violations, degradation of user experience, disruption to production systems, and destruction of data during security testing:

  • Use the designated communication channels to report vulnerability information to us; and
  • Keep information about any vulnerabilities you’ve discovered confidential between yourself and Algorithmia until we’ve had 90 days to resolve the issue.

If you follow these guidelines when reporting an issue to us we commit to:

 

  • Not institute a civil legal action against you and not support a criminal investigation;
  • Work with you to understand and resolve the issue quickly (confirming the report within 72 hours of submission);
  • Recognize your contribution on our site, if you are the first to report the issue and we make a code or configuration change based on the issue.

Scope:

 

Any component developed by us under Algorithmia.com is fair game for this bounty system except individual algorithms created by our users.

Out of Scope:

Any services hosted by 3rd party providers and services are excluded from scope.

 

In the interest of the safety of our users, staff, the Internet at large, and you as the security researcher, the following test types are excluded from scope and not eligible for a reward:

  • Findings from physical testing such as office access (e.g. open doors, tailgating)
  • Findings derived primarily from social engineering (e.g. phishing)
  • Findings from applications or systems not listed in the ‘Targets’ section
  • Functional, UI and UX bugs and spelling mistakes
  • Network level Denial of Service (DoS/DDoS) vulnerabilities

Things we do not want to see:

Personally identifiable information of users (PII) that you may have found during your research.

[1]

—–BEGIN PGP PUBLIC KEY BLOCK—–

Version: GnuPG v1

mQENBFU8HlYBCACm2TJ4/kaP10G/dpfeLlzWCsJ1zrbehUg/G3mzzN2vSXKNCK0KY5ZRmSHlMAoF9ZINpSlQCafDHIbVqqQnRZ6/VrG9hLc0avn9o8vgtkMfKFDVYozr1yz7GiGDqANaXS4B2xzJncZi3WFaslYOGFx2ctDEgBFGKilhoVaejA9EjcubMKvOVmtLkBTvTz0WTGasBxP0689httGBn/5P2yz+HT7ei3FiDa842ekgw1Ak699AVMrb7CtHpiyS8kTW/KGfrsmwEqWLKC46X/VjAdtKB994RIFN1BMJbza0i16N3rM/8+kVU8yDRm1S5w9gopQHUETVsTrOtoK40SHpYIRhABEBAAG0L0FsZ29yaXRobWlhIFNlY3VyaXR5IDxzZWN1cml0eUBhbGdvcml0aG1pYS5jb20+iQE+BBMBAgAoBQJVPB5WAhsDBQkFo5qABgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRAxFOOGF4jSHdT6B/4wKb+fPYdoOnBjLnz1NgHlEXoP4462Hn5YgYGTi3l9spimnkATYuIVCK1Q80iFJgV8V8TpC2e4XnsdeG7yFPSfHUg3dMikokz3bIfLsAuMj3dML7cp4nRVHlrr28zlC+HRgr9dp7wn4BOfD2qv3q3lknwscygHOCc4pzD7LmB36pzL0BA8M0p31xp+oVMH6zD2OrLws8bUvouxLTzXqtmQFOEt072CS85rXKKYVrDs4/pXgAEtpCcr6xJRcnDKKsdNFoYJDlaUTzD+mzobNYXb45rzB2JsNssQEH2SB6zV9ZLBfQGUe0iF35hNVe3ZceQLuNRbjSR2M913OLEmAUXvuQENBFU8HlYBCADThh+5DXgb6BOHi2TeO58QemL18mzqffRhlyrND+zYvLxwES2c4OoEWfXCGU7vPvtC5x855r8vx+ERVUxx7o+2DzSJU3AQ56YOqDaF55KcJtlfRIT2Xww+9Dl7bO9vL+piFmEMnvNPMUjy7mZm5Zmj3RN+0oewWVgzDrcRMz87SzXRYA2q0K3SlLIiaHD3KtD0lQILIvS9pvkcoNUcxV8pkbYRlQQPhIyxW+9kte+aru8kAfJwqhvrieo6OKscUJWYfX/ORms4iKVCcebfw0R5UmLBRsKKyQ2eIVU0q1UP0F+28YEFbLQlDo25PSD/N3gKnD1thkxqK/33UjDGVLyxABEBAAGJASUEGAECAA8FAlU8HlYCGwwFCQWjmoAACgkQMRTjhheI0h0XFwf7BkgEjHOE5YV8dqFVOC0K3RG8/Ppg63e+KdKO/9cx0gi/LB5O7jVA5m5XeQn2NPVzfr6NcUHjxP15aBu0zOD3ZN7tMNmsAGYMO8QPZIXjgvlWFTYklbxrSI5QEL914ZELdEfIHy1UYO2AoPUAkQCsNOvDhY6URHktLJtFYBUOzFKlDLDkX5Wv+hahG3OZ6XmPN40yeKfug/uO56eMtDDKmm2caQDs3awRUPDJ/EDgPi4Mtx55lHUcZ3bbzGMhIdVS2U8jCiox3IpMm1IogqOek3EBHRkvyeWZafbkqAX7YQwqS6SvjUtIvU/nHWVPdx9KpWggcgkEVs5GTw4VY6pFtA===EMqo
—–END PGP PUBLIC KEY BLOCK—–

 

How Machines See the Web: Exploring the Web Algorithmically

In 1996 Larry Page and Sergey Brin published the Backrub paper, a research project about a new kind of search engine. The concept was a link analysis algorithm that measured relative importance within a set. Based on this one algorithm, the company Google was created and the PageRank index became one of the most famous algorithmic concepts in history.

Knowing how important it is to be indexed for the right thing, we here at Algorithmia were inspired by one of our users when he added an implementation of Page Rank  to the marketplace (Note that Google has moved beyond their original algorithm over a decade ago at this point). We realized that by combining various algorithms available in the Algorithmia API with the Page Rank algorithm, we would be able to get an idea of how search engines view a site or domain. So we’ve essentially integrated every intermediate step and process that a crawler goes through when examining a site (using the algorithms already available in our marketplace), which allows us to understand how machines see the web.

Understanding how pages are linked to each other on a site gives us a snapshot of the connectedness of a domain, and a glimpse into how important a search engine might consider each individual page. Additionally, a machine generated summary and topic tags give us an even better picture of how a web crawler might classify your domain.

Modern web crawlers use sophisticated algorithms to link across the entire internet but even limiting our application to pages on just one domain gives us significant insights into how any site might look to one of these crawlers.

Building a site explorer

We broke up the task of exploring a site into three steps:

  • Crawl the site and retrieve as a graph
  • Analyze the connectivity of the graph
  • Analyze each node for its content

We found algorithms for each part already in the Algorihtmia API, which allowed for quickly building out the pipeline:

  • GetLinks (retrieves all the URLs at the given URL)
  • PageRank (simple implementation of the Backrub algorithm)
  • Url2text (converts documents and strips tags to return the text from any URL)
  • Summarizer
  • AutoTag (Uses Latent Dirichlet Allocation from mallet to produce topic tags)
image

Here is the result (click on a node for more info):

The individual steps

We built this app using three core technologies: AngularJS, D3.js, and the Algorithmia API.

The first thing we needed to do was crawl the domain supplied. We allow the user to determine the number of pages to crawl (limited in our demo to a max of 120 – at this point your browser will really start hurting though).  For each page crawled, we retrieve every single link on that page and plot it on the graph. Then, once we have reached the max pages to crawl, we apply PageRank to the result.

Get the links:

image

Iterate over site:

image

Apply Page Rank:

image

Once the graph is built, we render it using a D3 force layout graph. Clicking on any individual node retrieves the content from that page, cleans up the HTML so we are left with just the text, and process the text through both the summarizer and topic tagger algorithms.

Really, the hardest part about building this was figuring out the quirks of D3, since the Algorithmia API just allowed us to stitch together all the algorithms we wanted for the process and start using them, without worrying about a single download or dependency.

Don’t take our word for it try it yourself , we have made the AngularJS code available here, feel free to fork it, modify it and use it in your own applications.

It’s really easy to expand the capabilities of the site mapper, here are some ideas:

  • Sentiment by Term (understand how a term is seen across the site e.g.: Apple Watch on GeekWire)
  • Count Social Shares (understand how popular any link-node is on a number of social media sites)
  • many more that can be found in Algorithmia…

Made it this far?

If you tweet “I am [username] on @algorithmia” we will add another free 5k credits to your account.

– Diego (@doppene)

Accessing Algorithmia through Python

Today’s blog post is brought to you by one of our community members John Hammink

Imagine – you are a novice programmer and want to harness the power of algorithms.   Or maybe you are an academic, or an expert in another area that has nothing to do with programming and want to incorporate algorithms into your own work.  Now you can!   

To leverage the power of Algorithmia’s service, you must first understand a thing or two about REST APIs as well as how to do something with the output.  (Incidentally, what you learn about REST APIs here can be applied across the board to many web services!).

In this article, we’ll use python to create a small program to use a few REST APIs.

Why python?  Python’s easy, like using a notebook.  It’s very demoable code for any skill level.  So here’s what we’ll do:

  1. First, we’ll start out using cURL at the command line and walking through various usage scenarios.
  2. Next, we’ll look at a basic Python example for using a REST API.
  3. After that, we’ll get more advanced:  there are different ways to using the REST API within Python.  And why you might prefer different alternatives depending on the algorithm and the performance you want to achieve.
  4. Next, we’ll look at some different APIs.
  5. Our goal is to build an extremely simple console application to demonstrate the power of a couple of Algorithmia’s algorithms.
  6. Finally, we’ll end with a question: which problems would you, our users, via comments, want to solve?  

Starting off with cURL

First things first!  Head over to Algorithmia.com and sign-up if you have not already so you can get a free authentication token. You can find your API Key under your account summary:

image

Let’s look at the profanity detection algorithm.  You can read the summary docs of the algorithm here:  https://algorithmia.com/algorithms/nlp/ProfanityDetection

Let’s use cURL at the command line to access the API and pass a request.  

curl -X POST -d ’[“He is acting like a damn jackass, and as far as I am concerned he can frack off.”,[“frack”,“cussed”],false]’ -H ‘Content-Type: application/json’ -H 'Authorization: <insert_auth_token>’ https://api.algorithmia.com/api/nlp/ProfanityDetection

Before you move ahead, let’s explain what each parameter does.

curl -X POST -d ’[“He is acting like a damn jackass, and as far as I am concerned he can frack off.”,[“frack”,“cussed”],false]’ -H  

-X is how you use curl to send a custom request with data, specified with -d.   The data to be passed to the profanity detection algorithm comes in three parts:

  1. The sentence or phrase to be analyzed for profanity.  This can also be a variable, or a string scraped from somewhere else in the internet.
  2. Additional words or phrases to be added to the profanity corpus when analyzing a.
  3. a boolean value – when set to true, the additional words or phrases are ignored.

-H Allows us to send an extra header to include in the request when sending HTTP to a server. Here, we’re calling it twice – once to add the content type spec, the second time to add an authorization header.  Note that in our documentation and blog, this auth key will automatically be populated with your auth key….and finally, the url for the algorithm in the Algorithmia API that we want to use.

…enter python!

Python just might be the easiest language with which to make HTTP requests.  

Taking a step back, let’s look at how we might run a cURL-like request using python.  

First, let’s look at how one might use python with  cURL using a subprocess call.  Try this in python interactive mode:

Python 2.7.6 (default, Sep  9 2014, 15:04:36)[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwinType “help”, “copyright”, “credits” or “license” for more information.>>> data = “”“["I would like to eat some flaming cheesecake.”,[“cheesecake”,“cussed”],false]“”“>>> import subprocess>>> subprocess.call(['curl’, ’-X’, 'POST’, ’-d’, data, ’-H’, 'Content-Type: application/json’, ’-H’, 'Authorization:51619aab16b44ad0baef1d2c5110fb21’,  'http{"duration”:0.0009742940000000001,“result”:{“cheesecake”:1}}0>>>

This works, and passes the data, but there is always going to be a performance hit when calling cURL or any subprocess, for that matter, from within Python.  Generally, native function calls simply work best!   Python handles the whole data-passing thing much more cleanly.   Let’s look at a simple ‘dataless’ implementation (an urllib2 request where we don’t pass any data to the algorithm) might look like in python using urllib2 library.

import urllib2, jsonrequest = urllib2.Request('https://api.algorithmia.com/api/nlp/ProfanityDetection’)request

What is going on here?   We are simply creating a request and storing it to a variable.   Running it returns this:

<urllib2.Request instance at 0x7fcb52d5e3f8>

In this example, the imported json library and input variables are just placeholders for what we’re about to do – pass and return actual data to an urllib2 request.   Let’s try that now:

import urllib2url = 'https://api.algorithmia.com/api/nlp/ProfanityDetection’data = “”“["I would like to eat some damn cheesecake.”,[“cheesecake”,“cussed”],false]“”“req = urllib2.Request(url, data, {'Content-Type’: 'application/json’})req.add_header('Authorization’, ’<insert_auth_token>’)f = urllib2.urlopen(req)for x in f:    print(x)

This actually prints out the contents the data, returned by Algorithmia’s API:

{"duration”:0.089443493,“result”:{“damn”:1,“cheesecake”:1}}

To close the object, just add:

f.close()

So now, we have the basics.  You can use the above example for calling more or less any REST API in Python.  

In my study of algorithms over REST APIs, I’ve learned that once we’ve mastered the basics of calling the API from whatever language or framework we’re using, it’s time to play around with the parameters we’re passing to the algorithm.

Consider this:

data = “”“["I would like to eat some damn cheesecake.”,[“cheesecake”,“cussed”],false]“”“

There are actually several things going on here.   There is the main data we’re parsing:

"I would like to eat some damn cheesecake.”

Then there are the additional words we add to our cuss-corpus, in this case:

[“cheesecake”,“cussed”]

Last, but not least,  there is a boolean value being passed at the end of the string.   This value tells us whether or not to ignore known profanities (and only go with cuss-words you provide!).   The input format is documented on the algorithm summary page on Algorithmia.  I tried actually passing several parameters to the data string; you can have fun with this too:

data = “”“["I wouldd totally go bananas with his recommendations, if he weren not such a pompous ass”, [“cuss”, “banana”] , false]“”“data = ”“”[“Ernst is a conniving little runt.”, [“eat”, “little”] , false]“”“data = ”“”[“Ernst is a conniving little runt.”, [] , false]“”“

You can have a load of fun with the data set on this one.  Did you try these?  Did you invent some others? What did you get?

Other Algorithms
Now, let’s try calling some of the other APIs.

Using the template we provided earlier, let’s try calling a few different algorithms.  We’ll do this simply by switching out our url and data parameters.

Note that we can get some clues by studying the sample input above the API console for each.

Let’s look first at Do words rhyme ? (algorithm summary available at: https://algorithmia.com/algorithms/WebPredict/DoWordsRhyme)

import urllib2url = 'http://api.algorithmia.com/api/WebPredict/DoWordsRhyme’data = ”“”[“fish”, “wish”]“”“req = urllib2.Request(url, data, {'Content-Type’: 'application/json’})req.add_header('Authorization’, ’<insert_auth_token>’)f = urllib2.urlopen(req)for x in f:      print(x)

This returns:

{"duration”:0.23639225300000002,“result”:true}

Hint:  you can isolate just the result value you need by using python’s dictionary manipulations.   So for example, if you are getting your response this way:

import jsonresponse = urllib2.urlopen(req, json.dumps(data)

You can pass it to a python dictionary, and then just take the result:

f = response.read()rhyminDict = json.loads(f)rhyminResult = rhyminDict['result’]

You can mix it up with the input data and try again.    And you can benchmark different methods duration times by calling it directly with cURL or with cURL called as a subprocess in python.   See the performance difference?

Now let’s try some other algorithms:

Word Frequency Counter

algorithm summary available at https://algorithmia.com/algorithms/diego/WordFrequencyCounter

url = 'http://api.algorithmia.com/api/diego/WordFrequencyCounter’data = ’“Does this thing need a need like this need needs a thing?  Cause the last thing this thing needs is a need.”’

Geographic Distance

algorithm summary available at https://algorithmia.com/algorithms/diego/GeographicDistance

url = 'http://api.algorithmia.com/api/diego/GeographicDistance’data = “”“{"lat1”: 60.1,“long1”:24.5,“lat2”:37.2,“long2”:122.4}“”“

Are you getting the hang of it now?   

Putting it all together

Now let’s put together a simple console application in Python.

NOTE:  This app is not intended to take the place of legitimate psychological or arousal research.  It’s just code to illustrate what you can do with these algorithms.

In this app, we’ll use two algorithms, profanity detection and sentiment analysis to get a general read on the user’s state of mind, from the input they give.   

Listing:  howsyourday.py  Full source code available at: https://github.com/algorithmiaio/samples/blob/master/HowIsYourDay.py

############################################################## Simple python app that uses 2 algorithms in algorithmia API                     ## 	- Profanity Detection                                                                                    ## 	- Sentiment Analysis                                                                                     ##                                                                                                                       ## Author: John Hammink <john@johnhamm.ink>                                           #############################################################
import urllib2, json
def get_cuss_index(sentiment):    url = 'https://api.algorithmia.com/api/nlp/ProfanityDetection’    filterwords = ["cuss”, “blow”, “spooge”, “waste”]    ignore = True    data = “”“[”%s", [“%s”], %s]“”“ % (sentiment, filterwords, ignore)    req = urllib2.Request(url, data, {'Content-Type’: 'application/json’})    req.add_header('Authorization’, 'cb7a25d50ddd4deeaed9e4959cfa1ffe’)    response = urllib2.urlopen(req, json.dumps(data))    f = response.read()    curseDict = json.loads(f)    curseResult = curseDict['result’]    cuss_index = sum(curseResult.values())    return cuss_index
def analyze_sentiment(sentiment):    url2 = 'http://api.algorithmia.com/api/nlp/SentimentAnalysis’    data2 = str(sentiment)    req2 = urllib2.Request(url2, data2, {'Content-Type’: 'application/json’})    req2.add_header('Authorization’, 'cb7a25d50ddd4deeaed9e4959cfa1ffe’)    response = urllib2.urlopen(req2, json.dumps(data2))    g = response.read()    analysis = json.loads(g)    analysisResult = analysis['result’]    return analysisResult
def gimme_your_verdict(cuss_index, analysisResult):    highArousal = ’"My, we are feisty today! Take five or go for a skydive!”’    mediumArousal = ’“Seems like you are feeling pretty meh”’    lowArousal = ’“Hey dude, are you awake?”’    if cuss_index >= 2 or analysisResult >= 3:   	 	print highArousal    elif cuss_index >= 1 or analysisResult >= 2:   	 	print mediumArousal    else:   	 	print lowArousal    print “Come back tomorrow!”        if __name__ == ’__main__’:    print “Como estas amigo?”    sentiment = str(raw_input(“How was your day? Swearing is allowed , even encouraged: ”))    cuss_index = get_cuss_index(sentiment)    analysisResult = analyze_sentiment(sentiment)    gimme_your_verdict(cuss_index, analysisResult)

We’ve put together three functions.   get_cuss_index and analyze_sentiment both query and parse the returns from their respective algorithms;  both functions load their respective json responses into a python dictionary, then parse the dictionary for only the result value; get_cuss_index actually sums the values in the dictionary, since there may be many.

gimme_your_verdict returns a snarky comment based on a conditional derived from the values returned by the first two functions.    

Having a bad day today, oh feisty one?  Well, there’s always tomorrow!

-John

A question for you, the reader….

Which other problems would you like to solve? Do you have a favorite algorithm you’d like to see demonstrated?  In a different programming language or framework?  Maybe you’d like to propose one – an algorithm or a solution with an existing one – that doesn’t exist yet?   

Let us know in the comments below! Follow and tweet us at @algorithmia We’ll get right on it – and you’ll see more on the topic, either in our blog or in our knowledge base.   Thanks for reading!

Playing with n-grams

Links in this post require access to Algorithmia.com private beta – you can obtain access by just following this link. 

Inspired by @sampullara’s tweet we started thinking about how we could create an N-gram trainer and text generator directly in Algorithmia. It also happened to be almost Valentine’s day, so we wondered if we could apply the same principles to automatically generating love letters. 

Although it is clear that we probably wouldn’t be fooling our valentines with the fake letters it was still a fun exercise.

The task would be split up in three parts:

  1. generate the corpus of text 
  2. train a model 
  3. query the model for automatically generated text.  

N-gram models are probabilistic models that assign probabilities on the “next” word in a sequence, given the n-1 previous words. The power of n-gram models are limited in natural language processing due to the fact that they cannot model deep dependencies, that is, the nth word only depends on the n-1 previous words; but they perform well on simple statistical demonstrations.


Love as a Service?

A quick internet search on love letters provided us with enough of a corpus to extract some trigrams. Algorithmia already had a convenient way to extract text from given URL’s, and this is our data source for our love letter generator. Now we are ready to feed in our love corpus into the trigram generator.

The extraction of trigrams is done with  an algorithm that generates trigram frequencies. This algorithm takes in an array of Strings (the love letters in our corpus), a beginning and an end token, and a Data collection URL to write the final trigram frequencies. The beginning and end tokens are necessary for us to be able to generate sentences. There are word sequences that are fit to start a sentence and those that are fit to end a sentence. To be able to discern these, we use beginning and end tokens that are unique and they do not show up in the text, so instead of hard-coding them, we take them as inputs. 

N-grams explained visually:

image

After the small preprocessing step, we can go through the text and generate the frequencies in which they appear. This step can be referred to as the “sliding window” step where we go through each three-word (or word-token) combinations and keep recording the frequencies. The implementation details can be seen in the GenerateTrigramFrequencies  algorithm. We record the output as the three words followed by the total frequency in a file.

Now we can pass the trigram model that we obtained to the RandomTextFromTrigram algorithm. If we think of the trigram groups as possible paths to be taken down a graph, randomly choosing one of the possible nodes limits our choices already. By going from the start of the graph to the end and choosing randomly from our possible “next words”, we generate random text based on the original corpus.

And last as @sampullara tweet originally suggested we applied the same process to collect tweets by a user. We added an option to specify gram-length for the tweet generator. This was done due to the fact that, if there are not enough alternatives with different frequencies, the probabilistic nature of the generation of text is not very visible at the end result. Since tweets are generally short and fairly unique, trigrams do not result in very interesting combinations.

Note:we limit the number of tweets to be retrieved from a user so Twitter API doesn’t rate limit us. 

Want to try this yourself? All algorithms used for this blog post are online, available and open-source on Algorithmia. You can get private beta access by following this link.

  • Zeynep and Diego

Machine Learning Showdown: Apache Mahout vs Weka

We here at Algorithmia are firm believers that no one tool can do it all – that’s why we are working hard to put the world’s algorithmic knowledge within everyone’s reach. Needless to say, that’s a work that will be in progress for awhile, but we’re well on the way to getting many of the most popular algorithms out there. Machine learning is one of our highest priorities, so we recently made available two of the most popular machine learning packages: Weka and Mahout.

Both these packages are notable for different reasons. Weka is a venerable, well-developed project that’s been around since the 80’s. It includes a huge suite of well-optimized machine learning and data analysis algorithms as well as various supporting routines that handle formatting, data transformation, and related tasks. Weka’s only real weakness is that its main package is not well optimized for memory intensive tasks on large datasets.

Mahout is almost completely the opposite, being something of a cocky newcomer. It’s a recent project designed specifically for Big Data. Its set of algorithms seems tiny compared to Weka, its documentation is spotty and getting it to work can be a real headache (unless of course you’re using it through Algorithmia). However, as our results below show, if you have much data to crunch, it may be the only game in town. Mahout is designed to scale using MapReduce and while integration of MapReduce into its algorithms is neither complete nor easy to use, even in the single machine case, Mahout shows evidence of being more capable of handling large volumes of data.

So, which is better?

As you probably figured out already, it depends.

There are any number of ways to answer this question, but we opted to compare them by picking a popular algorithm present in both, and comparing performance on a well known and non-trivial machine learning task. Specifically, we applied random forests to the MNIST handwritten digit recognition dataset.

The Task

The MNIST dataset is consists of handwritten digits. Note that we did work with a reduced subset of the dataset that has about 42000 images for training and 28000 for testing.

The Algorithm

Random forests are an ensemble classification method that enjoys great popularity due to their conceptual simplicity, good performance (both speed and accuracy for many applications) and resistance to overfitting. You can read more about them here or, of course, on Wikipedia. Despite differences in the implementations, all parameters apart from the number of trees were kept constant for the purpose of a fair comparison.

A quick demo to showcase

The Results

For accuracy, Weka is the clear winner, achieving its best accuracy at 99.39% using 250 trees, compared with Mahout’s best: 95.89% using 100 trees. It’s also worth noting that the number of trees had little effect on Mahout’s classification accuracy, which stayed at between 95 and 96 percent for all forest sizes tried. Runtimes for both were comparable.

image

This shouldn’t be terribly surprising, as Weka has been around for longer and we would expect to be at least somewhat better optimized, especially in the relatively small datasets for which it was specifically designed.

However, bear in mind that this MNIST is easily small enough to fit one one machine. Weka’s superior accuracy won’t count for anything once your data outgrows this and tools like Mahout will be your only option.

 The Moral of the Story

So, it’s just like we told you. When taking into account the whole scope of enterprise data problems, there is no clear winner or one-size-fits-all solution. If you have a little data and want to get the most out of it, use Weka. If you have a bunch of data, Mahout is your best (or perhaps only) choice, even if performance isn’t quite what you would like.

One of the key advantages of Algorithmia is that you can easily try both head to head against your data set with no setup.

 – Austin & Zeynep

Create your own machine learning powered RSS reader in under 30 minutes

As developers one of our biggest “problems” is our voracious appetite for news. From Twitter to HackerNews to the latest funding on TechCrunch, it seems, at times, we cannot avoid the gravitational pull of our favorite news feeds. I know, at least for myself, this is engrained in my routine: wake up, check Twitter, check TechCrunch, check The Verge, etc. I spend at least the first 30 minutes of every day reading feeds based on title and repeat this a couple more times through the day.

Get the code sample for this project here.

I recently discovered SkimFeed, which I love and call my “dashboard into nerd-dom,” basically it is a single view of the major tech sites’ titles. However, I wanted more information on each article before I decided to click on one, so I thought: Why not use text analysis algorithms as a more efficient way of consuming my feeds?

Building my very own text analysis API

There are a number of things that can be done as part of a text analysis API but I decided to concentrate on four elements I believed I could build the fastest:

  • Strip documents of unnecessary elements
  • Advanced topic extraction
  • Automatically summarize stories
  • Sentiment analysis

As all of these algorithms are already available in the Algorithmia API, I could piece them together without having to worry about servers, scaling, or even implementing the actual algorithms:

  • ScrapeRSS – Retrieves all the necessary elements from an RSS feed
  • Html2Text – Strips HTML elements and keeps the important text
  • AutoTag – Looks for keywords that represent the topic of any text submitted to it. (Latent Dirichlet Allocation)
  • SentimentAnalysis – Analyzes sentences for positive, neutral or negative connotation. Uses the Stanford NLP library.
  • Summarizer – Breaks content into sentences, and extract key sentences that represent the contents topic. Uses classifier4j to ingest a URL and summarize its main contents

Now, it was just a question of making them work together (and it took ~200 lines of code).

Check it out:

Note: wow our poor servers have been Reddit’d. If you see a unable to find worker error , rest assured we are working to get things back to normal – check back with us in a couple of minutes. 

The process:

The first thing I needed to do was retrieve all the necessary elements from the RSS feeds (ScrapeRSS). Once I had located the main content I could strip all the unnecessary HTML (HTML2Text). Now I had a nice clean body of text to start doing analysis on.

Creating topic tags is an easy way of understanding an article at a very quick glance, so I fed our clean body of text through AutoTag. Now I had the RSS title and some topics, next step was to summarize each link into 3 sentences max to complement the tags. Finally, and mostly for fun, I wanted to see if it’s true that most news is negative, so I added SentimentAnalysis.

 Done.

You can check out some of the code here (or just view-source on our demo page in the browser).

 

image

That’s one way to use the Algorithmia library to build your own text analysis API to analyze almost any data stream in < 30 minutes. .

Cheers, Diego

An algorithmic approach to GitHub exploration

If you recall from our last blog post we showed you how we used some of the algorithms in Algorithmia to generate topic tags for any URL. Internally, we used the topic generation algorithm to generate tags based on the algorithm’s description and, later, use these tags as part of our recommender algorithm.

With today’s post we want to show you how easy it is to integrate these type of recommender algorithms (an algorithm already available in Algorithmia) into your own workflow by showing it in action with one of our favorite developer tools, GitHub.

As GitHub has grown to almost 50 million repositories since 2011 (per our calculations), discovering new repositories has become less and less straight forward when navigating outside the most popular or featured repositories. With this in mind, we thought of a way to make it easier to tackle some of this complexity using some of the algorithms available in Algorithmia.

So here you go, give us any GitHub repository (as long as it has a README) and we will recommend other repositories based on the information we extract from the README:

How we built this

  • The first step was to figure out the URL for every repository in GitHub. This might seem like a daunting task but the folks over at Github have been nice enough to make their entire data set available on Google Big Query. You can head over there and generate a list of every public repository since 2011.
  • The second step, was to check every public repository’s readme file and run it through Algorithmia’s topic analysis algorithm.
  • Once the topic analysis algorithm returns a set of tags, we save these tags to start generating the data model we will later use for our recommender algorithm as a mapping from URL -> [tags].
  • Repeat this a couple million times (not the easiest task, but we had a virtually unlimited sized cluster to parallelize this task thanks to the Algorithmia platform), and voila a tagged data model of the entire GitHub world.
  • Finally, it’s time to start working with our recommender algorithm. To start, we need to point it to the data model we built and then send the algorithm the new set of tags that we generated from the URL you provided us with. With two inputs, the recommender algorithm returns a number of relevant repositories based on the tags automatically generated.

If you are interested in using any of these algorithms on your own site or application, we can help – these are all available in the Algorithmia API, just send us an email.

—     Your friends at Algorithmia