Algorithmia Blog - Deploying AI at scale

Join Algorithmia and 600+ hackers at DubHacks on October 17th and 18th

image

We’re proud to be sponsoring DubHacks, the largest collegiate hackathon in the Pacific Northwest. Over 600 student developers and designers will gather at the University of Washington in Seattle campus on October 17th and 18th to form teams, build projects, and create solutions to real-world problems.

Algorithmia will be there with swag, snacks, and free API access. Come say hi! 

In the mean time, check out our collection of algorithms available for use. How might you use our content recommendation, summarization, and sentiment analysis algorithms? Oh, and what might you do with our nudity detection algorithm?  

LEARN MORE ABOUT DUBHACKS:

Mining Product Hunt Part II: Building a Recommender

This blog post is Part II of a series exploring the Product Hunt API. Apart from the Product Hunt and Algorithmia API we use Aerobatic to quickly create the front end of our apps.

Our last post discussed how we acquired the Product Hunt dataset, how we designed a vote-ring detection algorithm, and ended with a live demo that analyzes new Product Hunt submissions for vote-rings. In this post, we will briefly explain how collaborative recommendation systems work (through the FP-Growth algorithm) and will apply that on the Product Hunt dataset using a live demo. As a cherry on top, we made a Chrome extension that will augment your Product Hunt browsing with a list of recommendations using our approach. Read on for the results, demo, and code.

Note About Recommendation Engines

There are two broad approaches to recommendation engines: those based on collaborative filtering and those based on content based. A hybrid approach marrying the two is also very common.

Collaborative Filtering recommenders depend on the way that users interact with the data – the number of overlapping decisions between users A and B (such as liking a post or buying an item) is used as an indicator to predict the likelihood that user A will make the same decision as user B on future topics. This is similar to Amazon’s “Other users also bought…”.

Content Based recommenders depends on the inherent attributes of an item – if user A likes item X, then the description/keywords/color of item X is used to predict the next item to recommend. This analysis can be extended to build a profile about what user A likes and dislikes. This is similar to Pandora’s Music Genome Project.

Here we are going to build a Collaborative Filtering recommendation engine that is specific to Product Hunt. There are many approaches to achieve this task and we will use the most straightforward one: Affinity Analysis (a variant of Association Rule Learning).

Understanding Association Rules

Businesses realized the power of Association Rules long time ago, especially for up-selling and cross-selling. For example, a grocery shop might notice that people who buy diapers are also very likely to buy beer (i.e. strong association), and therefore the grocery shop manager might decide to place the diapers close to the beer fridge to fuel impulse buying.

image

We looked into Algorithmia’s catalogue and we found two algorithms providing exactly this function: Weka-based implementation by Aluxian, and direct implementation by Paranoia. They were both implementing the Frequent Pattern Growth algorithm (FP-Growth), which is a popular method to build association rules through a divide-and-conquer strategy. We ran them side-by-side and we decided to go with Paranoia’s implementation.

The algorithm takes a set of transactions as an input and produces weighted associations as an output. Each transaction is represented as a single line with comma-delimited items. Instead of customers and groceries, our transaction set was the upvotes that Product Hunt users made on all the 16,000+ posts. Each user was represented as a line, and each line contained the posts’ ids that received an upvote from that user.

image

We Have a Demo

We created a Product Hunt-specific wrapper around the FP-Growth algorithm, which we called Product Hunt Recommender. This algorithm has a copy of the Product Hunt dataset and updates it every 24 hours (therefore it works better on posts more than few days old). It takes a single input (post id) and returns up to five recommended posts.

If you already have an Algorithmia account, you should be able to experiment directly with the algorithm through the web console (only visible if you’re signed in).

It was extremely satisfying to see how well the algorithm worked – for example, notice the result of applying the algorithm on the post ‘The Hard Thing About Hard Things’ (a book about entrepreneurship) gives a recommendation of four other books, all in entrepreneurship as well.

In a real-world scenario, a developer would run the FP-Growth algorithm on their dataset every so often and save the association rules somewhere permanent in the backend. Whenever an item is pulled out, the app would also look for strong associations as recommendations. Keep in mind that there are other routes towards a recommendation engine, such as content-based, clustering, or even hybrid solutions.

Product Genius, the Chrome Extension

We thought it would be awesome to make a Chrome extension out of this that adds a new section to the sidebar of Product Hunt posts, which we titled “Other People Also Liked”. There’s already a “Similar Products” section that is built in within Product Hunt and we can’t definitively determine what method they used to implement it. One thing’s for sure: from our limited testing, we found numerous posts where Product Genius returned better results than the built-in version.

image

Install the extension from here and check it out for yourself:

We also found instances where Product Hunt’s internal recommender performed better: BitBound.

What Else Can We/You Do?

You can access the code for the Chrome extension, web demo, and the dataset itself from here. You can easily experiment with other approaches using the hundreds of algorithms within the Algorithmia catalogue, such as using Keyword-Set Similarity instead of FP-Growth or a hybrid approach using the two. Let us know if you have any ideas for other algorithms you want us to demonstrate on the dataset by tweeting us at @algorithmia.

Want to create your own analysis? Follow this Sign-up and get 100,000 credits on us – a special Product Hunt promotion.

Mining Product Hunt: Detecting Vote-Rings

This blog post is Part I of a series exploring the Product Hunt API.

We like showcasing practical algorithms on interesting datasets. This post is part of a series of posts that explore some algorithms by applying them on the Product Hunt dataset. Since this is the first of the series, we’ll start with how we acquired the data, and then drill down on our approach for the vote-ring problem. Read on for the results and how you might build this for your web app or apply it on other datasets.

Data Acquisition and Pre-Processing

The first challenge is to acquire the data. We used Product Hunt’s API to download the posts and combined them into one big file. The API returns JSON representation of each post, which is not the best format for querying and manipulating, so we ran that file through the JSONtoCSV algorithm and we got 16,000+ records. We did the same for all the associated Comments, Users, and Votes and imported them into an SQLite database. Here’s what the tables look like:

image

We also created an IRB (Interactive Ruby Shell) environment that wraps the tables with Ruby ActiveRecord classes. Because Algorithmia already has a Ruby gem, these wrappers make the task of experimenting with different algorithms exceptionally easy. You can download a copy of the database along with the ActiveRecord wrappers from here (25 MB). Run ‘ruby console.rb’.

Identifying Vote-Rings

Sometimes, users in social networks might reach out to their friends (or even create fake accounts) in order to “seed vote” their submissions. This is obviously a form of playing the system, and possible solutions for this problem depends on your data structure and how your users interact with it. Many platforms rely on IP addresses and browser agents among other things. Given that we do not have that data available, we had to look elsewhere.

One of our engineers came up with a very plausible solution to this problem: assign a collusion ratio to every possible combination of users and posts that will depend exclusively on voting activity. This ratio is used as a proxy for a collusion probability. The closer it is to 1.0, the more likely that the subject group of users are colluding.

image

Another way to visualize the formula above is to describe it in a plot: the horizontal line represents all the possible posts in the dataset and the vertical line represents all the possible users. The dots on the intersections of these lines represents votes made by users. If we choose a random group of users (horizontal green bar) and posts (vertical green bar), you can easily see how the ratio works by comparing users’ activity across other posts as well.

image

As a quick analysis, we computed the above ratio for posts having less than 10 votes. We start by computing the ratio for the group of users who voted for a given post. We then remove the the first user and examine the new ratio – if the ratio drops, we bring them back in, if it increases, we leave them out. Continue this for each user.

We ran the data through this algorithm, which we called SimpleVoteRingDetection. Clicking on any post on the left will show the subset of voters on that post with the highest collusion ratio – which is the most likely vote-ring pertaining to the post.

We removed users’ names to protect the innocent. Instead, we’re showing random fruit names prepended to the real users’ ids. You’ll notice that in many instances the colluding users have consecutive user ids, meaning that they (likely) created the accounts within the same time frame, which increases the likelihood that those are fake accounts. Having said that, we’re happy to report that less than 1% of the total posts on Product Hunt was subjected to vote-rings (defined as have a collusion ratio higher than 0.5).

In the demonstration above, we limited our dataset to 20 votes per post which is sufficient to illustrate our approach to the problem without overloading the Product Hunt API. The same algorithm can be easily applied to other contexts, whether internally within Product Hunt or other applications. In it’s current form, the tool works great in flagging suspicious activity to prompt further investigation. The data is refreshed every 24 hours.

Until next time…

Our next post will be about building a recommendation engine for the dataset. Until then, feel free to play with the pre-processed data from here, and definitely let us know if you have any ideas for other algorithms you want us to demonstrate on the dataset.

Want to create you own analysis? Follow this Sign-up and get 100,000 credits on us a special Product Hunt promotion

Happy coding!

-Ahmad

Using artificial intelligence to detect nudity

If there’s one thing the internet is good for, it’s racy material. This is a headache for a number of reasons, including a) it tends to show up where you’d rather it wouldn’t, like forums, social media, etc. and b) while humans generally know it when we see it, computers don’t so much. We here at Algorithmia decided to give this one a shot. Try out our demo at isitnude.com.

The Skinny

We based our algorithm on nude.py by Hideo Hattori, and on An Algorithm For Nudity Detection (R. Ap-apid, De La Salle University). The original solution relies on finding which parts of the image are skin patches, and then doing a number of checks on the relative geometry and size of these patches to figure out if the image is racy or not. This is a good place to start but isn’t always that accurate, since there are plenty of images with lots of skin that are perfectly innocent – think a closeup of a hand or face or someone wearing a beige dress. You might say that leaning too much on just color leaves the method, well, tone-deaf. To do better, you need to combine skin detection with other tricks.

image

Facing the problem

Our contribution to this problem is to detect other features in the image and using these to make the previous method more fine-grained. First, we get a better estimate of skin tone using a combination of what we learned from Human Computer Interaction Using Hand Gestures  and what we can get from the image using a combination of OpenCV’s nose detection algorithm and face detection algorithm, both of which are available on Algorithmia.  Specifically, we find bounding polygons for the face and for the nose, if we can get them, then get our skin tone ranges from a patch inside the former and outside the latter. This helps limit false positives. Face detection also allows us to detect the presence of multiple people in an image, thus customizing skin tones for each individual and re-adjusting thresholds so that the presence of multiple people in the image does not cause false positives. If you want to integrate this into your own application, you can find the magic right here.

Next Steps

This is an improvement, but it is hardly a final answer to the problem. There are countless techniques that can be used in place of or combined with the ones we’ve used to make an even better solution. For instance, you could train a convolutional neural network on problematic images, much like we did for Digit Recognition. Algorithmia was designed to enable this kind of high-level tinkering – we hope you’ll visit us soon and give it a try.

Integrating Algorithmia into iOS

Today’s blog post is brought to you by one of our community members @sprague. Thanks for submitting !

The same simple Algorithmia API that works with Python, Java, Scala, and others, is easy for iOS programming too.

This short lesson assumes you have a basic knowledge of iOS programming: enough to write a simple single-view app using the Storyboard. (If not, start with Apple’s documentation here). To run this example, all you need is a Mac and a copy of Apple’s (free) XCode development environment.

The Algorithmia API works through simple HTTP POST commands. Fortunately, iOS already provides several powerful networking object classes that make that very easy:

  • NSMutableURLRequest sets up the HTTP request with some straightforward and obvious methods, like setHTTPMethod:@“POST” to tell the server that you want to post some data.
  • NSURLSession is a powerful class that lets you download the content via HTTP, including in the background or even while the application is suspended. Fortunately, the methods to control the download behavior are pretty straightforward.
  • NSURLDataTask is a related class specifically for getting data via HTTP. Pass it an instance ofNSURLSession, a NSMutableURLRequest and handler that describes what to do when it receives the a response from the server.
  • Finally, NSJSONSerialization is a handy class that will convert between JSON and native iOS dictionary or array types. Read Apple’s class reference documentation to see how much work this will save you!

Example: POST a string to the Algorithmia API

The best way to understand is with a simple example. Here’s a short program to send a string representation of an integer to the Algorithmia isPrime API, to find whether an input number is prime or not.

Here’s the opening screen:

image

Just a normal text input box (UITextEdit), a button you push to submit the text to the server, and a few labels to show what’s happening. That’s all there is to the UI, and you should be able to whip this up quickly yourself in Storyboard.

As for the guts of the app, we need to specify three things:

  • The Algorithm path/uri, in this case it is ‘diego/isPrime’
  • Your API key, which you can get from the Profile/Credentials page
  • The right HTTP headers, as described in the API Docs – Getting Started.

That’s it! All the networking code you need to access the Algorithmia API is right there. Perhaps the most important thing to point out here is the URI format and the HTTP Headers. Whenever you’re in doubt about these details, you should browse to a random algorithm and inspect the cURL parameters given in the ‘API Usage Examples’ section.

You can download a working copy of the isPrime App from here – see if it works with the Carmichael numbers!

image