This blog post is Part I of a series exploring the Product Hunt API.
We like showcasing practical algorithms on interesting datasets. This post is part of a series of posts that explore some algorithms by applying them on the Product Hunt dataset. Since this is the first of the series, we’ll start with how we acquired the data, and then drill down on our approach for the vote-ring problem. Read on for the results and how you might build this for your web app or apply it on other datasets.
Data Acquisition and Pre-Processing
The first challenge is to acquire the data. We used Product Hunt’s API to download the posts and combined them into one big file. The API returns JSON representation of each post, which is not the best format for querying and manipulating, so we ran that file through the JSONtoCSV algorithm and we got 16,000+ records. We did the same for all the associated Comments, Users, and Votes and imported them into an SQLite database. Here’s what the tables look like:
We also created an IRB (Interactive Ruby Shell) environment that wraps the tables with Ruby ActiveRecord classes. Because Algorithmia already has a Ruby gem, these wrappers make the task of experimenting with different algorithms exceptionally easy. You can download a copy of the database along with the ActiveRecord wrappers from here (25 MB). Run ‘ruby console.rb’.
Sometimes, users in social networks might reach out to their friends (or even create fake accounts) in order to “seed vote” their submissions. This is obviously a form of playing the system, and possible solutions for this problem depends on your data structure and how your users interact with it. Many platforms rely on IP addresses and browser agents among other things. Given that we do not have that data available, we had to look elsewhere.
One of our engineers came up with a very plausible solution to this problem: assign a collusion ratio to every possible combination of users and posts that will depend exclusively on voting activity. This ratio is used as a proxy for a collusion probability. The closer it is to 1.0, the more likely that the subject group of users are colluding.
Another way to visualize the formula above is to describe it in a plot: the horizontal line represents all the possible posts in the dataset and the vertical line represents all the possible users. The dots on the intersections of these lines represents votes made by users. If we choose a random group of users (horizontal green bar) and posts (vertical green bar), you can easily see how the ratio works by comparing users’ activity across other posts as well.
As a quick analysis, we computed the above ratio for posts having less than 10 votes. We start by computing the ratio for the group of users who voted for a given post. We then remove the the first user and examine the new ratio – if the ratio drops, we bring them back in, if it increases, we leave them out. Continue this for each user.
We ran the data through this algorithm, which we called SimpleVoteRingDetection. Clicking on any post on the left will show the subset of voters on that post with the highest collusion ratio – which is the most likely vote-ring pertaining to the post.
We removed users’ names to protect the innocent. Instead, we’re showing random fruit names prepended to the real users’ ids. You’ll notice that in many instances the colluding users have consecutive user ids, meaning that they (likely) created the accounts within the same time frame, which increases the likelihood that those are fake accounts. Having said that, we’re happy to report that less than 1% of the total posts on Product Hunt was subjected to vote-rings (defined as have a collusion ratio higher than 0.5).
In the demonstration above, we limited our dataset to 20 votes per post which is sufficient to illustrate our approach to the problem without overloading the Product Hunt API. The same algorithm can be easily applied to other contexts, whether internally within Product Hunt or other applications. In it’s current form, the tool works great in flagging suspicious activity to prompt further investigation. The data is refreshed every 24 hours.
Until next time…
Our next post will be about building a recommendation engine for the dataset. Until then, feel free to play with the pre-processed data from here, and definitely let us know if you have any ideas for other algorithms you want us to demonstrate on the dataset.
Want to create you own analysis? Follow this Sign-up and get 100,000 credits on us a special Product Hunt promotion