All posts by Besir Kurtulmus

Introduction to Image Saliency Detection

An algorithm for saliency detectionWhenever we look at a photo or watch a video, we always notice certain things over others. This could be a person’s face, or a sports car, or even a symbol that is located in the corner of a video.
Read More…

How to Add Sentiment Analysis to Your Project in Five Lines of Code

An API for Sentiment Analysis Sentiment analysis is the process of identifying the underlying opinion or subjectivity of a given text. It generally categorizes these opinions on a scale from negative to positive. Some sentiment analysis algorithms include the neutral sentiment, too.

These sentiments scores are generally used to identify the level of satisfaction of a given product or service. This helps companies and organizations better understand their users, and make impactful changes to their products.

Read More…

Deep Filter: Getting Started With Style Transfer

Algorithm Spotlight: Deep Filter There’s been a lot of interest lately in a deep learning technique called style transfer, which is the process of reimagining one image in the style of another.

It’s fun way to convert photos or images into the style of a masterpiece painting, drawing, etc. For instance, you could apply the artistic style of Van Gogh’s Starry Night to an otherwise boring photo of the Grand Canal in Venice, Italy.
Read More…

A Data Science Approach to Writing a Good GitHub README

Octocat and the GitHub README Analyzer

What makes a good GitHub README? It’s a question almost every developer has asked as they’ve typed:

 git init
 git add README.md
 git commit -m "first commit"

As far as we can tell, nobody has a clear answer for this, yet you typically know a good GitHub README when you see it. The best explanation we’ve found is the README Wikipedia entry, which offers a handful suggestions.

We set out to flex our data science muscles, and see if we could come up with an objective standard for what makes a good GitHub README using machine learning. The result is the GitHub README Analyzer demo, an experimental tool to algorithmically improve the quality of your GitHub README’s.

GitHub README Analyzer Demo

The complete code for this project can be found in our iPython Notebook here.


Thinking About the GitHub README Problem

Understanding the contents of a README is a problem for NLP, and it’s a particularly hard one to solve.

To start, we made some assumptions about what makes a GitHub README good:

  1. Popular repositories probably have good README’s
  2. Popular repositories should have more stars than unpopular ones
  3. Each programming language has a unique pattern, or style

With those assumptions in place, we set out to collect our data, and build our model.


Approaching the Solution

Step One: Collecting Data

First, we needed to create a dataset. We decided to use the 10 most popular programming languages on GitHub by the number of repositories, which were Javascript, Java, Ruby, Python, PHP, HTML, CSS, C++, C, and C#. We used the GitHub API to retrieve a count of repos by language, and then grabbed the README’s for the 1,000 most-starred repos per language. We only used README’s that were encoded in either markdown or reStructuredText formats.

We scraped the first 1,000 repositories for each language. Then we removed any repository the didn’t have a GitHub README, was encoded in an obscure format, or was in some way unusable. After removing the bad repos, we ended up with — on average — 878 README’s per language.

Step Two: Data Scrubbing

We needed to then convert all the README’s into a common format for further processing. We chose to convert everything to HTML. When we did this, we also removed common words from a stop word list, and then stemmed the remaining words to find the common base — or root — of each word. We used NLTK Snowball stemmer for this. We also kept the original, unstemmed words to be used later for making recommendations from our model for the demo (e.g. instead of recommending “instal” we could recommend “installation”).

This preprocessing ensured that all the data was in a common format and ready to be vectorized.

Step Three: Feature Extraction

Now that we have a clean dataset to work with, we need to extract features. We used scikit-learn, a Python machine learning library, because it’s easy to learn, simple, and open sourced.

We extracted five features from the preprocessed dataset:

  1. 1-grams from the headers (<h1>, <h2>, <h3>)
  2. 1-grams from the text of the paragraphs (<p>)
  3. Count of code snippets (<pre>)
  4. Count of images (<img>)
  5. Count of characters, excluding HTML tags

To build feature vectors for our README headers and paragraphs, we tried TF-IDF, Count, and Hashing vectorizers from scikit-learn. We ended up using TF-IDF, because it had the highest accuracy rate. We used 1-grams, because anything more than that became computationally too expensive, and the results weren’t noticeably improved. If you think about it, this makes sense. Section headers tend to be single words, like Installation, or Usage.

Since code snippets, images, and total characters are numerical values by nature, we simply casted them into a 1×1 matrix. 

Step Four: Benchmarking

We benchmarked 11 different classifiers against each other for all five features. And, then repeated this for all 10 programming languages. In reality, there was an 11th “language,” which was the aggregate of all 10 languages. We later used this 11th (i.e. “All”) language to train a general model for our demo. We did this to handle situations where a user was trying to analyze a README representing a project in a language that wasn’t one of the top 10 languages on GitHub.

We used the following classifiers: Logistic Regression, Passive Aggressive, Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, Nearest Centroid, Label Propagation, SVC, Linear SVC, Decision Tree, and Extra Tree.

We would have used nuSVC, but as Stack Overflow can explain, it has no direct interpretation, which prevented us from using it.

In all, we created 605 different models (11 classifiers x 5 features x 11 languages — remember, we created an “All” language to produce an overall model, hence 11.). We picked the best model per feature per programming language, which means we ended up using 55 unique models (5 features x 11 languages).

Feature Extraction Model Mean Squared Error

View the interactive chart here. Find the detailed error rates broken out by language here.

Some notes on how we trained: 

  1. Before we started, we divided our dataset into a training set, and a testing set: 75%, and 25%, respectively.
  2. After training the initial 605 models, we benchmarked their accuracy scores against each other.
  3. We selected the best models for each feature, and then we saved (i.e. pickled) the model. Persisting the model file allowed us to skip the training step in the future.

To determine the error rate of the models, we calculated their mean squared errors. The model with the least mean squared errors was selected for each corresponding feature, and language – 55 models were selected in total.

Step Five: Putting it All Together

With all of this in place, we can now take any GitHub repository, and score the README using our models.

For non-numerical features, like Headers and Paragraphs, the model tries to make recommendations that should improve the quality of your README by adding, removing, or changing every word, and recalculating the score each time.

For the numerical-based features, like the count of <pre>, <img> tags, and the README length, the model incrementally increases and decreases values for each feature. It then re-evaluates the score with each iteration to determine an optimized recommendation. If it can’t produce a higher score with any of the changes, no recommendation is given.

Try out the GitHub README Analyzer now.


Conclusion

Some of our assumptions proved to be true, while some were off. We found that our assumption about headers and the text from paragraphs correlated with popular repositories. However, this wasn’t true for the length of a repository, or the count of code samples and images.

The reason for this may stem from the fact that n-gram based features, like Headers and Paragraphs, train on hundreds or thousands of variables per document. Whereas numerical features only trains on one variable per document. Hence, ~1,000 repositories per programming language were enough to train models for n-gram based features, but it wasn’t enough to train decent numerical-based models. Another reason might be that numerical-based features had noisy data, or the dataset didn’t correlate to the number of stars a project had.

When we tested our models, we initially found it kept recommending us to significantly shorten every README (e.g. it recommended that one README get shortened from 408 characters down to 6!), remove almost all the images and code samples. This was counterintuitive, so after plotting the data, we learned that a big part of our dataset had zero images, zero code snippets, or was less than 1,500 characters.

Plot of READMEs with counts of zero

This explains why our model behaved so poorly at first for the length, code sample, and image features. To account for this, we took an additional step in our preprocessing and removed any README that was less than 1,500 characters, contained zero images, or zero code samples. This dramatically improved the results for images and code sample recommendations, but the character length recommendation did not improve significantly. In the end, we chose to remove the length recommendation from our demo.


Think you can come up with a better model? Check out our iPython Notebook covering all of the steps from above, and tweet us @Algorithmia with your results


Below is the numerical feature distribution for images, code snippets, and character length all README’s. Find the detailed histograms broken out by language here.

A histogram of the count of images, code snippets, and character length

Analyzing Tweets Using Social Sentiment Analysis

An Introduction to Social Sentiment and Analyzing Tweets

Sentiment Analysis, also known as opinion mining, is a powerful tool you can use to build smarter products. It’s a natural language processing algorithm that gives you a general idea about the positive, neutral, and negative sentiment of texts. Sentiment analysis is often used to understand the opinion or attitude in tweets, status updates, movie/music/television reviews, chats, emails, comments, and more. Social media monitoring apps and companies all rely on sentiment analysis and machine learning to assist them in gaining insights about mentions, brands, and products.

For example, sentiment analysis can be used to better understand how your customers on twitter perceive your brand online in real-time. Instead of going through hundreds, or maybe thousands, of tweets by hand, you can easily group positive, neutral, and negative tweets together, and sort them based on their confidence values. This can save you countless hours and leads to actionable insights.

Additionally, you can use another natural language processing algorithm called Latent Dirichlet allocation. This powerful algorithm can extract a mixture of topics for a given set of documents. LDA doesn’t give you the topic name, but gives you a very good idea about what the topic is. Each topic is represented by a bag of words, which tells you what the topic is about. As an example, if the bag of words are: “sun, nasa, earth, moon”, you can understand the topic is probably related to space.

You can use both of these powerful algorithms to build a tool to analyze tweets in real time. This tool will just give you just a glimpse of what can be built using just two machine learning algorithms and a scraper. Learn more about Sentiment Analysis with this handy Algorithmia Guide..

Using NLP from Algorithmia to Build an App For Analyzing Tweets on Demand

Analyze Tweets is a minimal demo from Algorithmia that searches Twitter by keyword and analyzes tweets for sentiment and LDA topics. Currently it searches up to 500 tweets that aren’t older than 7 days. It also caches each query up to 24 hours.

Currently the demo starts initializing with a random keyword immediately after this web page loads. The randomly selected keywords are: algorithmiamicrosoftamazongooglefacebook, and reddit.

You can check out the demo below:

How we built it

Before we can start, you need to have an Algorithmia account. You can create one on our website.

The first thing we did was to click the Add Algorithm button located under the user tab, which is located at the top right side of the webpage.

Screen Shot 2016-01-28 at 1.58.32 AM

We enabled the Special Permissions, so that our algorithm can call other algorithms and have access to the internet. We also selected Python as our preferred language.

Screen Shot 2016-01-28 at 2.01.55 AM

The first thing we decided was to pick which algorithms we were going to use. We came up with a recipe that uses 3 different algorithms in the following steps:

  1. It first checks if the query has been cached before and the cache isn’t older than 24 hours.
    1. If it has the appropriate cache, it returns the cached query.
    2. If it can’t find an appropriate cached copy, it continues to run the analysis.
      1. It retrieves up to 500 tweets by using the user provided keyword and twitter/RetrieveTweetsWithKeyword algorithm.
      2. It runs the nlp/SocialSentimentAnalysis algorithm to get sentiment scores for all tweets.
      3. It creates two new groups of tweets that are respectively the top 20% most positive and top 20% most negative tweets.
        1. We do this by sorting the list of tweets based on their overall sentiment provided by the previously mentioned algorithm.
        2. We then take the top 20% and bottom 20% to get the most positive and negative tweets.
        3. We also remove any tweets in both groups that have a sentiment of 0 (neutral). This is because the twitter retriever does don’t guarantee to return 500 tweets each time. For more information, click here.
      4. These two new groups (positive and negative) of tweets are fed into the nlp/LDA algorithm to extract positive and negative topics.
      5. All relevant information is first cached, then is returned.

Based on our recipe above, we decided to wrap this app into a micro-service and call it nlp/AnalyzeTweets. You can check out the source code here. You can copy and paste the code to your own algorithm if you’re feeling lazy. The only part you would need to change would be line 28 from:

cache_uri = "data://demo/AnalyzeTweets/" + str(query_hash) + ".json"

to:

cache_uri = "data://yourUserName/AnalyzeTweets/" + str(query_hash) + ".json"

And the last step, is to compile and publish the new algorithm.

Screen Shot 2016-01-28 at 2.46.26 AM

And… Congrats! By using different algorithms as building blocks, you now have a working micro-service on the Algorithmia platform!

Notes

After writing the algorithm, we built a client-side (JS + CSS + HTML) demo that calls our new algorithm with input from the user. In our client-side app, we used d3.js for rendering the pie chart and histogram. We also used DataTables for rendering interactive tabular data. SweetAlert was used for the stylish popup messages. You can find a standalone version of our Analyze Tweets app here.

A real world case study

Screen Shot 2016-01-28 at 2.48.05 AM

While we were building the demo app and writing this blog post (on Jan 28th), Github went down and was completely inaccessible. Being the dataphiles we are, we immediately ran an analysis on the keyword “github” using our new app.

As you can expect, developers on Twitter were not positive about GitHub being down. Nearly half of all tweets relating to GitHub at this time were negative, but to get a better idea of the types of reactions we’re seeing, we can look to the topics generated with LDA. We can see that Topic 4 probably represents a group of people who are more calm about the situation. The famous raging unicorn makes an appearance in multiple topics and we can also see some anger in the other topics. And of course, what analysis of developer tweets would be complete without people complaining about other people complaining? We can see this in Topic 2 with words like “whine”. See the full analysis during the downtime below:

screen

Benchmarking Sentiment Analysis Algorithms

Sentiment Analysis, also known as opinion mining, is a powerful tool you can use to build smarter products. It’s a natural language processing algorithm that gives you a general idea about the positive, neutral, and negative sentiment of texts. Sentiment analysis is often used to understand the opinion or attitude in tweets, status updates, movie/music/television reviews, chats, emails, comments, and more. Social media monitoring apps and companies all rely on sentiment analysis and machine learning to assist them in gaining insights about mentions, brands, and products.

Want to learn more about Sentiment Analysis? Read the Algorithmia Guide to Sentiment Analysis.

For example, sentiment analysis can be used to better understand how your customers perceive your brand or product on Twitter in real-time. Instead of going through hundreds, or maybe thousands, of tweets by hand, you can easily group positive, neutral, and negative tweets together, and sort them based on their confidence values. This will save you countless hours and leads to actionable insights.

Sentiment Analysis Algorithms On Algorithmia

Anyone who has worked with sentiment analysis can tell you it’s a very hard problem. It’s not hard to find trained models for classifying very specific domains, but there currently isn’t an all-purpose sentiment analyzer in the wild. In general, sentiment analysis becomes even harder when the volume of text increases, due to the complex relations between words and phrases.

On the Algorithmia marketplace, the most popular sentiment analysis algorithm is nlp/SentimentAnalysis. It’s an algorithm based on the popular and well known NLP library called Stanford CoreNLP, and it does a good job of analyzing large bodies of text. However, we’ve observed that the algorithm tends to return overly negative sentiment on short bodies of text, and decided that it needed some improvement.

We’ve found at that the Stanford CoreNLP library was originally intended for building NLP pipelines in a fast and simple fashion, and didn’t focus too much on individual features, and lacks documentation about how to retrain the model. Additionally, the library doesn’t provide confidence values, but rather returns an integer between 0 and 4.

A Better Alternative

We decided to test out a few alternative open-source libraries out there that would hopefully outperform the current algorithm when it came to social media sentiment analysis. We came across an interesting and quite well performing library called Vader Sentiment Analysis. The library is based on a paper published by SocialAI at Georgia Tech. This new algorithm performed exceptionally well, and has been added it to the marketplace. It’s called nlp/SocialSentimentAnalysis, and is designed to analyze social media texts.

We ran benchmarks on both nlp/SentimentAnalysis and nlp/SocialSentimentAnalysis to compare them to each other. We used an open dataset from the Crowdflowers Data for Everyone initiative. The Apple Computers Twitter sentiment dataset was selected because the tweets covered a wide array of topics. Real life data is almost never homogeneous, since it almost always has noise in it. Using this dataset helped us better understand how the algorithm performed when faced with real data.

We removed tweets that had no sentiment, and then filtered out anything that didn’t return 100% confidence so that the tweets were grouped and labeled tweets by consensus. This decreased the size of the dataset from ~4000 tweets to ~1800 tweets.

Running Time Comparison

The first comparison we made was the running time for each algorithm. The new nlp/SocialSentimentAnalysis algorithm runs up to 12 times faster than the nlp/SentimentAnalysis algorithm.

fig01

Accuracy Comparison

As you can see in the bar chart below, the new social sentiment analysis algorithm performs 15% better in overall accuracy.

fig02

We can also see how well it performs in accurately predicting each specific label. We used the One-Versus-All method to calculate the accuracy for every individual label. The new algorithm outperformed the old one in every given label.

fig03

When To Use Social Sentiment Analysis

The new algorithm works well with social media texts, as well as texts that inherently have a similar structure and nature (i.e. status updates, chat messages, comments, etc). The algorithm can still give you sentiments for bigger texts such as reviews, or articles, but it will probably be not as accurate as it is with social media texts.

An example application would be to monitor social media for how people are reacting to a change to your product, such as when Foursquare split their app in two: Swarm and Foursquare. Or, when Apple releases an iOS update. You could monitor the overall sentiment of a certain hashtag or account mentions, and visualize a line chart that demonstrates the change of your customer’s sentiment over time.

Another example, you could monitor your products through social media 24/7, and receive alerts when significant changes in sentiment happen to your product or service in a short amount of time. This would act as an early alert system to help you take quick, and appropriate action before a problem gets even bigger. An example would be a brand like Comcast or Time Warner wanting to keep tabs on customer satisfaction through social media, and proactively respond to customers when there is a service interruption.

Understanding Social Sentiment Analysis

The new algorithm returns three individual sentiments: positive, neutral, and negative. Additionally, it returns one general overall (i.e. compound) sentiment. Each individual sentiment is scored between 0 and 1 according to their intensity. The compound sentiment of the text is given between -1 and 1, which is between absolute negative and absolute positive, respectively.

Based on your use case and application, you may want to only use a specific individual sentiment (i.e. wanting to see only negative tweets, ordered by intensity), or the compound, overall sentiment (i.e. understanding general consumer feelings and opinions). Having both types of sentiment gives you the freedom to build applications exactly as you need to.

Conclusion

The new and improved nlp/SocialSentimentAnalysis algorithm is definitely faster, and better at classifying the sentiments of social media texts. It allows you to build different kinds of applications due to it’s variety of sentiment types (individual or compound), whereas the old one only provided a overall sentiment with five discrete values, and is better reserved for larger bodies of text.

Did you build an awesome app that uses nlp/SocialSentimentAnalysis? Let us know on Twitter @algorithmia!

Bonus: Check out the source code for running the benchmarks yourself @github!