What makes a good GitHub README? It’s a question almost every developer has asked as they’ve typed:
git init git add README.md git commit -m "first commit"
As far as we can tell, nobody has a clear answer for this, yet you typically know a good GitHub README when you see it. The best explanation we’ve found is the README Wikipedia entry, which offers a handful suggestions.
We set out to flex our data science muscles, and see if we could come up with an objective standard for what makes a good GitHub README using machine learning. The result is the GitHub README Analyzer demo, an experimental tool to algorithmically improve the quality of your GitHub README’s.
The complete code for this project can be found in our iPython Notebook here.
Thinking About the GitHub README Problem
Understanding the contents of a README is a problem for NLP, and it’s a particularly hard one to solve.
To start, we made some assumptions about what makes a GitHub README good:
- Popular repositories probably have good README’s
- Popular repositories should have more stars than unpopular ones
- Each programming language has a unique pattern, or style
With those assumptions in place, we set out to collect our data, and build our model.
Approaching the Solution
Step One: Collecting Data
We scraped the first 1,000 repositories for each language. Then we removed any repository the didn’t have a GitHub README, was encoded in an obscure format, or was in some way unusable. After removing the bad repos, we ended up with — on average — 878 README’s per language.
Step Two: Data Scrubbing
We needed to then convert all the README’s into a common format for further processing. We chose to convert everything to HTML. When we did this, we also removed common words from a stop word list, and then stemmed the remaining words to find the common base — or root — of each word. We used NLTK Snowball stemmer for this. We also kept the original, unstemmed words to be used later for making recommendations from our model for the demo (e.g. instead of recommending “instal” we could recommend “installation”).
This preprocessing ensured that all the data was in a common format and ready to be vectorized.
Step Three: Feature Extraction
Now that we have a clean dataset to work with, we need to extract features. We used scikit-learn, a Python machine learning library, because it’s easy to learn, simple, and open sourced.
We extracted five features from the preprocessed dataset:
- 1-grams from the headers (<h1>, <h2>, <h3>)
- 1-grams from the text of the paragraphs (<p>)
- Count of code snippets (<pre>)
- Count of images (<img>)
- Count of characters, excluding HTML tags
To build feature vectors for our README headers and paragraphs, we tried TF-IDF, Count, and Hashing vectorizers from scikit-learn. We ended up using TF-IDF, because it had the highest accuracy rate. We used 1-grams, because anything more than that became computationally too expensive, and the results weren’t noticeably improved. If you think about it, this makes sense. Section headers tend to be single words, like Installation, or Usage.
Since code snippets, images, and total characters are numerical values by nature, we simply casted them into a 1×1 matrix.
Step Four: Benchmarking
We benchmarked 11 different classifiers against each other for all five features. And, then repeated this for all 10 programming languages. In reality, there was an 11th “language,” which was the aggregate of all 10 languages. We later used this 11th (i.e. “All”) language to train a general model for our demo. We did this to handle situations where a user was trying to analyze a README representing a project in a language that wasn’t one of the top 10 languages on GitHub.
We used the following classifiers: Logistic Regression, Passive Aggressive, Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, Nearest Centroid, Label Propagation, SVC, Linear SVC, Decision Tree, and Extra Tree.
We would have used nuSVC, but as Stack Overflow can explain, it has no direct interpretation, which prevented us from using it.
In all, we created 605 different models (11 classifiers x 5 features x 11 languages — remember, we created an “All” language to produce an overall model, hence 11.). We picked the best model per feature per programming language, which means we ended up using 55 unique models (5 features x 11 languages).
Some notes on how we trained:
- Before we started, we divided our dataset into a training set, and a testing set: 75%, and 25%, respectively.
- After training the initial 605 models, we benchmarked their accuracy scores against each other.
- We selected the best models for each feature, and then we saved (i.e. pickled) the model. Persisting the model file allowed us to skip the training step in the future.
To determine the error rate of the models, we calculated their mean squared errors. The model with the least mean squared errors was selected for each corresponding feature, and language – 55 models were selected in total.
Step Five: Putting it All Together
With all of this in place, we can now take any GitHub repository, and score the README using our models.
For non-numerical features, like Headers and Paragraphs, the model tries to make recommendations that should improve the quality of your README by adding, removing, or changing every word, and recalculating the score each time.
For the numerical-based features, like the count of <pre>, <img> tags, and the README length, the model incrementally increases and decreases values for each feature. It then re-evaluates the score with each iteration to determine an optimized recommendation. If it can’t produce a higher score with any of the changes, no recommendation is given.
Try out the GitHub README Analyzer now.
Some of our assumptions proved to be true, while some were off. We found that our assumption about headers and the text from paragraphs correlated with popular repositories. However, this wasn’t true for the length of a repository, or the count of code samples and images.
The reason for this may stem from the fact that n-gram based features, like Headers and Paragraphs, train on hundreds or thousands of variables per document. Whereas numerical features only trains on one variable per document. Hence, ~1,000 repositories per programming language were enough to train models for n-gram based features, but it wasn’t enough to train decent numerical-based models. Another reason might be that numerical-based features had noisy data, or the dataset didn’t correlate to the number of stars a project had.
When we tested our models, we initially found it kept recommending us to significantly shorten every README (e.g. it recommended that one README get shortened from 408 characters down to 6!), remove almost all the images and code samples. This was counterintuitive, so after plotting the data, we learned that a big part of our dataset had zero images, zero code snippets, or was less than 1,500 characters.
This explains why our model behaved so poorly at first for the length, code sample, and image features. To account for this, we took an additional step in our preprocessing and removed any README that was less than 1,500 characters, contained zero images, or zero code samples. This dramatically improved the results for images and code sample recommendations, but the character length recommendation did not improve significantly. In the end, we chose to remove the length recommendation from our demo.
Below is the numerical feature distribution for images, code snippets, and character length all README’s. Find the detailed histograms broken out by language here.