Algorithmia

A Data Science Approach to Writing a Good GitHub README

Octocat and the GitHub README Analyzer

What makes a good GitHub README? It’s a question almost every developer has asked as they’ve typed:

 git init
 git add README.md
 git commit -m "first commit"

As far as we can tell, nobody has a clear answer for this, yet you typically know a good GitHub README when you see it. The best explanation we’ve found is the README Wikipedia entry, which offers a handful suggestions.

We set out to flex our data science muscles, and see if we could come up with an objective standard for what makes a good GitHub README using machine learning. The result is the GitHub README Analyzer demo, an experimental tool to algorithmically improve the quality of your GitHub README’s.

GitHub README Analyzer Demo

The complete code for this project can be found in our iPython Notebook here.


Thinking About the GitHub README Problem

Understanding the contents of a README is a problem for NLP, and it’s a particularly hard one to solve.

To start, we made some assumptions about what makes a GitHub README good:

  1. Popular repositories probably have good README’s
  2. Popular repositories should have more stars than unpopular ones
  3. Each programming language has a unique pattern, or style

With those assumptions in place, we set out to collect our data, and build our model.


Approaching the Solution

Step One: Collecting Data

First, we needed to create a dataset. We decided to use the 10 most popular programming languages on GitHub by the number of repositories, which were Javascript, Java, Ruby, Python, PHP, HTML, CSS, C++, C, and C#. We used the GitHub API to retrieve a count of repos by language, and then grabbed the README’s for the 1,000 most-starred repos per language. We only used README’s that were encoded in either markdown or reStructuredText formats.

We scraped the first 1,000 repositories for each language. Then we removed any repository the didn’t have a GitHub README, was encoded in an obscure format, or was in some way unusable. After removing the bad repos, we ended up with — on average — 878 README’s per language.

Step Two: Data Scrubbing

We needed to then convert all the README’s into a common format for further processing. We chose to convert everything to HTML. When we did this, we also removed common words from a stop word list, and then stemmed the remaining words to find the common base — or root — of each word. We used NLTK Snowball stemmer for this. We also kept the original, unstemmed words to be used later for making recommendations from our model for the demo (e.g. instead of recommending “instal” we could recommend “installation”).

This preprocessing ensured that all the data was in a common format and ready to be vectorized.

Step Three: Feature Extraction

Now that we have a clean dataset to work with, we need to extract features. We used scikit-learn, a Python machine learning library, because it’s easy to learn, simple, and open sourced.

We extracted five features from the preprocessed dataset:

  1. 1-grams from the headers (<h1>, <h2>, <h3>)
  2. 1-grams from the text of the paragraphs (<p>)
  3. Count of code snippets (<pre>)
  4. Count of images (<img>)
  5. Count of characters, excluding HTML tags

To build feature vectors for our README headers and paragraphs, we tried TF-IDF, Count, and Hashing vectorizers from scikit-learn. We ended up using TF-IDF, because it had the highest accuracy rate. We used 1-grams, because anything more than that became computationally too expensive, and the results weren’t noticeably improved. If you think about it, this makes sense. Section headers tend to be single words, like Installation, or Usage.

Since code snippets, images, and total characters are numerical values by nature, we simply casted them into a 1×1 matrix. 

Step Four: Benchmarking

We benchmarked 11 different classifiers against each other for all five features. And, then repeated this for all 10 programming languages. In reality, there was an 11th “language,” which was the aggregate of all 10 languages. We later used this 11th (i.e. “All”) language to train a general model for our demo. We did this to handle situations where a user was trying to analyze a README representing a project in a language that wasn’t one of the top 10 languages on GitHub.

We used the following classifiers: Logistic Regression, Passive Aggressive, Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, Nearest Centroid, Label Propagation, SVC, Linear SVC, Decision Tree, and Extra Tree.

We would have used nuSVC, but as Stack Overflow can explain, it has no direct interpretation, which prevented us from using it.

In all, we created 605 different models (11 classifiers x 5 features x 11 languages — remember, we created an “All” language to produce an overall model, hence 11.). We picked the best model per feature per programming language, which means we ended up using 55 unique models (5 features x 11 languages).

Feature Extraction Model Mean Squared Error

View the interactive chart here. Find the detailed error rates broken out by language here.

Some notes on how we trained: 

  1. Before we started, we divided our dataset into a training set, and a testing set: 75%, and 25%, respectively.
  2. After training the initial 605 models, we benchmarked their accuracy scores against each other.
  3. We selected the best models for each feature, and then we saved (i.e. pickled) the model. Persisting the model file allowed us to skip the training step in the future.

To determine the error rate of the models, we calculated their mean squared errors. The model with the least mean squared errors was selected for each corresponding feature, and language – 55 models were selected in total.

Step Five: Putting it All Together

With all of this in place, we can now take any GitHub repository, and score the README using our models.

For non-numerical features, like Headers and Paragraphs, the model tries to make recommendations that should improve the quality of your README by adding, removing, or changing every word, and recalculating the score each time.

For the numerical-based features, like the count of <pre>, <img> tags, and the README length, the model incrementally increases and decreases values for each feature. It then re-evaluates the score with each iteration to determine an optimized recommendation. If it can’t produce a higher score with any of the changes, no recommendation is given.

Try out the GitHub README Analyzer now.


Conclusion

Some of our assumptions proved to be true, while some were off. We found that our assumption about headers and the text from paragraphs correlated with popular repositories. However, this wasn’t true for the length of a repository, or the count of code samples and images.

The reason for this may stem from the fact that n-gram based features, like Headers and Paragraphs, train on hundreds or thousands of variables per document. Whereas numerical features only trains on one variable per document. Hence, ~1,000 repositories per programming language were enough to train models for n-gram based features, but it wasn’t enough to train decent numerical-based models. Another reason might be that numerical-based features had noisy data, or the dataset didn’t correlate to the number of stars a project had.

When we tested our models, we initially found it kept recommending us to significantly shorten every README (e.g. it recommended that one README get shortened from 408 characters down to 6!), remove almost all the images and code samples. This was counterintuitive, so after plotting the data, we learned that a big part of our dataset had zero images, zero code snippets, or was less than 1,500 characters.

Plot of READMEs with counts of zero

This explains why our model behaved so poorly at first for the length, code sample, and image features. To account for this, we took an additional step in our preprocessing and removed any README that was less than 1,500 characters, contained zero images, or zero code samples. This dramatically improved the results for images and code sample recommendations, but the character length recommendation did not improve significantly. In the end, we chose to remove the length recommendation from our demo.


Think you can come up with a better model? Check out our iPython Notebook covering all of the steps from above, and tweet us @Algorithmia with your results


Below is the numerical feature distribution for images, code snippets, and character length all README’s. Find the detailed histograms broken out by language here.

A histogram of the count of images, code snippets, and character length

An Introduction to Natural Language Processing

An introduction to natural language processing

This introduction to Natural Language Processing, or NLP for short, describes a central problems of artificial intelligence. NLP is focused on the interactions between human language and computers, and it sits at the intersection of computer science, artificial intelligence, and computational linguistics.

What is Natural Language Processing?

NLP is used to analyze text, allowing machines to understand how human’s speak. This human-computer interaction enables real-world applications like automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.

NLP is characterized as a hard problem in computer science. Human language is rarely precise, or plainly spoken. To understand human language is to understand not only the words, but the concepts and how they’re linked together to create meaning. Despite language being one of the easiest things for humans to learn, the ambiguity of language is what makes natural language processing a difficult problem for computers to master.

Natural Language Processing Algorithms

NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be.

Popular Open Source NLP Libraries:

These libraries provide the algorithmic building blocks of NLP in real-world applications. Algorithmia provides a free API endpoint for many of these algorithms, without ever having to setup or provision servers and infrastructure.

A Few NLP Examples:

NLP algorithms can be extremely helpful for web developers, providing them with the turnkey tools needed to create advanced applications, and prototypes.

Natural Language Processing Tutorials

Recommended NLP Books

  • Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit
    This is a book about Natural Language Processing. By “natural language” we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. At one extreme, it could be as simple as counting word frequencies to compare different writing styles.
  • Speech and Language Processing, 2nd Edition 2nd Edition
    An explosion of Web-based language techniques, merging of distinct fields, availability of phone-based dialogue systems, and much more make this an exciting time in speech and language processing. The first of its kind to thoroughly cover language technology – at all levels and with all modern technologies – this text takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations. The authors cover areas that traditionally are taught in different courses, to describe a unified vision of speech and language processing.”
  • Introduction to Information Retrieval
    As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval systems. However, during the last decade, relentless optimization of information retrieval effectiveness has driven web search engines to new quality levels where most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows, 2004) found that 92% of Internet users say the Internet is a good place to go for getting everyday information.” To the surprise of many, the field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people’s preferred means of information access.”

NLP Videos

NLP Courses

  • Stanford Natural Language Processing on Coursera
    This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.”
  • Stanford Machine Learning on Coursera
    Machine learning is the science of getting computers to act without being explicitly programmed. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.”
  • Udemy’s Introduction to Natural Language Processing
    This course introduces Natural Language Processing through the use of python and the Natural Language Tool Kit. Through a practical approach, you’ll get hands on experience working with and analyzing text. As a student of this course, you’ll get updates for free, which include lecture revisions, new code examples, and new data projects.”
  • Certificate in Natural Language Technology
    When you talk to your mobile device or car navigation system – or it talks to you – you’re experiencing the fruits of developments in natural language processing. This field, which focuses on the creation of software that can analyze and understand human languages, has grown rapidly in recent years and now has many technological applications. In this three-course certificate program, we’ll explore the foundations of computational linguistics, the academic discipline that underlies NLP.”

Related NLP Topics

Six Natural Language Processing Algorithms for Web Developers

Natural language processing algorithms are a web developer’s best friend. These powerful tools easily integrate into apps, projects, and prototypes to extract insights and help developers understand text-based content.

We’ve put together a collection of the most common NLP algorithms freely available on Algorithmia. Think of these tools as the building blocks needed to quickly create smart, data rich applications.

New to NLP? Start with our guide to natural language processing algorithms.


Summarizer

Quickly get the tl;dr on any article. Summarizer creates a summary of a text document, while retaining the most important parts. Summarizer takes a large block of unstructured text as a string, and extracts the key sentences based on the frequency of topics and terms it finds. Try out Summarizer here.

Sample Input:

"A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution.
Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending.
We propose a solution to the double-spending problem using a peer-to-peer network.
The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work.
The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power.
As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers.
The network itself requires minimal structure.
Messages are broadcast on a best effort basis, and nodes can leave and rejoin the network at will, accepting the longest proof-of-work chain as proof of what happened while they were gone."

Sample Output:

We propose a solution to the double-spending problem using a peer-to-peer network.
The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work.

Sentiment Analysis

Want to understand the tone or attitude of a piece of text? Use sentiment analysis to determine its positive or negative feelings, opinions, or emotions. You can quickly identify and extract subjective information from text simply by passing this algorithm a string. The algorithm assigns a sentiment rating from 0 to 4, which corresponds to very negative, negative, neutral, positive, and very positive. Get started with sentiment analysis here.

Sample Input:

"Algorithmia loves you!"

Sample Out:

3

Need to analyze the sentiment of social media? The Social Sentiment Analysis algorithm is tuned specifically for bite-sized content, like tweets and status updates. This algorithm also provides you with some additional information, like the confidence levels for the positive-ness, and negative-ness of the text.

Sample Input:

{
  "sentence": "This old product sucks! But after the update it works like a charm!"
}

Sample Output:

{
  "positive": 0.335,
  "negative": 0.145,
  "sentence": "This old product sucks! But after the update it works like a charm!",
  "neutral": 0.521,
  "compound": 0.508
}

Want to learn more about sentiment analysis? Try our guide to sentiment analysis.


LDA (Auto-Keyword Tags)

LDA is a core NLP algorithm used to take a group of text documents, and identify the most relevant topic tags for each. LDA is perfect for anybody needing to categorize the key topics and tags from text. Many web developers use LDA to dynamically generate tags for blog posts and news articles. This LDA algorithm takes an object containing an array of strings, and returns an array of objects with the associated topic tags. Try it here.

We have a second algorithm called Auto-Tag URL, which is dead-simple to use: pass in a URL, and the algorithm returns the tags with the associated frequency. Try both, and see what works for your situation.

Sample Input:

"https://raw.githubusercontent.com/mbernste/machine-learning/master/README.md"

Sample Output:

{
  "algorithm": 4,
  "structure": 2,
  "bayes": 2,
  "classifier": 2,
  "naïve": 2,
  "search": 2,
  "algorithms": 2,
  "bayesian": 2
}

Analyze Tweets

Search Twitter for a keyword, and then analyze the matching tweets for sentiment and LDA topics. Marketing and brand managers can use Analyze Tweets to monitor their brands and products for mentions to determine what customers are saying, and how they feel about a product. This microservice outputs a list of all the matching tweets with corresponding sentiment, and groups the top positive and negative LDA topics. It also provides positive-specific and negative-specific tweets, in case you want to just look for just one flavor of sentiment. Before using this algorithm, you’ll need to generate free API keys through Twitter.

Need to analyze a specific Twitter account? We also provide Analyze Twitter User, to help you understand what a specific user tweets about positively, or negatively. This is great for identifying influencers.

Sample Input:

{
  "query": "seattle",
  "numTweets": "1000",
  "auth":
    {
      "app_key": "<YOUR TWITTER API KEY>",
      "app_secret": "<YOUR TWITTER SECRET>",
      "oauth_token": "<YOUR TWITTER OAUTH TOKEN>",
      "oauth_token_secret": "<YOUR TWITTER TOKEN SECRET>"
    }
}

Sample Output:

{
  "negLDA": [
    {
      "seattle's": 2,
      "adults": 2,
      "breaking": 2,
      "oldest": 1,
      "liveonkomo": 1,
      "teens": 2,
      "seattle": 5,
      "charged": 3
    }
  ],
  "allTweets": [
    {
      "text": "RT @Harry_Styles: Thanks for having us Seattle. You were amazing tonight. Hope you enjoyed the show. All the love",
      "created_at": "Thu Feb 04 19:34:18 +0000 2016",
      "negative_sentiment": 0,
      "tweet_url": "https://twitter.com/statuses/695329550536949761",
      "positive_sentiment": 0.55,
      "overall_sentiment": 0.9524,
      "neutral_sentiment": 0.45
    },
    {
      "text": "@kexp is super moody, super fun. Really have dug listening to them while in #Seattle \n\nProbably most lined up with my listening prefs",
      "created_at": "Thu Feb 04 19:31:04 +0000 2016",
      "negative_sentiment": 0.077,
      "tweet_url": "https://twitter.com/statuses/695328736531451904",
      "positive_sentiment": 0.34,
      "overall_sentiment": 0.8625,
      "neutral_sentiment": 0.583
    }
  ],
  "posLDA": [
    {
      "urban": 2,
      "love": 3,
      "lil": 2,
      "gentrified": 2,
      "openingday": 2,
      "cores": 2,
      "seattle": 8,
      "enjoyed": 3
    }
  ],
  "negTweets": [
    {
      "text": "I couldn't be more excited about leading a missions team to Seattle, Washington this summer! Join us!",
      "created_at": "Thu Feb 04 19:52:51 +0000 2016",
      "negative_sentiment": 0.157,
      "tweet_url": "https://twitter.com/statuses/695334220764463104",
      "positive_sentiment": 0.122,
      "overall_sentiment": -0.1622,
      "neutral_sentiment": 0.721
    }
  ],
  "posTweets": [
    {
      "text": "RT @Harry_Styles: Thanks for having us Seattle. You were amazing tonight. Hope you enjoyed the show. All the love",
      "created_at": "Thu Feb 04 19:34:18 +0000 2016",
      "negative_sentiment": 0,
      "tweet_url": "https://twitter.com/statuses/695329550536949761",
      "positive_sentiment": 0.55,
      "overall_sentiment": 0.9524,
      "neutral_sentiment": 0.45
    }
  ]
}

We’ve truncated the above output, because there is wayyy too much to display on the blog (there were 1,000+ tweets analyzed!). Check out the Analyze Tweets page to play with the algorithm, and see the full output.


Analyze URL

Need to quickly scrape a URL, and get the content and metadata back? Analyze URL is perfect for any web developer wanting to quickly turn HTML pages into structured data. Pass the algorithm a URL as a string, and it returns the text of the page, the title, a summary, the thumbnail if one exists, timestamp, and the status code as an object. Get Analyze URL here.

Sample Input:

["http://blog.algorithmia.com/2016/02/algorithm-economy-containers-microservices/"]

Sample Output:

{
  "summary": "Algorithms as Microservices: How the Algorithm Economy and Containers are Changing the Way We Build and Deploy Apps Today",
  "date": "2016-02-22T11:38:11-07:00",
  "thumbnail": "http://blog.algorithmia.com/wp-content/uploads/2016/02/algorithm-economy.png",
  "marker": true,
  "text": "Build tomorrow's smart apps today In the age of Big Data, algorithms give companies a competitive advantage. Today’s most important technology companies all have algorithmic intelligence built into the core of their product: Google Search, Facebook News Feed, Amazon’s and Netflix’s recommendation engines. “Data is inherently dumb,” Peter Sondergaard, senior vice president at Gartner and global head of Research, said in The Internet of Things Will Give Rise To The Algorithm Economy. “It doesn’t actually do anything unless you know how to use it.” Google, Facebook, Amazon, Netflix and others have built both the systems needed to acquire a mountain of data (i.e. search history, engagement metrics, purchase history, etc), as well as the algorithms responsible for extracting actionable insights from that data. As a result, these companies are using algorithms to create value, and impact millions of people a day. “Algorithms are where the real value lies,” Sondergaard said. “Algorithms define action.” For many technology companies, they’ve done a good job of capturing data, but they’ve come up short on doing anything valuable with that data. Thankfully, there are two fundamental shifts happening in technology right now that are leading to the democratization of algorithmic intelligence, and changing the way we build and deploy smart apps today: The confluence of the algorithm economy and containers creates a new value chain, where algorithms as a service can be discovered and made accessible to all developers through a simple REST API. Algorithms as containerized microservices ensure both interoperability and portability, allowing for code to be written in any programming language, and then seamlessly united across a single API. By containerizing algorithms, we ensure that code is always “on,” and always available, as well as being able to auto-scale to meet the needs of the application, without ever having to configure, manage, or maintain servers and infrastructure. Containerized algorithms shorten the time for any development team to go from concept, to prototype, to production-ready app. Algorithms running in containers as microservices is a strategy for companies looking to discover actionable insights in their data. This structure makes software development more agile and efficient. It reduces the infrastructure needed, and abstracts an application’s various functions into microservices to make the entire system more resilient. The “algorithm economy” is a term established by Gartner to describe the next wave of innovation, where developers can produce, distribute, and commercialize their code. The algorithm economy is not about buying and selling complete apps, but rather functional, easy to integrate algorithms that enable developers to build smarter apps, quicker and cheaper than before. Algorithms are the building blocks of any application. They provide the business logic needed to turn inputs into useful outputs. Similar to Lego blocks, algorithms can be stacked together in new and novel ways to manipulate data, extract key insights, and solve problems efficiently. The upshot is that these same algorithms are flexible, and easily reused and reconfigured to provide value in a variety of circumstances. For example, we created a microservice at Algorithmia called Analyze Tweets, which searches Twitter for a keyword, determining the sentiment and LDA topics for each tweet that matches the search term. This microservice stacks our Retrieve Tweets With Keywords algorithm with our Social Sentiment Analysis and LDA algorithms to create a simple, plug-and-play utility. The three underlying algorithms could just as easily be restacked to create a new use case. For instance, you could create an Analyze Hacker News microservice that uses the Scrape Hacker News and URL2Text algorithms to extract the text for the top HN posts. Then, you’d simply pass the text for each post to the Social Sentiment Analysis, and LDA algorithms to determine the sentiment and topics of all the top posts on HN. The algorithm economy also allows for the commercialization of world class research that historically would have been published, but largely under-utilized. In the algorithm economy, this research is turned into functional, running code, and made available for others to use. The ability to produce, distribute, and discover algorithms fosters a community around algorithm development, where creators can interact with the app developers putting their research to work. Algorithm marketplaces function as the global meeting place for researchers, engineers, and organizations to come together to make tomorrow’s apps today. Containers are changing how developers build and deploy distributed applications. In particular, containers are a form of lightweight virtualization that can hold all the application logic, and run as an isolated process with all the dependencies, libraries, and configuration files bundled into a single package that runs in the cloud. “Instead of making an application or a service the endpoint of a build, you’re building containers that wrap applications, services, and all their dependencies,” Simon Bisson at InfoWorld said in How Containers Change Everything. “Any time you make a change, you build a new container; and you test and deploy that container as a whole, not as an individual element.” Containers create a reliable environment where software can run when moved from one environment to another, allowing developers to write code once, and run it in any environment with predictable results — all without having to provision servers or manage infrastructure. This is a shot across the bow for large, monolithic code bases. “[Monoliths are] being replaced by microservices architectures, which decompose large applications – with all the functionality built-in – into smaller, purpose-driven services that communicate with each other through common REST APIs,” Lucas Carlson from InfoWorld said in 4 Ways Docker Fundamentally Changes Application Development. The hallmark of microservice architectures is that the various functions of an app are unbundled into a series of decentralized modules, each organized around a specific business capability. Martin Fowler, the co-author of the Agile Manifesto, describes microservices as “an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.” By decoupling services from a monolith, each microservice becomes independently deployable, and acts as a smart endpoint of the API. “There is a bare minimum of centralized management of these services,” Fowler said in Microservices: A Definition of this New Architectural Term, “which may be written in different programming languages and use different data storage technologies.” Similar to the algorithm economy, containers are like Legos for cloud-based application development. “This changes cloud development practices,” Carlson said, “by putting larger-scale architectures like those used at Facebook and Twitter within the reach of smaller development teams.” Diego Oppenheimer, founder and CEO of Algorithmia, is an entrepreneur and product developer with extensive background in all things data. Prior to founding Algorithmia he designed , managed and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server and Power BI. Diego holds a Bachelors degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University. More Posts - Website Follow Me:",
  "title": "The Algorithm Economy, Containers, and Microservices",
  "url": "http://blog.algorithmia.com/2016/02/algorithm-economy-containers-microservices/",
  "statusCode": 200
}

Site Map

The site map algorithm starts with a single URL, and crawls all the pages on the domain before returning a graph representing the link structure of the site. You can use the site map algorithm for many purposes, such as building dynamic XML site maps for search engines, doing competitive analysis, or simply auditing your own site to better understand what pages link where, and if there are any orphaned, or dead-end pages. Try it out here.

Sample Input:

["http://algorithmia.com",1]

Sample Output:

{
  "http://algorithmia.com/about":
    [
      "http://algorithmia.com/reset",
      "http://algorithmia.com/about",
      "http://algorithmia.com/contact",
      "http://algorithmia.com/api",
      "http://algorithmia.com/privacy",
      "http://algorithmia.com/terms",
      "http://algorithmia.com/signin",
      "http://algorithmia.com/"
    ],
  "http://algorithmia.com/privacy":
    [
      "http://algorithmia.com/reset",
      "http://algorithmia.com/about",
      "http://algorithmia.com/contact",
      "http://algorithmia.com/privacy",
      "http://algorithmia.com/terms",
      "http://algorithmia.com/signin",
      "http://algorithmia.com/"
    ],
  "http://algorithmia.com":
    [
      "http://algorithmia.com/reset",
      "http://algorithmia.com/about",
      "http://algorithmia.com/contact",
      "http://algorithmia.com/privacy",
      "http://algorithmia.com/terms",
      "http://algorithmia.com/signin",
      "http://algorithmia.com/"
  ]
}

Ready to learn more? Our free eBook, Five Algorithms Every Web Developer Can Use and Understand, is short primer on when and how web developers can harness the power of natural language processing algorithms and other text analysis tools in their projects and apps.

Three Forces Accelerating Artificial Intelligence Development

Trends in Artificial Intelligence

Artificial intelligence is set to hit the mainstream, thanks to improved machine learning tools, cheaper processing power, and a steep decline in the cost of cloud storage. As a result, firms are piling into the AI market, and pushing the pace of AI development across a range of fields.

“Given sufficiently large datasets, powerful computers, and the interest of subject-area experts, the deep learning tsunami looks set to wash over an ever-larger number of disciplines.

When it does, Nvidia will be there to capitalize. They announced a new chip design specifically for deep learning, with 15 billion transistors (a 3x increase) and the ability to process data 12x faster than previous chips.

For the first time we designed a [graphics-processing] architecture dedicated to accelerating AI and to accelerating deep learning.

Improved hardware contributes to Facebook’s ability to use artificial intelligence to describe photos to blind users, and it’s why Microsoft can now build a JARVIS-like personal digital assistant for smartphones.

Google, meanwhile, has its sights set on “solving intelligence, and then using that to solve everything else,” thanks to better processors and their cloud platform.

This kind of machine intelligence wouldn’t be possible without improved algorithms.

“I consider machine intelligence to be the entire world of learning algorithms, the class of algorithms that provide more intelligence to a system as more data is added to the system,” Shivon Zilis, creator of the machine intelligence framework, told Fast Forward Labs.  “These are algorithms that create products that seem human and smart.”

If you want to go deeper on the subject, O’Reilly has a free ebook out on The Future of Machine Intelligence, which unpacks the “concepts and innovations that represent the frontiers of ever-smarter machines” through ten interviews, spanning NLP, deep learning, autonomous cars, and more.

So, you might be wondering: Is the singularity near? Not at all, the NY Times says.

Machine learning algorithms ranked

Recapping Elon Musk’s Wild Week

Space X Falcon 9 landing on the drone ship

Last week SpaceX, the Elon Musk space company, successfully landed their Falcon 9 rocket on a drone ship for the first time last week.

The rocket landed vertically on a barge floating in choppy waters out in the Atlantic Ocean.

Bloomberg calls the landing a “moon walk” moment for the New Space Age. Whereas Musk simply mused: it’s just “another step toward the stars.”

But that’s not all: Tesla, Musk’s other company, received more than 325,000 preorders for their Model 3 last week, making it the “biggest one-week launch of any product.”

The preorders represent $14 billion in potential sales for the $35,000 electric car, which is expected to ship late-2017. The unprecedented demand is a “watershed moment” for electric vehicles, the WSJ says.

As electric vehicles go mainstream, speculation mounts that the petrol car could be dead as early as 2025. We are now witnessing the slow-motion disruption of the global auto industry, Quartz adds. At least one car maker has taken notice: GM has invested $500 million in Lyft, with plans to build self-driving electric cars.

Going Mainstream: Integrating Machine Learning in the Cloud

Adding machine learning to the cloud

You might have heard: Google unveiled their new machine learning platform to “pour ML all over the cloud.” The moves makes TensorFlow available to developers to do machine learning in the cloud with their own data.

Machine learning is developing fast, and “what will distinguish the good companies from the rest are things like domain expertise, quality of the dataset, and the ability to find the right problems to solve.” The Economist adds, “the firms that develop an early edge in artificial intelligence may reap the greatest rewards and erect barriers to entry.”

“At its highest level, machine learning is about understanding the world through data,” said Geoffrey Gordon, acting chair of Carnegie Mellon University’s Machine Learning Department. “Anything you can think of — public policy, finance, automobiles and robotics, for example — there’s a role for machine learning.”

While algorithm development is still broken, Coursera’s new Data Science Masters program is a step in the right direction. And, here are the top machine learning books for data scientists and machine learning engineers.

Confused about ML? Here’s how to approach machine learning as a non-technical person.

What AlphaGo Taught Us About The Importance Of Human Intuition

AlphaGo and Intuition

This is an excerpt from the post On AlphaGo, Intuition, and the Master Objective Function, by Jason Toy, founder of Somatic.io. For more, visit his blog. 

It is another great milestone in Artificial Intelligence that computers are now able to beat humans in the game of Go. The DeepMind/Google team came up with a combination of new algorithms that effectively reduced the search space of possible positions-1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000 to be exact- to a much smaller number that a computer could work with.

What I’m more excited about is that Deepmind/Google are publicly talking about intuition, and trying to build systems that sort of emulate it. In their blog post, they mention “The game is played primarily through intuition and feel”. They then go into detail on how they constructed a machine learning system that reduces the amount of positions the system needed to evaluate at each turn. Their system basically makes estimations of expected outcomes instead of directly calculating directed outcomes, thereby kinda sorta simulating intuition.

What is Intuition?

So what is the definition of intuition? According to wikipedia it is “a phenomenon of the mind, describes the ability to acquire knowledge without inference or the use of reason.”

Talking about this subject in relation to artificial intelligence has been a sore spot. Can intuition exist inside of a computer? The truth of the matter is we have no idea, people have been debating this since the beginning of computers and even much earlier with Descartes and others.

Another core problem is the name and definition of intuition. Should we call it intuition, emotion, gut reactions, lizard brain, or feelings? And, what exactly does that mean. We have been trying to objectively define this for a long time.

There is a specific definition that I have been using for a while is based off one of favorite books- Thinking, Fast and Slow.

The author, Daniel Kahneman, has done studies of the mind for decades, and came to the conclusion that the mind is running 2 different algorithms:

  • System 1: Fast, automatic, frequent, emotional, stereotypic, subconscious
  • System 2: Slow, effortful, infrequent, logical, calculating, conscious

System 1 is a machine for jumping to conclusions. It is a heuristic based system that reacts to incomplete information and stimuli fast enough to navigate in our world in real time given all the uncertainty. It gives answers that are typically good enough, but can make mistakes given the small amount of information it processes. Escaping an animal trying to attack us, swerving away from a car that suddenly stops, looking towards a bomb explosion, adding 1+1. Those are all examples of System 1 processing.

System 1 vs System 2

System 2 on the other hand is a completely different system. It is a slow and deliberate system that does things like complicated math, answering tests, writing a novel. System 2 can override System 1 when it senses that the answer from System 1 is wrong. It is presumed to reside in our neocortex and as far as we know, only humans have System 2.

It seems that humans and ALL living organisms have system 1. If we assume the System 1/System 2 combination is how our brains are working, then I would make the statement that we have already built an awesome System 2 in computers. All the current machine learning algorithms we build are based off logic and calculations which is what System 2 is all about. But System 2 cannot really exist by itself, it needs guidance.

Objective Functions

So all of these machine learning systems are always optimizing for a specific objective function. Depending on the algorithm and problem you are trying to solve, you use different algorithms and objective functions. With classical machine learning applications like regression and classification, the algorithm is trained on training data. The objective is to have the machine get as many answers correct on a previously unseen validation set.

training

On DeepMind’s video game system, they hook up the algorithm directly to the score of the video game and so the system is optimizing for winning in the video game.

AlphaGo’s objective function is to win at Go, and that is it, it can’t do anything else.

Notice how each of these machine learnings models have a single fixed objective function?

And, what is the objective function of general living organisms? Is it to eat? To reproduce? To live as long as possible? To be “happy”? To achieve homeostasis? How about for a human? Is is the same as other living organisms? If living organisms have a single objective function, it would seem like they all have different objective functions.

And what is my objective function as I write this blog post? I have the overarching goal of wanting to publish my thoughts on AlphaGo. As I’m writing this, my intuition on the direction of how I should write this keeps changing. How do we put this kind of intuition into a computer- Think about what you do in your daily life? How do you go about deciding what to do next?

If we want to get to general artificial intelligence,I believe that a heuristics-based reaction system, or System 1, should replace the objective function of a machine learning algorithm. In fact, they are the same thing.

For more, check out the full post On AlphaGo, Intuition, and the Master Objective Function.

The 2016 Internet of Things Landscape in Two Infographics

The 2016 Internet of Things Landscape

“An entirely new way of interacting with our world is emerging,” writes Matt Turck from FirstMark Capital. “The Internet of Things is about the transformation of any physical object into a digital data product.” The 2016 Internet of Things is occurring in five areas: the connected home, wearables, healthcare, robotics/drones, and transportation.

Meanwhile the security of IoT devices looms, with Linux founder Torvalds reminding us that “security plays second fiddle.” For instance, an Amazon Echo got confused and hijacked a thermostat after NPR was left on.

If you think that’s amusing, remember that with Alexa, Cortana, Siri, Google Now, and others, “you have a personal digital assistant that knows you,” Microsoft’s Satya Nadella says, “knows your preferences, has the ability, in a privacy-protecting way, to go and look at your information and your organization’s information and help you with your tasks.”

Plus: Learn how to build a motion sensing security camera with a Raspberry Pi. That reminds us of our own hack: a front desk AI inspired by Mark Zuckerberg’s New Year’s resolution.

Mobile is having an impact on the 2016 Internet of Things landscape

Onfographic courtesy Mobile Future.

Infographic: What Are Algorithms?

You May Have Heard: Instagram is moving to an algorithmically curated feed soon. This could be a good thing for users, but it’s worrisome for brands, Re/Code writes. While algorithmic feeds may save us from information overload, the New Statesman asks: Are these the curators we want?

Related, both Netflix and Twitter have made significant changes to their algorithms recently: Netflix updated it’s recommendation engine to better serve their growing global audience, while Twitter rolled out an algorithmic timeline.

Journalists analyzed the 11.5 million Panama Papers using optical character recognition (OCR) to transform more than 2.6 terabytes of messy information into a clean, searchable index.

All of this is powered by algorithms. So, what is an algorithm anyway? This infographic explains all:

What are algorithms? Credit futurism.com

Reality Check: Is Augmented and Virtual Reality Ready for Prime Time?

Is virtual reality ready for prime time? Virtual reality and augmented reality are on the cusp of going mainstream. Facebook’s Oculus Rift, HTC Vive, and Sony’s PlayStation VR will all release VR headsets this year.

Not to mention HoloLens, Microsoft’s slick augmented reality headset that “blends computer-generated imagery with the real world.” It’s already shipping. The Oculus Rift deliveries have been slowed by a component shortage.

In an Oculus vs HoloLens showdown, TechCrunch says, “There’s no way of explaining how fast you’ll get used to the world suddenly having a layer of data over it.”

A new type of “console war” is emerging, with companies racing to build market share among early adopters. You can understand why: Bloomberg reports that VR could turn into a $1.5 billion business.

That is, of course, if developers can “craft immersive content that educates, informs or entertains” to meet customer demand.

And, what does VR mean for art institutions anyway?