Algorithmia

Algorithmia Earns Gartner Data Science Cool Vendor Award

Gartner Names Algorithmia Top Cool Vendor Data Science 2016

Algorithmia is excited to announce that it has been selected as a Gartner Cool Data Science vendor for 2016.

Gartner’s annual Cool Vendor report identifies the five most innovative, impactful, and intriguing data science vendors help analytics leaders solve hard problems, and automate and scale their advanced analytics.

Algorithmia provides an open marketplace for algorithms, where developers can create, share, and remix the building blocks of algorithmic intelligence at scale.

On Algorithmia, algorithms run as serverless microservices, where a function can be written once — in any programming language — and then made available as part of a universal API, hosted on scalable infrastructure in the cloud.

This type of platform enables analytics leaders to quickly extract value from their datasets, by leveraging existing algorithms, and creating and hosting new models with ease.

For instance, a developer could create a simple, composable, and fully automated economic development report using predictive algorithms from Algorithmia.

Gartner is the world’s leading information technology research and advisory company.

Read the Gartner Cool Vendors in Data Science 2016 report here.

How We Turned Our GitHub README Model Into a Microservice

Hosting our GitHub Machine Learning ModelA few posts ago we talked about our data science approach to processing and analyzing a GitHub README.

We published the algorithm on our platform, and released our project as a Jupyter Notebook so you could try it for yourself. If you want to skip ahead, here’s the demo we made that grades your GitHub README.

It was a fun project that definitely motivated us to improve our own README’s, and we hope it inspired you, too (I personally got a C on one of mine, ouch)!

In this post, we’d like to take a step back to talk about how we used Algorithmia to host our scikit-learn model as an algorithm, and discuss the benefits associated with using our platform to deploy your own algorithm.

Why Host Your Model on Algorithmia?

There are many advantages to using Algorithmia to host your models.

The first is that you can transform your machine learning model into an algorithm that becomes a scalable, serverless microservice in just a few minutes, written in the language of your choice. Well, as long as that choice is: Python, Java, Rust, Scala, JavaScript or Ruby. But don’t worry, we are always adding more!

Second, you can monetize your proprietary models by collecting royalties on the usage. This way your agonizingly late nights slamming coffee and turning ideas into code won’t be for nothing. On the other hand, if altruism and transparency are your thing (it’s ours!) then you can open-source your projects and publish your algorithms for free.

Either way, the runtime (server) costs are being billed to the user who calls your algorithm. Learn more about pricing and permissions here.


A Step-By-Step Guide to Hosting Your Model

Now, on to the process of how we hosted the GitHub README analyzer model, and made an algorithm! This walkthrough is designed as an introduction to model and algorithm hosting even if you’ve never used Algorithmia before. For the GitHub ReadMe project, we developed, trained, and pickled our models locally. The following steps assume you’ve done the same, and will highlight the activities of getting your trained model onto Algorithmia.

Step 1: Add Your Data

The Algorithmia Data API is used when you have large data requirements or need to preserve state between calls. It allows algorithms to access data from within the same session, but ensures that your data is safe.

To use the Data API, log into your Algorithmia account and create a data collection via the Data Collections page. To get there, click the “Manage Data” link from the user profile icon dropdown in the upper right-hand corner.

Once there, click on “Add Collection” under the “My Collections” section on your data collections page

. After you create and name your data collection, you’ll want to set the read and write access to make it public, or lock it down as private. For more information about the four different types of data collections and permissions checkout our Data API docs.

Besides setting your algorithms permissions, this is also where you load either your data or your models, or both. In our GitHub README example we needed to load our pickled model rather than raw data. Loading your data or model is as easy as clicking the box ‘Drop files here to upload’ or drag and dropping them from your desktop.

You’ll need the path to your files in the next step. For example:
data://username/collection_name/data_model.pkl.zip

Uploading your trained model

Upload your model

Our recommendation is to preload your model before the apply() function (details below). This ensures the model is downloaded, and loaded into memory without causing a timeout when working with large model files. We support up to 4GB of memory per 5-minute session.

When preloading the model like this, only the initial call will load the model into memory, which can take several seconds with large files.

From there, only the apply() function will be called, which will return data much faster.

Step 2: Create an Algorithm

Now that we have our pickled model in a collection, we’ll create our algorithm and set our dependencies.

To add an algorithm, simply click “Add Algortithm” from the user profile icon. There, you will give it a name, pick the language, select permissions, and make the code either open or closed source.

Create your algorithm and set permissions.

Create your algorithm and set permissions.

Once you finish that step, go to your profile from the user profile icon where your algorithm will be listed by name. Click on the name and you’ll be taken to the algorithm’s page that you’ll eventually publish.

There is a purple tag that says “Edit Source.” Click that and it’ll open the editor where you can add your model and update your dependencies for the language of your choice. If you have questions about adding your dependencies check out our Algorithm Developer Guides.

Your new algorithm!

Step 3: Load Your Model

After you’ve set the dependencies, now you can load the pickled model you uploaded in step 1. You’ll want import the libraries and modules you’re using at the top of the file and then create a function that loads the data.

Here’s an example in Python:

import Algorithmia
import pickle
import zipfile

def loadModel():
    # Get file by name
    client = Algorithmia.client()
    # Open file and load model
    file_path = 'data://.my/colllections_name/model_file.pkl.zip'
    model_path = client.file(file_path).getFile().name
    # unzip compressed model file
    zip_ref = zipfile.Zipfile(model_path, ‘r’)
    zip_ref.extract('model_file.pkl’)
    zip_ref.close()
    # load model into memory
    model_file = open(‘model_file.pkl’, ‘rb’)
    model = pickle.load(model_file)
    model_file.close()
    return model

def apply(input):
    client = Algorithmia.client()
    model = loadModel()
    # Do something with your model and return your output for the user
    return some_data

Step 4: Publish Your Algorithm

Now, all you have to do is save/compile, test and publish your algorithm! When you publish your algorithm, you also set the permissions for public or private use, and whether to make it royalty-free or charge a per-call royalty. Also note that you can set permissions for the algorithm version to say whether internet is required or if your algorithm is not allowed to call other algorithms.

Publish your algorithm!

Publish your algorithm!

If you need more information and detailed steps to creating and publishing your algorithm follow these instructions to getting your algorithm on the Algorithmia platform check out our detailed guide to publishing your first algorithm.

Now that you’ve hosted your model and published your algorithm, tell people about it! It’s exciting to see your hard work utilized by others. If you need some inspiration check out the GitHub ReadMe demo, and our use cases.

If you have more questions about building your algorithm check out our Algorithmia Developer Center or get in touch with us.

CRN Big Data 100 Report Names Algorithmia a Top Business Analytics Vendor

Algorithmia Wins CRN Big Data 100 Award

For the second year in a row, Algorithmia earns a spot on the CRN Big Data 100 Business Analytics list.

Now in its fourth year, the Big Data 100 highlights vendors that create innovative products and services that help businesses make sense of their data. Algorithmia earned this same award in 2015.

“Working with big data remains one of the biggest IT challenges that businesses, government agencies and other organizations face today,” CRN says. “It’s also one of the biggest opportunities for IT vendors and for solution and strategic service providers.”

The Big Data 100 report is divided into three categories: Business Analytics, Platforms and Tools, and Data Management Technologies.

Announcing Algorithm Development Support for Python 3, JavaScript, Rust, and Ruby

Algorithm development in Python, Javascript, Rust, and Ruby

Today, we’re excited to announce full algorithm development support for Python 3, JavaScript, Rust, and Ruby!

Now, more developers can contribute to the world’s largest marketplace for algorithms in their favorite programming language. Python 3, JavaScript, Rust, and Ruby join Java, Scala, and Python 2 in our family of supported languages for algorithm development. We’re also pleased to announce support for PyPi, NPM, Crates.io, and Ruby Gems, so developers can easily add third-party dependencies to their algorithms.

Check out our algorithm development guides below to get started creating, sharing, and monetizing your code. We can’t wait to see the algorithms you contribute to the marketplace with these newly supported languages.

The Algorithmia platform runs entirely in the cloud, and offers all the infrastructure and tools needed for algorithm developers to build and deploy quickly. Our serverless architecture ensures developers can focus on the most important thing: creating state-of-the-art algorithms in the language of their choice, and immediately making them available for other developers to use.

If you have any questions or suggestions, please get in touch by email or @Algorithmia.


Algorithm Development Guides

For more, check out our overview of the algorithm developer program, and API Docs.


More About Algorithmia

We’ve created an open marketplace for algorithms and algorithm development, making state-of-the-art algorithms accessible and discoverable by everyone.

On Algorithmia, algorithms run as containerized microservices that act as the building blocks of algorithmic intelligence developers can use to create, share, and remix at scale. By making algorithms composable, interoperable, and portable, algorithms can be written in any supported language, and then made available to application developers where the code is always “on,” and available via a simple REST API. Application developers can access the API via our clients for Java, Scala, Python, JavaScript, Node.js, Ruby, RustCLI, and cURL. We also have an AWS Lambda blueprint for those working on IoT-type projects.

Algorithmia is the largest marketplace for algorithms in the world, with more than 18,000 developers leveraging 2,000+ algorithms.

 

IDC Innovators Report Names Algorithmia a Platform as a Service (PaaS) Pioneer

Algorithmia earns IDC Innovators Report 2016 Platform as a Service Recognition

Algorithmia is excited to announce that we’ve been selected as one of the top five worldwide pioneers in the platform as a service market by International Data Corporation (IDC).

The recent IDC Innovators Report evaluates startups under $50 million in revenue based on their inventive technologies, and groundbreaking business models.

At Algorithmia, we’re creating an open marketplace for algorithms and algorithm development, where developers can turn code into the building blocks needed to create, share, and remix algorithmic intelligence at scale.

“Algorithms are being developed all the time, but they’re not getting into the hands of the people and applications that could benefit from them the most,” Diego Oppenheimer said, founder and CEO of Algorithmia.

Algorithms are crucial in the modern technology economy as companies look to create intelligent applications from their data, and Algorithmia solves this by making state-of-the-art algorithms accessible and discoverable by everyone — not just those that work at top-five global technology companies.

“Rapid growth and the availability of low-cost cloud infrastructure is bringing new challenges to outdated business models by startups building innovative solutions challenging legacy businesses,” Larry Carvalho said, IDC research manager, platform as a service. “A new set of easily consumable cloud services is accelerating the ability to build and deliver new functionality.”

Other companies mentioned in the IDC Innovators Report include Reltio, CloudMunch, Kismatic, and Iron.io.

A Data Science Approach to Writing a Good GitHub README

Octocat and the GitHub README Analyzer

What makes a good GitHub README? It’s a question almost every developer has asked as they’ve typed:

 git init
 git add README.md
 git commit -m "first commit"

As far as we can tell, nobody has a clear answer for this, yet you typically know a good GitHub README when you see it. The best explanation we’ve found is the README Wikipedia entry, which offers a handful suggestions.

We set out to flex our data science muscles, and see if we could come up with an objective standard for what makes a good GitHub README using machine learning. The result is the GitHub README Analyzer demo, an experimental tool to algorithmically improve the quality of your GitHub README’s.

GitHub README Analyzer Demo

The complete code for this project can be found in our iPython Notebook here.


Thinking About the GitHub README Problem

Understanding the contents of a README is a problem for NLP, and it’s a particularly hard one to solve.

To start, we made some assumptions about what makes a GitHub README good:

  1. Popular repositories probably have good README’s
  2. Popular repositories should have more stars than unpopular ones
  3. Each programming language has a unique pattern, or style

With those assumptions in place, we set out to collect our data, and build our model.


Approaching the Solution

Step One: Collecting Data

First, we needed to create a dataset. We decided to use the 10 most popular programming languages on GitHub by the number of repositories, which were Javascript, Java, Ruby, Python, PHP, HTML, CSS, C++, C, and C#. We used the GitHub API to retrieve a count of repos by language, and then grabbed the README’s for the 1,000 most-starred repos per language. We only used README’s that were encoded in either markdown or reStructuredText formats.

We scraped the first 1,000 repositories for each language. Then we removed any repository the didn’t have a GitHub README, was encoded in an obscure format, or was in some way unusable. After removing the bad repos, we ended up with — on average — 878 README’s per language.

Step Two: Data Scrubbing

We needed to then convert all the README’s into a common format for further processing. We chose to convert everything to HTML. When we did this, we also removed common words from a stop word list, and then stemmed the remaining words to find the common base — or root — of each word. We used NLTK Snowball stemmer for this. We also kept the original, unstemmed words to be used later for making recommendations from our model for the demo (e.g. instead of recommending “instal” we could recommend “installation”).

This preprocessing ensured that all the data was in a common format and ready to be vectorized.

Step Three: Feature Extraction

Now that we have a clean dataset to work with, we need to extract features. We used scikit-learn, a Python machine learning library, because it’s easy to learn, simple, and open sourced.

We extracted five features from the preprocessed dataset:

  1. 1-grams from the headers (<h1>, <h2>, <h3>)
  2. 1-grams from the text of the paragraphs (<p>)
  3. Count of code snippets (<pre>)
  4. Count of images (<img>)
  5. Count of characters, excluding HTML tags

To build feature vectors for our README headers and paragraphs, we tried TF-IDF, Count, and Hashing vectorizers from scikit-learn. We ended up using TF-IDF, because it had the highest accuracy rate. We used 1-grams, because anything more than that became computationally too expensive, and the results weren’t noticeably improved. If you think about it, this makes sense. Section headers tend to be single words, like Installation, or Usage.

Since code snippets, images, and total characters are numerical values by nature, we simply casted them into a 1×1 matrix. 

Step Four: Benchmarking

We benchmarked 11 different classifiers against each other for all five features. And, then repeated this for all 10 programming languages. In reality, there was an 11th “language,” which was the aggregate of all 10 languages. We later used this 11th (i.e. “All”) language to train a general model for our demo. We did this to handle situations where a user was trying to analyze a README representing a project in a language that wasn’t one of the top 10 languages on GitHub.

We used the following classifiers: Logistic Regression, Passive Aggressive, Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, Nearest Centroid, Label Propagation, SVC, Linear SVC, Decision Tree, and Extra Tree.

We would have used nuSVC, but as Stack Overflow can explain, it has no direct interpretation, which prevented us from using it.

In all, we created 605 different models (11 classifiers x 5 features x 11 languages — remember, we created an “All” language to produce an overall model, hence 11.). We picked the best model per feature per programming language, which means we ended up using 55 unique models (5 features x 11 languages).

Feature Extraction Model Mean Squared Error

View the interactive chart here. Find the detailed error rates broken out by language here.

Some notes on how we trained: 

  1. Before we started, we divided our dataset into a training set, and a testing set: 75%, and 25%, respectively.
  2. After training the initial 605 models, we benchmarked their accuracy scores against each other.
  3. We selected the best models for each feature, and then we saved (i.e. pickled) the model. Persisting the model file allowed us to skip the training step in the future.

To determine the error rate of the models, we calculated their mean squared errors. The model with the least mean squared errors was selected for each corresponding feature, and language – 55 models were selected in total.

Step Five: Putting it All Together

With all of this in place, we can now take any GitHub repository, and score the README using our models.

For non-numerical features, like Headers and Paragraphs, the model tries to make recommendations that should improve the quality of your README by adding, removing, or changing every word, and recalculating the score each time.

For the numerical-based features, like the count of <pre>, <img> tags, and the README length, the model incrementally increases and decreases values for each feature. It then re-evaluates the score with each iteration to determine an optimized recommendation. If it can’t produce a higher score with any of the changes, no recommendation is given.

Try out the GitHub README Analyzer now.


Conclusion

Some of our assumptions proved to be true, while some were off. We found that our assumption about headers and the text from paragraphs correlated with popular repositories. However, this wasn’t true for the length of a repository, or the count of code samples and images.

The reason for this may stem from the fact that n-gram based features, like Headers and Paragraphs, train on hundreds or thousands of variables per document. Whereas numerical features only trains on one variable per document. Hence, ~1,000 repositories per programming language were enough to train models for n-gram based features, but it wasn’t enough to train decent numerical-based models. Another reason might be that numerical-based features had noisy data, or the dataset didn’t correlate to the number of stars a project had.

When we tested our models, we initially found it kept recommending us to significantly shorten every README (e.g. it recommended that one README get shortened from 408 characters down to 6!), remove almost all the images and code samples. This was counterintuitive, so after plotting the data, we learned that a big part of our dataset had zero images, zero code snippets, or was less than 1,500 characters.

Plot of READMEs with counts of zero

This explains why our model behaved so poorly at first for the length, code sample, and image features. To account for this, we took an additional step in our preprocessing and removed any README that was less than 1,500 characters, contained zero images, or zero code samples. This dramatically improved the results for images and code sample recommendations, but the character length recommendation did not improve significantly. In the end, we chose to remove the length recommendation from our demo.


Think you can come up with a better model? Check out our iPython Notebook covering all of the steps from above, and tweet us @Algorithmia with your results


Below is the numerical feature distribution for images, code snippets, and character length all README’s. Find the detailed histograms broken out by language here.

A histogram of the count of images, code snippets, and character length

An Introduction to Natural Language Processing

An introduction to natural language processing

This introduction to Natural Language Processing, or NLP for short, describes a central problems of artificial intelligence. NLP is focused on the interactions between human language and computers, and it sits at the intersection of computer science, artificial intelligence, and computational linguistics.

What is Natural Language Processing?

NLP is used to analyze text, allowing machines to understand how human’s speak. This human-computer interaction enables real-world applications like automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.

NLP is characterized as a hard problem in computer science. Human language is rarely precise, or plainly spoken. To understand human language is to understand not only the words, but the concepts and how they’re linked together to create meaning. Despite language being one of the easiest things for humans to learn, the ambiguity of language is what makes natural language processing a difficult problem for computers to master.

Natural Language Processing Algorithms

NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be.

Popular Open Source NLP Libraries:

These libraries provide the algorithmic building blocks of NLP in real-world applications. Algorithmia provides a free API endpoint for many of these algorithms, without ever having to setup or provision servers and infrastructure.

A Few NLP Examples:

NLP algorithms can be extremely helpful for web developers, providing them with the turnkey tools needed to create advanced applications, and prototypes.

Natural Language Processing Tutorials

Recommended NLP Books

  • Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit
    This is a book about Natural Language Processing. By “natural language” we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. At one extreme, it could be as simple as counting word frequencies to compare different writing styles.
  • Speech and Language Processing, 2nd Edition 2nd Edition
    An explosion of Web-based language techniques, merging of distinct fields, availability of phone-based dialogue systems, and much more make this an exciting time in speech and language processing. The first of its kind to thoroughly cover language technology – at all levels and with all modern technologies – this text takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations. The authors cover areas that traditionally are taught in different courses, to describe a unified vision of speech and language processing.”
  • Introduction to Information Retrieval
    As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval systems. However, during the last decade, relentless optimization of information retrieval effectiveness has driven web search engines to new quality levels where most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows, 2004) found that 92% of Internet users say the Internet is a good place to go for getting everyday information.” To the surprise of many, the field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people’s preferred means of information access.”

NLP Videos

NLP Courses

  • Stanford Natural Language Processing on Coursera
    This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.”
  • Stanford Machine Learning on Coursera
    Machine learning is the science of getting computers to act without being explicitly programmed. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.”
  • Udemy’s Introduction to Natural Language Processing
    This course introduces Natural Language Processing through the use of python and the Natural Language Tool Kit. Through a practical approach, you’ll get hands on experience working with and analyzing text. As a student of this course, you’ll get updates for free, which include lecture revisions, new code examples, and new data projects.”
  • Certificate in Natural Language Technology
    When you talk to your mobile device or car navigation system – or it talks to you – you’re experiencing the fruits of developments in natural language processing. This field, which focuses on the creation of software that can analyze and understand human languages, has grown rapidly in recent years and now has many technological applications. In this three-course certificate program, we’ll explore the foundations of computational linguistics, the academic discipline that underlies NLP.”

Related NLP Topics

Six Natural Language Processing Algorithms for Web Developers

Natural language processing algorithms are a web developer’s best friend. These powerful tools easily integrate into apps, projects, and prototypes to extract insights and help developers understand text-based content.

We’ve put together a collection of the most common NLP algorithms freely available on Algorithmia. Think of these tools as the building blocks needed to quickly create smart, data rich applications.

New to NLP? Start with our guide to natural language processing algorithms.


Summarizer

Quickly get the tl;dr on any article. Summarizer creates a summary of a text document, while retaining the most important parts. Summarizer takes a large block of unstructured text as a string, and extracts the key sentences based on the frequency of topics and terms it finds. Try out Summarizer here.

Sample Input:

"A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution.
Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending.
We propose a solution to the double-spending problem using a peer-to-peer network.
The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work.
The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power.
As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers.
The network itself requires minimal structure.
Messages are broadcast on a best effort basis, and nodes can leave and rejoin the network at will, accepting the longest proof-of-work chain as proof of what happened while they were gone."

Sample Output:

We propose a solution to the double-spending problem using a peer-to-peer network.
The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work.

Sentiment Analysis

Want to understand the tone or attitude of a piece of text? Use sentiment analysis to determine its positive or negative feelings, opinions, or emotions. You can quickly identify and extract subjective information from text simply by passing this algorithm a string. The algorithm assigns a sentiment rating from 0 to 4, which corresponds to very negative, negative, neutral, positive, and very positive. Get started with sentiment analysis here.

Sample Input:

"Algorithmia loves you!"

Sample Out:

3

Need to analyze the sentiment of social media? The Social Sentiment Analysis algorithm is tuned specifically for bite-sized content, like tweets and status updates. This algorithm also provides you with some additional information, like the confidence levels for the positive-ness, and negative-ness of the text.

Sample Input:

{
  "sentence": "This old product sucks! But after the update it works like a charm!"
}

Sample Output:

{
  "positive": 0.335,
  "negative": 0.145,
  "sentence": "This old product sucks! But after the update it works like a charm!",
  "neutral": 0.521,
  "compound": 0.508
}

Want to learn more about sentiment analysis? Try our guide to sentiment analysis.


LDA (Auto-Keyword Tags)

LDA is a core NLP algorithm used to take a group of text documents, and identify the most relevant topic tags for each. LDA is perfect for anybody needing to categorize the key topics and tags from text. Many web developers use LDA to dynamically generate tags for blog posts and news articles. This LDA algorithm takes an object containing an array of strings, and returns an array of objects with the associated topic tags. Try it here.

We have a second algorithm called Auto-Tag URL, which is dead-simple to use: pass in a URL, and the algorithm returns the tags with the associated frequency. Try both, and see what works for your situation.

Sample Input:

"https://raw.githubusercontent.com/mbernste/machine-learning/master/README.md"

Sample Output:

{
  "algorithm": 4,
  "structure": 2,
  "bayes": 2,
  "classifier": 2,
  "naïve": 2,
  "search": 2,
  "algorithms": 2,
  "bayesian": 2
}

Analyze Tweets

Search Twitter for a keyword, and then analyze the matching tweets for sentiment and LDA topics. Marketing and brand managers can use Analyze Tweets to monitor their brands and products for mentions to determine what customers are saying, and how they feel about a product. This microservice outputs a list of all the matching tweets with corresponding sentiment, and groups the top positive and negative LDA topics. It also provides positive-specific and negative-specific tweets, in case you want to just look for just one flavor of sentiment. Before using this algorithm, you’ll need to generate free API keys through Twitter.

Need to analyze a specific Twitter account? We also provide Analyze Twitter User, to help you understand what a specific user tweets about positively, or negatively. This is great for identifying influencers.

Sample Input:

{
  "query": "seattle",
  "numTweets": "1000",
  "auth":
    {
      "app_key": "<YOUR TWITTER API KEY>",
      "app_secret": "<YOUR TWITTER SECRET>",
      "oauth_token": "<YOUR TWITTER OAUTH TOKEN>",
      "oauth_token_secret": "<YOUR TWITTER TOKEN SECRET>"
    }
}

Sample Output:

{
  "negLDA": [
    {
      "seattle's": 2,
      "adults": 2,
      "breaking": 2,
      "oldest": 1,
      "liveonkomo": 1,
      "teens": 2,
      "seattle": 5,
      "charged": 3
    }
  ],
  "allTweets": [
    {
      "text": "RT @Harry_Styles: Thanks for having us Seattle. You were amazing tonight. Hope you enjoyed the show. All the love",
      "created_at": "Thu Feb 04 19:34:18 +0000 2016",
      "negative_sentiment": 0,
      "tweet_url": "https://twitter.com/statuses/695329550536949761",
      "positive_sentiment": 0.55,
      "overall_sentiment": 0.9524,
      "neutral_sentiment": 0.45
    },
    {
      "text": "@kexp is super moody, super fun. Really have dug listening to them while in #Seattle \n\nProbably most lined up with my listening prefs",
      "created_at": "Thu Feb 04 19:31:04 +0000 2016",
      "negative_sentiment": 0.077,
      "tweet_url": "https://twitter.com/statuses/695328736531451904",
      "positive_sentiment": 0.34,
      "overall_sentiment": 0.8625,
      "neutral_sentiment": 0.583
    }
  ],
  "posLDA": [
    {
      "urban": 2,
      "love": 3,
      "lil": 2,
      "gentrified": 2,
      "openingday": 2,
      "cores": 2,
      "seattle": 8,
      "enjoyed": 3
    }
  ],
  "negTweets": [
    {
      "text": "I couldn't be more excited about leading a missions team to Seattle, Washington this summer! Join us!",
      "created_at": "Thu Feb 04 19:52:51 +0000 2016",
      "negative_sentiment": 0.157,
      "tweet_url": "https://twitter.com/statuses/695334220764463104",
      "positive_sentiment": 0.122,
      "overall_sentiment": -0.1622,
      "neutral_sentiment": 0.721
    }
  ],
  "posTweets": [
    {
      "text": "RT @Harry_Styles: Thanks for having us Seattle. You were amazing tonight. Hope you enjoyed the show. All the love",
      "created_at": "Thu Feb 04 19:34:18 +0000 2016",
      "negative_sentiment": 0,
      "tweet_url": "https://twitter.com/statuses/695329550536949761",
      "positive_sentiment": 0.55,
      "overall_sentiment": 0.9524,
      "neutral_sentiment": 0.45
    }
  ]
}

We’ve truncated the above output, because there is wayyy too much to display on the blog (there were 1,000+ tweets analyzed!). Check out the Analyze Tweets page to play with the algorithm, and see the full output.


Analyze URL

Need to quickly scrape a URL, and get the content and metadata back? Analyze URL is perfect for any web developer wanting to quickly turn HTML pages into structured data. Pass the algorithm a URL as a string, and it returns the text of the page, the title, a summary, the thumbnail if one exists, timestamp, and the status code as an object. Get Analyze URL here.

Sample Input:

["http://blog.algorithmia.com/2016/02/algorithm-economy-containers-microservices/"]

Sample Output:

{
  "summary": "Algorithms as Microservices: How the Algorithm Economy and Containers are Changing the Way We Build and Deploy Apps Today",
  "date": "2016-02-22T11:38:11-07:00",
  "thumbnail": "http://blog.algorithmia.com/wp-content/uploads/2016/02/algorithm-economy.png",
  "marker": true,
  "text": "Build tomorrow's smart apps today In the age of Big Data, algorithms give companies a competitive advantage. Today’s most important technology companies all have algorithmic intelligence built into the core of their product: Google Search, Facebook News Feed, Amazon’s and Netflix’s recommendation engines. “Data is inherently dumb,” Peter Sondergaard, senior vice president at Gartner and global head of Research, said in The Internet of Things Will Give Rise To The Algorithm Economy. “It doesn’t actually do anything unless you know how to use it.” Google, Facebook, Amazon, Netflix and others have built both the systems needed to acquire a mountain of data (i.e. search history, engagement metrics, purchase history, etc), as well as the algorithms responsible for extracting actionable insights from that data. As a result, these companies are using algorithms to create value, and impact millions of people a day. “Algorithms are where the real value lies,” Sondergaard said. “Algorithms define action.” For many technology companies, they’ve done a good job of capturing data, but they’ve come up short on doing anything valuable with that data. Thankfully, there are two fundamental shifts happening in technology right now that are leading to the democratization of algorithmic intelligence, and changing the way we build and deploy smart apps today: The confluence of the algorithm economy and containers creates a new value chain, where algorithms as a service can be discovered and made accessible to all developers through a simple REST API. Algorithms as containerized microservices ensure both interoperability and portability, allowing for code to be written in any programming language, and then seamlessly united across a single API. By containerizing algorithms, we ensure that code is always “on,” and always available, as well as being able to auto-scale to meet the needs of the application, without ever having to configure, manage, or maintain servers and infrastructure. Containerized algorithms shorten the time for any development team to go from concept, to prototype, to production-ready app. Algorithms running in containers as microservices is a strategy for companies looking to discover actionable insights in their data. This structure makes software development more agile and efficient. It reduces the infrastructure needed, and abstracts an application’s various functions into microservices to make the entire system more resilient. The “algorithm economy” is a term established by Gartner to describe the next wave of innovation, where developers can produce, distribute, and commercialize their code. The algorithm economy is not about buying and selling complete apps, but rather functional, easy to integrate algorithms that enable developers to build smarter apps, quicker and cheaper than before. Algorithms are the building blocks of any application. They provide the business logic needed to turn inputs into useful outputs. Similar to Lego blocks, algorithms can be stacked together in new and novel ways to manipulate data, extract key insights, and solve problems efficiently. The upshot is that these same algorithms are flexible, and easily reused and reconfigured to provide value in a variety of circumstances. For example, we created a microservice at Algorithmia called Analyze Tweets, which searches Twitter for a keyword, determining the sentiment and LDA topics for each tweet that matches the search term. This microservice stacks our Retrieve Tweets With Keywords algorithm with our Social Sentiment Analysis and LDA algorithms to create a simple, plug-and-play utility. The three underlying algorithms could just as easily be restacked to create a new use case. For instance, you could create an Analyze Hacker News microservice that uses the Scrape Hacker News and URL2Text algorithms to extract the text for the top HN posts. Then, you’d simply pass the text for each post to the Social Sentiment Analysis, and LDA algorithms to determine the sentiment and topics of all the top posts on HN. The algorithm economy also allows for the commercialization of world class research that historically would have been published, but largely under-utilized. In the algorithm economy, this research is turned into functional, running code, and made available for others to use. The ability to produce, distribute, and discover algorithms fosters a community around algorithm development, where creators can interact with the app developers putting their research to work. Algorithm marketplaces function as the global meeting place for researchers, engineers, and organizations to come together to make tomorrow’s apps today. Containers are changing how developers build and deploy distributed applications. In particular, containers are a form of lightweight virtualization that can hold all the application logic, and run as an isolated process with all the dependencies, libraries, and configuration files bundled into a single package that runs in the cloud. “Instead of making an application or a service the endpoint of a build, you’re building containers that wrap applications, services, and all their dependencies,” Simon Bisson at InfoWorld said in How Containers Change Everything. “Any time you make a change, you build a new container; and you test and deploy that container as a whole, not as an individual element.” Containers create a reliable environment where software can run when moved from one environment to another, allowing developers to write code once, and run it in any environment with predictable results — all without having to provision servers or manage infrastructure. This is a shot across the bow for large, monolithic code bases. “[Monoliths are] being replaced by microservices architectures, which decompose large applications – with all the functionality built-in – into smaller, purpose-driven services that communicate with each other through common REST APIs,” Lucas Carlson from InfoWorld said in 4 Ways Docker Fundamentally Changes Application Development. The hallmark of microservice architectures is that the various functions of an app are unbundled into a series of decentralized modules, each organized around a specific business capability. Martin Fowler, the co-author of the Agile Manifesto, describes microservices as “an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.” By decoupling services from a monolith, each microservice becomes independently deployable, and acts as a smart endpoint of the API. “There is a bare minimum of centralized management of these services,” Fowler said in Microservices: A Definition of this New Architectural Term, “which may be written in different programming languages and use different data storage technologies.” Similar to the algorithm economy, containers are like Legos for cloud-based application development. “This changes cloud development practices,” Carlson said, “by putting larger-scale architectures like those used at Facebook and Twitter within the reach of smaller development teams.” Diego Oppenheimer, founder and CEO of Algorithmia, is an entrepreneur and product developer with extensive background in all things data. Prior to founding Algorithmia he designed , managed and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server and Power BI. Diego holds a Bachelors degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University. More Posts - Website Follow Me:",
  "title": "The Algorithm Economy, Containers, and Microservices",
  "url": "http://blog.algorithmia.com/2016/02/algorithm-economy-containers-microservices/",
  "statusCode": 200
}

Site Map

The site map algorithm starts with a single URL, and crawls all the pages on the domain before returning a graph representing the link structure of the site. You can use the site map algorithm for many purposes, such as building dynamic XML site maps for search engines, doing competitive analysis, or simply auditing your own site to better understand what pages link where, and if there are any orphaned, or dead-end pages. Try it out here.

Sample Input:

["http://algorithmia.com",1]

Sample Output:

{
  "http://algorithmia.com/about":
    [
      "http://algorithmia.com/reset",
      "http://algorithmia.com/about",
      "http://algorithmia.com/contact",
      "http://algorithmia.com/api",
      "http://algorithmia.com/privacy",
      "http://algorithmia.com/terms",
      "http://algorithmia.com/signin",
      "http://algorithmia.com/"
    ],
  "http://algorithmia.com/privacy":
    [
      "http://algorithmia.com/reset",
      "http://algorithmia.com/about",
      "http://algorithmia.com/contact",
      "http://algorithmia.com/privacy",
      "http://algorithmia.com/terms",
      "http://algorithmia.com/signin",
      "http://algorithmia.com/"
    ],
  "http://algorithmia.com":
    [
      "http://algorithmia.com/reset",
      "http://algorithmia.com/about",
      "http://algorithmia.com/contact",
      "http://algorithmia.com/privacy",
      "http://algorithmia.com/terms",
      "http://algorithmia.com/signin",
      "http://algorithmia.com/"
  ]
}

Ready to learn more? Our free eBook, Five Algorithms Every Web Developer Can Use and Understand, is short primer on when and how web developers can harness the power of natural language processing algorithms and other text analysis tools in their projects and apps.

Three Forces Accelerating Artificial Intelligence Development

Trends in Artificial Intelligence

Artificial intelligence is set to hit the mainstream, thanks to improved machine learning tools, cheaper processing power, and a steep decline in the cost of cloud storage. As a result, firms are piling into the AI market, and pushing the pace of AI development across a range of fields.

“Given sufficiently large datasets, powerful computers, and the interest of subject-area experts, the deep learning tsunami looks set to wash over an ever-larger number of disciplines.

When it does, Nvidia will be there to capitalize. They announced a new chip design specifically for deep learning, with 15 billion transistors (a 3x increase) and the ability to process data 12x faster than previous chips.

For the first time we designed a [graphics-processing] architecture dedicated to accelerating AI and to accelerating deep learning.

Improved hardware contributes to Facebook’s ability to use artificial intelligence to describe photos to blind users, and it’s why Microsoft can now build a JARVIS-like personal digital assistant for smartphones.

Google, meanwhile, has its sights set on “solving intelligence, and then using that to solve everything else,” thanks to better processors and their cloud platform.

This kind of machine intelligence wouldn’t be possible without improved algorithms.

“I consider machine intelligence to be the entire world of learning algorithms, the class of algorithms that provide more intelligence to a system as more data is added to the system,” Shivon Zilis, creator of the machine intelligence framework, told Fast Forward Labs.  “These are algorithms that create products that seem human and smart.”

If you want to go deeper on the subject, O’Reilly has a free ebook out on The Future of Machine Intelligence, which unpacks the “concepts and innovations that represent the frontiers of ever-smarter machines” through ten interviews, spanning NLP, deep learning, autonomous cars, and more.

So, you might be wondering: Is the singularity near? Not at all, the NY Times says.

Machine learning algorithms ranked

Recapping Elon Musk’s Wild Week

Space X Falcon 9 landing on the drone ship

Last week SpaceX, the Elon Musk space company, successfully landed their Falcon 9 rocket on a drone ship for the first time last week.

The rocket landed vertically on a barge floating in choppy waters out in the Atlantic Ocean.

Bloomberg calls the landing a “moon walk” moment for the New Space Age. Whereas Musk simply mused: it’s just “another step toward the stars.”

But that’s not all: Tesla, Musk’s other company, received more than 325,000 preorders for their Model 3 last week, making it the “biggest one-week launch of any product.”

The preorders represent $14 billion in potential sales for the $35,000 electric car, which is expected to ship late-2017. The unprecedented demand is a “watershed moment” for electric vehicles, the WSJ says.

As electric vehicles go mainstream, speculation mounts that the petrol car could be dead as early as 2025. We are now witnessing the slow-motion disruption of the global auto industry, Quartz adds. At least one car maker has taken notice: GM has invested $500 million in Lyft, with plans to build self-driving electric cars.