Wow! This morning we were selected as this year’s Startup Showcase winner at Strata + Hadoop World NY 2015. The panel of judges chose Algorithmia as the winner based on our team, technology, and innovations.
It’s an honor to take part in this competition, where 12 of the top big data startups demonstrated their technologies to a room packed full of investors, entrepreneurs, and researchers.
Three other companies stood apart from the rest, including second-place sense.io, third-place timbr.io, and the audience favorite, Blue Talon.
Build brilliant apps with Algorithmia, the largest marketplace for algorithms in the world. We help application developers solve complex problems with ease and efficiency by making algorithmic intelligence approachable and accessible in less than 5-lines of code.
Algorithmia unlocks the building blocks of human understanding, helping you make every app a smart app. Use Algorithmia to recognize patterns in your data, extract visual knowledge, understand audio, classify unstructured data, and derive meaning from language. Focus on what matters most, and let Algorithmia take care of the rest
- Infrastructureless Deployment
- REST API
- Production Ready
Natural Language Processing Summary:
Natural Language Process, or NLP for short, is a field of study focused on the interactions between human language and computers. It sits at the intersection of computer science, artificial intelligence, and computational linguistics (Wikipedia).
“Natural Language Processing is a field that covers computer understanding and manipulation of human language, and it’s ripe with possibilities for newsgathering,” Anthony Pesce said in Natural Language Processing in the kitchen. “You usually hear about it in the context of analyzing large pools of legislation or other document sets, attempting to discover patterns or root out corruption.”
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.
“Apart from common word processor operations that treat text like a mere sequence of symbols, NLP considers the hierarchical structure of language: several words make a phrase, several phrases make a sentence and, ultimately, sentences convey ideas,” John Rehling, an NLP expert at Meltwater Group, said in How Natural Language Processing Helps Uncover Social Media Sentiment. “By analyzing language for its meaning, NLP systems have long filled useful roles, such as correcting grammar, converting speech to text and automatically translating between languages.”
What Can I Use Natural Language Processing For?
- Summarize blocks of text using Summarizer to extract the most important and central ideas while ignoring irrelevant information.
- Automatically generate keyword tags from content using AutoTag, which leverages LDA, a technique that discovers topics contained within a body of text.
- Identify the type of entity extracted, such as it being a person, place, or organization using Named Entity Recognition.
- Use Sentiment Analysis to identify the sentiment of a string of text, from very negative to neutral to very positive.
- Reduce words to their root, or stem, using PorterStemmer, or break up text into tokens using Tokenizer.
These are just some of the natural language processing algorithms web developers can use.
What Are Some Real World Examples of Natural Language Processing?
Social media analysis is a great example of NLP use. Brands track conversations online to understand what customers are saying, and glean insight into user behavior.
“One of the most compelling ways NLP offers valuable intelligence is by tracking sentiment — the tone of a written message (tweet, Facebook update, etc.) — and tag that text as positive, negative or neutral,” Rehling said.
Build your own social media monitoring tool
- Start by using the algorithm Retrieve Tweets With Keyword to capture all mentions of your brand name on Twitter. In our case, we search for mentions of Algorithmia.
- Then, pipe the results into the Sentiment Analysis algorithm, which will assign a sentiment rating from 0-4 for each string (Tweet).
Similarly, Facebook uses NLP to track trending topics and popular hashtags.
“Hashtags and topics are two different ways of grouping and participating in conversations,” Chris Struhar, a software engineer on News Feed, said in How Facebook Built Trending Topics With Natural Language Processing. “So don’t think Facebook won’t recognize a string as a topic without a hashtag in front of it. Rather, it’s all about NLP: natural language processing. Ain’t nothing natural about a hashtag, so Facebook instead parses strings and figures out which strings are referring to nodes — objects in the network. We look at the text, and we try to understand what that was about.”
It’s not just social media that can use NLP to it’s benefit. Publishers are hoping to use NLP to improve the quality of their online communities by leveraging technology to “auto-filter the offensive comments on news sites to save moderators from what can be an ‘exhausting process’,” Francis Tseng said in Prototype winner using ‘natural language processing’ to solve journalism’s commenting problem.
Use NLP to build your own RSS reader
You can build a machine learning RSS reader in less than 30-minutes using the follow algorithms:
- ScrapeRSS to grab the title and content from an RSS feed.
- Html2Text to keep the important text, but strip all the HTML from the document.
- AutoTag uses Latent Dirichlet Allocation to identify relevant keywords from the text.
- Sentiment Analysis is then used to identify if the article is positive, negative, or neutral.
- Summarizer is finally used to identify the key sentences.
Recommended NLP Books for Beginners
- Speech and Language Processing: “The first of its kind to thoroughly cover language technology – at all levels and with all modern technologies – this book takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations.”
- An Introduction to Information Retrieval: “Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts.”
- Foundations of Statistical Natural Language Processing: “This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.”
- Handbook of Natural Language Processing: “The Second Edition presents practical tools and techniques for implementing natural language processing in computer systems. Along with removing outdated material, this edition updates every chapter and expands the content to include emerging areas, such as sentiment analysis.”
- Statistical Language Learning (Language, Speech, and Communication): “Eugene Charniak breaks new ground in artificial intelligenceresearch by presenting statistical language processing from an artificial intelligence point of view in a text for researchers and scientists with a traditional computer science background.”
- Natural Language Understanding: “This long-awaited revision offers a comprehensive introduction to natural language understanding with developments and research in the field today. Building on the effective framework of the first edition, the new edition gives the same balanced coverage of syntax, semantics, and discourse, and offers a uniform framework based on feature-based context-free grammars and chart parsers used for syntactic and semantic processing.”
- Natural Language Processing Tutorial: “We will go from tokenization to feature extraction to creating a model using a machine learning algorithm. You can get the source of the post from github.”
- Basic Natural Language Processing: “In this tutorial competition, we dig a little “deeper” into sentiment analysis. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers.“
- An NLP tutorial with Roger Ebert: “Natural Language Processing is the process of extracting information from text and speech. In this post, we walk through different approaches for automatically extracting information from text—keyword-based, statistical, machine learning—to explain why many organizations are now moving towards the more sophisticated machine-learning approaches to managing text data.”
If you’re interested in learning more, this free introductory course from Stanford University will help you will learn the fundamentals of natural language processing, and how you can use it to solve practical problems.
Once you’ve gotten the fundamentals down, apply what you’ve learned using Python and NLTK, the most popular framework for Python NLP.
- Natural language processing (Wikipedia): “Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. In 1950, Alan Turing published an article titled ‘Computing Machinery and Intelligence’ which proposed what is now called the Turing test as a criterion of intelligence. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing.”
- Outline of natural language processing (Wikipedia): “The following outline is provided as an overview of and topical guide to natural language processing: Natural language processing – computer activity in which computers are entailed to analyze, understand, alter, or generate natural language.”
- Apache OpenNLP: “The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.”
- Natural Language Toolkit: “NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. Natural Language Processing with Python provides a practical introduction to programming for language processing.”
We’re thrilled to be one of the twelve leading big data startups selected to present at Strata + Hadoop World 2015 at 6:30pm on Tuesday, September 29th at the Javits Center in New York City. Diego and Kenny will be showing off our intelligent algorithms to a packed room of developers, entrepreneurs, and researchers as part of Startup Showcase. Winners will be announced during Wednesday’s keynote – fingers crossed!
Check out this video of winners from Strata + Hadoop in San Jose to get a taste of the competition:
Are you in New York? Startup Showcase is free to attend, and open to the public as part of NYC DataWeek. Register here for your free ticket, and be sure to come say hi to Diego and Kenny, who will be handing out Algorithmia credits. Or reach out to us @Algorithmia.
We’re proud to be sponsoring DubHacks, the largest collegiate hackathon in the Pacific Northwest. Over 600 student developers and designers will gather at the University of Washington in Seattle campus on October 17th and 18th to form teams, build projects, and create solutions to real-world problems.
Algorithmia will be there with swag, snacks, and free API access. Come say hi!
In the mean time, check out our collection of algorithms available for use. How might you use our content recommendation, summarization, and sentiment analysis algorithms? Oh, and what might you do with our nudity detection algorithm?
LEARN MORE ABOUT DUBHACKS:
This blog post is Part II of a series exploring the Product Hunt API. Apart from the Product Hunt and Algorithmia API we use Aerobatic to quickly create the front end of our apps.
Our last post discussed how we acquired the Product Hunt dataset, how we designed a vote-ring detection algorithm, and ended with a live demo that analyzes new Product Hunt submissions for vote-rings. In this post, we will briefly explain how collaborative recommendation systems work (through the FP-Growth algorithm) and will apply that on the Product Hunt dataset using a live demo. As a cherry on top, we made a Chrome extension that will augment your Product Hunt browsing with a list of recommendations using our approach. Read on for the results, demo, and code.
Note About Recommendation Engines
There are two broad approaches to recommendation engines: those based on collaborative filtering and those based on content based. A hybrid approach marrying the two is also very common.
Collaborative Filtering recommenders depend on the way that users interact with the data – the number of overlapping decisions between users A and B (such as liking a post or buying an item) is used as an indicator to predict the likelihood that user A will make the same decision as user B on future topics. This is similar to Amazon’s “Other users also bought…”.
Content Based recommenders depends on the inherent attributes of an item – if user A likes item X, then the description/keywords/color of item X is used to predict the next item to recommend. This analysis can be extended to build a profile about what user A likes and dislikes. This is similar to Pandora’s Music Genome Project.
Here we are going to build a Collaborative Filtering recommendation engine that is specific to Product Hunt. There are many approaches to achieve this task and we will use the most straightforward one: Affinity Analysis (a variant of Association Rule Learning).
Understanding Association Rules
Businesses realized the power of Association Rules long time ago, especially for up-selling and cross-selling. For example, a grocery shop might notice that people who buy diapers are also very likely to buy beer (i.e. strong association), and therefore the grocery shop manager might decide to place the diapers close to the beer fridge to fuel impulse buying.
We looked into Algorithmia’s catalogue and we found two algorithms providing exactly this function: Weka-based implementation by Aluxian, and direct implementation by Paranoia. They were both implementing the Frequent Pattern Growth algorithm (FP-Growth), which is a popular method to build association rules through a divide-and-conquer strategy. We ran them side-by-side and we decided to go with Paranoia’s implementation.
The algorithm takes a set of transactions as an input and produces weighted associations as an output. Each transaction is represented as a single line with comma-delimited items. Instead of customers and groceries, our transaction set was the upvotes that Product Hunt users made on all the 16,000+ posts. Each user was represented as a line, and each line contained the posts’ ids that received an upvote from that user.
We Have a Demo
We created a Product Hunt-specific wrapper around the FP-Growth algorithm, which we called Product Hunt Recommender. This algorithm has a copy of the Product Hunt dataset and updates it every 24 hours (therefore it works better on posts more than few days old). It takes a single input (post id) and returns up to five recommended posts.
If you already have an Algorithmia account, you should be able to experiment directly with the algorithm through the web console (only visible if you’re signed in). To simplify things however, we created the below demo of a selected subset of posts. Click around.
It was extremely satisfying to see how well the algorithm worked – for example, notice the result of applying the algorithm on the post ‘The Hard Thing About Hard Things’ (a book about entrepreneurship) gives a recommendation of four other books, all in entrepreneurship as well.
In a real-world scenario, a developer would run the FP-Growth algorithm on their dataset every so often and save the association rules somewhere permanent in the backend. Whenever an item is pulled out, the app would also look for strong associations as recommendations. Keep in mind that there are other routes towards a recommendation engine, such as content-based, clustering, or even hybrid solutions.
Product Genius, the Chrome Extension
We thought it would be awesome to make a Chrome extension out of this that adds a new section to the sidebar of Product Hunt posts, which we titled “Other People Also Liked”. There’s already a “Similar Products” section that is built in within Product Hunt and we can’t definitively determine what method they used to implement it. One thing’s for sure: from our limited testing, we found numerous posts where Product Genius returned better results than the built-in version.
Install the extension from here and check it out for yourself:
We also found instances where Product Hunt’s internal recommender performed better: BitBound.
What Else Can We/You Do?
You can access the code for the Chrome extension, web demo, and the dataset itself from here. You can easily experiment with other approaches using the hundreds of algorithms within the Algorithmia catalogue, such as using Keyword-Set Similarity instead of FP-Growth or a hybrid approach using the two. Let us know if you have any ideas for other algorithms you want us to demonstrate on the dataset by tweeting us at @algorithmia.
Want to create your own analysis? Follow this Sign-up and get 100,000 credits on us – a special Product Hunt promotion.
This blog post is Part I of a series exploring the Product Hunt API.
We like showcasing practical algorithms on interesting datasets. This post is part of a series of posts that explore some algorithms by applying them on the Product Hunt dataset. Since this is the first of the series, we’ll start with how we acquired the data, and then drill down on our approach for the vote-ring problem. Read on for the results and how you might build this for your web app or apply it on other datasets.
Data Acquisition and Pre-Processing
The first challenge is to acquire the data. We used Product Hunt’s API to download the posts and combined them into one big file. The API returns JSON representation of each post, which is not the best format for querying and manipulating, so we ran that file through the JSONtoCSV algorithm and we got 16,000+ records. We did the same for all the associated Comments, Users, and Votes and imported them into an SQLite database. Here’s what the tables look like:
We also created an IRB (Interactive Ruby Shell) environment that wraps the tables with Ruby ActiveRecord classes. Because Algorithmia already has a Ruby gem, these wrappers make the task of experimenting with different algorithms exceptionally easy. You can download a copy of the database along with the ActiveRecord wrappers from here (25 MB). Run ‘ruby console.rb’.
Sometimes, users in social networks might reach out to their friends (or even create fake accounts) in order to “seed vote” their submissions. This is obviously a form of playing the system, and possible solutions for this problem depends on your data structure and how your users interact with it. Many platforms rely on IP addresses and browser agents among other things. Given that we do not have that data available, we had to look elsewhere.
One of our engineers came up with a very plausible solution to this problem: assign a collusion ratio to every possible combination of users and posts that will depend exclusively on voting activity. This ratio is used as a proxy for a collusion probability. The closer it is to 1.0, the more likely that the subject group of users are colluding.
Another way to visualize the formula above is to describe it in a plot: the horizontal line represents all the possible posts in the dataset and the vertical line represents all the possible users. The dots on the intersections of these lines represents votes made by users. If we choose a random group of users (horizontal green bar) and posts (vertical green bar), you can easily see how the ratio works by comparing users’ activity across other posts as well.
As a quick analysis, we computed the above ratio for posts having less than 10 votes. We start by computing the ratio for the group of users who voted for a given post. We then remove the the first user and examine the new ratio – if the ratio drops, we bring them back in, if it increases, we leave them out. Continue this for each user.
We ran the data through this algorithm, which we called SimpleVoteRingDetection. Clicking on any post on the left will show the subset of voters on that post with the highest collusion ratio – which is the most likely vote-ring pertaining to the post.
We removed users’ names to protect the innocent. Instead, we’re showing random fruit names prepended to the real users’ ids. You’ll notice that in many instances the colluding users have consecutive user ids, meaning that they (likely) created the accounts within the same time frame, which increases the likelihood that those are fake accounts. Having said that, we’re happy to report that less than 1% of the total posts on Product Hunt was subjected to vote-rings (defined as have a collusion ratio higher than 0.5).
In the demonstration above, we limited our dataset to 20 votes per post which is sufficient to illustrate our approach to the problem without overloading the Product Hunt API. The same algorithm can be easily applied to other contexts, whether internally within Product Hunt or other applications. In it’s current form, the tool works great in flagging suspicious activity to prompt further investigation. The data is refreshed every 24 hours.
Until next time…
Our next post will be about building a recommendation engine for the dataset. Until then, feel free to play with the pre-processed data from here, and definitely let us know if you have any ideas for other algorithms you want us to demonstrate on the dataset.
Want to create you own analysis? Follow this Sign-up and get 100,000 credits on us a special Product Hunt promotion
If there’s one thing the internet is good for, it’s racy material. This is a headache for a number of reasons, including a) it tends to show up where you’d rather it wouldn’t, like forums, social media, etc. and b) while humans generally know it when we see it, computers don’t so much. We here at Algorithmia decided to give this one a shot. Try out our demo at isitnude.com.
We based our algorithm on nude.py by Hideo Hattori, and on An Algorithm For Nudity Detection (R. Ap-apid, De La Salle University). The original solution relies on finding which parts of the image are skin patches, and then doing a number of checks on the relative geometry and size of these patches to figure out if the image is racy or not. This is a good place to start but isn’t always that accurate, since there are plenty of images with lots of skin that are perfectly innocent – think a closeup of a hand or face or someone wearing a beige dress. You might say that leaning too much on just color leaves the method, well, tone-deaf. To do better, you need to combine skin detection with other tricks.
Facing the problem
Our contribution to this problem is to detect other features in the image and using these to make the previous method more fine-grained. First, we get a better estimate of skin tone using a combination of what we learned from Human Computer Interaction Using Hand Gestures and what we can get from the image using a combination of OpenCV’s nose detection algorithm and face detection algorithm, both of which are available on Algorithmia. Specifically, we find bounding polygons for the face and for the nose, if we can get them, then get our skin tone ranges from a patch inside the former and outside the latter. This helps limit false positives. Face detection also allows us to detect the presence of multiple people in an image, thus customizing skin tones for each individual and re-adjusting thresholds so that the presence of multiple people in the image does not cause false positives. If you want to integrate this into your own application, you can find the magic right here.
This is an improvement, but it is hardly a final answer to the problem. There are countless techniques that can be used in place of or combined with the ones we’ve used to make an even better solution. For instance, you could train a convolutional neural network on problematic images, much like we did for Digit Recognition. Algorithmia was designed to enable this kind of high-level tinkering – we hope you’ll visit us soon and give it a try.
Today’s blog post is brought to you by one of our community members @sprague. Thanks for submitting !
The same simple Algorithmia API that works with Python, Java, Scala, and others, is easy for iOS programming too.
This short lesson assumes you have a basic knowledge of iOS programming: enough to write a simple single-view app using the Storyboard. (If not, start with Apple’s documentation here). To run this example, all you need is a Mac and a copy of Apple’s (free) XCode development environment.
The Algorithmia API works through simple HTTP POST commands. Fortunately, iOS already provides several powerful networking object classes that make that very easy:
- NSMutableURLRequest sets up the HTTP request with some straightforward and obvious methods, like setHTTPMethod:@“POST” to tell the server that you want to post some data.
- NSURLSession is a powerful class that lets you download the content via HTTP, including in the background or even while the application is suspended. Fortunately, the methods to control the download behavior are pretty straightforward.
- NSURLDataTask is a related class specifically for getting data via HTTP. Pass it an instance ofNSURLSession, a NSMutableURLRequest and handler that describes what to do when it receives the a response from the server.
- Finally, NSJSONSerialization is a handy class that will convert between JSON and native iOS dictionary or array types. Read Apple’s class reference documentation to see how much work this will save you!
Example: POST a string to the Algorithmia API
The best way to understand is with a simple example. Here’s a short program to send a string representation of an integer to the Algorithmia isPrime API, to find whether an input number is prime or not.
Here’s the opening screen:
Just a normal text input box (UITextEdit), a button you push to submit the text to the server, and a few labels to show what’s happening. That’s all there is to the UI, and you should be able to whip this up quickly yourself in Storyboard.
As for the guts of the app, we need to specify three things:
- The Algorithm path/uri, in this case it is ‘diego/isPrime’
- Your API key, which you can get from the Profile/Credentials page
- The right HTTP headers, as described in the API Docs – Getting Started.
That’s it! All the networking code you need to access the Algorithmia API is right there. Perhaps the most important thing to point out here is the URI format and the HTTP Headers. Whenever you’re in doubt about these details, you should browse to a random algorithm and inspect the cURL parameters given in the ‘API Usage Examples’ section.
The world runs on data, but all too often, the effort to acquire fresh data, analyze it and deploy a live analysis or model can be a challenge. We here at Algorithmia spend most of our waking hours thinking about how to help you with the latter problem, and our friends at Socrata spend their time thinking about how to make data more available. So, we combined forces to analyze municipal building permit data (using various time series algorithms), as a reasonable proxy for economic development. Thanks to Socrata, this information is now available (and automatically updated) for many cities, with no need to navigate any bureaucracy. So we pulled in data for a range of cities, including:
- Santa Rosa, CA
- Fort Worth, TX
- Boston, MA
- Edmonton, Alberta, Canada
- Los Angeles, CA
- New York City, NY
We will show that it is fairly easy to relate building permit data to local (seasonal) weather patterns and economic conditions. When evaluating a given area, it is useful to know how much development has been going on, but also how much of this is explainable by other factors, such as seasonal variation, macroeconomic indicators, commodity prices, etc. This can help us answer such questions as: “how is this city likely be affected by a recession?” or “how will real estate development fare if oil prices drop?”
- Socrata Open Data Query – pulls the permit data from Socrata
- Simple Moving Average – uses local average to smooth data
- Linear Detrend – removes increasing or decreasing trends in time series
- Autocorrelate – used to analyze the seasonality of a time series
- Remove Seasonality – removes known seasonal effects from a time series
- Outlier Detection – flags unusual data points
- Forecast – predict a given time series into the future
We use an algorithm to directly access Socrata’s API that can be found here.
Once we’ve retrieved the data, we aggregate it by total number of permits issued per month, and plot this data. To make the graph clearer, it sometimes helps to smooth or denoise the data. There are a number of ways to do this, with moving averages being the most popular. The most intuitive is the Simple Moving Average, which replaces each point by the average of itself and some number of the preceding points (default is 3).
A simple first inspection of data reveals trends and large peaks/dips. Ft. Worth, TX, lends itself to this analysis. It shows a steady growth from the beginning of the record up to a peak right before the subprime crisis, with a fairly rapid fall off to a plateau.
As much as we can learn from simple inspection of the plot, there are factors making this difficult, especially when it comes to inferring economic activity. For instance, outdoor construction tends to take place during nicer weather, especially in cold places far from the equator. This often makes the data harder to interpret – a seasonal spike doesn’t say nearly as much about underlying economic activity as a non-seasonal one, so we need a way to take this into account.
Edmonton, Alberta, is a particularly clear example of this.
To address seasonality, we first need to remove the linear trends via Linear Detrend, which fits a line to the data and then subtracts it from the data, resulting in a time series of differentials from the removed trend. Without detrending, the following seasonality analysis will not work.
Permit issuance in Edmonton is clearly seasonal, even to the naked eye, but seasonality can be detected even in much noisier data using autocorrelation. Very roughly speaking, you can identify seasonality by the relative strength and definition of peaks in the autocorrelation plot of the time series, which you can calculate using our Autocorrelation algorithm. The autocorrelation of a time series is a rough measure of how similar a time series is to a lagged version of itself. If a series is highly seasonal according to some period of length k, the series will look similar to itself about every k steps between peaks and troughs.
We expect the seasonality to be strongest furthest from the equator. Sure enough, the seasonality is most clear in more northern cities such as NYC and Edmonton and least clear in cities like Los Angeles – you can check this yourself with the interactive panel above.
It can help to suppress this seasonal influence so the effects of other factors will be made more clear. In Algorithmia this can be done using an algorithm called Remove Seasonality, which, by default, detects and removes the strongest seasonal period. When we do this it smooths out seasonal variation, and what it leaves is, in a sense, what is unexpected in the data.
Note that the transformed time point should be interpreted as the difference of the actual value of a given point in time, minus the expected seasonal component of the data. It is a differential rather than an absolute value and thus can be negative.
Once we’ve accounted for linear trends and seasonality, we’re left with the most mysterious part of the data, where the more subtle signals hide. In many cases visual inspection and some detective work will be useful here, but in other cases it’s useful to automate the process of detecting subtle behavior. For instance, if you have an automated process that ingests, analyzes, and reports on incoming data, you may want to be alerted if something changes drastically, and take action based on that information. Defining what constitutes “abnormal” can be a hard problem, but by removing the larger trends, we are left with a clearer picture of these abnormalities.
The city of Santa Rosa, CA provides an interesting example. One can see from inspection the sudden spike in permits issued in 2013, but it can also be detected automatically using Outlier Detection algorithms. This algorithm works by setting all non-outliers to zero and leaving outliers unchanged, where an outlier is defined as any data point that falls more than two standard deviations from the mean of the series. We don’t have an explanation for this particular outlier, it may just be chance or a change in reporting, but does indicate that it may be worth looking into more carefully.
At this point it would be nice to tell you about our shiny crystal ball that can predict the future (and return it via an API call). Unfortunately, we’re not there quite yet, but we CAN help you see what observed trends will look like extrapolated into the future. Specifically, the Forecast algorithm takes your time series and projects observed linear and seasonal trends out a given number of time steps into the future. It does this by calculating, for each future point, its value according to extrapolation of the detected linear trend, then adding to that the differential corresponding to the contribution of each seasonal component to that point. In the demo, Forecast is set to predict using data from the 3 strongest seasonal components.
Judging from the results of the Forecast algorithm, we expect a steady increase in construction activity for Edmonton, as well as Los Angeles, Boston, and New York. Santa Rosa looks to maintain the status quo, and Forth Worth looks to have a slight downward trend.
All together now!
The takeaway from this example is not the individual pieces – most of these techniques are available a number of places. The secret here is that it is simple, composable, and fully automated. Once one person writes a data connector in Algorithmia (which is likely to happen for interesting public data, like that provided by Socrata), you don’t have to burn precious time writing and working kinks out of your own. On the other hand, if something existing gets done better, the compositional nature of Algorithmia will allow you to swap out pieces of a pipeline seamlessly. When someone comes up with a better outlier detector, then upgrading a pipeline like the one above can be literally just a few keystrokes away.
Finally, the analyses we’ve shown here are based on live data, changing all the time as new data flows in from Socrata’s pipeline to each individual city’s building permit system; Algorithmia offers an easy way to deploy and host that live analysis so the underlying data and interpretation are always up to date.
Today’s blog post is brought to you by one of our customers DeansList software. Thank you Matt for sharing your experience with our community.
At DeansList, one of the things that we do really well is build custom report cards that students take home (or get e–mailed) to their parents. We build them from the ground up with every school – taking instructions on everything from design to data points to the placement of charts. They often go home every week and, in addition to keeping parents up–to–speed, they include them in the messaging, programs and structures that engage their children every day.
Ugly, but functional.
Not surprisingly, with the promise of customization come very unique requests. Lots of our schools include fake paychecks linked to a school’s token economy – a tool to teach financial literacy. Additionally, many schools we work with serve large populations of families whose primary language is not English (often Spanish). Whenever possible we translate our reports into a second language on the back, so non–English speaking parents aren’t missing any important details. The challenge we faced was translating integers into Spanish words for the second line on the check (i.e. $430 into cuatrocientos treinta dólares). We wrote out a script to translate numbers to English words, but no one on our team has the language expertise to do the same in Spanish. It admittedly wasn’t a huge priority so we kept the words in English – ugly but functional.
Around the time we were tackling this we came across Algorithmia and it’s Bounty feature. It seemed like a longshot – but what’s the harm in asking? So we posted a request for someone to write an algorithm to Translate Integers to Spanish Words.
A Bounty Fulfilled
To be honest, after I created the bounty, I forgot about until March 2nd when I got the “Your bounty has been completed” e–mail. A user named Javier had uploaded Cardinales. I logged right in, threw some tests into the web console and, within diez minutos, was hitting the API successfully via Postman.
The Algorithmia-driven solution!
Algorithmia’s libraries plug right into any platform, and we considered using their JS on the front–end. However, for a few reasons, wrapping an endpoint to our internal API around a call to Cardinales made the most sense. First, this feature will likely get deployed again in many more schools – so abstracting it gives us more flexibility to change things around as needed. Secondly, school firewalls can be incredibly restrictive, so having the browser request the data via our server keeps all our calls in the whitelist and eliminates troubleshooting hard–to–pinpoint issues.
Safe and Secure
What’s awesome about Algothmia is it’s not inside our system, and we don’t need to provide Cardinales any kind of student data to work. We just send it a # – 354, and it comes back to us with the translation: trescientos cincuenta y cuatro. Everything happens via cURL and there’s no trace of Algorithmia or Cardinales code on our servers.
Considerations for Next Time
Or, How to Write a Better Bounty Spec…
Javier went above–and–beyond and included things in the solution that we didn’t ask for, like translations in both the masculine and the feminine, and proper translations for negative numbers. Still, now that it’s up and running, I realize there are a bunch of things we could have thought more about when we wrote the spec.
- Multiple items in a single request – Right now, we send one number at a time, and receive one translation per request. This means there’s a lot of overhead if we print 100 students reports at once. Next time around, I’ll include the ability to send an array of inputs, and receive key–value pairs as a return.
- Error codes – Cardinales handles bad inputs gracefully. Sending “abcd” returns empty quotes. However, more elaborate algorithms might require more detailed error reporting. It’s definitely something we’ll keep in mind going forward.
As more schools build savvy data teams, we’re always looking for ways to help them integrate their efforts with our own. We have a public API, but I could also see Algorithmia as an easy, cost–effective way for them to contribute their own highly–specific code without having to think about setting up their own infrastructure.
So far, leveraging Algorithmia for a non–core feature like this has been an easy, awesome experience. Our engineers were stoked to plug this in and so far it works perfectly. I’ve never met Javier, but I can read his code (it’s elegant), he’s contributed meaningfully to our platform, and I appreciate his work. Gracias!
Matt Robins is the cofounder of DeansList, a platform that manages non-academic student data. DeansList’s platform puts behavior data to work, driving actionable reporting for students, teachers administrators and parents. For more information on DeansList, or to ask Matt questions about his Algorithmia implementation, e–mail him at firstname.lastname@example.org.