As developers one of our biggest “problems” is our voracious appetite for news. From Twitter to HackerNews to the latest funding on TechCrunch, it seems, at times, we cannot avoid the gravitational pull of our favorite news feeds. I know, at least for myself, this is engrained in my routine: wake up, check Twitter, check TechCrunch, check The Verge, etc. I spend at least the first 30 minutes of every day reading feeds based on title and repeat this a couple more times through the day.
I recently discovered SkimFeed, which I love and call my “dashboard into nerd-dom,” basically it is a single view of the major tech sites’ titles. However, I wanted more information on each article before I decided to click on one, so I thought: Why not use text analysis algorithms as a more efficient way of consuming my feeds?
Building my very own text analysis API
There are a number of things that can be done as part of a text analysis API but I decided to concentrate on four elements I believed I could build the fastest:
- Strip documents of unnecessary elements
- Advanced topic extraction
- Automatically summarize stories
- Sentiment analysis
As all of these algorithms are already available in the Algorithmia API, I could piece them together without having to worry about servers, scaling, or even implementing the actual algorithms:
- ScrapeRSS – Retrieves all the necessary elements from an RSS feed
- Html2Text – Strips HTML elements and keeps the important text
- AutoTag – Looks for keywords that represent the topic of any text submitted to it. (Latent Dirichlet Allocation)
- SentimentAnalysis – Analyzes sentences for positive, neutral or negative connotation. Uses the Stanford NLP library.
- Summarizer – Breaks content into sentences, and extract key sentences that represent the contents topic. Uses classifier4j to ingest a URL and summarize its main contents
Now, it was just a question of making them work together (and it took ~200 lines of code).
Check it out here:
The first thing I needed to do was retrieve all the necessary elements from the RSS feeds (ScrapeRSS). Once I had located the main content I could strip all the unnecessary HTML (HTML2Text). Now I had a nice clean body of text to start doing analysis on.
Creating topic tags is an easy way of understanding an article at a very quick glance, so I fed our clean body of text through AutoTag. Now I had the RSS title and some topics, next step was to summarize each link into 3 sentences max to complement the tags. Finally, and mostly for fun, I wanted to see if it’s true that most news is negative, so I added SentimentAnalysis.
You can check out some of the code here (or just view-source on our demo page in the browser).
That’s one way to use the Algorithmia library to build your own text analysis API to analyze almost any data stream in < 30 minutes. .
If you recall from our last blog post we showed you how we used some of the algorithms in Algorithmia to generate topic tags for any URL. Internally, we used the topic generation algorithm to generate tags based on the algorithm’s description and, later, use these tags as part of our recommender algorithm.
With today’s post we want to show you how easy it is to integrate these type of recommender algorithms (an algorithm already available in Algorithmia) into your own workflow by showing it in action with one of our favorite developer tools, GitHub.
As GitHub has grown to almost 50 million repositories since 2011 (per our calculations), discovering new repositories has become less and less straight forward when navigating outside the most popular or featured repositories. With this in mind, we thought of a way to make it easier to tackle some of this complexity using some of the algorithms available in Algorithmia.
So here you go, give us any GitHub repository (as long as it has a README) and we will recommend other repositories based on the information we extract from the README:
How we built this
- The first step was to figure out the URL for every repository in GitHub. This might seem like a daunting task but the folks over at Github have been nice enough to make their entire data set available on Google Big Query. You can head over there and generate a list of every public repository since 2011.
- The second step, was to check every public repository’s readme file and run it through Algorithmia’s topic analysis algorithm.
- Once the topic analysis algorithm returns a set of tags, we save these tags to start generating the data model we will later use for our recommender algorithm as a mapping from URL -> [tags].
- Repeat this a couple million times (not the easiest task, but we had a virtually unlimited sized cluster to parallelize this task thanks to the Algorithmia platform), and voila a tagged data model of the entire GitHub world.
- Finally, it’s time to start working with our recommender algorithm. To start, we need to point it to the data model we built and then send the algorithm the new set of tags that we generated from the URL you provided us with. With two inputs, the recommender algorithm returns a number of relevant repositories based on the tags automatically generated.
If you are interested in using any of these algorithms on your own site or application, we can help – these are all available in the Algorithmia API, just send us an email.
— Your friends at Algorithmia
Part of making algorithms more discoverable is creating meta-data tags to classify them. Often sites will allow users to pick their own tags but what if the content had already been generated? This is the problem we faced when trying to tag all the algorithms in our API. Each algorithm had a description page and we believed that using some simple machine learning algorithms already in our API we could generate tags for each one.
By generating tag data, it becomes easy to classify documents, make recommendations, optimize SEO, etc. Below we show how we approached this task, using HackerNews as an example data source.
So how did we do this? Our secret sauce is that Algorithmia is designed to make it extremely easy to combine algorithms, to create a pipeline for processing and generating tags for almost any site.
- Given a site, pull the data and iterate over every link
- Extract the text from each linked page
- Run the text though topic analysis algorithm (such as Latent Dirichlet Allocation)
- Return tagged data
- Render tags next to links
All this really is, is a pipeline of algorithms (plus some clever front-end js :P). In most cases, this would require some serious code stitching and algorithm development, but most of the components already existed in the Algorithmia API, and the solution was extremely clean, check it out:
This also works for arbitrary webpages. Enter a URL below to automatically generate tags for it:
Once we had these its becomes easy to classify algorithms, make recommendations and relate one algorithm to another. We’ll leave that for the next post…
Get topic tags for your site dynamically – contact us !
Highly complex algorithm development is often at the core of innovative software products. The most famous algorithm of the last decade led to the development of the $396 billion dollar titan Google, which changed our lives forever with a simple search box.
Today, an estimated 80% of equity trading in the U.S. is done using automated algorithms. Even our phones use algorithms to figure out where we are, what we are doing, and what we might want to do next. A few days ago a company, DeepMind, that developed the next generation of Artificial Intelligence algorithms was bought for a record sum. These examples are just scratching the surface of what algorithms can do.
As the volumes of data we collect on a day-to-day basis grow, data architects are on a race to have more, better, and faster hardware; yet Moore’s Law is showing a slow down, indicating that throwing more hardware at a data problem is no longer going to fulfill the needs of organizations. As organizations are increasingly and continuously hungry for understanding on how they operate and how they can improve, they must concentrate more on the efficiency and quality of the underlying algorithms that compose suites of data analysis packages versus the hardware they run on.
The Current State of Algorithm Development
It’s safe to say that on any given day there are thousands of brilliant computer scientists developing the state of the art and pushing the limits of software, yet there lies a serious problem: Algorithms are being developed all the time, but are not getting into the hands of people and applications that could benefit from them.
- A vast majority of algorithmic developments occurs in one of two places: either private Research and Development programs (e.g., Microsoft Research, Google, etc.) or in one of more than 320+ academic institutions with computer science departments around the world. The output of these types of research and development is almost always private white papers or academic papers.
- Companies and application developers have a hard time taking advantage of advanced algorithms. Academic papers are hard to find and even harder to implement. Even in the best case, when published code is available, more time is spent setting up infrastructure and fighting with undocumented libraries and dependencies than actually doing any algorithm development.
For example, a while back I was researching Latent Dirichlet Allocation. In layman’s terms, it is an algorithm that allows for the extraction of topics from documents without an understanding of the language itself. A quick search on Google brought up a dozen research papers and a couple of libraries where this algorithm had been implemented. So where do I start? I guess I could go ahead and implement one of the libraries, but which one? Do I even know if this is going to work? Once I do figure it out will it work in my current system?
The Birth of Algorithmia
Algorithmia was born out of frustration with these problems and the current state of algorithm development and deployment. Algorithmia is moving away from developers working in isolation and toward providing a community for algorithm developers to share knowledge, test algorithms, and run them directly in their applications. What’s different is that every algorithm is live, so no more spending time finding the right libraries, compilers, and/or virtual machines—Algorithmia takes care of all of that for you. Find a useful algorithm in our API? Great, use that in your own applications.
Algorithmia is a live, crowd-sourced algorithm API. Our goal with Algorithmia is to make applications smarter, by building a community around algorithm development, where state-of-the-art algorithms are always live and accessible to anyone.
Intrigued? Come check us out.