As developers one of our biggest “problems” is our voracious appetite for news. From Twitter to HackerNews to the latest funding on TechCrunch, it seems, at times, we cannot avoid the gravitational pull of our favorite news feeds. I know, at least for myself, this is engrained in my routine: wake up, check Twitter, check TechCrunch, check The Verge, etc. I spend at least the first 30 minutes of every day reading feeds based on title and repeat this a couple more times through the day.
I recently discovered SkimFeed, which I love and call my “dashboard into nerd-dom,” basically it is a single view of the major tech sites’ titles. However, I wanted more information on each article before I decided to click on one, so I thought: Why not use text analysis algorithms as a more efficient way of consuming my feeds?
Building my very own text analysis API
There are a number of things that can be done as part of a text analysis API but I decided to concentrate on four elements I believed I could build the fastest:
- Strip documents of unnecessary elements
- Advanced topic extraction
- Automatically summarize stories
- Sentiment analysis
As all of these algorithms are already available in the Algorithmia API, I could piece them together without having to worry about servers, scaling, or even implementing the actual algorithms:
- ScrapeRSS – Retrieves all the necessary elements from an RSS feed
- Html2Text – Strips HTML elements and keeps the important text
- AutoTag – Looks for keywords that represent the topic of any text submitted to it. (Latent Dirichlet Allocation)
- SentimentAnalysis – Analyzes sentences for positive, neutral or negative connotation. Uses the Stanford NLP library.
- Summarizer – Breaks content into sentences, and extract key sentences that represent the contents topic. Uses classifier4j to ingest a URL and summarize its main contents
Now, it was just a question of making them work together (and it took ~200 lines of code).
Check it out here:
The first thing I needed to do was retrieve all the necessary elements from the RSS feeds (ScrapeRSS). Once I had located the main content I could strip all the unnecessary HTML (HTML2Text). Now I had a nice clean body of text to start doing analysis on.
Creating topic tags is an easy way of understanding an article at a very quick glance, so I fed our clean body of text through AutoTag. Now I had the RSS title and some topics, next step was to summarize each link into 3 sentences max to complement the tags. Finally, and mostly for fun, I wanted to see if it’s true that most news is negative, so I added SentimentAnalysis.
You can check out some of the code here (or just view-source on our demo page in the browser).
That’s one way to use the Algorithmia library to build your own text analysis API to analyze almost any data stream in < 30 minutes. .