Algorithmia

Analyzing Tweets

Introduction

Sentiment Analysis, also known as opinion mining, is a powerful tool you can use to build smarter products. It’s a natural language processing algorithm that gives you a general idea about the positive, neutral, and negative sentiment of texts. Sentiment analysis is often used to understand the opinion or attitude in tweets, status updates, movie/music/television reviews, chats, emails, comments, and more. Social media monitoring apps and companies all rely on sentiment analysis and machine learning to assist them in gaining insights about mentions, brands, and products.

For example, sentiment analysis can be used to better understand how your customers on twitter perceive your brand online in real-time. Instead of going through hundreds, or maybe thousands, of tweets by hand, you can easily group positive, neutral, and negative tweets together, and sort them based on their confidence values. This can save you countless hours and leads to actionable insights.

Additionally, you can use another natural language processing algorithm called Latent Dirichlet allocation. This powerful algorithm can extract a mixture of topics for a given set of documents. LDA doesn’t give you the topic name, but gives you a very good idea about what the topic is. Each topic is represented by a bag of words, which tells you what the topic is about. As an example, if the bag of words are: “sun, nasa, earth, moon”, you can understand the topic is probably related to space.

You can use both of these powerful algorithms to build a tool to analyze tweets in real time. This tool will just give you just a glimpse of what can be built using just two machine learning algorithms and a scraper.

Using NLP from Algorithmia to build an app that analyzes tweets on demand.

Analyze Tweets is a minimal demo from Algorithmia that searches Twitter by keyword and analyzes tweets for sentiment and LDA topics. Currently it searches up to 500 tweets that aren’t older than 7 days. It also caches each query up to 24 hours.

Currently the demo starts initializing with a random keyword immediately after this web page loads. The randomly selected keywords are: algorithmiamicrosoftamazongooglefacebook, and reddit.

You can check out the demo below:

How we built it

Before we can start, you need to have an Algorithmia account. You can create one on our website.

The first thing we did was to click the Add Algorithm button located under the user tab, which is located at the top right side of the webpage.

Screen Shot 2016-01-28 at 1.58.32 AM

We enabled the Special Permissions, so that our algorithm can call other algorithms and have access to the internet. We also selected Python as our preferred language.

Screen Shot 2016-01-28 at 2.01.55 AM

The first thing we decided was to pick which algorithms we were going to use. We came up with a recipe that uses 3 different algorithms in the following steps:

  1. It first checks if the query has been cached before and the cache isn’t older than 24 hours.
    1. If it has the appropriate cache, it returns the cached query.
    2. If it can’t find an appropriate cached copy, it continues to run the analysis.
      1. It retrieves up to 500 tweets by using the user provided keyword and twitter/RetrieveTweetsWithKeyword algorithm.
      2. It runs the nlp/SocialSentimentAnalysis algorithm to get sentiment scores for all tweets.
      3. It creates two new groups of tweets that are respectively the top 20% most positive and top 20% most negative tweets.
        1. We do this by sorting the list of tweets based on their overall sentiment provided by the previously mentioned algorithm.
        2. We then take the top 20% and bottom 20% to get the most positive and negative tweets.
        3. We also remove any tweets in both groups that have a sentiment of 0 (neutral). This is because the twitter retriever does don’t guarantee to return 500 tweets each time. For more information, click here.
      4. These two new groups (positive and negative) of tweets are fed into the nlp/LDA algorithm to extract positive and negative topics.
      5. All relevant information is first cached, then is returned.

Based on our recipe above, we decided to wrap this app into a micro-service and call it nlp/AnalyzeTweets. You can check out the source code here. You can copy and paste the code to your own algorithm if you’re feeling lazy. The only part you would need to change would be line 28 from:

cache_uri = "data://demo/AnalyzeTweets/" + str(query_hash) + ".json"

to:

cache_uri = "data://yourUserName/AnalyzeTweets/" + str(query_hash) + ".json"

And the last step, is to compile and publish the new algorithm.

Screen Shot 2016-01-28 at 2.46.26 AM

And… Congrats! By using different algorithms as building blocks, you now have a working micro-service on the Algorithmia platform!

Notes

After writing the algorithm, we built a client-side (JS + CSS + HTML) demo that calls our new algorithm with input from the user. In our client-side app, we used d3.js for rendering the pie chart and histogram. We also used DataTables for rendering interactive tabular data. SweetAlert was used for the stylish popup messages. You can find a standalone version of our Analyze Tweets app here.

A real world case study

Screen Shot 2016-01-28 at 2.48.05 AM

While we were building the demo app and writing this blog post (on Jan 28th), Github went down and was completely inaccessible. Being the dataphiles we are, we immediately ran an analysis on the keyword “github” using our new app.

As you can expect, developers on Twitter were not positive about GitHub being down. Nearly half of all tweets relating to GitHub at this time were negative, but to get a better idea of the types of reactions we’re seeing, we can look to the topics generated with LDA. We can see that Topic 4 probably represents a group of people who are more calm about the situation. The famous raging unicorn makes an appearance in multiple topics and we can also see some anger in the other topics. And of course, what analysis of developer tweets would be complete without people complaining about other people complaining? We can see this in Topic 2 with words like “whine”. See the full analysis during the downtime below:

screen

Winners of the Algorithmia Shorties Contest Announced

We’re excited to announce the winners of the first-ever Algorithmia Shorties contest. Inspired by NaNoGenMo, the Algorithmia Shorties were designed to help programmers of all skill levels learn more about natural language processing. We challenged developers to get started with NLP by creating computer-generated works of short story fiction using only algorithms. Entries were judged by their originality, readability, and creative use of the Algorithmia platform.

Grand Prize Winner: Big Dummy’s Leaves of Spam

Algorithmia Shorties - Big Dummy's Leave of Spam

Our winner of the Algorithmia Shorties is Big Dummy’s Leaves of Spam by Skwak. We loved this entry because it creatively mashed up the classic poem Leaves of Grass by Walt Whitman, and Big Dummy’s Guide to the Internet, a sort of layperson’s guide to the internet, with “select works” from her Gmail Spam folder. The result is a highly readable, and often hilarious poem about the philosophy of life, the Internet, humanity, and scams.

Skwak used Sentence SplitGenerate Paragraph From Trigram, and Generate Trigram Frequencies from Algorithmia. In addition, she created an Express app and hosted it on Heroku. Here Github repo can be found here.

Read Big Dummy’s Leaves of Spam here. 

Honorable Mention: n+7

Our first honorable mention is n+7 by fabianekc. It was inspired by a class in Mathematics & Literature he took in 2001. He learned about the “n+7 method,” which is where you replace each noun in a text with the noun in a dictionary seven places after the original. For example, “a man meets a woman” is transformed to “a mandrake meets a wonder” using the Langenscheidt dictionary. The n+7 entry features the first four sections of Flatland by E. Abbott translated as a corpus. What we loved about this entry was that fabianekc created an algorithm on Algorithmia to accomplish the n+7 method. The algorithm takes either a link to a text file for processing, a pre-processed text file, or plain text, and replaces each noun with the noun in the dictionary seven places away. You can also change the dictionary, the offset (how many places after the original noun to use), and the start and end of the text to begin the replacement. Check out the Github repo here.

Read n+7 here.

Honorable Mention: Arrested Development Episode

Our second honorable mention goes to a computer-generated Arrested Development episode, created by peap and jennaplusplus. The duo generated trigrams using the scripts from the first three seasons of Arrested Development for all characters that had more than 10 lines overall. This created an eye-popping 71 trigram files! To create the faux episode script, they started the episode off with the Narrator speaking, and then randomly selected which characters would speak based on the size of their trigram file, ensuring no character spoke twice in a row, which resulted in every character having 1-5 lines every time they spoke. Check out their Github repo here.

Read Arrested Development Episode here.

How The Algorithmia Shorties Were Generated

For algorithmically-generated short stories, users start by finding a corpus of text available in the public domain through Feedbooks or Project Gutenberg. Other interesting corpora users could use are TV scripts, software license agreements, personal journals, public speeches, or Wikipedia articles. Users then generate trigrams, which are a collection of all the three-word sequences in the corpus. With trigrams generated, the last step is to reassemble the trigrams into sentences using Random Text From Trigrams, or Generate Paragraph From Trigrams.

Want to learn more? For a detailed walkthrough, check out our trigram tutorial here.

 

Benchmarking Sentiment Analysis Algorithms

Sentiment Analysis, also known as opinion mining, is a powerful tool you can use to build smarter products. It’s a natural language processing algorithm that gives you a general idea about the positive, neutral, and negative sentiment of texts. Sentiment analysis is often used to understand the opinion or attitude in tweets, status updates, movie/music/television reviews, chats, emails, comments, and more. Social media monitoring apps and companies all rely on sentiment analysis and machine learning to assist them in gaining insights about mentions, brands, and products.

For example, sentiment analysis can be used to better understand how your customers perceive your brand or product on Twitter in real-time. Instead of going through hundreds, or maybe thousands, of tweets by hand, you can easily group positive, neutral, and negative tweets together, and sort them based on their confidence values. This will save you countless hours and leads to actionable insights.

Sentiment Analysis Algorithms On Algorithmia

Anyone who has worked with sentiment analysis can tell you it’s a very hard problem. It’s not hard to find trained models for classifying very specific domains, but there currently isn’t an all-purpose sentiment analyzer in the wild. In general, sentiment analysis becomes even harder when the volume of text increases, due to the complex relations between words and phrases.

On the Algorithmia marketplace, the most popular sentiment analysis algorithm is nlp/SentimentAnalysis. It’s an algorithm based on the popular and well known NLP library called Stanford CoreNLP, and it does a good job of analyzing large bodies of text. However, we’ve observed that the algorithm tends to return overly negative sentiment on short bodies of text, and decided that it needed some improvement.

We’ve found at that the Stanford CoreNLP library was originally intended for building NLP pipelines in a fast and simple fashion, and didn’t focus too much on individual features, and lacks documentation about how to retrain the model. Additionally, the library doesn’t provide confidence values, but rather returns an integer between 0 and 4.

A Better Alternative

We decided to test out a few alternative open-source libraries out there that would hopefully outperform the current algorithm when it came to social media sentiment analysis. We came across an interesting and quite well performing library called Vader Sentiment Analysis. The library is based on a paper published by SocialAI at Georgia Tech. This new algorithm performed exceptionally well, and has been added it to the marketplace. It’s called nlp/SocialSentimentAnalysis, and is designed to analyze social media texts.

We ran benchmarks on both nlp/SentimentAnalysis and nlp/SocialSentimentAnalysis to compare them to each other. We used an open dataset from the Crowdflowers Data for Everyone initiative. The Apple Computers Twitter sentiment dataset was selected because the tweets covered a wide array of topics. Real life data is almost never homogeneous, since it almost always has noise in it. Using this dataset helped us better understand how the algorithm performed when faced with real data.

We removed tweets that had no sentiment, and then filtered out anything that didn’t return 100% confidence so that the tweets were grouped and labeled tweets by consensus. This decreased the size of the dataset from ~4000 tweets to ~1800 tweets.

Running Time Comparison

The first comparison we made was the running time for each algorithm. The new nlp/SocialSentimentAnalysis algorithm runs up to 12 times faster than the nlp/SentimentAnalysis algorithm.

fig01

Accuracy Comparison

As you can see in the bar chart below, the new social sentiment analysis algorithm performs 15% better in overall accuracy.

fig02

We can also see how well it performs in accurately predicting each specific label. We used the One-Versus-All method to calculate the accuracy for every individual label. The new algorithm outperformed the old one in every given label.

fig03

When To Use Social Sentiment Analysis

The new algorithm works well with social media texts, as well as texts that inherently have a similar structure and nature (i.e. status updates, chat messages, comments, etc). The algorithm can still give you sentiments for bigger texts such as reviews, or articles, but it will probably be not as accurate as it is with social media texts.

An example application would be to monitor social media for how people are reacting to a change to your product, such as when Foursquare split their app in two: Swarm and Foursquare. Or, when Apple releases an iOS update. You could monitor the overall sentiment of a certain hashtag or account mentions, and visualize a line chart that demonstrates the change of your customer’s sentiment over time.

Another example, you could monitor your products through social media 24/7, and receive alerts when significant changes in sentiment happen to your product or service in a short amount of time. This would act as an early alert system to help you take quick, and appropriate action before a problem gets even bigger. An example would be a brand like Comcast or Time Warner wanting to keep tabs on customer satisfaction through social media, and proactively respond to customers when there is a service interruption.

Understanding Social Sentiment Analysis

The new algorithm returns three individual sentiments: positive, neutral, and negative. Additionally, it returns one general overall (i.e. compound) sentiment. Each individual sentiment is scored between 0 and 1 according to their intensity. The compound sentiment of the text is given between -1 and 1, which is between absolute negative and absolute positive, respectively.

Based on your use case and application, you may want to only use a specific individual sentiment (i.e. wanting to see only negative tweets, ordered by intensity), or the compound, overall sentiment (i.e. understanding general consumer feelings and opinions). Having both types of sentiment gives you the freedom to build applications exactly as you need to.

Conclusion

The new and improved nlp/SocialSentimentAnalysis algorithm is definitely faster, and better at classifying the sentiments of social media texts. It allows you to build different kinds of applications due to it’s variety of sentiment types (individual or compound), whereas the old one only provided a overall sentiment with five discrete values, and is better reserved for larger bodies of text.

Did you build an awesome app that uses nlp/SocialSentimentAnalysis? Let us know on Twitter @algorithmia!

Bonus: Check out the source code for running the benchmarks yourself @github!

Data Day 2016: Algorithm Marketplaces and the New “Algorithm Economy”

Algorithm Economy

Join Algorithmia CEO Diego Oppenheimer at Data Day 2016 in Austin, Texas on Saturday, January 16th at 2pm in Conference Room #301.

During this 60-minute session, learn how Big Data is being transformed by artificial intelligence through algorithmic advances in computer vision, natural language processing, and machine learning. This talk will cover how algorithms are a crucial part in the next Big Data revolution, and how the Algorithm Economy is creating new opportunities are opening up for startups and large companies.

Data Day Texas 2016 is held at The AT&T Conference Center at The University of Texas.

 

A Year In Review: A Letter From Algorithmia CEO Diego Oppenheimer

Diego Oppenheimer, CEO AlgorithmiaIt’s been an exciting and incredibly productive year here at Algorithmia. As we kickoff 2016, I want to look back at everything we accomplished in 2015.

First, I want to thank our community of more than 15,000 algorithm and application developers for their support – it has been an amazing experience. We’re honored by everybody that signed-up, used us in your applications, and added algorithms to the platform. We’re truly thankful for this unique community.

These are the major milestones we crossed in 2015:

Algorithmia Algorithms

MARCH

Algorithmia Launches

Algorithmia leaves private beta and publicly launches, introducing a community built around the algorithm economy, where state-of-the-art algorithms are always live, and accessible by everyone. The algorithm marketplace launches with over 800 algorithms, 3,500 users, and an API for leveraging the building blocks of human intelligence in your apps.

Algorithmia Algorithms

APRIL

How to Build Your Own Google

Algorithms on Algorithmia are like Legos, where they can be mixed and matched to explore the web algorithmically. In this demo, we use Algorithmia to implement PageRank, the algorithm Google was originally based on, to crawl a site, retrieve it as a graph, analyze the connectivity of the graph, and then analyze each node for its contents.

Algorithmia Algorithms

JUNE

Using Artificial Intelligence to Detect Nudity

We combine artificial intelligence with University research to teach an algorithm to determine if there is nudity in an image. The result is our Nudity Detection algorithm, which uses a combination of nose, face, and skin color detection algorithms to identify if there are people in an image, and if any of them are nude.

Don’t believe us? Try out the demo at isitnude.com.

Algorithmia Algorithms

JULY

Navigate Product Hunt Like a Pro

We build a Chrome Extension that uses FP-Growth and Keyword Set Similarity algorithms to surface related products on Product Hunt from users who’ve upvoted the product you’re browsing (i.e. collaborative filtering). Get the Chrome Extension here, and start discovering better products on Product Hunt today.

Algorithmia Algorithms

SEPTEMBER

AWS Lambda Partnership

Algorithmia teams up with AWS Lambda, enabling developers to build intelligent, serverless apps in minutes by leveraging our built-in Node.js blueprint. In this demo, we show you how to quickly make a serverless photo app to create digital art in less than 300 lines of code using AWS Lambda and Algorithmia.

The #1 Big Data Startup

A panel of judges from Strata + Hadoop World select Algorithmia as the winner of the Startup Showcase based on the team, technology, and innovation from a pool of the 12 leading big data startups.

Algorithmia Algorithms

OCTOBER

Supporting the Innovators of Tomorrow

Algorithmia joins more than 600 student hackers at DubHacks, the largest collegiate hackathon in the Pacific Northwest, as sponsor and participant to help teams build projects, and create solutions to real-world problems.

Content Recommendations Made Simple

We launch Algorithmia Recommends, a free content recommendation plugin for any WordPress or Drupal blog, or website to increase their engagement and retention. Algorithmia Recommends is built on the Breadth First Site Map web crawler algorithm, and powerful natural language processing algorithms Keywords For Document Set and Keyword Set Similarity to find and categorize all the pages on your website in order to help your users find content that’s most relevant to their interests.

Algorithmia Algorithms

NOVEMBER

Free eBook Published

Algorithmia launches its first eBook, Five Algorithms Every Web Developer Can Use and Understand, which teaches you how to harness the power of algorithms so you can make every app a smart app. In this short primer, we cover PageRank, Language Detection, Nudity Detection, Sentiment Analysis, and TF-IDF, and how you could implement it today.

Algorithmia Algorithms

DECEMBER

GeekWire: Algorithmia Top Seattle Startup

Algorithmia joins GeekWire’s prestigious, annual list of the 10 most promising early-stage startups in the Seattle area. Our business model gets translated onto a giant six-foot by six-foot cocktail napkin, which is unveiled during GeekWire Gala at the Museum of History & Industry (MOHAI).

One of the 10 Coolest Big Data Startups

Algorithmia caps off an incredible year with another award, this time CRN names us one of the coolest big data startups.

Algorithmia Algorithms

On top of those highlights, we also overhauled the Algorithmia UI, introducing a new experience to help users explore algorithms, discover solutions, and help users get started faster. Additionally, we improved our documentation to make using Algorithmia a snap, as well as added clients for CLI, Java, JavaScript, Python, and Scala.

This could have not been achieved without our amazing team Anthony, Besir, John, Jonathan, Kenny, Liz, Matt, Patrick and Zeynep. As well as our awesome 2015 interns Ahmad and Mark.

We’re just getting started, and 2016 is shaping up to be even bigger as we add more algorithms and use cases.

We’re looking forward to another incredible year, and we’re excited to have you along for the journey.

– Diego M. Oppenheimer, CEO

The Algorithmia Shorties Contest

Generating short story fiction with algorithms

The Algorithmia Shorties contest is designed to help programmers of all skill levels get started with Natural Language Processing tools and concepts. We are big fans of NaNoGenMo here at Algorithmia, even producing our own NaNoGenMo entry this year, so we thought we’d replicate the fun by creating this generative short story competition!

The Prizes

We’ll be giving away $300 USD for the top generative short story entry!

Additionally there will be two $100 Honorable Mention prizes for outstanding entries. We’ll also highlight the winners and some of our favorite entries on the Algorithmia blog.

The Rules

We’re pretty fast and loose with what constitutes a short story. You can define what the “story” part of your project is, whether that means your story is completely original, a modified copy of another book, a collection of tweets woven into a story, or just a non-nonsensical collection of words! The minimum requirements are that your story is primarily in English and no more than 7,500 words.

Each story will be evaluated with the following rubric:

  • Originality
  • Readability
  • Creative use of the Algorithmia API

We’ll read though all the entries and grab the top 20. The top 20 stories will be sent to two Seattle school teachers for some old-school red ink grading before the final winner selection.

The contest runs from December 9th to January 9th. Your submission must entered before midnight PST on January 9th. Winners will be announced on January 13th.

How to generate a short story

Step One: Find a Corpus

Read More…

NaNoGenMo + Text Analysis with Algorithmia’s Natural Language Processing algorithms

We’ve just wrapped up November, which means aspiring writers all over the world are frantically typing away in an attempt to finish an entire novel in one month as part of National Novel Writing Month, also known as NaNoWriMo. Each November, participants aim to write 50,000 words on a 30 day deadline–a difficult feat for any writer! NaNoWriMo has been around for quite a long time, but for the last couple of years programmers and digital artists have been participating in a cheeky alternative: NaNoGenMo, or National Novel Generation Month.

Internet artist Darius Kazemi started NaNoGenMo after tweeting the idea in 2013:

image

This November is the third organized installment of NaNoGenMo and the community keeps growing every year as more and more programmers & artists become interested in the strange intersection of code, language processing, and literature. And because the event is primarily driven by developers, submissions are posted on a Github repo as Issues so that participants can comment on one another’s ideas and help each other create some of the most unique and sometimes nonsensical novels written in November.

In the NaNoGenMo world, “novel” is pretty loosely defined. According to the rules,

“The “novel” is defined however you want. It could be 50,000 repetitions of the word “meow”. It could literally grab a random novel from Project Gutenberg. It doesn’t matter, as long as it’s 50k+ words.”

(And of course, someone did make that 50,000 word “meow” book in 2014!)

Novel generation can be much more complicated than it appears from the outside. Some books integrate with social media by pulling text from twitter to generate dialogue, others go down a recursive rabbithole, and some even generate graphic novels.

Algorithmia is home to a wide variety of algorithms that are a perfect fit for NaNoGenMo. Because I don’t have any background in natural language processing or computational linguistics, I found it was easy to combine algorithms that not only helped me generate my novel, but gave me insights on the texts I used as a basis.

I chose the texts I wanted to work with based on two things: availability in the public domain and to have an interesting author demographic. While there are tons of NaNoGenMo books out there that are based on other texts, I wanted to find a really unique set of texts to base my novel on. I also developed an interest in 19th century American literature after reading Uncle Tom’s Cabin when I was 12. Luckily for me, Project Gutenberg is home to many novels and autobiographies that fit this intersection of interests!

First step: compile a corpus of texts. I chose to go with two sets of 7 books to compare. The first set was composed of primarily slave and emanicpation narratives from Black female authors. While digging around in these texts, I realized that books as seemingly disparate as Little Women were published at the same time. Somehow I have never really thought about how such drastically different worlds were becoming exposed in what we now think of as classic American literature, so I decided it would be interested to compare. The second set of texts are all from white female authors and published around the mid-19th century.

Set one:

  • 1861 – Incidents in the Life of a Slave Girl by Harriet Jacobs
  • 1868 to 1888 (published in serial form) – Trial and Triumph by Frances Ellen Watkins Harper
  • 1868 to 1888 (published in serial form) – Sowing and Reaping: A Temperance Story by Frances Ellen Watkins Harper
  • 1868 to 1888 (published in serial form) – Minnie’s Sacrifice by Frances Ellen Watkins Harper
  • 1868 – Behind the Scenes by Elizabeth Keckley
  • 1891 – From the Darkness Cometh the Light, or, Struggles for Freedom by Lucy Delaney
  • 1892 – Iola Leroy, or Shadows Uplifted by Frances Ellen Watkins Harper

Set two:

  • 1845 – Woman in the Nineteenth Century by Margaret Fuller
  • 1852 – Uncle Tom’s Cabin by Harriet Beecher Stowe
  • 1854 – The Lamplighter by Maria S. Cummins
  • 1854 – Ruth Hall: A Domestic Tale of the Present Time by Fanny Fern (pen name of Sara Payson Willis)
  • 1860 – Rutledge by Miriam Coles Harris
  • 1868 – Little Women by Louisa May Alcott
  • 1869 to 1870 (published in serial form) – An Old Fashioned Girl by Louisa May Alcott
  • 1872 – What Katy Did by Susan Coolidge

Before I started generating my own novel based on these texts, I rolled up my sleeves and got to work on analyzing them. The Algorithmia platform is already full of many text analysis algorithms, so instead of getting lost in learning natural language processing from scratch, it was as simple as choosing an algorithm, passing in my texts, and comparing the results.

Haven’t read any of the books? Don’t worry! The first algorithm I ran on the texts was Summarizer. This algorithm is pretty straightforward to use–input text, get back key sentences and ranked keywords. Read the summaries of Set One and Set Two if you need a literary refresher!

Using the AutoTag algorithm, I set out to discover if there would be a difference in the topics we’d find between the two author demographics. The Autotag algorithm uses a variant of Latent Dirichlet allocation and returns a set of keywords that reprensent the topics in the text. I then took each of the topics returned by the algorithm and classified them into various categories or themes to see if we could find some common threads.

Autotag

I had suspected that the second set of books would have more domestic related themes, but I was mostly unsurprised that there were no autotagged keywords about race or slavery in that set. Interestingly, specific names as keywords were fairly frequent in both sets, averaging 4.8 out of 8 topics for set one and 5.7 of the topics in set two.

While this algorithm gives us some interesting insights into our texts, it can’t tell us everything and sometimes it can even trick you. For example, I grew suspicious of Sowing & Reaping when the AutoTag algorithm returned that one of the topics was “romaine”. I suspected that this book did not in fact focus on a type of lettuce as a main topic. Since I hadn’t read this specific book, I looked it up–turned out to be the last name of a main character!

After running the AutoTag algorithm on my data sets, I decided it check out Sentiment Analysis. This algorithm uses text analysis, natural language processing, and computational linguistics to identify subjective information in text. It’s also known as opinion mining. The algorithm I used returns a rating of Very Negative, Negative, Neutral, Positive or Very Positive.

Here’s the breakdown of sentiment by book:

Set One Books Sentiment Set Two Books Sentiment
Incidents in the Life of a Slave Girl Negative Woman in the Nineteenth Century Negative
Trial and Triumph Negative Uncle Tom’s Cabin Negative
From the Darkness Cometh the Light Negative The Lamplighter Negative
Sowing and Reaping Negative Ruth Hall Neutral
Minnie’s Sacrifice Negative Rutledge Very Negative
Behind the Scenes Positive Little Women Negative
Iola Leroy Negative What Katy Did Negative

Unsurprisingly, 12 out of 14 of the books I analyzed were Negative or Very Negative. Rough times in the 19th century!

Next, I decided it might be interesting to see what popped up with Profanity Detection. While getting the data into the algorithm and writing the results back to a file was easy, it turns out that profanity detection requires a lot of double checking by hand. I knew that some words that came up were not really profane back then; words like “queer”, “pussy”, and “muff” were innocent in the context of these 19th century texts.

Interestingly, the frequency of racial profanity of the two data sets ended up being relatively similar:

Profanity

 

Of course running the algorithm doesn’t give you the full picture since in our second set of data about 95% of the racial profanity came from one book: Uncle Tom’s Cabin. This is unsurprising since it’s the only work in our second set of books that was written by an aboloitionist. However, we still don’t quite get the full picture about profanity in these books: many of the words used were not considered slurs back then, and additionally, within the use of dialogue this kind of language takes on a different dimension. The thing we can learn from an algorithm such as Profanity Detection is that there is a very stark different in who these books focused on as main characters and what kind of world they lived in. Four of the seven books written by white authors had zero instances of these words.

Now, you’ve read though all this and you’ve seen the results from all these different algorithms, you might be thinking to yourself that you don’t know how to do natural language processing so maybe this will be something you put on a project list and try out later. The most amazing part of this project that I haven’t told you yet is this: every single one of the scripts that I wrote to do NLP and text analysis was under 30 lines of code.

Check out the script I wrote for running the AutoTag algorithm:

import Algorithmia
import os
import json

client = Algorithmia.client('my_api_key')
algo = client.algo('nlp/AutoTag/0.1.4')

rootdir = './clean_books/set_one/'
output_file = 'set_one_autotag_results.txt'
results = ''

for subdir, dirs, files in os.walk(rootdir):
for filename in files:
with open(rootdir + filename, 'r') as content_file:
input = content_file.read()
print "Autotagging " + filename
results += filename + "\n\n"
results += json.dumps(algo.pipe(input))
results += "\n\n"
with open(output_file, 'w') as f:
f.write(results)

f.close()

print "Done!"

After analyzing all my books, it was time to generate my novel to complete NaNoGenMo. This was so easy compared to the text analysis! Once again, with just simple API calls, I generated trigram models based on each set of books. I then made book previews based on each trigram model just to see if you could hear a difference in the books generated on these different demographics.

The book preview from Set One:

Dem young uns vil kill you dead than to see you. Well, you would be less unhappy marriages if labor were more women in the midst of her nice pudding, as there are no enemies to good old aunt, and confirm themselves in woods and gloomy clouds hung like graceful draperies. Talk about the streets of the ballot in his land, that those who have fitted their children?

Belle, and I live in such dingy, humble quarters. said Mrs. Underhill, from my own sorrow-darkened home, I did, that he had asked them. Do you remember the incident so well were given to Frederick Douglass contributed $200, besides lecturing for us. The President added: Man is a fair specimen of her negro blood in his friendship, but they may be an old woman entered her home with me? If the vessel had been. Reader, I felt humiliated enough.

The book preview from Set Two:

aw! Yes, said Miss Skinlin she hasn’t the first heir to the female figure. The waves dance bright and happy when I forgot to learn, before which she told me to read and study. My Uncle, with a commanding, What are you better than Kintuck.

It was useless to ask one last word I ran down a corridor as dark and narrow streets or the other.

No Oh, Earth! And no one interfered, and it was. but then strangers came so by letting out all fear and distress and doubts of the damned, as well as bodies. What word Can we not get. I don’t resent the sarcasm, and unsettled most of my observing her to rise. Fortunately the gate swinging in the recesses, chrysanthemums and Christmas roses bloomed as freshly as in her voice, what everybody finds in the streets so, for the best thing was insufferably disgusting and loathsome to me. I said a thing as leisure there.

The most interesting difference I found in the text generated from these different data sets was that the text from Set Two sounded much more formal. The first set of books, the ones written by Black authors, tended to have much more dialogue written in such a way as to let the reader hear the accents and dialects of the time. These words became part of the model to generate text, so as you can see in the first sentence of the Set One preview, the algorithm generated text that still makes a lot of sense even with words that are intended to showcase an accent.

In the end, I decided to create a trigram model based on both sets of text and use that to generate my full length novel. I didn’t have to do any fancy code, I merely made another API call to the Generate Trigram Frequencies algorithm, this time passing in the entirety of my data set. Then, to generate my novel, I wrote a quick script that calls into another algorithm: Generate Paragraph From Trigram. This algorithm uses the trained trigram model to generate paragraphs of text. Since NaNoGenMo requires the book to be at least 50,000 words, I simply wrote a loop that calls the Generate Paragraph algorithm until the total word count of the book reaches the goal:

import Algorithmia
import os
import re
from random import randint

client = Algorithmia.client('my_api_key')
text_from_trigram = client.algo('/lizmrush/GenerateParagraphFromTrigram')
trigrams_file = "data://.algo/ngram/GenerateTrigramFrequencies/temp/all-trigrams.txt"

book_title = 'full_book.txt'
book = ''
book_word_length = 50000

while len(re.findall(r'w+', book)) < book_word_length:
print "Generating new paragraph..."
input = [trigrams_file, "xxBeGiN142xx", "xxEnD142xx", (randint(1,9))]
new_paragraph = text_from_trigram.pipe(input)
book += new_paragraph
book += 'nn'
print "Updated word count:"
print len(re.findall(r'w+', book))

with open(book_title, 'w') as f:
f.write(book.encode('utf8'))

f.close()

print "Done!"
print "You book is now complete. Give " + book_title + " a read now!"

Even with extra new lines for readability, the code I needed to generate an entire novel with Algorithmia was still under 30 lines! And I ended up generating a really unique, interesting novel without getting lost in the highly technical parts of natural language generation. Now, the text isn’t perfect: sometimes the sentences don’t quite sound right and there isn’t really any sort of story arch, but for such simple code I think it’s pretty good! My favorite part about using 19th century texts as the data set was that sometimes you can’t tell if the generated text is hard to read because it’s generated and doesn’t make much sense or because it sounds so old-timey. My book includes the following gems that just might pass as human-written text:

I have said of human life when I saw the Ohio river, that you shall work.

Still, falsehood may be hearing you. She only ‘spects something. Them curls may make a noise you shall not.

You can read the book, or rather, attempt to read the book, online or you can download it from the repo.

It’s mindblowingly fast and simple to get the power of these algorithms into your hands once they are behind a simple API call. You can see all the other scripts I wrote in the GitHub repo for this project. If you browse around, you’ll see that each script is nearly identical. The only real changes I had to make were replacing the algorithm I was calling and what I named the files to write results to! The Algorithmia platform is an incredibly powerful tool. Instead of spending days, weeks, months learning how to code my own natural language processing and text analysis algorithms, I could just pop my data into a variety of algorithms with simple API calls. No sweat, just results.

A behind the scenes look at the making of our GeekWire ‘Seattle 10′ napkin

We mentioned previously that GeekWire and a panel of Seattle’s top startup leaders selected Algorithmia as one of the 10 most promising startups in Seattle – an incredible honor to say the least. As part of this award, we were asked to translate our business onto a giant six-foot by six-foot cocktail napkin, which will be unveiled tonight to the public at the annual GeekWire Gala at the Museum of History & Industry (MOHAI)

We’re pretty excited with our napkin, and wanted to offer a sneak peek to readers. The napkin was designed by our developer evangelist, Liz Rush, who also happens to enjoy cartooning in her spare time. 

The concept started as a sketch with the different types of personas (users) that benefit from Algorithmia, and a mission statement: “Algorithmia: Bringing together organizations, academics, researchers, hackers, and engineers to unlock the power of algorithms in an accessible, open marketplace.”

image

We wanted to include the Algorithmia binary tree logo, punch up the mission statement, and move the design more toward a printed circuit board look:

image

Getting better! Something was missing, though… one of us had “a ridiculous idea to write the text the way you’d call an algorithm in Python.”

image

Thankfully, somebody knows some Python around here. Our CTO Kenny Daniel straightened things out:

image

There we go. The final napkin turned out great. We’re most excited about how we communicate the idea that Algorithmia enables developers to create tomorrow’s smart applications today in a first-of-its-kind marketplace for algorithms, which unlocks the building blocks of human intelligence, and provides access to world class scientific research and artificial intelligence in five lines of code or less.

image

image

Click the video below for a time-lapse version of the cocktail napkin design process!

Oh, and we even created the algorithm Purpose, which takes an array of your users as a string, and returns the purpose for your organization. Try it for yourself here.

Get Started Building Intelligent, Serverless Apps Using AWS Lambda and Algorithmia

In this walkthrough, we’ll show you how to quickly make a serverless photo app that creates digital art pieces in less than 300 lines of code using AWS Lambda and Algorithmia. We’ll be using the Quadtree Art Generator algorithm to create our art, and push the new image to our S3 bucket automatically:

image

AWS Lambda is great, because you can run code without provisioning or managing servers. Similarly, Algorithmia let’s you tap into the power of the algorithm economy with just a single API call. Together you can quickly build and deploy serverless apps within minutes.

Ready? Okay, let’s get started.

Step 1: Create Accounts

You’ll need a free AWS account, as well as an Algorithmia account. We provide you with 10,000 credits to get started, which will be more than enough for this demo and beyond.

Step 2: Create Your S3 Bucket

Now we need to create your S3 bucket for this project.

Start by selecting S3 from the AWS dashboard.

image

Then select “Create Bucket” from the Actions menu. Give your bucket a unique name (remember: only lowercase names, and no spaces), and then select a region where you want this bucket hosted

image

Once your bucket is created, you want to create two folders: input, and output. The input folder is what Lambda will watch for new images. The images will get processed and then returned to the output folder.

Step 3: IAM Role Configuration

Before we can create a Lambda function, we need to first make an IAM execution role. IAM stands for “Identity and Access Management,” and is an AWS service that helps you control access.

First, go to the AWS IAM Roles page, and select “Create Role.”

image

Under “Select Role Type,” find and select AWS Lambda. Search for AWSLambdaExecute, and select that.

If you need it, find the complete AWS documentation for this step here.

Step 4: Create the Lambda Function

Create a Lambda function by going to the Services menu in the AWS console, and select Lambda from the list.

image

Hit the blue “Get Started Now” button. On the “Select Blueprint” page, scroll to the bottom and hit “Skip.”

Give your function a name, description, and set the runtime to Node.js.

Now, copy our SDK from this Gist, into the Lambda function code box below.

image

Replace ‘YOUR_API_KEY_HERE’ with your Algorithmia API key in the Gist. Your API key can be found on the dashboard of your Algorithmia account.

image

Let’s walk through the top half of the Gist so we can understand how this works. We first define which Algorithmia algorithm we want to use. In this case we’re using the Quadtree Art Generator:

var algo = "algo://besirkurtulmus/quadtree_art/0.1.x"; 
Then we grab the new image from our S3 bucket
var s3 = new AWS.S3();
var bucket = event.Records[0].s3.bucket.name
var key = decodeURIComponent(event.Records[0].s3.object.key.replace(/+/g, " "));
var params = {Bucket: bucket, Key: key};
var signedUrl = s3.getSignedUrl('getObject', params);

We process the image, turning it into quadtree art, and upload the image back to our bucket.

var client = algorithmia(apiKey);
    client.algo(algo).pipe(signedUrl).then(function(output) {
        if(output.error) {
            // The algorithm returned an error
            console.log("Error: " + output.error.message);
            // We call context.succeed to avoid Lambda retries, for more information see: https://forums.aws.amazon.com/thread.jspa?messageID=643046#643046 
            context.succeed(output.error.message);
        } else {
            // Upload the result image to the bucket
            var outputKey = 'output/'+key.substring(key.lastIndexOf('/') + 1);
            var params = {Bucket: bucket, Key: outputKey, Body: output.get()};
            s3.upload(params, function(err, data) {
                if (err) {
                    console.log("Error uploading data: ", err);
                    context.fail(err);
                } else {
                    console.log("Successfully uploaded data to bucket");
                    context.succeed("Finished processing");
               }
            });
        }
    });

Got that? Okay, great.

When you’re ready, select the IAM role you created in Step 3. It is best to be on the safe side and adjust Timeout to be maximum, so 5 min, and hit “Next.“

Step 5: Configure Event Sources

Once your function is created, we need to setup the event for Lambda to respond to. Start by clicking the "Event Sources” tab on your functions detail page. Then select “Add event source.”

Select S3 from the event source type drop-down. Select the bucket you created in Step 2. The event type you want to select is “Object Created (All).” We also want to add a prefix to tell Lambda to watch for new images here. In this case, we’ll use the prefix “input/”. Hit submit and you’re done.

Congrats, your AWS Lambda + Algorithmia function is ready to go. Lambda will now listen for new events in your S3 bucket, and automatically pass those images to Algorithmia where they will get processed by the Quadtree Art Generator, and then added back to S3 in the /output folder.

Test this out by logging into your S3 bucket, and navigating to the input folder, and uploading an image. Then, navigate to the output folder, where you’ll have your own piece of digital quadtree art!

Here’s out founders Diego Oppenheimer, and Kenny Daniel before:

image

…and after quadtree art generation:

image

What’s Next

You now have a working prototype that uses AWS Lambda and Algorithmia. You could use this same workflow to easily detect and crop photos using Smart Thumbnail, transcribe videos using speech to text, or check images for nudity. Learn more about how to leverage Algorithmia and Lambda here.

In a follow-up guide, we’ll teach you how to create a simple Android photo sharing app for uploading photos to S3, where Lambda will pick them up and turn them into digital art for others to enjoy.