All posts by Stephanie Kim

How We Turned Our GitHub README Model Into a Microservice

Hosting our GitHub Machine Learning ModelA few posts ago we talked about our data science approach to processing and analyzing a GitHub README.

We published the algorithm on our platform, and released our project as a Jupyter Notebook so you could try it for yourself. If you want to skip ahead, here’s the demo we made that grades your GitHub README.

It was a fun project that definitely motivated us to improve our own README’s, and we hope it inspired you, too (I personally got a C on one of mine, ouch)!

In this post, we’d like to take a step back to talk about how we used Algorithmia to host our scikit-learn model as an algorithm, and discuss the benefits associated with using our platform to deploy your own algorithm.

Why Host Your Model on Algorithmia?

There are many advantages to using Algorithmia to host your models.

The first is that you can transform your machine learning model into an algorithm that becomes a scalable, serverless microservice in just a few minutes, written in the language of your choice. Well, as long as that choice is: Python, Java, Rust, Scala, JavaScript or Ruby. But don’t worry, we are always adding more!

Second, you can monetize your proprietary models by collecting royalties on the usage. This way your agonizingly late nights slamming coffee and turning ideas into code won’t be for nothing. On the other hand, if altruism and transparency are your thing (it’s ours!) then you can open-source your projects and publish your algorithms for free.

Either way, the runtime (server) costs are being billed to the user who calls your algorithm. Learn more about pricing and permissions here.


A Step-By-Step Guide to Hosting Your Model

Now, on to the process of how we hosted the GitHub README analyzer model, and made an algorithm! This walkthrough is designed as an introduction to model and algorithm hosting even if you’ve never used Algorithmia before. For the GitHub ReadMe project, we developed, trained, and pickled our models locally. The following steps assume you’ve done the same, and will highlight the activities of getting your trained model onto Algorithmia.

Step 1: Add Your Data

The Algorithmia Data API is used when you have large data requirements or need to preserve state between calls. It allows algorithms to access data from within the same session, but ensures that your data is safe.

To use the Data API, log into your Algorithmia account and create a data collection via the Data Collections page. To get there, click the “Manage Data” link from the user profile icon dropdown in the upper right-hand corner.

Once there, click on “Add Collection” under the “My Collections” section on your data collections page

. After you create and name your data collection, you’ll want to set the read and write access to make it public, or lock it down as private. For more information about the four different types of data collections and permissions checkout our Data API docs.

Besides setting your algorithms permissions, this is also where you load either your data or your models, or both. In our GitHub README example we needed to load our pickled model rather than raw data. Loading your data or model is as easy as clicking the box ‘Drop files here to upload’ or drag and dropping them from your desktop.

You’ll need the path to your files in the next step. For example:
data://username/collection_name/data_model.pkl.zip

Uploading your trained model

Upload your model

Our recommendation is to preload your model before the apply() function (details below). This ensures the model is downloaded, and loaded into memory without causing a timeout when working with large model files. We support up to 4GB of memory per 5-minute session.

When preloading the model like this, only the initial call will load the model into memory, which can take several seconds with large files.

From there, only the apply() function will be called, which will return data much faster.

Step 2: Create an Algorithm

Now that we have our pickled model in a collection, we’ll create our algorithm and set our dependencies.

To add an algorithm, simply click “Add Algorithm” from the user profile icon. There, you will give it a name, pick the language, select permissions, and make the code either open or closed source.

Create your algorithm and set permissions.

Create your algorithm and set permissions.

Once you finish that step, go to your profile from the user profile icon where your algorithm will be listed by name. Click on the name and you’ll be taken to the algorithm’s page that you’ll eventually publish.

There is a purple tag that says “Edit Source.” Click that and it’ll open the editor where you can add your model and update your dependencies for the language of your choice. If you have questions about adding your dependencies check out our Algorithm Developer Guides.

Your new algorithm!

Step 3: Load Your Model

After you’ve set the dependencies, now you can load the pickled model you uploaded in step 1. You’ll want import the libraries and modules you’re using at the top of the file and then create a function that loads the data.

Here’s an example in Python:

import Algorithmia
import pickle
import zipfile

def loadModel():
    # Get file by name
    client = Algorithmia.client()
    # Open file and load model
    file_path = 'data://.my/colllections_name/model_file.pkl.zip'
    model_path = client.file(file_path).getFile().name
    # unzip compressed model file
    zip_ref = zipfile.Zipfile(model_path, ‘r’)
    zip_ref.extract('model_file.pkl’)
    zip_ref.close()
    # load model into memory
    model_file = open(‘model_file.pkl’, ‘rb’)
    model = pickle.load(model_file)
    model_file.close()
    return model

def apply(input):
    client = Algorithmia.client()
    model = loadModel()
    # Do something with your model and return your output for the user
    return some_data

Step 4: Publish Your Algorithm

Now, all you have to do is save/compile, test and publish your algorithm! When you publish your algorithm, you also set the permissions for public or private use, and whether to make it royalty-free or charge a per-call royalty. Also note that you can set permissions for the algorithm version to say whether internet is required or if your algorithm is not allowed to call other algorithms.

Publish your algorithm!

Publish your algorithm!

If you need more information and detailed steps to creating and publishing your algorithm follow these instructions to getting your algorithm on the Algorithmia platform check out our detailed guide to publishing your first algorithm.

Now that you’ve hosted your model and published your algorithm, tell people about it! It’s exciting to see your hard work utilized by others. If you need some inspiration check out the GitHub ReadMe demo, and our use cases.

If you have more questions about building your algorithm check out our Algorithmia Developer Center or get in touch with us.

HackPoly Spotlight: Helping Hand Uses Facial Recognition To Automate Tasks

Students at the HackPoly hackathon

We joined over 500 student hackers at the annual HackPoly hackathon at Cal Poly Pomona last weekend to see what these up-and-coming technologists could develop in just 24 hours. Student developers, designers, and hardware enthusiasts came from all over Southern California to form teams and build innovative products that solve real-world problems using a variety of tools, including Algorithmia.

John Pham, Josh Bither, Kevin Dinh, and Elijah Marchese worked together as team Helping Hand, which focused on creating a platform to give users the ability to remotely automate tasks, such as opening a door based using facial recognition software. Check out the great promo video they made about their project:

We chatted with John Pham to get a closer look at what they built:

How did you build this hack, and what technologies did you use?
This hack uses a Raspberry Pi at its core, responsible for most of the computation done. We then used a master/slave setup involving the Pi and an Arduino Uno. We utilized Algorithmia’s, Microsoft’s, and Clarifai’s API to create a facial recognition technique for a multitude of potential applications. We repurposed a Logitech webcam a live feed to the Pi. We then can programmed the Arduino to activate a servo. We created a SQL database, filled with recognized faces, which the user could personalize. Notifications and logs can be viewed from the Android app, and by extension the Pebble watch.

What’s next for the Helping Hand team?
We all hope to continue refining Helping Hand in propelling the project forward to reach its full potential. As of now, we plan to improve the software aspect of Helping Hand, and then direct our attention on bettering the hardware. We fully intend to release our product to the public. Affordable and user-friendly, Helping Hand will provide the security and peace of mind we all want for our communities and our families.

For John, this was his fourth hackathon working on a completely new platform. Kevin also had prior hackathon experience, but was especially proud of all they were able to accomplish in just 24 hours. According to Kevin, the event was really stressful at the start, because the team hadn’t worked together before, but “it got better as it went along, and we got more acclimated to cooperating.”

HackPoly was Josh’s first hackathon, but like a seasoned hacker, he said “My favorite part of HackPoly were the very early morning hours, from 12 – 3 am, where trying to code coherently becomes nearly impossible. I got about 3-4 hours of sleep for the whole event.”

It was also Elijah’s first hackathon and he described the experience as “a wild rollercoaster ride.” Following the event, Elijah went on to explain that the hackathon was a high pressure, highly competitive endeavor: “There was a lot of stress trying to learn so much in so little time. Despite only having three hours of sleep, Helping Hand turned out great and the satisfaction of completing the project made me feel fulfilled.”

As the winners of the “Best Use of Algorithmia API” prize, we sent the team Cloudbit Starter Kits to help them continue on their path of hardware and Internet of Things hacking. We can’t wait to see what else the team builds and what happens next with Helping Hand!

More About HackPoly and Helping Hand:

The Algorithmia Shorties Contest

Generating short story fiction with algorithms

The Algorithmia Shorties contest is designed to help programmers of all skill levels get started with Natural Language Processing tools and concepts. We are big fans of NaNoGenMo here at Algorithmia, even producing our own NaNoGenMo entry this year, so we thought we’d replicate the fun by creating this generative short story competition!

The Prizes

We’ll be giving away $300 USD for the top generative short story entry!

Additionally there will be two $100 Honorable Mention prizes for outstanding entries. We’ll also highlight the winners and some of our favorite entries on the Algorithmia blog.

The Rules

We’re pretty fast and loose with what constitutes a short story. You can define what the “story” part of your project is, whether that means your story is completely original, a modified copy of another book, a collection of tweets woven into a story, or just a non-nonsensical collection of words! The minimum requirements are that your story is primarily in English and no more than 7,500 words.

Each story will be evaluated with the following rubric:

  • Originality
  • Readability
  • Creative use of the Algorithmia API

We’ll read though all the entries and grab the top 20. The top 20 stories will be sent to two Seattle school teachers for some old-school red ink grading before the final winner selection.

The contest runs from December 9th to January 9th. Your submission must entered before midnight PST on January 9th. Winners will be announced on January 13th.

How to generate a short story

Step One: Find a Corpus

Read More…

NaNoGenMo + Text Analysis with Algorithmia’s Natural Language Processing algorithms

We’ve just wrapped up November, which means aspiring writers all over the world are frantically typing away in an attempt to finish an entire novel in one month as part of National Novel Writing Month, also known as NaNoWriMo. Each November, participants aim to write 50,000 words on a 30 day deadline–a difficult feat for any writer! NaNoWriMo has been around for quite a long time, but for the last couple of years programmers and digital artists have been participating in a cheeky alternative: NaNoGenMo, or National Novel Generation Month.

Internet artist Darius Kazemi started NaNoGenMo after tweeting the idea in 2013:

image

This November is the third organized installment of NaNoGenMo and the community keeps growing every year as more and more programmers & artists become interested in the strange intersection of code, language processing, and literature. And because the event is primarily driven by developers, submissions are posted on a Github repo as Issues so that participants can comment on one another’s ideas and help each other create some of the most unique and sometimes nonsensical novels written in November.

In the NaNoGenMo world, “novel” is pretty loosely defined. According to the rules,

“The “novel” is defined however you want. It could be 50,000 repetitions of the word “meow”. It could literally grab a random novel from Project Gutenberg. It doesn’t matter, as long as it’s 50k+ words.”

(And of course, someone did make that 50,000 word “meow” book in 2014!)

Novel generation can be much more complicated than it appears from the outside. Some books integrate with social media by pulling text from twitter to generate dialogue, others go down a recursive rabbithole, and some even generate graphic novels.

Algorithmia is home to a wide variety of algorithms that are a perfect fit for NaNoGenMo. Because I don’t have any background in natural language processing or computational linguistics, I found it was easy to combine algorithms that not only helped me generate my novel, but gave me insights on the texts I used as a basis.

I chose the texts I wanted to work with based on two things: availability in the public domain and to have an interesting author demographic. While there are tons of NaNoGenMo books out there that are based on other texts, I wanted to find a really unique set of texts to base my novel on. I also developed an interest in 19th century American literature after reading Uncle Tom’s Cabin when I was 12. Luckily for me, Project Gutenberg is home to many novels and autobiographies that fit this intersection of interests!

First step: compile a corpus of texts. I chose to go with two sets of 7 books to compare. The first set was composed of primarily slave and emanicpation narratives from Black female authors. While digging around in these texts, I realized that books as seemingly disparate as Little Women were published at the same time. Somehow I have never really thought about how such drastically different worlds were becoming exposed in what we now think of as classic American literature, so I decided it would be interested to compare. The second set of texts are all from white female authors and published around the mid-19th century.

Set one:

  • 1861 – Incidents in the Life of a Slave Girl by Harriet Jacobs
  • 1868 to 1888 (published in serial form) – Trial and Triumph by Frances Ellen Watkins Harper
  • 1868 to 1888 (published in serial form) – Sowing and Reaping: A Temperance Story by Frances Ellen Watkins Harper
  • 1868 to 1888 (published in serial form) – Minnie’s Sacrifice by Frances Ellen Watkins Harper
  • 1868 – Behind the Scenes by Elizabeth Keckley
  • 1891 – From the Darkness Cometh the Light, or, Struggles for Freedom by Lucy Delaney
  • 1892 – Iola Leroy, or Shadows Uplifted by Frances Ellen Watkins Harper

Set two:

  • 1845 – Woman in the Nineteenth Century by Margaret Fuller
  • 1852 – Uncle Tom’s Cabin by Harriet Beecher Stowe
  • 1854 – The Lamplighter by Maria S. Cummins
  • 1854 – Ruth Hall: A Domestic Tale of the Present Time by Fanny Fern (pen name of Sara Payson Willis)
  • 1860 – Rutledge by Miriam Coles Harris
  • 1868 – Little Women by Louisa May Alcott
  • 1869 to 1870 (published in serial form) – An Old Fashioned Girl by Louisa May Alcott
  • 1872 – What Katy Did by Susan Coolidge

Before I started generating my own novel based on these texts, I rolled up my sleeves and got to work on analyzing them. The Algorithmia platform is already full of many text analysis algorithms, so instead of getting lost in learning natural language processing from scratch, it was as simple as choosing an algorithm, passing in my texts, and comparing the results.

Haven’t read any of the books? Don’t worry! The first algorithm I ran on the texts was Summarizer. This algorithm is pretty straightforward to use–input text, get back key sentences and ranked keywords. Read the summaries of Set One and Set Two if you need a literary refresher!

Using the AutoTag algorithm, I set out to discover if there would be a difference in the topics we’d find between the two author demographics. The Autotag algorithm uses a variant of Latent Dirichlet allocation and returns a set of keywords that reprensent the topics in the text. I then took each of the topics returned by the algorithm and classified them into various categories or themes to see if we could find some common threads.

Autotag

I had suspected that the second set of books would have more domestic related themes, but I was mostly unsurprised that there were no autotagged keywords about race or slavery in that set. Interestingly, specific names as keywords were fairly frequent in both sets, averaging 4.8 out of 8 topics for set one and 5.7 of the topics in set two.

While this algorithm gives us some interesting insights into our texts, it can’t tell us everything and sometimes it can even trick you. For example, I grew suspicious of Sowing & Reaping when the AutoTag algorithm returned that one of the topics was “romaine”. I suspected that this book did not in fact focus on a type of lettuce as a main topic. Since I hadn’t read this specific book, I looked it up–turned out to be the last name of a main character!

After running the AutoTag algorithm on my data sets, I decided it check out Sentiment Analysis. This algorithm uses text analysis, natural language processing, and computational linguistics to identify subjective information in text. It’s also known as opinion mining. The algorithm I used returns a rating of Very Negative, Negative, Neutral, Positive or Very Positive.

Here’s the breakdown of sentiment by book:

Set One Books Sentiment Set Two Books Sentiment
Incidents in the Life of a Slave Girl Negative Woman in the Nineteenth Century Negative
Trial and Triumph Negative Uncle Tom’s Cabin Negative
From the Darkness Cometh the Light Negative The Lamplighter Negative
Sowing and Reaping Negative Ruth Hall Neutral
Minnie’s Sacrifice Negative Rutledge Very Negative
Behind the Scenes Positive Little Women Negative
Iola Leroy Negative What Katy Did Negative

Unsurprisingly, 12 out of 14 of the books I analyzed were Negative or Very Negative. Rough times in the 19th century!

Next, I decided it might be interesting to see what popped up with Profanity Detection. While getting the data into the algorithm and writing the results back to a file was easy, it turns out that profanity detection requires a lot of double checking by hand. I knew that some words that came up were not really profane back then; words like “queer”, “pussy”, and “muff” were innocent in the context of these 19th century texts.

Interestingly, the frequency of racial profanity of the two data sets ended up being relatively similar:

Profanity

 

Of course running the algorithm doesn’t give you the full picture since in our second set of data about 95% of the racial profanity came from one book: Uncle Tom’s Cabin. This is unsurprising since it’s the only work in our second set of books that was written by an aboloitionist. However, we still don’t quite get the full picture about profanity in these books: many of the words used were not considered slurs back then, and additionally, within the use of dialogue this kind of language takes on a different dimension. The thing we can learn from an algorithm such as Profanity Detection is that there is a very stark different in who these books focused on as main characters and what kind of world they lived in. Four of the seven books written by white authors had zero instances of these words.

Now, you’ve read though all this and you’ve seen the results from all these different algorithms, you might be thinking to yourself that you don’t know how to do natural language processing so maybe this will be something you put on a project list and try out later. The most amazing part of this project that I haven’t told you yet is this: every single one of the scripts that I wrote to do NLP and text analysis was under 30 lines of code.

Check out the script I wrote for running the AutoTag algorithm:

import Algorithmia
import os
import json

client = Algorithmia.client('my_api_key')
algo = client.algo('nlp/AutoTag/0.1.4')

rootdir = './clean_books/set_one/'
output_file = 'set_one_autotag_results.txt'
results = ''

for subdir, dirs, files in os.walk(rootdir):
for filename in files:
with open(rootdir + filename, 'r') as content_file:
input = content_file.read()
print "Autotagging " + filename
results += filename + "\n\n"
results += json.dumps(algo.pipe(input))
results += "\n\n"
with open(output_file, 'w') as f:
f.write(results)

f.close()

print "Done!"

After analyzing all my books, it was time to generate my novel to complete NaNoGenMo. This was so easy compared to the text analysis! Once again, with just simple API calls, I generated trigram models based on each set of books. I then made book previews based on each trigram model just to see if you could hear a difference in the books generated on these different demographics.

The book preview from Set One:

Dem young uns vil kill you dead than to see you. Well, you would be less unhappy marriages if labor were more women in the midst of her nice pudding, as there are no enemies to good old aunt, and confirm themselves in woods and gloomy clouds hung like graceful draperies. Talk about the streets of the ballot in his land, that those who have fitted their children?

Belle, and I live in such dingy, humble quarters. said Mrs. Underhill, from my own sorrow-darkened home, I did, that he had asked them. Do you remember the incident so well were given to Frederick Douglass contributed $200, besides lecturing for us. The President added: Man is a fair specimen of her negro blood in his friendship, but they may be an old woman entered her home with me? If the vessel had been. Reader, I felt humiliated enough.

The book preview from Set Two:

aw! Yes, said Miss Skinlin she hasn’t the first heir to the female figure. The waves dance bright and happy when I forgot to learn, before which she told me to read and study. My Uncle, with a commanding, What are you better than Kintuck.

It was useless to ask one last word I ran down a corridor as dark and narrow streets or the other.

No Oh, Earth! And no one interfered, and it was. but then strangers came so by letting out all fear and distress and doubts of the damned, as well as bodies. What word Can we not get. I don’t resent the sarcasm, and unsettled most of my observing her to rise. Fortunately the gate swinging in the recesses, chrysanthemums and Christmas roses bloomed as freshly as in her voice, what everybody finds in the streets so, for the best thing was insufferably disgusting and loathsome to me. I said a thing as leisure there.

The most interesting difference I found in the text generated from these different data sets was that the text from Set Two sounded much more formal. The first set of books, the ones written by Black authors, tended to have much more dialogue written in such a way as to let the reader hear the accents and dialects of the time. These words became part of the model to generate text, so as you can see in the first sentence of the Set One preview, the algorithm generated text that still makes a lot of sense even with words that are intended to showcase an accent.

In the end, I decided to create a trigram model based on both sets of text and use that to generate my full length novel. I didn’t have to do any fancy code, I merely made another API call to the Generate Trigram Frequencies algorithm, this time passing in the entirety of my data set. Then, to generate my novel, I wrote a quick script that calls into another algorithm: Generate Paragraph From Trigram. This algorithm uses the trained trigram model to generate paragraphs of text. Since NaNoGenMo requires the book to be at least 50,000 words, I simply wrote a loop that calls the Generate Paragraph algorithm until the total word count of the book reaches the goal:

import Algorithmia
import os
import re
from random import randint

client = Algorithmia.client('my_api_key')
text_from_trigram = client.algo('/lizmrush/GenerateParagraphFromTrigram')
trigrams_file = "data://.algo/ngram/GenerateTrigramFrequencies/temp/all-trigrams.txt"

book_title = 'full_book.txt'
book = ''
book_word_length = 50000

while len(re.findall(r'w+', book)) < book_word_length:
print "Generating new paragraph..."
input = [trigrams_file, "xxBeGiN142xx", "xxEnD142xx", (randint(1,9))]
new_paragraph = text_from_trigram.pipe(input)
book += new_paragraph
book += 'nn'
print "Updated word count:"
print len(re.findall(r'w+', book))

with open(book_title, 'w') as f:
f.write(book.encode('utf8'))

f.close()

print "Done!"
print "You book is now complete. Give " + book_title + " a read now!"

Even with extra new lines for readability, the code I needed to generate an entire novel with Algorithmia was still under 30 lines! And I ended up generating a really unique, interesting novel without getting lost in the highly technical parts of natural language generation. Now, the text isn’t perfect: sometimes the sentences don’t quite sound right and there isn’t really any sort of story arch, but for such simple code I think it’s pretty good! My favorite part about using 19th century texts as the data set was that sometimes you can’t tell if the generated text is hard to read because it’s generated and doesn’t make much sense or because it sounds so old-timey. My book includes the following gems that just might pass as human-written text:

I have said of human life when I saw the Ohio river, that you shall work.

Still, falsehood may be hearing you. She only ‘spects something. Them curls may make a noise you shall not.

You can read the book, or rather, attempt to read the book, online or you can download it from the repo.

It’s mindblowingly fast and simple to get the power of these algorithms into your hands once they are behind a simple API call. You can see all the other scripts I wrote in the GitHub repo for this project. If you browse around, you’ll see that each script is nearly identical. The only real changes I had to make were replacing the algorithm I was calling and what I named the files to write results to! The Algorithmia platform is an incredibly powerful tool. Instead of spending days, weeks, months learning how to code my own natural language processing and text analysis algorithms, I could just pop my data into a variety of algorithms with simple API calls. No sweat, just results.