In part 2 of our guest-post series by Nick Rose, he takes things a step further by showing us how to host predictive models in the cloud, so they can be made available for a web app or other service to access.
Nick’s original post can be found on Medium. Enjoy!
My most recent post about analyzing my university’s gym crowdedness over the last year using machine learning generated a lot of great responses — including:
- How neat! I always wanted an excuse to never go to the RSF [our gym] again.
- Can you do this for our other campus locations as well?
- How do I do something like this myself?
First, I’m sorry you feel that way about the gym and I sincerely hope the oncoming wave of New Years Resolutioners doesn’t completely crush your dreams of working out forever.
Second, yes. I’m in the process of creating more predictive models for the other ten campus locations Packd tracks. Machine learning requires a lot of data, so it may take some time before the models are ready for training.
Lastly, great question! In my last post I hinted about how exactly you can do this yourself:
I’ve deployed my Random Forest Regressor on Algorithmia to generate predictions as an API call.
I’ve since realized this requires a lot more explanation, and it’s the subject of this post. My process may not be the best, but hopefully by illustrating it, I can learn from my mistakes and give you a starting point. The rest of this post assumes you have basic knowledge of machine learning as a concept, you know Python, and you’re familiar with APIs. We’ll use scikit-learn for this tutorial.
Developing a Model Locally
The first step is to create a machine learning model locally (on your computer) to test things out. The data for this is located here for downloading. You’ll need to install:
- Numpy (pip install numpy)
- Pandas (pip install pandas)
- Scipy (pip install scipy)
- Scikit-learn (pip install scikit-learn)
We’ll need to create a prediction model using the data. The following is taken from my kernel over at Kaggle:
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler df = pd.read_csv("data.csv") # or wherever your data is located # Extract the training and test data data = df.values X = data[:, 1:] # all rows, no "number_people" y = data[:, 0] # all rows, "number_people" only X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) # Scale the data to be between -1 and 1 scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # Create your prediction model model = RandomForestRegressor(n_estimators=140) model.fit(X_train, y_train)
Let’s pause here for a second. I’ve breezed over some important details in the code above. First of all, I didn’t have to modify any of my data (do things like convert column values of “yes / no” to 1/0) because I already did that part behind the scenes. Second, choosing the correct model is not a trivial thing, and I seem to have pulled the Random Forest Regressor out of thin air. (I didn’t really, check my first post for details). Last, why did I set n_estimators to 140? Normally we’d want to do a Grid Search over our hyperparameters to find the optimal n_estimators, but I did the hard work behind the scenes already (it takes a while).
The last line is the most important — it actually performs the learning part of the algorithm. It changes the state of the variable model. From here on out, we can throw new data at the model to make predictions:
test_output = model.predict(X_test) # predicted values score = model.score(X_test, y_test) # accuracy score, around 0.90 print(test_output)
Here are the predicted number of people for the test data:
array([ 44.92142857, 54.22142857, 45.35714286, ..., 47.95714286, 63.27142857, 40.22142857])
Great! This isn’t really predicting the future yet — the test data was actually a quarter of the original historical data (more on the test/train data split). If we want to really predict the future, we need to do a lot more work.
In order to predict how crowded the gym will be in the future, we need a new array of column values where everything except the target label, number_people, is filled in. The details of how to fill these in probably will take another post, but the general idea is to create several datetimes in the future, and for each one, add each feature value (the timestamp, day of the week, will it be a weekend, will it be a holiday, will it be the start of the semester, etc). For weather feature values, I used the Dark Sky API to make forecasts.
Let’s say you now have an array of times in the future for which you want to predict how crowded the gym will be. Here’s how you do it:
X_predict = generate_prediction_data() # your special call here scaler = StandardScaler() X_predict = scaler.fit_transform(X_predict) predictions = model.predict(X_predict)
Congratulations! The predictions variable now holds the predicted number of people for each of your datetimes in the future. Now, if you wanted, you could generate predictions for the rest of the year, or the century, or whenever, and call that your predicted values. However, this seems like it would take a huge amount of memory and probably wouldn’t be very accurate the further in the future you go. It would be much better to store the model we created and only bring it out when we want to make predictions. Here’s how in sklearn:
from sklearn.externals import joblib # How to save the model: joblib.dump(model, "prediction_model.pkl") # You can quit your Python process here, even shut off your computer. # Later, if you want to load the saved model back into memory: model = joblib.load("prediction_model.pkl")
It’s easy to store already-trained models in scikit-learn. Of course, your users can’t call you on the phone every time they want predictions so you can generate them on your computer. They’ll need to make an API call of some kind to a service you provide that does the predicting.
Storing Your Model in the Cloud
You could store the prediction_model.pkl file on your server and load it whenever someone makes a request for predictions, but turns out loading this file uses up a lot of memory. In my case, more than I was allotted by my hosting company. It’s better to host your machine learning model on an entirely different server. You have a few options here from big companies, including Google, Microsoft, and Amazon.
However, I decided to go with a smaller, lesser known company to try it out.
Algorithmia is a relatively new company based in Seattle that specializes in “algorithms as a service”. Users of the platform can create code snippets and host them on Algorithmia and call their code as an API. In addition, these developers can charge other users for access to their code. It’s an interesting business idea and I’m eager to see how it works out, but for now I’m just using the hosting service idea.
After you make an account, you’ll want to upload your prediction_model.pkl to the data section. Here, I have several prediction models, each one ending in the .pkl extension:
Uploading is as easy as drag and drop, and takes a few minutes if you have larger prediction models (> 500 MiB). The next step is to create an algorithm that uses your data to make predictions.
Make Predictions as a Service Call
Under the “More” tab at the top of Algorithmia, you’ll find the Add Algorithm button.
Next you need to fill in the details of how your algorithm operates.
I enabled “full access to the internet” and “can call other algorithms” just in case. This means my algorithm can call external web APIs, like the Dark Sky weather API, and call other algorithms I create on Algorithmia.
Algorithmia provides a code editor online for you to edit your algorithm. Once you create your algorithm, you’ll be brought here:
Every algorithm consists of an apply method, that takes one input, conveniently named input. The input variable is whatever you define it to be, since you’ll be the one calling the algorithm. Usually, input will be a 2-dimensional array of your features, where each row is missing the column you want to predict. For me, that’s number_people. Here’s an example input:
input = [ [ 1.072308379173353, 0.772801541291309, 0.0, 0.0, 0.24191249808036538, -0.36311268827583276, 0.0, 0.0, 0.0, 1.034839680120056 ], ... ]
If we throw input at the model I created locally, we’ll get back a list of the predicted number_people for each feature row, just like we want. The only thing left is to mirror this behavior as a service call. Let’s finish the code on Algorithmia.
Before continuing, make sure you add the correct dependencies to your code. Since we’re using sci-kit learn, here’s all we need to add:
Finishing up the source code, we get this:
It’s very short, but this will work. When a new correctly formatted input is passed to this algorithm, the code will find the path of the prediction_model.pkl file we uploaded earlier, load it into memory using joblib, and finally make the prediction and return it. Click “Save”, “Compile”, then “Publish” to finish your algorithm.
You’ll end up on this page, which is the finished result.
If you scroll down to “Use this Algorithm” and click on the Python tab, you’ll see how easy it is to make a call to your algorithm.
import Algorithmia input = <INPUT> client = Algorithmia.client(<Your Algorithmia Key here>) algo = client.algo('nsrose/testalgo2/') print algo.pipe(input)
This will work anywhere, from any computer that has Python and Algorithmia installed (pip install algorithmia). All you have to do is make sure you format your input correctly!
Why is This a Good Thing?
- Separation of ML concerns from the rest of your app: By calling and making predictions as an API call, your app can be free of the concerns of machine learning mechanics. All it needs to do is create correctly formatted inputs.
- Ease up on memory consumption: Loading *.pkl files into memory is a heavy task and probably not suitable for small applications on limited servers. Outsourcing this job to Algorithmia is a good alternative.
- Source control your ML code: Algorithmia recently added support for Git repos, so you can edit your code offline and push the changes.
Suggestions for Improvement
- Add debugging tools to the online code editor
- Add integrations (Slack notifications when someone uses your algorithm perhaps)