Algorithmia Blog - Deploying AI at scale

How To Deploy a Scikit-Learn Model on the AI Layer

For all the time that it takes to clean your data, train and test your model, then tune your model hyperparmeters, and further re-train and test your model, you’re not done until it’s deployed into production. Because what good is a model that’s just sitting on your laptop?

In this tutorial, you’ll learn how to write a simple algorithm using a pre-trained Scikit-learn model we’ve provided. This way, once you’ve gotten your own Scikit-learn model’s accuracy in an acceptable range, you’ll be able to easily deploy it into production.

Background

The tutorial will walk through how to:

  • create an algorithm,
  • host data in Algorithmia’s data collections,
  • deploy a random forest regression model trained on a Boston housing prices dataset in order to predict the price of Boston houses that the model has not seen before.

Note that for any model you deploy on Algorithmia, you will need to train and save the serialized model. For Scikit-learn models, you can use the Python Pickle library to serialize, and later deserialize, your model file.

Before you get started, you’ll need these files in the GitHub repository.

Get Started

Next, you’ll need to upload the .csv file and the pre-trained pickled Scikit-learn model to a data collection.

If you’ve never created a data collection before, check out the docs for hosting your data on Algorithmia. Go ahead and create a data collection. Below, you’ll notice that we named our data collection “demo_files,” but you can name yours as you like.

Once you’ve created a data collection, you can click “Upload Files” from where you stored the scikit-demo-boston-regression.pkl and boston_test_data.csv files on your computer:

Take note of the path created that starts with “data://” and shows your username and the data collection name along with your file name.

You’ll want to use this path in your algorithm to point to your own data and model path, so we recommend keeping this data collection page open and also opening a new tab where you can create your algorithm. This way you can easily copy and paste the paths from your data collections when you’re ready to add them to the demo code sample.

Next click the “Plus” icon in the navigation pane, and create your Scikit-learn algorithm, naming it as you like. If you’ve never created an algorithm before and want to understand what the permissions mean, check out our docs on Getting Started but note that for this example, you’ll want to choose: “Python 3.x” for your language so it runs well. The rest of the permissions and execution environment can stay under their default settings.

Once you create your algorithm, you’ll be able to edit it either through the CLI tools or the Web IDE. It’s your choice how you want to interact with your algorithm, but this tutorial will demonstrate working in the Web IDE:

Before we touch our code, we need to add the required dependencies. Click “Dependencies” found right above your source code. This will show a modal that is basically a requirements.txt file that pulls the stated libraries from PyPi. If you state the package name without a version number, you’ll automatically get the latest version we support, otherwise  state the needed version number or range.

In this example, we have an older Scikit-learn model, so we need a range of versions for our model to work. Go ahead and add these to the libraries already in your dependency file:

numpy

scikit-learn>=0.14,<0.18

So your whole dependency file will look like this:

You’ll want to remove the boilerplate code that exists in your newly created algorithm and copy/paste the code from the “demo.py” file found in the GitHub repository for this project.

Here is what you’ll see when you paste in the code from demo.py:

Notice in the first few lines of our script that we are importing the Python packages required by our algorithm. Then on line 7, we are creating the variable “client” in global scope to use throughout our algorithm. This will enable us to access our data in data collections via the Data API.

On line 12, inside the “load_model()” function, you’ll want to replace that string with the path from your data collections for the pickled model file.

Then, notice on line 13 that we are passing in that data collection path to the data API using:

client.file(file_path).getFile().name

And then we use the Pickle library to open the file.

Notice that line 20 is where we call the function “load_model().” This is important because you’ll always want to load the model outside of the “apply()” function. This is so the model file only gets loaded into memory during the initial call within that session, so while the first call to your algorithm might take a bit of time depending on the size of your model, subsequent calls will be much faster. If you were to load your model inside the “apply()” function, then the model would get loaded with each call of your algorithm.

Also, if you are tempted to add your model file as a Python module, and then import that module into your algorithm file, this will result in a loss of performance, and we don’t recommend it.

The next function, called “process_input(),” simply turns the .csv file into a numpy array, and we call that function within the “apply()” function where we will pass in the user provided “input.”

The “input” argument is the data or any other input from the user that gets passed into your algorithm. It’s important to support inputs from multiple data sources with exception handling, going beyond data collections like shown, but also for data files hosted in S3, Azure Blobs, or other data sources for which we have data connectors. For a great example of handling multiple types of files, or if you want to see a PyTorch algorithm in action, check out the Open Anomaly Detection algorithm in the Algorithmia Marketplace.

Notice that we are returning the predicted data as output from our Scikit-learn model in the apply() function.

Now click the “Build” button on the top right of the web IDE. This will commit our code to a git repository. Every algorithm is backed by a git repository and as you are developing your algorithm, whenever you hit “Build,” that will commit your code and return a hash version of your algorithm, which you’ll see in the Algorithmia console.

You can use that hash version to call your algorithm locally using one of the language clients for testing purposes while you work on perfecting your algorithm.

Note that you’ll get a semantic version number once you publish your algorithm.

Now we are ready to test our algorithm. Go back to the data collections page where we got our model path, and copy/paste the path for our .csv file into the Algorithmia console (wrapping it in quotes so it’s a proper JSON formatted string) and hit return/enter:

You’ll need to hit the “Publish” button now that you’re happy with the results. The publishing workflow will give you options for adding sample code (like the .csv data we tested on) so users of your algorithm can test out its inputs and outputs. You’ll also be able to choose if you want a backwards compatible version of your algorithm or if you’re making breaking changes to it. For more information on the publishing steps, check out our docs on Getting Started.

Once you publish, you can find the runable code on your algorithm’s description page:

That’s it for deploying your Scikit-learn model into production!