All posts in Blog Posts

How To Deploy a Scikit-Learn Model on the AI Layer

For all the time that it takes to clean your data, train and test your model, then tune your model hyperparmeters, and further re-train and test your model, you’re not done until it’s deployed into production. Because what good is a model that’s just sitting on your laptop?

In this tutorial, you’ll learn how to write a simple algorithm using a pre-trained Scikit-learn model we’ve provided. This way, once you’ve gotten your own Scikit-learn model’s accuracy in an acceptable range, you’ll be able to easily deploy it into production.

Background

The tutorial will walk through how to:

  • create an algorithm,
  • host data in Algorithmia’s data collections,
  • deploy a random forest regression model trained on a Boston housing prices dataset in order to predict the price of Boston houses that the model has not seen before.

Note that for any model you deploy on Algorithmia, you will need to train and save the serialized model. For Scikit-learn models, you can use the Python Pickle library to serialize, and later deserialize, your model file.

Before you get started, you’ll need these files in the GitHub repository.

Get Started

Next, you’ll need to upload the .csv file and the pre-trained pickled Scikit-learn model to a data collection.

If you’ve never created a data collection before, check out the docs for hosting your data on Algorithmia. Go ahead and create a data collection. Below, you’ll notice that we named our data collection “demo_files,” but you can name yours as you like.

Once you’ve created a data collection, you can click “Upload Files” from where you stored the scikit-demo-boston-regression.pkl and boston_test_data.csv files on your computer:

Take note of the path created that starts with “data://” and shows your username and the data collection name along with your file name.

You’ll want to use this path in your algorithm to point to your own data and model path, so we recommend keeping this data collection page open and also opening a new tab where you can create your algorithm. This way you can easily copy and paste the paths from your data collections when you’re ready to add them to the demo code sample.

Next click the “Plus” icon in the navigation pane, and create your Scikit-learn algorithm, naming it as you like. If you’ve never created an algorithm before and want to understand what the permissions mean, check out our docs on Getting Started but note that for this example, you’ll want to choose: “Python 3.x” for your language so it runs well. The rest of the permissions and execution environment can stay under their default settings.

Once you create your algorithm, you’ll be able to edit it either through the CLI tools or the Web IDE. It’s your choice how you want to interact with your algorithm, but this tutorial will demonstrate working in the Web IDE:

Before we touch our code, we need to add the required dependencies. Click “Dependencies” found right above your source code. This will show a modal that is basically a requirements.txt file that pulls the stated libraries from PyPi. If you state the package name without a version number, you’ll automatically get the latest version we support, otherwise  state the needed version number or range.

In this example, we have an older Scikit-learn model, so we need a range of versions for our model to work. Go ahead and add these to the libraries already in your dependency file:

numpy

scikit-learn>=0.14,<0.18

So your whole dependency file will look like this:

You’ll want to remove the boilerplate code that exists in your newly created algorithm and copy/paste the code from the “demo.py” file found in the GitHub repository for this project.

Here is what you’ll see when you paste in the code from demo.py:

Notice in the first few lines of our script that we are importing the Python packages required by our algorithm. Then on line 7, we are creating the variable “client” in global scope to use throughout our algorithm. This will enable us to access our data in data collections via the Data API.

On line 12, inside the “load_model()” function, you’ll want to replace that string with the path from your data collections for the pickled model file.

Then, notice on line 13 that we are passing in that data collection path to the data API using:

client.file(file_path).getFile().name

And then we use the Pickle library to open the file.

Notice that line 20 is where we call the function “load_model().” This is important because you’ll always want to load the model outside of the “apply()” function. This is so the model file only gets loaded into memory during the initial call within that session, so while the first call to your algorithm might take a bit of time depending on the size of your model, subsequent calls will be much faster. If you were to load your model inside the “apply()” function, then the model would get loaded with each call of your algorithm.

Also, if you are tempted to add your model file as a Python module, and then import that module into your algorithm file, this will result in a loss of performance, and we don’t recommend it.

The next function, called “process_input(),” simply turns the .csv file into a numpy array, and we call that function within the “apply()” function where we will pass in the user provided “input.”

The “input” argument is the data or any other input from the user that gets passed into your algorithm. It’s important to support inputs from multiple data sources with exception handling, going beyond data collections like shown, but also for data files hosted in S3, Azure Blobs, or other data sources for which we have data connectors. For a great example of handling multiple types of files, or if you want to see a PyTorch algorithm in action, check out the Open Anomaly Detection algorithm in the Algorithmia Marketplace.

Notice that we are returning the predicted data as output from our Scikit-learn model in the apply() function.

Now click the “Build” button on the top right of the web IDE. This will commit our code to a git repository. Every algorithm is backed by a git repository and as you are developing your algorithm, whenever you hit “Build,” that will commit your code and return a hash version of your algorithm, which you’ll see in the Algorithmia console.

You can use that hash version to call your algorithm locally using one of the language clients for testing purposes while you work on perfecting your algorithm.

Note that you’ll get a semantic version number once you publish your algorithm.

Now we are ready to test our algorithm. Go back to the data collections page where we got our model path, and copy/paste the path for our .csv file into the Algorithmia console (wrapping it in quotes so it’s a proper JSON formatted string) and hit return/enter:

You’ll need to hit the “Publish” button now that you’re happy with the results. The publishing workflow will give you options for adding sample code (like the .csv data we tested on) so users of your algorithm can test out its inputs and outputs. You’ll also be able to choose if you want a backwards compatible version of your algorithm or if you’re making breaking changes to it. For more information on the publishing steps, check out our docs on Getting Started.

Once you publish, you can find the runable code on your algorithm’s description page:

That’s it for deploying your Scikit-learn model into production!

We got a new look!

Algorithmia developers unveiled a new UI on our user platform this week. Now, when you log into Algorithmia, you’ll see a personalized dashboard with your account information and recent activity, enabling easy account management and navigation.

New Algorithmia UI

This change comes about as we mark the four-year anniversary of the Algorithmia website, but the redesign of the platform was tied to creating a more user-friendly and intuitive experience for our users.

A Mini Tour

There are two new menus: global and navigation. The global menu (the purple one at the left) is designed to keep primary actions readily available (creating new algorithms, data sources, searching, accessing your notifications, and profile actions).

The navigation menu (the light gray one) provides quick access to the main pages of the site (your user pages, algorithms, data, and the accompanying technical docs).

We were excited to move to a drawer-based navigation style, and surfacing application functionality to be more consistent. We’re making the app more intuitive and usable for everyone.
-Ryan Miller, Algorithmia Frontend Engineering Lead

What’s New?

We designed the new UI with user control in mind. Some of the features include:

  • Lists organization: the new UI provides easy access to saved work, API Keys, and set organizations for model sharing.
  • Dashboard navigation: easy access via two nav panes.
  • Mobile integration: much of the updated drawer navigation features of the site are now available for mobile users as well.

“A good UI should get out of the way, but as the app grew we began to see the opposite—we were inhibiting user effectiveness. So we broke it down and are iteratively building it back up with usability at the foundation so users can focus on their work—not how to use the app.”
      -James Hoover, Algorithmia Product Designer

We welcome any feedback you have as we continue to make the Algorithmia user platform ever more intuitive and usable.

Check out our updated UI today!

Read More…

Data Scientists and Deploying Models From any Framework

any framework

Asking a data scientist to work with only one framework is like asking a carpenter to work with only a hammer. It’s essential that professionals have access to all the right tools for the job.

It’s time to rethink best practices for leveraging and building ML infrastructure and set the precedent that data scientists should be able to use whichever tools they need at any time.

Now, certainly some ML frameworks are better suited to solve specific problems or perform specific tasks, and as projects become more complex, being able to work across multiple frameworks and with various tools will be paramount.  

For now, machine learning is still in its pioneering days, and though tech behemoths have created novel approaches to ML (Google’s TensorFlow, Amazon’s SageMaker, Uber’s Michelangelo), most ML infrastructure is still immature or inflexible at best, which severely limits data scientists and DevOps. This should change.

Flexible frameworks and ML investment

Most companies don’t have dozens of systems engineers who can devote several years to building and maintaining their own custom ML infrastructure or learning to work within new frameworks, and sometimes, open-source models are only available in a specific framework. This could restrict some ML teams from using them if the models don’t work with their pre-existing infrastructure. Companies can, and should, have the freedom to work concurrently across all frameworks. The benefits of doing so are multifold:

Increase interoperability of ML teams

Often machine learning is conducted in different parts of an organization by data scientists seeking to automate their processes. These silos are not collaborating with other teams doing similar work. Being able to blend teams together while still retaining the merits of their individual work is key. It will de-duplicate efforts as ML work becomes more transparent within an organization.

Allow for vendor flexibility and pipelining

You don’t want to end up locked-in to only one framework or only one cloud provider. The best framework for a specific task today may be overtaken next month or next year by a better product, and businesses should be able to scale and adapt as they grow. Pipelining different frameworks together creates the environment for using the best tools.

any framework

Reduce the time from model training to deployment

Data scientists write models in the framework they know best and hand them over to DevOps, who rewrite them to work within their infrastructure. Not only does this usually decrease the quality of a model, it creates a huge iteration delay.

Enable collaboration and prevent wasted resources

If a data scientist is accustomed to PyTorch but her colleague has only used TensorFlow, a platform that supports both people’s work saves times and money. Forcing work to be done with tools that aren’t optimal for a given project is like showing up with knives to a gunfight.

Leverage off-the-shelf products

There’s no need to constantly reinvent the wheel; if an existing open-source service or dataset can do the job, then that becomes the right tool.

Position your team for future innovations

Because the ML story is far from complete, being flexible now will enable a company to pivot more easily come whatever tech developments arise.

How to deploy models in any framework

Attaining framework flexibility, however, is no small feat. The main steps to enable deploying ML models from multiple frameworks are as follows:

Dependency management

It’s fairly simple to run a variety of frameworks on a laptop, but trying to productionize them requires a way to manage all dependencies for running each model, in addition to interfacing with other tools to manage compute.

Containerization and orchestration

Putting a model in a container is straightforward. When companies have only a handful of models, they often task junior engineers with manually containerizing, putting the models into production, and managing scale-up. This process unravels as usage increases, model numbers grow, and as multiple versions of models run in parallel to serve various applications.

Many companies are using Kubernetes to orchestrate containers—there are a variety of open-source projects on components that will do some of the drudge work of machine learning for container orchestration. Teams who have attempted to do this in-house have found that it requires constant maintenance and becomes a Frankenstein of modular components and spaghetti code that falls over when trying to scale. Worse still, after models are in production, you discover that Kubernetes doesn’t deal well with many machine learning use cases.

API creation and management

Handling many frameworks requires a disciplined API design and seamless management practice. When data scientists begin to work faster, a growing portfolio of models with an ever-increasing number of versions needs to be continuously managed, and that can be difficult.

Languages and DevOps

Machine learning has vastly different requirements than traditional compute, including the freedom to support models written in many languages. A problem arises, however, when data scientists working in R or Python sync with DevOps teams who then need to rewrite or wrap the models to work in the language of their current infrastructure.

Down the road

Eventually, every company that wants to extend its capabilities with ML is going to have to choose between enabling a multi-framework solution like the AI Layer or undertaking a massive, ongoing investment in building and maintaining an in-house DevOps platform for machine learning.

We Run the World’s Machine Learning. Literally.

Satellite imagery

Algorithmia’s AI Layer Powers the UN Methods Service

Algorithmia is renewing its commitment to global humanitarian efforts by making powerful ML tools available to everyone.

Economic and population data are key elements of planning and decision-making in first-world countries, but access to sophisticated analytic and compute power is limited or non-existent in developing countries.

Meeting the Problem Head On

Working in conjunction with the United Nations Global Platform for Official Statistics, Algorithmia built a repository of algorithms that are readily available to any data scientist of any member state at any time. There are models for predicting economic, environmental, and social trends to enable smarter decision-making for strategies like agricultural planning, flooding probabilities, and curbing deforestation.

The United Nations Global Platform for Official Statistics sought to build the algorithm repository as part of the Sustainable Development Goals (SDGs). The SDGs aim to meet global challenges in healthcare, poverty, environmental degradation, and inequality by 2030. The algorithm repository will serve member states to “establish strategies to reuse and adapt algorithms across topics and to build implementations for large volumes of data.” UN Big Data

Building an Algorithm Marketplace for the Developing World

The UN wanted a way to share models with underdeveloped countries to curate economic, environmental, and social data to save lives and improve health and environmental conditions. Using the UN algorithm repository, for example, a developing country could model farmland satellite imagery to predict draughts, urbanization trends, or migration patterns.

Such statistical information can be used in myriad ways by both humanitarian organizations and policy-making, governmental bodies to make smarter resource-allocation decisions, better understand urban planning needs from population data, and even predict migration crop cycles using geospatial imagery.

The UN’s partnership with Algorithmia demonstrates our dedication to leveraging AI and machine learning to seek solutions to global problems. We are so looking forward to empowering the developing world, one algorithm at a time.

Read the Case Study

Navigating the Machine Learning Roadmap

Roadmap whitepaper cover

Machine learning (ML) will drastically alter how many industries operate in the future. Natural language processing will enable seamless and instantaneous language translation, forecasting algorithms will help predict environmental trends, and computer vision will revolutionize the driverless car industry.

Nearly all companies that have initiated ML programs have encountered challenges or roadblocks in their development. Despite efforts to move toward building robust ML programs, most companies are still at nascent stages of building sophisticated infrastructure to productionize ML models.

Roadmap Overview

After surveying hundreds of companies, Algorithmia has developed a roadmap that outlines the main stages of building a robust ML program as well as tips for avoiding common ML pitfalls. We hope this roadmap can be a guide that companies can use to position themselves for ML maturity. Keep in mind, the route to building a sophisticated ML program will vary by company and team and require flexibility.

Using the Roadmap

Every company or team is situated at a different maturity level in each stage. After locating your current position on the roadmap, we suggest the following:

  • Chart your path to maturity
  • Orient and align stakeholders
  • Navigate common pitfalls

The roadmap comprises four stages: Data, Training, Deployment, and Management. The stages build on one another but could also occur concurrently in some instances.

Data: Developing and maintaining secure, clean data
Training: Using structured datasets to train models
Deployment: Feeding applications, pipelining models, or generating reports.
*Models begin to generate value at this stage.*
Management: Continuously tuning models to ensure optimal performance

Pinpointing Your Location on Algorithmia’s Roadmap

ML Roadmap

At each stage, the roadmap charts three variables to gauge ML maturity: people, tools, and operations. These variables develop further at every stage as an ML program becomes more sophisticated.

For more information about building a sophisticated machine learning program and to use the roadmap, read our whitepaper, The Roadmap to Machine Learning Maturity.

Download Roadmap