“Our machine learning infrastructure is a great big Frankenstein of one-offs,” said one data scientist at our Seattle Roadshow. Heads nodded. Every time his organization needs to integrate with a new system, development teams hardcode dependencies and schedule jobs inside their model deployment system, creating a collection of ad-hoc integrations that runs a very real risk of breaking if either connected system changes.
Machine learning workflows are complex, but connecting all the pieces doesn’t need to be. Data generated by many sources feed many different models. These, in turn, are used by a variety of applications (or even by other models in a pipeline). That’s a lot of moving pieces for any one data scientist to consider.
Loosely Coupled, Event-Driven: The Pub/Sub Pattern
Hardcoding every interaction between a model serving platform and its connected systems is far more work than data scientists should have to handle. Loosely coupled, event-driven architectures, such as the publish/subscribe (pub/sub) pattern, are asynchronous, message-oriented notification patterns commonly found in traditional enterprise software. It’s time they became more common in machine learning.
In a pub/sub model, one system acts as a publisher, sending messages to a message broker, such as Amazon SNS. Through that message broker, subscriber systems explicitly subscribe to a channel, and the broker forwards and verifies delivery of publisher messages, which can then be used by subscribers as event triggers.
Traditional vs. Pub/Sub Implementation
The following is a simple user story describing a routine process an organization might want to follow as it ingests scanned documents:
As a records manager, when new documents are added to an S3 bucket, I want to run them through an OCR model.
A traditional implementation might look like this:
- Create a document insertion trigger in the database.
- Create a custom, hardcoded function to invoke a model run via API, including delivery verification, failure handling, and retry logic.
- Maintain that function as end-user requirements change.
- Deprecate that function when it is no longer needed.
Over time, as other business apps want to build onto the document insertion event, a fifth step will be added:
- Repeat steps 2 through 4 for every other application of the trigger.
Pros of traditional implementation:
- Near real-time communication
- No scheduling or communication burden on ML deployment system
Cons of traditional implementation:
- Database development team becomes a service organization.
- Database development team must have knowledge of all downstream use cases involving document ingestion.
- Architectural changes on either side of the exchange could break existing integrations.
The pub/sub implementation might look like this:
- Create a document insertion trigger in the database.
- When the trigger fires, send an event notification to a message broker.
At that point, the owners of the model deployment system would do the following:
- Subscribe to the published event feed.
- Build a function to invoke a model run via API.
Any other systems wishing to use that event as a trigger would follow the same process independently.
Pros of pub/sub implementation:
- Near real-time communication
- No scheduling or communication burden on ML deployment systems
- Database development team can focus on database development
- Downstream development teams can build without dependencies
- Architectural changes far less likely to break existing integrations
The pub/sub pattern’s loose coupling is highly scalable, centralizing the burden of communication and removing it from dependent apps. It is also extremely flexible. By abstracting communications between publishers and subscribers, each side operates independently. This reduces the overhead of integrating any number of systems and allows publishers and subscribers to be swapped at any time, with no impact on performance.
- Flexible, durable integrations: Upgrades to component systems won’t break communications workflows. As long as a published system continues to send messages to the appropriate queue, the receiving systems shouldn’t care how it generates those messages.
- Developer independence: Decoupling publishers and subscribers means teams can iterate at their own paces and ship updates to components without introducing breaking changes.
- Increased performance: Removing messaging from the application allows Ops to dedicate infrastructure to queueing, removing the processing burden from system components.
- Modularity: Because components integrate through an intermediary message queue, any component can be added or swapped at any time without impacting the system.
- Scalability: Since any number of applications can subscribe to a publisher’s feed, multiple systems can react to the same events.
Using Pub/Sub With Algorithmia’s Event Listeners
The Enterprise AI Layer provides configurable event listeners so users can trigger actions based on input from pub/sub systems. In concert with the AI Layer’s automatic API generation, integrated versioning, and multi-language SDKs, this means your ML infrastructure is able to grow. Systems can trigger any version of any model, written in any language, trained with any framework. That’s critical as an ML program grows in size and scope and touches every part of a business.
The first of our supported event sources is Amazon SQS, with more to come. Read the Event Listeners tutorial in our Developer Center, give it a try, and let us know what you think!
We spend a lot of time focused on giving data scientists the best experience for deploying their machine learning models. We think they should not only use the best tools for the job, they should also be able to integrate their work easily with other tools. Today we’ll highlight one such integration: Jupyter Notebook.
When we work in Jupyter Notebook—an open-source project tool used for creating and sharing documents that contain live code snippets, visualizations, and markdown text—we are reminded how easy it is to use our API to deploy a machine learning model from within a notebook.
About Jupyter Notebook
These notebooks are popular among data scientists and are used both internally and externally to share information about data exploration, model training, and model deployment. Jupyter Notebook supports running both Python and R, two of the most common languages used by data scientists today.
How We Use It
We don’t think you should be limited to creating algorithms/models solely through our UI. Instead, to give data scientists a better and more comprehensive experience, we built an API endpoint that gives more control over your algorithms. You can create, update, and publish algorithms using our Python client, which lets data scientists deploy their models directly from their existing Jupyter Notebook workflows.
We put together the following walkthrough to help guide users through the process to deploy from within a notebook. The first few steps are shown below:
In this example, after setting up credentials, we download and prep a dataset and build and deploy a text classifier model. You can see the full example notebook here. And for more information about our API, please visit our guide.
More of a Visual Learner?
Watch this short demo that goes through the full workflow.
We are extremely thankful for our customers and their trust in us, and we want to share this news with them first and foremost. This funding enables us to double-down on developing the infrastructure to scale and accelerate their machine learning efforts.
We are also thrilled to welcome Norwest Venture Partners (NVP) to Algorithmia and are honored for the continued support of Madrona Venture Group, Gradient Ventures, Work-Bench, Osage University Partners, and Rakuten Capital. Rama Sekhar, a partner at NVP, will be joining Algorithmia’s Board of Directors.
Their support means two things:
- Our customers have communicated to our investors that they are getting the personalized support they need to manage their machine learning life cycles. They feel confident not only in the product decisions we’ve made to date but also in the delivery of future product features.
- This funding allows us to continue maintaining our market leadership position.
We have proven as a team that we can enable humanitarian efforts of the United Nations, individual developers, and the largest companies in the world to adopt machine learning.
More importantly, our customers have given us the opportunity to truly deliver on our potential—and for that, we are immensely thankful. If you use Algorithmia, we are especially thankful for your support.
We are laser-focused on our customers
As we pioneer the field of machine learning operationalization, we know that our users are also pioneering their fields—we empathize with the challenges they face and are laser-focused on their success.
Our customers choose us because we listen to and build features they need to solve real problems and we offer responsive onboarding and support. Our customers are key to our success—we will now push even harder to help, listen to, and learn from them.
Moving forward, we will accelerate our engineering efforts and maintain our lead as the best platform for managing the machine learning life cycle.
Finally, funding is never a goal—it’s simply fuel to ensure we deliver on our mission to provide everyone the tools they need to use machine learning at any scale. If the purpose of tools is to extend human potential, then machine learning is poised to become one of the most powerful tools ever created. The AI Layer is a tool that makes machine learning available to everyone.
Feeling thankful and energized,
P.S. A small ask—please share the news with your network on LinkedIn to help us spread the word.
Data science in production is still in its infancy. Everything is new and nearly everyone involved is doing this for the first time. Every quarter, we talk with more than a hundred teams working to operationalize their machine learning and we see a clear pathway out of this difficult stage that many companies cannot get past: operationalizing the deployment process.
Massive effort goes into any well-trained model—research shows that less than 25 percent of a data scientist’s time is spent on training models; the rest is spent collecting and engineering data and conducting DIY DevOps.
Collecting and cleaning data is a fairly easy-to-understand problem with a maturing ecosystem of best practices, products, and training. As such, let’s focus on the problem of DIY DevOps.
DIY DevOps massively increases the risk of product failure.
We often hear from data scientists that they tried to deploy a model themselves, got stuck as they attempted to fit it into existing architecture, and found out that building machine learning infrastructure is a much harder endeavor than they imagined.
A common story arc we hear:
- A project starts with a clear problem to solve but maybe not a clear path forward
- Collecting and cleaning the data is difficult but doable
- Everyone realizes that deploying and iterating will yield real data and improve the model
- Data scientists are tasked with deploying models for the project themselves
- Data scientists get spread thin as they work on tasks outside their areas of strength
- Months later it becomes clear that DevOps needs to be called in
- Deploying a model gets tasked to an engineer
- The engineer realizes that the model was created in Python when the application that needs to use it is in Java
- The engineer heads over to Stack Overflow and decides the best path forward is to rewrite the model in Java
- The model gets deployed weeks or months later
- Down the line, a new version of the original Python model is trained
Companies that are serious about machine learning need an abstraction layer in their machine learning operationalization infrastructure that allows data scientists to deploy and iterate their own models in any language, using any framework.
We think about this abstraction layer as being an operating system for machine learning. But, what does it look like to help data scientists deploy and iterate their own models? It looks like using whatever tools are best for a given project.
What do you need in your infrastructure?
Models aren’t static. As data science practices mature, there is a constant need to monitor model drift and retrain models as more data is made available. You’ll need a versioning system that allows you to continuously iterate on models without breaking or disrupting applications that are using older versions.
ML for DevOps
Traditional software development has a well-understood DevOps process: developers commit code to a versioned repository, then a Continuous Integration (CI) process builds, tests, and validates the most recent master branch. If everything meets deployment criteria, a Continuous Delivery or Continuous Deployment (CD) pipeline releases the latest valid version of the product to customers. As a result, developers can focus on writing code, release engineers can build and govern the software delivery process, and end users can receive the most functional code available, all within a system that can be audited and rolled back at any time.
The machine learning life cycle can, and should, borrow heavily from the traditional CI/CD model, but it’s important to account for several key differences, including:
Enterprise and commercial software development is far more likely to have many developers iterating concurrently on a single piece of code. In ML development, there may be far fewer data scientists iterating on individual models, but those models are likely to be pipelined together in myriad ways.
Traditional application development often leverages a single framework and one, or possibly two, languages. For example, many enterprise applications are written in C#, using the .NET framework, with small bits of highly performant code written in C++. ML models, on the other hand, are typically written using a wide range of languages and frameworks (R, Scala, Python, Java, etc.), so each model can have a unique set of dependencies.
Containerizing these models with all necessary dependencies while still optimizing start times and performance is a far more complex task than compiling a monolithic app built in a single framework.
The infrastructure for deploying and managing a machine learning portfolio is crucial for data scientists to do their best work. Though it’s a relatively new field, good practices exist for moving data science forward efficiently and effectively.
For all the time that it takes to clean your data, train and test your model, then tune your model hyperparmeters, and further re-train and test your model, you’re not done until it’s deployed into production. Because what good is a model that’s just sitting on your laptop?
In this tutorial, you’ll learn how to write a simple algorithm using a pre-trained Scikit-learn model we’ve provided. This way, once you’ve gotten your own Scikit-learn model’s accuracy in an acceptable range, you’ll be able to easily deploy it into production.
The tutorial will walk through how to:
- create an algorithm,
- host data in Algorithmia’s data collections,
- deploy a random forest regression model trained on a Boston housing prices dataset in order to predict the price of Boston houses that the model has not seen before.
Note that for any model you deploy on Algorithmia, you will need to train and save the serialized model. For Scikit-learn models, you can use the Python Pickle library to serialize, and later deserialize, your model file.
Before you get started, you’ll need these files in the GitHub repository.
Next, you’ll need to upload the .csv file and the pre-trained pickled Scikit-learn model to a data collection.
If you’ve never created a data collection before, check out the docs for hosting your data on Algorithmia. Go ahead and create a data collection. Below, you’ll notice that we named our data collection “demo_files,” but you can name yours as you like.
Take note of the path created that starts with “data://” and shows your username and the data collection name along with your file name.
You’ll want to use this path in your algorithm to point to your own data and model path, so we recommend keeping this data collection page open and also opening a new tab where you can create your algorithm. This way you can easily copy and paste the paths from your data collections when you’re ready to add them to the demo code sample.
Next click the “Plus” icon in the navigation pane, and create your Scikit-learn algorithm, naming it as you like. If you’ve never created an algorithm before and want to understand what the permissions mean, check out our docs on Getting Started but note that for this example, you’ll want to choose: “Python 3.x” for your language so it runs well. The rest of the permissions and execution environment can stay under their default settings.
Once you create your algorithm, you’ll be able to edit it either through the CLI tools or the Web IDE. It’s your choice how you want to interact with your algorithm, but this tutorial will demonstrate working in the Web IDE:
Before we touch our code, we need to add the required dependencies. Click “Dependencies” found right above your source code. This will show a modal that is basically a requirements.txt file that pulls the stated libraries from PyPi. If you state the package name without a version number, you’ll automatically get the latest version we support, otherwise state the needed version number or range.
In this example, we have an older Scikit-learn model, so we need a range of versions for our model to work. Go ahead and add these to the libraries already in your dependency file:
So your whole dependency file will look like this:
You’ll want to remove the boilerplate code that exists in your newly created algorithm and copy/paste the code from the “demo.py” file found in the GitHub repository for this project.
Here is what you’ll see when you paste in the code from demo.py:
Notice in the first few lines of our script that we are importing the Python packages required by our algorithm. Then on line 7, we are creating the variable “client” in global scope to use throughout our algorithm. This will enable us to access our data in data collections via the Data API.
On line 12, inside the “load_model()” function, you’ll want to replace that string with the path from your data collections for the pickled model file.
Then, notice on line 13 that we are passing in that data collection path to the data API using:
And then we use the Pickle library to open the file.
Notice that line 20 is where we call the function “load_model().” This is important because you’ll always want to load the model outside of the “apply()” function. This is so the model file only gets loaded into memory during the initial call within that session, so while the first call to your algorithm might take a bit of time depending on the size of your model, subsequent calls will be much faster. If you were to load your model inside the “apply()” function, then the model would get loaded with each call of your algorithm.
Also, if you are tempted to add your model file as a Python module, and then import that module into your algorithm file, this will result in a loss of performance, and we don’t recommend it.
The next function, called “process_input(),” simply turns the .csv file into a numpy array, and we call that function within the “apply()” function where we will pass in the user provided “input.”
The “input” argument is the data or any other input from the user that gets passed into your algorithm. It’s important to support inputs from multiple data sources with exception handling, going beyond data collections like shown, but also for data files hosted in S3, Azure Blobs, or other data sources for which we have data connectors. For a great example of handling multiple types of files, or if you want to see a PyTorch algorithm in action, check out the Open Anomaly Detection algorithm in the Algorithmia Marketplace.
Notice that we are returning the predicted data as output from our Scikit-learn model in the apply() function.
Now click the “Build” button on the top right of the web IDE. This will commit our code to a git repository. Every algorithm is backed by a git repository and as you are developing your algorithm, whenever you hit “Build,” that will commit your code and return a hash version of your algorithm, which you’ll see in the Algorithmia console.
You can use that hash version to call your algorithm locally using one of the language clients for testing purposes while you work on perfecting your algorithm.
Note that you’ll get a semantic version number once you publish your algorithm.
Now we are ready to test our algorithm. Go back to the data collections page where we got our model path, and copy/paste the path for our .csv file into the Algorithmia console (wrapping it in quotes so it’s a proper JSON formatted string) and hit return/enter:
You’ll need to hit the “Publish” button now that you’re happy with the results. The publishing workflow will give you options for adding sample code (like the .csv data we tested on) so users of your algorithm can test out its inputs and outputs. You’ll also be able to choose if you want a backwards compatible version of your algorithm or if you’re making breaking changes to it. For more information on the publishing steps, check out our docs on Getting Started.
Once you publish, you can find the runable code on your algorithm’s description page:
That’s it for deploying your Scikit-learn model into production!