Data science in production is still in its infancy. Everything is new and nearly everyone involved is doing this for the first time. Every quarter, we talk with more than a hundred teams working to operationalize their machine learning and we see a clear pathway out of this difficult stage that many companies cannot get past: operationalizing the deployment process.
Massive effort goes into any well-trained model—research shows that less than 25 percent of a data scientist’s time is spent on training models; the rest is spent collecting and engineering data and conducting DIY DevOps.
Collecting and cleaning data is a fairly easy-to-understand problem with a maturing ecosystem of best practices, products, and training. As such, let’s focus on the problem of DIY DevOps.
DIY DevOps massively increases the risk of product failure.
We often hear from data scientists that they tried to deploy a model themselves, got stuck as they attempted to fit it into existing architecture, and found out that building machine learning infrastructure is a much harder endeavor than they imagined.
A common story arc we hear:
- A project starts with a clear problem to solve but maybe not a clear path forward
- Collecting and cleaning the data is difficult but doable
- Everyone realizes that deploying and iterating will yield real data and improve the model
- Data scientists are tasked with deploying models for the project themselves
- Data scientists get spread thin as they work on tasks outside their areas of strength
- Months later it becomes clear that DevOps needs to be called in
- Deploying a model gets tasked to an engineer
- The engineer realizes that the model was created in Python when the application that needs to use it is in Java
- The engineer heads over to Stack Overflow and decides the best path forward is to rewrite the model in Java
- The model gets deployed weeks or months later
- Down the line, a new version of the original Python model is trained
Companies that are serious about machine learning need an abstraction layer in their machine learning operationalization infrastructure that allows data scientists to deploy and iterate their own models in any language, using any framework.
We think about this abstraction layer as being an operating system for machine learning. But, what does it look like to help data scientists deploy and iterate their own models? It looks like using whatever tools are best for a given project.
What do you need in your infrastructure?
Models aren’t static. As data science practices mature, there is a constant need to monitor model drift and retrain models as more data is made available. You’ll need a versioning system that allows you to continuously iterate on models without breaking or disrupting applications that are using older versions.
ML for DevOps
Traditional software development has a well-understood DevOps process: developers commit code to a versioned repository, then a Continuous Integration (CI) process builds, tests, and validates the most recent master branch. If everything meets deployment criteria, a Continuous Delivery or Continuous Deployment (CD) pipeline releases the latest valid version of the product to customers. As a result, developers can focus on writing code, release engineers can build and govern the software delivery process, and end users can receive the most functional code available, all within a system that can be audited and rolled back at any time.
The machine learning life cycle can, and should, borrow heavily from the traditional CI/CD model, but it’s important to account for several key differences, including:
Enterprise and commercial software development is far more likely to have many developers iterating concurrently on a single piece of code. In ML development, there may be far fewer data scientists iterating on individual models, but those models are likely to be pipelined together in myriad ways.
Traditional application development often leverages a single framework and one, or possibly two, languages. For example, many enterprise applications are written in C#, using the .NET framework, with small bits of highly performant code written in C++. ML models, on the other hand, are typically written using a wide range of languages and frameworks (R, Scala, Python, Java, etc.), so each model can have a unique set of dependencies.
Containerizing these models with all necessary dependencies while still optimizing start times and performance is a far more complex task than compiling a monolithic app built in a single framework.
The infrastructure for deploying and managing a machine learning portfolio is crucial for data scientists to do their best work. Though it’s a relatively new field, good practices exist for moving data science forward efficiently and effectively.