Algorithmia Blog - Deploying AI at scale

Lessons Learned from Deploying Deep Learning at Scale

Deploying deep learning at scaleDeep learning is a machine learning technique used to solve complex problems related to image recognition, natural language processing, and more.

It requires both massive amounts of data and computational power

The compute problem has been solved largely by GPUs, which have become essential for making deep learning performant. GPUs have reduced the time required to train models from weeks to days. Most training tasks can be accomplished by building your own box.

But, how does an organization make their trained deep learning model available as part of an app or web service? For instance, say you have a deep learning model that can identify places in images. Your user provides a photo, and your models tells us what’s in the photo (e.g. bridge, cars, water).

One option is to move your model to the cloud as scalable web service.

However, deploying deep learning models in the cloud can be challenging due to complex hardware requirements and software dependencies. Plus, GPU cloud computing is not yet mainstream. And, although Amazon, Google, and Microsoft have all announced cloud GPUs, it’s still not as easy as spinning up an EC2 instance or DigitalOcean Droplet.

We have a unique perspective on using not just one, but five different deep learning frameworks in the cloud. Since users depend on Algorithmia to host and scale their algorithms, we’ve worked out most of the idiosyncrasies with the most popular deep learning frameworks out there.

Below are the slides from our presentation at the 2016 O’Reilly Artificial Intelligence conference. These slides cover the pros and cons of popular frameworks like TensorFlow, Caffe, Torch, and Theano. Why you should use one framework over another. And, more importantly, once you’ve picked a framework and trained a machine-learning model to solve your problem, how to reliably deploy deep learning frameworks at scale.

Here’s what we’ve learned from deploying deep learning at scale and fighting these battles so that you don’t have to. Let us know what you think @Algorithmia or @Platypii.

Algorithmia's mission is to make state-of-the-art algorithms accessible and discoverable by everyone

We’ve created an open marketplace for algorithms and algorithm development, making state-of-the-art algorithms accessible and discoverable by everyone.

On Algorithmia, algorithms run as containerized microservices that act as the building blocks of algorithmic intelligence developers can use to create, share, and remix at scale.

By making algorithms composable, interoperable, and portable, algorithms can be written in any supported language, and then made available to application developers where the code is always “on,” and available via a simple REST API.


Colorizing black and white photos using deep learning

For context, in the months leading up to this presentation, we had been preparing the Algorithmia platform to support deep learning. In July 2016, we launched our solution for hosting and distributing trained deep learning models using GPUs in the cloud. To demonstrate, we created demos to automatically colorize black and white photos, and classify places and locations in images.


Cloud hosting of machine learning models is challenging

We speak from experience about the challenges faced when going beyond simple demos and trying to use deep learning in a scaled, production environment accessed by hundreds of thousands of people.


What is deep learning?

At its core, deep learning is just trying to replicate how the brain functions using machines. 

Think of deep learning as the technique for learning in neural networks that utilizes multiple layers of abstraction to solve pattern recognition problems. Our brains are able to learn the structure of things with little to no guidance.

Need an introduction to deep learning


Applications of deep learning on unstructured text

Over the past half-decade, we have generated and stored unstructured data at a whopping pace. Just think of the amount of unstructured metadata your mobile phone creates in a single day: SMS texts, cell towers connected to, GPS, emails, file types and sizes, use and inactivity, wifi connected to, and a lot more. For more, here’s how your phone tracks your every move.


Commercial uses of deep learning include computer vision and natural language processing

To showcase some of these applications we’ve added several open source deep learning models to the platform for you to try.


Why is deep learning popular right now?

Three major shifts have driven the craze behind deep learning: a) advances in research, b) advances in hardware, and c) availability of this hardware at scale.


Hardware needed for deep learning

For the most part, there has been only one game in town. NVIDIAs efforts with CUDA, cuDANN have been included in every major deep learning framework.


GPUs and ASICs

There are still new advances coming down the line, like new GPUs are tailored for deep learning (i.e sacrificing precision in exchange for faster calculations and reduced memory usage), as well as custom hardware. This is similar to the progression we saw with Bitcoin from CPUs, to GPUs, to ASICS.


GPU dependencies

This is the general stack required for doing deep learning. NVIDIA’s success in this space can be attributed to having owned everything from the GPU chip up to cuDNN.

A few notes:

  • NVIDIA drives how the kernel communicates with the GPU.
  • CUDA enables parallel computing over the GPU.
  • cuDNN is a GPU accelerated library of primitives for deep neural nets.
  • “Meta deep learning frameworks” are attempting to make deep learning even more accessible by abstracting even more. The prime example of this is Keras.

Top deep learning frameworks

There are a number of deep learning frameworks that support different languages. Next, we’ll explain the top frameworks.


Theano


Torch


Caffe deep learning framework


CNTK by Microsoft


TensorFlow Deep Learning by Google


Machine learning networks for training

There is a whole area of research in the design of depth and shape of neural nets. Practitioners rarely start from scratch, instead they’ll start with designs that have worked for similar tasks in the past and evolve it from there.


Two phases of deep learning are training and running


hosting deep learning as an apiHow do you scale inference in neural nets to hundreds, thousands, or millions of users as an API? 


Machine learning in the cloud

Machine learning is generally very computationally intensive. Using an elastic compute layer that can be scaled up and down is perfect for ML tasks.


Microservice architecture

Deep learning requires GPUs, and is very computationally intensive, and so it will typically reside in its own service, separate from user-facing web servers. Dedicated hardware will make sure that you don’t starve other parts of your SOA system of resources. If it’s going to be a separate service anyway, why not outsource it?


Deep learning in the cloud

There are a number of challenges we hit when getting deep learning to work at scale. The first was picking the right provider – at the time of this talk there were not many options, though this is starting to change.


Programming bindings for deep learning

For the most part, there are only three languages represented in the world of deep learning in terms of bindings: Python, Lua and C++. Yet, applications in the real world use many more languages than these. A service architecture allows you to abstract language through the use of REST endpoints.


Deep learning models are getting larger

As the neural nets get deeper, the models have become large and larger. Any practitioner in the DL space at one point has hit the problem of running out of GPU memory. Memory management for GPU memory is virtually non-existent. We can solve this in two ways: more, larger hardware, or smaller models.


Memory per deep learning model

As you can see, the deeper the neural net the larger the memory footprint. Sadly, memory footprint and reduction in error rates are correlated.


SqueezeNet

There has been some promising research in reducing the size of the models. There is still an error rate trade off.


machine learning model compression


Sharing GPUs in the cloud

We have been sharing CPUs for decades — ever since the early mainframes. For GPUs, that has not been the case. The tooling and knowledge is still very young. Dealing with memory overflows is extremely common.


Using containers to share GPUs in the cloud

Containers have provided a great way for managing cloud services and applications, but the Docker abstraction breaks down when trying to interface with GPU hardware. If you’re using containers over GPUs for deep learning in production, you are a pioneer on the frontier of this field with little literature or community support to rely on.


Lessons learned from deploying deep learning models as microservices


To recap the presentation: Training deep learning models is only half the problem. Deploying deep learning models in the cloud is the next step, but the skillset required to deploy models into production is entirely different. While tools are frameworks are improving everyday and making this easier, we still have a long way to go.

Let us know what you think @Algorithmia or @Platypii.

Further Reading:

Deep Learning on Algorithmia:

Kenny Daniel is founder and CTO of Algorithmia. He came up with the idea for Algorithmia while working on his PhD and seeing the plethora of algorithms that never saw the light of day. Kenny’s goal with Algorithmia is to accelerate AI development by creating a marketplace where algorithm developers can share their creations and application developers can make their applications smarter by incorporating the latest machine-learning algorithms.

More Posts

Follow Me:
Twitter