Algorithmia Blog

Lessons Learned from Deploying Deep Learning at Scale

Deep learning is a machine learning technique used to solve complex problems related to image recognition, natural language processing, and more.

It requires both massive amounts of data and computational power

The compute problem has been solved largely by GPUs, which have become essential for making deep learning performant. GPUs have reduced the time required to train models from weeks to days. Most training tasks can be accomplished by building your own box.

But, how does an organization make their trained deep learning model available as part of an app or web service? For instance, say you have a deep learning model that can identify places in images. Your user provides a photo, and your models tells us what’s in the photo (e.g. bridge, cars, water).

One option is to move your model to the cloud as scalable web service.

However, deploying deep learning models in the cloud can be challenging due to complex hardware requirements and software dependencies. Plus, GPU cloud computing is not yet mainstream. And, although Amazon, Google, and Microsoft have all announced cloud GPUs, it’s still not as easy as spinning up an EC2 instance or DigitalOcean Droplet.

We have a unique perspective on using not just one, but five different deep learning frameworks in the cloud. Since users depend on Algorithmia to host and scale their algorithms, we’ve worked out most of the idiosyncrasies with the most popular deep learning frameworks out there.

Below are the slides from our presentation at the 2016 O’Reilly Artificial Intelligence conference. These slides cover the pros and cons of popular frameworks like TensorFlow, Caffe, Torch, and Theano. Why you should use one framework over another. And, more importantly, once you’ve picked a framework and trained a machine-learning model to solve your problem, how to reliably deploy deep learning frameworks at scale.

Here’s what we’ve learned from deploying deep learning at scale and fighting these battles so that you don’t have to. Let us know what you think @Algorithmia or @Platypii.

We’ve created an open marketplace for algorithms and algorithm development, making state-of-the-art algorithms accessible and discoverable by everyone.

On Algorithmia, algorithms run as containerized microservices that act as the building blocks of algorithmic intelligence developers can use to create, share, and remix at scale.

By making algorithms composable, interoperable, and portable, algorithms can be written in any supported language, and then made available to application developers where the code is always “on,” and available via a simple REST API.


For context, in the months leading up to this presentation, we had been preparing the Algorithmia platform to support deep learning. In July 2016, we launched our solution for hosting and distributing trained deep learning models using GPUs in the cloud. To demonstrate, we created demos to automatically colorize black and white photos, and classify places and locations in images.


We speak from experience about the challenges faced when going beyond simple demos and trying to use deep learning in a scaled, production environment accessed by hundreds of thousands of people.


At its core, deep learning is just trying to replicate how the brain functions using machines. 

Think of deep learning as the technique for learning in neural networks that utilizes multiple layers of abstraction to solve pattern recognition problems. Our brains are able to learn the structure of things with little to no guidance.

Need an introduction to deep learning


Over the past half-decade, we have generated and stored unstructured data at a whopping pace. Just think of the amount of unstructured metadata your mobile phone creates in a single day: SMS texts, cell towers connected to, GPS, emails, file types and sizes, use and inactivity, wifi connected to, and a lot more. For more, here’s how your phone tracks your every move.


To showcase some of these applications we’ve added several open source deep learning models to the platform for you to try.


Three major shifts have driven the craze behind deep learning: a) advances in research, b) advances in hardware, and c) availability of this hardware at scale.


For the most part, there has been only one game in town. NVIDIAs efforts with CUDA, cuDANN have been included in every major deep learning framework.


There are still new advances coming down the line, like new GPUs are tailored for deep learning (i.e sacrificing precision in exchange for faster calculations and reduced memory usage), as well as custom hardware. This is similar to the progression we saw with Bitcoin from CPUs, to GPUs, to ASICS.


This is the general stack required for doing deep learning. NVIDIA’s success in this space can be attributed to having owned everything from the GPU chip up to cuDNN.

A few notes:


There are a number of deep learning frameworks that support different languages. Next, we’ll explain the top frameworks.







There is a whole area of research in the design of depth and shape of neural nets. Practitioners rarely start from scratch, instead they’ll start with designs that have worked for similar tasks in the past and evolve it from there.



How do you scale inference in neural nets to hundreds, thousands, or millions of users as an API? 


Machine learning is generally very computationally intensive. Using an elastic compute layer that can be scaled up and down is perfect for ML tasks.


Deep learning requires GPUs, and is very computationally intensive, and so it will typically reside in its own service, separate from user-facing web servers. Dedicated hardware will make sure that you don’t starve other parts of your SOA system of resources. If it’s going to be a separate service anyway, why not outsource it?


There are a number of challenges we hit when getting deep learning to work at scale. The first was picking the right provider – at the time of this talk there were not many options, though this is starting to change.


For the most part, there are only three languages represented in the world of deep learning in terms of bindings: Python, Lua and C++. Yet, applications in the real world use many more languages than these. A service architecture allows you to abstract language through the use of REST endpoints.


As the neural nets get deeper, the models have become large and larger. Any practitioner in the DL space at one point has hit the problem of running out of GPU memory. Memory management for GPU memory is virtually non-existent. We can solve this in two ways: more, larger hardware, or smaller models.


As you can see, the deeper the neural net the larger the memory footprint. Sadly, memory footprint and reduction in error rates are correlated.


There has been some promising research in reducing the size of the models. There is still an error rate trade off.



We have been sharing CPUs for decades — ever since the early mainframes. For GPUs, that has not been the case. The tooling and knowledge is still very young. Dealing with memory overflows is extremely common.


Containers have provided a great way for managing cloud services and applications, but the Docker abstraction breaks down when trying to interface with GPU hardware. If you’re using containers over GPUs for deep learning in production, you are a pioneer on the frontier of this field with little literature or community support to rely on.



To recap the presentation: Training deep learning models is only half the problem. Deploying deep learning models in the cloud is the next step, but the skillset required to deploy models into production is entirely different. While tools are frameworks are improving everyday and making this easier, we still have a long way to go.

Let us know what you think @Algorithmia or @Platypii.

Further Reading:

Deep Learning on Algorithmia: