Algorithmia Blog - Deploying AI at scale

Hardware for Machine Learning

Source: AnandTech

If you’re trying to create value by using Machine Learning, you need to be using the best hardware for the task. With CPUs, GPUs, ASICs, and TPUs, things can get kind of confusing.

While for most of computing history there was only one type of processor, the wide growth of Deep Learning has led to two new entrants into the field: GPUs and ASICs. This post will walk through the different types of compute chips, where they’re available, and which ones are the best to boost your performance.

Basic Overview and Introduction

Chips are integral to your computer because they’re the brain: processors deal with all of the instructions that other hardware and software throw around. When we’re talking specifically about Machine Learning, the processor has the role of executing the logic in a given algorithm. If we’re performing Gradient Descent to optimize the cost function, the processing unit is directing and executing it. That means running the basic mathematical computations (matrix multiplication) that drive the algorithm.

The CPU (Central Processing Unit)

The OG processing unit was and is the CPU, first developed by Intel in the early 1970’s. The first ever processor was the Intel 4004 unit pictured below.

As time went on, CPUs grew and grew in their speed and capabilities. For some context, according to this Computer Hope article on the topic, “the first microprocessor was the Intel 4004 that was released November 15, 1971, and had 2,300 transistors and performed 60,000 operations per second. The Intel Pentium processor has 3,300,000 transistors and performs around 188,000,000 instructions per second.”

Source: Singularity

Most processors were designed with one core (one CPU), which meant that it could only perform one operation at once. IBM released the first dual-core processor in 2001, which was able to “focus” on two tasks at once. Since then, more and more CPUs have been crammed into microprocessors: some modern supercomputers can have more than 40.

Even with recent advances, the fact remains that most computers only have a few cores at most. CPUs are designed for complex computations: they’re very good at rapidly parsing through a detailed and intertwined set of commands. And for most of the tasks a computer needs to do, like gasping for air in a sea of Chrome tabs, that’s exactly what you want. But with Machine Learning, things can get a bit different.

Machine Learning is a New Type of Challenge for Processing

The strength of the CPU is executing a few very complex operations very efficiently, and Machine Learning presents the opposite challenge. Most of the computation in the training process is matrix multiplication, which is a simple but broad task – the calculations are very small and easy, but there are a ton of them. Effectively, the CPU is often overpowered but understaffed.

Advances in data storage are some of the major drivers of the explosion of Machine Learning over the past decade, and they’ve also compounded this problem. Today we’re training our algorithms on more data than ever before; that means more and more small calculations that max out our CPUs.

A far better-optimized chip for Machine Learning is actually the other major processor that’s mass manufactured – something that only has the core complexity to do basic operations, but it can do them at scale all at the same time. Lucky enough for us, that chip has already been sitting in our computers for years, and it’s called a GPU.

GPUs Have Risen to the Occasion

GPUs, or Graphics Processing Units, have been around in gaming applications since the early 1970’s. The late 80’s saw GPUs being added into consumer computers, and by 2018 they’re absolutely standard. What makes a GPU unique is how it handles commands – it’s the exact opposite of a CPU.

GPUs utilize parallel architecture: while a CPU is excellent at handling one set of very complex instructions, a GPU is very good at handling many sets of very simple instructions.

A few years ago, groups in the Machine Learning community started to realize that these architectural features – that GPUs are excellent for parallel processing of simple operations – might lend well to using them for algorithms. Over time, GPUs started to show massive improvements over CPUs for training models, often in the 10x ballpark for speed. And the stock price of Nvidia, the most well-known manufacturer of these kinds of chips, shows as much (from Google Finance):

Source: Nvidia

Nvidia isn’t the only manufacturer of GPUs, but it’s certainly the default one. There are other vendors like AMD, but the software that’s used to integrate with the chips is far behind Nvidia. Cuda, Nvidia’s platform, is effectively the only usable one for Machine Learning applications today.

The degree to which GPUs have become popular is hard to overstate. They’re in high demand right now, for the original video game applications as well as Machine Learning (and even cryptocurrency mining), and prices have been skyrocketing. The price for a standard Nvidia GPU manufactured last year is now higher than it was when released. Algorithmia is the only major vendor that supports serverless execution (FaaS) on GPUs.

The Black Horse – ASICs

ASICs, or Application Specific Integrated Circuits, are the next level of chip design – it’s a processor designed specifically for one type of task. The chip is architected to be very good at executing a specific function or type of function.

An awesome and relevant example is cryptocurrency mining, which people seem to attack with endless creativity. If you need an introduction to cryptocurrency, head over to Coindesk’s Blockchain 101 section. But for our purposes, all you need to know is this: mining these currencies on your computer involves brute force guessing of a specific number. It’s a pretty simple function, but the faster you do it the higher chance you have at winning; and there’s no variation. You just keep guessing until you get it right.

When you know there’s only going to be one specific computational requirement, you can design a chip around that specific task with its own unique architecture. For Bitcoin mining, one of these rigs is designed by a company called Bitmain. If you try to run other programs on it, it just won’t work.

Google has also gotten into the ASIC game, but with a focus on Machine Learning. It’s called a TPU (Tensor Processing Unit), and it’s a Google-designed and manufactured chip specifically made for Machine Learning with Tensorflow, Google’s open source deep learning framework. Google claims that these TPUs can be up to 15x to 30x faster than the best CPUs and GPUs for programming Neural Nets, but there has been some dispute as to how accurate that figure really is. A 3rd party benchmark was recently released and found that TPUs can be significantly more efficient than comparable GPUs.   

As Machine Learning becomes more and more integrated into all the applications we use on a daily basis, expect more research to be done on how to create chips tailored for these tasks.

The Practical – How To Access And Use CPUs, GPUs, and TPUs

Moving from the theoretical and architectural, in 2018 it’s finally possible to access all of these kinds of chips for your Machine Learning applications.

CPUs

Since CPUs have been the standard for so long, they make up the bulk of modern public cloud offerings. If you train and run Machine Learning models on normal AWS (Amazon), GCP (Google), or Azure (Microsoft) instances, they’ll be using CPUs. Unless you’re using specific deep learning frameworks that target GPUs on your machine, running algorithms locally will also use your CPUs.

GPUs

Because of the skyrocketing demand and relatively modest supply, GPUs weren’t always that easy to get access to. Thankfully, the major cloud platforms have bought up enough of them to make it possible today. AWS allows for a GPU only setup (p3 instance), as well as Google and Microsoft. Algorithmia’s Serverless AI Layer integrates both GPUs and CPUs, and makes it easy to deploy on either.

ASICs

The only widely available ASIC for Machine Learning applications is Google’s TPU program. As of recently, TPUs are available to the public for compute.

Which Type of Processor is Best For You?

As with anything, the answer is: it depends. Projects and product involving Machine Learning often have varying priorities, ranging from speed to accuracy to reliability. As a general rule, if you can get your hands on a state-of-the-art GPU, it’s your best bet for fast Machine Learning.

GPU compute will usually be about 4x expensive as CPU compute – so if you’re not getting a 4x improvement in speed, or if speed is less of a priority than cost, you might want to stick with CPUs. Additionally, your training needs will be different than your implementation needs. A GPU may be necessary for training a large Neural Net, but it’s possible that a CPU is more than powerful enough to use the model once built.

TPUs are still experimental, so we’ll need more data before making judgement about tradeoffs.

Further Reading

What’s The Difference Between The CPU And GPU? – “GPUs can handle graphics better because graphics include thousands of tiny calculations that need to be conducted. Instead of sending those tiny equations to the CPU, which could only handle a few at a time, they’re sent to the GPU, which can handle many of them at once.

The History of The Modern Graphics Processor – “While 3D graphics turned a fairly dull PC industry into a light and magic show, they owe their existence to generations of innovative endeavour. Over the next few weeks (this is the first installment on a series of four articles) we’ll be taking an extensive look at the history of the GPU, going from the early days of 3D consumer graphics, to the 3Dfx Voodoo game-changer, the industry’s consolidation at the turn of the century, and today’s modern GPGPU.

The History of Nvidia’s GPUs – “Nvidia formed in 1993 and immediately began work on its first product, the NV1. Taking two years to develop, the NV1 was officially launched in 1995. An innovative chipset for its time, the NV1 was capable of handling both 2D and 3D video, along with included audio processing hardware.

Will ASIC Chips Become The Next Big Thing in AI? – “When Google announced its second generation of ASICs to accelerate the company’s machine learning processing, my phone started ringing off the hook with questions about the potential impact on the semiconductor industry. Would the other members of the Super 7, the world’s largest data centers, all rush to build their own chips for AI? How might this affect NVIDIA, a leading supplier of AI silicon and platforms, and potentially other companies such as AMD, Intel, and the many startups that hope to enter this lucrative market? Is it game over for GPUs and FPGAs just when they were beginning to seem so promising? To answer these and other questions, let us get inside the heads of these Goliaths of the Internet and see what they may be planning.

Papers

Deep Machine Learning on GPUs – “With the increasing use of GPUs in science and the resulting computational power, tasks which were too complex a few years back can now be realized and executed in a reasonable time. Deep Machine learning is one of these tasks.

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs – “Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed.

Serving deep learning models in a serverless platform – “In this work we evaluate the suitability of a serverless computing environment for the inferencing of large neural network models. Our experimental evaluations are executed on the AWS Lambda environment using the MxNet deep learning framework. Our experimental results show that while the inferencing latency can be within an acceptable range, longer delays due to cold starts can skew the latency distribution and hence risk violating more stringent SLAs.

Scaling Deep Learning on GPU and Knights Landing Clusters – “The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs.

Tutorials

How To Train TensorFlow Models Using GPUs – “GPUs can accelerate the training of machine learning models. In this post, explore the setup of a GPU-enabled AWS instance to train a neural network in TensorFlow.

Deep Learning CNNs in TensorFlow with GPUs – “In this tutorial, you’ll learn the architecture of a convolutional neural network (CNN), how to create a CNN in Tensorflow, and provide predictions on labels of images. Finally, you’ll learn how to run the model on a GPU so you can spend your time creating better models, not waiting for them to converge.

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning – “With a good, solid GPU, one can quickly iterate over deep learning networks, and run experiments in days instead of months, hours instead of days, minutes instead of hours. So making the right choice when it comes to buying a GPU is critical. So how do you select the GPU which is right for you? This blog post will delve into that question and will lend you advice which will help you to make choice that is right for you.

Using GPUs With TensorFlow (From Google) – Documentation.