Anyone who is interested in deep learning has likely gotten their hands dirty at some point playing around with Tensorflow, Google’s open source deep learning framework. Tensorflow has a lot of benefits like wide-scale adoption, deployment on mobile, and support for distributed computing, but it also has a somewhat challenging learning curve, and is difficult to debug. It also doesn’t support variable input lengths and shapes due to its static graph architecture unless you use external packages. PyTorch is a new deep learning framework that solves a lot of those problems.
PyTorch is only in beta, but users are rapidly adopting this modular deep learning framework. PyTorch supports tensor computation and dynamic computation graphs that allow you to change how the network behaves on the fly unlike static graphs that are used in frameworks such as Tensorflow. PyTorch also offers modularity, which enhances the ability to debug or see within the network. For many, PyTorch is more intuitive to learn than Tensorflow.
This talk will objectively look at PyTorch and why it might be the best fit for your deep learning use case. We’ll look at use cases that will showcase why you might want consider using Tensorflow instead.
PyTorch is a Python open source deep learning framework that was primarily developed by Facebook’s artificial intelligence research group and was publicly introduced in January 2017.
While PyTorch is still really new, users are rapidly adopting this modular deep learning framework, especially because PyTorch supports dynamic computation graphs that allow you to change how the network behaves on the fly, unlike static graphs that are used in frameworks such as Tensorflow. PyTorch offers modularity which enhances the ability to debug or see within the network. For many, PyTorch is more intuitive to learn than Tensorflow.
PyTorch shares some C++ backend with the deep learning framework Torch which was written in Lua. However, PyTorch isn’t simply a Python interface for making it easier to work with Torch. It is deeply integrated with the underlying C++ code and shares the performance that Torch has, but in a language that has made for wider adoption because Python is such a popular and accessible language.
The community around PyTorch has really taken off in the short year and a half that it’s been publicly available and has been cited in various research papers and groups. The image shown is from a popular image filter model originally written in Torch and developed by Jun-Yan Zhu and Taesung Park. They re-rewrote it in PyTorch, citing the same or sometimes better results as the Torch version.
PyTorch is really gaining traction in the community as the go-to deep learning framework for quickly iterating on models and for making tight deadlines due to its intuitive imperative programming architecture.
The framework is extensible using scikit-learn and other common Python libraries, and you can also extend some of PyTorch’s main components such as Torch.nn and Torch.autograd and write custom C extensions.
PyTorch is easy to ramp up and get started with as a beginner, especially when you’re already familiar with Numpy. And get this: you can even run Numpy-like arrays on GPU’s.
Another unique aspect of PyTorch is that it relies on dynamic computational graphs. That means graphs are created on the fly (which you might want if the input data has non-uniform length or dimensions) and those dynamic graphs make debugging as easy as debugging in Python.
So deep learning frameworks like PyTorch and Tensorflow (I know, the name alone is a spoiler alert), use tensors as their data structure. Tensors are the main building blocks of deep learning frameworks (besides variables, computational graphs, and such) and are basically objects that describe a linear relationship to other objects. These data containers can be various sizes like you can see on the slide, such as 1d tensors (vectors), 2d tensors (matrices), 3d tensors (cubes), and so on, holding numerical values.
If you’ve worked with Numpy, it’s easy to make the mental switch to tensors. You can think of tensors as N-D arrays or multi-dimensional arrays.
Next we need to talk about computational graphs, because a huge difference between Tensorflow and PyTorch is how they architect graphs.
So, deep learning frameworks all require computational graphs, and computational graphs state the order of computations that are supposed to occur. Computational graphs are just DAGs (directed acyclic graphs).
Computational graphs are directional, and they have nodes which make up variables (such as tensors), while the edges make up operations (such as multiplication or addition). On the slide above, you can see that the nodes hold variables and each computation makes up a new tensor.
Neural nets rely on computational graphs and when the network is being trained, the gradients of the loss function get computed based off of the weights and biases, then the weights are updated using gradient descent. This is done efficiently by the computational graph using the chain rule.
Dynamic computational graphs are the cool kids of computational graphs. But why?
PyTorch relies on dynamic computational graphs which are built at runtime on the fly versus Tensorflow which creates static graphs at compile. This means that everything has to be defined in TF before it’s run which means if you want to make any changes to the neural net structure, you have to rebuild it from scratch.
So, what does it really mean to be dynamic?
First let’s define some terms.
One epoch in neural nets means there one forward pass and one backward pass of all the training examples in your dataset.
Each iteration is a pass (of both forward and backward passes) of the batch size of inputs.
So, for each iteration in an epoch, a computational graph is created in PyTorch. After each iteration the graph is freed in PyTorch which means more memory available!
Because PyTorch is a define-by-run framework (defining the graph in forward pass versus a define-then-run framework like Tensorflow), your backprop is defined by how your code is run, and that every single iteration can be different. PyTorch records the values as they happen in your code to build the dynamic graph as the code is run, as you can see in the slide above from PyTorch’s about page.
PyTorch does this through reverse automatic differentiation called Autograd which is used to calculate the gradients. Reverse automatic differentiation is simply walking backwards through the computational graph to calculate the derivatives.
In the above slide there is a small toy example of using Autograd. Remember, that during the forward pass of the network is where we’ll define a computational graph where the nodes in the graph are tensors and the edges will be computations that create the output. Both the input and the output will be tensors. Then we can backprop through the graph to compute gradients.
On the slide, you’ll see that a is a tensor (do note that Variable is deprecated and you don’t need it to use Autograd anymore), and you need to pass in requires_grad=True so you can later get the gradient of that tensor.
Down at the bottom of the code snippet, you’ll see d.backward() which will use Autograd to compute the backward pass in our little example. That will compute the loss on any tensors with requires_grad=True.
So to reiterate how PyTorch does reverse automatic differentiation, the graph is created on the fly each time code is encountered and with backprop, the graph will be dynamically walked backward re-calculating the gradient when you need it.
Now that we’ve gone through what dynamic computational graphs are, let’s go through what makes them awesome.
Because operations in PyTorch actually modify real data, versus in Tensorflow where they are references or containers where the data will be inserted later, it’s possible to debug your code within the graph itself and treat it like regular Python code. Meaning, due to its imperative programming architecture you can set breakpoints, print out values during computation, read the stack trace and know what part of code it’s referring to.
Dynamic computational graphs also allow you to have varying input lengths which is important when you don’t know exactly how much memory to allocate to your inputs. For example, in natural language processing, sentence lengths vary. If you were to create a sentiment analysis model using Recurrent Neural Nets where you also have various input sizes and need native control flow, it’s easy to accomplish with dynamic computational graphs.
With Tensorflow, if you were creating an RNN, you would have to pad the inputs with zeros after setting the maximum length of your sentences to allow for varying input lengths unless you use Tensorflow Fold that allows for use of dynamic batching. This is due to the static graph only allowing the same input size and shape because remember, the graph is defined before it’s run so you have to explicit about your inputs before you create a TF session to run your graph.
Note however, that if you want to use batches in PyTorch you will have to pad each element (such as a sentence) within that batch, even though your batches can have different lengths.
Another great thing about PyTorch is that you can add custom logic easily within the graph itself using native Python. In writing operations when building the graph in Tensorflow, you have to use their control flow operations.
This is because using native Python you wouldn’t be able to keep that control flow logic when exporting the graph, so instead you need to use Tensorflow’s methods for control flow such as tf.while_loop and more so you can build efficient ETL operations within Tensorflow.
Note, when using Tensorflow, that you can have some of the flexibility you gain from dynamic graphs in PyTorch when you use both control flow operations in TF and Tensorflow Fold, but it is not as smooth of a user experience as it is in PyTorch and takes time to ramp up.
But, static graphs aren’t all bad. Tensorflow actually has a lot of really awesome features.
Remember, static graph architecture can be thought of as: define the graph, run the graph once to initialize weights, then run your data through the graph multiple times within TF Session for training.
This architecture allows for the ease of distributed computing so you can run the graph several times on different workers. This is because you define the graph before you run it and each node is independent from other nodes, so the computations can be run in parallel. This also allows for different computations to be done on CPUs while others can be run on GPUs.
Static graphs also allow for easy optimization because you optimize the graph while defining it before running it. It’s a lot harder for a dynamic graph to be optimized since it’s being created as the code runs.
Now, static graph architecture also allows for the graph to be saved with all its methods and functions for production. I’ll compare this type of serialization with PyTorch’s a bit later on.
Recall that I said that PyTorch’s methods actually hold real data while Tensorflow’s are only references. This is easier to digest with a little code example.
On the slide we are creating two tensors and adding them together. Simple right? Looks like Numpy and the output is very easy to understand: you have a Tensor of type float with size 3.
Notice the variables x_1 and x_2 are symbolic tensor objects created with tf.constant.
Also notice that instead of using native Python to add like you could in PyTorch you have to use the Tensorflow API, tf.add in this case. Note that you only get a reference to the tensor; no data has been injected yet until you call tf.session.
This comparison leads us to understand a little more about why debugging in PyTorch is awesome.
Due to the dynamic graphs in PyTorch, the tensors actually contain values and the operations can modify the data.
This means that in PyTorch you can use your favorite IDE to debug using regular Python. You can print out weight sizes or whatever you want from the Tensor within the graph itself. You can add breakpoints, or even step into the forward pass, or the loss function. Note however, that for calculating the gradient you have to use the method .register_hook() to inspect gradients because that computation is carried out by PyTorch vs the user.
All this is huge because when training a neural network: you don’t want to spend all that time and money training only to get error messages after you’ve spent hours or days on it.
In Tensorflow, debugging isn’t as robust. You can inspect the variables from your session, or there is the Tensorflow debugger tool you can use, but again like everything in Tensorflow you have to learn their tools or APIs for doing what you can do in native Python code in PyTorch. However, you do have Tensorboard to help you debug in Tensorflow which is a really nice feature.
Because this is deep learning, let’s talk about GPU support for PyTorch.
In PyTorch, GPU utilization is pretty much in the hands of the developer in the sense that you must define whether you are using CPUs or GPUs, which you can see with a quick example on the slide. You’re saying “hey, if I’ve got GPUs use ‘em, if not, use the CPUs.”
A note on GPU utilization In Tensorflow – while you can explicitly define whether you want to use CPUs or GPUs, Tensorflow tries to figure it out for you. If GPUs are available, it will defer to using them and will also take all available GPU memory. While training, a lot of times you want all GPU’s to be used, however Tensoflow (unless you use their pre-loading package) will take all available GPU memory for pre-loading and pre-caching data and I’m sure that’s not what you want to use your GPUs for.
Tensorflow gives you a couple of ways around this, There are configurations shown in TF docs that show how to allocate memory as you need it or to straight up limit how much of each device you want to allow TF to use.
Also note that Tensorflow session does not release the GPU even when you close the session so you have to kill the process to release the memory. This is due to Tensorflow allocating memory to the lifetime of the process, not the session.
So, back to GPU support in PyTorch:
On the slide is a quick example of how you can run Numpy-like arrays on a GPU.
You can see how to specify what device you want with torch.device in the first line of code.
Note that torch.ones will automatically be created using CPUs so to change it to using GPUs you simply pass in the specific device you want with the device keyword argument.
Also, if you wanted to convert it from a tensor like it is above to a Numpy array you can simply apply the method numpy() to your torch.ones tensor. This is cool, because say you want to do some computational heavy data pre-processing and are in Numpy, you can switch to using GPUs in PyTorch, then switch back to Numpy arrays when you’re done.
Another cool part about PyTorch and GPU utilization is that you can move everything from the content tree including tensors, gradient position, and weights – basically every component in the network – and it can be transferred back and forth from CPU to GPU. You can see on the slide how easy it is to transfer a tensor from CPU to GPU in the forward function…not hard at all!
Benefits to this kind of flexibility:
- For custom, complex architectures – some work should be done on the CPU, and some on the GPU. Being able to optimize is really cool.
- Being able to clone and save your model state in host memory during training allows for easy checkpointing if your training deviates.
- You define how you want your model to get parallelized.
However, CUDA copies can be costly, so be careful where you take advantage of this features. Also note that GPU operations are asynchronous so when moving stuff around you’ll want to use torch.cuda.synchronize() to wait for all kernels in streams to finish otherwise it will be very slow to move things around if they’re waiting for the model to finish.
The nice thing about GPU utilization in PyTorch is the easy, fairly automatic way of initializing data parallelism. Using torch.nn.DataParallel() which you can see defined in the 4th line of code within the __init__ method, you can wrap around a module to parallelize over multiple GPUs in the batch dimension. This occurs without a lot of work on your part while in Tensorflow it’s more of a manual process.
You also have nice features like pre-loaders in both PyTorch and Tensorflow that allow you to pre-load the data using multiple workers for parallel processing so loading your data doesn’t swamp your expensive GPU’s.
Serialization is a bit tricky in PyTorch, depending on your use case.
In the first and recommended way as you can see on the slide, the model parameters (weights) are saved in a dictionary, then serialized, saving it to disk. That binary can then be read with .load_state_dict () and then the weights you get in an ordered dict can be used to inject into the network. This means that the serialization process is ignorant of model structure.
The second way is more for the use case of when you want to resume training your model later. This way you save your model (without calling eval() on it), and then later get back to training using the pickled model. Note that you’ll need to explicitly save the epoch state and the optimizer of your graph to get back to training, but this doesn’t get you the whole way in terms of saving your model.
To understand what this means, let’s think about how Tensorflow saves the model. With Tensorflow you can save the entire computational graph as a protocol buffer that saves operations and weights, everything you need to use it for inference without needing to deploy your actual source code that exported the protobuf. This makes it’s easier to deploy in other languages such as Java, plus it makes backwards compatibility a possibility.
PyTorch’s serialization methods assume that you have the entire source code when you’re deploying your model to production in order to have access to operations. The problem with this is that you:
- might not have the original source code
- you might want to deploy your model in a different language
- PyTorch’s methods for serialization and deserialization make backwards compatibility an issue in production – you have to be very careful about making changes to your model because if you don’t have the original source code you could have mismatched model parameters with your new model.
But luckily, there are some solutions being developed.
Try out ONNX which stands for Open Neural Network Exchange, developed as a community project between Facebook and Microsoft. ONNX is a library that focuses on taking research code into production with an open source format to make it easier to work between different frameworks, such as developing your model in PyTorch and deploying it in Caffe2. Facebook has different teams working in PyTorch (for research) while other teams use Caffe2 for deployment, so it makes sense that a format was needed for interoperability between frameworks. It’s the same issue we at Algorithmia ran into and developed a solution to as well, but we needed a more general solution since currently ONNX only supports interoperability between six deep learning frameworks.
I’ve talked a lot about why PyTorch is great, but also highlighted along the way areas that make Tensorflow pretty great. So now I’ll break it down by use case for y’all.
One area that makes PyTorch a more favorable choice when creating deep learning models is its natural support of recurrent neural nets. Because RNN’s have variable inputs and due to PyTorch’s use of dynamic graphs, RNNs run faster on PyTorch (there are various blog posts and benchmarks out there, here is one) than on Tensorflow and you don’t have to hack together a solution with Tensorflow Fold to use dynamic graph structures.
Also, again due to the choice of having dynamic graphs versus static graphs, PyTorch is a winner for easy debugging and being able to quickly iterate on your model due to the imperative programming structure. Researchers have been the largest adopters and fangirls and boys of PyTorch because of this.
But, as mentioned, within Facebook PyTorch is used for research while they use Caffe for production.
And that’s where Tensorflow currently shines, in production. With Tensorflow Serving, and mobile support with Tensorflow lite, Tensorflow is often the framework of choice. There are a few reasons for this, but the main one is simply its architecture.
Tensorflow was built for distributed computing (static graphs allow for better optimization) while automatic differentiation has to be performed on every iteration of the graph in PyTorch which can slow it down in certain cases.
But more importantly the development of Tensorflow Serving was specifically created for productionizing TF models, while currently PyTorch doesn’t have any deployment engine. Tensorflow Serving uses the gRPC open source framework that runs on HTTP/2 and uses protocol buffers by default for serialization. The reason why using gRPC is great is because with it, the model can be deployed in Java, or C++, etc. because not everyone uses Python for production.
So, if you were paying attention, I mentioned that PyTorch currently doesn’t have much support for production. But, I already mentioned ONNX which helps with compatibility across different frameworks, although not many are supported right now.
Luckily, on the roadmap for PyTorch is to combine PyTorch with Caffe2 for deploying models into production without losing PyTorch’s iterability, ease of use, and awesome debugging. They are adding a ton of optimization features by expanding the API such as adding the ability to configure memory layouts such as NCHW or NHWC. Note that the layout you want depends on your hardware.
So, in the near future when they release PyTorch 1.0 they are releasing torch.jit which is a just-in-time compiler that converts your PyTorch models into production ready code that can also export your model into a C++ runtime. Note, they are also planning on adding better mobile support as well.
I hope this talk helped you understand areas where you might want to use PyTorch and where you might want to use Tensorflow for your projects. A lot of times you are choosing PyTorch for flexibility and sometimes losing out on performance, but that depends on your hardware and model you are running and now that different memory layouts will be available in PyTorch, the future is looking good for PyTorch and performance in production.
Check out how to host your PyTorch model on Algorithmia so you can avoid all the pain points of deploying deep learning models into production.