Algorithmia Blog - Deploying AI at scale

Convolutional Neural Nets in Pytorch


Many of the exciting applications in Machine Learning have to do with images, which means they’re likely built using Convolutional Neural Networks (or CNNs). This type of algorithm has been shown to achieve impressive results in many computer vision tasks and is a must-have part of any developer’s or data scientist’s modern toolkit.

This tutorial will walk through the basics of the architecture of a Convolutional Neural Network (CNN), explain why it works as well as it does, and step through the necessary code piece by piece. You should finish this with a good starting point for developing your own more complex architecture and applying CNNs to problems that intrigue you.

Thanks is due to ParisTech for the most of the code, and Ujjwal Karn for the intuitive explanation of CNNs.

What Is A Convolutional Neural Net Anyway?

CNNs are a subset of the field of computer vision, which is all about applying computational techniques to visual content. For more information about how computer vision works and what kinds of problems businesses are tackling with it, check out our introduction here.

People often refer to a CNN as a type of algorithm, but it’s actually a combination of different algorithms that work well together. What differentiates a CNN from your run-of-the-mill neural net is the preprocessing, or the stuff that you do to your data before passing it into the neural net itself. That means CNNs have two major pieces:

  1. Feature engineering / preprocessing – turn our images into something that the algorithm can more efficiently interpret
  2. Classification – train the algorithm to map our images to the given classes, and understand the underlying relationship

Preprocessing in CNNs is aimed at turning your input images into a set of features that’s more informative to the neural net. It can get complicated, but as long as you remember that there are only two sections and the goals of each, you won’t get lost in the weeds.


The CNN gets its name from the process of Convolution, which is the first filter applied as part of the feature engineering step. Think of convolution as applying a filter to our image. We pass over a mini image, usually called a kernel, and output the resulting filtered subset of our image.


Source: Stanford Deep Learning

Since an image is just a bunch of pixel values, in practice this means multiplying small parts of our input images by the filter. There are a few parameters that get adjusted here:

  • Kernel Size – the size of the filter.
  • Kernel Type – the values of the actual filter. Some examples include identity, edge detection, and sharpen.
  • Stride – the rate at which the kernel passes over the input image. A stride of 2 moves the kernel in 2 pixel increments.
  • Padding – we can add layers of 0s to the outside of the image in order to make sure that the kernel properly passes over the edges of the image.
  • Output Layers – how many different kernels are applied to the image.

The output of the convolution process is called the “convolved feature” or “feature map.” Remember: it’s just a filtered version of our original image where we multiplied some pixels by some numbers.

The resulting feature map can be viewed as a more optimal representation of the input image that’s more informative to the eventual neural network that the image will be passed through. In practice, convolution combined with the next two steps has been shown to greatly increase the accuracy of neural networks on images.

Code: you’ll see the convolution step through the use of the torch.nn.Conv2d() function in Pytorch.


Since the neural network forward pass is essentially a linear function (just multiplying inputs by weights and adding a bias), CNNs often add in a nonlinear function to help approximate such a relationship in the underlying data.

The function most popular with CNNs is called ReLU and it’s extremely simple. ReLU stands for Rectified Linear Unit, and it just converts all negative pixel values to 0. The function itself is output = Max(0, input).

There are other functions that can be used to add non-linearity, like tanh or softmax. But in CNNs, ReLU is the most commonly used.

Code: you’ll see the ReLU step through the use of the torch.nn.relu() function in Pytorch.

Max Pooling

The last part of the feature engineering step in CNNs is pooling, and the name describes it pretty well: we pass over sections of our image and pool them into the highest value in the section. Depending on the size of the pool, this can greatly reduce the size of the feature set that we pass into the neural network. This graphic from Stanford’s course page visualizes it simply:

max pooling

Max pooling also has a few of the same parameters as convolution that can be adjusted, like stride and padding. There are also other types of pooling that can be applied, like sum pooling or average pooling.

Code: you’ll see the Max Pooling step through the use of the torch.nn.MaxPool2d() function in Pytorch.

Fully Connected Layers

After the above preprocessing steps are applied, the resulting image (which may end up looking nothing like the original!) is passed into the traditional neural network architecture. Designing the optimal neural network is beyond the scope of this post, and we’ll be using a simple 2 layer format with one hidden layer and one output layer.

This part of the CNN is almost identical to any other standard neural network. The key to understanding CNNs is this: the driver of better accuracy is the steps we take to engineer better features, not the classifier we end up passing those values through. Convolution, ReLU, and max pooling prepare our data for the neural network in a way that extracts all the useful information they have in an efficient manner.

Code: you’ll see the forward pass step through the use of the torch.nn.Linear() function in Pytorch.

Getting Started in Pytorch

This tutorial is in Pytorch, one of the newer Python-focused frameworks for designing Deep Learning workflows that can be easily productionized. Pytorch is one of many frameworks that have been designed for this purpose and work well with Python, among popular ones like TensorFlow and Keras.

There are a few key differences between these popular frameworks that should determine which is the right for you and your project, including constraints like:

  • Ease of deployment
  • Level of abstraction
  • Visualization options
  • Debugging flexibility

It’s safe to say that Pytorch is a good medium level of abstraction between Keras and Tensorflow, and it seems to be picking up a good amount of buzz in the Data Science community. It also offers strong support for GPUs.

To install Pytorch, head to the homepage and select your machine configuration.

Gathering and Loading Data

As with most Machine Learning projects, a minority of the code you end up writing has to do with actual statistics––most is spent on gathering, cleaning, and readying your data for analysis. CNNs in Pytorch are no exception.

Pytorch ships with the torchvision package, which makes it easy to download and use datasets for CNNs. To stick with convention and benchmark accurately, we’ll use the CIFAR-10 dataset. CIFAR-10 contains images of 10 different classes, and is a standard library of sorts used for CNN building.

We’ll want to start with importing the Pytorch libraries as well as the standard numpy library for numerical computation.

We’ll also want to set a standard random seed for reproducible results.

The first step to get our data is to use Pytorch and download it.

We then designate the 10 possible labels for each image:

The final step of data preparation is to define samplers for our images. These Pytorch objects will split all of the available training examples into training, test, and cross validation sets when we train our model later on.

Designing a Neural Net in Pytorch

Pytorch makes it pretty easy to implement all of those feature engineering steps that we described above. We’ll be making use of 4 major functions in our CNN class:

  1. torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding) – applies convolution
  2. torch.nn.relu(x) – applies ReLU
  3. torch.nn.MaxPool2d(kernel_size, stride, padding) – applies Max Pooling
  4. torch.nn.Linear(in_features, out_features) – fully connected layer (multiply inputs by learned weights)

Writing CNN code in Pytorch can get a little complex, since everything is defined inside of one class. We’ll create a SimpleCNN class which inherits from the master torch.nn.Module class.

Let’s explain what’s going on here. We’re creating a SimpleCNN class with one class method: forward. The forward() method computes a forward pass of the CNN, which includes the preprocessing steps we outlined above. When an instance of the SimpleCNN class is created, we define internal functions to represent the layers of the net. During the forward pass we call these internal functions.

One of the pesky parts about manually defining neural nets is that we need to specify the sizes of inputs and outputs at each part of the process. The comments should give some direction as to what’s happening with size changes at each step. In general, the output size for any dimension in our input set can be defined as:

To use an example from our CNN, look at the max pooling layer. The input dimension is (18, 32, 32)–– using our formula applied to each of the final two dimensions (the first dimension, or number of feature maps, remains unchanged during any pooling operation), we get an output size of (18, 16, 16).

Training a Neural Net in Pytorch

Once we’ve defined the class for our CNN, we need to train the net itself. This is where neural network code gets interesting. If you’re working with more basic types of Machine Learning algorithms, you can usually get meaningful output in just a few lines of code. For example, implementing a Support Vector Machine in the sklearn Python package is as easy as:

With neural networks in Pytorch (and TensorFlow) though, it takes a bunch more code than that. Our basic flow is a training loop: each time we pass through the loop (called and “epoch”), we compute a forward pass on the network and implement backpropagation to adjust the weights. We’ll also record some other measurements like loss and time passed, so that we can analyze them as the net trains itself.

To start, we’ll define our data loaders using the samplers we created above.

We’ll also define our loss and optimizer functions that the CNN will use to find the right weights. We’ll be using Cross Entropy Loss (Log Loss) as our loss function, which strongly penalizes high confidence in the wrong answer. The optimizer is the popular Adam algorithm (not a person!).

Finally, we’ll define a function to train our CNN using a simple for loop.

During each epoch of training, we pass data to the model in batches whose size we define when we call the training loop. Data is feature engineered using the SimpleCNN class we’ve defined, and then basic metrics are printed after a few passes. During each loop, we also calculate the loss on our validation set.

To actually train the net now only requires two lines of code:

And that’s it! You’ve successful trained your (first?) CNN in Pytorch.

Next Steps and Options

Accuracy Metrics

Our training loop prints out two measures of accuracy for the CNN: training loss (after batch multiples of 10) and validation loss (after each epoch). When we defined the loss and optimization functions for our CNN, we used the torch.nn.CrossEntropyLoss() function. Cross Entropy Loss, also referred to as Log Loss, outputs a probability value between 0 and 1 that increases as the probability of the predicted label diverges from the actual label.

For Machine Learning pipelines, other measures of accuracy like precision, recall, and a confusion matrix might be used. Each project has different goals and limitations, so you should tailor your “metric of choice” – the measure of accuracy that you optimize for – towards those goals.

Network Architecture

On the CIFAR-10 dataset, the loss we’re getting translates to about 60% accuracy on the training dataset. That’s much better than the base rate––what you’d get by guessing at random––but it’s still very far off from the state of the art.

One of the major differences between our model and those that achieve 80%+ accuracy is layers. Our network has one convolution layer, one pooling layer, and two layers of the neural network itself (4 total). For example, the VGG-16 architecture utilizes more than 16 layers and won high awards at the ImageNet 2014 Challenge.

To add more layers into our CNN, we can create new methods during the initialization of our SimpleCNN class instance (although by then, we might want to change the class name to LessSimpleCNN). For example, we could try:

Hyperparameter Adjustment

In addition to varying the sizes of inputs and activation functions we use, the convolution operation and max pooling have more hyperparameters that we can adjust. Fiddling with the kernel_size, stride, and padding can extract more informative features and lead to higher accuracy (if not overfitting).

Getting a CNN in Pytorch working on your laptop is very different than having one working in production. Algorithmia supports Pytorch, which makes it easy to turn this simple CNN into a model that scales in seconds and works blazingly fast. Check out our Pytorch documentation here, and consider publishing your first algorithm on Algorithmia.