Algorithmia Blog - Deploying AI at scale

Introduction to Dataset Augmentation and Expansion

Source: TDS

If your neural nets are getting larger and larger but your training sets aren’t, you’re going to hit an accuracy wall. If you want to train better models with less data, I’ve got good news for you.

Dataset augmentation – the process of applying simple and complex transformations like flipping or style transfer to your data – can help overcome the increasingly large requirements of Deep Learning models. This post will walk through why dataset augmentation is important, how it works, and how Deep Learning fits in to the equation.

It’s hard to have the right dataset from scratch

“What’s wrong with my dataset?!?”

Don’t worry, we didn’t mean to insult you. It’s not your fault: it’s Deep Learning’s fault. Algorithms are getting infinitely more complex, and neural nets are getting deeper and deeper. More layers in neural nets means more parameters that your model is learning from your data. In some of the recent more state of the art models we’ve seen, there can be more than 100 million parameters learned during training:

Source: NanoNets

When your model is trying to understand a relationship this deeply, it needs a lot of examples to learn from. That’s why popular datasets for models like these might have something like 10,000 images for training. That size of data is not at all easy to come by.

Even if you’re using simpler or smaller types of models, it’s challenging to organize a dataset large enough to train effectively. Especially as Machine Learning gets applied to newer and newer verticals, it’s becoming harder and harder to find reliable training data. If you wanted to create a classifier to distinguish iPhones from Google Pixels, how would you get thousands of different photos?

Finally, even with the right size training set, things can still go awry. Remember that algorithms don’t think like humans: while you classify images based on a natural understanding of what’s in the image, algorithms are learning that on the fly. If you’re creating a cat / dog classifier and most of your training images for dogs have a snowy background, your algorithm might end up learning the wrong rules. Having images from varied perspectives and with different contexts is crucial.

Dataset augmentation can multiply your data’s effectiveness

For all of the reasons outlined above, it’s important to be able to augment your dataset: to make it more effective without acquiring loads of more training data. Dataset augmentation applies transformations to your training examples: they can be as simple as flipping an image, or as complicated as applying neural style transfer. The idea is that by changing the makeup of your data, you can improve your performance and increase your training set size.

For an idea of just how much this process can help, check out this benchmark that NanoNets ran in their explainer post. Their results showed an almost 20 percentage point increase in test accuracy with dataset augmentation applied.

It’s safer for us to assume the cause of this accuracy boost was a bit more complicated than just dataset augmentation, but the message is clear: it can really help.

Before we dive into what you might practically do to augment your data, it’s worth noting that there are two broad approaches to when to augment it. In offline dataset augmentation, transforms are applied en masse to your dataset before training. You might, for example, flip each of your images horizontally and vertically, resulting in a training set with twice as many examples. In online dataset augmentation, transforms are applied in real time as batches are passed into training. This won’t help with size, but is much quicker for larger training sets.

How basic dataset augmentation works

Basic augmentation is super simple, at least when it comes to images: just try to imagine all the things you could do in photoshop with a picture! A few of the simple and popular ones include:

  • Flipping (both vertically and horizontally)
  • Rotating
  • Zooming and scaling
  • Cropping
  • Translating (moving along the x or y axis)
  • Adding Gaussian noise (distortion of high frequency features)

Most of these transformations have fairly simple implementations in packages like Tensorflow. And though they might seem simple, combining them in creative ways across your dataset can yield impressive improvements in model accuracy.

One issue that often comes up is input size requirements, which are one of the most frustrating parts of neural nets for practitioners. If you shift or rotate an image, you’re going to end up with something that’s a different size, and that needs to be fixed before training. Different approaches advocate filling in empty space with constant values, zooming in until you’ve reached the right size, or reflecting pixel values into your empty space. As with any preprocessing, testing and validating is the best way to find a definitive answer.

Deep Learning for dataset augmentation

Moving from the simple to the complex, there are some more interesting things than just flips and rotations that you can do to your dataset to make it more robust.

Neural Style Transfer

Neural networks have proven effective in transferring stylistic elements from one image to another, like “Starry Stanford” here:

Source: Artists and Machine Intelligence

You can utilize pretrained nets that transfer exterior styles onto your training images as part of a dataset augmentation pipeline.

Generative Adversarial Networks

A new type of algorithm called GANs have been stealing headlines lately for their ability to generate content (of all types) that’s actually pretty good. Using these types of algorithms, researchers were able to apply image-to-image translation and get some interesting results. Here are a few of the images they worked on:

Source: UC Berkeley

Although it’s not entirely computationally feasible right now, it’s clear that this kind of technology can open doors for much more sophisticated dataset augmentation.

Google’s AutoAugment

Google recently released a paper outlining a framework for AutoAugment, or using Machine Learning to augment your dataset. This is Machine Learning to improve Machine Learning: Machine Learning-ception. The idea is that the right augmentations depend on your dataset, and can be learned through a model: even though the actual augmentations themselves are pretty simple.

On Algorithmia, our Image Augmentation algorithm implements the ideas put forth in the paper. It’s as easy as sending a directory of images to our API endpoint and getting the augmented images back. Check it out and consider implementing it as part of your Machine Learning pipeline.


Data augmentation: how to use Deep Learning when you have limited data (NanoNets) – “Naturally, if you have a lot of parameters, you would need to show your machine learning model a proportional amount of examples, to get good performance. Also, the number of parameters you need is proportional to the complexity of the task your model has to perform.

How to improve your image classifier with with Google’s AutoAugment (TDS) – “AutoML — the idea of using Machine Learning to improve Machine Learning design choices like architectures or optimizers — has reached the space of data augmentation. This article explains what Data Augmentation is, how Google’s AutoAugment searches for the best augmentation policies and how you can transfer these policies to your own image classification problem.

Data augmentation (Kaggle) – “At the end of this lesson, you will be able to use data augmentation. This trick that makes it seem like you have far more data than you actually have, resulting in even better models.

Understanding Data Augmentation for Classification: When to Warp? (Paper) – “In this paper we investigate the benefit of augmenting data with synthetically created samples when training a machine learning classifier. Two approaches for creating additional training samples are data warping, which generates additional samples through transformations applied in the data-space, and synthetic oversampling, which creates additional samples in feature-space.

Dataset augmentation in feature space (Paper) – “Dataset augmentation, the practice of applying a wide array of domain-specific transformations to synthetically expand a training set, is a standard tool in supervised learning. While effective in tasks such as visual recognition, the set of transformations must be carefully designed, implemented, and tested for every new domain, limiting its re-use and generality. In this paper, we adopt a simpler, domain-agnostic approach to dataset augmentation.

AutoAugment: Learning Augmentation Policies from Data (Paper) – “In this paper, we take a closer look at data augmentation for images, and describe a simple procedure called AutoAugment to search for improved data augmentation policies. Our key insight is to create a search space of data augmentation policies, evaluating the quality of a particular policy directly on the dataset of interest.