Algorithmia Blog - Deploying AI at scale

Introduction to Dataset Augmentation and Expansion

Source: TDS

If your neural nets are getting larger and larger but your training sets aren’t, you’re going to hit an accuracy wall. If you want to train better models with less data, I’ve got good news for you.

Dataset augmentation – the process of applying simple and complex transformations like flipping or style transfer to your data – can help overcome the increasingly large requirements of Deep Learning models. This post will walk through why dataset augmentation is important, how it works, and how Deep Learning fits in to the equation.


It’s hard to build the right dataset from scratch

“What’s wrong with my dataset?!?”

Don’t worry, we didn’t mean to insult you. It’s not your fault: it’s Deep Learning’s fault. Algorithms are getting infinitely more complex, and neural nets are getting deeper and deeper. More layers in neural nets means more parameters that your model is learning from your data. In some of the recent more state of the art models we’ve seen, there can be more than 100 million parameters learned during training:

Source: NanoNets

When your model is trying to understand a relationship this deeply, it needs a lot of examples to learn from. That’s why popular datasets for models like these might have something like 10,000 images for training. That size of data is not at all easy to come by.

Even if you’re using simpler or smaller types of models, it’s challenging to organize a dataset large enough to train effectively. Especially as Machine Learning gets applied to newer and newer verticals, it’s becoming harder and harder to find reliable training data. If you wanted to create a classifier to distinguish iPhones from Google Pixels, how would you get thousands of different photos?

Finally, even with the right size training set, things can still go awry. Remember that algorithms don’t think like humans: while you classify images based on a natural understanding of what’s in the image, algorithms are learning that on the fly. If you’re creating a cat / dog classifier and most of your training images for dogs have a snowy background, your algorithm might end up learning the wrong rules. Having images from varied perspectives and with different contexts is crucial.

Dataset augmentation can multiply your data’s effectiveness

For all of the reasons outlined above, it’s important to be able to augment your dataset: to make it more effective without acquiring loads of more training data. Dataset augmentation applies transformations to your training examples: they can be as simple as flipping an image, or as complicated as applying neural style transfer. The idea is that by changing the makeup of your data, you can improve your performance and increase your training set size.

For an idea of just how much this process can help, check out this benchmark that NanoNets ran in their explainer post. Their results showed an almost 20 percentage point increase in test accuracy with dataset augmentation applied.

It’s safer for us to assume the cause of this accuracy boost was a bit more complicated than just dataset augmentation, but the message is clear: it can really help.

Before we dive into what you might practically do to augment your data, it’s worth noting that there are two broad approaches to when to augment it. In offline dataset augmentation, transforms are applied en masse to your dataset before training. You might, for example, flip each of your images horizontally and vertically, resulting in a training set with twice as many examples. In online dataset augmentation, transforms are applied in real time as batches are passed into training. This won’t help with size, but is much quicker for larger training sets.

How basic dataset augmentation works

Basic augmentation is super simple, at least when it comes to images: just try to imagine all the things you could do in photoshop with a picture! A few of the simple and popular ones include:

  • Flipping (both vertically and horizontally)
  • Rotating
  • Zooming and scaling
  • Cropping
  • Translating (moving along the x or y axis)
  • Adding Gaussian noise (distortion of high frequency features)

Most of these transformations have fairly simple implementations in packages like Tensorflow. And though they might seem simple, combining them in creative ways across your dataset can yield impressive improvements in model accuracy.

One issue that often comes up is input size requirements, which are one of the most frustrating parts of neural nets for practitioners. If you shift or rotate an image, you’re going to end up with something that’s a different size, and that needs to be fixed before training. Different approaches advocate filling in empty space with constant values, zooming in until you’ve reached the right size, or reflecting pixel values into your empty space. As with any preprocessing, testing and validating is the best way to find a definitive answer.

Deep Learning for dataset augmentation

Moving from the simple to the complex, there are some more interesting things than just flips and rotations that you can do to your dataset to make it more robust.

Neural Style Transfer

Neural networks have proven effective in transferring stylistic elements from one image to another, like “Starry Stanford” here:

Source: Artists and Machine Intelligence

You can utilize pretrained nets that transfer exterior styles onto your training images as part of a dataset augmentation pipeline.

Generative Adversarial Networks

A new type of algorithm called GANs have been stealing headlines lately for their ability to generate content (of all types) that’s actually pretty good. Using these types of algorithms, researchers were able to apply image-to-image translation and get some interesting results. Here are a few of the images they worked on:

Source: UC Berkeley

Although it’s not entirely computationally feasible right now, it’s clear that this kind of technology can open doors for much more sophisticated dataset augmentation.

Google’s AutoAugment

Google recently released a paper outlining a framework for AutoAugment, or using Machine Learning to augment your dataset. This is Machine Learning to improve Machine Learning: Machine Learning-ception. The idea is that the right augmentations depend on your dataset, and can be learned through a model: even though the actual augmentations themselves are pretty simple.

On Algorithmia, our Image Augmentation algorithm implements the ideas put forth in the paper. It’s as easy as sending a directory of images to our API endpoint and getting the augmented images back. Check it out and consider implementing it as part of your Machine Learning pipeline.

Vertical Spotlight: Machine Learning for Healthcare Diagnostics

Source: Case Engineering

Diagnostics is part of the core of healthcare — research suggests a third of all Healthcare AI SaaS companies are tackling just this sector.

Machine Learning can automate parts of the diagnostic stack, aid doctors in deciding how to interpret tests, and greatly reduce errors in communication. This post will walk through popular use cases, the challenges inherent in applying ML models in diagnostics, and some of the tradeoffs to be made in model selection.

Read More…

Document Classifier: use cases for your business

Image result for document classifier

Source: TDS

We recently went into detail about the Document Classifier algorithm in our spotlight. That’s all fine and good, but it’s not immediately clear what can you do with it.

In this post, we’ll focus on potential use cases. We’ll start with a quick refresher on what this algorithm does, and then look at concrete examples of real world problems that this algorithm can tackle – and why it makes sense for you to give it go. Read More…

Vertical Spotlight: Machine Learning for financial fraud

For every dollar of fraud that financial services companies suffer, they incur $2.67 in costs to their business. With more entry points in the digital age and increasingly sophisticated attackers, tackling fraud manually is quickly fading to irrelevance: but Machine Learning offers a promising way to automate the process, as well as surface more nuanced fraud patterns.

This post will walk through the challenges of applying ML models to fraud detection, popular applications, and tradeoffs to think about in model selection.

Read More…

Algorithmia Engineering Team Spotlight: Besir

At Algorithmia we’re lucky enough to be surrounded by group of wildly intelligent, quirky, and fun engineers. We’d love for you to come by and meet them in person, but until then we’ll post a series of interviews introducing you to some of the talented people who are creating the future of AI.

For today’s installment, We’d like to introduce you to Besir! He’s an Algorithm Engineer who develops unique AI/ML algorithms. You may remember him from our work on the DanKu protocol, the first trustless contract on the ethereum blockchain. Fun fact: the Ku of DanKu is for Besir’s last name.

Read More…