Algorithmia Blog

Deep Dive into Parallelized Video Processing

Where it all began

At Algorithmia, one of our driving goals is to enable all developers to stand on the shoulders of the algorithmic giants. Like Lego our users can construct amazing devices and tools by utilizing our algorithmic building blocks like FaceDetection or Smart Image Downloader.

As a platform, Algorithmia is unique in that we’re able to scale to meet any volume of concurrent algorithm requests, meaning that even though your algorithm might be making 10,000 API requests to a particular image processing algorithm, it won’t influence the experience of other users quality of service.

One of the earliest projects I worked on at Algorithmia was to construct a video processing pipeline which would leverage our existing image processing algorithms as building blocks. The project was designed to improve the reach of our image processing algorithms by automatically enabling them to become video processing algorithms.

After the first couple of weeks the first Interface and process flow was starting to come together and by using ffmpeg we were able to easily split videos into frames and concatenate them back into any video format we wanted. However, it quickly became apparent how fragile this initial process flow was, and how difficult it was to use for an end user.

What we learned

Since state and frame processing was intended to be managed by the user, it lead to a situation where a complicated business logic script was needed to be written for each user application. This logic would be required to parallelize and process the individual video frames using a chosen image processing algorithm, which depending on the user’s skills and capability with concurrency and parallelism, might not have been possible without help.

To improve performance, the intended process was to split up a video into smaller sub videos which were then split into a batch of frames to improve performance. Unfortunately some state was necessary when splitting sub videos that we weren’t capturing. There was no way to tell each sub video when to start sampling frames. Because of this we lost the cadence of a consistent frame rate, as each video chunk started sampling at frame 0.

The transition to Rust

A coworker of mine had recently become obsessed with the Rust language. We talked about Rust and its merits for a few months, and it definitely seemed like an interesting choice for large scale projects.

The fact that the language guarantees thread safety, memory safety, and strict compiler exceptions while not restricting your control over low level optimizations was pretty impressive. Not only do you get all that, but Rust also delivers such novel features as user defined composable macros making it a clear winner.

After doing more research into how easy it is to parse json with Rust (a sore spot even today with Scala and Java), we decided to swap out Scala and give the language a shot when developing the next version of the video processing project.

General purpose abstractions and extendability

When we revisited the video processing algorithm, one of the bigger tasks was to resolve the clunkiness of the algorithm flow. We wanted to improve the usability of the algorithm by moving as much logic as possible into the algorithm itself, rather than forcing users to handle all the parallel frame processing logic.

Nearly every image processor on algorithmia comes in two general flavors: One type transforms an image into another image (DeepFilter, SalNet, CensorFace, etc), and the other extracts data from an image and returns json (InceptionNet, FaceDetection, NudityDetection, etc).

It turns out that both types have similar input requirements, however only the image transformation algorithms require a url image output.

The differences between the two processor types are part of what motivated us to separate out Video Metadata Extraction from Video Transform as separate algorithms — however they fundamentally share a significant amount of source code.

Flow charts detailing how Video Transform and Video Metadata Extraction

Json composability

We realized quickly that if we wanted to make our two video combination algorithms capable of supporting the vast ecosystem of algorithm IO styles and formats, we wouldn’t want to hardcode a unique implementation for each algorithm.

To solve this problem, we designed an advanced json template extractor that digests an image processing algorithm’s input interface and then internally processes it in the parallel image processing loop.

This templating engine allows the video processing algorithms to interface with nearly every image processing algorithm on the platform and by design, enabled a tons of API design variations to work automatically with our video algorithms, like this one:

API template

{
"images":"$BATCH_INPUT",
"savePaths":"$BATCH_OUTPUT",
"filterName":"neo_instinct"
}

Input passed to algorithm's API internally

{
"images":[
"data://.session/1019661a-c986-4e8c-922b-b01a51a8a2ec-0002264.png",
"data://.session/1019661a-c986-4e8c-922b-b01a51a8a2ec-0002396.png",
"data://.session/1019661a-c986-4e8c-922b-b01a51a8a2ec-0000285.png",
"data://.session/1019661a-c986-4e8c-922b-b01a51a8a2ec-0001981.png"
],
"savePaths":[
"data://.session/116b6620-be75-4c1f-a622-0112fa902d7a-0002264.png",
"data://.session/116b6620-be75-4c1f-a622-0112fa902d7a-0002396.png",
"data://.session/116b6620-be75-4c1f-a622-0112fa902d7a-0000285.png",
"data://.session/116b6620-be75-4c1f-a622-0112fa902d7a-0001981.png"
],
"filterName":"neo_instinct"
}

By having users to provide us with an algorithm’s API template, we were able to avoid having to hardcode an implementation for every possible image processing algorithm into each video processor, reducing maintenance requirements, and improving versatility.

Of course, it’s one thing to describe how it works, another to show it. Below is a minimal JSON template extraction & keyword substitution example along with our current JSON source code.

The quest for real time video processing

Generic abstractions and composability are nice, but you want your video to be processed in a reasonable amount of time, so spending resources on improving the performance can play an important role.

Concurrency and Rust

The bulk of the computation is done within the image processing algorithms themselves. As each frame is processed, potentially millions of operations are performed to extract information from the image using Convolutional Neural Networks and other ML tools. But it’s not only computation that’s expensive, but eventually data transfers start adding up as well.

Processing time is taken from our DeepFilter algorithm, but comparable to most image processing algorithms.

To reduce this bottleneck, we needed to implement a solution that parallelizes our frame processing tasks, thankfully Rust has the perfect library for this called Rayon.

Rayon has a powerful abstraction called the Parallel Iterator, which allows the rust developer to avoid having to create & manage their own thread pool and input/output queue buffer. By using this safe and reliable abstraction we were able to simplify our multi-threading process, allowing us to safely implement dynamic threading backoffs and early exits on exceptions from individual image processing algorithms.

Conclusion

Algorithmia is a platform that allows users to compose different algorithms together to create something larger than the sum of its parts. Why not try your hand at making your own algorithmic lego masterpiece?

If you’d like to see just what our video processing algorithms can do, take a look at these sweet demos: