Facial recognition has become an increasingly common technology.
Today smartphones use facial recognition for access control, and animated movies use facial recognition software to bring realistic human movement and expression to life. Police surveillance cameras use it to identify people who have warrants out for their arrest, and it is also being used in retail stores for targeted marketing campaigns. And of course, celebrity look-a-like apps and Facebook’s auto tagger also uses facial recognition to tag faces.
Not all facial recognition libraries are equal in accuracy and performance, and most state-of-the-art systems are proprietary black boxes.
OpenFace is an open-source library that rivals the performance and accuracy of proprietary models. This project was created with mobile performance in mind, so let’s look at some of the internals that make this library fast and accurate, and work through some use cases on why you might want to implement it in your projects.
High-level Architectural Overview
OpenFace is a deep learning facial recognition model developed by Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. It’s based off the paper: FaceNet: A Unified Embedding for Face Recognition and Clustering by Florian Schroff, Dmitry Kalenichenko, and James Philbin at Google, and is implemented using Python and Torch so it can be run on CPUs or GPUs.
While OpenFace is only a couple of years old, it’s been widely adopted because it offers levels of accuracy similar to facial recognition models found in private state-of-the-art systems such as Google’s FaceNet or Facebook’s DeepFace.
What’s particularly nice about OpenFace, besides being open source, is that development of the model focused on real-time face recognition on mobile devices, so you can train a model with high accuracy with very little data on the fly.
From a high level perspective, OpenFace uses Torch, a scientific computing framework to do training offline, meaning it’s only done once by OpenFace and the user doesn’t have to get their hands dirty training hundreds of thousands of images themselves. Those images are then thrown into a neural net for feature extraction using Google’s FaceNet model. FaceNet relies on a triplet loss function to compute the accuracy of the neural net classifying a face and is able to cluster faces because of the resulting measurements on a hypersphere.
This trained neural net is later used in the Python implementation after new images are run through dlib’s face-detection model. Once the faces are normalized by OpenCV’s Affine transformation so all faces are oriented in the same direction, they are sent through the trained neural net in a single forward pass. This results in 128 facial embeddings used for classification for matching or can even be used in a clustering algorithm for similarity detection.
During the training portion of the OpenFace pipeline, 500,000 images are passed through the neural net. These images are from two public datasets: CASIA-WebFace, which is comprised of 10,575 individuals for a total of 494,414 images and FaceScrub, which is made of 530 individuals with a total of 106,863 images.
The point of training the neural net on all these images ahead of time is that it wouldn’t be possible on mobile or any other real-time scenario to train 500,000 images to retrieve the needed facial embeddings. Now remember, this portion of the pipeline is only done once because OpenFace trains these images to produce 128 facial embeddings that represent a generic face that are to be later used in the Python training-on-the-fly part of the pipeline. Then instead of matching an image in high-dimensional space, you’re only using low-dimensional data, which helps make this model fast.
As mentioned before, OpenFace uses Google’s FaceNet architecture for feature extraction and uses a triplet loss function to test how accurate the neural net classifies a face. It does this by training on three different images where one is a known face image called the anchor image, then another image of that same person has positive embeddings, while the last one is an image of a different person, which has negative embeddings.
If you want to learn more about triplet loss, check out Andrew NG’s Convolutional Neural Network Coursera video.
The great thing about using triple embeddings is that the embeddings are measured on a unit hypersphere where Euclidean distance is used to determine which images are closer together and which are farther apart. Obviously, the negative image embeddings are measured farther from the positive and anchor embeddings while those two would be closer in distance to each other. This is important because it allows for clustering algorithms to be used for similarity detection. You might want to use a clustering algorithm if you wanted to detect family members on a genealogy site for example, or on social media for possible marketing campaigns.
Isolate Face from Background
Now that we’ve covered how OpenFace uses Torch to train hundreds of thousands of images from public datasets to get low-dimensional face embeddings, we can check out their use of the popular face-detection library dlib and see why you’d want to use it instead of OpenCV’s face detection library.
One of the first steps in facial recognition software is to isolate the actual face from the background of the image along with isolating each face from others found in the image. Face detection algorithms also must be able to deal with bad and inconsistent lighting and various facial positions such as tilted or rotated faces. Luckily dlib along with OpenCV handles all these issues. Dlib takes care of finding the fiducial points on the face while OpenCV handles the normalization of the facial position.
It’s important to note that when using OpenFace, you can either implement dlib for face detection, which uses a combination of HOG (Histogram of Oriented Gradient) & Support Vector Machine (SVM), or OpenCV’s Haar cascade classifier. Both are trained on positive and negative images (meaning there are images that have faces and ones that don’t), but they are very different in implementation, speed, and accuracy.
There are several benefits to using the HOG classifier. First, the training is done using a sliding sub-window on the image so no subsampling and parameter manipulation is required like it is in the Haar classifier. This makes dlib’s HOG and SVM face detection easier to use and faster to train. It also means that less data is required. Note that HOG has higher accuracy for face detection than Haar cascade classifier. It kind of makes using dlib’s HOG + SVM a no brainer for face detection!
Along with finding each face in an image, part of the process in facial recognition is preprocessing the images to handle problems such as inconsistent and bad lighting, converting images to grayscale for faster training, and normalization of facial position.
While some facial recognition models can handle these issues by training on massive datasets, dlib uses OpenCV’s 2D Affine transformation, which rotates the face and makes the position of the eyes, nose, and mouth for each face consistent. There are 68 facial landmarks used in affine transformation for feature detection, and the distances between those points are measured and compared to the points found in an average face image. Then the image is rotated and transformed based on those points to normalize the face for comparison and cropped to 96×96 pixels for input to the trained neural net.
So, after we isolate the image from the background and preprocess it using dlib and OpenCV, we can pass the image into the trained neural net that was created in the Torch portion of the pipeline. In this step, there is a single forward pass on the neural net to get 128 embeddings (facial features) that are used in prediction. These low-dimensional facial embeddings are then used in classification or clustering algorithms.
For classification tests, OpenFace uses a linear support-vector machine that is commonly used out in the real world to match image features. An impressive thing about OpenFace is that at this point it takes only a few milliseconds to classify images.
Now that we’ve gone through OpenFace’s architecture at a high level, we can cover some interesting use case ideas for the open-source library. Previously mentioned, facial recognition is used as a form of access control and identification. One idea that we explored a couple of years ago was to use it for identification and customization of visitor experience when entering our office. Wow, that was a long time ago in startup world! But you could also think about creating a mobile app that identifies VIP folks entering a club or party. Bouncers wouldn’t have to remember everyone’s face or rely on a list of names for letting people enter. It would also be easy to add new faces to the training data on the fly and the model would be trained by the time the individual went back outside for a breath of fresh air and wanted to enter the club again. Along those same lines, using facial identification for a meeting where temporary access to a floor or office is needed. Security or front desk personnel could easily update or remove images from the dataset on their phones.
Where to Find OpenFace for Implementation
Hopefully, we’ve made the case for checking out OpenFace. You can implement the model for facial recognition from OpenFace on GitHub, or check out the hosted OpenFace model on Algorithmia where you can add, train, remove, and predict images using our SVM implementation for classification. If you want to learn how to use our facial recognition algorithm, check out our recipe for making a celebrity classifier.
Stay tuned for slides from my talk next week from PyCascades, a regional Python conference in Vancouver, B.C., where I’ll be speaking on Racial Bias in Facial Recognition Software. I’ll be covering a use case for building a celebrity look-a-like app and talk about model failure due to racial bias and other use cases.