Exploring the databases used to train facial recognition algorithms
MegaPixels is an ongoing project about machine learning image datasets. This first chapter of the project launched in London in 2017 in collaboration with Tactical Tech for the Glass Room exhibition.
The installation used facial recognition to search for your identity in the largest publicly available facial recognition training dataset in the world, called MegaFace (V2). Out of approximately 15,000 users, two people reported finding a positive match in the training database which they never knew existed.
What many people aren't aware of is that you might be included in this facial recognition training dataset because it was created entirely from Flickr. The dataset contains approximately 672,000 identities and 4.2 million photos, all obtained from Flickr without anyone consenting.
These 4.2 million images are currently being passed between researchers in the US, China, Russia, and all over the world to train and evaluate state-of-the-art facial recognition algorithms. If you knew that your image, your friend's image, or your child's image was being used for developing products in the defense industry, would you object? How would you even find out if your photo was included?
MegaPixels is an ongoing project to bring these datasets into the public view , provide new tools to explore their contents, and surface the ethical implications of datasets.
Walk up to the MegaPixels kiosk and in a few seconds you’ll be compared to ~672,00 identities from the MegaFace dataset. The best matches will be displayed on the screen with their match confidence scores. When a similar face is found, a button illuminates and you can print out a summary of your match score along with information about the database.
The installation runs locally on a desktop computer built for image processing. No data is transmitted to any 3rd party service or software.
The installation was built using OpenCV and uses a neural network face recognition library to compute a 128-D feature vector for each face. These feature vectors (an array with 128 floating point numbers) are compared to all precomputed face descriptors in the database and the top 5 matches are displayed to the screen.
View press coverage
There's no absolute accuracy score in facial recognition because it can only be measured relative to the faces in the evaluation dataset. A blog post from the author of the facial recognition library used in this project recorded a 99.38% accuracy on the Labeled Faces in the Wild (LFW) dataset. However, this dataset contains only frontally aligned photos (detected using a frontal haarcascade) and is notoriously biased. As a result of the Euro-centric biometric bias in the LFW datasaet, there is lower matching performance for underrepresented faces in the dataset. In addition to shallow datasets, real word scenarios introduce more variables that further degrade performance.
These variables include pose, resolution, camera noise, illumination, motion blur, camera angle, face expression, lens distortion, and face obfuscation (makeup, glasses, jewelry, or facial decoration). Combined, these attributes can degrade facial recognition accuracy to below 70%.
The MegaFace database can be acquired from megaface.cs.washington.edu. You will need to agree to their terms and provide an email. This installation used all 672,000 identities from the MegaFace dataset. However, after filtering for image quality and non-faces, the total number was closer to 600,000.
Absolutely no data is collected. The software only temporarily holds the facial capture and biometric information in RAM (Random Access Memory) during real-time processing while the person is present in front of the screen. No data is ever stored to disk, transmitted, or shared in any way.