MegaPixels: Face Database Query
Exploring the databases used to train facial recognition algorithms.
Currently live at the Glass Room in London until Nov. 12th.
Facial recognition relies on the availability of training data to learn features and to evaluate algorithms. Today, most of these datasets are created by scraping the Internet for profile pictures and people’s public photo albums. Both academic and commercial researchers scrape social media sites, such as Flickr, to enroll people in these training databases without ever asking consent, relying instead on an outdated Creative Commons license. Or, in some cases, bypassing consent altogether.
This project aims to provide a search interface into the dozens of facial recognition training databases, created witout any user consent, while also exploring the capabilities of facial recognition and facial analysis software.
Like most projects on this site, MegaPixels is under active development and will be updated
In the meantime, email me with any questions.
How it Works
Walk up to MegaPixels and in a few seconds you’ll be compared to hundreds of thousands of identities from the MegaFace and LFW dataset. The best matches will be displayed on the screen with the confidence scores. When a similar face is found, a button will illuminate and you can print out a summary of your match score along with information about the database source.
The MegaPixels project is currently under development and the number of identities in the database is expanding. As of November 2, 2017 there are 485,290 identities used in the Glass Room installation, which includes 85% of the MegaFace dataset and 100% of the LFW dataset. The project will continue to expand and eventually include all publicly available facial recognition training datasets that were created by scraping the Internet.
In total there are over one million people in these databases (possible many more) and it’s likely that none of these people are aware that their biometric data is being used by companies around the world (from USA to Russia) to train and evaluate software that, in some cases, is ultimately licensed to government surveillance programs.
The installation runs locally on a desktop computer built for image processing. No data is transmitted to any 3rd party service or software.
The software is built using OpenCV and uses a neural network face recognition library to compute a 128-D feature vector for each face. This feature vector is compared to all precomputed face descriptors in the database and the top matches are displayed to the screen. Matching against the entire dataset of several hundred thousand images takes approximately 15 seconds. The search could be much quicker (milliseconds) but is intentionally slowed down to display the results to the user as they are computed with increasing accuracy, finally displaying the best match after around 15 seconds.
Once a face is matched a green button is illuminated and the user and can create a thermal receipt printout.
How accurate is the facial recognition software?
There is no absolute accuracy score. The accuracy can only be measured relatively. A blog post from the author of the facial recognition library used in this project recorded a 99.38% accuracy on the Labeled Faces in the Wild (LFW) dataset. However, this dataset contains only frontally aligned photos (detected using a frontal haarcascade) and does not reflect real-world scenarios, which would degrade the accuracy. This dataset is also biased towards western faces.
The factors that will degrade accuracy are generally pose, resolution, camera noise, illumination, motion blur, camera angle, face expression, lens distortion, and face obfuscation (makeup, glasses, jewelry, or facial decoration). Combined, these attributes can degrade facial recognition accuracy to below a useable score. This discussion assumes facial recognition is limited to a 2D visible wavelength capture with approximately 40 interoccular pixels (about 100x100px).
Another variable that affects accuracy is the dataset that was used for training. Because the training images used for this facial recognition algorithm originated from western photo sharing sites, such as Flickr, or from western celebrity databases such as IMDB, the learned neural network features have a greater aptitude for discerning small differences between faces similar to those in the dataset. Conversely, for faces that were not well represented in the dataset, the facial recognition scoring will be overconfident because it is not able to understand the subtle differences between, for example, Asian faces, which are not well represented in the training set.
A decreased accuracy for identifying Asian faces has been observed at the installation and this is likely due to insufficiently diverse training data. However, this facial recognition library is open source, in development, and the author makes no claims of providing universal accuracy. Facial recognition algorithms are rarely universal and, when used at large scale, would either be used in conjunction with a racial classifier preprocessor followed by a race-specific clustering algorithm, then an intra-group clustering (cluster people who look similar), and eventually the final recognition confidence scoring algorithm.
In summary, the accuracy of the actual installation is not known and probably can’t be determined. There is quite a lot of room for improvement (10-20%?) with pose filtering and identity clustering.
How do you obtain the databases?
The two databases currently used are MegaFace and LFW. The MegaFace database can be acquired from megaface.cs.washington.edu. You will need to agree to their terms and provide an email. The LFW database can be acquired from http://vis-www.cs.umass.edu/lfw/.
Did anyone find a true match in the database?
As of Nov. 2, two people had confirmed a positive match. One person shared their match on Twitter.
Does the MegaPixels project collect any data?
Absolutely no data is collected. The software only temporarily holds the facial capture and biometric information in RAM (Random Access Memory) during real-time processing and while the person is present in front of the screen. No data is ever stored to disk, transmitted, or shared in any way.
MegaPixels has no intentions of flattery and typically provides the opposite, a less glamorous confrontation with the reality of biometric technologies and the methods used to build it.
- Presented by Mozilla at the Glass Room
- Curated by Tactical Tech
- Thanks to Stephanie Hankey, Marek Tuszynski, Sophie Macpherson, Daisy Kidd
- Header photo courtesy Carmen Aguilar y Wedge / Hyphen-Labs
- Design and Development Adam Harvey