Demographic effects on estimates of automatic face recognition performance

doi:10.1016/j.imavis.2011.12.007

Image and Vision Computing

Volume 30, Issue 3, March 2012, Pages 169-176

https://doi.org/10.1016/j.imavis.2011.12.007 Get rights and content

Abstract

The intended applications of automatic face recognition systems include venues that vary widely in demographic diversity. Formal evaluations of algorithms do not commonly consider the effects of population diversity on performance. We document the effects of racial and gender demographics on estimates of the accuracy of algorithms that match identity in pairs of face images. In particular, we focus on the effects of the “background” population distribution of non-matched identities against which identity matches are compared. The algorithm we tested was created by fusing three of the top performers from a recent US Government competition. First, we demonstrate the variability of algorithm performance estimates when the population of non-matched identities was demographically “yoked" by race and/or gender (i.e., “yoking” constrains non-matched pairs to be of the same race or gender). We also report differences in the match threshold required to obtain a false alarm rate of .001 when demographic controls on the non-matched identity pairs varied. In a second experiment, we explored the effect on algorithm performance of progressively increasing population diversity. We found systematic, but non-general, effects when the balance between majority and minority populations of non-matched identities shifted. Third, we show that identity match accuracy differs substantially when the non-match identity population varied by race. Finally, we demonstrate the impact on performance when the non-match distribution consists of faces chosen to resemble a target face. The results from all experiments indicate the importance of the demographic composition and modeling of the background population in predicting the accuracy of face recognition algorithms.

Graphical abstract

Highlights

► We model accuracy estimates for face recognition algorithms in demographically variable venues. ► Accuracy measure vary substantially for different race/gender background populations. ► We model an imposter scenario and demonstrate realistic expectations for performance. ► Algorithm developers should consider the demographic composition of a population in estimates of performance.

Introduction

The appearance of a face is determined by its gender, race/ethnicity, age, and identity. Any given image of a face depends also on the viewing angle, illumination, and resolution of the sensor. The goal of most face recognition algorithms is to identify someone as a unique individual. Often this test requires the algorithm to match the identity of faces between images that may vary in the quality or nature of the viewing conditions. The diversity of faces in the real world means that face recognition algorithms must operate across a backdrop of appearance variability that is not related to an individual's unique identity. Thus, face recognition algorithms intended for real-world applications must perform predictably over changes in the demographic composition of the intended application populations. The most likely application sites for algorithms include airports, border crossings, and crowded city sites such as train and metro stations. These locations are characterized by ethnically diverse populations that may vary by the time of year (e.g., tourist season) or even by the time of day (e.g., flights from the Far East arrive in the morning and European flights in the afternoon).

The performance of state-of-the-art automatic face recognition algorithms has been tested extensively over the last two decades in a series of U.S. Government-sponsored tests (e.g., [1]). Measures of algorithm performance in these tests provide the best publicly available information for making decisions about the suitability and readiness of algorithms for security and surveillance applications of importance. Traditionally, these tests have emphasized measuring performance over photometric conditions such as illumination and image resolution [2], [3], [4], [5]. They have also concentrated primarily on the quality of the “match” (i.e., similarity) between images of the same individuals, considering this to be the critical factor determining algorithm performance. When the degree of similarity between matched identities is high, as is the case when photometric factors are controlled, the algorithm is expected to perform well.

Much less consideration has been given to the distribution against which matched identities are compared. To digress briefly, all measures of the performance of face recognition algorithms rely both on the distribution of data for the population of identity matches (i.e., pairs of images of the same person) and on the distribution of data for non-matched identities (i.e., pairs of images of different people). By definition, the identity match population contains image pairs of the same race and gender. The non-matched identity population, however, may be structured in several ways. To date, in most evaluations, this distribution consists of pairs of faces that have different identities and which may, or may not, be of different races or genders.

More formally, identity match decisions (same person or different people?) are generally based on a computed similarity score between two face images. If this score exceeds a threshold similarity score, the faces are judged as an “identity match" (sometimes referred to as an “identity verification”). Otherwise, the images are judged as non-matched identities. False alarm errors occur when the computed similarity score between a pair of face images of different individuals exceeds the match threshold. In most applications, the similarity cutoff threshold is set to achieve a low false alarm rate (commonly on the order of 0.001).

As noted, in many formal evaluations of algorithms, the distribution of similarity scores for the non-matched face images includes face pairs that may differ in gender, race/ethnicity, and age [1]. The inclusion of these categorically mismatched image pairs may lead to an over-estimation of an algorithm's ability to discriminate the identity of face pairs. In other words, when non-match face pairs differ on categorical variables such as race and gender, some part of the performance of the algorithms may be due to the easier problem of discriminating faces based on race or gender, rather than to the more challenging problem of recognizing individual face identities. From a theoretical point of view, any demographic factor that decreases the average similarity of the non-matched face pairs will increase the estimated performance of an algorithm.

In this study, we focus on the problem of how the demographic composition of non-match populations affects estimates of algorithm accuracy. Although demographic variables have been shown to affect both human [6], [7], [8], [9] and machine [10], [11] accuracy at recognizing faces, the effects of these variables on the non-matched identity distributions have not been studied previously. The first goal of this study was to document the effects of yoking non-match pairs according to the categorical variables of race and gender, individually, and together. By “yoking” we mean controlling the demographic variables within a non-match pair, such that both faces in the pair are of the same gender, same race, or same gender and race. We also examine the implications of demographic control of the non-match pairs for the choice of a threshold for match/non-match decisions. The second goal was to examine algorithm accuracy with systematic variations in the proportion of a “second” ethnic group in the non-match distribution. Third, we measured algorithm accuracy when the identity matches were of a particular race and identity mismatches were from another race. This was compared to the case when the match and non-match distributions were of the same race. Finally, we look at the challenging, but plausible security application scenario, that occurs when a deliberate attempt is made to impersonate another person. Specifically, we evaluate algorithm performance estimates when the non-match distribution consists of selected imposters, chosen to be similar to a target.

Section snippets

Algorithm fusion and test protocol

In this section, we overview the algorithm and test protocol common to all experiments. The source of data for these experiments was the FRVT 2006 [1], a U.S. Government sponsored test of face recognition algorithms conducted by the National Institute of Standards and Technology (NIST) (Details of this test and its results can be found elsewhere, [1]). We used face recognition algorithms from the FRVT 2006 international test because they are among the best algorithms available for testing and

Demographic pairing in non-match identity distributions

The goal of these experiments is to document changes in performance estimates for face recognition algorithms as a function of the demographic characteristics of the non-matched face pairs. In the first section, we show that performance estimates vary substantially when the non-match population consists of face pairs that are yoked by demographic groups (gender, race, gender and race). In the second section, we examined the effects of demographic controls on the appropriate choice of a

Simulations on mixed demographics

In real world applications, the representativeness of different demographic categories varies in different population contexts, from nearly 100% of a single majority race to various degrees of inclusion of (an)other race(s). In this section, we systematically explore the effects of progressive increases in population diversity on algorithm performance. Again, we focus on diversity in the non-matched identity distribution. We begin by measuring algorithm performance when only one race of faces

Demographically “reversed” identity match and mismatch distributions

It is easy to imagine an application scenario in which the background distribution of non-match faces contains faces of one race, but the target population contains faces of another race. This might happen when the algorithm is developed in one geographic venue but is deployed in another venue. In that case, estimates of the similarity of match and non-match faces would be based on different races of faces. If the homogeneity of the two populations differs, one would expect variable estimates

Imposter distributions

In the experiments reported up to this point in the paper, we have considered what happens to estimates of algorithm performance as a function of changes to the demographic constraints of the background population of non-matched faces. Consequently, our results are due to the effects of these constraints on the structure of the similarity scores in the non-match distribution. In this final experiment, we modeled the scenario of deliberate attempts to impersonate other people. Analogous to the

Conclusions

In summary, all measures of the performance of face recognition algorithms rely both on the distribution of data for identity matches and on the distribution of data for mismatched identities. Traditionally, attempts to improve the performance of face recognition algorithms have emphasized methods that increase the degree of match between images of the same person (e.g., by bridging differences in illumination). Less consideration has been given to the effects of the composition of the

Acknowledgments

The authors wish to thank Technical Support Working Group (TSWG)/ DOD and the Federal Bureau of Investigation (FBI) for their support of this work. The identification of any commercial product or trade name does not imply endorsement or recommendation by the National Institute of Standards and Technology or The University of Texas at Dallas. The authors thank Jay Scallan for his assistance in preparing the GBU challenge problem and Allyson Rice for comments on a previous version of this

References (12)

N. Furl et al.
Face recognition algorithms and the other-race effect: computational mechanisms for a developmental contact hypothesis
Cogn. Sci.
(2002)
P.J. Phillips et al.
FRVT 2006 and ICE 2006 large-scale results
IEEE Trans. Pattern Anal. Mach. Intell.
(2010)
R. Gross et al.
Face recognition across pose and illumination
P.J. Phillips, P.J. Flynn, T. Scruggs, K.W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, W. Worek, Overview of the...
P.J. Phillips et al.
The FERET evaluation methodology for face-recognition algorithms
IEEE Trans. Pattern Anal. Mach. Intell.
(2000)
P.J. Phillips et al.
Face Recognition Vendor Test 2002: Evaluation Report

There are more references available in the full text version of this article.

Cited by (46)

Facial metrics generated from manually and automatically placed image landmarks are highly correlated
2021, Evolution and Human Behavior
Citation Excerpt :
Finally, to investigate the generalizability of our results across image sets, we investigated these correlations in two independent open-access image sets (DeBruine & Jones, 2017; DeBruine & Jones, 2020). In our second study, we investigated whether these facial metric generated from manual and automatic landmarks show any systematic biases when measured on faces of different ethnicities, to test whether automatic methods may be generalizable to different study populations without introducing biases that can be present in facial detection algorithms (O'Toole, Phillips, An, & Dunlop, 2012). All data and analyses (including code for calculating facial metrics) can be found on the Open Science Framework (osf.io/5e3qp).
Research on social judgments of faces often investigates relationships between measures of face shape taken from images (facial metrics), and either perceptual ratings of the faces on various traits (e.g., attractiveness) or characteristics of the photographed individual (e.g., their health). A barrier to carrying out this research using large numbers of face images is the time it takes to manually position the landmarks from which these facial metrics are derived. Although research in face recognition has led to the development of algorithms that can automatically position landmarks on face images, the utility of such methods for deriving facial metrics commonly used in research on social judgments of faces has not yet been established. Thus, across two studies, we investigated the correlations between four facial metrics commonly used in social perception research (sexual dimorphism, distinctiveness, bilateral asymmetry, and facial width to height ratio) when measured from manually and automatically placed landmarks. In the first study, in two independent sets of open access face images, we found that facial metrics derived from manually and automatically placed landmarks were typically highly correlated, in both raw and Procrustes-fitted representations. In study two, we investigated the potential for automatic landmark placement to differ between White and East Asian faces. We found that two metrics, facial width to height ratio and sexual dimorphism, were better approximated by automatic landmarks in East Asian faces. However, this difference was small, and easily corrected with outlier detection. These data validate the use of automatically placed landmarks for calculating facial metrics to use in research on social judgments of faces, but we urge caution in their use. We also provide a tutorial for the automatic placement of landmarks on face images.
A FRAMEWORK FOR COVARIATE-SPECIFIC ROC CURVE ESTIMATION, WITH APPLICATION TO BIOMETRIC RECOGNITION
2023, Annals of Applied Statistics
Computational Visual Analysis in Political Communication
2023, SSRN
A Covariate-Adjusted Homogeneity Test with Application to Facial Recognition Accuracy Assessment
2023, arXiv
Computational visual analysis in political communication
2023, Research Handbook on Visual Politics
Robustness Disparities in Face Detection
2022, arXiv

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by special issue Guest Editors Rainer Stiefelhagen, Marian Stewart Bartlett and Kevin Bowyer.

View full text

Demographic effects on estimates of automatic face recognition performance☆

Abstract

Graphical abstract

Highlights

Introduction

Section snippets

Algorithm fusion and test protocol

Demographic pairing in non-match identity distributions

Simulations on mixed demographics

Demographically “reversed” identity match and mismatch distributions

Imposter distributions

Conclusions

Acknowledgments

Cogn. Sci.

FRVT 2006 and ICE 2006 large-scale results

IEEE Trans. Pattern Anal. Mach. Intell.

Face recognition across pose and illumination

The FERET evaluation methodology for face-recognition algorithms

IEEE Trans. Pattern Anal. Mach. Intell.

Face Recognition Vendor Test 2002: Evaluation Report