Evaluation and analysis of a face and voice outdoor multi-biometric system

doi:10.1016/j.patrec.2007.03.019

Pattern Recognition Letters

Volume 28, Issue 12, 1 September 2007, Pages 1572-1580

https://doi.org/10.1016/j.patrec.2007.03.019 Get rights and content

Abstract

A biometric sample collected in an uncontrolled outdoor environment varies significantly from its indoor version. Sample variations due to outdoor environmental conditions degrade the performance of biometric systems that otherwise perform well with indoor samples. In this study, we quantitatively evaluate such performance degradation in the case of a face and a voice biometric system. We also investigate how elementary combination schemes involving min–max or z normalization followed by the sum or max fusion rule can improve performance of the multi-biometric system. We use commercial biometric systems to collect face and voice samples from the same subjects in an environment that closely mimics the operational scenario. This realistic evaluation on a dataset of 116 subjects shows that the system performance degrades in outdoor scenarios but by multi-modal score fusion the performance is enhanced by 20%. We also find that max rule fusion performs better than sum rule fusion on this dataset. More interestingly, we see that by using multiple samples of the same biometric modality, the performance of a unimodal system can approach that of a multi-modal system.

Introduction

Biometric security systems offer convenience to the user as there are no passwords to remember or physical tokens to carry around. User convenience and acceptable performance in controlled environments have led to the integration of biometrics into devices such as laptops, desktops and PDAs. Various turn-key solutions are also available for physical access such as authorizing entry into a room or a facility. These solutions perform well when used in controlled indoor environments. However, when used in outdoor environments, the same technologies suffer from lower performance since they are vulnerable to various presentation and channel effects. For example, the performance of the face system in an outdoor scenario degrades due to shadows on the face and squinting of eyes. Similarly, voice-based systems suffer from background noises e.g. from passing-by vehicles. Since most biometric systems perform poorly in uncontrolled environments, their acceptance for outdoor verification purposes has been slow.

In this study, we investigate the effect of environmental variations on a face and voice biometric system. We then study how the performance of such systems can be improved in the indoor–outdoor environment (where enrollment is indoors and verification is outdoors). We look at two possible schemes: multi-modal score fusion – where scores from different biometric modalities are combined and intramodal fusion – where scores from multiple samples of the same biometric modality are combined.

For such a study, a database containing face images and speech samples acquired in both indoor and outdoor environments is required. However, most multi-modal databases for face and voice are collected in an indoor environment. The M2VTS (M2VTS) and XM2VTSDB (Messer et al., 1999) databases contain voice, 2D and 3D face images, collected in an indoor environment. Similarly, the CMU-PIE database (Sim et al., 2003) captures pose, illumination and expression variations of face images only in an indoor environment. Databases collected outdoors, usually do not contain both face and voice samples from the same subject. The FERET database (Phillips et al., 2003) consists of face images collected both indoors and outdoors, but does not contain voice. When voice is recorded outdoors, usually cellular phones are used as the capturing device, (Przybocki and Martin, 2002) which would be unsuitable in a physical access scenario. Studies on voice databases where the same subject is recorded both indoors and outdoors have not been published either.

Because of the dearth of suitable databases, most face-speech fusion studies have been conducted on indoor databases. Various approaches have been explored for score-level and decision-level fusion. Brunelli and Falavigna (1995) fused face and voice samples collected in an indoor environment using the weighted product approach. The multi-modal fusion resulted in a recognition rate of 98%, where the recognition rates of the face and voice systems were 91% and 88%, respectively. Bigun et al. (1997) used Bayesian statistics to fuse face and voice data using the M2VTS database (which only contains indoor samples). Jain et al. (1999) performed fusion on face, fingerprint and voice data based on the Neyman–Pearson rule. The database used for this experiment consists of samples from 50 subjects in an indoor environment. Sanderson and Paliwal (2002) compared and fused face and voice data in clean and noisy conditions. Although the data was collected indoors, they simulate a noisy environment by corrupting the data with additive white Gaussian noise at different signal to noise ratio levels.

Thus from the literature surveyed, we see that most fusion experiments involving face and voice use an indoor dataset or try to simulate outdoor conditions by adding noise. Table 1 summarizes different studies and contrasts them with this work which focuses on same day, indoor–outdoor experiments.

This work is unique in three aspects. Firstly, our dataset contains face and voice samples that are collected in both, indoor and outdoor environments. Secondly, in many fusion studies, a multi-modal sample is created by randomly clubbing together independent unimodal samples. Researchers (Phillips et al., 2004) express the need for a corpus that reflects realistic conditions. In response, for this study, we have created a truly multi-modal database where the two modalities are collected from the same subjects. Thirdly, most fusion studies report an improvement in performance due to multi-modal fusion. However, it has been suggested (Phillips et al., 2004) that this improvement could partly be credited to the larger number of samples used. To ensure a fair comparison, where the number of samples used is the same, the multi-modal fusion performance is compared to that of intramodal fusion.

The layout of the paper is as follows. Section 2 describes our indoor–outdoor database. Section 3 describes the normalization and fusion procedures. The results are presented and analyzed in Section 4, followed by the conclusion of the paper in Section 5.

Section snippets

Database

For a realistic testing of multi-modal authentication systems, the dataset should mimic the operational environment. Many of the reported tests for biometric fusion are conducted on a multi-modal database that is composed of single biometric databases collected for different individuals. Some databases like M2VTS (Messer et al., 1999), DAVID (Chibelushi et al., 1996), MyIdea (Dumas et al., 2005), Biomet (Garcia-Salicetti et al., 2003) that contain face and voice samples from the same user are

Methods

We study the performance of the individual unimodal systems in the indoor–indoor, indoor–outdoor and outdoor–outdoor scenarios. Next, we evaluate the performance improvement obtained through intramodal fusion on face and intramodal fusion on voice in the indoor–outdoor scenario. The performance improvement obtained by multi-modal fusion is then compared to the intramodal performances.

Information fusion in biometrics is possible at the sensor, feature, score or decision levels (Aleksic and

Results and discussion

The performance of individual biometric systems in the indoor–indoor, indoor–outdoor and outdoor–outdoor environments is discussed in Section 4.1. Section 4.2 describes the use of multiple samples (intramodal fusion) to improve the performance of the individual modalities in the indoor–outdoor scenario. Section 4.3 investigates different normalization and fusion schemes for multi-modal fusion in the indoor–outdoor scenario and compares the result to that of intramodal fusion.

Conclusions

In this paper, we describe studies conducted on a novel, truly multi-modal, indoor–outdoor database. This is one of the first studies which deal with both face and voice, and their intramodal and multi-modal fusions in an indoor–outdoor physical access scenario. The study mimics a realistic scenario as face and voice samples were collected from the same person under conditions typical to that of an operational scenario.

From this study, we uncover certain interesting observations. Firstly, we

References (33)

A. Jain et al.
Score normalization in multimodal biometric systems
Pattern Recognition
(2005)
A. Ross et al.
Information fusion in biometrics
Pattern Recognition Lett.
(2003)
P. Aleksic et al.
Audio-visual biometrics
Proc. IEEE
(2006)
Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici,...
S. Ben-Yacoub et al.
Fusion of face and speech data for person identity verification
IEEE Trans. Neural Networks
(1999)
Bigun, E., Bigun, J., Duc, B., Fischer, S., 1997. Expert conciliation for multimodal person authentication systems...
R. Brunelli et al.
Person identification using multiple cues
IEEE Trans. Pattern Anal. Machine Intell.
(1995)
K. Chang et al.
An evaluation of multimodal 2D + 3D face biometrics
IEEE Trans. Pattern Anal. Machine Intell.
(2005)
C. Chibelushi et al.
Design issues for a digital audio-visual integrated database
IEE Colloq. Integrated Audio-Visual Process Recognition Synth. Comm.
(1996)
Dumas, B., Pugin, C., Hennebert, J., Petrovska-Delacrétaz, D., Humm, A., Evéquoz, F., Ingold, R., Von-Rotz, D., 2005....

M. Faundez-Zanuy et al.

Multimodal biometric databases: An overview

IEEE Aerospace Electron. Systems Mag.

(2006)

Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Leroux les Jardins, J., Lunter, J., Ni, Y.,...

Garcia-Salicetti, S., Mellakh, M., Allano, L., Dorizzi, B., 2005. Multimodal biometric score fusion: The mean rule vs....

Indovina, M., Uludag, U., Snelick, R., Mink, A., Jain, A., 2003. Multimodal biometric authentication methods: A COTS...

Jain, A., Hong, L., Kulkarni, Y., 1999. A Multimodal biometric system using fingerprints, face and speech. In: 2nd...

T. Kar-Ann et al.

Combination of hyperbolic functions for multimodal biometrics data fusion

IEEE Trans. System Man Cybernet.

(2004)

Cited by (0)

View full text

Evaluation and analysis of a face and voice outdoor multi-biometric system

Abstract

Introduction

Section snippets

Database

Methods

Results and discussion

Conclusions

Pattern Recognition

Pattern Recognition Lett.

Audio-visual biometrics

Proc. IEEE

Fusion of face and speech data for person identity verification

IEEE Trans. Neural Networks

Person identification using multiple cues

IEEE Trans. Pattern Anal. Machine Intell.

An evaluation of multimodal 2D + 3D face biometrics

IEEE Trans. Pattern Anal. Machine Intell.

Design issues for a digital audio-visual integrated database

IEE Colloq. Integrated Audio-Visual Process Recognition Synth. Comm.

Multimodal biometric databases: An overview

IEEE Aerospace Electron. Systems Mag.

Combination of hyperbolic functions for multimodal biometrics data fusion

IEEE Trans. System Man Cybernet.