Evaluation and analysis of a face and voice outdoor multi-biometric system
Introduction
Biometric security systems offer convenience to the user as there are no passwords to remember or physical tokens to carry around. User convenience and acceptable performance in controlled environments have led to the integration of biometrics into devices such as laptops, desktops and PDAs. Various turn-key solutions are also available for physical access such as authorizing entry into a room or a facility. These solutions perform well when used in controlled indoor environments. However, when used in outdoor environments, the same technologies suffer from lower performance since they are vulnerable to various presentation and channel effects. For example, the performance of the face system in an outdoor scenario degrades due to shadows on the face and squinting of eyes. Similarly, voice-based systems suffer from background noises e.g. from passing-by vehicles. Since most biometric systems perform poorly in uncontrolled environments, their acceptance for outdoor verification purposes has been slow.
In this study, we investigate the effect of environmental variations on a face and voice biometric system. We then study how the performance of such systems can be improved in the indoor–outdoor environment (where enrollment is indoors and verification is outdoors). We look at two possible schemes: multi-modal score fusion – where scores from different biometric modalities are combined and intramodal fusion – where scores from multiple samples of the same biometric modality are combined.
For such a study, a database containing face images and speech samples acquired in both indoor and outdoor environments is required. However, most multi-modal databases for face and voice are collected in an indoor environment. The M2VTS (M2VTS) and XM2VTSDB (Messer et al., 1999) databases contain voice, 2D and 3D face images, collected in an indoor environment. Similarly, the CMU-PIE database (Sim et al., 2003) captures pose, illumination and expression variations of face images only in an indoor environment. Databases collected outdoors, usually do not contain both face and voice samples from the same subject. The FERET database (Phillips et al., 2003) consists of face images collected both indoors and outdoors, but does not contain voice. When voice is recorded outdoors, usually cellular phones are used as the capturing device, (Przybocki and Martin, 2002) which would be unsuitable in a physical access scenario. Studies on voice databases where the same subject is recorded both indoors and outdoors have not been published either.
Because of the dearth of suitable databases, most face-speech fusion studies have been conducted on indoor databases. Various approaches have been explored for score-level and decision-level fusion. Brunelli and Falavigna (1995) fused face and voice samples collected in an indoor environment using the weighted product approach. The multi-modal fusion resulted in a recognition rate of 98%, where the recognition rates of the face and voice systems were 91% and 88%, respectively. Bigun et al. (1997) used Bayesian statistics to fuse face and voice data using the M2VTS database (which only contains indoor samples). Jain et al. (1999) performed fusion on face, fingerprint and voice data based on the Neyman–Pearson rule. The database used for this experiment consists of samples from 50 subjects in an indoor environment. Sanderson and Paliwal (2002) compared and fused face and voice data in clean and noisy conditions. Although the data was collected indoors, they simulate a noisy environment by corrupting the data with additive white Gaussian noise at different signal to noise ratio levels.
Thus from the literature surveyed, we see that most fusion experiments involving face and voice use an indoor dataset or try to simulate outdoor conditions by adding noise. Table 1 summarizes different studies and contrasts them with this work which focuses on same day, indoor–outdoor experiments.
This work is unique in three aspects. Firstly, our dataset contains face and voice samples that are collected in both, indoor and outdoor environments. Secondly, in many fusion studies, a multi-modal sample is created by randomly clubbing together independent unimodal samples. Researchers (Phillips et al., 2004) express the need for a corpus that reflects realistic conditions. In response, for this study, we have created a truly multi-modal database where the two modalities are collected from the same subjects. Thirdly, most fusion studies report an improvement in performance due to multi-modal fusion. However, it has been suggested (Phillips et al., 2004) that this improvement could partly be credited to the larger number of samples used. To ensure a fair comparison, where the number of samples used is the same, the multi-modal fusion performance is compared to that of intramodal fusion.
The layout of the paper is as follows. Section 2 describes our indoor–outdoor database. Section 3 describes the normalization and fusion procedures. The results are presented and analyzed in Section 4, followed by the conclusion of the paper in Section 5.
Section snippets
Database
For a realistic testing of multi-modal authentication systems, the dataset should mimic the operational environment. Many of the reported tests for biometric fusion are conducted on a multi-modal database that is composed of single biometric databases collected for different individuals. Some databases like M2VTS (Messer et al., 1999), DAVID (Chibelushi et al., 1996), MyIdea (Dumas et al., 2005), Biomet (Garcia-Salicetti et al., 2003) that contain face and voice samples from the same user are
Methods
We study the performance of the individual unimodal systems in the indoor–indoor, indoor–outdoor and outdoor–outdoor scenarios. Next, we evaluate the performance improvement obtained through intramodal fusion on face and intramodal fusion on voice in the indoor–outdoor scenario. The performance improvement obtained by multi-modal fusion is then compared to the intramodal performances.
Information fusion in biometrics is possible at the sensor, feature, score or decision levels (Aleksic and
Results and discussion
The performance of individual biometric systems in the indoor–indoor, indoor–outdoor and outdoor–outdoor environments is discussed in Section 4.1. Section 4.2 describes the use of multiple samples (intramodal fusion) to improve the performance of the individual modalities in the indoor–outdoor scenario. Section 4.3 investigates different normalization and fusion schemes for multi-modal fusion in the indoor–outdoor scenario and compares the result to that of intramodal fusion.
Conclusions
In this paper, we describe studies conducted on a novel, truly multi-modal, indoor–outdoor database. This is one of the first studies which deal with both face and voice, and their intramodal and multi-modal fusions in an indoor–outdoor physical access scenario. The study mimics a realistic scenario as face and voice samples were collected from the same person under conditions typical to that of an operational scenario.
From this study, we uncover certain interesting observations. Firstly, we
References (33)
- et al.
Score normalization in multimodal biometric systems
Pattern Recognition
(2005) - et al.
Information fusion in biometrics
Pattern Recognition Lett.
(2003) - et al.
Audio-visual biometrics
Proc. IEEE
(2006) - Bailly-Baillière, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici,...
- et al.
Fusion of face and speech data for person identity verification
IEEE Trans. Neural Networks
(1999) - Bigun, E., Bigun, J., Duc, B., Fischer, S., 1997. Expert conciliation for multimodal person authentication systems...
- et al.
Person identification using multiple cues
IEEE Trans. Pattern Anal. Machine Intell.
(1995) - et al.
An evaluation of multimodal 2D + 3D face biometrics
IEEE Trans. Pattern Anal. Machine Intell.
(2005) - et al.
Design issues for a digital audio-visual integrated database
IEE Colloq. Integrated Audio-Visual Process Recognition Synth. Comm.
(1996) - Dumas, B., Pugin, C., Hennebert, J., Petrovska-Delacrétaz, D., Humm, A., Evéquoz, F., Ingold, R., Von-Rotz, D., 2005....