Abstract
An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Fox, N.A., Reilly, R.B.: Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 743–751. Springer, Heidelberg (2003)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings of the IEEE 91, 1306–1324 (2003)
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4, 23–35 (2002)
Brunelli, R., Falavigna, D.: Person Identification Using Multiple Cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 955–966 (1995)
Wark, T.J., Sridharan, S., Chandran, V.: The use of Speech and Lip Modalities for Robust Speaker Verification under Adverse Conditions. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, June 1999, pp. 812–816 (1999)
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: Adaptive Classifier Integration for Robust Pattern Recognition. IEEE Transactions on Systems, Man, and Cybernetics 29, 902–907 (1999)
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14, 449–480 (2004)
The XM2VTS database, http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/
Campbell, J.P.: Speaker Recognition: A Tutorial. Proceedings of the IEEE 85, 1437–1462 (1997)
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department: Microsoft Corporation (2001)
Luettin, J.: Speaker verification experiments on the XM2VTS database, in IDIAP Communication 98-02: IDIAP, Martigny, Switzerland (1999)
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, J.A., Harvey, R.: Extraction of Visual Features for Lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 198–213 (2002)
Potamianos, G., Graf, H., Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. In: Proceedings of the IEEE International Conference on Image Processing, ICIP 98, Chicago, vol. 3, pp. 173–177 (October 1998)
Netravali, N., Haskell, B.G.: Digital Pictures. Plenum Press, New York (1998)
Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 226–239 (1998)
Bengio, S.: Multimodal speech processing using asynchronous Hidden Markov Models. Information Fusion 5, 81–89 (2004)
Tamura, S., Iwano, K., Furui, S.: A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, pp. 857–860 (2004)
Heckmann, M., Berthommier, F., Kristian, K.: Noise Adaptive Stream Weigthing in Audio-Visual Speech Recognition. EURASIP Journal on Applied Signal Processing 2002, 1260–1273 (2002)
Kittler, J., Alkoot, F.M.: Sum versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 110–115 (2003)
Wark, T., Sridharan, S.: Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification. Digital Signal Processing 11, 169–186 (2001)
Fox, N.A., Gross, R., de Chazal, P., Cohn, J.F., Reilly, R.B.: Person Identification Using Automatic Integration of Speech, Lip, and Face Experts. In: ACM SIGMM workshop on Biometrics Methods and Applications, Berkley, CA, pp. 25–32 (November 2003)
The BANCA Database, http://www.ee.surrey.ac.uk/Research/VSSP/banca/
The VALID Database, http://ee.ucd.ie/validdb/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fox, N.A., O’Mullane, B.A., Reilly, R.B. (2005). Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities. In: Kanade, T., Jain, A., Ratha, N.K. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2005. Lecture Notes in Computer Science, vol 3546. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527923_82
Download citation
DOI: https://doi.org/10.1007/11527923_82
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27887-0
Online ISBN: 978-3-540-31638-1
eBook Packages: Computer ScienceComputer Science (R0)