Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities

Fox, Niall A.; O’Mullane, Brian A.; Reilly, Richard B.

doi:10.1007/11527923_82

Niall A. Fox¹⁹,
Brian A. O’Mullane¹⁹ &
Richard B. Reilly¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3546))

Included in the following conference series:

International Conference on Audio- and Video-Based Biometric Person Authentication

2290 Accesses

Abstract

An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment

Article 22 May 2023

Unified System for Visual Speech Recognition and Speaker Identification

Bimodal Speech Recognition Fusing Audio-Visual Modalities

References

Fox, N.A., Reilly, R.B.: Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 743–751. Springer, Heidelberg (2003)
Chapter Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings of the IEEE 91, 1306–1324 (2003)
Article Google Scholar
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4, 23–35 (2002)
Article Google Scholar
Brunelli, R., Falavigna, D.: Person Identification Using Multiple Cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 955–966 (1995)
Article Google Scholar
Wark, T.J., Sridharan, S., Chandran, V.: The use of Speech and Lip Modalities for Robust Speaker Verification under Adverse Conditions. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, June 1999, pp. 812–816 (1999)
Google Scholar
Chibelushi, C.C., Deravi, F., Mason, J.S.D.: Adaptive Classifier Integration for Robust Pattern Recognition. IEEE Transactions on Systems, Man, and Cybernetics 29, 902–907 (1999)
Article Google Scholar
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14, 449–480 (2004)
Article Google Scholar
The XM2VTS database, http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/
Campbell, J.P.: Speaker Recognition: A Tutorial. Proceedings of the IEEE 85, 1437–1462 (1997)
Article Google Scholar
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department: Microsoft Corporation (2001)
Google Scholar
Luettin, J.: Speaker verification experiments on the XM2VTS database, in IDIAP Communication 98-02: IDIAP, Martigny, Switzerland (1999)
Google Scholar
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, J.A., Harvey, R.: Extraction of Visual Features for Lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 198–213 (2002)
Article Google Scholar
Potamianos, G., Graf, H., Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. In: Proceedings of the IEEE International Conference on Image Processing, ICIP 98, Chicago, vol. 3, pp. 173–177 (October 1998)
Google Scholar
Netravali, N., Haskell, B.G.: Digital Pictures. Plenum Press, New York (1998)
Google Scholar
Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 226–239 (1998)
Article Google Scholar
Bengio, S.: Multimodal speech processing using asynchronous Hidden Markov Models. Information Fusion 5, 81–89 (2004)
Article Google Scholar
Tamura, S., Iwano, K., Furui, S.: A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, pp. 857–860 (2004)
Google Scholar
Heckmann, M., Berthommier, F., Kristian, K.: Noise Adaptive Stream Weigthing in Audio-Visual Speech Recognition. EURASIP Journal on Applied Signal Processing 2002, 1260–1273 (2002)
Article MATH Google Scholar
Kittler, J., Alkoot, F.M.: Sum versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 110–115 (2003)
Article Google Scholar
Wark, T., Sridharan, S.: Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification. Digital Signal Processing 11, 169–186 (2001)
Article Google Scholar
Fox, N.A., Gross, R., de Chazal, P., Cohn, J.F., Reilly, R.B.: Person Identification Using Automatic Integration of Speech, Lip, and Face Experts. In: ACM SIGMM workshop on Biometrics Methods and Applications, Berkley, CA, pp. 25–32 (November 2003)
Google Scholar
The BANCA Database, http://www.ee.surrey.ac.uk/Research/VSSP/banca/
The VALID Database, http://ee.ucd.ie/validdb/

Download references

Author information

Authors and Affiliations

Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland
Niall A. Fox, Brian A. O’Mullane & Richard B. Reilly

Authors

Niall A. Fox
View author publications
You can also search for this author in PubMed Google Scholar
Brian A. O’Mullane
View author publications
You can also search for this author in PubMed Google Scholar
Richard B. Reilly
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Robotics Institute, Carnegie Mellon University., Pittsburgh, 15213-3890, Pennsylvania, USA
Takeo Kanade
Withington Hospital, Nightingale Centre, Manchester, UK
Anil Jain
IBM Thomas J. Watson Research Center, 19 Skyline Drive, NY 10598, Hawthorne, USA
Nalini K. Ratha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fox, N.A., O’Mullane, B.A., Reilly, R.B. (2005). Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities. In: Kanade, T., Jain, A., Ratha, N.K. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2005. Lecture Notes in Computer Science, vol 3546. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527923_82

Download citation

DOI: https://doi.org/10.1007/11527923_82
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27887-0
Online ISBN: 978-3-540-31638-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities

Abstract

Access this chapter

Preview

Similar content being viewed by others

Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment

Unified System for Visual Speech Recognition and Speaker Identification

Bimodal Speech Recognition Fusing Audio-Visual Modalities

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities

Abstract

Access this chapter

Preview

Similar content being viewed by others

Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment

Unified System for Visual Speech Recognition and Speaker Identification

Bimodal Speech Recognition Fusing Audio-Visual Modalities

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation