Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features

Fox, Niall; Reilly, Richard B.

doi:10.1007/3-540-44887-X_86

Niall Fox⁶ &
Richard B. Reilly⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2688))

Included in the following conference series:

International Conference on Audio- and Video-Based Biometric Person Authentication

1812 Accesses
13 Citations

Abstract

This paper presents a speaker identification system based on dynamical features of both the audio and visual modes. Speakers are modeled using a text dependent HMM methodology. Early and late audio-visual integration are investigated. Experiments are carried out for 252 speakers from the XM2VTS database. From our experimental results, it has been shown that the addition of the dynamical visual information improves the speaker identification accuracies for both clean and noisy audio conditions compared to the audio only case. The best audio, visual and audio-visual identification accuracies achieved were 86.91%, 57.14% and 94.05% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brunelli, R. and Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955–966, Oct.1995
Article Google Scholar
Brunelli, R. and Poggio, T.: Face Recognition: Features versus Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042–1052, 1993
Article Google Scholar
Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9–21, Jan.2001
Article MATH Google Scholar
Chibelushhi, C. C., Deravi, F., and Mason, J. S. D.: A Review of Speech-Based Bimodal Recognition. IEEE Transaction on Multimedia, vol. 4, no. 1, pp. 23–36, Mar.2002
Article Google Scholar
Davis, S. and Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980
Article Google Scholar
Lucey, S.: Audio-Visual Speech Processing. PhD thesis, Queensland University of Technology, Brisbane, Australia, Apr.2002
Google Scholar
Luettin J.: Speaker verification experiments on the XM2VTS database. In IDIAP Communication 98-02, IDIAP, Martigny, Switzerland, Aug.1999
Google Scholar
Luettin, J. and Maitre, G.: Evaluation Protocol for the XM2VTSDB Database (Lausanne Protocol). In IDIAP Communication 98-05, IDIAP, Martigny, Switzerland, Oct.1998
Google Scholar
Matthews, I., Cootes, T. F., Bangham, J. A., Cox, J. A., and Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198–213, Feb.2002
Article Google Scholar
McGurk, H. and MacDonald, J.: Hearing Lips and Seeing Voices. Nature, vol. 264, pp. 746–748, Dec.1976
Google Scholar
Messer, K., Matas, J., Kittler, J., Luettin J., and Maitre, G.: XM2VTSDB: The Extended M2VTS Database. The Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication (AVBPA’99), Washington D.C., pp. 72–77, Mar.1999
Google Scholar
Netravali, A. N. and Haskell, B. G.: Digital Pictures. Plenum Press, pp. 408–416, 1998
Google Scholar
Potamianos, G., Graf, H., and Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. Proceedings of the IEEE International Conference on Image Processing, Chicago, vol. 3 pp. 173–177, 1998
Google Scholar
Rabiner, L. R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb.1989
Google Scholar
Ramachandran, R. P., Zilovic, M. S., and Mammone, R. J.: A Comparative Study of Robust Linear Predictive Analysis Methods with Applications to Speaker Identification. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 117–125, Mar.1995
Article Google Scholar
Reynolds, D. A. and Rose, R. C.: Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, Jan.1995
Google Scholar
Scanlon, P. and Reilly, R.: Visual Feature Analysis For Automatic Speechreading. DSP Research Group, UCD, Dublin, Ireland, 2001
Google Scholar
Silsbee, P. and Bovik, A.: A Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 337–350, Sept.1990
Article Google Scholar
Yacoub, S. B. and Luetin, J.: Audio-Visual Person Verification. In IDIAP Communication 98-18, IDIAP, Martigny, Switzerland, Nov.1998
Google Scholar
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P.: The HTK Book (for HTK Version 3.1). Microsoft Corporation, Cambridge University Engineering Department, Nov.2001
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland
Niall Fox & Richard B. Reilly

Authors

Niall Fox
View author publications
You can also search for this author in PubMed Google Scholar
Richard B. Reilly
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Vision, Speech and Signal Proc., University of Surrey, GU2 7XH, Guildford, Surrey, UK
Josef Kittler
Department of Electronics and Computer Science, University of Southampton, SO17 1BJ, Southampton, UK
Mark S. Nixon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fox, N., Reilly, R.B. (2003). Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_86

Download citation

DOI: https://doi.org/10.1007/3-540-44887-X_86
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40302-9
Online ISBN: 978-3-540-44887-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics