Skip to main content

Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3546))

Abstract

An audio-visual speaker identification system is described, where the audio and visual speech modalities are fused by an automatic unsupervised process that adapts to local classifier performance, by taking into account the output score based reliability estimates of both modalities. Previously reported methods do not consider that both the audio and the visual modalities can be degraded. The visual modality uses the speakers lip information. To test the robustness of the system, the audio and visual modalities are degraded to emulate various levels of train/test mismatch; employing additive white Gaussian noise for the audio and JPEG compression for the visual signals. Experiments are carried out on a large augmented data set from the XM2VTS database. The results show improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. For high mismatch levels, the audio, visual, and auto-adapted audio-visual accuracies are 37.1%, 48%, and 71.4% respectively.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fox, N.A., Reilly, R.B.: Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 743–751. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings of the IEEE 91, 1306–1324 (2003)

    Article  Google Scholar 

  3. Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4, 23–35 (2002)

    Article  Google Scholar 

  4. Brunelli, R., Falavigna, D.: Person Identification Using Multiple Cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 955–966 (1995)

    Article  Google Scholar 

  5. Wark, T.J., Sridharan, S., Chandran, V.: The use of Speech and Lip Modalities for Robust Speaker Verification under Adverse Conditions. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, June 1999, pp. 812–816 (1999)

    Google Scholar 

  6. Chibelushi, C.C., Deravi, F., Mason, J.S.D.: Adaptive Classifier Integration for Robust Pattern Recognition. IEEE Transactions on Systems, Man, and Cybernetics 29, 902–907 (1999)

    Article  Google Scholar 

  7. Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14, 449–480 (2004)

    Article  Google Scholar 

  8. The XM2VTS database, http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/

  9. Campbell, J.P.: Speaker Recognition: A Tutorial. Proceedings of the IEEE 85, 1437–1462 (1997)

    Article  Google Scholar 

  10. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department: Microsoft Corporation (2001)

    Google Scholar 

  11. Luettin, J.: Speaker verification experiments on the XM2VTS database, in IDIAP Communication 98-02: IDIAP, Martigny, Switzerland (1999)

    Google Scholar 

  12. Matthews, I., Cootes, T.F., Bangham, J.A., Cox, J.A., Harvey, R.: Extraction of Visual Features for Lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 198–213 (2002)

    Article  Google Scholar 

  13. Potamianos, G., Graf, H., Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. In: Proceedings of the IEEE International Conference on Image Processing, ICIP 98, Chicago, vol. 3, pp. 173–177 (October 1998)

    Google Scholar 

  14. Netravali, N., Haskell, B.G.: Digital Pictures. Plenum Press, New York (1998)

    Google Scholar 

  15. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 226–239 (1998)

    Article  Google Scholar 

  16. Bengio, S.: Multimodal speech processing using asynchronous Hidden Markov Models. Information Fusion 5, 81–89 (2004)

    Article  Google Scholar 

  17. Tamura, S., Iwano, K., Furui, S.: A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, pp. 857–860 (2004)

    Google Scholar 

  18. Heckmann, M., Berthommier, F., Kristian, K.: Noise Adaptive Stream Weigthing in Audio-Visual Speech Recognition. EURASIP Journal on Applied Signal Processing 2002, 1260–1273 (2002)

    Article  MATH  Google Scholar 

  19. Kittler, J., Alkoot, F.M.: Sum versus Vote Fusion in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 110–115 (2003)

    Article  Google Scholar 

  20. Wark, T., Sridharan, S.: Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification. Digital Signal Processing 11, 169–186 (2001)

    Article  Google Scholar 

  21. Fox, N.A., Gross, R., de Chazal, P., Cohn, J.F., Reilly, R.B.: Person Identification Using Automatic Integration of Speech, Lip, and Face Experts. In: ACM SIGMM workshop on Biometrics Methods and Applications, Berkley, CA, pp. 25–32 (November 2003)

    Google Scholar 

  22. The BANCA Database, http://www.ee.surrey.ac.uk/Research/VSSP/banca/

  23. The VALID Database, http://ee.ucd.ie/validdb/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fox, N.A., O’Mullane, B.A., Reilly, R.B. (2005). Audio-Visual Speaker Identification via Adaptive Fusion Using Reliability Estimates of Both Modalities. In: Kanade, T., Jain, A., Ratha, N.K. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2005. Lecture Notes in Computer Science, vol 3546. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527923_82

Download citation

  • DOI: https://doi.org/10.1007/11527923_82

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-27887-0

  • Online ISBN: 978-3-540-31638-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics