Skip to main content

Fusion of Audio-Visual Information for Integrated Speech Processing

  • Conference paper
  • First Online:
Audio- and Video-Based Biometric Person Authentication (AVBPA 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2091))

Abstract

This paper describes the integration of audio and visual speech information for robust adaptive speech processing. Since both audio speech signals and visual face configurations are produced by the human speech organs, these two types of information are strongly correlated and sometimes complement each other. This paper describes two applications based on the relationship between the two types of information, that is, bimodal speech recognition robust to acoustic noise that integrates audio-visual information, and speaking face synthesis based on the correlation between audio and visual speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Stork, D.G. and Hennecke, M.E.: Speechreading by Humans and Machines. NATO ASI Series, (1995) Springer.

    Google Scholar 

  2. Petajan, E.: Automatic Lipreading to Enhance Speech Recognition. Proc. CVPR’85, (1985).

    Google Scholar 

  3. Yuhas, B., Goldstein, M., and Sejnowski, T.: Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Mag. (1989) 65–71.

    Google Scholar 

  4. Bregler, C., Hild, H., Manke, S., and Waibel, A.: Improving Connected Letter Recognition by Lipreading. Proc. ICSLP’93, (1993).

    Google Scholar 

  5. Adjoudani, A. and Benoit, C.: Audio-Visual Speech Recognition Compared Across Two Architectures. Proc. Eurospeech’95, (1995).

    Google Scholar 

  6. Silsbee, P.: Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Trans. Speech and Audio, (1996) Vol. 4.No.5.

    Google Scholar 

  7. Nakamura, S., Nagai, R., and Shikano, K.: Improved Bimodal Speech Recognition Using Tied-Mixture HMMs and 5000 word Audio-Visual Synchronous Database. Proc. Eurospeech’97, (1997) 1623–1626.

    Google Scholar 

  8. Nakamura, S., Ito, H., and Shikano, K.: Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition. Proc. ICSLP’2000, (2000), Vol. 3, 20–23.

    Google Scholar 

  9. Potamianos, G. and Graf, H.P.: Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition. Proc. ICASSP’98,(1998), 3733–3736.

    Google Scholar 

  10. Miyajima, C., Tokuda, K., and Kitamura, T.: Audio-Visual Speech Recognition Using MCE-based HMMs and Model-dependent Stream Weights. Proc. ICSLP’2000, (2000), 1023–1026.

    Google Scholar 

  11. Duchnowski, P., Meier, U., and Waibel, A.: See Mee, Hear Me: Integrating Automatic Speech Recognition and Lip-Reading. Proc. ICSLP’94, (1994).

    Google Scholar 

  12. Tomlinson, M., Russell, M., and Brooke, N.: Integrating Audio and Visual Information to Provide Highly Robust Speech Recognition. Proc. ICASSP’96, (1996).

    Google Scholar 

  13. Katagiri, S., Juang, B-H., and Lee, C-H.: Pattern Recognition using a Family of Design Algorithms based upon the Generalized Probabilistic Descent Method. Proc. IEEE, (1998) Vol. 86,No. 11.

    Google Scholar 

  14. Alissali, M., Deleglise, P., and Rogozan, A.: Asynchronous Integration of Visual Information in an Automatic Speech RecognitionSystem. Proc. ICSLP’96, (1996).

    Google Scholar 

  15. Hernando, J.: Maximum Likelihood Weighting of Dynamic Speech Features for CDHMM Speech Recognition. Proc. ICASSP’97C (1997) 1267–1270.

    Google Scholar 

  16. Potamianos, G. and Graf, H.P.: Discriminative Training of HMM Stream Exponents for Audio-visual Speech Recognition. Proc. ICASSP’98C (1998) 3733–3736.

    Google Scholar 

  17. Morishima, S., Aizawa, K., and Harashima, H.: An Intelligent Facial Image Coding Driven by Speech and Phoneme. Proc. ICASSP’89, (1989), 1795–1798.

    Google Scholar 

  18. Morishima, S. and Harashima, H.: A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface. IEEE Journal on sel. areas in Communications, (1991) Vol. 9,No. 4, 594–600.

    Article  Google Scholar 

  19. Lavagetto, F.: Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People. IEEE Trans. on Rehabilitation Engineering, (1995) Vol. 3,No. 1, 90–102.

    Article  Google Scholar 

  20. Simons, A. and Cox, S.: Generation of Mouthshape for a Synthetic Talking Head. Proc. The Institute of Acoustics, (1990) Vol. 12,No. 10.

    Google Scholar 

  21. Chou, W. and Chen, H.: Speech Recognition for Image Animation and Coding. Proc. ICASSP’95 (1995) 2253–2256.

    Google Scholar 

  22. Rao, R.R. and Chen, T.,: Cross-Modal Prediction in Audio-Visual Communication. Proc. ICASSP’96, (1996) Vol. 4, 2056–2059.

    Google Scholar 

  23. Goldenthal, W., Waters, K., Van Thong, J.M., and Glickman, O.: Driving Synthetic Mouth Gestures: Phonetic Recognition for FaceMe!. Proc. Eurospeech’97 Vol. 4, (1997) 1995–1998.

    Google Scholar 

  24. Chen, T. and Rao, R.: Audio-Visual Interaction in Multimedia Communication. Proc. ICASSP’97, (1997) 179–182.

    Google Scholar 

  25. Yamamoto, E., Nakamura, S., and Shikano, K.: Speech-to-Lip Movement Synthesis by HMM. ESCA Workshop of Audio Visual Speech Processing, (1997) 137–140.

    Google Scholar 

  26. Guiard-Marigny, T., Adjoudani, T., and Benoit, C.: A 3-D model of the lips for visual speech synthesis. Proc. of the Second ESCA/IEEE Workshop on Speech Synthesis (1994).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nakamura, S. (2001). Fusion of Audio-Visual Information for Integrated Speech Processing. In: Bigun, J., Smeraldi, F. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2001. Lecture Notes in Computer Science, vol 2091. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45344-X_20

Download citation

  • DOI: https://doi.org/10.1007/3-540-45344-X_20

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42216-7

  • Online ISBN: 978-3-540-45344-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics