Fusion of Audio-Visual Information for Integrated Speech Processing

Nakamura, Satoshi

doi:10.1007/3-540-45344-X_20

Satoshi Nakamura⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2091))

Included in the following conference series:

International Conference on Audio- and Video-Based Biometric Person Authentication

1683 Accesses
12 Citations

Abstract

This paper describes the integration of audio and visual speech information for robust adaptive speech processing. Since both audio speech signals and visual face configurations are produced by the human speech organs, these two types of information are strongly correlated and sometimes complement each other. This paper describes two applications based on the relationship between the two types of information, that is, bimodal speech recognition robust to acoustic noise that integrates audio-visual information, and speaking face synthesis based on the correlation between audio and visual speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Stork, D.G. and Hennecke, M.E.: Speechreading by Humans and Machines. NATO ASI Series, (1995) Springer.
Google Scholar
Petajan, E.: Automatic Lipreading to Enhance Speech Recognition. Proc. CVPR’85, (1985).
Google Scholar
Yuhas, B., Goldstein, M., and Sejnowski, T.: Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Mag. (1989) 65–71.
Google Scholar
Bregler, C., Hild, H., Manke, S., and Waibel, A.: Improving Connected Letter Recognition by Lipreading. Proc. ICSLP’93, (1993).
Google Scholar
Adjoudani, A. and Benoit, C.: Audio-Visual Speech Recognition Compared Across Two Architectures. Proc. Eurospeech’95, (1995).
Google Scholar
Silsbee, P.: Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Trans. Speech and Audio, (1996) Vol. 4.No.5.
Google Scholar
Nakamura, S., Nagai, R., and Shikano, K.: Improved Bimodal Speech Recognition Using Tied-Mixture HMMs and 5000 word Audio-Visual Synchronous Database. Proc. Eurospeech’97, (1997) 1623–1626.
Google Scholar
Nakamura, S., Ito, H., and Shikano, K.: Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition. Proc. ICSLP’2000, (2000), Vol. 3, 20–23.
Google Scholar
Potamianos, G. and Graf, H.P.: Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition. Proc. ICASSP’98,(1998), 3733–3736.
Google Scholar
Miyajima, C., Tokuda, K., and Kitamura, T.: Audio-Visual Speech Recognition Using MCE-based HMMs and Model-dependent Stream Weights. Proc. ICSLP’2000, (2000), 1023–1026.
Google Scholar
Duchnowski, P., Meier, U., and Waibel, A.: See Mee, Hear Me: Integrating Automatic Speech Recognition and Lip-Reading. Proc. ICSLP’94, (1994).
Google Scholar
Tomlinson, M., Russell, M., and Brooke, N.: Integrating Audio and Visual Information to Provide Highly Robust Speech Recognition. Proc. ICASSP’96, (1996).
Google Scholar
Katagiri, S., Juang, B-H., and Lee, C-H.: Pattern Recognition using a Family of Design Algorithms based upon the Generalized Probabilistic Descent Method. Proc. IEEE, (1998) Vol. 86,No. 11.
Google Scholar
Alissali, M., Deleglise, P., and Rogozan, A.: Asynchronous Integration of Visual Information in an Automatic Speech RecognitionSystem. Proc. ICSLP’96, (1996).
Google Scholar
Hernando, J.: Maximum Likelihood Weighting of Dynamic Speech Features for CDHMM Speech Recognition. Proc. ICASSP’97C (1997) 1267–1270.
Google Scholar
Potamianos, G. and Graf, H.P.: Discriminative Training of HMM Stream Exponents for Audio-visual Speech Recognition. Proc. ICASSP’98C (1998) 3733–3736.
Google Scholar
Morishima, S., Aizawa, K., and Harashima, H.: An Intelligent Facial Image Coding Driven by Speech and Phoneme. Proc. ICASSP’89, (1989), 1795–1798.
Google Scholar
Morishima, S. and Harashima, H.: A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface. IEEE Journal on sel. areas in Communications, (1991) Vol. 9,No. 4, 594–600.
Article Google Scholar
Lavagetto, F.: Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People. IEEE Trans. on Rehabilitation Engineering, (1995) Vol. 3,No. 1, 90–102.
Article Google Scholar
Simons, A. and Cox, S.: Generation of Mouthshape for a Synthetic Talking Head. Proc. The Institute of Acoustics, (1990) Vol. 12,No. 10.
Google Scholar
Chou, W. and Chen, H.: Speech Recognition for Image Animation and Coding. Proc. ICASSP’95 (1995) 2253–2256.
Google Scholar
Rao, R.R. and Chen, T.,: Cross-Modal Prediction in Audio-Visual Communication. Proc. ICASSP’96, (1996) Vol. 4, 2056–2059.
Google Scholar
Goldenthal, W., Waters, K., Van Thong, J.M., and Glickman, O.: Driving Synthetic Mouth Gestures: Phonetic Recognition for FaceMe!. Proc. Eurospeech’97 Vol. 4, (1997) 1995–1998.
Google Scholar
Chen, T. and Rao, R.: Audio-Visual Interaction in Multimedia Communication. Proc. ICASSP’97, (1997) 179–182.
Google Scholar
Yamamoto, E., Nakamura, S., and Shikano, K.: Speech-to-Lip Movement Synthesis by HMM. ESCA Workshop of Audio Visual Speech Processing, (1997) 137–140.
Google Scholar
Guiard-Marigny, T., Adjoudani, T., and Benoit, C.: A 3-D model of the lips for visual speech synthesis. Proc. of the Second ESCA/IEEE Workshop on Speech Synthesis (1994).
Google Scholar

Download references

Author information

Authors and Affiliations

ATR Spoken Language Translation Research Laboratories, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan
Satoshi Nakamura

Authors

Satoshi Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Science, Computer and Electrical Engineering, Halmstad University, P.O. Box 823, S-301 18, Halmstad, Sweden
Josef Bigun & Fabrizio Smeraldi &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakamura, S. (2001). Fusion of Audio-Visual Information for Integrated Speech Processing. In: Bigun, J., Smeraldi, F. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2001. Lecture Notes in Computer Science, vol 2091. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45344-X_20

Download citation

DOI: https://doi.org/10.1007/3-540-45344-X_20
Published: 17 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42216-7
Online ISBN: 978-3-540-45344-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics