Abstract
This paper describes the integration of audio and visual speech information for robust adaptive speech processing. Since both audio speech signals and visual face configurations are produced by the human speech organs, these two types of information are strongly correlated and sometimes complement each other. This paper describes two applications based on the relationship between the two types of information, that is, bimodal speech recognition robust to acoustic noise that integrates audio-visual information, and speaking face synthesis based on the correlation between audio and visual speech.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Stork, D.G. and Hennecke, M.E.: Speechreading by Humans and Machines. NATO ASI Series, (1995) Springer.
Petajan, E.: Automatic Lipreading to Enhance Speech Recognition. Proc. CVPR’85, (1985).
Yuhas, B., Goldstein, M., and Sejnowski, T.: Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Mag. (1989) 65–71.
Bregler, C., Hild, H., Manke, S., and Waibel, A.: Improving Connected Letter Recognition by Lipreading. Proc. ICSLP’93, (1993).
Adjoudani, A. and Benoit, C.: Audio-Visual Speech Recognition Compared Across Two Architectures. Proc. Eurospeech’95, (1995).
Silsbee, P.: Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Trans. Speech and Audio, (1996) Vol. 4.No.5.
Nakamura, S., Nagai, R., and Shikano, K.: Improved Bimodal Speech Recognition Using Tied-Mixture HMMs and 5000 word Audio-Visual Synchronous Database. Proc. Eurospeech’97, (1997) 1623–1626.
Nakamura, S., Ito, H., and Shikano, K.: Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition. Proc. ICSLP’2000, (2000), Vol. 3, 20–23.
Potamianos, G. and Graf, H.P.: Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition. Proc. ICASSP’98,(1998), 3733–3736.
Miyajima, C., Tokuda, K., and Kitamura, T.: Audio-Visual Speech Recognition Using MCE-based HMMs and Model-dependent Stream Weights. Proc. ICSLP’2000, (2000), 1023–1026.
Duchnowski, P., Meier, U., and Waibel, A.: See Mee, Hear Me: Integrating Automatic Speech Recognition and Lip-Reading. Proc. ICSLP’94, (1994).
Tomlinson, M., Russell, M., and Brooke, N.: Integrating Audio and Visual Information to Provide Highly Robust Speech Recognition. Proc. ICASSP’96, (1996).
Katagiri, S., Juang, B-H., and Lee, C-H.: Pattern Recognition using a Family of Design Algorithms based upon the Generalized Probabilistic Descent Method. Proc. IEEE, (1998) Vol. 86,No. 11.
Alissali, M., Deleglise, P., and Rogozan, A.: Asynchronous Integration of Visual Information in an Automatic Speech RecognitionSystem. Proc. ICSLP’96, (1996).
Hernando, J.: Maximum Likelihood Weighting of Dynamic Speech Features for CDHMM Speech Recognition. Proc. ICASSP’97C (1997) 1267–1270.
Potamianos, G. and Graf, H.P.: Discriminative Training of HMM Stream Exponents for Audio-visual Speech Recognition. Proc. ICASSP’98C (1998) 3733–3736.
Morishima, S., Aizawa, K., and Harashima, H.: An Intelligent Facial Image Coding Driven by Speech and Phoneme. Proc. ICASSP’89, (1989), 1795–1798.
Morishima, S. and Harashima, H.: A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface. IEEE Journal on sel. areas in Communications, (1991) Vol. 9,No. 4, 594–600.
Lavagetto, F.: Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People. IEEE Trans. on Rehabilitation Engineering, (1995) Vol. 3,No. 1, 90–102.
Simons, A. and Cox, S.: Generation of Mouthshape for a Synthetic Talking Head. Proc. The Institute of Acoustics, (1990) Vol. 12,No. 10.
Chou, W. and Chen, H.: Speech Recognition for Image Animation and Coding. Proc. ICASSP’95 (1995) 2253–2256.
Rao, R.R. and Chen, T.,: Cross-Modal Prediction in Audio-Visual Communication. Proc. ICASSP’96, (1996) Vol. 4, 2056–2059.
Goldenthal, W., Waters, K., Van Thong, J.M., and Glickman, O.: Driving Synthetic Mouth Gestures: Phonetic Recognition for FaceMe!. Proc. Eurospeech’97 Vol. 4, (1997) 1995–1998.
Chen, T. and Rao, R.: Audio-Visual Interaction in Multimedia Communication. Proc. ICASSP’97, (1997) 179–182.
Yamamoto, E., Nakamura, S., and Shikano, K.: Speech-to-Lip Movement Synthesis by HMM. ESCA Workshop of Audio Visual Speech Processing, (1997) 137–140.
Guiard-Marigny, T., Adjoudani, T., and Benoit, C.: A 3-D model of the lips for visual speech synthesis. Proc. of the Second ESCA/IEEE Workshop on Speech Synthesis (1994).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nakamura, S. (2001). Fusion of Audio-Visual Information for Integrated Speech Processing. In: Bigun, J., Smeraldi, F. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2001. Lecture Notes in Computer Science, vol 2091. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45344-X_20
Download citation
DOI: https://doi.org/10.1007/3-540-45344-X_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42216-7
Online ISBN: 978-3-540-45344-4
eBook Packages: Springer Book Archive