Abstract
This paper describes audio-visual speech recognition experiments on a multi-speaker, large vocabulary corpus using the Janus speech recognition toolkit. We describe a complete audio-visual speech recognition system and present experiments on this corpus. By using visual cues as additional input to the speech recognizer, we observed good improvements, both on clean and noisy speech in our experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature (1976)
Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: Proceedings of AVSP 2003 (2003)
Goecke, R., Potamianos, G., Neti, C.: Noisy Audio Feature Enhancement using Audio-Visual Speech Data. In: ICASSP 2002 (2002)
Hennecke, M.E., Prasad, K.V., Stork, D.G.: Using deformable templates to infer visual speech dynamics. In: 28th Annual Asimolar conference on Signal speech and Computers
Goldschen, A.J., Gracia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: 28th Annual Asimolar conference on Signal speech and Computers
Movellan, J.R.: Visual speech recognition with stochastic networks. In: NIPS 1994 (1994)
Duchnowski, P., Meier, U., Waibel, A.: See me, hear me: Integrating automatic speech recognition and lip-reading. In: Internation Conference on Spoken Language Processing, ICSLP, pp. 547–550 (1994)
Deligne, S., Potamianos, G., Neti, C.: Audio-Visual speech enhancement with avcdcn (Audio-Visual Codebook Dependent Cepstral Normalization). In: IEEE workshop on Sensor Array and Multichannel Signal Processing in August 2002, Washington DC and ICSLP (2002)
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)
Huang, J., Potamianos, G., Neti, C.: Improving Audio-Visual Speech Recognition with an Infrared Headset. In: Proceedings of AVSP 2003 (2003)
Meier, U., Stiefelhagen, R., Yang, J., Waibel, A.: Towards Unrestricted Lipreading. International Journal of pattern Recognition and Artificial Intelligence 14(5), 571–785 (2000); Second International Conference on Multimodal Interfaces, ICMI 1999 (1999)
Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: Proc. IEEE Intl. Conf. Acous. Speech Sig. Process, pp. 669–672 (1994)
Matthews, I., Bangham, J.A., Cox, S.: Audiovisual speech recognition using multiscale nonlinear image decomposition. In: Proc. 4th ICSLP, vol. 1, pp. 38–41 (1996)
Ogihara, A., Asao, S.: An isolated word speech recognition based on fusion of visual and auditory information using 30-frames/s and 24-bit color image. IEICE Trans. Fund. Electron., Commun. Comput. Sci. E80A(8), 1417–1422 (1997)
Neti, C., Potamianos, G., et al.: Audio-Visual Speech Recognition - Workshop 2000 Final Report. Center for Language and Speech Processing. The Johns Hopkins University, Baltimore (2000)
Potamianos, G., Neti, C., Iyengar, G., Helmuth, E.: Large-Vocabulary Audio-Visual Speech Recognition by Machines and Humans. In: Proc. Eurospeech (2001)
Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.: A Cascade Image Transformation For Speaker Independent Automatic Speechreading. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1097–1100 (2000)
Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K., Westphal, M.: The Karlsruhe- VERBMOBIL Speech Recognition Engine. In: Proceedings of ICASSP, Munich, Germany (1997)
Soltau, H., Metze, F., Fügen, C., Waibel, A.: A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. In: Proc. of ASRU, Trento, Italy (2001)
Stiefelhagen, R., Yang, J.: Gaze Tracking for Multimodal Human- Computer Interaction. In: Proc. of the International Conference on Acoustics, Speech and Signal Processing: ICASSP 1997, Munich, Germany (April 1997)
Gravier, G., Potamianos, G., Neti, C.: Asynchrony modeling for audio-visual speech recognition. In: Proc. Human Language Technology Conference (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kratt, J., Metze, F., Stiefelhagen, R., Waibel, A. (2004). Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds) Pattern Recognition. DAGM 2004. Lecture Notes in Computer Science, vol 3175. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28649-3_60
Download citation
DOI: https://doi.org/10.1007/978-3-540-28649-3_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22945-2
Online ISBN: 978-3-540-28649-3
eBook Packages: Springer Book Archive