Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit

Kratt, Jan; Metze, Florian; Stiefelhagen, Rainer; Waibel, Alex

doi:10.1007/978-3-540-28649-3_60

Jan Kratt²⁰,
Florian Metze²⁰,
Rainer Stiefelhagen²⁰ &
…
Alex Waibel²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3175))

Included in the following conference series:

Joint Pattern Recognition Symposium

2066 Accesses
6 Citations

Abstract

This paper describes audio-visual speech recognition experiments on a multi-speaker, large vocabulary corpus using the Janus speech recognition toolkit. We describe a complete audio-visual speech recognition system and present experiments on this corpus. By using visual cues as additional input to the speech recognizer, we observed good improvements, both on clean and noisy speech in our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Complementary models for audio-visual speech classification

Article 07 January 2022

Speech Recognition Using Spectrogram-Based Visual Features

Visual speech recognition for multiple languages in the wild

Article 24 October 2022

References

McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature (1976)
Google Scholar
Potamianos, G., Neti, C., Deligne, S.: Joint Audio-Visual Speech Processing for Recognition and Enhancement. In: Proceedings of AVSP 2003 (2003)
Google Scholar
Goecke, R., Potamianos, G., Neti, C.: Noisy Audio Feature Enhancement using Audio-Visual Speech Data. In: ICASSP 2002 (2002)
Google Scholar
Hennecke, M.E., Prasad, K.V., Stork, D.G.: Using deformable templates to infer visual speech dynamics. In: 28th Annual Asimolar conference on Signal speech and Computers
Google Scholar
Goldschen, A.J., Gracia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: 28th Annual Asimolar conference on Signal speech and Computers
Google Scholar
Movellan, J.R.: Visual speech recognition with stochastic networks. In: NIPS 1994 (1994)
Google Scholar
Duchnowski, P., Meier, U., Waibel, A.: See me, hear me: Integrating automatic speech recognition and lip-reading. In: Internation Conference on Spoken Language Processing, ICSLP, pp. 547–550 (1994)
Google Scholar
Deligne, S., Potamianos, G., Neti, C.: Audio-Visual speech enhancement with avcdcn (Audio-Visual Codebook Dependent Cepstral Normalization). In: IEEE workshop on Sensor Array and Multichannel Signal Processing in August 2002, Washington DC and ICSLP (2002)
Google Scholar
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)
Article Google Scholar
Huang, J., Potamianos, G., Neti, C.: Improving Audio-Visual Speech Recognition with an Infrared Headset. In: Proceedings of AVSP 2003 (2003)
Google Scholar
Meier, U., Stiefelhagen, R., Yang, J., Waibel, A.: Towards Unrestricted Lipreading. International Journal of pattern Recognition and Artificial Intelligence 14(5), 571–785 (2000); Second International Conference on Multimodal Interfaces, ICMI 1999 (1999)
Google Scholar
Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: Proc. IEEE Intl. Conf. Acous. Speech Sig. Process, pp. 669–672 (1994)
Google Scholar
Matthews, I., Bangham, J.A., Cox, S.: Audiovisual speech recognition using multiscale nonlinear image decomposition. In: Proc. 4th ICSLP, vol. 1, pp. 38–41 (1996)
Google Scholar
Ogihara, A., Asao, S.: An isolated word speech recognition based on fusion of visual and auditory information using 30-frames/s and 24-bit color image. IEICE Trans. Fund. Electron., Commun. Comput. Sci. E80A(8), 1417–1422 (1997)
Google Scholar
Neti, C., Potamianos, G., et al.: Audio-Visual Speech Recognition - Workshop 2000 Final Report. Center for Language and Speech Processing. The Johns Hopkins University, Baltimore (2000)
Google Scholar
Potamianos, G., Neti, C., Iyengar, G., Helmuth, E.: Large-Vocabulary Audio-Visual Speech Recognition by Machines and Humans. In: Proc. Eurospeech (2001)
Google Scholar
Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.: A Cascade Image Transformation For Speaker Independent Automatic Speechreading. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1097–1100 (2000)
Google Scholar
Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K., Westphal, M.: The Karlsruhe- VERBMOBIL Speech Recognition Engine. In: Proceedings of ICASSP, Munich, Germany (1997)
Google Scholar
Soltau, H., Metze, F., Fügen, C., Waibel, A.: A One Pass-Decoder Based on Polymorphic Linguistic Context Assignment. In: Proc. of ASRU, Trento, Italy (2001)
Google Scholar
Stiefelhagen, R., Yang, J.: Gaze Tracking for Multimodal Human- Computer Interaction. In: Proc. of the International Conference on Acoustics, Speech and Signal Processing: ICASSP 1997, Munich, Germany (April 1997)
Google Scholar
Gravier, G., Potamianos, G., Neti, C.: Asynchrony modeling for audio-visual speech recognition. In: Proc. Human Language Technology Conference (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Interactive Systems Laboratories, University of Karlsruhe, Germany
Jan Kratt, Florian Metze, Rainer Stiefelhagen & Alex Waibel

Authors

Jan Kratt
View author publications
You can also search for this author in PubMed Google Scholar
Florian Metze
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Stiefelhagen
View author publications
You can also search for this author in PubMed Google Scholar
Alex Waibel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Max Planck Institute for Biological Cybernetics,, Spemannstr. 38, D-72076, Tübingen, Germany
Carl Edward Rasmussen
Max-Planck-Institute for Biological Cybernetics,, Spemannstr. 38, 72076, Tübingen, Germany
Heinrich H. Bülthoff
MPI for Biological Cybernetics, Spemannstr. 38, 72076, Tübingen, Germany
Bernhard Schölkopf
School of Psychology, University of Wales, Bangor, UK
Martin A. Giese

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kratt, J., Metze, F., Stiefelhagen, R., Waibel, A. (2004). Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds) Pattern Recognition. DAGM 2004. Lecture Notes in Computer Science, vol 3175. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28649-3_60

Download citation

DOI: https://doi.org/10.1007/978-3-540-28649-3_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22945-2
Online ISBN: 978-3-540-28649-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Complementary models for audio-visual speech classification

Speech Recognition Using Spectrogram-Based Visual Features

Visual speech recognition for multiple languages in the wild

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Large Vocabulary Audio-Visual Speech Recognition Using the Janus Speech Recognition Toolkit

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Complementary models for audio-visual speech classification

Speech Recognition Using Spectrogram-Based Visual Features

Visual speech recognition for multiple languages in the wild

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation