Skip to main content
Log in

Multimodal interfaces

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

In this paper, we present an overview of research in our laboratories on Multimodal Human Computer Interfaces. The goal for such interfaces is to free human computer interaction from the limitations and acceptance barriers due to rigid operating commands and keyboards as the only/main I/O-device. Instead we move to involve all available human communication modalities. These human modalities include Speech, Gesture and Pointing, Eye-Gaze, Lip Motion and Facial Expression, Handwriting, Face Recognition, Face Tracking, and Sound Localization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Baluja, S. & Pomerleau, D. (1994). Non-Intrusive Gaze Tracking Using Artificial Neural Networks. To appear in Advances in Neural Information Processing Systems 6, Morgan Kaufmann Publishers.

  • Bodenhausen, U., Manke, S. & Waibel, A. (1993). Connectionist Architectural Learning for High Performance Character and Speech Recognition. In Proceedings of ICASSP'93, Vol. 1, 625–628. Minneapolis, MN, U.S.A.

  • Braida, L. D. (1991). Crossmodal Integration in the Identification of Consonant Segments. The Quarterly Journal of Experimental Psychology 43A(3): 647–677.

    Google Scholar 

  • Bregler, C., Hild, H., Manke, S. & Waibel, A. (1993). Improving Connected Letter Recognition by Lipreading. In Proceedings of ICASSP'93, Vol. 1, 557–560, Minneapolis, MN, U.S.A.

  • Bregler, C. (1993). Lippenlesen als Unterstützung zur robusten automatischen Spracherkennung. M.S. Thesis. Fakultaet für Informatik, Universität Karlsruhe.

  • Goldschen, A. J. (1993). Continuous Automatic Speech Recognition by Lipreading. Ph.D. Dissertation, George Washington University.

  • Guyon, I., Albrecht, P., LeCun, Y., Denker, J. & Hubbard, W. (1991). Design of a Neural Network Character Recognizer for a Touch Terminal. Pattern Recognition 24(2): 105–119.

    Google Scholar 

  • Haffner, P., Franzini, M. & Waibel, A. (1991). Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition. In Proceedings of ICASSP'91, Vol. 1. 105–108. Toronto, Canada.

  • Haffner, P. & Waibel, A. (1991). Multi-State Time Delay Neural Networks for Continuous Speech Recognition. Advances in Neural Network Information Processing Systems 4, 135–142. Morgan Kaufmann Publishers.

  • Hauptmann, A. (1989). Speech and Gestures for Graphic Image Manipulation. In Proceedings of CHI'89, 241–245. Austin, TX, U.S.A.

  • Hild, H. & Waibel, A. (1993). Connected Letter Recognition with a Multi-State Time Delay Neural Network. Advances in Neural Information Processing Systems 5, 712–719. Morgan Kaufmann Publishers.

  • Huang, X., Alleva, F., Hon, H., Hwang, M., Lee, K. & Rosenfeld, R. (1993). The SPHINX-II Speech Recognition System: An Overview. Computer Speech and Language 7(2): 137–148.

    Google Scholar 

  • Jackson, P. L. (1988). The Theoretical Minimal Unit for Visual Speech Perception: Visemes and Coarticulation. The Volta Review 90(5): 99–115.

    Google Scholar 

  • Manke, S. & Bodenhausen, U. (1994). A Connectionist Recognizer for On-Line Cursive Handwriting Recognition. In Proceedings of ICASSP'94, Vol. 2, 633–636. Adelaide, Australia.

  • Miller, G. A. & Nicely, P. E. (1955). An Analysis of Perceptual Confusions Among Some English Consonants. Journal of the Acoustical Society of America 27(2): 338–352.

    Google Scholar 

  • Ney, H. (1984). The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition. In IEEE Transactions on Acoustics, Speech and Signal Processing 32(2): 263–271.

  • Nodine, C., Kundel, H., Toto, L. & Krupinski, E. (1992). Recording and Analyzing Eye-position Data Using a Microcomputer Workstation. Behavior Research Methods, Instruments & Computers 24(3): 475–584.

    Google Scholar 

  • Mase, K. & Pentland, A. (1991). Automatic Lipreading by Optical-Flow Analysis. Systems and Computers in Japan 22(6): 67–76.

    Google Scholar 

  • Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition. Ph.D. Thesis, University of Illinois.

  • Petajan, E. D., Bischoff, B. & Bodoff, D. (1988). An Improved Automatic Lipreading System to Enhance Speech Recognition. In Proceedings of CHI'88, 19–25. Washington, DC, U.S.A.

  • Pomerleau, D., (1992). Neural Network Perception for Mobile Robot Guidance. Ph.D. Thesis, Carnegie Mellon University, CMU-CS-92-115.

  • Rose, R. & Paul, D. (1990). A Hidden Markov Model Based Keyword Recognition Systems. In Proceedings of ICASSP'90, Vol. 1, 129–132. Albuquerque, NM, U.S.A.

  • Rubine, D., (1991). The Automatic Recognition of Gestures. Ph.D. Thesis, Carnegie Mellon University.

  • Rubine, D.,(1991). Specifying Gestures by Examples. Computer Graphics 25(4): 329–337.

    Google Scholar 

  • Schwartz, R. & Austin, S. (1991). A Comparison of Several Approximate Algorithms for Finding N-best Hypotheses. In Proceedings of ICASSP'91, Vol. 1, 701–704. Toronto, Canada.

  • Schenkel, M., Guyon, I. & Henderson, D. (1994). On-Line Cursive Script Recognition Using Time Delay Neural Networks and Hidden Markov Models. In Proceedings of ICASSP'94, Vol. 2, 637–640. Adelaide, Australia.

  • Schmidbauer, O. & Tebelskis, J. (1992). An LVQ-based Reference Model for Speaker-Adaptive Speech Recognition. In Proceedings of ICASSP'92, Vol. I, 441–444. San Francisco, CA, U.S.A.

  • Stork, D. G., Wolff, G. & Levine, E. (1992). Neural Network Lipreading System for Improved Speech Recognition. In Proceedings of IJCNN'92, Vol. 2, 289–295. Baltimore, MD, U.S.A.

  • Summerfield, Q. (1983). Audio-visual Speech Perception, Lipreading and Artificial Stimulation. In Lutman, M. E. & Haggard, M. P. (eds.) Hearing Science and Hearing Disorders, Academic Press: New York.

    Google Scholar 

  • Tebelskis, J. & Waibel, A. (1993). Performance Through Consistency: MS-TDNNs for Large Vocabulary Continuous Speech Recognition. In Advances in Neural Information Processing Systems 5, 696–703. Morgan Kaufmann Publishers.

  • Turk, M. & Pentland, A. (1991). Eigenfaces for Recognition. Journal of Cognitive Neuro-Science 3(1): 71–86.

    Google Scholar 

  • Vo, M. T. & Waibel, A. (1993). A Multimodal Human-Computer Interface: Combination of Speech and Gesture Recognition. In Adjunct Proc. InterCHI'93. Amtersdam, The Netherlands.

  • Vo, M. T. (1994). Incremental Learning using the Time Delay Neural Network. In Proceedings of ICASSP'94, Vol. 2, 629–632. Adelaide. Australia.

  • Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. & Lang, K. (1989). Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(3): 328–339.

    Google Scholar 

  • Waibel, A., Jain, A., McNair, A., Saito, H., Hauptmann, A. & Tebelskis, J. (1991). JANUS: A Speechto-speech Translation System Using Connectionist and Symbolic Processing Strategies. In Proceedings of ICASSP'91, Vol. 2, 793–796. Toronto, Canada.

  • Ward, W. (1991). Understanding Spontaneous Speech: The Phoenix System. In Proceedings of ICASSP'91, Vol. 1, 365–367. Toronto, Canada.

  • Ware, C. & Mikaelian, H. (1987). An Evaluation of an Eye Tracker as a Device for Computer Input. In SIGCHI Bulletin, Spec. Issue, CHI+GI'87, 183–188. Toronto, Canada.

  • Woszczyna, M. et al. (1993). Recent Advances in Janus: A Speech Translation System. In Proceedings of EUROSPEECH'93, Vol. 2, 1295–1298. Berlin, Germany.

  • Yuhas, B. P., Goldstein, M. H., SejnowskiJr., T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Magazine 27(11): 65–71.

    Google Scholar 

  • Zeppenfeld, T., & Waibel, A., (1992). A Hybrid Neural Network, Dynamic Programming Word Spotter. In Proceedings of ICASSP'92, Vol. 2, 77–80. San Francisco, CA, U.S.A.

  • Zeppenfeld, T., Houghton, R., & Waibel, A. (1993). Improving the MS-TSNN for Word Spotting. In Proceedings of ICASSP'93, Vol. 2, 475–478. Minneapolis, MN, U.S.A.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Waibel, A., Vo, M.T., Duchnowski, P. et al. Multimodal interfaces. Artif Intell Rev 10, 299–319 (1996). https://doi.org/10.1007/BF00127684

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00127684

Key words

Navigation