skip to main content
10.1145/1027933.1027972acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

Authors Info & Claims
Published:13 October 2004Publication History

ABSTRACT

This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.

References

  1. C. Benoit. The intrinsic bimodality of speech communication and the synthesis of talking faces. In Journal on Communications of the Scientific Society for Telecommunications, Hungary, number 43, pages 32--40, September 1992.Google ScholarGoogle Scholar
  2. M. T. Chan, Y. Zhang, and T. S. Huang. Real-time lip tracking and bimodal continuous speech recognition. In Proc. of the Workshop on Multimedia Signal Processing, pp. 65--70, Redondo Beach, CA, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Chu and T. Huang. Bimodal speech recognition using coupled hidden Markov models. In Proc. of the International Conference on Spoken Language Processing, vol. II, Beijing, October 2000.Google ScholarGoogle Scholar
  4. S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. In IEEE Transactions on Multimedia, number 2, pages 141--151, September 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Glass. A probabilistic framework for segment-based speech recognition. To appear in Computer Speech and Language, 2003.Google ScholarGoogle Scholar
  6. A. Halberstadt and J. Glass. Heterogeneous measurements and multiple classifiers for speech recognition. In Proceedings of ICSLP 98, Sydney, Australia, November 1998.Google ScholarGoogle Scholar
  7. T. J. Hazen and A. Halberstadt, "Using aggregation to improve the performance of mixture Gaussian acoustic models," In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, May, 1998.Google ScholarGoogle Scholar
  8. IBM Research - Audio Visual Speech Technologies: Data Collection. Accessed online at http://www.research.ibm.com/AVSTG/data.html, May 2003.Google ScholarGoogle Scholar
  9. Intel's AVCSR Toolkit source code can be downloaded from http://sourceforge.net/projects/opencvlibrary/.Google ScholarGoogle Scholar
  10. K. F. Lee and H. W. Hon. Speaker-independent phone recognition using hidden Markov models. In IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641--1648, November 1989.Google ScholarGoogle ScholarCross RefCross Ref
  11. L. H. Liang, X. X. Liu, Y. Zhao, X. Pi and A.V. Nefian. Speaker independent audio-visual continuous speech recognition. In Proc. of the IEEE International Conference on Multimedia and Expo, vol.2, pp. 25--28, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  12. I. Matthews, J. A. Bangham, and S. Cox. Audio-visual speech recognition using multiscale nonlinear image decomposition. In Proc. of the International Conference on Spoken Language Processing, pp. 38--41, Philadelphia, PA, 1996.Google ScholarGoogle Scholar
  13. U. Meier, R. Stiefelhagen, J. Yang, and A. Waibel. Towards unrestricted lip reading. In International Journal of Pattern Recognition and Artificial Intelligence, number 14, pages 571--585, August 2000.Google ScholarGoogle ScholarCross RefCross Ref
  14. K. Messer, J. Matas, J. Kittler, and K. Jonsson. XM2VTSDB: The extended M2VTS database. In Audio- and Video-based Biometric Person Authentication, AVBPA'99, pages 72--77, Washington, D.C., March 1999. 16 IDIAP--RR 99-02.Google ScholarGoogle Scholar
  15. C. Neti, et al. Audio-visual speech recognition. In Technical Report, Center for Language and Speech Processing, Baltimore, Maryland, 2000. The Johns Hopkins University.Google ScholarGoogle Scholar
  16. S. Pigeon and L. Vandendorpe. The M2VTS multimodal face database. In Proc. of the Audio- and Video-based Biometric Person Authentication Workshop, Germany, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Potamianos and C. Neti. Audio-visual speech recognition in challenging environments. In Proc. Of EUROSPEECH, pp. 1293--1296, Geneva, Switzerland, September 2003.Google ScholarGoogle Scholar
  18. K. Saenko, T. Darrel, and J. Glass. Articulatory features for robust visual speech recognition In these proceedings, ICMI'04, State College, Pennsylvania, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Sanderson. The VidTIMIT Database. IDIAP Communication 02-06, Martigny, Switzerland, 2002.Google ScholarGoogle Scholar
  20. C. Sanderson. Automatic Person Verification Using Speech and Face Information. PhD Thesis, Griffith University, Brisbane, Australia, 2002.Google ScholarGoogle Scholar
  21. N. Strom, L. Hetherington, T.J. Hazen, E. Sandness, and J. Glass. Acoustic modeling improvements in a segment-based speech recognizer. In Proc. 1999 IEEE ASRU Workshop, Keystone, CO, December 1999.Google ScholarGoogle Scholar
  22. V. Zue, S. Seneff, and J. Glass. Speech database development: TIMIT and beyond. Speech Communication, vol. 9, no. 4, pp. 351--356, 1990.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A segment-based audio-visual speech recognizer: data collection, development, and initial experiments

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces
      October 2004
      368 pages
      ISBN:1581139950
      DOI:10.1145/1027933

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 October 2004

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate453of1,080submissions,42%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader