Skip to main content

Visual Features Extracting & Selecting for Lipreading

  • Conference paper
  • First Online:
Audio- and Video-Based Biometric Person Authentication (AVBPA 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2688))

  • 1774 Accesses

Abstract

This paper has put forward a way to select and extract visual features effectively for lipreading. These features come from both lowlevel and high-level, those are compensatory each other. There are 41 dimensional features to be used for recognition. Tested on a bimodal database AVCC which consists of sentences including all Chinese pronunciation, it achieves an accuracy of 87.8% from 84.1% for automatic speech recognition by lipreading assisting. It improves 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. And the paper has proves that visual speech information can reinforce the loss of acoustic information effectively by improving recognition rate from 10% to 30% various with the different amount of noises in speech signals in our system, the improving scope is higher than ASR system of IBM. And it performs better in noisy environments.

This paper has been supported by Chinese National High Technology Plan “Multi-model perception techniques” (2001AA114160).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Dupont, J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition”. IEEE Transactions On Multimedia, Vol. 2, No. 3, September 2000.

    Google Scholar 

  2. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of Visual Features for Lipreading”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 2, February 2002.

    Google Scholar 

  3. M.E. Hennecke, K.V. Prasad & D.G. Stork, “Using Deformable Templates to Infer Visual Speech Dynamics”. 28th Annual Asilomar Conference on Signals, Systems and Computers, Volume 1, pp578–582, Pacific Grove, CA.IEEE, IEEE Computer Society Press, 1994.

    Google Scholar 

  4. H. Yao, W. Gao, J. Li, Y. Lv, R. Wang. Real-time Lip Locating Method for Lip-Movement Recognition, Chinese Journal of Software, 2000, 11(8): 1126–1132.

    Google Scholar 

  5. G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proc. Human Language Technology Conference, San Diego, 2002.

    Google Scholar 

  6. G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002.

    Google Scholar 

  7. C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001.

    Google Scholar 

  8. Matthews, G. Potamianos, C. Neti, and J. Luettin, A comparison of model and transform-based visual features for audio-visual LVCSR, Proc. IEEE Int. Conf. Multimedia Expo., Tokyo, 2001.

    Google Scholar 

  9. G. Potamianos, J. Luettin, C. Neti. Hierarchical discriminant features for audiovisual LVCSR, ICASSP, Salt Lake City, May 2001.

    Google Scholar 

  10. J. Luettin, G. Potamianos, C. Neti. Asynchronous stream modeling for largevocabulary audio-visual speech recognition, ICASSP, Salt Lake City, May 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yao, Hx., Gao, W., Shan, W., Xu, Mh. (2003). Visual Features Extracting & Selecting for Lipreading. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_30

Download citation

  • DOI: https://doi.org/10.1007/3-540-44887-X_30

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40302-9

  • Online ISBN: 978-3-540-44887-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics