Visual Features Extracting & Selecting for Lipreading

Yao, Hong-xun; Gao, Wen; Shan, Wei; Xu, Ming-hui

doi:10.1007/3-540-44887-X_30

Hong-xun Yao⁶,
Wen Gao^6,7,
Wei Shan⁶ &
…
Ming-hui Xu⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2688))

Included in the following conference series:

International Conference on Audio- and Video-Based Biometric Person Authentication

1774 Accesses

Abstract

This paper has put forward a way to select and extract visual features effectively for lipreading. These features come from both lowlevel and high-level, those are compensatory each other. There are 41 dimensional features to be used for recognition. Tested on a bimodal database AVCC which consists of sentences including all Chinese pronunciation, it achieves an accuracy of 87.8% from 84.1% for automatic speech recognition by lipreading assisting. It improves 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. And the paper has proves that visual speech information can reinforce the loss of acoustic information effectively by improving recognition rate from 10% to 30% various with the different amount of noises in speech signals in our system, the improving scope is higher than ASR system of IBM. And it performs better in noisy environments.

This paper has been supported by Chinese National High Technology Plan “Multi-model perception techniques” (2001AA114160).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Dupont, J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition”. IEEE Transactions On Multimedia, Vol. 2, No. 3, September 2000.
Google Scholar
Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of Visual Features for Lipreading”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 2, February 2002.
Google Scholar
M.E. Hennecke, K.V. Prasad & D.G. Stork, “Using Deformable Templates to Infer Visual Speech Dynamics”. 28th Annual Asilomar Conference on Signals, Systems and Computers, Volume 1, pp578–582, Pacific Grove, CA.IEEE, IEEE Computer Society Press, 1994.
Google Scholar
H. Yao, W. Gao, J. Li, Y. Lv, R. Wang. Real-time Lip Locating Method for Lip-Movement Recognition, Chinese Journal of Software, 2000, 11(8): 1126–1132.
Google Scholar
G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proc. Human Language Technology Conference, San Diego, 2002.
Google Scholar
G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002.
Google Scholar
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001.
Google Scholar
Matthews, G. Potamianos, C. Neti, and J. Luettin, A comparison of model and transform-based visual features for audio-visual LVCSR, Proc. IEEE Int. Conf. Multimedia Expo., Tokyo, 2001.
Google Scholar
G. Potamianos, J. Luettin, C. Neti. Hierarchical discriminant features for audiovisual LVCSR, ICASSP, Salt Lake City, May 2001.
Google Scholar
J. Luettin, G. Potamianos, C. Neti. Asynchronous stream modeling for largevocabulary audio-visual speech recognition, ICASSP, Salt Lake City, May 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Harbin Institute of Technology, 150001, Harbin, China
Hong-xun Yao, Wen Gao, Wei Shan & Ming-hui Xu
Institute of Computing Technology, Chinese Academy of Sciences, 100080, Beijing, China
Wen Gao

Authors

Hong-xun Yao
View author publications
You can also search for this author in PubMed Google Scholar
Wen Gao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Shan
View author publications
You can also search for this author in PubMed Google Scholar
Ming-hui Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Vision, Speech and Signal Proc., University of Surrey, GU2 7XH, Guildford, Surrey, UK
Josef Kittler
Department of Electronics and Computer Science, University of Southampton, SO17 1BJ, Southampton, UK
Mark S. Nixon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, Hx., Gao, W., Shan, W., Xu, Mh. (2003). Visual Features Extracting & Selecting for Lipreading. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_30

Download citation

DOI: https://doi.org/10.1007/3-540-44887-X_30
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40302-9
Online ISBN: 978-3-540-44887-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics