Abstract
This paper has put forward a way to select and extract visual features effectively for lipreading. These features come from both lowlevel and high-level, those are compensatory each other. There are 41 dimensional features to be used for recognition. Tested on a bimodal database AVCC which consists of sentences including all Chinese pronunciation, it achieves an accuracy of 87.8% from 84.1% for automatic speech recognition by lipreading assisting. It improves 19.5% accuracy from 31.7% to 51.2% for speakers dependent and improves 27.7% accuracy from 27.6% to 55.3% for speakers independent when speech recognition under noise conditions. And the paper has proves that visual speech information can reinforce the loss of acoustic information effectively by improving recognition rate from 10% to 30% various with the different amount of noises in speech signals in our system, the improving scope is higher than ASR system of IBM. And it performs better in noisy environments.
This paper has been supported by Chinese National High Technology Plan “Multi-model perception techniques” (2001AA114160).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Dupont, J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition”. IEEE Transactions On Multimedia, Vol. 2, No. 3, September 2000.
Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of Visual Features for Lipreading”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 2, February 2002.
M.E. Hennecke, K.V. Prasad & D.G. Stork, “Using Deformable Templates to Infer Visual Speech Dynamics”. 28th Annual Asilomar Conference on Signals, Systems and Computers, Volume 1, pp578–582, Pacific Grove, CA.IEEE, IEEE Computer Society Press, 1994.
H. Yao, W. Gao, J. Li, Y. Lv, R. Wang. Real-time Lip Locating Method for Lip-Movement Recognition, Chinese Journal of Software, 2000, 11(8): 1126–1132.
G. Gravier, G. Potamianos, and C. Neti, Asynchrony modeling for audio-visual speech recognition, Proc. Human Language Technology Conference, San Diego, 2002.
G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, Maximum entropy and MCE based HMM stream weight estimation for audio-visual ASR, Proc. Int. Conf. Acoust. Speech Signal Process., Orlando, 2002.
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop, Proc. IEEE Work. Multimedia Signal Process., Cannes, 2001.
Matthews, G. Potamianos, C. Neti, and J. Luettin, A comparison of model and transform-based visual features for audio-visual LVCSR, Proc. IEEE Int. Conf. Multimedia Expo., Tokyo, 2001.
G. Potamianos, J. Luettin, C. Neti. Hierarchical discriminant features for audiovisual LVCSR, ICASSP, Salt Lake City, May 2001.
J. Luettin, G. Potamianos, C. Neti. Asynchronous stream modeling for largevocabulary audio-visual speech recognition, ICASSP, Salt Lake City, May 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yao, Hx., Gao, W., Shan, W., Xu, Mh. (2003). Visual Features Extracting & Selecting for Lipreading. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_30
Download citation
DOI: https://doi.org/10.1007/3-540-44887-X_30
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40302-9
Online ISBN: 978-3-540-44887-7
eBook Packages: Springer Book Archive