Abstract
We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. The system has been evaluated for a continuously spoken digit recognition task of 37 subjects.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
J. B. Allen. How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4):567–577, 1994.
S. Basu, N. Oliver, and A. Pentland. 3D Modeling and Tracking of Human Lip Motion. In IEEE International Conference on Computer Vision, 1998.
H. Bourlard, S. Dupont, and C. Riss. Multi-stream speech recognition. Technical Report IDIAP-RR 96-07, IDIAP, 1996.
H. Bourlard, and S. Dupont. Sub-band-based Speech Recognition. In IEEE Int. Conf. on Acoust., Speech, and Signal Processing, pages 1251–1254, 1997.
L. Braida. Crossmodal integration in the identification of consonants. Quarterly Journal of Experimental Psychology, 43A(3):647–677, 1991.
C. Bregler and S. M. Omohundro. Nonlinear manifold learning for visual speech recognition. In IEEE International Conference on Computer Vision, pages 494–499. IEEE, Piscataway, NJ, USA, 1995.
G. Chollet, J. L. Cochard, A. Constantinescu, and P. Langlais. Swiss French Polyphone and Polyvar: Telephone speech databases to study intra and inter speaker variability. Technical report, IDIAP, Martigny, 1995.
T. Coianiz, L. Torresani, and B. Capril. 2D deformable models for visual speech analysis. In David G. Stork and Marcus E. Hennecke, editors, Speechreading by Humans and Machines, volume 150 of NATO ASI Series, Series F: Computer and Systems Sciences, pages 391–398. Springer Verlag, Berlin, 1996.
R. Cole, L. Hirschmann, L Atlas, and et al. The challenge of spoken language processing: research directions for the nineties. IEEE Trans. on Speech and Audio Processing, 3(1):1–20, 1995.
T. F. Cootes, A. Hill, C. J. Taylor, and J. Haslam. Use of active shape models for locating structures in medical images. Image and Vision Computing, 12:355–365, Jul–Aug 1994.
T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models — their training and application. Computer Vision and Image Understanding, 61:38–59, Jan 1995.
N. P. Erber and C. L. De Filippo. Voice-mouth synthesis of tactual/visual perception of /pa, ba, ma/. Journal of the Acoustical Society of America, 64:1015–1019, 1978.
I. A. Essa and A. P. Pentland. Facial expression recognition using a dynamic model and motion energy. In Proc. 5th Int. Conf. on Computer Vision, pages 360–367. IEEE Computer Society Press, July 1995.
H. Fletcher. Speech and Hearing in Communication. Krieger, New York, 1953.
Y. Gong. Speech recognition in noisy environments: A survey. Speech Communication, 16:261–291, 1995.
M. S. Gray, J. R. Movellan, and T. J. Sejnowski. Dynamic features for visual speechreading: A systematic comparison. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, Cambridge, MA, 1997.
K. P. Green and J. L. Miller. On the role of visual rate information in phonetic perception. Perception & Psychophysics, 38(3):269–276, 1985.
W. J. Hardcastle. Physiology of Speech Production. Academic Press, New York, NY, 1976.
H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87:1738–1752, 1990.
B. K. P. Horn. Robot Vision. McGraw-Hill, New York, 1986.
M. I. Jordan and Z. Ghahramani and L. K. Saul. Hidden Markov Decision Trees. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, Cambridge, MA, 1997.
A. Lanitis, C. J. Taylor, and T. F. Cootes. Automatic interpretation and coding of face images using flexible models. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):743–756, 1997.
J. Luettin and N. A. Thacker. Speechreading using probabilistic models. Computer Vision and Image Understanding, 65(2):163–178, February 1997.
J. Luettin. Visual Speech and Speaker Recognition. PhD thesis, University of Sheffield, 1997.
K. Mase and A. Pentland. Automatic lipreading by optical flow analysis. Systems and Computers in Japan, 22(6), 1991.
B. Moghaddam and A. Pentland. Probabilistic visual learning for object detection. In IEEE International Conference on Computer Vision, pages 786–793. IEEE, Piscataway, NJ, USA, 1995.
E. D. Petajan. Automatic lipreading to enhance speech recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 40–47, 1985.
S. Pigeon and L. Vandendorpe. The M2VTS multimodal face database. In Proceedings of the First International Conference on Audio-and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science. Springer Verlag, 1997.
M. U. Ramos Sanchez, J. Matas, and J. Kittler. Statistical chromaticity models for lip tracking with B-splines. In Proceedings of the First International Conference on Audio-and Video-based Biometric Person Authentication, Lecture Notes in Computer Science, pages 69–76. Springer Verlag, 1997.
P. L. Silsbee and A. C. Bovik. Computer lipreading for improved accuracy in automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5):337–351, 1996.
A. Q. Summerfield. Lipreading and audio-visual speech perception. Philosophical Transactions of the Royal Society of London, Series B, 335:71–78, 1992.
M. J. Tomlinson, M. J. Russell, and N. M. Brooke. Integrating audio and visual information to provide highly robust speech recognition. In Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing, volume 2, pages 821–824, 1996.
A. Varga and R. Moore. Hidden markov model decomposition of speech and noise. In Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing, pages 845–848, 1990.
B. P. Yuhas, M. H. Goldstein, T. J. Sejnowski, and R. E. Jenkins. Neural network models of sensory integration for improved vowel recognition. Proc. IEEE, 78(10):1658–1668, October 1990.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luettin, J., Dupont, S. (1998). Continuous audio-visual speech recognition. In: Burkhardt, H., Neumann, B. (eds) Computer Vision — ECCV’98. ECCV 1998. Lecture Notes in Computer Science, vol 1407. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0054771
Download citation
DOI: https://doi.org/10.1007/BFb0054771
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64613-6
Online ISBN: 978-3-540-69235-5
eBook Packages: Springer Book Archive