A Cascade Visual Front End for Speaker Independent Automatic Speechreading

Potamianos, Gerasimos; Neti, Chalapathy; Iyengar, Giridharan; Senior, Andrew W.; Verma, Ashish

doi:10.1023/A:1011352422845

A Cascade Visual Front End for Speaker Independent Automatic Speechreading

Published: July 2001

Volume 4, pages 193–208, (2001)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Gerasimos Potamianos¹,
Chalapathy Neti¹,
Giridharan Iyengar¹,
Andrew W. Senior¹ &
…
Ashish Verma¹

108 Accesses
21 Citations
Explore all metrics

Abstract

We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimenting with lipreading for large vocabulary continuous speech recognition

Article 16 July 2018

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Unified System for Visual Speech Recognition and Speaker Identification

References

Adjoudani, A. and Benoît, C. (1996). On the integration of auditory and visual parameters in an HMM-based ASR. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 461-471.
Google Scholar
Basu, S., Neti, C., Rajput, N., Senior, A., Subramaniam, L., and Verma, A. (1999). Audio-visual large vocabulary continuous speech recognition in the broadcast domain. '99, Copenhagen, Denmark, pp. 475-481.
Bregler, C. and Konig, Y. (1994). 'Eigenlips' for robust speech recognition. '94, Adelaide, Australia, pp. 669-672.
Brooke, N.M. (1996). Talking heads and speech recognizers that can see: The computer processing of visual speech signals. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 351-371.
Google Scholar
Chandramohan, D. and Silsbee, P.L. (1996). A multiple deformable template approach for visual speech recognition. '96, Philadelphia, PA, pp. 50-53.
Chatfield, C. and Collins, A.J. (1980). Introduction to Multivariate Analysis. London, United Kingdom: Chapman and Hall.
Google Scholar
Chen, T. (2001). Audiovisual speech processing. Lip reading and lip synchronization. IEEE Signal Processing Magazine, 18(1):9-21.
Google Scholar
Chiou, G.I. and Hwang, J.-N. (1997). Lipreading from color video. IEEE Transactions on Image Processing, 6(8):1192-1195.
Google Scholar
Daubechies, I. (1992). Wavelets. Philadelphia, PA: S.I.A.M.
Google Scholar
De Cuetos, P., Neti, C., and Senior, A. (2000). Audio-visual intentto-speak detection for human-computer interaction. '00, Istanbul, Turkey, pp. 1325-1328.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38.
Google Scholar
Duchnowski, P., Hunke, M., Büsching, D., Meier, U., and Waibel, A. (1995). Toward movement-invariant automatic lip-reading and speech recognition. '95, Detroit, MI, pp. 109-112.
Dupont, S. and Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3):141-151.
Google Scholar
Fröba, B., Küblbeck, C., Rothe, C., and Plankensteiner, P. (1999).Multi-sensor biometric person recognition in an access control system. '99,Washington, DC, pp. 55-59.
Gillick, L. and Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. '89, Glasgow, United Kingdom, pp. 532-535.
Golub, G.H. and Van Loan, C.F. (1983). Matrix Computations. Baltimore, MD: The Johns Hopkins University Press.
Google Scholar
Gopinath, R.A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. '98, Seattle, WA, pp. 661-664.
Graf, H.P., Cosatto, E., and Potamianos, G. (1997). Robust recognition of faces and facial features with a multi-modal system. '97, Orlando, FL, pp. 2034-2039.
Gray, M.S., Movellan, J.R., and Sejnowski, T.J. (1997). Dynamic features for visual speech-reading:Asystematic comparison. In M.C. Mozer, M.I. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, pp. 751-757.
Google Scholar
Hennecke, M.E., Stork, D.G., and Prasad, K.V. (1996). Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 331-349.
Google Scholar
Jain, A.K., Duin, R.P.W., and Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4-37.
Google Scholar
Massaro, D.W. and Stork, D.G. (1998). Speech recognition and sensory integration. American Scientist, 86(3):236-244.
Google Scholar
Matthews, I. (1998). Features for Audio-Visual Speech Recognition. Ph.D. Thesis, School of Information Systems, University of East Anglia, Norwich, United Kingdom.
Google Scholar
Matthews, I., Bangham, J.A., and Cox, S. (1996). Audio-visual speech recognition using multiscale nonlinear image decomposition. '96, Philadelphia, PA, pp. 38-41.
McGurk, H. and Mac Donald, J.W. (1976). Hearing lips and seeing voices. Nature, 264:746-748.
Google Scholar
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., and Zhou, J. (2000). Audio-Visual Speech Recognition (Summer Workshop 2000 Final Technical Report). Baltimore, MD: Center for Language and Speech Processing, The Johns Hopkins University. Available at http://www.clsp.jhu.edu/ws2000/final reports/avsr/.
Google Scholar
Petajan, E.D. (1984). Automatic lipreading to enhance speech recognition. '84, Atlanta, GA, pp. 265-272.
Polymenakos, L., Olsen, P., Kanevsky, D., Gopinath, R.A., Gopalakrishnan, P.S., and Chen, S. (1998). Transcription of broadcast news-some recent improvements to IBM's LVCSR system. '98, Seattle, WA, pp. 901-904.
Potamianos, G. and Graf, H.P. (1998a). Linear discriminant analysis for speechreading. '98, Los Angeles, CA, pp. 221-226.
Potamianos, G. and Graf, H.P. (1998b). Discriminative training of HMMstream exponents for audio-visual speech recognition. '98, Seattle, WA, pp. 3733-3736.
Potamianos, G., Graf, H.P., and Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. '98, Chicago, IL, Vol. III, pp. 173-177.
Google Scholar
Potamianos, G., Luettin, J., and Neti, C. (2001). Hierarchical discriminant features for audio-visual LVCSR. '01, Salt Lake City, UT, Vol. 1.
Potamianos, G. and Neti, C. (2000). Stream confidence estimation for audio-visual speech recognition. '00, Beijing, China, Vol. III, pp. 746-749.
Google Scholar
Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1988). Numerical Recipes in C. The Art of Scientific Computing. Cambridge, MA: Cambridge University Press.
Google Scholar
Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Rao, C.R. (1965). Linear Statistical Inference and Its Applications. New York, NY: John Wiley and Sons.
Google Scholar
Rogozan, A., Deléglise, P., and Alissali, M. (1997). Adaptive determination of audio and visual weights for automatic speech recognition. '97, Rhodes, Greece, pp. 61-64.
Senior, A.W. (1999). Face and feature finding for a face recognition system. '99, Washington, DC, pp. 154-159.
Stork, D.G. and Hennecke, M.E. (Eds.) (1996). Speechreading by Humans and Machines. Berlin, Germany: Springer.
Google Scholar
Summerfield, A.Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd and R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-Reading. Hillside, NJ: Lawrence Erlbaum Associates, pp. 97-113.
Google Scholar
Summerfield, Q., MacLeod, A., McGrath, M., and Brooke, M. (1989). Lips, teeth, and the benefits of lipreading. In A.W. Young and H.D. Ellis (Eds.), Handbook of Research on Face Processing. Amsterdam, The Netherlands: Elsevier Science Publishers, pp. 223-233.
Google Scholar
Teissier, P., Robert-Ribes, J., Schwartz, J.-L., and Guérin-Dugué, A. (1999). Comparing models for audiovisual fusion in a noisyvowel recognition task. IEEE Transactions on Speech and Audio Processing, 7(6):629-642.
Google Scholar
Vanegas, O., Tanaka, A., Tokuda, K., and Kitamura, T. (1998). HMMbased visual speech recognition using intensity and location normalization. '98, Sydney, Australia, pp. 289-292.
Wark, T. and Sridharan, S. (1998). A syntactic approach to automatic lip feature extraction for speaker identification. '98, Seattle, WA, pp. 3693-3696.

Download references

Author information

Authors and Affiliations

Human Language Technologies, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Gerasimos Potamianos, Chalapathy Neti, Giridharan Iyengar, Andrew W. Senior & Ashish Verma

Authors

Gerasimos Potamianos
View author publications
You can also search for this author in PubMed Google Scholar
Chalapathy Neti
View author publications
You can also search for this author in PubMed Google Scholar
Giridharan Iyengar
View author publications
You can also search for this author in PubMed Google Scholar
Andrew W. Senior
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Verma
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Potamianos, G., Neti, C., Iyengar, G. et al. A Cascade Visual Front End for Speaker Independent Automatic Speechreading. International Journal of Speech Technology 4, 193–208 (2001). https://doi.org/10.1023/A:1011352422845

Download citation

Issue Date: July 2001
DOI: https://doi.org/10.1023/A:1011352422845

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Cascade Visual Front End for Speaker Independent Automatic Speechreading

Abstract

Access this article

Similar content being viewed by others

Experimenting with lipreading for large vocabulary continuous speech recognition

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Unified System for Visual Speech Recognition and Speaker Identification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Cascade Visual Front End for Speaker Independent Automatic Speechreading

Abstract

Access this article

Similar content being viewed by others

Experimenting with lipreading for large vocabulary continuous speech recognition

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Unified System for Visual Speech Recognition and Speaker Identification

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation