Skip to main content
Log in

A Cascade Visual Front End for Speaker Independent Automatic Speechreading

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adjoudani, A. and Benoît, C. (1996). On the integration of auditory and visual parameters in an HMM-based ASR. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 461-471.

    Google Scholar 

  • Basu, S., Neti, C., Rajput, N., Senior, A., Subramaniam, L., and Verma, A. (1999). Audio-visual large vocabulary continuous speech recognition in the broadcast domain. '99, Copenhagen, Denmark, pp. 475-481.

  • Bregler, C. and Konig, Y. (1994). 'Eigenlips' for robust speech recognition. '94, Adelaide, Australia, pp. 669-672.

  • Brooke, N.M. (1996). Talking heads and speech recognizers that can see: The computer processing of visual speech signals. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 351-371.

    Google Scholar 

  • Chandramohan, D. and Silsbee, P.L. (1996). A multiple deformable template approach for visual speech recognition. '96, Philadelphia, PA, pp. 50-53.

  • Chatfield, C. and Collins, A.J. (1980). Introduction to Multivariate Analysis. London, United Kingdom: Chapman and Hall.

    Google Scholar 

  • Chen, T. (2001). Audiovisual speech processing. Lip reading and lip synchronization. IEEE Signal Processing Magazine, 18(1):9-21.

    Google Scholar 

  • Chiou, G.I. and Hwang, J.-N. (1997). Lipreading from color video. IEEE Transactions on Image Processing, 6(8):1192-1195.

    Google Scholar 

  • Daubechies, I. (1992). Wavelets. Philadelphia, PA: S.I.A.M.

    Google Scholar 

  • De Cuetos, P., Neti, C., and Senior, A. (2000). Audio-visual intentto-speak detection for human-computer interaction. '00, Istanbul, Turkey, pp. 1325-1328.

  • Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38.

    Google Scholar 

  • Duchnowski, P., Hunke, M., Büsching, D., Meier, U., and Waibel, A. (1995). Toward movement-invariant automatic lip-reading and speech recognition. '95, Detroit, MI, pp. 109-112.

  • Dupont, S. and Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3):141-151.

    Google Scholar 

  • Fröba, B., Küblbeck, C., Rothe, C., and Plankensteiner, P. (1999).Multi-sensor biometric person recognition in an access control system. '99,Washington, DC, pp. 55-59.

  • Gillick, L. and Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. '89, Glasgow, United Kingdom, pp. 532-535.

  • Golub, G.H. and Van Loan, C.F. (1983). Matrix Computations. Baltimore, MD: The Johns Hopkins University Press.

    Google Scholar 

  • Gopinath, R.A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. '98, Seattle, WA, pp. 661-664.

  • Graf, H.P., Cosatto, E., and Potamianos, G. (1997). Robust recognition of faces and facial features with a multi-modal system. '97, Orlando, FL, pp. 2034-2039.

  • Gray, M.S., Movellan, J.R., and Sejnowski, T.J. (1997). Dynamic features for visual speech-reading:Asystematic comparison. In M.C. Mozer, M.I. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, pp. 751-757.

    Google Scholar 

  • Hennecke, M.E., Stork, D.G., and Prasad, K.V. (1996). Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 331-349.

    Google Scholar 

  • Jain, A.K., Duin, R.P.W., and Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4-37.

    Google Scholar 

  • Massaro, D.W. and Stork, D.G. (1998). Speech recognition and sensory integration. American Scientist, 86(3):236-244.

    Google Scholar 

  • Matthews, I. (1998). Features for Audio-Visual Speech Recognition. Ph.D. Thesis, School of Information Systems, University of East Anglia, Norwich, United Kingdom.

    Google Scholar 

  • Matthews, I., Bangham, J.A., and Cox, S. (1996). Audio-visual speech recognition using multiscale nonlinear image decomposition. '96, Philadelphia, PA, pp. 38-41.

  • McGurk, H. and Mac Donald, J.W. (1976). Hearing lips and seeing voices. Nature, 264:746-748.

    Google Scholar 

  • Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., and Zhou, J. (2000). Audio-Visual Speech Recognition (Summer Workshop 2000 Final Technical Report). Baltimore, MD: Center for Language and Speech Processing, The Johns Hopkins University. Available at http://www.clsp.jhu.edu/ws2000/final reports/avsr/.

    Google Scholar 

  • Petajan, E.D. (1984). Automatic lipreading to enhance speech recognition. '84, Atlanta, GA, pp. 265-272.

  • Polymenakos, L., Olsen, P., Kanevsky, D., Gopinath, R.A., Gopalakrishnan, P.S., and Chen, S. (1998). Transcription of broadcast news-some recent improvements to IBM's LVCSR system. '98, Seattle, WA, pp. 901-904.

  • Potamianos, G. and Graf, H.P. (1998a). Linear discriminant analysis for speechreading. '98, Los Angeles, CA, pp. 221-226.

  • Potamianos, G. and Graf, H.P. (1998b). Discriminative training of HMMstream exponents for audio-visual speech recognition. '98, Seattle, WA, pp. 3733-3736.

  • Potamianos, G., Graf, H.P., and Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. '98, Chicago, IL, Vol. III, pp. 173-177.

    Google Scholar 

  • Potamianos, G., Luettin, J., and Neti, C. (2001). Hierarchical discriminant features for audio-visual LVCSR. '01, Salt Lake City, UT, Vol. 1.

  • Potamianos, G. and Neti, C. (2000). Stream confidence estimation for audio-visual speech recognition. '00, Beijing, China, Vol. III, pp. 746-749.

    Google Scholar 

  • Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1988). Numerical Recipes in C. The Art of Scientific Computing. Cambridge, MA: Cambridge University Press.

    Google Scholar 

  • Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  • Rao, C.R. (1965). Linear Statistical Inference and Its Applications. New York, NY: John Wiley and Sons.

    Google Scholar 

  • Rogozan, A., Deléglise, P., and Alissali, M. (1997). Adaptive determination of audio and visual weights for automatic speech recognition. '97, Rhodes, Greece, pp. 61-64.

  • Senior, A.W. (1999). Face and feature finding for a face recognition system. '99, Washington, DC, pp. 154-159.

  • Stork, D.G. and Hennecke, M.E. (Eds.) (1996). Speechreading by Humans and Machines. Berlin, Germany: Springer.

    Google Scholar 

  • Summerfield, A.Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd and R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-Reading. Hillside, NJ: Lawrence Erlbaum Associates, pp. 97-113.

    Google Scholar 

  • Summerfield, Q., MacLeod, A., McGrath, M., and Brooke, M. (1989). Lips, teeth, and the benefits of lipreading. In A.W. Young and H.D. Ellis (Eds.), Handbook of Research on Face Processing. Amsterdam, The Netherlands: Elsevier Science Publishers, pp. 223-233.

    Google Scholar 

  • Teissier, P., Robert-Ribes, J., Schwartz, J.-L., and Guérin-Dugué, A. (1999). Comparing models for audiovisual fusion in a noisyvowel recognition task. IEEE Transactions on Speech and Audio Processing, 7(6):629-642.

    Google Scholar 

  • Vanegas, O., Tanaka, A., Tokuda, K., and Kitamura, T. (1998). HMMbased visual speech recognition using intensity and location normalization. '98, Sydney, Australia, pp. 289-292.

  • Wark, T. and Sridharan, S. (1998). A syntactic approach to automatic lip feature extraction for speaker identification. '98, Seattle, WA, pp. 3693-3696.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Potamianos, G., Neti, C., Iyengar, G. et al. A Cascade Visual Front End for Speaker Independent Automatic Speechreading. International Journal of Speech Technology 4, 193–208 (2001). https://doi.org/10.1023/A:1011352422845

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011352422845

Navigation