Abstract
We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.
Similar content being viewed by others
References
Adjoudani, A. and Benoît, C. (1996). On the integration of auditory and visual parameters in an HMM-based ASR. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 461-471.
Basu, S., Neti, C., Rajput, N., Senior, A., Subramaniam, L., and Verma, A. (1999). Audio-visual large vocabulary continuous speech recognition in the broadcast domain. '99, Copenhagen, Denmark, pp. 475-481.
Bregler, C. and Konig, Y. (1994). 'Eigenlips' for robust speech recognition. '94, Adelaide, Australia, pp. 669-672.
Brooke, N.M. (1996). Talking heads and speech recognizers that can see: The computer processing of visual speech signals. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 351-371.
Chandramohan, D. and Silsbee, P.L. (1996). A multiple deformable template approach for visual speech recognition. '96, Philadelphia, PA, pp. 50-53.
Chatfield, C. and Collins, A.J. (1980). Introduction to Multivariate Analysis. London, United Kingdom: Chapman and Hall.
Chen, T. (2001). Audiovisual speech processing. Lip reading and lip synchronization. IEEE Signal Processing Magazine, 18(1):9-21.
Chiou, G.I. and Hwang, J.-N. (1997). Lipreading from color video. IEEE Transactions on Image Processing, 6(8):1192-1195.
Daubechies, I. (1992). Wavelets. Philadelphia, PA: S.I.A.M.
De Cuetos, P., Neti, C., and Senior, A. (2000). Audio-visual intentto-speak detection for human-computer interaction. '00, Istanbul, Turkey, pp. 1325-1328.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38.
Duchnowski, P., Hunke, M., Büsching, D., Meier, U., and Waibel, A. (1995). Toward movement-invariant automatic lip-reading and speech recognition. '95, Detroit, MI, pp. 109-112.
Dupont, S. and Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3):141-151.
Fröba, B., Küblbeck, C., Rothe, C., and Plankensteiner, P. (1999).Multi-sensor biometric person recognition in an access control system. '99,Washington, DC, pp. 55-59.
Gillick, L. and Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. '89, Glasgow, United Kingdom, pp. 532-535.
Golub, G.H. and Van Loan, C.F. (1983). Matrix Computations. Baltimore, MD: The Johns Hopkins University Press.
Gopinath, R.A. (1998). Maximum likelihood modeling with Gaussian distributions for classification. '98, Seattle, WA, pp. 661-664.
Graf, H.P., Cosatto, E., and Potamianos, G. (1997). Robust recognition of faces and facial features with a multi-modal system. '97, Orlando, FL, pp. 2034-2039.
Gray, M.S., Movellan, J.R., and Sejnowski, T.J. (1997). Dynamic features for visual speech-reading:Asystematic comparison. In M.C. Mozer, M.I. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, pp. 751-757.
Hennecke, M.E., Stork, D.G., and Prasad, K.V. (1996). Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M.E. Hennecke (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 331-349.
Jain, A.K., Duin, R.P.W., and Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4-37.
Massaro, D.W. and Stork, D.G. (1998). Speech recognition and sensory integration. American Scientist, 86(3):236-244.
Matthews, I. (1998). Features for Audio-Visual Speech Recognition. Ph.D. Thesis, School of Information Systems, University of East Anglia, Norwich, United Kingdom.
Matthews, I., Bangham, J.A., and Cox, S. (1996). Audio-visual speech recognition using multiscale nonlinear image decomposition. '96, Philadelphia, PA, pp. 38-41.
McGurk, H. and Mac Donald, J.W. (1976). Hearing lips and seeing voices. Nature, 264:746-748.
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., and Zhou, J. (2000). Audio-Visual Speech Recognition (Summer Workshop 2000 Final Technical Report). Baltimore, MD: Center for Language and Speech Processing, The Johns Hopkins University. Available at http://www.clsp.jhu.edu/ws2000/final reports/avsr/.
Petajan, E.D. (1984). Automatic lipreading to enhance speech recognition. '84, Atlanta, GA, pp. 265-272.
Polymenakos, L., Olsen, P., Kanevsky, D., Gopinath, R.A., Gopalakrishnan, P.S., and Chen, S. (1998). Transcription of broadcast news-some recent improvements to IBM's LVCSR system. '98, Seattle, WA, pp. 901-904.
Potamianos, G. and Graf, H.P. (1998a). Linear discriminant analysis for speechreading. '98, Los Angeles, CA, pp. 221-226.
Potamianos, G. and Graf, H.P. (1998b). Discriminative training of HMMstream exponents for audio-visual speech recognition. '98, Seattle, WA, pp. 3733-3736.
Potamianos, G., Graf, H.P., and Cosatto, E. (1998). An image transform approach for HMM based automatic lipreading. '98, Chicago, IL, Vol. III, pp. 173-177.
Potamianos, G., Luettin, J., and Neti, C. (2001). Hierarchical discriminant features for audio-visual LVCSR. '01, Salt Lake City, UT, Vol. 1.
Potamianos, G. and Neti, C. (2000). Stream confidence estimation for audio-visual speech recognition. '00, Beijing, China, Vol. III, pp. 746-749.
Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1988). Numerical Recipes in C. The Art of Scientific Computing. Cambridge, MA: Cambridge University Press.
Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.
Rao, C.R. (1965). Linear Statistical Inference and Its Applications. New York, NY: John Wiley and Sons.
Rogozan, A., Deléglise, P., and Alissali, M. (1997). Adaptive determination of audio and visual weights for automatic speech recognition. '97, Rhodes, Greece, pp. 61-64.
Senior, A.W. (1999). Face and feature finding for a face recognition system. '99, Washington, DC, pp. 154-159.
Stork, D.G. and Hennecke, M.E. (Eds.) (1996). Speechreading by Humans and Machines. Berlin, Germany: Springer.
Summerfield, A.Q. (1987). Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd and R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-Reading. Hillside, NJ: Lawrence Erlbaum Associates, pp. 97-113.
Summerfield, Q., MacLeod, A., McGrath, M., and Brooke, M. (1989). Lips, teeth, and the benefits of lipreading. In A.W. Young and H.D. Ellis (Eds.), Handbook of Research on Face Processing. Amsterdam, The Netherlands: Elsevier Science Publishers, pp. 223-233.
Teissier, P., Robert-Ribes, J., Schwartz, J.-L., and Guérin-Dugué, A. (1999). Comparing models for audiovisual fusion in a noisyvowel recognition task. IEEE Transactions on Speech and Audio Processing, 7(6):629-642.
Vanegas, O., Tanaka, A., Tokuda, K., and Kitamura, T. (1998). HMMbased visual speech recognition using intensity and location normalization. '98, Sydney, Australia, pp. 289-292.
Wark, T. and Sridharan, S. (1998). A syntactic approach to automatic lip feature extraction for speaker identification. '98, Seattle, WA, pp. 3693-3696.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Potamianos, G., Neti, C., Iyengar, G. et al. A Cascade Visual Front End for Speaker Independent Automatic Speechreading. International Journal of Speech Technology 4, 193–208 (2001). https://doi.org/10.1023/A:1011352422845
Issue Date:
DOI: https://doi.org/10.1023/A:1011352422845