Skip to main content

Abstract

Visual speech recognition refers to the identification of utterances through the movements of lips, tongue, teeth, and other facial muscles of the speaker without using the acoustic signal. This work shows the relative benefits of both static and dynamic visual speech features for improved visual speech recognition. Two approaches for visual feature extraction have been considered: (1) an image transform based static feature approach in which Discrete Cosine Transform (DCT) is applied to each video frame and 6×6 triangle region coefficients are considered as features. Principal Component Analysis (PCA) is applied over all 60 features corresponding to the video frame to reduce the redundancy; the resultant 21 coefficients are taken as the static visual features. (2) Motion segmentation based dynamic feature approach in which the facial movements are segmented from the video file using motion history images (MHI). DCT is applied to the MHI and triangle region coefficients are taken as the dynamic visual features. Two types of experiments were done one with concatenated features and another with dimension reduced feature by using PCA to identify the utterances. The left-right continuous HMMs are used as visual speech classifier to classify nine MPEG-4 standard viseme consonants. The experimental result shows that the concatenated as well as dimension reduced features improve te visual speech recognition with a high accuracy of 92.45% and 92.15% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, T.: Audio Visual Speech Processing: Lip Reading and Lip Synchronization. IEEE Signal Processing Magazine, pp. 9–21 (2001)

    Google Scholar 

  2. Rowan S., Darryl, S., Ji Ming M.: Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos. EURASIP Journal on Image and Video Processing. Hindawi Publishing Corporation (2008)

    Google Scholar 

  3. Potamianos, G., Neti, C., Luettin, J., Matthews, I.,: Audio-Visual Automatic Speech Recognition: An Overview. In: Issues in Visual and Audio-Visual Speech processing. Bailly, G., Vatikiotis-Bateson, E., Perrier, P. (eds.) MIT Press (2004)

    Google Scholar 

  4. Say Wei Foo, Yong Lian, Liang Dong: Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. IEEE Transactions on Circuits and Systems for Video Technology Vol. 14, No. 5 (2004)

    Google Scholar 

  5. Yau, W.C., Kumar, D.K., Arjunan, S.P.,: Voiceless Speech Recoginition Using Dynamic Visual Speech Features. Australian Computer Society, Inc, pp. 93–101. Canberra Australia (2006)

    Google Scholar 

  6. Yau, W.C., Kumar, D.K., Weghorn, H.,: Visual Speech Recoginition Using Motion Features and Hidden Markov Models. In: Kampel, M., Hanbury, A. (eds.) LNCS, pp. 832–839. Springer, Heidelberg (2007)

    Google Scholar 

  7. Potamianos, G., Verma, A., Neti, C., Iyengar, G., Basu, S.,: A Cascade Image Transform for Speaker Independent Automatic Speech Reading. In: Proceeding of IEEE International Conference on Multimedia and Expo. pp. 1097–1100, New York (2000)

    Google Scholar 

  8. Potamianos, G., Graf, H.P., Cosatto, E.,: An Image Transform Approach for HMM Based Automatic Lip Reading. In: Proc of the International Conference on Image Processing. Vol. 3, pp. 173–177, Chicago (1998)

    Google Scholar 

  9. Foo, S.W., Dong, L.,: Recognition of Visual Speech Elements Using Hidden Markov Models. In: Chen, Y.C., Chang, L.W., Hsu, C.T. (eds.) LNCS, pp. 607–614. Springer, Heidelberg (2002)

    Google Scholar 

  10. Rabiner, L.R., Juang, B.H.,: Fundamentals of Speech Recognition. Signal Processing Series, Prentice-hall, Englewood Cliffs, NJ (1993)

    Google Scholar 

  11. Petajan, E.D., Bischoff, B., Bodoff, D.,: An Improved Automatic Lip Reading System to Enhance Speech Recognition. In: ACM SIGCHI-88. pp. 19–25 (1988)

    Google Scholar 

  12. Potamianos, G., Neti, C., Huang, J., Connell, J.H., Chu, S., Libal, V., Marcheret, E., Hass, N., Jiang, J.,: Towards Practical Development of Audio-Visual Speech Recognition. In: IEEE International Conference on Acoustic, Speech, and Signal Processing (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Indian Institute of Information Technology, India

About this paper

Cite this paper

Rajavel, R., Sathidevi, P.S. (2009). Static and Dynamic Features for Improved HMM based Visual Speech Recognition. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds) Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New Delhi. https://doi.org/10.1007/978-81-8489-203-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-81-8489-203-1_17

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-8489-404-2

  • Online ISBN: 978-81-8489-203-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics