Skip to main content
Log in

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates Inc, New Jersey (1987)

    Google Scholar 

  2. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)

    Article  Google Scholar 

  3. Macleod, A., Summerfield, Q.: A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24(1), 29–43 (1990)

    Article  Google Scholar 

  4. Mehrabian, A.: Nonverbal betrayal of feeling. J. Exp. Res. Personal. 5(1), 64–73 (1971)

    Google Scholar 

  5. Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: Proceedings of the International Conference on Image Processing, pp. 173–177 (1998)

  6. Potamianos, G., Neti, C., Iyengar, G., Senior, A.W., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Int. J. Speech Technol. 4(3–4), 193–208 (2001)

    Article  MATH  Google Scholar 

  7. Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 619–624 (2001)

  8. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  9. Xu, C., Wang, Y., Tan, T., Quan, L.: Depth vs. intensity: which is more important for face recognition?. In: Proceedings of the International Conference on Pattern Recognition, pp. 342–345 (2004)

  10. Goecke, R., Millar. J.B.: The audio-video Australian English speech data corpus AVOZES. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2525–2528 (2004)

  11. Ortega, A., Sukno, F., Lleida, E., Frangi, A.F., Miguel, A., Buera, L., Zacur, E.: AV@ CAR: a Spanish multichannel multimodal corpus for invehicle automatic audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 763–766 (2004)

  12. Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: Acquisition of a 3D audio-visual corpus of affective speech. IEEE Trans. Multimed. 12(6), 591–598 (2010)

    Article  Google Scholar 

  13. Vorwerk, A., Wang, X., Kolossa, D., Zeiler, S., Orglmeister, R.: WAPUSK20-a database for robust audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 3016–3019 (2010)

  14. Webb, J., Ashley, J.: Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, California (2012)

    Book  Google Scholar 

  15. Galatas, G., Potamianos, G., Kosmopoulos, D.I., McMurrough, C., Makedon, F.: Bilingual corpus for AVASR using multiple sensors and depth information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 103–106 (2011)

  16. Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the European Signal Processing Conference, pp. 2714–2717 (2012)

  17. Ahlberg, J.: Candide-3-an updated parameterized face. Report No. LiTH-ISY-R-2326. Linkoping University, Sweden (2001)

  18. Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2306–2310 (2013)

  19. Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pp. 1–5 (2013)

  20. Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. Int. J. Comput. Inf. Sci. 5(1), 62–75 (2013)

    Google Scholar 

  21. Ramos, E.: Kinect Basics. Arduino and Kinect Projects. Apress, California (2012)

    Google Scholar 

  22. Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for Lip-Reading. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006)

  23. Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Springer, Berlin (2013)

    MATH  Google Scholar 

  24. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., et al.: The Hidden Markov Model Toolkit Book (version 3.4). Entropic Cambridge Research Laboratory, Cambridge (1995)

    Google Scholar 

Download references

Acknowledgments

The research is supported by part of the National Natural Science Foundation (surface project No.61175016, surface project No.61471259).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianguo Wei.

Additional information

Communicated by B. Huet.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Zhang, J., Honda, K. et al. Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimedia Systems 22, 315–323 (2016). https://doi.org/10.1007/s00530-015-0499-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-015-0499-9

Keywords

Navigation