Abstract
Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.
Similar content being viewed by others
References
Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates Inc, New Jersey (1987)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Macleod, A., Summerfield, Q.: A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24(1), 29–43 (1990)
Mehrabian, A.: Nonverbal betrayal of feeling. J. Exp. Res. Personal. 5(1), 64–73 (1971)
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: Proceedings of the International Conference on Image Processing, pp. 173–177 (1998)
Potamianos, G., Neti, C., Iyengar, G., Senior, A.W., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Int. J. Speech Technol. 4(3–4), 193–208 (2001)
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 619–624 (2001)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Xu, C., Wang, Y., Tan, T., Quan, L.: Depth vs. intensity: which is more important for face recognition?. In: Proceedings of the International Conference on Pattern Recognition, pp. 342–345 (2004)
Goecke, R., Millar. J.B.: The audio-video Australian English speech data corpus AVOZES. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2525–2528 (2004)
Ortega, A., Sukno, F., Lleida, E., Frangi, A.F., Miguel, A., Buera, L., Zacur, E.: AV@ CAR: a Spanish multichannel multimodal corpus for invehicle automatic audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 763–766 (2004)
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: Acquisition of a 3D audio-visual corpus of affective speech. IEEE Trans. Multimed. 12(6), 591–598 (2010)
Vorwerk, A., Wang, X., Kolossa, D., Zeiler, S., Orglmeister, R.: WAPUSK20-a database for robust audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 3016–3019 (2010)
Webb, J., Ashley, J.: Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, California (2012)
Galatas, G., Potamianos, G., Kosmopoulos, D.I., McMurrough, C., Makedon, F.: Bilingual corpus for AVASR using multiple sensors and depth information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 103–106 (2011)
Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the European Signal Processing Conference, pp. 2714–2717 (2012)
Ahlberg, J.: Candide-3-an updated parameterized face. Report No. LiTH-ISY-R-2326. Linkoping University, Sweden (2001)
Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2306–2310 (2013)
Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pp. 1–5 (2013)
Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. Int. J. Comput. Inf. Sci. 5(1), 62–75 (2013)
Ramos, E.: Kinect Basics. Arduino and Kinect Projects. Apress, California (2012)
Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for Lip-Reading. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006)
Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Springer, Berlin (2013)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., et al.: The Hidden Markov Model Toolkit Book (version 3.4). Entropic Cambridge Research Laboratory, Cambridge (1995)
Acknowledgments
The research is supported by part of the National Natural Science Foundation (surface project No.61175016, surface project No.61471259).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B. Huet.
Rights and permissions
About this article
Cite this article
Wang, J., Zhang, J., Honda, K. et al. Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimedia Systems 22, 315–323 (2016). https://doi.org/10.1007/s00530-015-0499-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-015-0499-9