Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Wang, Jianrong; Zhang, Ju; Honda, Kiyoshi; Wei, Jianguo; Dang, Jianwu

doi:10.1007/s00530-015-0499-9

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Regular Paper
Published: 06 December 2015

Volume 22, pages 315–323, (2016)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Jianrong Wang¹,
Ju Zhang¹,
Kiyoshi Honda¹,
Jianguo Wei² &
…
Jianwu Dang¹

803 Accesses
17 Citations
4 Altmetric
Explore all metrics

Abstract

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bimodal Speech Recognition Fusing Audio-Visual Modalities

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

An Experimental Analysis of Different Approaches to Audio–Visual Speech Recognition and Lip-Reading

References

Dodd, B.E., Campbell, R.E.: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates Inc, New Jersey (1987)
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588), 746–748 (1976)
Article Google Scholar
Macleod, A., Summerfield, Q.: A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. Br. J. Audiol. 24(1), 29–43 (1990)
Article Google Scholar
Mehrabian, A.: Nonverbal betrayal of feeling. J. Exp. Res. Personal. 5(1), 64–73 (1971)
Google Scholar
Potamianos, G., Graf, H.P., Cosatto, E.: An image transform approach for HMM based automatic lipreading. In: Proceedings of the International Conference on Image Processing, pp. 173–177 (1998)
Potamianos, G., Neti, C., Iyengar, G., Senior, A.W., Verma, A.: A cascade visual front end for speaker independent automatic speechreading. Int. J. Speech Technol. 4(3–4), 193–208 (2001)
Article MATH Google Scholar
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D.: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop. In: Proceedings of the IEEE Fourth Workshop on Multimedia Signal Processing, pp. 619–624 (2001)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Xu, C., Wang, Y., Tan, T., Quan, L.: Depth vs. intensity: which is more important for face recognition?. In: Proceedings of the International Conference on Pattern Recognition, pp. 342–345 (2004)
Goecke, R., Millar. J.B.: The audio-video Australian English speech data corpus AVOZES. In: Proceedings of the International Conference on Spoken Language Processing, pp. 2525–2528 (2004)
Ortega, A., Sukno, F., Lleida, E., Frangi, A.F., Miguel, A., Buera, L., Zacur, E.: AV@ CAR: a Spanish multichannel multimodal corpus for invehicle automatic audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 763–766 (2004)
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: Acquisition of a 3D audio-visual corpus of affective speech. IEEE Trans. Multimed. 12(6), 591–598 (2010)
Article Google Scholar
Vorwerk, A., Wang, X., Kolossa, D., Zeiler, S., Orglmeister, R.: WAPUSK20-a database for robust audio-visual speech recognition. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 3016–3019 (2010)
Webb, J., Ashley, J.: Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, California (2012)
Book Google Scholar
Galatas, G., Potamianos, G., Kosmopoulos, D.I., McMurrough, C., Makedon, F.: Bilingual corpus for AVASR using multiple sensors and depth information. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 103–106 (2011)
Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the European Signal Processing Conference, pp. 2714–2717 (2012)
Ahlberg, J.: Candide-3-an updated parameterized face. Report No. LiTH-ISY-R-2326. Linkoping University, Sweden (2001)
Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2306–2310 (2013)
Yargic, A., Dogan, M.: A lip reading application on MS Kinect camera. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, pp. 1–5 (2013)
Werda, S., Mahdi, W., Hamadou, A.B.: Lip localization and viseme classification for visual speech recognition. Int. J. Comput. Inf. Sci. 5(1), 62–75 (2013)
Google Scholar
Ramos, E.: Kinect Basics. Arduino and Kinect Projects. Apress, California (2012)
Google Scholar
Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA based visual DCT feature extraction method for Lip-Reading. In: Proceedings of the International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 321–326 (2006)
Chatfield, C., Collins, A.J.: Introduction to Multivariate Analysis. Springer, Berlin (2013)
MATH Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., et al.: The Hidden Markov Model Toolkit Book (version 3.4). Entropic Cambridge Research Laboratory, Cambridge (1995)
Google Scholar

Download references

Acknowledgments

The research is supported by part of the National Natural Science Foundation (surface project No.61175016, surface project No.61471259).

Author information

Authors and Affiliations

School of Computer Science and Technology, Tianjin University, Tianjin, China
Jianrong Wang, Ju Zhang, Kiyoshi Honda & Jianwu Dang
School of Computer Software, Tianjin University, Tianjin, China
Jianguo Wei

Authors

Jianrong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ju Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoshi Honda
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianguo Wei.

Additional information

Communicated by B. Huet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Zhang, J., Honda, K. et al. Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimedia Systems 22, 315–323 (2016). https://doi.org/10.1007/s00530-015-0499-9

Download citation

Received: 23 January 2015
Accepted: 20 November 2015
Published: 06 December 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s00530-015-0499-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Abstract

Access this article

Similar content being viewed by others

Bimodal Speech Recognition Fusing Audio-Visual Modalities

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

An Experimental Analysis of Different Approaches to Audio–Visual Speech Recognition and Lip-Reading

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Abstract

Access this article

Similar content being viewed by others

Bimodal Speech Recognition Fusing Audio-Visual Modalities

Lip-Reading Using Pixel-Based and Geometry-Based Features for Multimodal Human–Robot Interfaces

An Experimental Analysis of Different Approaches to Audio–Visual Speech Recognition and Lip-Reading

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation