Abstract
Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual and also depth information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video and depth signals. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. Both the video and depth data was captured by the Microsoft Kinect device. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition. Somewhat surprisingly, a relative improvement of up to 16% has also been reached using the interpolated depth data.
Similar content being viewed by others
References
Assael YM, Shillingford B, Whiteson S, de Freitas N (2016) Lipnet: sentence-level lipreading. In: CoRR abs/1611.01599
Cao X, Wei Y, Wen F, Sun J(2012) Face alignment by explicit shape regression. In: CVPR
Chung JS, Senior AW, Vinyals O, Zisserman A (2016) Lip reading sentences in the wild. In: CoRR
Císař P (2006) Application of lipreading methods for speech recognition. Ph.D. thesis
Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am 120(5):2421–2424
Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio-visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157
Galatas G, Potamianos G, Makedon F (2012) Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: Proceedings of the 20th European signal processing conference (EUSIPCO), pp 2714–2717
Glotin H, Vergyr D, Neti C, Potamianos G, Luettin J (2001) Weighting schemes for audio-visual fusion in speech recognition. In: 2001 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’01), vol 1, pp 173–176
Harte N, Gillen E (2015) Tcd-timit: an audio-visual corpus of continuous speech. IEEE Trans Multimed 17(5):603–615
Lan Y, Theobald B, Harvey R, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the international conference on auditory-visual speech processing, 2010, pp 142–147
Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang TS (2004) AVICAR: audio-visual speech corpus in a car environment. In: INTERSPEECH, pp 2489–2492
Lucey S, Chen T, Sridharan S, Chandran V (2005) Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans Multimed 7(3):495–506
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011)Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning, ICML 2011, Bellevue, Washington, USA, June 28-July 2, 2011, pp 689–696
Noda K, Yamaguchi Y, Nakadai K, Okuno, H, Ogata, T (2014) Lipreading using convolutional neural network. In: International speech and communication association, pp 1149–1153
Nouza, J, Psutka J, Uhlíř (1997) Phonetic alphabet for speech recognition of czech. Radioengineering 6(4):16–20
Ong E, Bowden, R (2011) Learning sequential patterns for lipreading. In: Proceedings of the British machine vision conference, BMVC 2011, Dundee, UK, August 29-September 2, 2011, pp 1–10
Palecek K (2016) Lipreading using spatiotemporal histogram of oriented gradients. In: EUSIPCO 2016, Budapest, Hungary, 2016, pp 1882–1885
Paleček K (2017) Spatiotemporal convolutional features for lipreading. Springer, Cham, pp 438–446
Paleček K (2017) Utilizing lipreading in large vocabulary continuous speech recognition. In: Karpov A, Potapova R, Mporas I (eds) Speech and computer. Springer, Cham, pp 767–776
Pei Y, Kim T, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: IEEE international conference on computer vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pp 129–136
Petridis S, Li Z, Pantic M (2017) End-to-end visual speech recognition with LSTMS. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) pp 2592–2596
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audio-visual speech. In: Proceedings of the IEEE, pp 1306–1326
Ramage MD (2013) Disproving visemes as the basic visual unit of speech. Ph.D. thesis
Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T (2005) Visual speech recognition with loosely synchronized feature streams. In: Proceedings of the tenth IEEE international conference on computer vision, ICCV ’05, vol 2. IEEE Computer Society, Washington, DC, USA, pp 1424–1431
Stolcke A (2002) SRILM: an extensible language modeling toolkit. In: Proceedings of ICSLP, vol 2. Denver, USA, pp 901–904
Sui C, Bennamoun M, Togneri R (2016) Visual speech feature representations: recent advances. Springer, Cham, pp 377–396
Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd B (ed) Hearing by eye: the psychology of lip-reading. Lawrence Erlbaum Associates, Hillsdale
Wand M, Koutník J Schmidhuber J (2016) Lipreading with long short-term memory. In: CoRR
Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265
Zhou Z, Zhao G, Hong X, Pietikinen M (2014) A review of recent advances in visual speech decoding. Image Vis Comput 32(9):590–605
Zhou Z, Zhao G, Pietikainen M (2011)Towards a practical lipreading system. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, CVPR ’11. IEEE Computer Society, Washington, DC, USA, pp 137–144
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper is an extended version of [20] that was presented at the SPECOM 2017 conference.
Rights and permissions
About this article
Cite this article
Paleček, K. Experimenting with lipreading for large vocabulary continuous speech recognition. J Multimodal User Interfaces 12, 309–318 (2018). https://doi.org/10.1007/s12193-018-0266-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-018-0266-2