Experimenting with lipreading for large vocabulary continuous speech recognition

Paleček, Karel

doi:10.1007/s12193-018-0266-2

Experimenting with lipreading for large vocabulary continuous speech recognition

Original Paper
Published: 16 July 2018

Volume 12, pages 309–318, (2018)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Karel Paleček ORCID: orcid.org/0000-0003-0693-6603¹

274 Accesses
6 Citations
Explore all metrics

Abstract

Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual and also depth information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video and depth signals. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. Both the video and depth data was captured by the Microsoft Kinect device. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition. Somewhat surprisingly, a relative improvement of up to 16% has also been reached using the interpolated depth data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Spatiotemporal Convolutional Features for Lipreading

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Article 25 January 2024

Notes

https://github.com/julius-speech/julius.

References

Assael YM, Shillingford B, Whiteson S, de Freitas N (2016) Lipnet: sentence-level lipreading. In: CoRR abs/1611.01599
Cao X, Wei Y, Wen F, Sun J(2012) Face alignment by explicit shape regression. In: CVPR
Chung JS, Senior AW, Vinyals O, Zisserman A (2016) Lip reading sentences in the wild. In: CoRR
Císař P (2006) Application of lipreading methods for speech recognition. Ph.D. thesis
Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am 120(5):2421–2424
Article Google Scholar
Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio-visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157
Article Google Scholar
Galatas G, Potamianos G, Makedon F (2012) Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: Proceedings of the 20th European signal processing conference (EUSIPCO), pp 2714–2717
Glotin H, Vergyr D, Neti C, Potamianos G, Luettin J (2001) Weighting schemes for audio-visual fusion in speech recognition. In: 2001 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’01), vol 1, pp 173–176
Harte N, Gillen E (2015) Tcd-timit: an audio-visual corpus of continuous speech. IEEE Trans Multimed 17(5):603–615
Article Google Scholar
Lan Y, Theobald B, Harvey R, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the international conference on auditory-visual speech processing, 2010, pp 142–147
Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang TS (2004) AVICAR: audio-visual speech corpus in a car environment. In: INTERSPEECH, pp 2489–2492
Lucey S, Chen T, Sridharan S, Chandran V (2005) Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans Multimed 7(3):495–506
Article Google Scholar
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011)Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning, ICML 2011, Bellevue, Washington, USA, June 28-July 2, 2011, pp 689–696
Noda K, Yamaguchi Y, Nakadai K, Okuno, H, Ogata, T (2014) Lipreading using convolutional neural network. In: International speech and communication association, pp 1149–1153
Nouza, J, Psutka J, Uhlíř (1997) Phonetic alphabet for speech recognition of czech. Radioengineering 6(4):16–20
Ong E, Bowden, R (2011) Learning sequential patterns for lipreading. In: Proceedings of the British machine vision conference, BMVC 2011, Dundee, UK, August 29-September 2, 2011, pp 1–10
Palecek K (2016) Lipreading using spatiotemporal histogram of oriented gradients. In: EUSIPCO 2016, Budapest, Hungary, 2016, pp 1882–1885
Paleček K (2017) Spatiotemporal convolutional features for lipreading. Springer, Cham, pp 438–446
Google Scholar
Paleček K (2017) Utilizing lipreading in large vocabulary continuous speech recognition. In: Karpov A, Potapova R, Mporas I (eds) Speech and computer. Springer, Cham, pp 767–776
Chapter Google Scholar
Pei Y, Kim T, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: IEEE international conference on computer vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pp 129–136
Petridis S, Li Z, Pantic M (2017) End-to-end visual speech recognition with LSTMS. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) pp 2592–2596
Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audio-visual speech. In: Proceedings of the IEEE, pp 1306–1326
Ramage MD (2013) Disproving visemes as the basic visual unit of speech. Ph.D. thesis
Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T (2005) Visual speech recognition with loosely synchronized feature streams. In: Proceedings of the tenth IEEE international conference on computer vision, ICCV ’05, vol 2. IEEE Computer Society, Washington, DC, USA, pp 1424–1431
Stolcke A (2002) SRILM: an extensible language modeling toolkit. In: Proceedings of ICSLP, vol 2. Denver, USA, pp 901–904
Sui C, Bennamoun M, Togneri R (2016) Visual speech feature representations: recent advances. Springer, Cham, pp 377–396
Google Scholar
Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd B (ed) Hearing by eye: the psychology of lip-reading. Lawrence Erlbaum Associates, Hillsdale
Google Scholar
Wand M, Koutník J Schmidhuber J (2016) Lipreading with long short-term memory. In: CoRR
Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265
Article Google Scholar
Zhou Z, Zhao G, Hong X, Pietikinen M (2014) A review of recent advances in visual speech decoding. Image Vis Comput 32(9):590–605
Article Google Scholar
Zhou Z, Zhao G, Pietikainen M (2011)Towards a practical lipreading system. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, CVPR ’11. IEEE Computer Society, Washington, DC, USA, pp 137–144

Download references

Author information

Authors and Affiliations

Institute of Information Technology and Electronics, Technical University of Liberec, 461 17, Liberec, Czech Republic
Karel Paleček

Authors

Karel Paleček
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karel Paleček.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of [20] that was presented at the SPECOM 2017 conference.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paleček, K. Experimenting with lipreading for large vocabulary continuous speech recognition. J Multimodal User Interfaces 12, 309–318 (2018). https://doi.org/10.1007/s12193-018-0266-2

Download citation

Received: 31 October 2017
Accepted: 15 June 2018
Published: 16 July 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s12193-018-0266-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimenting with lipreading for large vocabulary continuous speech recognition

Abstract

Access this article

Similar content being viewed by others

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Spatiotemporal Convolutional Features for Lipreading

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Experimenting with lipreading for large vocabulary continuous speech recognition

Abstract

Access this article

Similar content being viewed by others

Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition

Spatiotemporal Convolutional Features for Lipreading

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation