Skip to main content

Advertisement

Log in

Experimenting with lipreading for large vocabulary continuous speech recognition

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

Vast majority of current research in the area of audiovisual speech recognition via lipreading from frontal face videos focuses on simple cases such as isolated phrase recognition or structured speech, where the vocabulary is limited to several tens of units. In this paper, we diverge from these traditional applications and investigate the effect of incorporating the visual and also depth information in the task of continuous speech recognition with vocabulary size ranging from several hundred to half a million words. To this end, we evaluate various visual speech parametrizations, both existing and novel, that are designed to capture different kind of information in the video and depth signals. The experiments are conducted on a moderate sized dataset of 54 speakers, each uttering 100 sentences in Czech language. Both the video and depth data was captured by the Microsoft Kinect device. We show that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition. Somewhat surprisingly, a relative improvement of up to 16% has also been reached using the interpolated depth data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/julius-speech/julius.

References

  1. Assael YM, Shillingford B, Whiteson S, de Freitas N (2016) Lipnet: sentence-level lipreading. In: CoRR abs/1611.01599

  2. Cao X, Wei Y, Wen F, Sun J(2012) Face alignment by explicit shape regression. In: CVPR

  3. Chung JS, Senior AW, Vinyals O, Zisserman A (2016) Lip reading sentences in the wild. In: CoRR

  4. Císař P (2006) Application of lipreading methods for speech recognition. Ph.D. thesis

  5. Cooke M, Barker J, Cunningham S, Shao X (2006) An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am 120(5):2421–2424

    Article  Google Scholar 

  6. Estellers V, Gurban M, Thiran J (2012) On dynamic stream weighting for audio-visual speech recognition. IEEE Trans Audio Speech Lang Process 20(4):1145–1157

    Article  Google Scholar 

  7. Galatas G, Potamianos G, Makedon F (2012) Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: Proceedings of the 20th European signal processing conference (EUSIPCO), pp 2714–2717

  8. Glotin H, Vergyr D, Neti C, Potamianos G, Luettin J (2001) Weighting schemes for audio-visual fusion in speech recognition. In: 2001 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’01), vol 1, pp 173–176

  9. Harte N, Gillen E (2015) Tcd-timit: an audio-visual corpus of continuous speech. IEEE Trans Multimed 17(5):603–615

    Article  Google Scholar 

  10. Lan Y, Theobald B, Harvey R, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the international conference on auditory-visual speech processing, 2010, pp 142–147

  11. Lee B, Hasegawa-Johnson M, Goudeseune C, Kamdar S, Borys S, Liu M, Huang TS (2004) AVICAR: audio-visual speech corpus in a car environment. In: INTERSPEECH, pp 2489–2492

  12. Lucey S, Chen T, Sridharan S, Chandran V (2005) Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans Multimed 7(3):495–506

    Article  Google Scholar 

  13. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  14. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011)Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning, ICML 2011, Bellevue, Washington, USA, June 28-July 2, 2011, pp 689–696

  15. Noda K, Yamaguchi Y, Nakadai K, Okuno, H, Ogata, T (2014) Lipreading using convolutional neural network. In: International speech and communication association, pp 1149–1153

  16. Nouza, J, Psutka J, Uhlíř (1997) Phonetic alphabet for speech recognition of czech. Radioengineering 6(4):16–20

  17. Ong E, Bowden, R (2011) Learning sequential patterns for lipreading. In: Proceedings of the British machine vision conference, BMVC 2011, Dundee, UK, August 29-September 2, 2011, pp 1–10

  18. Palecek K (2016) Lipreading using spatiotemporal histogram of oriented gradients. In: EUSIPCO 2016, Budapest, Hungary, 2016, pp 1882–1885

  19. Paleček K (2017) Spatiotemporal convolutional features for lipreading. Springer, Cham, pp 438–446

    Google Scholar 

  20. Paleček K (2017) Utilizing lipreading in large vocabulary continuous speech recognition. In: Karpov A, Potapova R, Mporas I (eds) Speech and computer. Springer, Cham, pp 767–776

    Chapter  Google Scholar 

  21. Pei Y, Kim T, Zha H (2013) Unsupervised random forest manifold alignment for lipreading. In: IEEE international conference on computer vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pp 129–136

  22. Petridis S, Li Z, Pantic M (2017) End-to-end visual speech recognition with LSTMS. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) pp 2592–2596

  23. Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audio-visual speech. In: Proceedings of the IEEE, pp 1306–1326

  24. Ramage MD (2013) Disproving visemes as the basic visual unit of speech. Ph.D. thesis

  25. Saenko K, Livescu K, Siracusa M, Wilson K, Glass J, Darrell T (2005) Visual speech recognition with loosely synchronized feature streams. In: Proceedings of the tenth IEEE international conference on computer vision, ICCV ’05, vol 2. IEEE Computer Society, Washington, DC, USA, pp 1424–1431

  26. Stolcke A (2002) SRILM: an extensible language modeling toolkit. In: Proceedings of ICSLP, vol 2. Denver, USA, pp 901–904

  27. Sui C, Bennamoun M, Togneri R (2016) Visual speech feature representations: recent advances. Springer, Cham, pp 377–396

    Google Scholar 

  28. Summerfield Q (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. In: Dodd B (ed) Hearing by eye: the psychology of lip-reading. Lawrence Erlbaum Associates, Hillsdale

    Google Scholar 

  29. Wand M, Koutník J Schmidhuber J (2016) Lipreading with long short-term memory. In: CoRR

  30. Zhao G, Barnard M, Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Trans Multimed 11(7):1254–1265

    Article  Google Scholar 

  31. Zhou Z, Zhao G, Hong X, Pietikinen M (2014) A review of recent advances in visual speech decoding. Image Vis Comput 32(9):590–605

    Article  Google Scholar 

  32. Zhou Z, Zhao G, Pietikainen M (2011)Towards a practical lipreading system. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, CVPR ’11. IEEE Computer Society, Washington, DC, USA, pp 137–144

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karel Paleček.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of [20] that was presented at the SPECOM 2017 conference.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Paleček, K. Experimenting with lipreading for large vocabulary continuous speech recognition. J Multimodal User Interfaces 12, 309–318 (2018). https://doi.org/10.1007/s12193-018-0266-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-018-0266-2

Keywords

Navigation