Spotting words in silent speech videos: a retrieval-based approach

Jha, Abhishek; Namboodiri, Vinay P.; Jawahar, C. V.

doi:10.1007/s00138-019-01006-y

Spotting words in silent speech videos: a retrieval-based approach

Special Issue Paper
Published: 08 February 2019

Volume 30, pages 217–229, (2019)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

445 Accesses
4 Citations
Explore all metrics

Abstract

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Article 25 January 2024

A video indexing and retrieval computational prototype based on transcribed speech

Article 30 August 2021

References

Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473
Basu, S., Oliver, N., Pentland, A.: 3d modeling and tracking of human lip motions. In: ICCV (1998)
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, Berlin (2012)
Google Scholar
Bradski, G.: The opencv library. Dr. Dobb’s J.: Softw. Tools Prof. Progr. 25(11), 120, 122–125 (2000)
Brooke N.M, S.S.: Pca image coding schemes and visual speech intelligibility. In: Proceedings of the Institute of Acoustics, vol. 16 (1994)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP, pp. 4960–4964 (2016)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078
Chollet, F., et al.: Keras. https://keras.io (2015)
Chorowski, J., Jaitly, N.: Towards better decoding and language model integration in sequence to sequence models (2016). arXiv preprint arXiv:1612.02695
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR (2016)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: ACCV (2016)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: ACCV (2016)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of recurrent neural networks for offline handwriting recognition. In: ICFHR (2014)
Fergus, R., Perona, P., Zisserman, A.: A visual category filter for google images. In: ECCV (2004)
Fernández, S., Graves, A., Schmidhuber, J.: An application of recurrent neural networks to discriminative keyword spotting. In: ICANN (2007)
Fischer, A., Keller, A., Frinken, V., Bunke, H.: HMM-based word spotting in handwritten documents using subword models. In: ICMR (2010)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar
Frinken, V., Fischer, A., Manmatha, R., Bunke, H.: A novel word spotting method based on recurrent neural networks. IEEE TPAMI 34(2), 211–224 (2012)
Article Google Scholar
Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognit. 68, 310–332 (2017)
Article Google Scholar
Gish, H., Ng, K.: A segmental speech model with applications to word spotting. In: ICASSP, vol. 2 (1993)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML (2006)
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: ICANN (2005)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, pp. 1764–1772 (2014)
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: scaling up end-to-end speech recognition (2014). arXiv preprint arXiv:1412.5567
Hassanat, A.B.: Visual words for automatic lip-reading (2014). arXiv preprint arXiv:1409.6689
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hennecke, M.E.: Audio-visual speech recognition: preprocessing, learning and sensory integration. PhD thesis, Stanford Univ. (1997)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Ho, T.K., Hull, J.J., Srihari, S.N.: A computational model for recognition of multifont word images. Mach. Vis. Appl. 5(3), 157–168 (1992)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jha, A., Namboodiri, V., Jawahar, C.V.: Word spotting in silent lip videos. In: WACV (2018)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE TPAMI 35(1), 221–231 (2013)
Article Google Scholar
Keshet, J., Grangier, D., Bengio, S.: Discriminative keyword spotting. Speech Commun. 51(4), 317–329 (2009)
Article Google Scholar
King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Krishnan, P., Jawahar, C.V.: Bringing semantics in word image retrieval. In: ICDAR (2013)
Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE TMM 10(5), 767–779 (2008)
Google Scholar
Liu, H., Fan, T., Wu, P.: Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction. In: ICRA, pp. 6644–6651 (2014)
Manmatha, R., Han, C., Riseman, E.M.: Word spotting: A new approach to indexing handwriting. In: CVPR (1996)
Mohamed, A.R., Dahl, G.E., Hinton, G., et al.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
Article Google Scholar
Robinson, T., Hochberg, M., Renals, S.: The use of recurrent neural networks in continuous speech recognition. In: Automatic Speech and Speaker Recognition, pp. 233–258. Springer, Berlin (1996)
Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP (1989)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading (2017). arXiv preprint arXiv:1703.04105
Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild (2018). arXiv preprint arXiv:1807.08469
Stillittano, S., Girondel, V., Caplier, A.: Lip contour segmentation and tracking compliant with lip-reading application constraints. Mach. Vis. Appl. 24(1), 1–18 (2013)
Article Google Scholar
Sudholt, S., Fink, G.A.: Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Tsai, S.S., Chen, D., Takacs, G., Chandrasekhar, V., Vedantham, R., Grzeszczuk, R., Girod, B.: Fast geometric re-ranking for image-based retrieval. In: ICIP (2010)
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: ICASSP (2016)
Wang, K., Belongie, S.: Word spotting in the wild. In: ECCV (2010)
Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE TMM 18(3), 326–338 (2016)
Google Scholar
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition (2016). arXiv preprint arXiv:1610.05256
Zhang, X.Y., Yin, F., Zhang, Y.M., Liu, C.L., Bengio, Y.: Drawing and recognizing chinese characters with recurrent neural network. IEEE TPAMI 849—862 (2017)
Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)
Article Google Scholar

Download references

Acknowledgements

This work is partly supported by Alexa Graduate Fellowship from Amazon.

Author information

Authors and Affiliations

CVIT, IIIT Hyderabad, Hyderabad, India
Abhishek Jha & C. V. Jawahar
Department of Computer science, IIT Kanpur, Kanpur, India
Vinay P. Namboodiri

Authors

Abhishek Jha
View author publications
You can also search for this author in PubMed Google Scholar
Vinay P. Namboodiri
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhishek Jha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jha, A., Namboodiri, V.P. & Jawahar, C.V. Spotting words in silent speech videos: a retrieval-based approach. Machine Vision and Applications 30, 217–229 (2019). https://doi.org/10.1007/s00138-019-01006-y

Download citation

Received: 24 October 2018
Accepted: 02 January 2019
Published: 08 February 2019
Issue Date: 04 March 2019
DOI: https://doi.org/10.1007/s00138-019-01006-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spotting words in silent speech videos: a retrieval-based approach

Abstract

Access this article

Similar content being viewed by others

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

A video indexing and retrieval computational prototype based on transcribed speech

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spotting words in silent speech videos: a retrieval-based approach

Abstract

Access this article

Similar content being viewed by others

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

A video indexing and retrieval computational prototype based on transcribed speech

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation