Skip to main content
Log in

Spotting words in silent speech videos: a retrieval-based approach

  • Special Issue Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)

  2. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473

  4. Basu, S., Oliver, N., Pentland, A.: 3d modeling and tracking of human lip motions. In: ICCV (1998)

  5. Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, Berlin (2012)

    Google Scholar 

  6. Bradski, G.: The opencv library. Dr. Dobb’s J.: Softw. Tools Prof. Progr. 25(11), 120, 122–125 (2000)

  7. Brooke N.M, S.S.: Pca image coding schemes and visual speech intelligibility. In: Proceedings of the Institute of Acoustics, vol. 16 (1994)

  8. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP, pp. 4960–4964 (2016)

  9. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)

  10. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078

  11. Chollet, F., et al.: Keras. https://keras.io (2015)

  12. Chorowski, J., Jaitly, N.: Towards better decoding and language model integration in sequence to sequence models (2016). arXiv preprint arXiv:1612.02695

  13. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR (2016)

  14. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: ACCV (2016)

  15. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: ACCV (2016)

  16. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  17. Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of recurrent neural networks for offline handwriting recognition. In: ICFHR (2014)

  18. Fergus, R., Perona, P., Zisserman, A.: A visual category filter for google images. In: ECCV (2004)

  19. Fernández, S., Graves, A., Schmidhuber, J.: An application of recurrent neural networks to discriminative keyword spotting. In: ICANN (2007)

  20. Fischer, A., Keller, A., Frinken, V., Bunke, H.: HMM-based word spotting in handwritten documents using subword models. In: ICMR (2010)

  21. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  22. Frinken, V., Fischer, A., Manmatha, R., Bunke, H.: A novel word spotting method based on recurrent neural networks. IEEE TPAMI 34(2), 211–224 (2012)

    Article  Google Scholar 

  23. Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognit. 68, 310–332 (2017)

    Article  Google Scholar 

  24. Gish, H., Ng, K.: A segmental speech model with applications to word spotting. In: ICASSP, vol. 2 (1993)

  25. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML (2006)

  26. Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: ICANN (2005)

  27. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, pp. 1764–1772 (2014)

  28. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: scaling up end-to-end speech recognition (2014). arXiv preprint arXiv:1412.5567

  29. Hassanat, A.B.: Visual words for automatic lip-reading (2014). arXiv preprint arXiv:1409.6689

  30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  31. Hennecke, M.E.: Audio-visual speech recognition: preprocessing, learning and sensory integration. PhD thesis, Stanford Univ. (1997)

  32. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

  33. Ho, T.K., Hull, J.J., Srihari, S.N.: A computational model for recognition of multifont word images. Mach. Vis. Appl. 5(3), 157–168 (1992)

    Article  Google Scholar 

  34. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  35. Jha, A., Namboodiri, V., Jawahar, C.V.: Word spotting in silent lip videos. In: WACV (2018)

  36. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE TPAMI 35(1), 221–231 (2013)

    Article  Google Scholar 

  37. Keshet, J., Grangier, D., Bengio, S.: Discriminative keyword spotting. Speech Commun. 51(4), 317–329 (2009)

    Article  Google Scholar 

  38. King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  39. Krishnan, P., Jawahar, C.V.: Bringing semantics in word image retrieval. In: ICDAR (2013)

  40. Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE TMM 10(5), 767–779 (2008)

    Google Scholar 

  41. Liu, H., Fan, T., Wu, P.: Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction. In: ICRA, pp. 6644–6651 (2014)

  42. Manmatha, R., Han, C., Riseman, E.M.: Word spotting: A new approach to indexing handwriting. In: CVPR (1996)

  43. Mohamed, A.R., Dahl, G.E., Hinton, G., et al.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)

    Article  Google Scholar 

  44. Robinson, T., Hochberg, M., Renals, S.: The use of recurrent neural networks in continuous speech recognition. In: Automatic Speech and Speaker Recognition, pp. 233–258. Springer, Berlin (1996)

  45. Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP (1989)

  46. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading (2017). arXiv preprint arXiv:1703.04105

  47. Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild (2018). arXiv preprint arXiv:1807.08469

  48. Stillittano, S., Girondel, V., Caplier, A.: Lip contour segmentation and tracking compliant with lip-reading application constraints. Mach. Vis. Appl. 24(1), 1–18 (2013)

    Article  Google Scholar 

  49. Sudholt, S., Fink, G.A.: Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)

  50. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)

  51. Tsai, S.S., Chen, D., Takacs, G., Chandrasekhar, V., Vedantham, R., Grzeszczuk, R., Girod, B.: Fast geometric re-ranking for image-based retrieval. In: ICIP (2010)

  52. Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: ICASSP (2016)

  53. Wang, K., Belongie, S.: Word spotting in the wild. In: ECCV (2010)

  54. Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE TMM 18(3), 326–338 (2016)

    Google Scholar 

  55. Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition (2016). arXiv preprint arXiv:1610.05256

  56. Zhang, X.Y., Yin, F., Zhang, Y.M., Liu, C.L., Bengio, Y.: Drawing and recognizing chinese characters with recurrent neural network. IEEE TPAMI 849—862 (2017)

  57. Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This work is partly supported by Alexa Graduate Fellowship from Amazon.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhishek Jha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jha, A., Namboodiri, V.P. & Jawahar, C.V. Spotting words in silent speech videos: a retrieval-based approach. Machine Vision and Applications 30, 217–229 (2019). https://doi.org/10.1007/s00138-019-01006-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-019-01006-y

Keywords

Navigation