Abstract
Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip motion of the speaker is clearly visible and audio is absent. Existing work in this domain has mainly focused on recognizing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recognition performance. Our contribution is twofold: (1) we develop a pipeline for recognition-free retrieval and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. (2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatiotemporal landmarks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean average precision over recognition-based method on large-scale LRW dataset. We also demonstrate the application of the method by word spotting in a popular speech video (“The great dictator” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies. Finally, we compare our model against ASR in a noisy environment and analyze the effect of the performance of underlying lip-reader and input video quality on the proposed word spotting pipeline.
Similar content being viewed by others
References
Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473
Basu, S., Oliver, N., Pentland, A.: 3d modeling and tracking of human lip motions. In: ICCV (1998)
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, Berlin (2012)
Bradski, G.: The opencv library. Dr. Dobb’s J.: Softw. Tools Prof. Progr. 25(11), 120, 122–125 (2000)
Brooke N.M, S.S.: Pca image coding schemes and visual speech intelligibility. In: Proceedings of the Institute of Acoustics, vol. 16 (1994)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP, pp. 4960–4964 (2016)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation (2014). arXiv preprint arXiv:1406.1078
Chollet, F., et al.: Keras. https://keras.io (2015)
Chorowski, J., Jaitly, N.: Towards better decoding and language model integration in sequence to sequence models (2016). arXiv preprint arXiv:1612.02695
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR (2016)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: ACCV (2016)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: ACCV (2016)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of recurrent neural networks for offline handwriting recognition. In: ICFHR (2014)
Fergus, R., Perona, P., Zisserman, A.: A visual category filter for google images. In: ECCV (2004)
Fernández, S., Graves, A., Schmidhuber, J.: An application of recurrent neural networks to discriminative keyword spotting. In: ICANN (2007)
Fischer, A., Keller, A., Frinken, V., Bunke, H.: HMM-based word spotting in handwritten documents using subword models. In: ICMR (2010)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Frinken, V., Fischer, A., Manmatha, R., Bunke, H.: A novel word spotting method based on recurrent neural networks. IEEE TPAMI 34(2), 211–224 (2012)
Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognit. 68, 310–332 (2017)
Gish, H., Ng, K.: A segmental speech model with applications to word spotting. In: ICASSP, vol. 2 (1993)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML (2006)
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: ICANN (2005)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, pp. 1764–1772 (2014)
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: scaling up end-to-end speech recognition (2014). arXiv preprint arXiv:1412.5567
Hassanat, A.B.: Visual words for automatic lip-reading (2014). arXiv preprint arXiv:1409.6689
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hennecke, M.E.: Audio-visual speech recognition: preprocessing, learning and sensory integration. PhD thesis, Stanford Univ. (1997)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Ho, T.K., Hull, J.J., Srihari, S.N.: A computational model for recognition of multifont word images. Mach. Vis. Appl. 5(3), 157–168 (1992)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jha, A., Namboodiri, V., Jawahar, C.V.: Word spotting in silent lip videos. In: WACV (2018)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE TPAMI 35(1), 221–231 (2013)
Keshet, J., Grangier, D., Bengio, S.: Discriminative keyword spotting. Speech Commun. 51(4), 317–329 (2009)
King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Krishnan, P., Jawahar, C.V.: Bringing semantics in word image retrieval. In: ICDAR (2013)
Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE TMM 10(5), 767–779 (2008)
Liu, H., Fan, T., Wu, P.: Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction. In: ICRA, pp. 6644–6651 (2014)
Manmatha, R., Han, C., Riseman, E.M.: Word spotting: A new approach to indexing handwriting. In: CVPR (1996)
Mohamed, A.R., Dahl, G.E., Hinton, G., et al.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
Robinson, T., Hochberg, M., Renals, S.: The use of recurrent neural networks in continuous speech recognition. In: Automatic Speech and Speaker Recognition, pp. 233–258. Springer, Berlin (1996)
Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP (1989)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading (2017). arXiv preprint arXiv:1703.04105
Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild (2018). arXiv preprint arXiv:1807.08469
Stillittano, S., Girondel, V., Caplier, A.: Lip contour segmentation and tracking compliant with lip-reading application constraints. Mach. Vis. Appl. 24(1), 1–18 (2013)
Sudholt, S., Fink, G.A.: Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Tsai, S.S., Chen, D., Takacs, G., Chandrasekhar, V., Vedantham, R., Grzeszczuk, R., Girod, B.: Fast geometric re-ranking for image-based retrieval. In: ICIP (2010)
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: ICASSP (2016)
Wang, K., Belongie, S.: Word spotting in the wild. In: ECCV (2010)
Wu, P., Liu, H., Li, X., Fan, T., Zhang, X.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE TMM 18(3), 326–338 (2016)
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition (2016). arXiv preprint arXiv:1610.05256
Zhang, X.Y., Yin, F., Zhang, Y.M., Liu, C.L., Bengio, Y.: Drawing and recognizing chinese characters with recurrent neural network. IEEE TPAMI 849—862 (2017)
Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605 (2014)
Acknowledgements
This work is partly supported by Alexa Graduate Fellowship from Amazon.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jha, A., Namboodiri, V.P. & Jawahar, C.V. Spotting words in silent speech videos: a retrieval-based approach. Machine Vision and Applications 30, 217–229 (2019). https://doi.org/10.1007/s00138-019-01006-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-019-01006-y