skip to main content
10.1145/3421558.3421563acmotherconferencesArticle/Chapter ViewAbstractPublication PagesipmvConference Proceedingsconference-collections
research-article

A Survey of Lipreading Methods Based on Deep Learning

Published:25 November 2020Publication History

ABSTRACT

Visual speech recognition, also known as lipreading, is to recognize speech content by decoding the visual information of the speaker's lip movement on the basis of no speech signal. The effect of lipreading based on traditional manual features in complex scenes is not very good. With the great success of deep learning in image classification, the application of deep learning to lipreading has become a new development trend, but there are still some challenges and problems. Lipreading model based on deep learning is mainly divided into front-end and back-end. The front-end are mainly used to extract the lip movement features, and then decode the long-term information through the back-end. In this paper, the convolution neural network structure in the front-end and the sequence processing model of the back-end are discussed and analyzed. In addition, it introduces the current lipreading datasets and the comparison of the methods used in these datasets.

References

  1. Lee, C., Lee, E., Jung, S. and Lee, S. Design and Implementation of a Real-Time Lipreading System Using PCA and HMM. Journal of Korea Multimedia Society, 7, 11 (2004), 1597-1609.Google ScholarGoogle Scholar
  2. Yao, J. and Kaifeng, Z. Evaluation Model of the Artist Based on Fuzzy Membership to Improve the Principal Component Analysis of Robust Kernel. City, 2016.Google ScholarGoogle Scholar
  3. Puviarasan, N. and Palanivel, S. Lip reading of hearing impaired persons using HMM. Expert Systems with Applications, 38, 4 (2011/04/01/ 2011), 4477-4481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Matthews, I., Potamianos, G., Neti, C. and Luettin, J. A comparison of model and transform-based visual features for audio-visual LVCSR. City, 2001.Google ScholarGoogle Scholar
  5. Morade, S. S. and Patnaik, S. Lip reading using DWT and LSDA. City, 2014.Google ScholarGoogle Scholar
  6. Fernandez-Lopez, A., Martinez, O. and Sukno, F. M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. IEEE, City, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shaikh, A. A., Kumar, D. K., Yau, W. C., Azemin, M. Z. C. and Gubbi, J. Lip reading using optical flow and support vector machines. City, 2010.Google ScholarGoogle Scholar
  8. Mase, K. and Pentland, A. Automatic lipreading by optical-flow analysis. Systems & Computers in Japan, 22, 6 (67-76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wang, S.-L., Liew, A. W.-C., Lau, W. H. and Leung, S. H. An automatic lipreading system for spoken digits with limited training data. IEEE transactions on circuits and systems for video technology, 18, 12 (2008), 1760-1765.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Luettin, J. and Thacker, N. A. Speechreading using probabilistic models. Computer vision and image understanding, 65, 2 (1997), 163-178.Google ScholarGoogle Scholar
  11. Bear, H. L. and Harvey, R. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication, 95 (2017), 40-67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Howell, D., Cox, S. and Theobald, B. Visual units and confusion modelling for automatic lip-reading. Image and Vision Computing, 51 (2016), 1-12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cooke, M., Barker, J., Cunningham, S. and Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 120, 5 (2006), 2421.Google ScholarGoogle ScholarCross RefCross Ref
  14. Anina, I., Zhou, Z., Zhao, G. and Pietikäinen, M. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. City, 2015.Google ScholarGoogle Scholar
  15. Chung, J. S. and Zisserman, A. Lip reading in the wild. Springer, City, 2016.Google ScholarGoogle Scholar
  16. Chung, J. S., Senior, A., Vinyals, O. and Zisserman, A. Lip reading sentences in the wild. City, 2017.Google ScholarGoogle Scholar
  17. Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, K., Shan, S. and Chen, X. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. IEEE, City, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shillingford, B., Assael, Y., Hoffman, M. W., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K. and Bennett, L. Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162 (2018).Google ScholarGoogle Scholar
  19. Chung, J. and Zisserman, A. Lip reading in profile (2017).Google ScholarGoogle Scholar
  20. Petridis, S., Shen, J., Cetin, D. and Pantic, M. Visual-only recognition of normal, whispered and silent speech. IEEE, City, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Afouras, T., Son Chung, J. and Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. City, 2018.Google ScholarGoogle Scholar
  22. Li, Y., Takashima, Y., Takiguchi, T. and Ariki, Y. Lip reading using a dynamic feature of lip images and convolutional neural networks. IEEE, City, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  23. Petridis, S., Li, Z. and Pantic, M. End-to-end visual speech recognition with LSTMs. IEEE, City, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Weng, X. and Kitani, K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading. arXiv preprint arXiv:1905.02540 (2019).Google ScholarGoogle Scholar
  25. Petridis, S., Wang, Y., Li, Z. and Pantic, M. End-to-end multi-view lipreading. arXiv preprint arXiv:1709.00443 (2017).Google ScholarGoogle Scholar
  26. Saitoh, T., Zhou, Z., Zhao, G. and Pietikäinen, M. Concatenated frame image based cnn for visual speech recognition. Springer, City, 2016.Google ScholarGoogle Scholar
  27. Jang, D.-W., Kim, H.-I., Je, C., Park, R.-H. and Park, H.-M. Lip Reading Using Committee Networks With Two Different Types of Concatenated Frame Images. IEEE Access, 7 (2019), 90125-90131.Google ScholarGoogle ScholarCross RefCross Ref
  28. NadeemHashmi, S., Gupta, H., Mittal, D., Kumar, K., Nanda, A. and Gupta, S. A Lip Reading Model Using CNN with Batch Normalization. IEEE, City, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  29. Noda, K., Ogata, T., Yamaguchi, Y., Okuno, H. G., Nakadai, K. and th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, I. Lipreading using convolutional neural network. Proc. Annu. Conf. Int. Speech. Commun. Assoc., INTERSPEECH Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2014), 1149-1153.Google ScholarGoogle Scholar
  30. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G. and Ogata, T. Audio-visual speech recognition using deep learning. Applied Intelligence, 42, 4 (2015/06/01 2015), 722-737.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Garg, A., Noyola, J. and Bagadia, S. Lip reading using CNN and LSTM. Technical report, Stanford University, CS231n project report, 2016.Google ScholarGoogle Scholar
  32. Chung, J. S. and Zisserman, A. Out of time: automated lip sync in the wild. Springer, City, 2016.Google ScholarGoogle Scholar
  33. Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H. and Daoudi, M. Lip Reading with Hahn Convolutional Neural Networks. Image and Vision Computing (2019).Google ScholarGoogle Scholar
  34. Ji, S., Xu, W., Yang, M. and Yu, K. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35, 1 (2012), 221-231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. Learning spatiotemporal features with 3d convolutional networks. City, 2015.Google ScholarGoogle Scholar
  36. Tran, D., Ray, J., Shou, Z., Chang, S.-F. and Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).Google ScholarGoogle Scholar
  37. Qiu, Z., Yao, T. and Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. City, 2017.Google ScholarGoogle Scholar
  38. Assael, Y. M., Shillingford, B., Whiteson, S. and De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).Google ScholarGoogle Scholar
  39. Torfi, A., Iranmanesh, S. M., Nasrabadi, N. and Dawson, J. 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition. IEEE Access, 5 (2017), 22081-22091.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xu, K., Li, D., Cassimatis, N. and Wang, X. LCANet: End-to-end lipreading with cascaded attention-CTC. IEEE, City, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K. Q. Densely connected convolutional networks. City, 2017.Google ScholarGoogle Scholar
  42. He, K., Zhang, X., Ren, S. and Sun, J. Deep residual learning for image recognition. City, 2016.Google ScholarGoogle Scholar
  43. Stafylakis, T. and Tzimiropoulos, G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).Google ScholarGoogle Scholar
  44. Margam, D. K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S. and Venkatesan, S. M. LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019).Google ScholarGoogle Scholar
  45. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H. and Ng, A. Y. Multimodal deep learning. City, 2011.Google ScholarGoogle Scholar
  46. Wand, M., Koutník, J. and Schmidhuber, J. Lipreading with long short-term memory. IEEE, City, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wand, M. and Schmidhuber, J. Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017).Google ScholarGoogle Scholar
  48. Wand, M., Schmidhuber, J. and Vu, N. T. Investigations on End-to-End Audiovisual Fusion. IEEE, City, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Petridis, S., Wang, Y., Li, Z. and Pantic, M. End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343 (2017).Google ScholarGoogle Scholar
  50. Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G. and Pantic, M. End-to-end audiovisual speech recognition. IEEE, City, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Courtney, L. and Sreenivas, R. Learning from Videos with Deep Convolutional LSTM Networks. arXiv preprint arXiv:1904.04817 (2019).Google ScholarGoogle Scholar
  52. Rahmani, M. H. and Almasganj, F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. IEEE, City, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  53. Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y. and Takeda, K. Integration of deep bottleneck features for audio-visual speech recognition. City, 2015.Google ScholarGoogle Scholar
  54. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9, 8 (1997), 1735-1780.Google ScholarGoogle Scholar
  55. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  56. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. Attention is all you need. City, 2017.Google ScholarGoogle Scholar
  57. Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. ACM, City, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Afouras, T., Chung, J. S. and Zisserman, A. Deep lip reading: a comparison of models and an online application. arXiv preprint arXiv:1806.06053 (2018).Google ScholarGoogle Scholar
  59. Chollet, F. Xception: Deep learning with depthwise separable convolutions. City, 2017.Google ScholarGoogle Scholar
  60. Afouras, T., Chung, J. S., Senior, A., Vinyals, O. and Zisserman, A. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).Google ScholarGoogle Scholar
  61. Zimmermann, M., Ghazi, M. M., Ekenel, H. K. and Thiran, J.-P. Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. Springer, City, 2016.Google ScholarGoogle Scholar
  62. Lee, D., Lee, J. and Kim, K.-E. Multi-view automatic lip-reading using neural network. Springer, City, 2016.Google ScholarGoogle Scholar
  63. Lin, M., Chen, Q. and Yan, S. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google ScholarGoogle Scholar
  64. Krizhevsky, A., Sutskever, I. and Hinton, G. E. Imagenet classification with deep convolutional neural networks. City, 2012.Google ScholarGoogle Scholar
  65. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A. Going deeper with convolutions. City, 2015.Google ScholarGoogle Scholar
  66. Fung, I. and Mak, B. End-to-end low-resource lip-reading with maxout CNN and LSTM. IEEE, City, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Chung, J. S. and Zisserman, A. Learning to lip read words by watching videos. Computer Vision and Image Understanding, 173 (2018), 76-85.Google ScholarGoogle ScholarCross RefCross Ref
  68. Petridis, S., Stafylakis, T., Ma, P., Tzimiropoulos, G. and Pantic, M. Audio-visual speech recognition with a hybrid ctc/attention architecture. IEEE, City, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  69. Stafylakis, T. and Tzimiropoulos, G. Deep word embeddings for visual speech recognition. IEEE, City, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    IPMV '20: Proceedings of the 2020 2nd International Conference on Image Processing and Machine Vision
    August 2020
    194 pages
    ISBN:9781450388412
    DOI:10.1145/3421558

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 25 November 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format