research-article

A Survey of Lipreading Methods Based on Deep Learning

Authors:
MINGFENG HAO

Xinjiang University, China

Xinjiang University, China
View Profile

,
MUTELEP MAMUT

Library of Xinjiang University, China

Library of Xinjiang University, China
View Profile

,
KURBAN UBUL

Xinjiang University, China

Xinjiang University, China
View Profile

IPMV '20: Proceedings of the 2020 2nd International Conference on Image Processing and Machine VisionAugust 2020Pages 31–39https://doi.org/10.1145/3421558.3421563

Published:25 November 2020Publication History

IPMV '20: Proceedings of the 2020 2nd International Conference on Image Processing and Machine Vision

Pages 31–39

ABSTRACT

Visual speech recognition, also known as lipreading, is to recognize speech content by decoding the visual information of the speaker's lip movement on the basis of no speech signal. The effect of lipreading based on traditional manual features in complex scenes is not very good. With the great success of deep learning in image classification, the application of deep learning to lipreading has become a new development trend, but there are still some challenges and problems. Lipreading model based on deep learning is mainly divided into front-end and back-end. The front-end are mainly used to extract the lip movement features, and then decode the long-term information through the back-end. In this paper, the convolution neural network structure in the front-end and the sequence processing model of the back-end are discussed and analyzed. In addition, it introduces the current lipreading datasets and the comparison of the methods used in these datasets.

References

Lee, C., Lee, E., Jung, S. and Lee, S. Design and Implementation of a Real-Time Lipreading System Using PCA and HMM. Journal of Korea Multimedia Society, 7, 11 (2004), 1597-1609.Google Scholar
Yao, J. and Kaifeng, Z. Evaluation Model of the Artist Based on Fuzzy Membership to Improve the Principal Component Analysis of Robust Kernel. City, 2016.Google Scholar
Puviarasan, N. and Palanivel, S. Lip reading of hearing impaired persons using HMM. Expert Systems with Applications, 38, 4 (2011/04/01/ 2011), 4477-4481.Google ScholarDigital Library
Matthews, I., Potamianos, G., Neti, C. and Luettin, J. A comparison of model and transform-based visual features for audio-visual LVCSR. City, 2001.Google Scholar
Morade, S. S. and Patnaik, S. Lip reading using DWT and LSDA. City, 2014.Google Scholar
Fernandez-Lopez, A., Martinez, O. and Sukno, F. M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. IEEE, City, 2017.Google ScholarDigital Library
Shaikh, A. A., Kumar, D. K., Yau, W. C., Azemin, M. Z. C. and Gubbi, J. Lip reading using optical flow and support vector machines. City, 2010.Google Scholar
Mase, K. and Pentland, A. Automatic lipreading by optical-flow analysis. Systems & Computers in Japan, 22, 6 (67-76.Google ScholarDigital Library
Wang, S.-L., Liew, A. W.-C., Lau, W. H. and Leung, S. H. An automatic lipreading system for spoken digits with limited training data. IEEE transactions on circuits and systems for video technology, 18, 12 (2008), 1760-1765.Google ScholarDigital Library
Luettin, J. and Thacker, N. A. Speechreading using probabilistic models. Computer vision and image understanding, 65, 2 (1997), 163-178.Google Scholar
Bear, H. L. and Harvey, R. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication, 95 (2017), 40-67.Google ScholarDigital Library
Howell, D., Cox, S. and Theobald, B. Visual units and confusion modelling for automatic lip-reading. Image and Vision Computing, 51 (2016), 1-12.Google ScholarDigital Library
Cooke, M., Barker, J., Cunningham, S. and Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 120, 5 (2006), 2421.Google ScholarCross Ref
Anina, I., Zhou, Z., Zhao, G. and Pietikäinen, M. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. City, 2015.Google Scholar
Chung, J. S. and Zisserman, A. Lip reading in the wild. Springer, City, 2016.Google Scholar
Chung, J. S., Senior, A., Vinyals, O. and Zisserman, A. Lip reading sentences in the wild. City, 2017.Google Scholar
Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, K., Shan, S. and Chen, X. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. IEEE, City, 2019.Google ScholarDigital Library
Shillingford, B., Assael, Y., Hoffman, M. W., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K. and Bennett, L. Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162 (2018).Google Scholar
Chung, J. and Zisserman, A. Lip reading in profile (2017).Google Scholar
Petridis, S., Shen, J., Cetin, D. and Pantic, M. Visual-only recognition of normal, whispered and silent speech. IEEE, City, 2018.Google ScholarDigital Library
Afouras, T., Son Chung, J. and Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. City, 2018.Google Scholar
Li, Y., Takashima, Y., Takiguchi, T. and Ariki, Y. Lip reading using a dynamic feature of lip images and convolutional neural networks. IEEE, City, 2016.Google ScholarCross Ref
Petridis, S., Li, Z. and Pantic, M. End-to-end visual speech recognition with LSTMs. IEEE, City, 2017.Google ScholarDigital Library
Weng, X. and Kitani, K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading. arXiv preprint arXiv:1905.02540 (2019).Google Scholar
Petridis, S., Wang, Y., Li, Z. and Pantic, M. End-to-end multi-view lipreading. arXiv preprint arXiv:1709.00443 (2017).Google Scholar
Saitoh, T., Zhou, Z., Zhao, G. and Pietikäinen, M. Concatenated frame image based cnn for visual speech recognition. Springer, City, 2016.Google Scholar
Jang, D.-W., Kim, H.-I., Je, C., Park, R.-H. and Park, H.-M. Lip Reading Using Committee Networks With Two Different Types of Concatenated Frame Images. IEEE Access, 7 (2019), 90125-90131.Google ScholarCross Ref
NadeemHashmi, S., Gupta, H., Mittal, D., Kumar, K., Nanda, A. and Gupta, S. A Lip Reading Model Using CNN with Batch Normalization. IEEE, City, 2018.Google ScholarCross Ref
Noda, K., Ogata, T., Yamaguchi, Y., Okuno, H. G., Nakadai, K. and th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, I. Lipreading using convolutional neural network. Proc. Annu. Conf. Int. Speech. Commun. Assoc., INTERSPEECH Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2014), 1149-1153.Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G. and Ogata, T. Audio-visual speech recognition using deep learning. Applied Intelligence, 42, 4 (2015/06/01 2015), 722-737.Google ScholarDigital Library
Garg, A., Noyola, J. and Bagadia, S. Lip reading using CNN and LSTM. Technical report, Stanford University, CS231n project report, 2016.Google Scholar
Chung, J. S. and Zisserman, A. Out of time: automated lip sync in the wild. Springer, City, 2016.Google Scholar
Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H. and Daoudi, M. Lip Reading with Hahn Convolutional Neural Networks. Image and Vision Computing (2019).Google Scholar
Ji, S., Xu, W., Yang, M. and Yu, K. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35, 1 (2012), 221-231.Google ScholarDigital Library
Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. Learning spatiotemporal features with 3d convolutional networks. City, 2015.Google Scholar
Tran, D., Ray, J., Shou, Z., Chang, S.-F. and Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).Google Scholar
Qiu, Z., Yao, T. and Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. City, 2017.Google Scholar
Assael, Y. M., Shillingford, B., Whiteson, S. and De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).Google Scholar
Torfi, A., Iranmanesh, S. M., Nasrabadi, N. and Dawson, J. 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition. IEEE Access, 5 (2017), 22081-22091.Google ScholarCross Ref
Xu, K., Li, D., Cassimatis, N. and Wang, X. LCANet: End-to-end lipreading with cascaded attention-CTC. IEEE, City, 2018.Google ScholarDigital Library
Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K. Q. Densely connected convolutional networks. City, 2017.Google Scholar
He, K., Zhang, X., Ren, S. and Sun, J. Deep residual learning for image recognition. City, 2016.Google Scholar
Stafylakis, T. and Tzimiropoulos, G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).Google Scholar
Margam, D. K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S. and Venkatesan, S. M. LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019).Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H. and Ng, A. Y. Multimodal deep learning. City, 2011.Google Scholar
Wand, M., Koutník, J. and Schmidhuber, J. Lipreading with long short-term memory. IEEE, City, 2016.Google ScholarDigital Library
Wand, M. and Schmidhuber, J. Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017).Google Scholar
Wand, M., Schmidhuber, J. and Vu, N. T. Investigations on End-to-End Audiovisual Fusion. IEEE, City, 2018.Google ScholarDigital Library
Petridis, S., Wang, Y., Li, Z. and Pantic, M. End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343 (2017).Google Scholar
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G. and Pantic, M. End-to-end audiovisual speech recognition. IEEE, City, 2018.Google ScholarDigital Library
Courtney, L. and Sreenivas, R. Learning from Videos with Deep Convolutional LSTM Networks. arXiv preprint arXiv:1904.04817 (2019).Google Scholar
Rahmani, M. H. and Almasganj, F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. IEEE, City, 2017.Google ScholarCross Ref
Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y. and Takeda, K. Integration of deep bottleneck features for audio-visual speech recognition. City, 2015.Google Scholar
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9, 8 (1997), 1735-1780.Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. Attention is all you need. City, 2017.Google Scholar
Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. ACM, City, 2006.Google ScholarDigital Library
Afouras, T., Chung, J. S. and Zisserman, A. Deep lip reading: a comparison of models and an online application. arXiv preprint arXiv:1806.06053 (2018).Google Scholar
Chollet, F. Xception: Deep learning with depthwise separable convolutions. City, 2017.Google Scholar
Afouras, T., Chung, J. S., Senior, A., Vinyals, O. and Zisserman, A. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).Google Scholar
Zimmermann, M., Ghazi, M. M., Ekenel, H. K. and Thiran, J.-P. Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. Springer, City, 2016.Google Scholar
Lee, D., Lee, J. and Kim, K.-E. Multi-view automatic lip-reading using neural network. Springer, City, 2016.Google Scholar
Lin, M., Chen, Q. and Yan, S. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G. E. Imagenet classification with deep convolutional neural networks. City, 2012.Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A. Going deeper with convolutions. City, 2015.Google Scholar
Fung, I. and Mak, B. End-to-end low-resource lip-reading with maxout CNN and LSTM. IEEE, City, 2018.Google ScholarDigital Library
Chung, J. S. and Zisserman, A. Learning to lip read words by watching videos. Computer Vision and Image Understanding, 173 (2018), 76-85.Google ScholarCross Ref
Petridis, S., Stafylakis, T., Ma, P., Tzimiropoulos, G. and Pantic, M. Audio-visual speech recognition with a hybrid ctc/attention architecture. IEEE, City, 2018.Google ScholarCross Ref
Stafylakis, T. and Tzimiropoulos, G. Deep word embeddings for visual speech recognition. IEEE, City, 2018.Google ScholarDigital Library

Recommendations

LCSNet: End-to-end Lipreading with Channel-aware Feature Selection
Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human ...
Read More
CALLip: Lipreading using Contrastive and Attribute Learning
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Lipreading, aiming at interpreting speech by watching the lip movements of the speaker, has great significance in human communication and speech understanding. Despite having reached a feasible performance, lipreading still faces two crucial challenges: ...
Read More
Lipreading with local spatiotemporal descriptors

Visual speech information plays an important role in lipreading under noisy conditions or for listeners with a hearing impairment. In this paper, we present local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IPMV '20: Proceedings of the 2020 2nd International Conference on Image Processing and Machine Vision
August 2020
194 pages
ISBN:9781450388412
DOI:10.1145/3421558

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 November 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Back-end
Deep Learning
Front-end
Lipreading
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 136
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Survey of Lipreading Methods Based on Deep Learning

IPMV '20: Proceedings of the 2020 2nd International Conference on Image Processing and Machine Vision

ABSTRACT

References

Cited By

Recommendations

LCSNet: End-to-end Lipreading with Channel-aware Feature Selection

CALLip: Lipreading using Contrastive and Attribute Learning

Lipreading with local spatiotemporal descriptors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A Survey of Lipreading Methods Based on Deep Learning

IPMV '20: Proceedings of the 2020 2nd International Conference on Image Processing and Machine Vision

ABSTRACT

References

Cited By

Recommendations

LCSNet: End-to-end Lipreading with Channel-aware Feature Selection

CALLip: Lipreading using Contrastive and Attribute Learning

Lipreading with local spatiotemporal descriptors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media