ABSTRACT
Visual speech recognition, also known as lipreading, is to recognize speech content by decoding the visual information of the speaker's lip movement on the basis of no speech signal. The effect of lipreading based on traditional manual features in complex scenes is not very good. With the great success of deep learning in image classification, the application of deep learning to lipreading has become a new development trend, but there are still some challenges and problems. Lipreading model based on deep learning is mainly divided into front-end and back-end. The front-end are mainly used to extract the lip movement features, and then decode the long-term information through the back-end. In this paper, the convolution neural network structure in the front-end and the sequence processing model of the back-end are discussed and analyzed. In addition, it introduces the current lipreading datasets and the comparison of the methods used in these datasets.
- Lee, C., Lee, E., Jung, S. and Lee, S. Design and Implementation of a Real-Time Lipreading System Using PCA and HMM. Journal of Korea Multimedia Society, 7, 11 (2004), 1597-1609.Google Scholar
- Yao, J. and Kaifeng, Z. Evaluation Model of the Artist Based on Fuzzy Membership to Improve the Principal Component Analysis of Robust Kernel. City, 2016.Google Scholar
- Puviarasan, N. and Palanivel, S. Lip reading of hearing impaired persons using HMM. Expert Systems with Applications, 38, 4 (2011/04/01/ 2011), 4477-4481.Google ScholarDigital Library
- Matthews, I., Potamianos, G., Neti, C. and Luettin, J. A comparison of model and transform-based visual features for audio-visual LVCSR. City, 2001.Google Scholar
- Morade, S. S. and Patnaik, S. Lip reading using DWT and LSDA. City, 2014.Google Scholar
- Fernandez-Lopez, A., Martinez, O. and Sukno, F. M. Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. IEEE, City, 2017.Google ScholarDigital Library
- Shaikh, A. A., Kumar, D. K., Yau, W. C., Azemin, M. Z. C. and Gubbi, J. Lip reading using optical flow and support vector machines. City, 2010.Google Scholar
- Mase, K. and Pentland, A. Automatic lipreading by optical-flow analysis. Systems & Computers in Japan, 22, 6 (67-76.Google ScholarDigital Library
- Wang, S.-L., Liew, A. W.-C., Lau, W. H. and Leung, S. H. An automatic lipreading system for spoken digits with limited training data. IEEE transactions on circuits and systems for video technology, 18, 12 (2008), 1760-1765.Google ScholarDigital Library
- Luettin, J. and Thacker, N. A. Speechreading using probabilistic models. Computer vision and image understanding, 65, 2 (1997), 163-178.Google Scholar
- Bear, H. L. and Harvey, R. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication, 95 (2017), 40-67.Google ScholarDigital Library
- Howell, D., Cox, S. and Theobald, B. Visual units and confusion modelling for automatic lip-reading. Image and Vision Computing, 51 (2016), 1-12.Google ScholarDigital Library
- Cooke, M., Barker, J., Cunningham, S. and Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 120, 5 (2006), 2421.Google ScholarCross Ref
- Anina, I., Zhou, Z., Zhao, G. and Pietikäinen, M. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. City, 2015.Google Scholar
- Chung, J. S. and Zisserman, A. Lip reading in the wild. Springer, City, 2016.Google Scholar
- Chung, J. S., Senior, A., Vinyals, O. and Zisserman, A. Lip reading sentences in the wild. City, 2017.Google Scholar
- Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, K., Shan, S. and Chen, X. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. IEEE, City, 2019.Google ScholarDigital Library
- Shillingford, B., Assael, Y., Hoffman, M. W., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K. and Bennett, L. Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162 (2018).Google Scholar
- Chung, J. and Zisserman, A. Lip reading in profile (2017).Google Scholar
- Petridis, S., Shen, J., Cetin, D. and Pantic, M. Visual-only recognition of normal, whispered and silent speech. IEEE, City, 2018.Google ScholarDigital Library
- Afouras, T., Son Chung, J. and Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. City, 2018.Google Scholar
- Li, Y., Takashima, Y., Takiguchi, T. and Ariki, Y. Lip reading using a dynamic feature of lip images and convolutional neural networks. IEEE, City, 2016.Google ScholarCross Ref
- Petridis, S., Li, Z. and Pantic, M. End-to-end visual speech recognition with LSTMs. IEEE, City, 2017.Google ScholarDigital Library
- Weng, X. and Kitani, K. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading. arXiv preprint arXiv:1905.02540 (2019).Google Scholar
- Petridis, S., Wang, Y., Li, Z. and Pantic, M. End-to-end multi-view lipreading. arXiv preprint arXiv:1709.00443 (2017).Google Scholar
- Saitoh, T., Zhou, Z., Zhao, G. and Pietikäinen, M. Concatenated frame image based cnn for visual speech recognition. Springer, City, 2016.Google Scholar
- Jang, D.-W., Kim, H.-I., Je, C., Park, R.-H. and Park, H.-M. Lip Reading Using Committee Networks With Two Different Types of Concatenated Frame Images. IEEE Access, 7 (2019), 90125-90131.Google ScholarCross Ref
- NadeemHashmi, S., Gupta, H., Mittal, D., Kumar, K., Nanda, A. and Gupta, S. A Lip Reading Model Using CNN with Batch Normalization. IEEE, City, 2018.Google ScholarCross Ref
- Noda, K., Ogata, T., Yamaguchi, Y., Okuno, H. G., Nakadai, K. and th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, I. Lipreading using convolutional neural network. Proc. Annu. Conf. Int. Speech. Commun. Assoc., INTERSPEECH Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (2014), 1149-1153.Google Scholar
- Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G. and Ogata, T. Audio-visual speech recognition using deep learning. Applied Intelligence, 42, 4 (2015/06/01 2015), 722-737.Google ScholarDigital Library
- Garg, A., Noyola, J. and Bagadia, S. Lip reading using CNN and LSTM. Technical report, Stanford University, CS231n project report, 2016.Google Scholar
- Chung, J. S. and Zisserman, A. Out of time: automated lip sync in the wild. Springer, City, 2016.Google Scholar
- Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H. and Daoudi, M. Lip Reading with Hahn Convolutional Neural Networks. Image and Vision Computing (2019).Google Scholar
- Ji, S., Xu, W., Yang, M. and Yu, K. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35, 1 (2012), 221-231.Google ScholarDigital Library
- Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. Learning spatiotemporal features with 3d convolutional networks. City, 2015.Google Scholar
- Tran, D., Ray, J., Shou, Z., Chang, S.-F. and Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).Google Scholar
- Qiu, Z., Yao, T. and Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. City, 2017.Google Scholar
- Assael, Y. M., Shillingford, B., Whiteson, S. and De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).Google Scholar
- Torfi, A., Iranmanesh, S. M., Nasrabadi, N. and Dawson, J. 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition. IEEE Access, 5 (2017), 22081-22091.Google ScholarCross Ref
- Xu, K., Li, D., Cassimatis, N. and Wang, X. LCANet: End-to-end lipreading with cascaded attention-CTC. IEEE, City, 2018.Google ScholarDigital Library
- Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K. Q. Densely connected convolutional networks. City, 2017.Google Scholar
- He, K., Zhang, X., Ren, S. and Sun, J. Deep residual learning for image recognition. City, 2016.Google Scholar
- Stafylakis, T. and Tzimiropoulos, G. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).Google Scholar
- Margam, D. K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S. and Venkatesan, S. M. LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170 (2019).Google Scholar
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H. and Ng, A. Y. Multimodal deep learning. City, 2011.Google Scholar
- Wand, M., Koutník, J. and Schmidhuber, J. Lipreading with long short-term memory. IEEE, City, 2016.Google ScholarDigital Library
- Wand, M. and Schmidhuber, J. Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565 (2017).Google Scholar
- Wand, M., Schmidhuber, J. and Vu, N. T. Investigations on End-to-End Audiovisual Fusion. IEEE, City, 2018.Google ScholarDigital Library
- Petridis, S., Wang, Y., Li, Z. and Pantic, M. End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343 (2017).Google Scholar
- Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G. and Pantic, M. End-to-end audiovisual speech recognition. IEEE, City, 2018.Google ScholarDigital Library
- Courtney, L. and Sreenivas, R. Learning from Videos with Deep Convolutional LSTM Networks. arXiv preprint arXiv:1904.04817 (2019).Google Scholar
- Rahmani, M. H. and Almasganj, F. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. IEEE, City, 2017.Google ScholarCross Ref
- Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y. and Takeda, K. Integration of deep bottleneck features for audio-visual speech recognition. City, 2015.Google Scholar
- Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9, 8 (1997), 1735-1780.Google Scholar
- Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. Attention is all you need. City, 2017.Google Scholar
- Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. ACM, City, 2006.Google ScholarDigital Library
- Afouras, T., Chung, J. S. and Zisserman, A. Deep lip reading: a comparison of models and an online application. arXiv preprint arXiv:1806.06053 (2018).Google Scholar
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. City, 2017.Google Scholar
- Afouras, T., Chung, J. S., Senior, A., Vinyals, O. and Zisserman, A. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).Google Scholar
- Zimmermann, M., Ghazi, M. M., Ekenel, H. K. and Thiran, J.-P. Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. Springer, City, 2016.Google Scholar
- Lee, D., Lee, J. and Kim, K.-E. Multi-view automatic lip-reading using neural network. Springer, City, 2016.Google Scholar
- Lin, M., Chen, Q. and Yan, S. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
- Krizhevsky, A., Sutskever, I. and Hinton, G. E. Imagenet classification with deep convolutional neural networks. City, 2012.Google Scholar
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A. Going deeper with convolutions. City, 2015.Google Scholar
- Fung, I. and Mak, B. End-to-end low-resource lip-reading with maxout CNN and LSTM. IEEE, City, 2018.Google ScholarDigital Library
- Chung, J. S. and Zisserman, A. Learning to lip read words by watching videos. Computer Vision and Image Understanding, 173 (2018), 76-85.Google ScholarCross Ref
- Petridis, S., Stafylakis, T., Ma, P., Tzimiropoulos, G. and Pantic, M. Audio-visual speech recognition with a hybrid ctc/attention architecture. IEEE, City, 2018.Google ScholarCross Ref
- Stafylakis, T. and Tzimiropoulos, G. Deep word embeddings for visual speech recognition. IEEE, City, 2018.Google ScholarDigital Library
Recommendations
LCSNet: End-to-end Lipreading with Channel-aware Feature Selection
Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human ...
CALLip: Lipreading using Contrastive and Attribute Learning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaLipreading, aiming at interpreting speech by watching the lip movements of the speaker, has great significance in human communication and speech understanding. Despite having reached a feasible performance, lipreading still faces two crucial challenges: ...
Lipreading with local spatiotemporal descriptors
Visual speech information plays an important role in lipreading under noisy conditions or for listeners with a hearing impairment. In this paper, we present local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely ...
Comments