TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

Chen, Huanjie; Li, Wenjuan; Cheng, Zhigang; Liang, Xiubo; Zhang, Qifei

doi:10.1007/978-3-031-44201-8_34

Huanjie Chen¹¹,
Wenjuan Li¹²,
Zhigang Cheng¹¹,
Xiubo Liang¹¹ &
…
Qifei Zhang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14262))

Included in the following conference series:

International Conference on Artificial Neural Networks

975 Accesses

Abstract

Lip-reading is the process of translating input lip-movement image sequences into text sequences, which is a task that requires both temporal and spatial information to be considered, and feature extraction is difficult. In this regard, this paper proposes a new lip reading model, TCS-LipNet, which innovatively proposes the temporal channel space attention mechanism module TCSAM, and compared with the channel space attention mechanism, TCS increases the association of channel space features in the temporal dimension and improves the performance of the model. TCS-LipNet uses the TCSAM-based ResNet18 network as the front-end module to enhance the extraction of visual features, and DC-TCN (Densely Connected Temporal Convolutional Networks) as the back-end module to address the temporal correlation of sequences. The experimental data show that TCS-LipNet achieves 92.2% accuracy on LRW, which is the highest accuracy rate currently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lip Reading Using Deformable 3D Convolution and Channel-Temporal Attention

Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

Article 11 June 2024

Channel Enhanced Temporal-Shift Module for Efficient Lipreading

References

Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks, pp. 6319–6323 (2020)
Google Scholar
Ma, P., Wang, Y., Petridis, S., Shen, J., Pantic, M.: Training strategies for improved lip-reading. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8472–8476. IEEE (2022)
Google Scholar
Feng, D., Yang, S., Shan, S., Chen, X.: Learn an effective lip reading model without pains. arXiv preprint arXiv:2011.07557 (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Xu, K. Li, D., Cassimatis, N., Wang, X.: LCANet: end-to-end lipreading with cascaded attention-CTC (2018)
Google Scholar
Yang, S., Zhang, Y., Feng, D. Yang, M., Chen, X.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019) (2019)
Google Scholar
Zhao, Y., Xu, R., Song, M.: A cascade sequence-to-sequence model for Chinese mandarin lip reading. ACM (2019)
Google Scholar
Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers (2019)
Google Scholar
Petridis, S., Stafylakis, T., Ma, P., Tzimiropoulos, G., Pantic, M.: Audio-visual speech recognition with a hybrid CTC/attention architecture. IEEE (2018)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., Freitas, N.D.: LipNet: end-to-end sentence-level lipreading (2016)
Google Scholar
Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Shillingford, B., et al.: Large-scale visual speech recognition (2018)
Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Stafylakis, T., Tzimiropoulos, G.: Deep word embeddings for visual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4974–4978. IEEE (2018)
Google Scholar
Jha, A., Namboodiri, V.P., Jawahar, C.: Word spotting in silent lip videos. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 150–159. IEEE (2018)
Google Scholar
Zhang, X., Cheng, F., Wang, S.: Spatio-temporal fusion based convolutional sequence learning for lip reading. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 713–722 (2019)
Google Scholar
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Supplementary material for ‘ECA-net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13–19IEEE, Seattle (2020)
Google Scholar
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510–519 (2019)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Google Scholar
Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Google Scholar
Lecun, Y., Bottou, L.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Chung, J., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Software, Zhejiang University, Ningbo, China
Huanjie Chen, Zhigang Cheng, Xiubo Liang & Qifei Zhang
School of Engineering, Hangzhou Normal University, Hangzhou, China
Wenjuan Li

Authors

Huanjie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenjuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhigang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiubo Liang
View author publications
You can also search for this author in PubMed Google Scholar
Qifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qifei Zhang .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, H., Li, W., Cheng, Z., Liang, X., Zhang, Q. (2023). TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14262. Springer, Cham. https://doi.org/10.1007/978-3-031-44201-8_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-44201-8_34
Published: 23 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44200-1
Online ISBN: 978-3-031-44201-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics