Lip Recognition Based on Bi-GRU with Multi-Head Self-Attention

Ni, Ran; Jiang, Haiyang; Zhou, Lu; Lu, Yuanyao

doi:10.1007/978-3-031-63219-8_8

Ran Ni²⁰,
Haiyang Jiang²¹,
Lu Zhou²⁰ &
…
Yuanyao Lu²⁰

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 713))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

394 Accesses

Abstract

Currently, lip recognition technology is a significant research direction in the field of video understanding in computer vision. It aims to recognize the content expressed by the main characters through the dynamic changes of the lips' visual image. With the development of deep learning and enhanced computer performance, lip recognition techniques have evolved to effectively separate the background and target foreground across different scenes, improving uniformity in models and technical routes. However, despite the success of lip recognition with English datasets, the semantic specificity of Chinese words and the scarcity of open-source Chinese lip recognition datasets have generally led to subpar recognition standards.

To address these challenges, this paper introduces a novel model that synergizes two-dimensional and three-dimensional networks. Our approach leverages an improved Convolutional 3D (C3D) network to extract spatiotemporal features effectively. Unlike traditional 2D networks, which lack temporal dynamics, and conventional 3D networks that are prone to overfitting due to deep layers, our enhanced C3D network provides a robust foundation for feature extraction. To capture temporal features more efficiently, we integrate the outputs into a Bi-directional Gated Recurrent Unit (Bi-GRU), combined with a Multi-Head Self-Attention mechanism. This fusion allows for better extraction of semantic and syntactic features, overcoming the limitations of normal RNNs in handling long sequences and the non-parallelizable nature of GRU due to sequence dependence.

The effectiveness of the proposed model was validated through experiments on our self-compiled Chinese dataset. We compared it with mainstream networks such as ResNet-18, ResNet-34, and the original C3D model lacking the multi-attention mechanism. The analysis of loss function curves and accuracy curves demonstrated that our model achieves significant improvements, effectively addressing the challenges in lip recognition for the Chinese language and setting a new benchmark for performance in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

Article 11 June 2024

TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

A multimodel keyword spotting system based on lip movement and speech features

Article 20 April 2020

References

He, R., Wu, X., Sun, Z., Tan, T.: Wasserstein CNN: learning invariant features for NIR-VIS face recognition. IEEE Trans. Patt. Anal. Mach. Intell. 41(7), 1761–1773, 1 July 2019. https://doi.org/10.1109/TPAMI.2018.2842770
Song, N., Yang, H., Wu, P.: A gesture-to-emotional speech conversion by combining gesture recognition and facial expression recognition. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), Beijing, China, pp. 1–6 (2018). https://doi.org/10.1109/ACIIAsia.2018.8470350
Byun, K.-S., Park, C.-H., Sim, K.-B.: Emotion recognition from facial expression using hybrid-feature extraction. In: SICE 2004 Annual Conference, Sapporo, Japan, vol. 3, pp. 2483–2487 (2004)
Google Scholar
Zhang, Z., Wu, B., Jiang, Y.: Gesture recognition system based on improved YOLO v3. In: 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi'an, China, pp. 1540–1543 (2022). https://doi.org/10.1109/ICSP54964.2022.9778394
Veres, M., Moussa, M.: Deep learning for intelligent transportation systems: a survey of emerging trends. IEEE Trans. Intell. Transp. Syst. 21(8), 3152–3168 (2020). https://doi.org/10.1109/TITS.2019.2929020
Article Google Scholar
Adeel, A., Gogate, M., Hussain, A., Whitmer, W.M.: Lip-reading driven deep learning approach for speech enhancement. IEEE Trans. Emerg. Topics Comput. Intell. 5(3), 481–490 (2021). https://doi.org/10.1109/TETCI.2019.2917039
Article Google Scholar
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)
Article Google Scholar
Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Proceeding of 28th Annual Asilomar Conference on Signal Systems and Computer, vol. 1, no. 1, pp. 572–577 (1994)
Google Scholar
Rahmani, M.H., Almasganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA), Shahrekord, Iran, pp. 195–199 (2017). https://doi.org/10.1109/PRIA.2017.7983045
Xue, S., Jiang, H., Dai, L.: Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. In: The 9th International Symposium on Chinese Spoken Language Processing, Singapore, pp. 1–5 (2014). https://doi.org/10.1109/ISCSLP.2014.6936583
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 6319–6323 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053841
Bi, C., Zhang, D., Yang, L., Chen, P.: An lipreading modle with DenseNet and E3D-LSTM. In: 2019 6th International Conference on Systems and Informatics (ICSAI), Shanghai, China, pp. 511–515 (2019).https://doi.org/10.1109/ICSAI48974.2019.9010432
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 6115–6119 (2016). https://doi.org/10.1109/ICASSP.2016.7472852
Maeda, T., Tamura,S.: Multi-view convolution for lipreading. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1092–1096 (2021)
Google Scholar
Maulana, M.R.A.R., Fanany, M. I.: Sentence-level Indonesian lip reading with spatiotemporal CNN and gated RNN. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Indonesia, pp. 375–380 (2017). https://doi.org/10.1109/ICACSIS.2017.8355061
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
Zeng, Q., Du, J., Wang,Z.: HMM-based lip reading with stingy residual 3D convolution. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1438–1443 (2021)
Google Scholar
Stergiou, A., Poppe,R.: Spatio-temporal FAST 3D convolutions for human action recognition. In: 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, pp. 183–190 (2019). https://doi.org/10.1109/ICMLA.2019.00036
Tao, X., et al.: A new 3D convolution network for hyperspectral Unmixing. In: IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, pp. 1620–1623 (2022). https://doi.org/10.1109/IGARSS46834.2022.9883506
Xie, J., et al.: Advanced dropout: a model-free methodology for Bayesian dropout optimization. IEEE Trans. Patt. Anal. Mach. Intell. 44(9), 4605–4625, 1 September 2022. https://doi.org/10.1109/TPAMI.2021.3083089
Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 5152–5162 (2022). https://doi.org/10.1109/CVPR52688.2022.00510
Wang, H., Pu, G., Chen, T.: A lip reading method based on 3d convolutional vision transformer. IEEE Access 10, 77205–77212 (2022). https://doi.org/10.1109/ACCESS.2022.3193231
Article Google Scholar

Download references

Acknowledgments

This paper was supported by the National Natural Science Foundation of China (61971007 and 61571013).

Author information

Authors and Affiliations

School of Information Science and Technology, North China University of Technology, Beijing, 100144, People’s Republic of China
Ran Ni, Lu Zhou & Yuanyao Lu
School of Electrical and Control Engineering, North China University of Technology, Beijing, 100144, People’s Republic of China
Haiyang Jiang

Authors

Ran Ni
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yuanyao Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuanyao Lu .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Abertay, Dundee, UK
John Macintyre
Ionian University, Corfu, Greece
Markos Avlonitis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas

Ethics declarations

Disclosure of Interests

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ni, R., Jiang, H., Zhou, L., Lu, Y. (2024). Lip Recognition Based on Bi-GRU with Multi-Head Self-Attention. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-031-63219-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-63219-8_8
Published: 22 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63218-1
Online ISBN: 978-3-031-63219-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)