Skip to main content

Lip Recognition Based on Bi-GRU with Multi-Head Self-Attention

  • Conference paper
  • First Online:
Artificial Intelligence Applications and Innovations (AIAI 2024)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 713))

  • 394 Accesses

Abstract

Currently, lip recognition technology is a significant research direction in the field of video understanding in computer vision. It aims to recognize the content expressed by the main characters through the dynamic changes of the lips' visual image. With the development of deep learning and enhanced computer performance, lip recognition techniques have evolved to effectively separate the background and target foreground across different scenes, improving uniformity in models and technical routes. However, despite the success of lip recognition with English datasets, the semantic specificity of Chinese words and the scarcity of open-source Chinese lip recognition datasets have generally led to subpar recognition standards.

To address these challenges, this paper introduces a novel model that synergizes two-dimensional and three-dimensional networks. Our approach leverages an improved Convolutional 3D (C3D) network to extract spatiotemporal features effectively. Unlike traditional 2D networks, which lack temporal dynamics, and conventional 3D networks that are prone to overfitting due to deep layers, our enhanced C3D network provides a robust foundation for feature extraction. To capture temporal features more efficiently, we integrate the outputs into a Bi-directional Gated Recurrent Unit (Bi-GRU), combined with a Multi-Head Self-Attention mechanism. This fusion allows for better extraction of semantic and syntactic features, overcoming the limitations of normal RNNs in handling long sequences and the non-parallelizable nature of GRU due to sequence dependence.

The effectiveness of the proposed model was validated through experiments on our self-compiled Chinese dataset. We compared it with mainstream networks such as ResNet-18, ResNet-34, and the original C3D model lacking the multi-attention mechanism. The analysis of loss function curves and accuracy curves demonstrated that our model achieves significant improvements, effectively addressing the challenges in lip recognition for the Chinese language and setting a new benchmark for performance in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. He, R., Wu, X., Sun, Z., Tan, T.: Wasserstein CNN: learning invariant features for NIR-VIS face recognition. IEEE Trans. Patt. Anal. Mach. Intell. 41(7), 1761–1773, 1 July 2019. https://doi.org/10.1109/TPAMI.2018.2842770

  2. Song, N., Yang, H., Wu, P.: A gesture-to-emotional speech conversion by combining gesture recognition and facial expression recognition. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), Beijing, China, pp. 1–6 (2018). https://doi.org/10.1109/ACIIAsia.2018.8470350

  3. Byun, K.-S., Park, C.-H., Sim, K.-B.: Emotion recognition from facial expression using hybrid-feature extraction. In: SICE 2004 Annual Conference, Sapporo, Japan, vol. 3, pp. 2483–2487 (2004)

    Google Scholar 

  4. Zhang, Z., Wu, B., Jiang, Y.: Gesture recognition system based on improved YOLO v3. In: 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi'an, China, pp. 1540–1543 (2022). https://doi.org/10.1109/ICSP54964.2022.9778394

  5. Veres, M., Moussa, M.: Deep learning for intelligent transportation systems: a survey of emerging trends. IEEE Trans. Intell. Transp. Syst. 21(8), 3152–3168 (2020). https://doi.org/10.1109/TITS.2019.2929020

    Article  Google Scholar 

  6. Adeel, A., Gogate, M., Hussain, A., Whitmer, W.M.: Lip-reading driven deep learning approach for speech enhancement. IEEE Trans. Emerg. Topics Comput. Intell. 5(3), 481–490 (2021). https://doi.org/10.1109/TETCI.2019.2917039

    Article  Google Scholar 

  7. Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)

    Article  Google Scholar 

  8. Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Proceeding of 28th Annual Asilomar Conference on Signal Systems and Computer, vol. 1, no. 1, pp. 572–577 (1994)

    Google Scholar 

  9. Rahmani, M.H., Almasganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA), Shahrekord, Iran, pp. 195–199 (2017). https://doi.org/10.1109/PRIA.2017.7983045

  10. Xue, S., Jiang, H., Dai, L.: Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. In: The 9th International Symposium on Chinese Spoken Language Processing, Singapore, pp. 1–5 (2014). https://doi.org/10.1109/ISCSLP.2014.6936583

  11. Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 6319–6323 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053841

  12. Bi, C., Zhang, D., Yang, L., Chen, P.: An lipreading modle with DenseNet and E3D-LSTM. In: 2019 6th International Conference on Systems and Informatics (ICSAI), Shanghai, China, pp. 511–515 (2019).https://doi.org/10.1109/ICSAI48974.2019.9010432

  13. Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 6115–6119 (2016). https://doi.org/10.1109/ICASSP.2016.7472852

  14. Maeda, T., Tamura,S.: Multi-view convolution for lipreading. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1092–1096 (2021)

    Google Scholar 

  15. Maulana, M.R.A.R., Fanany, M. I.: Sentence-level Indonesian lip reading with spatiotemporal CNN and gated RNN. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Indonesia, pp. 375–380 (2017). https://doi.org/10.1109/ICACSIS.2017.8355061

  16. Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594

  17. Zeng, Q., Du, J., Wang,Z.: HMM-based lip reading with stingy residual 3D convolution. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1438–1443 (2021)

    Google Scholar 

  18. Stergiou, A., Poppe,R.: Spatio-temporal FAST 3D convolutions for human action recognition. In: 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, pp. 183–190 (2019). https://doi.org/10.1109/ICMLA.2019.00036

  19. Tao, X., et al.: A new 3D convolution network for hyperspectral Unmixing. In: IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, pp. 1620–1623 (2022). https://doi.org/10.1109/IGARSS46834.2022.9883506

  20. Xie, J., et al.: Advanced dropout: a model-free methodology for Bayesian dropout optimization. IEEE Trans. Patt. Anal. Mach. Intell. 44(9), 4605–4625, 1 September 2022. https://doi.org/10.1109/TPAMI.2021.3083089

  21. Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 5152–5162 (2022). https://doi.org/10.1109/CVPR52688.2022.00510

  22. Wang, H., Pu, G., Chen, T.: A lip reading method based on 3d convolutional vision transformer. IEEE Access 10, 77205–77212 (2022). https://doi.org/10.1109/ACCESS.2022.3193231

    Article  Google Scholar 

Download references

Acknowledgments

This paper was supported by the National Natural Science Foundation of China (61971007 and 61571013).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuanyao Lu .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ni, R., Jiang, H., Zhou, L., Lu, Y. (2024). Lip Recognition Based on Bi-GRU with Multi-Head Self-Attention. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-031-63219-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-63219-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-63218-1

  • Online ISBN: 978-3-031-63219-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics