Abstract
Currently, lip recognition technology is a significant research direction in the field of video understanding in computer vision. It aims to recognize the content expressed by the main characters through the dynamic changes of the lips' visual image. With the development of deep learning and enhanced computer performance, lip recognition techniques have evolved to effectively separate the background and target foreground across different scenes, improving uniformity in models and technical routes. However, despite the success of lip recognition with English datasets, the semantic specificity of Chinese words and the scarcity of open-source Chinese lip recognition datasets have generally led to subpar recognition standards.
To address these challenges, this paper introduces a novel model that synergizes two-dimensional and three-dimensional networks. Our approach leverages an improved Convolutional 3D (C3D) network to extract spatiotemporal features effectively. Unlike traditional 2D networks, which lack temporal dynamics, and conventional 3D networks that are prone to overfitting due to deep layers, our enhanced C3D network provides a robust foundation for feature extraction. To capture temporal features more efficiently, we integrate the outputs into a Bi-directional Gated Recurrent Unit (Bi-GRU), combined with a Multi-Head Self-Attention mechanism. This fusion allows for better extraction of semantic and syntactic features, overcoming the limitations of normal RNNs in handling long sequences and the non-parallelizable nature of GRU due to sequence dependence.
The effectiveness of the proposed model was validated through experiments on our self-compiled Chinese dataset. We compared it with mainstream networks such as ResNet-18, ResNet-34, and the original C3D model lacking the multi-attention mechanism. The analysis of loss function curves and accuracy curves demonstrated that our model achieves significant improvements, effectively addressing the challenges in lip recognition for the Chinese language and setting a new benchmark for performance in this field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
He, R., Wu, X., Sun, Z., Tan, T.: Wasserstein CNN: learning invariant features for NIR-VIS face recognition. IEEE Trans. Patt. Anal. Mach. Intell. 41(7), 1761–1773, 1 July 2019. https://doi.org/10.1109/TPAMI.2018.2842770
Song, N., Yang, H., Wu, P.: A gesture-to-emotional speech conversion by combining gesture recognition and facial expression recognition. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), Beijing, China, pp. 1–6 (2018). https://doi.org/10.1109/ACIIAsia.2018.8470350
Byun, K.-S., Park, C.-H., Sim, K.-B.: Emotion recognition from facial expression using hybrid-feature extraction. In: SICE 2004 Annual Conference, Sapporo, Japan, vol. 3, pp. 2483–2487 (2004)
Zhang, Z., Wu, B., Jiang, Y.: Gesture recognition system based on improved YOLO v3. In: 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi'an, China, pp. 1540–1543 (2022). https://doi.org/10.1109/ICSP54964.2022.9778394
Veres, M., Moussa, M.: Deep learning for intelligent transportation systems: a survey of emerging trends. IEEE Trans. Intell. Transp. Syst. 21(8), 3152–3168 (2020). https://doi.org/10.1109/TITS.2019.2929020
Adeel, A., Gogate, M., Hussain, A., Whitmer, W.M.: Lip-reading driven deep learning approach for speech enhancement. IEEE Trans. Emerg. Topics Comput. Intell. 5(3), 481–490 (2021). https://doi.org/10.1109/TETCI.2019.2917039
Sumby, W.H., Pollack, I.: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26(2), 212–215 (1954)
Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical automatic speech recognition by lipreading. In: Proceeding of 28th Annual Asilomar Conference on Signal Systems and Computer, vol. 1, no. 1, pp. 572–577 (1994)
Rahmani, M.H., Almasganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA), Shahrekord, Iran, pp. 195–199 (2017). https://doi.org/10.1109/PRIA.2017.7983045
Xue, S., Jiang, H., Dai, L.: Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition. In: The 9th International Symposium on Chinese Spoken Language Processing, Singapore, pp. 1–5 (2014). https://doi.org/10.1109/ISCSLP.2014.6936583
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 6319–6323 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053841
Bi, C., Zhang, D., Yang, L., Chen, P.: An lipreading modle with DenseNet and E3D-LSTM. In: 2019 6th International Conference on Systems and Informatics (ICSAI), Shanghai, China, pp. 511–515 (2019).https://doi.org/10.1109/ICSAI48974.2019.9010432
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with long short-term memory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 6115–6119 (2016). https://doi.org/10.1109/ICASSP.2016.7472852
Maeda, T., Tamura,S.: Multi-view convolution for lipreading. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1092–1096 (2021)
Maulana, M.R.A.R., Fanany, M. I.: Sentence-level Indonesian lip reading with spatiotemporal CNN and gated RNN. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Indonesia, pp. 375–380 (2017). https://doi.org/10.1109/ICACSIS.2017.8355061
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
Zeng, Q., Du, J., Wang,Z.: HMM-based lip reading with stingy residual 3D convolution. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1438–1443 (2021)
Stergiou, A., Poppe,R.: Spatio-temporal FAST 3D convolutions for human action recognition. In: 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, pp. 183–190 (2019). https://doi.org/10.1109/ICMLA.2019.00036
Tao, X., et al.: A new 3D convolution network for hyperspectral Unmixing. In: IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, pp. 1620–1623 (2022). https://doi.org/10.1109/IGARSS46834.2022.9883506
Xie, J., et al.: Advanced dropout: a model-free methodology for Bayesian dropout optimization. IEEE Trans. Patt. Anal. Mach. Intell. 44(9), 4605–4625, 1 September 2022. https://doi.org/10.1109/TPAMI.2021.3083089
Prajwal, K.R., Afouras, T., Zisserman, A.: Sub-word level lip reading with visual attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 5152–5162 (2022). https://doi.org/10.1109/CVPR52688.2022.00510
Wang, H., Pu, G., Chen, T.: A lip reading method based on 3d convolutional vision transformer. IEEE Access 10, 77205–77212 (2022). https://doi.org/10.1109/ACCESS.2022.3193231
Acknowledgments
This paper was supported by the National Natural Science Foundation of China (61971007 and 61571013).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors declare that they have no conflict of interest.
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Ni, R., Jiang, H., Zhou, L., Lu, Y. (2024). Lip Recognition Based on Bi-GRU with Multi-Head Self-Attention. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-031-63219-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-63219-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63218-1
Online ISBN: 978-3-031-63219-8
eBook Packages: Computer ScienceComputer Science (R0)