Abstract
Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability and Access
Data used in this work will be made available on reasonable request.
References
Belhan C, Fikirdanis D, Cimen O, et al (2021) Audio-visual speech recognition using 3d convolutional neural networks. In: Proceedings of the Innovations in Intelligent Systems and Applications Conference, pp 1–5
Chen C, Hu Y, Zhang Q, et al (2023a) Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12607–12615
Chen G, Wang W, Wang Z et al (2020) Two-dimensional discrete feature based spatial attention capsnet for semg signal recognition. Applied Intell 50:3503–3520
Chen X, Du J, Zhang H (2020) Lipreading with densenet and resbi-lstm. Signal, Image and Video Process 14(5):981–989
Chen X, Xia Y, Sun Y et al (2023) Silent speech recognition based on high-density surface electromyogram using hybrid neural networks. IEEE Trans Human-Mach Syst 53(2):335–345
Deng L, Li X (2013) Machine learning paradigms for speech recognition: An overview. IEEE Trans Audio, Speech, and Language Process 21(5):1060–1089
Ding K, Li R, Xu Y et al (2024) Adaptive data augmentation for mandarin automatic speech recognition. Applied Intell 54(7):5674–5687
Fan C, Yi J, Tao J et al (2021) Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans Audio, Speech, and Language Process 29:198–209
Sarkar S, Ghosh S, Ghosh S et al (2024) Audio-visual speech synthesis using vision transformer-enhanced autoencoders with ensemble of loss functions. Applied Intell 54(6):4507–4524
Gupta AK, Gupta P, Rahtu E (2022) Fatalread-fooling visual speech recognition models: put words on lips. Applied Intell 52(8):9001–9016
Hassanat AB (2011) Visual speech recognition. Speech and Language Technol. 1:279–303
Hong J, Kim M, Choi J, et al (2023) Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18783–18794
Jong NS, de Herrera AGS, Phukpattaranont P (2020) Multimodal data fusion of electromyography and acoustic signals for thai syllable recognition. IEEE J Biomed Health Inf 25(6):1997–2006
Kim K, Wu F, Peng Y, et al (2023) E-branchformer: Branchformer with enhanced merging for speech recognition. In: Proceedings of the IEEE Spoken Language Technology Workshop, pp 84–91
Kim S, Gholami A, Shaw A, et al (2022) Squeezeformer: An efficient transformer for automatic speech recognition. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp 9361–9373
Li J, et al (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal and Inf Process 11(1)
Liu H, Chen Z, Shi W (2020) Robust audio-visual mandarin speech recognition based on adaptive decision fusion and tone features. In: Proceedings of the IEEE International Conference on Image Processing, pp 1381–1385
Liu H, Xu W, Yang B (2021) Audio-visual speech recognition using a two-step feature fusion strategy. In: Proceedings of the International Conference on Pattern Recognition, IEEE, pp 1896–1903
Ma P, Martinez B, Petridis S, et al (2021a) Towards practical lipreading with distilled and efficient models. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7608–7612
Ma P, Petridis S, Pantic M (2021b) End-to-end audio-visual speech recognition with conformers. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7613–7617
Ma P, Haliassos A, Fernandez-Lopez A, et al (2023a) Auto-avsr: Audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1–5
Ma P, Haliassos A, Fernandez-Lopez A, et al (2023b) Auto-avsr: Audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1–5
MacKenzie IS (2024) Human-computer interaction: An empirical research perspective. Elsevier
Martinez B, Ma P, Petridis S, et al (2020) Lipreading using temporal convolutional networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6319–6323
Mo S, Morgado P (2024) Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 27186–27196
Mo S, Tian Y (2023) Audio-visual grouping network for sound localization from mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10565–10574
Oneață D, Cucu H (2022) Improving multimodal speech recognition by data augmentation and speech representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4579–4588
Pan X, Chen P, Gong Y, et al (2022) Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp 4491–4503
Panayotov V, Chen G, Povey D, et al (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5206–5210
Pappagari R, Villalba J, Żelasko P, et al (2021) Copypaste: An augmentation method for speech emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6324–6328
Passos LA, Papa JP, Del Ser J et al (2023) Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement. Inf Fusion 90:1–11
Petridis S, Stafylakis T, Ma P, et al (2018) End-to-end audiovisual speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 6548–6552
Ryumin D, Axyonov A, Ryumina E et al (2024) Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems. Exp Syst Appl 252:124159
Shi B, Hsu WN, Lakhotia K, et al (2021) Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the International Conference on Learning Representations, pp 1–12
Shi B, Hsu WN, Lakhotia K, et al (2022) Learning audio-visual speech representation by masked multimodal cluster prediction. In: International Conference on Learning Representations, pp 1–12
Son Chung J, Senior A, Vinyals O, et al (2017) Lip reading sentences in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6447–6456
Song Q, Sun B, Li S (2022) Multimodal sparse transformer network for audio-visual speech recognition. IEEE Trans Neural Netw Learn Syst 34(12):10028–10038
Song R, Zhang X, Chen X et al (2023) Decoding silent speech from high-density surface electromyographic data using transformer. Biomed Signal Process Contr 80:104298
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the Annual Conference on Neural Information Processing Systems
Wang H, Guo P, Zhou P, et al (2024) Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8150–8154
Wang W, Tran D, Feiszli M (2020a) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12695–12705
Wang ZQ, Wang P, Wang D (2020) Complex spectral mapping for single-and multi-channel speech enhancement and robust asr. IEEE/ACM Trans Audio, Speech, and Language Process 28:1778–1787
Wu J, Zhao T, Zhang Y, et al (2021) Parallel-inception cnn approach for facial semg based silent speech recognition. In: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine Biology Society, pp 554–557
Wu J, Zhang Y, Xie L et al (2022) A novel silent speech recognition approach based on parallel inception convolutional neural network and mel frequency spectral coefficient. Frontiers in Neurorobot 16:971446
Yağanoğlu M (2021) Real time wearable speech recognition system for deaf persons. Comput Electric Eng 91:107026
Yu D, Deng L (2016) Automatic Speech Recognition. Springer
Yu F, Wang H, Shi X, et al (2024) Lcb-net: Long-context biasing for audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 10621–10625
Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: Proceedings of the European Signal Processing Conference, pp 341–345
Zeghidour N, Usunier N, Kokkinos I, et al (2018) Learning filterbanks from raw speech for phone recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5509–5513
Zhang Q, Wang S, Chen G (2021) Speaker-independent lipreading by disentangled representation learning. In: Proceedings of the IEEE International Conference on Image Processing, pp 2493–2497
Zhang Y, Cai H, Wu J et al (2023) Emg-based cross-subject silent speech recognition using conditional domain adversarial network. IEEE Trans Cogn Development Syst 15(4):2282–2290
Zhou D, Zhang H, Li Q et al (2023) Coutfitgan: Learning to synthesize compatible outfits supervised by silhouette masks and fashion styles. IEEE Trans Multimed 25:4986–5001
Zhou D, Zhang H, Yang K et al (2024) Learning to synthesize compatible fashion items using semantic alignment and collocation classification: An outfit generation framework. IEEE Trans Neural Netw Learn Syst 35(4):5226–5240
Zhou P, Yang W, Chen W, et al (2019) Modality attention for end-to-end audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6565–6569
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 62332019 and 62076250, and in part by the National Key Research and Development Program of China under Grants 2023YFF1203900 and 2023YFF1203903.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yakun Zhang, Liang Xie, Xingwei An and Erwei Yin. Methods and experiments were designed and conducted by Jinghan Wu, Meishan Zhang, Changyan Zheng and Xingyu Zhang. The first draft of the manuscript was written by Jinghan Wu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Ethical and Informed Consent for Data Used
The construction of the self-generated dataset was approved by the Institutional Review Board of Tianjin University (TJUE-2021-138). The informed consent form was signed by each participant before data collection.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, J., Zhang, Y., Zhang, M. et al. DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion. Appl Intell 55, 224 (2025). https://doi.org/10.1007/s10489-024-06119-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06119-0