DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

Wu, Jinghan; Zhang, Yakun; Zhang, Meishan; Zheng, Changyan; Zhang, Xingyu; Xie, Liang; An, Xingwei; Yin, Erwei

doi:10.1007/s10489-024-06119-0

DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

Published: 26 December 2024

Volume 55, article number 224, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jinghan Wu^1,4,
Yakun Zhang^2,3,4^na1,
Meishan Zhang⁵,
Changyan Zheng^2,6,
Xingyu Zhang^2,4,
Liang Xie^2,4,
Xingwei An¹ &
…
Erwei Yin ORCID: orcid.org/0000-0002-2147-9888^1,2,4

137 Accesses
Explore all metrics

Abstract

Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions

A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability and Access

Data used in this work will be made available on reasonable request.

References

Belhan C, Fikirdanis D, Cimen O, et al (2021) Audio-visual speech recognition using 3d convolutional neural networks. In: Proceedings of the Innovations in Intelligent Systems and Applications Conference, pp 1–5
Chen C, Hu Y, Zhang Q, et al (2023a) Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12607–12615
Chen G, Wang W, Wang Z et al (2020) Two-dimensional discrete feature based spatial attention capsnet for semg signal recognition. Applied Intell 50:3503–3520
Article Google Scholar
Chen X, Du J, Zhang H (2020) Lipreading with densenet and resbi-lstm. Signal, Image and Video Process 14(5):981–989
Article Google Scholar
Chen X, Xia Y, Sun Y et al (2023) Silent speech recognition based on high-density surface electromyogram using hybrid neural networks. IEEE Trans Human-Mach Syst 53(2):335–345
Article Google Scholar
Deng L, Li X (2013) Machine learning paradigms for speech recognition: An overview. IEEE Trans Audio, Speech, and Language Process 21(5):1060–1089
Article Google Scholar
Ding K, Li R, Xu Y et al (2024) Adaptive data augmentation for mandarin automatic speech recognition. Applied Intell 54(7):5674–5687
Article Google Scholar
Fan C, Yi J, Tao J et al (2021) Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans Audio, Speech, and Language Process 29:198–209
Article Google Scholar
Sarkar S, Ghosh S, Ghosh S et al (2024) Audio-visual speech synthesis using vision transformer-enhanced autoencoders with ensemble of loss functions. Applied Intell 54(6):4507–4524
Article Google Scholar
Gupta AK, Gupta P, Rahtu E (2022) Fatalread-fooling visual speech recognition models: put words on lips. Applied Intell 52(8):9001–9016
Article Google Scholar
Hassanat AB (2011) Visual speech recognition. Speech and Language Technol. 1:279–303
Google Scholar
Hong J, Kim M, Choi J, et al (2023) Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18783–18794
Jong NS, de Herrera AGS, Phukpattaranont P (2020) Multimodal data fusion of electromyography and acoustic signals for thai syllable recognition. IEEE J Biomed Health Inf 25(6):1997–2006
Article Google Scholar
Kim K, Wu F, Peng Y, et al (2023) E-branchformer: Branchformer with enhanced merging for speech recognition. In: Proceedings of the IEEE Spoken Language Technology Workshop, pp 84–91
Kim S, Gholami A, Shaw A, et al (2022) Squeezeformer: An efficient transformer for automatic speech recognition. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp 9361–9373
Li J, et al (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal and Inf Process 11(1)
Liu H, Chen Z, Shi W (2020) Robust audio-visual mandarin speech recognition based on adaptive decision fusion and tone features. In: Proceedings of the IEEE International Conference on Image Processing, pp 1381–1385
Liu H, Xu W, Yang B (2021) Audio-visual speech recognition using a two-step feature fusion strategy. In: Proceedings of the International Conference on Pattern Recognition, IEEE, pp 1896–1903
Ma P, Martinez B, Petridis S, et al (2021a) Towards practical lipreading with distilled and efficient models. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7608–7612
Ma P, Petridis S, Pantic M (2021b) End-to-end audio-visual speech recognition with conformers. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7613–7617
Ma P, Haliassos A, Fernandez-Lopez A, et al (2023a) Auto-avsr: Audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1–5
Ma P, Haliassos A, Fernandez-Lopez A, et al (2023b) Auto-avsr: Audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1–5
MacKenzie IS (2024) Human-computer interaction: An empirical research perspective. Elsevier
Martinez B, Ma P, Petridis S, et al (2020) Lipreading using temporal convolutional networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6319–6323
Mo S, Morgado P (2024) Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 27186–27196
Mo S, Tian Y (2023) Audio-visual grouping network for sound localization from mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10565–10574
Oneață D, Cucu H (2022) Improving multimodal speech recognition by data augmentation and speech representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4579–4588
Pan X, Chen P, Gong Y, et al (2022) Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp 4491–4503
Panayotov V, Chen G, Povey D, et al (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5206–5210
Pappagari R, Villalba J, Żelasko P, et al (2021) Copypaste: An augmentation method for speech emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6324–6328
Passos LA, Papa JP, Del Ser J et al (2023) Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement. Inf Fusion 90:1–11
Article Google Scholar
Petridis S, Stafylakis T, Ma P, et al (2018) End-to-end audiovisual speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 6548–6552
Ryumin D, Axyonov A, Ryumina E et al (2024) Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems. Exp Syst Appl 252:124159
Shi B, Hsu WN, Lakhotia K, et al (2021) Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the International Conference on Learning Representations, pp 1–12
Shi B, Hsu WN, Lakhotia K, et al (2022) Learning audio-visual speech representation by masked multimodal cluster prediction. In: International Conference on Learning Representations, pp 1–12
Son Chung J, Senior A, Vinyals O, et al (2017) Lip reading sentences in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6447–6456
Song Q, Sun B, Li S (2022) Multimodal sparse transformer network for audio-visual speech recognition. IEEE Trans Neural Netw Learn Syst 34(12):10028–10038
Article Google Scholar
Song R, Zhang X, Chen X et al (2023) Decoding silent speech from high-density surface electromyographic data using transformer. Biomed Signal Process Contr 80:104298
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the Annual Conference on Neural Information Processing Systems
Wang H, Guo P, Zhou P, et al (2024) Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8150–8154
Wang W, Tran D, Feiszli M (2020a) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12695–12705
Wang ZQ, Wang P, Wang D (2020) Complex spectral mapping for single-and multi-channel speech enhancement and robust asr. IEEE/ACM Trans Audio, Speech, and Language Process 28:1778–1787
Article Google Scholar
Wu J, Zhao T, Zhang Y, et al (2021) Parallel-inception cnn approach for facial semg based silent speech recognition. In: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine Biology Society, pp 554–557
Wu J, Zhang Y, Xie L et al (2022) A novel silent speech recognition approach based on parallel inception convolutional neural network and mel frequency spectral coefficient. Frontiers in Neurorobot 16:971446
Yağanoğlu M (2021) Real time wearable speech recognition system for deaf persons. Comput Electric Eng 91:107026
Yu D, Deng L (2016) Automatic Speech Recognition. Springer
Google Scholar
Yu F, Wang H, Shi X, et al (2024) Lcb-net: Long-context biasing for audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 10621–10625
Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: Proceedings of the European Signal Processing Conference, pp 341–345
Zeghidour N, Usunier N, Kokkinos I, et al (2018) Learning filterbanks from raw speech for phone recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5509–5513
Zhang Q, Wang S, Chen G (2021) Speaker-independent lipreading by disentangled representation learning. In: Proceedings of the IEEE International Conference on Image Processing, pp 2493–2497
Zhang Y, Cai H, Wu J et al (2023) Emg-based cross-subject silent speech recognition using conditional domain adversarial network. IEEE Trans Cogn Development Syst 15(4):2282–2290
Article Google Scholar
Zhou D, Zhang H, Li Q et al (2023) Coutfitgan: Learning to synthesize compatible outfits supervised by silhouette masks and fashion styles. IEEE Trans Multimed 25:4986–5001
Article Google Scholar
Zhou D, Zhang H, Yang K et al (2024) Learning to synthesize compatible fashion items using semantic alignment and collocation classification: An outfit generation framework. IEEE Trans Neural Netw Learn Syst 35(4):5226–5240
Article Google Scholar
Zhou P, Yang W, Chen W, et al (2019) Modality attention for end-to-end audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6565–6569

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62332019 and 62076250, and in part by the National Key Research and Development Program of China under Grants 2023YFF1203900 and 2023YFF1203903.

Author information

Jinghan Wu and Yakun Zhang contributed equally to this work.

Authors and Affiliations

Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, 300072, China
Jinghan Wu, Xingwei An & Erwei Yin
National Innovation Institute of Defense Technology, Academy of Military Sciences, Beijing, 100072, China
Yakun Zhang, Changyan Zheng, Xingyu Zhang, Liang Xie & Erwei Yin
Intelligent Game and Decision Laboratory, Beijing, 100072, China
Yakun Zhang
Tianjin Artificial Intelligence Innovation Center, Tianjin, 300450, China
Jinghan Wu, Yakun Zhang, Xingyu Zhang, Liang Xie & Erwei Yin
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, 518055, China
Meishan Zhang
High-tech Institute, Weifang, Shandong, 261000, China
Changyan Zheng

Authors

Jinghan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yakun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Meishan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Changyan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xingyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xingwei An
View author publications
You can also search for this author in PubMed Google Scholar
Erwei Yin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yakun Zhang, Liang Xie, Xingwei An and Erwei Yin. Methods and experiments were designed and conducted by Jinghan Wu, Meishan Zhang, Changyan Zheng and Xingyu Zhang. The first draft of the manuscript was written by Jinghan Wu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xingwei An or Erwei Yin.

Ethics declarations

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethical and Informed Consent for Data Used

The construction of the self-generated dataset was approved by the Institutional Review Board of Tianjin University (TJUE-2021-138). The informed consent form was signed by each participant before data collection.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, J., Zhang, Y., Zhang, M. et al. DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion. Appl Intell 55, 224 (2025). https://doi.org/10.1007/s10489-024-06119-0

Download citation

Accepted: 26 November 2024
Published: 26 December 2024
DOI: https://doi.org/10.1007/s10489-024-06119-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions

A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition

Data Availability and Access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing Interests

Ethical and Informed Consent for Data Used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions

A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition

Explore related subjects

Data Availability and Access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing Interests

Ethical and Informed Consent for Data Used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation