Skip to main content

Advertisement

DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability and Access

Data used in this work will be made available on reasonable request.

References

  1. Belhan C, Fikirdanis D, Cimen O, et al (2021) Audio-visual speech recognition using 3d convolutional neural networks. In: Proceedings of the Innovations in Intelligent Systems and Applications Conference, pp 1–5

  2. Chen C, Hu Y, Zhang Q, et al (2023a) Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 12607–12615

  3. Chen G, Wang W, Wang Z et al (2020) Two-dimensional discrete feature based spatial attention capsnet for semg signal recognition. Applied Intell 50:3503–3520

    Article  Google Scholar 

  4. Chen X, Du J, Zhang H (2020) Lipreading with densenet and resbi-lstm. Signal, Image and Video Process 14(5):981–989

    Article  Google Scholar 

  5. Chen X, Xia Y, Sun Y et al (2023) Silent speech recognition based on high-density surface electromyogram using hybrid neural networks. IEEE Trans Human-Mach Syst 53(2):335–345

    Article  Google Scholar 

  6. Deng L, Li X (2013) Machine learning paradigms for speech recognition: An overview. IEEE Trans Audio, Speech, and Language Process 21(5):1060–1089

    Article  Google Scholar 

  7. Ding K, Li R, Xu Y et al (2024) Adaptive data augmentation for mandarin automatic speech recognition. Applied Intell 54(7):5674–5687

    Article  Google Scholar 

  8. Fan C, Yi J, Tao J et al (2021) Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans Audio, Speech, and Language Process 29:198–209

    Article  Google Scholar 

  9. Sarkar S, Ghosh S, Ghosh S et al (2024) Audio-visual speech synthesis using vision transformer-enhanced autoencoders with ensemble of loss functions. Applied Intell 54(6):4507–4524

    Article  Google Scholar 

  10. Gupta AK, Gupta P, Rahtu E (2022) Fatalread-fooling visual speech recognition models: put words on lips. Applied Intell 52(8):9001–9016

    Article  Google Scholar 

  11. Hassanat AB (2011) Visual speech recognition. Speech and Language Technol. 1:279–303

    Google Scholar 

  12. Hong J, Kim M, Choi J, et al (2023) Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18783–18794

  13. Jong NS, de Herrera AGS, Phukpattaranont P (2020) Multimodal data fusion of electromyography and acoustic signals for thai syllable recognition. IEEE J Biomed Health Inf 25(6):1997–2006

    Article  Google Scholar 

  14. Kim K, Wu F, Peng Y, et al (2023) E-branchformer: Branchformer with enhanced merging for speech recognition. In: Proceedings of the IEEE Spoken Language Technology Workshop, pp 84–91

  15. Kim S, Gholami A, Shaw A, et al (2022) Squeezeformer: An efficient transformer for automatic speech recognition. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp 9361–9373

  16. Li J, et al (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal and Inf Process 11(1)

  17. Liu H, Chen Z, Shi W (2020) Robust audio-visual mandarin speech recognition based on adaptive decision fusion and tone features. In: Proceedings of the IEEE International Conference on Image Processing, pp 1381–1385

  18. Liu H, Xu W, Yang B (2021) Audio-visual speech recognition using a two-step feature fusion strategy. In: Proceedings of the International Conference on Pattern Recognition, IEEE, pp 1896–1903

  19. Ma P, Martinez B, Petridis S, et al (2021a) Towards practical lipreading with distilled and efficient models. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7608–7612

  20. Ma P, Petridis S, Pantic M (2021b) End-to-end audio-visual speech recognition with conformers. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 7613–7617

  21. Ma P, Haliassos A, Fernandez-Lopez A, et al (2023a) Auto-avsr: Audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1–5

  22. Ma P, Haliassos A, Fernandez-Lopez A, et al (2023b) Auto-avsr: Audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 1–5

  23. MacKenzie IS (2024) Human-computer interaction: An empirical research perspective. Elsevier

  24. Martinez B, Ma P, Petridis S, et al (2020) Lipreading using temporal convolutional networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6319–6323

  25. Mo S, Morgado P (2024) Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 27186–27196

  26. Mo S, Tian Y (2023) Audio-visual grouping network for sound localization from mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10565–10574

  27. Oneață D, Cucu H (2022) Improving multimodal speech recognition by data augmentation and speech representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4579–4588

  28. Pan X, Chen P, Gong Y, et al (2022) Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp 4491–4503

  29. Panayotov V, Chen G, Povey D, et al (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5206–5210

  30. Pappagari R, Villalba J, Żelasko P, et al (2021) Copypaste: An augmentation method for speech emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6324–6328

  31. Passos LA, Papa JP, Del Ser J et al (2023) Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement. Inf Fusion 90:1–11

    Article  Google Scholar 

  32. Petridis S, Stafylakis T, Ma P, et al (2018) End-to-end audiovisual speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 6548–6552

  33. Ryumin D, Axyonov A, Ryumina E et al (2024) Audio-visual speech recognition based on regulated transformer and spatio-temporal fusion strategy for driver assistive systems. Exp Syst Appl 252:124159

  34. Shi B, Hsu WN, Lakhotia K, et al (2021) Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the International Conference on Learning Representations, pp 1–12

  35. Shi B, Hsu WN, Lakhotia K, et al (2022) Learning audio-visual speech representation by masked multimodal cluster prediction. In: International Conference on Learning Representations, pp 1–12

  36. Son Chung J, Senior A, Vinyals O, et al (2017) Lip reading sentences in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6447–6456

  37. Song Q, Sun B, Li S (2022) Multimodal sparse transformer network for audio-visual speech recognition. IEEE Trans Neural Netw Learn Syst 34(12):10028–10038

    Article  Google Scholar 

  38. Song R, Zhang X, Chen X et al (2023) Decoding silent speech from high-density surface electromyographic data using transformer. Biomed Signal Process Contr 80:104298

  39. Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

  40. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the Annual Conference on Neural Information Processing Systems

  41. Wang H, Guo P, Zhou P, et al (2024) Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 8150–8154

  42. Wang W, Tran D, Feiszli M (2020a) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12695–12705

  43. Wang ZQ, Wang P, Wang D (2020) Complex spectral mapping for single-and multi-channel speech enhancement and robust asr. IEEE/ACM Trans Audio, Speech, and Language Process 28:1778–1787

    Article  Google Scholar 

  44. Wu J, Zhao T, Zhang Y, et al (2021) Parallel-inception cnn approach for facial semg based silent speech recognition. In: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine Biology Society, pp 554–557

  45. Wu J, Zhang Y, Xie L et al (2022) A novel silent speech recognition approach based on parallel inception convolutional neural network and mel frequency spectral coefficient. Frontiers in Neurorobot 16:971446

  46. Yağanoğlu M (2021) Real time wearable speech recognition system for deaf persons. Comput Electric Eng 91:107026

  47. Yu D, Deng L (2016) Automatic Speech Recognition. Springer

    Google Scholar 

  48. Yu F, Wang H, Shi X, et al (2024) Lcb-net: Long-context biasing for audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 10621–10625

  49. Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: Proceedings of the European Signal Processing Conference, pp 341–345

  50. Zeghidour N, Usunier N, Kokkinos I, et al (2018) Learning filterbanks from raw speech for phone recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5509–5513

  51. Zhang Q, Wang S, Chen G (2021) Speaker-independent lipreading by disentangled representation learning. In: Proceedings of the IEEE International Conference on Image Processing, pp 2493–2497

  52. Zhang Y, Cai H, Wu J et al (2023) Emg-based cross-subject silent speech recognition using conditional domain adversarial network. IEEE Trans Cogn Development Syst 15(4):2282–2290

    Article  Google Scholar 

  53. Zhou D, Zhang H, Li Q et al (2023) Coutfitgan: Learning to synthesize compatible outfits supervised by silhouette masks and fashion styles. IEEE Trans Multimed 25:4986–5001

    Article  Google Scholar 

  54. Zhou D, Zhang H, Yang K et al (2024) Learning to synthesize compatible fashion items using semantic alignment and collocation classification: An outfit generation framework. IEEE Trans Neural Netw Learn Syst 35(4):5226–5240

    Article  Google Scholar 

  55. Zhou P, Yang W, Chen W, et al (2019) Modality attention for end-to-end audio-visual speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp 6565–6569

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62332019 and 62076250, and in part by the National Key Research and Development Program of China under Grants 2023YFF1203900 and 2023YFF1203903.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yakun Zhang, Liang Xie, Xingwei An and Erwei Yin. Methods and experiments were designed and conducted by Jinghan Wu, Meishan Zhang, Changyan Zheng and Xingyu Zhang. The first draft of the manuscript was written by Jinghan Wu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xingwei An or Erwei Yin.

Ethics declarations

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethical and Informed Consent for Data Used

The construction of the self-generated dataset was approved by the Institutional Review Board of Tianjin University (TJUE-2021-138). The informed consent form was signed by each participant before data collection.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Zhang, Y., Zhang, M. et al. DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion. Appl Intell 55, 224 (2025). https://doi.org/10.1007/s10489-024-06119-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06119-0

Keywords