Skip to main content
Log in

Full single-type deep learning models with multihead attention for speech enhancement

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Artificial neural network (ANN) models with attention mechanisms for eliminating noise in audio signals, called speech enhancement models, have proven effective. However, their architectures become complex, deep, and demanding in terms of computational resources when trying to achieve higher levels of efficiency. Given this situation, we selected and evaluated simple and less resource-demanding models and utilized the same training parameters and performance metrics to conduct a fair comparison among the four selected models. Our purpose was to demonstrate that simple neural network models with multihead attention are efficient when implemented on computational devices with conventional resources since they provide results that are competitive with those of hybrid, complex and resource-demanding models. We experimentally evaluated the efficiency of multilayer perceptron (MLP), one-dimensional and two-dimensional convolutional neural network (CNN), and gated recurrent unit (GRU) deep learning models with and without multiheaded attention. We also analyzed the generalization capability of each model. The results showed that although these architectures were composed of only one type of ANN, multihead attention increased the efficiency of the speech enhancement process, yielding results that were competitive with those of complex models. Therefore, this study is helpful as a reference for building simple and efficient single-type ANN models with attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

References

  1. Brauwers G, Frasincar F (2021) A general survey on attention mechanisms in deep learning. IEEE Trans Knowl Data Eng:1–1. https://doi.org/10.1109/TKDE.2021.3126456

  2. Fan C, Yi J, Tao J, et al (2021) Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans Audio Speech Language Process 29:198–209. https://doi.org/10.1109/TASLP.2020.3039600

    Article  Google Scholar 

  3. Galassi A, Lippi M, Torroni P (2020) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893

    Article  Google Scholar 

  4. Garofolo J, Lamel L, Fisher W et al (1992) Timit acoustic-phonetic continuous speech corpus. Linguis Data Consortium. https://doi.org/10.35111/17gk-bn40

  5. Hatzopoulos S, Ciorba AH, Skarzynski P (eds) (2020) The human auditory system - basic features and updates on audiological diagnosis and therapy. IntechOpen, Rijeka. https://doi.org/10.5772/intechopen.77713

  6. Hu G, Wang D (2010) A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process 18(8):2067–2079. https://doi.org/10.1109/TASL.2010.2041110

    Article  Google Scholar 

  7. Jensen J, Taal C H, Jensen J, et al (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio. Speech Lang Process 24 (11):2009–2022. https://doi.org/10.1109/TASLP.2016.2585878

    Article  Google Scholar 

  8. Kamath U, Graham K, Emara W (2022) Transformers for Machine Learning: A Deep Dive. Chapman and Hall/CRC, New York. https://doi.org/10.1201/9781003170082

    Book  Google Scholar 

  9. Kim J, El-Khamy M, Lee J (2020) T-GSA: transformer with Gaussian-weighted self-attention for speech enhancement. In: IEEE international conference on acoustics, speech and signal processing, pp 6649–6653. https://doi.org/10.1109/ICASSP40776.2020.9053591. iSSN: 2379-190X

  10. Koizumi Y, Yatabe K, Delcroix M et al (2020) Speech enhancement using self-adaptation and multi-head self-attention. In: IEEE international conference on acoustics, speech and signal processing, pp 181–185. https://doi.org/10.1109/ICASSP40776.2020.9053214. iSSN: 2379-190X

  11. Lan T, Ye W, Lyu Y, et al (2020) Embedding encoder-decoder with attention mechanism for monaural speech enhancement. Ieee Access 685(96):677–96. https://doi.org/10.1109/ACCESS.2020.2995346

    Article  Google Scholar 

  12. Li L, Lu Z, Watzel T, et al (2021) Light-weight self-attention augmented generative adversarial networks for speech enhancement. Electronics 10(13):1586. https://doi.org/10.3390/electronics10131586

    Article  Google Scholar 

  13. McLoughlin I (2009) Applied Speech and Audio Processing: with Matlab Examples. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511609640

  14. Nicolson A, Paliwal KK (2020) Masked multi-head self-attention for causal speech enhancement. Speech Comm 125:80–96. https://doi.org/10.1016/j.specom.2020.10.004

    Article  Google Scholar 

  15. Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091

    Article  Google Scholar 

  16. Nossier SA, Wall J, Moniri M et al (2020a) A comparative study of time and frequency domain approaches to deep learning based speech enhancement. In: International Joint Conference on Neural Networks, pp 1–8. https://doi.org/10.1109/IJCNN48605.2020.9206928. iSSN: 2161-4407

  17. Nossier SA, Wall J, Moniri M et al (2020b) Mapping and masking targets comparison using different deep learning based speech enhancement architectures. In: International joint conference on neural networks, pp 1–8. https://doi.org/10.1109/IJCNN48605.2020.9206623. iSSN: 2161-4407

  18. Nossier SA, Wall J, Moniri M, et al (2021) An experimental analysis of deep learning architectures for supervised speech enhancement. Electronics 10(1):17. https://doi.org/10.3390/electronics10010017

    Article  Google Scholar 

  19. Pandey A, Wang D (2021) Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 29:1270–1279. https://doi.org/10.1109/TASLP.2021.3064421

    Article  Google Scholar 

  20. Rix A, Beerends J, Hollier M et al (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 749–752. https://doi.org/10.1109/ICASSP.2001.941023. iSSN: 1520-6149

  21. Roux JL, Wisdom S, Erdogan H et al (2019) SDR – Half-baked or well done?. In: IEEE international conference on acoustics, speech and signal processing, pp 626–630. https://doi.org/10.1109/ICASSP.2019.8683855. iSSN: 2379-190X

  22. Roy SK, Nicolson A, Paliwal KK (2021) DeepLPC-MHANet: multi-head self-attention for augmented kalman filter-based speech enhancement. IEEE Access 9:70,516–70,530. https://doi.org/10.1109/ACCESS.2021.3077281

    Article  Google Scholar 

  23. Thiemann J, Ito N, Vincent E (2013) The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings. Proc Meetings Acoust 19 (1):035–081. https://doi.org/10.1121/1.4799597

    Article  Google Scholar 

  24. Tomar NK, Jha D, Riegler MA, et al (2022) FANEt: a feedback attention network for improved biomedical image segmentation. IEEE Trans Neural Netw Learn Syst:1–14. https://doi.org/10.1109/TNNLS.2022.3159394

  25. Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition ii: Noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12 (3):247–251. https://doi.org/10.1016/0167-6393(93)90095-3

    Article  Google Scholar 

  26. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  27. Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio, Speech and Lang Proc 26(10):1702–1726. https://doi.org/10.1109/TASLP.2018.2842159

    Article  MathSciNet  Google Scholar 

  28. Ye JC (2022) Geometry of Deep Learning: A Signal Processing Perspective, Mathematics in Industry, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-16-6046-7

  29. Yu W, Zhou J, Wang H, et al (2022) SETRansformer: speech enhancement transformer. Cognit Comput 14(3):1152–1158. https://doi.org/10.1007/s12559-020-09817-2

    Article  Google Scholar 

  30. Yuliani AR, Amri MF, Suryawati E, et al (2021) Speech enhancement using deep learning methods: a review. J Elektronika dan Telekomunikasi 21(1):19–26. https://doi.org/10.14203/jet.v21.19-26

    Article  Google Scholar 

  31. Zacarias-Morales N, Pancardo P, Hernández-Nolasco JA, et al (2021) Attention-inspired artificial neural networks for speech processing: a systematic review. Symmetry 13(2):214. https://doi.org/10.3390/sym13020214

    Article  Google Scholar 

  32. Zhang L, Wang M, Zhang Q, et al (2020) Environmental attention-guided branchy neural network for speech enhancement. Appl Sci Basel 10(3):1167. https://doi.org/10.3390/app10031167

    Article  Google Scholar 

  33. Zhu T, Cheng C (2020) Joint CTC-attention end-to-end speech recognition with a triangle recurrent neural network encoder. J Shanghai Jiaotong University (Sci) 25(1):70–75. https://doi.org/10.1007/s12204-019-2147-6

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Dr. Matias Garcia-Constantino for his useful comments that improved the quality of this paper. The authors would like to thank the Laboratorio Nacional de Supercómputo del Sureste de México (LNS), a member of CONACYT, for the computational resources, support and technical assistance provided through project No. 202103086N.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José Adán Hernández-Nolasco.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

José Adán Hernández-Nolasco and Pablo Pancardo are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zacarias-Morales, N., Hernández-Nolasco, J.A. & Pancardo, P. Full single-type deep learning models with multihead attention for speech enhancement. Appl Intell 53, 20561–20576 (2023). https://doi.org/10.1007/s10489-023-04571-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04571-y

Keywords

Navigation