Full single-type deep learning models with multihead attention for speech enhancement

Zacarias-Morales, Noel; Hernández-Nolasco, José Adán; Pancardo, Pablo

doi:10.1007/s10489-023-04571-y

Full single-type deep learning models with multihead attention for speech enhancement

Published: 15 April 2023

Volume 53, pages 20561–20576, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Noel Zacarias-Morales¹,
José Adán Hernández-Nolasco ORCID: orcid.org/0000-0003-4671-0350¹ &
Pablo Pancardo¹

230 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Artificial neural network (ANN) models with attention mechanisms for eliminating noise in audio signals, called speech enhancement models, have proven effective. However, their architectures become complex, deep, and demanding in terms of computational resources when trying to achieve higher levels of efficiency. Given this situation, we selected and evaluated simple and less resource-demanding models and utilized the same training parameters and performance metrics to conduct a fair comparison among the four selected models. Our purpose was to demonstrate that simple neural network models with multihead attention are efficient when implemented on computational devices with conventional resources since they provide results that are competitive with those of hybrid, complex and resource-demanding models. We experimentally evaluated the efficiency of multilayer perceptron (MLP), one-dimensional and two-dimensional convolutional neural network (CNN), and gated recurrent unit (GRU) deep learning models with and without multiheaded attention. We also analyzed the generalization capability of each model. The results showed that although these architectures were composed of only one type of ANN, multihead attention increased the efficiency of the speech enhancement process, yielding results that were competitive with those of complex models. Therefore, this study is helpful as a reference for building simple and efficient single-type ANN models with attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Enhancement of Complex Convolutional Recurrent Network with Attention

Article 30 August 2022

A Subconvolutional U-net with Gated Recurrent Unit and Efficient Channel Attention Mechanism for Real-Time Speech Enhancement

Article 04 March 2024

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

Article Open access 03 February 2024

Data Availability

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

References

Brauwers G, Frasincar F (2021) A general survey on attention mechanisms in deep learning. IEEE Trans Knowl Data Eng:1–1. https://doi.org/10.1109/TKDE.2021.3126456
Fan C, Yi J, Tao J, et al (2021) Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans Audio Speech Language Process 29:198–209. https://doi.org/10.1109/TASLP.2020.3039600
Article Google Scholar
Galassi A, Lippi M, Torroni P (2020) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893
Article Google Scholar
Garofolo J, Lamel L, Fisher W et al (1992) Timit acoustic-phonetic continuous speech corpus. Linguis Data Consortium. https://doi.org/10.35111/17gk-bn40
Hatzopoulos S, Ciorba AH, Skarzynski P (eds) (2020) The human auditory system - basic features and updates on audiological diagnosis and therapy. IntechOpen, Rijeka. https://doi.org/10.5772/intechopen.77713
Hu G, Wang D (2010) A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process 18(8):2067–2079. https://doi.org/10.1109/TASL.2010.2041110
Article Google Scholar
Jensen J, Taal C H, Jensen J, et al (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio. Speech Lang Process 24 (11):2009–2022. https://doi.org/10.1109/TASLP.2016.2585878
Article Google Scholar
Kamath U, Graham K, Emara W (2022) Transformers for Machine Learning: A Deep Dive. Chapman and Hall/CRC, New York. https://doi.org/10.1201/9781003170082
Book Google Scholar
Kim J, El-Khamy M, Lee J (2020) T-GSA: transformer with Gaussian-weighted self-attention for speech enhancement. In: IEEE international conference on acoustics, speech and signal processing, pp 6649–6653. https://doi.org/10.1109/ICASSP40776.2020.9053591. iSSN: 2379-190X
Koizumi Y, Yatabe K, Delcroix M et al (2020) Speech enhancement using self-adaptation and multi-head self-attention. In: IEEE international conference on acoustics, speech and signal processing, pp 181–185. https://doi.org/10.1109/ICASSP40776.2020.9053214. iSSN: 2379-190X
Lan T, Ye W, Lyu Y, et al (2020) Embedding encoder-decoder with attention mechanism for monaural speech enhancement. Ieee Access 685(96):677–96. https://doi.org/10.1109/ACCESS.2020.2995346
Article Google Scholar
Li L, Lu Z, Watzel T, et al (2021) Light-weight self-attention augmented generative adversarial networks for speech enhancement. Electronics 10(13):1586. https://doi.org/10.3390/electronics10131586
Article Google Scholar
McLoughlin I (2009) Applied Speech and Audio Processing: with Matlab Examples. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511609640
Nicolson A, Paliwal KK (2020) Masked multi-head self-attention for causal speech enhancement. Speech Comm 125:80–96. https://doi.org/10.1016/j.specom.2020.10.004
Article Google Scholar
Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
Article Google Scholar
Nossier SA, Wall J, Moniri M et al (2020a) A comparative study of time and frequency domain approaches to deep learning based speech enhancement. In: International Joint Conference on Neural Networks, pp 1–8. https://doi.org/10.1109/IJCNN48605.2020.9206928. iSSN: 2161-4407
Nossier SA, Wall J, Moniri M et al (2020b) Mapping and masking targets comparison using different deep learning based speech enhancement architectures. In: International joint conference on neural networks, pp 1–8. https://doi.org/10.1109/IJCNN48605.2020.9206623. iSSN: 2161-4407
Nossier SA, Wall J, Moniri M, et al (2021) An experimental analysis of deep learning architectures for supervised speech enhancement. Electronics 10(1):17. https://doi.org/10.3390/electronics10010017
Article Google Scholar
Pandey A, Wang D (2021) Dense CNN with self-attention for time-domain speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 29:1270–1279. https://doi.org/10.1109/TASLP.2021.3064421
Article Google Scholar
Rix A, Beerends J, Hollier M et al (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 749–752. https://doi.org/10.1109/ICASSP.2001.941023. iSSN: 1520-6149
Roux JL, Wisdom S, Erdogan H et al (2019) SDR – Half-baked or well done?. In: IEEE international conference on acoustics, speech and signal processing, pp 626–630. https://doi.org/10.1109/ICASSP.2019.8683855. iSSN: 2379-190X
Roy SK, Nicolson A, Paliwal KK (2021) DeepLPC-MHANet: multi-head self-attention for augmented kalman filter-based speech enhancement. IEEE Access 9:70,516–70,530. https://doi.org/10.1109/ACCESS.2021.3077281
Article Google Scholar
Thiemann J, Ito N, Vincent E (2013) The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings. Proc Meetings Acoust 19 (1):035–081. https://doi.org/10.1121/1.4799597
Article Google Scholar
Tomar NK, Jha D, Riegler MA, et al (2022) FANEt: a feedback attention network for improved biomedical image segmentation. IEEE Trans Neural Netw Learn Syst:1–14. https://doi.org/10.1109/TNNLS.2022.3159394
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition ii: Noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12 (3):247–251. https://doi.org/10.1016/0167-6393(93)90095-3
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio, Speech and Lang Proc 26(10):1702–1726. https://doi.org/10.1109/TASLP.2018.2842159
Article MathSciNet Google Scholar
Ye JC (2022) Geometry of Deep Learning: A Signal Processing Perspective, Mathematics in Industry, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-16-6046-7
Yu W, Zhou J, Wang H, et al (2022) SETRansformer: speech enhancement transformer. Cognit Comput 14(3):1152–1158. https://doi.org/10.1007/s12559-020-09817-2
Article Google Scholar
Yuliani AR, Amri MF, Suryawati E, et al (2021) Speech enhancement using deep learning methods: a review. J Elektronika dan Telekomunikasi 21(1):19–26. https://doi.org/10.14203/jet.v21.19-26
Article Google Scholar
Zacarias-Morales N, Pancardo P, Hernández-Nolasco JA, et al (2021) Attention-inspired artificial neural networks for speech processing: a systematic review. Symmetry 13(2):214. https://doi.org/10.3390/sym13020214
Article Google Scholar
Zhang L, Wang M, Zhang Q, et al (2020) Environmental attention-guided branchy neural network for speech enhancement. Appl Sci Basel 10(3):1167. https://doi.org/10.3390/app10031167
Article Google Scholar
Zhu T, Cheng C (2020) Joint CTC-attention end-to-end speech recognition with a triangle recurrent neural network encoder. J Shanghai Jiaotong University (Sci) 25(1):70–75. https://doi.org/10.1007/s12204-019-2147-6
Article Google Scholar

Download references

Acknowledgements

We would like to thank Dr. Matias Garcia-Constantino for his useful comments that improved the quality of this paper. The authors would like to thank the Laboratorio Nacional de Supercómputo del Sureste de México (LNS), a member of CONACYT, for the computational resources, support and technical assistance provided through project No. 202103086N.

Author information

Authors and Affiliations

Academic Division of Sciences and Information Technology, Juarez Autonomous University of Tabasco, Cunduacan, 86690, Tabasco, Mexico
Noel Zacarias-Morales, José Adán Hernández-Nolasco & Pablo Pancardo

Authors

Noel Zacarias-Morales
View author publications
You can also search for this author in PubMed Google Scholar
José Adán Hernández-Nolasco
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Pancardo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Adán Hernández-Nolasco.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

José Adán Hernández-Nolasco and Pablo Pancardo are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zacarias-Morales, N., Hernández-Nolasco, J.A. & Pancardo, P. Full single-type deep learning models with multihead attention for speech enhancement. Appl Intell 53, 20561–20576 (2023). https://doi.org/10.1007/s10489-023-04571-y

Download citation

Accepted: 11 March 2023
Published: 15 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10489-023-04571-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Full single-type deep learning models with multihead attention for speech enhancement

Abstract

Access this article

Similar content being viewed by others

Speech Enhancement of Complex Convolutional Recurrent Network with Attention

A Subconvolutional U-net with Gated Recurrent Unit and Efficient Channel Attention Mechanism for Real-Time Speech Enhancement

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Full single-type deep learning models with multihead attention for speech enhancement

Abstract

Access this article

Similar content being viewed by others

Speech Enhancement of Complex Convolutional Recurrent Network with Attention

A Subconvolutional U-net with Gated Recurrent Unit and Efficient Channel Attention Mechanism for Real-Time Speech Enhancement

Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation