Multi-stage strength estimation network with cross attention for single channel speech enhancement

Zhang, Zipeng; Ding, Yuchen; Chen, Wei; Chen, Yutao; Guo, Weiwei; Liu, Houguang

doi:10.1007/s11760-024-03364-1

Multi-stage strength estimation network with cross attention for single channel speech enhancement

Original Paper
Published: 25 June 2024

Volume 18, pages 6937–6948, (2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Zipeng Zhang¹^na1,
Yuchen Ding¹^na1,
Wei Chen^3,4,5,
Yutao Chen¹,
Weiwei Guo^3,4,5 &
…
Houguang Liu^1,2

292 Accesses
Explore all metrics

Abstract

Speech enhancement is a fundamental task for acoustic signal processing, which is still an unsolved challenge. Recently, with the rapid development of deep learning, data-driven approaches based on a variety of different modules in machine learning have made great progress in speech enhancement. Each of these basic modules have unique advantages as well as certain limitations. Inspired by the blocks’ unique preferences and the distinguishing feature of speech signals, we proposed a multi-stage strength estimation network with cross-attention for single-channel speech enhancement in this paper. The proposed method consists of a feature-wised fusion block using the attention mechanism and the strength estimation block using FFT and sequential representations (FTB). We first describe the speech enhancement problem mathematically, after which we compared the proposed method with some well-known speech enhancement methods on the 50-h DNS and LibriFSD50K dataset, showing that the proposed method can pay full attention to both time and frequency domains and achieve satisfying results. Further ablation studies are also carried out to prove the effectiveness of each section of the proposed method, and the results show the effectiveness of the proposed method. By the exhibit of the proposed method, we show the effectiveness of improving the performance of speech enhancement models by utilizing modules with different properties, which pointing out a promising direction for the future development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Channel and temporal-frequency attention UNet for monaural speech enhancement

Article Open access 14 August 2023

A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

Article 27 September 2023

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Article 26 July 2023

Data availability statements

All datasets used in our research are public access from internet.

Notes

References

Priyanka, S.S., Kumar, T.K.: Multi-channel speech enhancement using early and late fusion convolutional neural networks. Signal Image Video Process. 17(4), 973–979 (2023)
Article Google Scholar
Gay, S.L., Benesty, J.: Acoustic Signal Processing for Telecommunication vol. 551. Springer, Berlin (2012)
Shen, Y.-L., Huang, C.-Y., Wang, S.-S., Tsao, Y., Wang, H.-M., Chi, T.-S.: Reinforcement learning based speech enhancement for robust speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6750–6754 (2019). IEEE
Reddy, C.K.A., Shankar, N., Bhat, G.S., Charan, R., Panahi, I.: An individualized super-gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device. IEEE Signal Process. Lett. 24(11), 1601–1605 (2017)
Article Google Scholar
Bai, Z., Zhang, X.-L.: Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65–99 (2021)
Article Google Scholar
Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979). https://doi.org/10.1109/TASSP.1979.1163209
Article Google Scholar
Ephraim, Y.: Statistical-model-based speech enhancement systems. Proc. IEEE 80(10), 1526–1555 (1992). https://doi.org/10.1109/5.168664
Article Google Scholar
Zhao, Y., Xu, B., Giri, R., Zhang, T.: Perceptually guided speech enhancement using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5074–5078 (2018). https://doi.org/10.1109/ICASSP.2018.8462593
Feng, X., Zhang, Y., Glass, J.: Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1759–1763 (2014). https://doi.org/10.1109/ICASSP.2014.6853900
Gao, T., Du, J., Dai, L.-R., Lee, C.-H.: Densely connected progressive learning for lstm-based speech enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5054–5058 (2018). https://doi.org/10.1109/ICASSP.2018.8461861
Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)
Westhausen, N.L., Meyer, B.T.: Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551 (2020)
Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., Lee, K.: Phase-aware speech enhancement with deep complex u-net. In: International Conference on Learning Representations (2018)
Li, A., Liu, W., Luo, X., Yu, G., Zheng, C., Li, X.: A simultaneous denoising and dereverberation framework with target decoupling. arXiv preprint arXiv:2106.12743 (2021)
Tzinis, E., Casebeer, J., Wang, Z., Smaragdis, P.: Separate but together: unsupervised federated learning for speech enhancement from non-iid data. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2021). IEEE
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Reddy, C.K., Gopal, V., Cutler, R.: Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In: ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497 (2021). IEEE

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 52275296), National Key Research and Development Program of China (No. 2022YFC2402700), the China Scholarships Council (No. 202306420024), the Graduate Innovation Program of China University of Mining and Technology (No. 2024WLJCRCZL112), the Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX24_2729), and the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Zipeng Zhang and Yuchen Ding have contributed equally to this work.

Authors and Affiliations

School of Mechatronic Engineering, China University of Mining and Technology, Xuzhou, 221116, China
Zipeng Zhang, Yuchen Ding, Yutao Chen & Houguang Liu
Department of Otolaryngology and Institute of Audioneurotechnology (VIANNA), Hannover Medical School, 30625, Hannover, Germany
Houguang Liu
College of Otolaryngology Head and Neck Surgery, Chinese PLA General Hospital, Chinese PLA Medical School, Beijing, 100853, China
Wei Chen & Weiwei Guo
State Key Laboratory of Hearing and Balance Science, Beijing, 100853, China
Wei Chen & Weiwei Guo
National Clinical Research Center for Otolaryngologic Diseases, Beijing, 100853, China
Wei Chen & Weiwei Guo

Authors

Zipeng Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yuchen Ding
View author publications
You can also search for this author inPubMed Google Scholar
Wei Chen
View author publications
You can also search for this author inPubMed Google Scholar
Yutao Chen
View author publications
You can also search for this author inPubMed Google Scholar
Weiwei Guo
View author publications
You can also search for this author inPubMed Google Scholar
Houguang Liu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

ZZ: Data Curation, Methodology, Conceptualization, Writing—Original Draft. YD: Software, Writing—Original Draft, Investigation. WC: Visualization, Project administration. YC: Writing—Original Draft, Investigation. WG: Project administration, Supervision. HL: Supervision, Funding acquisition. Both authors co-wrote the manuscript and approved the final version for submission.

Corresponding author

Correspondence to Houguang Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Z., Ding, Y., Chen, W. et al. Multi-stage strength estimation network with cross attention for single channel speech enhancement. SIViP 18, 6937–6948 (2024). https://doi.org/10.1007/s11760-024-03364-1

Download citation

Received: 06 April 2024
Revised: 20 May 2024
Accepted: 08 June 2024
Published: 25 June 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11760-024-03364-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stage strength estimation network with cross attention for single channel speech enhancement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Channel and temporal-frequency attention UNet for monaural speech enhancement

A novel skip connection mechanism based on channel-wise cross transformer for speech enhancement

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Data availability statements

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now