A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

Chen, Feilong; Lin, Wenmo; Sun, Chengli; Guo, Qiaosheng

doi:10.1007/s00034-024-02652-y

A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

Published: 01 April 2024

Volume 43, pages 4369–4389, (2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Feilong Chen ORCID: orcid.org/0000-0001-5515-9586¹,
Wenmo Lin¹,
Chengli Sun¹ &
…
Qiaosheng Guo²

317 Accesses
1 Altmetric
Explore all metrics

Abstract

Speech enhancement in 3D reverberant environments is a challenging and significant problem for many downstream applications, such as speech recognition, speaker identification, and audio analysis. Existing deep neural network models have shown efficacy for 3D speech enhancement tasks, but they often introduce distortions or unnatural artifacts in the enhanced speech. In this work, we propose a novel two-stage refiner system that integrates a neural beamforming network and a diffusion model for robust 3D speech enhancement. The neural beamforming network performs spatial filtering to suppress the noise and reverberation; while, the diffusion model leverages its generative capability to restore the missing or distorted speech components from the beamformed output. To the best of our knowledge, this is the first work that applies the diffusion model as a backend refiner to 3D speech enhancement. We investigate the effect of training the diffusion model with either enhanced speech or clean speech, and find that clean speech can better capture the prior knowledge of speech components and improve the speech recovery. We evaluate our proposed system on different datasets and beamformer architectures, and show that it achieves consistent improvements in metrics like WER and NISQA, indicating that the diffusion model has strong generalization ability and can serve as a backend refinement module for 3D speech enhancement, regardless of the front-end beamforming network. Our work demonstrates the effectiveness of integrating discriminative and generative models for robust 3D speech enhancement, and also opens up a new direction for applying generative diffusion models to 3D speech processing tasks, which can be used as a backend to various beamforming enhancement methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

Article 09 February 2023

Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition

Dual-Channel Speech Enhancement Using Neural Network Adaptive Beamforming

Data Availability

All L3DAS challenge datasets used in this study are publicly available in https://www.l3das.com/editions.html. Code is available at https://github.com/flchenwhu/3D-SE-Diffusion.

References

B.D. Anderson, Reverse-time diffusion equation models. Stoch. Process. Appl. 12(3), 313–326 (1982)
Article MathSciNet Google Scholar
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 20: A framework for self-supervised learning of speech representations. Adv. Neural Inform. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
H.-S. Choi, S. Park, J.H. Lee, et al., Real-time denoising and dereverberation with tiny recurrent u-net. ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2021)
H. Erdogan, J.R. Hershey, S. Watanabe, et al., in Interspeech, Improved mvdr beamforming using single-channel mask prediction networks (2016), pp. 1981–1985
E. Fonseca, X. Favory, J. Pons et al., Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 829–852 (2021)
Article Google Scholar
S.W. Fu, T.Y. Hu, Y. Tsao, X. Lu, in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Complex spectrogram enhancement by convolutional neural network with multi-metrics learning (2017), pp. 1–6
S.-W. Fu, Y. Tsao, X. Lu, H. Kawai, in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Raw waveform-based speech enhancement by fully convolutional networks (IEEE, 2017)
I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
R.C. Hendriks, T. Gerkmann, J. Jensen, DFT-domain based single-microphone noise reduction for speech enhancement: a survey of the state of the art. Synth. Lect. Speech Audio Process. 9(1), 1–80 (2013)
Google Scholar
J. Heymann, L. Drude, R. Haeb-Umbach, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Neural network based spectral mask estimation for acoustic beamforming (2016), pp. 196–200
J. Jensen, C.H. Taal, An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2009–2022 (2016)
Article Google Scholar
W. Jiang, C. Sun, F. Chen et al., Low complexity speech enhancement network based on frame-level Swin transformer. Electronics 12(6), 1330 (2023)
Article Google Scholar
D.P. Kingma, M. Welling, Auto-encoding variational Bayes (2013). arXiv:1312.6114
R. LeBlanc, S.A. Selouani, A two-stage deep neuroevolutionary technique for self-adaptive speech enhancement. IEEE Access 10, 5083–5102 (2022)
Article Google Scholar
J. Le Roux, S. Wisdom, H. Erdogan, J.R. Hershey, SDR–half-baked or well done? in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 626–630
J. Li, Y. Zhu, D. Luo, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), The PCG-AIID system for L3DAS22 challenge: MIMO and MISO convolutional recurrent network for multi channel speech enhancement and speech recognition (2022), pp. 9211–9215
W. Lin, F. Chen, C. Sun, Z. Zhu, 3D speech enhancement algorithm for two-stage U-net beamforming network. Comput. Eng. Appl. 59(22), 128–135 (2023)
Google Scholar
Y.J. Lu, S. Cornell, X. Chang, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Towards low-distortion multi-channel speech enhancement: The ESPNET-SE submission to the L3DAS22 challenge (2022), pp. 9201–9205
Y.-J. Lu, Y. Tsao, S. Watanabe, in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), A study on speech enhancement based on diffusion probabilistic model (IEEE, 2021)
Y.-J. Lu, Z.-Q. Wang, S. Watanabe, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conditional diffusion probabilistic model for speech enhancement (IEEE, 2022)
Y. Luo, Z. Chen, N. Mesgarani, T. Yoshioka, in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), End-to-end microphone permutation and number invariant multi-channel speech separation (2020), pp. 6394–6398
Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
G. Mittag, B. Naderi, A. Chehadi, S. Möller, Nisqa: a deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets (2021). arXiv:2104.09494
G. Mittag, S. Möller, Deep learning based assessment of synthetic speech naturalness. arXiv:2104.11673 (2021)
S.A. Nossier, J. Wall, M. Moniri, et al., in 2022 International Joint Conference on Neural Networks (IJCNN), Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains (2022), pp. 1–10
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), Librispeech: an asr corpus based on public domain audio books (2015), pp. 5206–5210
Z. Qiu, M. Fu, Y. Yu, et al., SRTNet: Time Domain Speech Enhancement via Stochastic Refinement (2022). arXiv:2210.16805
X. Ren, L. Chen, X. Zheng, et al., in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), A neural beamforming network for b-format 3d speech enhancement and recognition (2021), pp. 1–6
J. Richter, S. Welker, J.M. Lemercier, et al., Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Trans. Audio Speech Lang. Process. (2023).
S. Särkkä, A. Solin, Applied stochastic differential equations (Cambridge University Press, Cambridge, 2019)
Book Google Scholar
J. Serrà, S. Pascual, J. Pons, et al., Universal speech enhancement with score-based diffusion. arXiv:2206.03065 (2022)
R. Sawata, N. Murata, Y. Takida, et al., A versatile diffusion-based generative refiner for speech enhancement. arXiv:2210.17287 (2022).
Y. Song, J. Sohl-Dickstein, D.P. Kingma, et al., Score-based generative modeling through stochastic differential equations. arXiv:2011.13456 (2020)
S. Welker, J. Richter, T. Gerkmann, Speech enhancement with score-based generative models in the complex STFT domain. arXiv:2203.17004 (2022)
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar
Z. Zhang, Y. Xu, M. Yu, et al., in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ADL-MVDR: All deep learning MVDR beamformer for target speech separation (2021), pp. 6089–6093

Download references

Acknowledgements

This work was supported partly by Jiangxi Province Degree and Postgraduate Education Teaching Reform Project (No. JXYJG-2023-134), Nanchang Hangkong University PhD Foundation (No. EA201904283) and Nanchang Hangkong University Graduate Foundation (No. YC2022-044).

Author information

Authors and Affiliations

School of Information and Engineering, Nanchang Hangkong University, Nanchang, 330063, China
Feilong Chen, Wenmo Lin & Chengli Sun
Chaoyang Jushengtai (Xinfeng) Technology Co Ltd, Ganzhou, 341001, China
Qiaosheng Guo

Authors

Feilong Chen
View author publications
You can also search for this author inPubMed Google Scholar
Wenmo Lin
View author publications
You can also search for this author inPubMed Google Scholar
Chengli Sun
View author publications
You can also search for this author inPubMed Google Scholar
Qiaosheng Guo
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Feilong Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, F., Lin, W., Sun, C. et al. A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement. Circuits Syst Signal Process 43, 4369–4389 (2024). https://doi.org/10.1007/s00034-024-02652-y

Download citation

Received: 06 August 2023
Revised: 23 February 2024
Accepted: 26 February 2024
Published: 01 April 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00034-024-02652-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition

Dual-Channel Speech Enhancement Using Neural Network Adaptive Beamforming

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now