Abstract
Speech enhancement in 3D reverberant environments is a challenging and significant problem for many downstream applications, such as speech recognition, speaker identification, and audio analysis. Existing deep neural network models have shown efficacy for 3D speech enhancement tasks, but they often introduce distortions or unnatural artifacts in the enhanced speech. In this work, we propose a novel two-stage refiner system that integrates a neural beamforming network and a diffusion model for robust 3D speech enhancement. The neural beamforming network performs spatial filtering to suppress the noise and reverberation; while, the diffusion model leverages its generative capability to restore the missing or distorted speech components from the beamformed output. To the best of our knowledge, this is the first work that applies the diffusion model as a backend refiner to 3D speech enhancement. We investigate the effect of training the diffusion model with either enhanced speech or clean speech, and find that clean speech can better capture the prior knowledge of speech components and improve the speech recovery. We evaluate our proposed system on different datasets and beamformer architectures, and show that it achieves consistent improvements in metrics like WER and NISQA, indicating that the diffusion model has strong generalization ability and can serve as a backend refinement module for 3D speech enhancement, regardless of the front-end beamforming network. Our work demonstrates the effectiveness of integrating discriminative and generative models for robust 3D speech enhancement, and also opens up a new direction for applying generative diffusion models to 3D speech processing tasks, which can be used as a backend to various beamforming enhancement methods.









Similar content being viewed by others
Data Availability
All L3DAS challenge datasets used in this study are publicly available in https://www.l3das.com/editions.html. Code is available at https://github.com/flchenwhu/3D-SE-Diffusion.
References
B.D. Anderson, Reverse-time diffusion equation models. Stoch. Process. Appl. 12(3), 313–326 (1982)
A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 20: A framework for self-supervised learning of speech representations. Adv. Neural Inform. Process. Syst. 33, 12449–12460 (2020)
H.-S. Choi, S. Park, J.H. Lee, et al., Real-time denoising and dereverberation with tiny recurrent u-net. ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2021)
H. Erdogan, J.R. Hershey, S. Watanabe, et al., in Interspeech, Improved mvdr beamforming using single-channel mask prediction networks (2016), pp. 1981–1985
E. Fonseca, X. Favory, J. Pons et al., Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 829–852 (2021)
S.W. Fu, T.Y. Hu, Y. Tsao, X. Lu, in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Complex spectrogram enhancement by convolutional neural network with multi-metrics learning (2017), pp. 1–6
S.-W. Fu, Y. Tsao, X. Lu, H. Kawai, in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Raw waveform-based speech enhancement by fully convolutional networks (IEEE, 2017)
I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
R.C. Hendriks, T. Gerkmann, J. Jensen, DFT-domain based single-microphone noise reduction for speech enhancement: a survey of the state of the art. Synth. Lect. Speech Audio Process. 9(1), 1–80 (2013)
J. Heymann, L. Drude, R. Haeb-Umbach, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Neural network based spectral mask estimation for acoustic beamforming (2016), pp. 196–200
J. Jensen, C.H. Taal, An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2009–2022 (2016)
W. Jiang, C. Sun, F. Chen et al., Low complexity speech enhancement network based on frame-level Swin transformer. Electronics 12(6), 1330 (2023)
D.P. Kingma, M. Welling, Auto-encoding variational Bayes (2013). arXiv:1312.6114
R. LeBlanc, S.A. Selouani, A two-stage deep neuroevolutionary technique for self-adaptive speech enhancement. IEEE Access 10, 5083–5102 (2022)
J. Le Roux, S. Wisdom, H. Erdogan, J.R. Hershey, SDR–half-baked or well done? in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 626–630
J. Li, Y. Zhu, D. Luo, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), The PCG-AIID system for L3DAS22 challenge: MIMO and MISO convolutional recurrent network for multi channel speech enhancement and speech recognition (2022), pp. 9211–9215
W. Lin, F. Chen, C. Sun, Z. Zhu, 3D speech enhancement algorithm for two-stage U-net beamforming network. Comput. Eng. Appl. 59(22), 128–135 (2023)
Y.J. Lu, S. Cornell, X. Chang, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Towards low-distortion multi-channel speech enhancement: The ESPNET-SE submission to the L3DAS22 challenge (2022), pp. 9201–9205
Y.-J. Lu, Y. Tsao, S. Watanabe, in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), A study on speech enhancement based on diffusion probabilistic model (IEEE, 2021)
Y.-J. Lu, Z.-Q. Wang, S. Watanabe, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conditional diffusion probabilistic model for speech enhancement (IEEE, 2022)
Y. Luo, Z. Chen, N. Mesgarani, T. Yoshioka, in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), End-to-end microphone permutation and number invariant multi-channel speech separation (2020), pp. 6394–6398
Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
G. Mittag, B. Naderi, A. Chehadi, S. Möller, Nisqa: a deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets (2021). arXiv:2104.09494
G. Mittag, S. Möller, Deep learning based assessment of synthetic speech naturalness. arXiv:2104.11673 (2021)
S.A. Nossier, J. Wall, M. Moniri, et al., in 2022 International Joint Conference on Neural Networks (IJCNN), Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains (2022), pp. 1–10
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), Librispeech: an asr corpus based on public domain audio books (2015), pp. 5206–5210
Z. Qiu, M. Fu, Y. Yu, et al., SRTNet: Time Domain Speech Enhancement via Stochastic Refinement (2022). arXiv:2210.16805
X. Ren, L. Chen, X. Zheng, et al., in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), A neural beamforming network for b-format 3d speech enhancement and recognition (2021), pp. 1–6
J. Richter, S. Welker, J.M. Lemercier, et al., Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Trans. Audio Speech Lang. Process. (2023).
S. Särkkä, A. Solin, Applied stochastic differential equations (Cambridge University Press, Cambridge, 2019)
J. Serrà, S. Pascual, J. Pons, et al., Universal speech enhancement with score-based diffusion. arXiv:2206.03065 (2022)
R. Sawata, N. Murata, Y. Takida, et al., A versatile diffusion-based generative refiner for speech enhancement. arXiv:2210.17287 (2022).
Y. Song, J. Sohl-Dickstein, D.P. Kingma, et al., Score-based generative modeling through stochastic differential equations. arXiv:2011.13456 (2020)
S. Welker, J. Richter, T. Gerkmann, Speech enhancement with score-based generative models in the complex STFT domain. arXiv:2203.17004 (2022)
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Z. Zhang, Y. Xu, M. Yu, et al., in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ADL-MVDR: All deep learning MVDR beamformer for target speech separation (2021), pp. 6089–6093
Acknowledgements
This work was supported partly by Jiangxi Province Degree and Postgraduate Education Teaching Reform Project (No. JXYJG-2023-134), Nanchang Hangkong University PhD Foundation (No. EA201904283) and Nanchang Hangkong University Graduate Foundation (No. YC2022-044).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, F., Lin, W., Sun, C. et al. A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement. Circuits Syst Signal Process 43, 4369–4389 (2024). https://doi.org/10.1007/s00034-024-02652-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-024-02652-y