Abstract
Generally, the recorded speech signal is corrupted by both room reverberation and background noise leading to a reduced speech quality and intelligibility. In order to deal with the distortions caused by the joint effect of noise and reverberation, we propose a context-aware-based deep neural network (DNN) approach for simultaneous speech denoising and dereverberation. The proposed system consists of two stages such as denoising stage and the dereverberation stage. In the denoising stage, the additive noise is suppressed by estimating a phase-sensitive mask using DNN. Then, the noise-free reverberant speech is processed through the dereverberation stage. In the dereverberation stage, a reverberation-time-aware DNN-based model is used to perform dereverberation by adopting two reverberation time-dependent parameters such as frameshift size and acoustic context size to get the benefits of the characteristics of the superposition and frame-wise temporal correlations in different reverberation circumstances. Finally, we integrate both the modules and employ the integrated module for joint training using a multi objective loss function to further optimize both the denoising and dereverberation stages. Experimental results show that the proposed approach has shown significant performance improvements over prevalent benchmark dereverberation algorithms on IEEE corpus, REVERB challenge, and TIMIT corpus datasets under several reverberation circumstances.
Similar content being viewed by others
References
Doire CSJ, Brookes M, Naylor PA, Hicks CM, Betts D, Dmour MA, Holdt-Jensen S (2017) Single-channel online enhancement of speech corrupted by reverberation and noise. IEEE/ACM Trans Audio Speech Lang Process 25(3):572–587
Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 25(7):1492–1501
Nakatani T, Ikeshita R., Kinoshita K, Sawada H, Araki S (2021) Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation. In: ICASSP 2021 - 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6129–6133, https://doi.org/10.1109/ICASSP39728.2021.9414264
Nakatani T, Boeddeker C, Kinoshita K, Ikeshita R, Delcroix M, Haeb-Umbach R (2020) Jointly optimal denoising, dereverberation, and source separation. IEEE/ACM Trans Audio Speech Lang Process 28:2267–2282. https://doi.org/10.1109/TASLP.2020.3013118
Baby D, Bourlard H (2021) Speech dereverberation using variational autoencoders. In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5784–5788, https://doi.org/10.1109/ICASSP39728.2021.9414736
Wu M, Wang D (2006) A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans Audio Speech Lang Process 14(3):774–784
Parchami M, Amindavar H, Zhu W (2019) Speech reverberation suppression for time-varying environments using weighted prediction error method with time-varying autoregressive model. Speech Commun 109:1–14. https://doi.org/10.1016/j.specom.2019.03.002
Delcroix M, Yoshioka T, Ogawa A, Kubo Y, Fujimoto M, Ito N, Kinoshita K, Espi M, Hori T, Nakatani T, Nakamura A (2014) Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the reverb challenge, In: Proceedings of the REVERB challenge workshop, vol 1, pp 1–8
Schwartz B, Gannot S, Habets EAP (2015) Online speech dereverberation using Kalman filter and EM algorithm. IEEE/ACM Trans Audio Speech Lang Process 23(2):394–406
Cohen A, Stemmer G, Ingalsuo S, Markovich-Golan S (2017) Combined weighted prediction error and minimum variance distortionless response for dereverberation. In: IEEE international conference on acoustics, speech and signal processing, pp 446–450
Weninger F, Geiger J, Wollmer M, Schuller B, Rigoll G (2014) Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput Speech Lang 28(4):888–902
Han K, Wang Y, Wang D, Woods WS, Merks I, Zhang T (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 23(6):982–992
Xiao X, Zhao S, Nguyen DHH, Zhong X, Jones DL, Chng ES, Li H (2016) Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. EURASIP J Adv Signal Process 2016(1):4
Wu B, Li K, Yang M, Lee C-H (2017) A reverberation-time aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(1):102–111
Zhao Y, Wang Z-Q, Wang DL (2017) A two-stage algorithm for noisy and reverberant speech enhancement. In: Proceedings of ICASSP, pp 5580–5584
Raikar A, Basu S, Hegde RM (2018) Single channel joint speech dereverberation and denoising using deep priors. In: 2018 IEEE global conference on signal and information processing (GlobalSIP). IEEE, pp 216–220
Wang Z-Q, Wang D (2020) Deep learning based target cancellation for speech dereverberation. IEEE/ACM Trans Audio Speech Lang Process 28:941–950. https://doi.org/10.1109/TASLP.2020.2975902
Hussain T, Siniscalchi SM, Wang H-LS, Tsao Y, Salerno VM, Liao W-H (2020) Ensemble hierarchical extreme learning machine for speech dereverberation. IEEE Trans Cognit Dev Syst 12(4):744–758. https://doi.org/10.1109/TCDS.2019.2953620
Chen H, Zhang P (2021) A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation. Neural Netw 141:238–248. https://doi.org/10.1016/j.neunet.2021.04.023
Albuquerque RQ, Mello CAB (2021) Automatic no-reference speech quality assessment with convolutional neural networks. Neural Comput Appl 33(16):9993–10003
Routray S, Mao Q (2022) Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network. Comput Speech Lang 71:101270. https://doi.org/10.1016/j.csl.2021.101270
Kanda N et al. (2019) Guided source separation meets a strong asr backend: Hitachi/Paderborn university joint investigation for dinner party ASR. In: Proceedings of the Interspeech, pp 1248–1252
Haeb-Umbach R et al (2019) Speech processing for digital home assistants. IEEE Signal Process Mag 36(6):111–124
Togami M (2015) Multichannel online speech dereverberation under noisy environments. In: Proceedings of the 23rd European conference on signal processing, pp 1078–1082
Braun S, Habets EAP (2018) Linear prediction based online dereverberation and noise reduction using alternating Kalman filters. IEEE/ACM Trans Audio Speech Lang Process 26(6):1119–1129
Dietzen T, Doclo S, Moonen M, van Waterschoot T (2018) Joint multi-microphone speech dereverberation and noise reduction using integrated sidelobe cancellation and linear prediction. In: Proceedings of the 6th international workshop on acoustic signal enhancement, pp 221–225
Mohammadiha N, Smaragdis P, Doclo S (2015) Joint acoustic and spectral modeling for speech dereverberation using non-negative representations. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4410–4414. IEEE
Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858
Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio Speech Lang Process 26(10):1702–1726
Shao Y, Srinivasan S, Wang DL (2008) Robust speaker identification using auditory features and computational auditory scene analysis. In: Proceedings of ICASSP, pp 1589–1592
Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Proc 2:578–589
Rothauser EH et al (1969) IEEE recommended practice for speech quality measurements. IEEE Trans Audio Electroacoust 17:225–246
Habets E (2010) Room impulse response generator (http://home.tiscali.nl/ehabets/rir generator.html)
Allen JB, Berkley DA (1979) Image method for efficiently simulating small room acoustics. J Acoust Soc Am 65:943–950
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Kinoshita K, Delcroix M, Gannot S, Habets E, Haeb-Umbach R, Kellermann W, Leutnant V, Maas R, Nakatani T, Raj B, Sehr A, Yoshioka T (2016) A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J Adv Signal Process 7:1–19
Robinson T, Fransen J, Pye D, Foote J, Renals S (1995) WSJCAMO: a british english speech corpus for large vocabulary continuous speech recognition. In: International conference on acoustics, speech, and signal processing (ICASSP), pp 81–84
Lincoln M, McCowan I, Vepa J, Maganti HK (2005) The multichannel wall street journal audio visual corpus (MC-WSJ-AV): specification and initial experiments. In: IEEE workshop on automatic speech recognition and understanding, pp 357–362
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS (1993) Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, NASA STI/Recon technical report n, vol 93
Hu G (2019) 100 nonspeech sounds 2006 [oneline], Technical Report. Available online: http://web.cse.ohiostate.edu/pnl/corpus/HuNonspeech/HuCorpus.html (accessed on 22 February 2019), Tech. Rep
Rix A W, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 749–752
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136
Hu Y, Loizou PC (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16(1):229–238
Nakatani Tomohiro, Yoshioka Takuya, Kinoshita Keisuke, Miyoshi Masato, Juang Biing-Hwang (2010) Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans Audio Speech Lang Process 18(7):1717–1731
Mack Wolfgang, Chakrabarty Soumitro, Stoter Fabian-Robert, Braun Sebastian, Edler Bernd, Habets Emanuel (2018) Single-channel dereverberation using direct mmse optimization and bidirectional lstm networks. Proc Interspeech 2018:1314–1318
Rethage D, Pons J, Serra X (2018) A wavenet for speech denoising. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5069–5073
Han K, Wang Y, Wang DL, Woods WS, Merks I, Zhang T (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 23(6):982–992
Fan C, Tao J, Liu B, Yi J, Wen Z (2020) Joint Training for simultaneous speech denoising and dereverberation with deep embedding representations, INTERSPEECH
Nakatani T et al (2020) DNN-supported mask-based convolutional beamforming for simultaneous denoising, dereverberation, and source separation. In: ICASSP 2020–2020 ieee international conference on acoustics, speech and signal processing (ICASSP), Barcelona, Spain, pp 6399–6403
Jeub M, Schafer M, Vary P (2009) A binaural room impulse response database for the evaluation of dereverberation algorithms. In: Proceedings of the international conference on digital signal processing, pp 1–5
Zhao Y, Wang D, Xu B, Zhang T (2020) Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM Trans Audio Speech Lang Process 28:1598–1607
Acknowledgements
This work was supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the Jiangsu Planned Projects for Postdoctoral Research Funds under Grant 2019K222, the Qing Lan Talent Program of Jiangsu Province, the Jiangsu Province Key Research and Development Plan under Grant BE2020036.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Routray, S., Mao, Q. A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation. Neural Comput & Applic 34, 9831–9845 (2022). https://doi.org/10.1007/s00521-022-06968-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-06968-1