Skip to main content
Log in

A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Generally, the recorded speech signal is corrupted by both room reverberation and background noise leading to a reduced speech quality and intelligibility. In order to deal with the distortions caused by the joint effect of noise and reverberation, we propose a context-aware-based deep neural network (DNN) approach for simultaneous speech denoising and dereverberation. The proposed system consists of two stages such as denoising stage and the dereverberation stage. In the denoising stage, the additive noise is suppressed by estimating a phase-sensitive mask using DNN. Then, the noise-free reverberant speech is processed through the dereverberation stage. In the dereverberation stage, a reverberation-time-aware DNN-based model is used to perform dereverberation by adopting two reverberation time-dependent parameters such as frameshift size and acoustic context size to get the benefits of the characteristics of the superposition and frame-wise temporal correlations in different reverberation circumstances. Finally, we integrate both the modules and employ the integrated module for joint training using a multi objective loss function to further optimize both the denoising and dereverberation stages. Experimental results show that the proposed approach has shown significant performance improvements over prevalent benchmark dereverberation algorithms on IEEE corpus, REVERB challenge, and TIMIT corpus datasets under several reverberation circumstances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Doire CSJ, Brookes M, Naylor PA, Hicks CM, Betts D, Dmour MA, Holdt-Jensen S (2017) Single-channel online enhancement of speech corrupted by reverberation and noise. IEEE/ACM Trans Audio Speech Lang Process 25(3):572–587

    Article  Google Scholar 

  2. Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 25(7):1492–1501

    Article  Google Scholar 

  3. Nakatani T, Ikeshita R., Kinoshita K, Sawada H, Araki S (2021) Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation. In: ICASSP 2021 - 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6129–6133, https://doi.org/10.1109/ICASSP39728.2021.9414264

  4. Nakatani T, Boeddeker C, Kinoshita K, Ikeshita R, Delcroix M, Haeb-Umbach R (2020) Jointly optimal denoising, dereverberation, and source separation. IEEE/ACM Trans Audio Speech Lang Process 28:2267–2282. https://doi.org/10.1109/TASLP.2020.3013118

    Article  Google Scholar 

  5. Baby D, Bourlard H (2021) Speech dereverberation using variational autoencoders. In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5784–5788, https://doi.org/10.1109/ICASSP39728.2021.9414736

  6. Wu M, Wang D (2006) A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans Audio Speech Lang Process 14(3):774–784

    Article  Google Scholar 

  7. Parchami M, Amindavar H, Zhu W (2019) Speech reverberation suppression for time-varying environments using weighted prediction error method with time-varying autoregressive model. Speech Commun 109:1–14. https://doi.org/10.1016/j.specom.2019.03.002

    Article  Google Scholar 

  8. Delcroix M, Yoshioka T, Ogawa A, Kubo Y, Fujimoto M, Ito N, Kinoshita K, Espi M, Hori T, Nakatani T, Nakamura A (2014) Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the reverb challenge, In: Proceedings of the REVERB challenge workshop, vol 1, pp 1–8

  9. Schwartz B, Gannot S, Habets EAP (2015) Online speech dereverberation using Kalman filter and EM algorithm. IEEE/ACM Trans Audio Speech Lang Process 23(2):394–406

    Article  Google Scholar 

  10. Cohen A, Stemmer G, Ingalsuo S, Markovich-Golan S (2017) Combined weighted prediction error and minimum variance distortionless response for dereverberation. In: IEEE international conference on acoustics, speech and signal processing, pp 446–450

  11. Weninger F, Geiger J, Wollmer M, Schuller B, Rigoll G (2014) Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput Speech Lang 28(4):888–902

    Article  Google Scholar 

  12. Han K, Wang Y, Wang D, Woods WS, Merks I, Zhang T (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 23(6):982–992

    Article  Google Scholar 

  13. Xiao X, Zhao S, Nguyen DHH, Zhong X, Jones DL, Chng ES, Li H (2016) Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. EURASIP J Adv Signal Process 2016(1):4

    Article  Google Scholar 

  14. Wu B, Li K, Yang M, Lee C-H (2017) A reverberation-time aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(1):102–111

    Article  Google Scholar 

  15. Zhao Y, Wang Z-Q, Wang DL (2017) A two-stage algorithm for noisy and reverberant speech enhancement. In: Proceedings of ICASSP, pp 5580–5584

  16. Raikar A, Basu S, Hegde RM (2018) Single channel joint speech dereverberation and denoising using deep priors. In: 2018 IEEE global conference on signal and information processing (GlobalSIP). IEEE, pp 216–220

  17. Wang Z-Q, Wang D (2020) Deep learning based target cancellation for speech dereverberation. IEEE/ACM Trans Audio Speech Lang Process 28:941–950. https://doi.org/10.1109/TASLP.2020.2975902

    Article  Google Scholar 

  18. Hussain T, Siniscalchi SM, Wang H-LS, Tsao Y, Salerno VM, Liao W-H (2020) Ensemble hierarchical extreme learning machine for speech dereverberation. IEEE Trans Cognit Dev Syst 12(4):744–758. https://doi.org/10.1109/TCDS.2019.2953620

    Article  Google Scholar 

  19. Chen H, Zhang P (2021) A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation. Neural Netw 141:238–248. https://doi.org/10.1016/j.neunet.2021.04.023

    Article  Google Scholar 

  20. Albuquerque RQ, Mello CAB (2021) Automatic no-reference speech quality assessment with convolutional neural networks. Neural Comput Appl 33(16):9993–10003

    Article  Google Scholar 

  21. Routray S, Mao Q (2022) Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network. Comput Speech Lang 71:101270. https://doi.org/10.1016/j.csl.2021.101270

    Article  Google Scholar 

  22. Kanda N et al. (2019) Guided source separation meets a strong asr backend: Hitachi/Paderborn university joint investigation for dinner party ASR. In: Proceedings of the Interspeech, pp 1248–1252

  23. Haeb-Umbach R et al (2019) Speech processing for digital home assistants. IEEE Signal Process Mag 36(6):111–124

    Article  Google Scholar 

  24. Togami M (2015) Multichannel online speech dereverberation under noisy environments. In: Proceedings of the 23rd European conference on signal processing, pp 1078–1082

  25. Braun S, Habets EAP (2018) Linear prediction based online dereverberation and noise reduction using alternating Kalman filters. IEEE/ACM Trans Audio Speech Lang Process 26(6):1119–1129

    Article  Google Scholar 

  26. Dietzen T, Doclo S, Moonen M, van Waterschoot T (2018) Joint multi-microphone speech dereverberation and noise reduction using integrated sidelobe cancellation and linear prediction. In: Proceedings of the 6th international workshop on acoustic signal enhancement, pp 221–225

  27. Mohammadiha N, Smaragdis P, Doclo S (2015) Joint acoustic and spectral modeling for speech dereverberation using non-negative representations. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4410–4414. IEEE

  28. Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 22(12):1849–1858

    Article  Google Scholar 

  29. Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio Speech Lang Process 26(10):1702–1726

    Article  Google Scholar 

  30. Shao Y, Srinivasan S, Wang DL (2008) Robust speaker identification using auditory features and computational auditory scene analysis. In: Proceedings of ICASSP, pp 1589–1592

  31. Hermansky H, Morgan N (1994) RASTA processing of speech. IEEE Trans Speech Audio Proc 2:578–589

    Article  Google Scholar 

  32. Rothauser EH et al (1969) IEEE recommended practice for speech quality measurements. IEEE Trans Audio Electroacoust 17:225–246

    Article  Google Scholar 

  33. Habets E (2010) Room impulse response generator (http://home.tiscali.nl/ehabets/rir generator.html)

  34. Allen JB, Berkley DA (1979) Image method for efficiently simulating small room acoustics. J Acoust Soc Am 65:943–950

    Article  Google Scholar 

  35. Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

    Article  Google Scholar 

  36. Kinoshita K, Delcroix M, Gannot S, Habets E, Haeb-Umbach R, Kellermann W, Leutnant V, Maas R, Nakatani T, Raj B, Sehr A, Yoshioka T (2016) A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J Adv Signal Process 7:1–19

    Google Scholar 

  37. Robinson T, Fransen J, Pye D, Foote J, Renals S (1995) WSJCAMO: a british english speech corpus for large vocabulary continuous speech recognition. In: International conference on acoustics, speech, and signal processing (ICASSP), pp 81–84

  38. Lincoln M, McCowan I, Vepa J, Maganti HK (2005) The multichannel wall street journal audio visual corpus (MC-WSJ-AV): specification and initial experiments. In: IEEE workshop on automatic speech recognition and understanding, pp 357–362

  39. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS (1993) Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, NASA STI/Recon technical report n, vol 93

  40. Hu G (2019) 100 nonspeech sounds 2006 [oneline], Technical Report. Available online: http://web.cse.ohiostate.edu/pnl/corpus/HuNonspeech/HuCorpus.html (accessed on 22 February 2019), Tech. Rep

  41. Rix A W, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 749–752

  42. Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125–2136

    Article  Google Scholar 

  43. Hu Y, Loizou PC (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 16(1):229–238

    Article  Google Scholar 

  44. Nakatani Tomohiro, Yoshioka Takuya, Kinoshita Keisuke, Miyoshi Masato, Juang Biing-Hwang (2010) Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans Audio Speech Lang Process 18(7):1717–1731

    Article  Google Scholar 

  45. Mack Wolfgang, Chakrabarty Soumitro, Stoter Fabian-Robert, Braun Sebastian, Edler Bernd, Habets Emanuel (2018) Single-channel dereverberation using direct mmse optimization and bidirectional lstm networks. Proc Interspeech 2018:1314–1318

    Article  Google Scholar 

  46. Rethage D, Pons J, Serra X (2018) A wavenet for speech denoising. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5069–5073

  47. Han K, Wang Y, Wang DL, Woods WS, Merks I, Zhang T (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Process 23(6):982–992

    Article  Google Scholar 

  48. Fan C, Tao J, Liu B, Yi J, Wen Z (2020) Joint Training for simultaneous speech denoising and dereverberation with deep embedding representations, INTERSPEECH

  49. Nakatani T et al (2020) DNN-supported mask-based convolutional beamforming for simultaneous denoising, dereverberation, and source separation. In: ICASSP 2020–2020 ieee international conference on acoustics, speech and signal processing (ICASSP), Barcelona, Spain, pp 6399–6403

  50. Jeub M, Schafer M, Vary P (2009) A binaural room impulse response database for the evaluation of dereverberation algorithms. In: Proceedings of the international conference on digital signal processing, pp 1–5

  51. Zhao Y, Wang D, Xu B, Zhang T (2020) Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE/ACM Trans Audio Speech Lang Process 28:1598–1607

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the Jiangsu Planned Projects for Postdoctoral Research Funds under Grant 2019K222, the Qing Lan Talent Program of Jiangsu Province, the Jiangsu Province Key Research and Development Plan under Grant BE2020036.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Routray, S., Mao, Q. A context aware-based deep neural network approach for simultaneous speech denoising and dereverberation. Neural Comput & Applic 34, 9831–9845 (2022). https://doi.org/10.1007/s00521-022-06968-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-06968-1

Keywords

Navigation