Skip to main content

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

  • Conference paper
  • First Online:
Latent Variable Analysis and Signal Separation (LVA/ICA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9237))

Abstract

We propose a joint framework combining speech enhancement (SE) and voice activity detection (VAD) to increase the speech intelligibility in low signal-noise-ratio (SNR) environments. Deep Neural Networks (DNN) have recently been successfully adopted as a regression model in SE. Nonetheless, the performance in harsh environments is not always satisfactory because the noise energy is often dominating in certain speech segments causing speech distortion. Based on the analysis of SNR information at the frame level in the training set, our approach consists of two steps, namely: (1) a DNN-based VAD model is trained to generate frame-level speech/non-speech probabilities; and (2) the final enhanced speech features are obtained by a weighted sum of the estimated clean speech features processed by incorporating VAD information. Experimental results demonstrate that the proposed SE approach effectively improves short-time objective intelligibility (STOI) by 0.161 and perceptual evaluation of speech quality (PESQ) by 0.333 over the already-good SE baseline systems at \(-\)5dB SNR of babble noise.

This work was supported by the National Natural Science Foundation of China under Grants No. 61305002. We would like to thank iFLYTEK Research for providing the training data and DNN training platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The noise types are vehicle: bus, train, plane and car; exhibition hall; meeting room; office; emporium; family living room; factory; bus station; mess hall; KTV; musical instruments.

  2. 2.

    http://home.ustc.edu.cn/~gtian09/demos/LowSNR-SEDNN.html.

References

  1. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  2. Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)

    Article  Google Scholar 

  3. Cohen, I.: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 11(5), 466–475 (2003)

    Article  Google Scholar 

  4. Mohammadiha, N., Smaragdis, P., Leijon, A.: Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)

    Article  Google Scholar 

  5. Wang, Y.X., Wang, D.L.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)

    Article  Google Scholar 

  6. Narayanan, A., Wang, D.L.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: ICASSP, pp. 7092–7096 (2013)

    Google Scholar 

  7. Du, J., Huo, Q.: A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: INTERSPEECH, pp. 569–572 (2008)

    Google Scholar 

  8. Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

  9. Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)

    Article  Google Scholar 

  10. Tu, Y.-H., Du, J., Xu, Y., Dai, L.-R., Lee, C.-H.: Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. In: ISCSLP, pp. 250–254 (2014)

    Google Scholar 

  11. Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.-R., Lee, C.-H.: Robust speech recognition with speech enhanced deep neural networks. In: INTERSPEECH, pp. 616–620 (2014)

    Google Scholar 

  12. Gao, T., Du, J., Dai, L.-R., Lee, C.-H.: Joint training of front-end and back-end deep neural networks for robust speech recognition. In: ICASSP (2015, accepted)

    Google Scholar 

  13. Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  14. Sohn, J., Sung, W.: A voice activity detector employing soft decision based noise spectrum adaptation. In: ICASSP, pp. 365–368 (1998)

    Google Scholar 

  15. Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)

    Article  Google Scholar 

  16. Zhang, X.-L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697–710 (2013)

    Article  Google Scholar 

  17. Zhang, X.-L., Wang, D.L.: Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection. In: INTERSPEECH, pp. 1534–1538 (2014)

    Google Scholar 

  18. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  19. Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: Dynamic noise aware training for speech enhancement based on deep neural networks. In: INTERSPEECH, pp. 2670–2674 (2014)

    Google Scholar 

  20. Hinton, G.E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: ICASSP, pp. 4214–4217 (2010)

    Google Scholar 

  22. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: ICASSP, pp. 749–752 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tian Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Gao, T., Du, J., Xu, Y., Liu, C., Dai, LR., Lee, CH. (2015). Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science(), vol 9237. Springer, Cham. https://doi.org/10.1007/978-3-319-22482-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22482-4_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22481-7

  • Online ISBN: 978-3-319-22482-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics