Abstract
We propose a joint framework combining speech enhancement (SE) and voice activity detection (VAD) to increase the speech intelligibility in low signal-noise-ratio (SNR) environments. Deep Neural Networks (DNN) have recently been successfully adopted as a regression model in SE. Nonetheless, the performance in harsh environments is not always satisfactory because the noise energy is often dominating in certain speech segments causing speech distortion. Based on the analysis of SNR information at the frame level in the training set, our approach consists of two steps, namely: (1) a DNN-based VAD model is trained to generate frame-level speech/non-speech probabilities; and (2) the final enhanced speech features are obtained by a weighted sum of the estimated clean speech features processed by incorporating VAD information. Experimental results demonstrate that the proposed SE approach effectively improves short-time objective intelligibility (STOI) by 0.161 and perceptual evaluation of speech quality (PESQ) by 0.333 over the already-good SE baseline systems at \(-\)5dB SNR of babble noise.
This work was supported by the National Natural Science Foundation of China under Grants No. 61305002. We would like to thank iFLYTEK Research for providing the training data and DNN training platform.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The noise types are vehicle: bus, train, plane and car; exhibition hall; meeting room; office; emporium; family living room; factory; bus station; mess hall; KTV; musical instruments.
- 2.
References
Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
Cohen, I.: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 11(5), 466–475 (2003)
Mohammadiha, N., Smaragdis, P., Leijon, A.: Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
Wang, Y.X., Wang, D.L.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Narayanan, A., Wang, D.L.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: ICASSP, pp. 7092–7096 (2013)
Du, J., Huo, Q.: A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: INTERSPEECH, pp. 569–572 (2008)
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)
Tu, Y.-H., Du, J., Xu, Y., Dai, L.-R., Lee, C.-H.: Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. In: ISCSLP, pp. 250–254 (2014)
Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.-R., Lee, C.-H.: Robust speech recognition with speech enhanced deep neural networks. In: INTERSPEECH, pp. 616–620 (2014)
Gao, T., Du, J., Dai, L.-R., Lee, C.-H.: Joint training of front-end and back-end deep neural networks for robust speech recognition. In: ICASSP (2015, accepted)
Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Sohn, J., Sung, W.: A voice activity detector employing soft decision based noise spectrum adaptation. In: ICASSP, pp. 365–368 (1998)
Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Zhang, X.-L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697–710 (2013)
Zhang, X.-L., Wang, D.L.: Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection. In: INTERSPEECH, pp. 1534–1538 (2014)
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: Dynamic noise aware training for speech enhancement based on deep neural networks. In: INTERSPEECH, pp. 2670–2674 (2014)
Hinton, G.E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: ICASSP, pp. 4214–4217 (2010)
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: ICASSP, pp. 749–752 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Gao, T., Du, J., Xu, Y., Liu, C., Dai, LR., Lee, CH. (2015). Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science(), vol 9237. Springer, Cham. https://doi.org/10.1007/978-3-319-22482-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-22482-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22481-7
Online ISBN: 978-3-319-22482-4
eBook Packages: Computer ScienceComputer Science (R0)