Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Gao, Tian; Du, Jun; Xu, Yong; Liu, Cong; Dai, Li-Rong; Lee, Chin-Hui

doi:10.1007/978-3-319-22482-4_9

Tian Gao¹⁷,
Jun Du¹⁷,
Yong Xu¹⁷,
Cong Liu¹⁸,
Li-Rong Dai¹⁷ &
…
Chin-Hui Lee¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9237))

Included in the following conference series:

International Conference on Latent Variable Analysis and Signal Separation

3036 Accesses
10 Citations
6 Altmetric

Abstract

We propose a joint framework combining speech enhancement (SE) and voice activity detection (VAD) to increase the speech intelligibility in low signal-noise-ratio (SNR) environments. Deep Neural Networks (DNN) have recently been successfully adopted as a regression model in SE. Nonetheless, the performance in harsh environments is not always satisfactory because the noise energy is often dominating in certain speech segments causing speech distortion. Based on the analysis of SNR information at the frame level in the training set, our approach consists of two steps, namely: (1) a DNN-based VAD model is trained to generate frame-level speech/non-speech probabilities; and (2) the final enhanced speech features are obtained by a weighted sum of the estimated clean speech features processed by incorporating VAD information. Experimental results demonstrate that the proposed SE approach effectively improves short-time objective intelligibility (STOI) by 0.161 and perceptual evaluation of speech quality (PESQ) by 0.333 over the already-good SE baseline systems at $-$5dB SNR of babble noise.

This work was supported by the National Natural Science Foundation of China under Grants No. 61305002. We would like to thank iFLYTEK Research for providing the training data and DNN training platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Integrated noise suppression techniques for enhancing voice activity detection in degraded environments

Article 05 October 2024

Enhancing Voice Activity Detection in Noisy Environments Using Deep Neural Networks

Article 09 March 2025

Audio Denoising Using Deep Neural Networks

Notes

1.
The noise types are vehicle: bus, train, plane and car; exhibition hall; meeting room; office; emporium; family living room; factory; bus station; mess hall; KTV; musical instruments.
2.
http://home.ustc.edu.cn/~gtian09/demos/LowSNR-SEDNN.html.

References

Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
Article Google Scholar
Cohen, I.: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 11(5), 466–475 (2003)
Article Google Scholar
Mohammadiha, N., Smaragdis, P., Leijon, A.: Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
Article Google Scholar
Wang, Y.X., Wang, D.L.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)
Article Google Scholar
Narayanan, A., Wang, D.L.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: ICASSP, pp. 7092–7096 (2013)
Google Scholar
Du, J., Huo, Q.: A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: INTERSPEECH, pp. 569–572 (2008)
Google Scholar
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Article Google Scholar
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)
Article Google Scholar
Tu, Y.-H., Du, J., Xu, Y., Dai, L.-R., Lee, C.-H.: Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. In: ISCSLP, pp. 250–254 (2014)
Google Scholar
Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.-R., Lee, C.-H.: Robust speech recognition with speech enhanced deep neural networks. In: INTERSPEECH, pp. 616–620 (2014)
Google Scholar
Gao, T., Du, J., Dai, L.-R., Lee, C.-H.: Joint training of front-end and back-end deep neural networks for robust speech recognition. In: ICASSP (2015, accepted)
Google Scholar
Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
Sohn, J., Sung, W.: A voice activity detector employing soft decision based noise spectrum adaptation. In: ICASSP, pp. 365–368 (1998)
Google Scholar
Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Article Google Scholar
Zhang, X.-L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio Speech Lang. Process. 21(4), 697–710 (2013)
Article Google Scholar
Zhang, X.-L., Wang, D.L.: Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection. In: INTERSPEECH, pp. 1534–1538 (2014)
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MATH MathSciNet Google Scholar
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: Dynamic noise aware training for speech enhancement based on deep neural networks. In: INTERSPEECH, pp. 2670–2674 (2014)
Google Scholar
Hinton, G.E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)
Chapter Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: ICASSP, pp. 4214–4217 (2010)
Google Scholar
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: ICASSP, pp. 749–752 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, Anhui, People’s Republic of China
Tian Gao, Jun Du, Yong Xu & Li-Rong Dai
iFlytek Research, iFlytek Co., Ltd., Hefei, Anhui, People’s Republic of China
Cong Liu
Georgia Institute of Technology, Atlanta, GA, USA
Chin-Hui Lee

Authors

Tian Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jun Du
View author publications
You can also search for this author in PubMed Google Scholar
Yong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Cong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li-Rong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Hui Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tian Gao .

Editor information

Editors and Affiliations

Inria, Villers-les-Nancy, France
Emmanuel Vincent
Tel Aviv University, Tel-Aviv, Israel
Arie Yeredor
Technical University of Libere, Liberec, Czech Republic
Zbyněk Koldovský
The Czech Academy of Sciences, Prague, Czech Republic
Petr Tichavský

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, T., Du, J., Xu, Y., Liu, C., Dai, LR., Lee, CH. (2015). Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science(), vol 9237. Springer, Cham. https://doi.org/10.1007/978-3-319-22482-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-22482-4_9
Published: 15 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22481-7
Online ISBN: 978-3-319-22482-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics