Abstract
This paper describes the speech denoising system based on long short-term memory (LSTM) neural networks. The architecture of the presented network is designed to make speech enhancement in spectrogram magnitude domain. The audio resynthesis is performed via the inverse short-time Fourier transform by maintaining the original phase. Objective quality is assessed by root mean square error between clean and denoised audio signals on CHiME corpus and speaker verification rate by using RSR2015 corpus. Proposed system demonstrates improved results on both metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lukin, A., Todd, J.: Suppression of musical noise artifacts in audio noise reduction by adaptive 2D filtering. In: Audio Engineering Society Convention 123 (2007)
Valin, J.-M.: Speex: a free codec for free speech. In: linux.conf.au Conference (2006)
Liu, D., Smaragdis, P., Kim, M.: Experiments on deep learning for speech denoising. In: 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore, pp. 2685–2689 (2014)
Feng, X., Zhang, Y., Glass, J.: Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1759–1763. IEEE Press, Italy (2014)
Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: 12th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), Liberec, Czech Republic (2015)
Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia (2015)
Mimura, M., Sakai, S., Kawahara, T.: Speech dereverberation using long short-term memory. In: 16th Annual Conference of the International Speech Communication Association (INTERSPEECH), Dresden, Germany, pp. 2435–2439 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Xu, Y., Du, J., Da, L.-R.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015). IEEE Press
Keras: Deep Learning library for Theano and TensorFlow. https://keras.io
Hinton, G., Deng, L., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29(6), 82–97 (2012). IEEE Press
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning (ICML), Lille, France (2015)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Crochiere, R.: A weighted overlap-add method of short-time Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Sig. Process. ASSP–28, 99–102 (1980)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference for Learning Representations, San Diego (2015)
Christensen, H., Barker, J., Ma, N., Green, P.: The CHiME corpus: a resource and a challenge for computational hearing in multisource environments. In: 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), Chiba, Japan (2010)
Larcher, A., Lee, K.A., Ma, B., Li, H.: RSR2015: database for textdependent speaker verification using multiple pass-phrases. In: 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, Oregon, USA (2012)
Ferras, M., Madikeri, S., Motlicek, P., Dey, S., Bourlard, H.: A large-scale open-source acoustic simulator for speaker recognition. IEEE Sig. Process. Lett. 23(4), 527–531 (2016)
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 788–798 (2011). IEEE Press
Prince, S.J., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: 11th International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, pp. 1–8 (2007)
Testarium. Research tool. http://testarium.makseq.com
Denoising examples. http://denoiser.makseq.com
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Tkachenko, M., Yamshinin, A., Lyubimov, N., Kotov, M., Nastasenko, M. (2017). Speech Enhancement for Speaker Recognition Using Deep Recurrent Neural Networks. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_69
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_69
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)