Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation

Zhang, Jiantao; Zhang, Pingjian

doi:10.1007/978-3-030-30490-4_32

Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation

Conference paper
First Online: 09 September 2019

4606 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11730))

Abstract

Over the past few decades, monaural speech separation has always been an interesting but challenging problem. The goal of speech separation is to separate a specific target speech from some background interferences and it has been treated as a signal processing problem traditionally. In recent years, with the rapid advances of deep learning techniques, deep learning has made a great breakthrough in speech separation. In this paper, recurrent neural networks (RNNs) which integrate multiple nonlinear masking layers (NMLs) to learn two-level estimation are proposed for speech separation. Experimental results show that our proposed model “RNN + SMMs + 3 NMLs” outperforms the baseline RNN without any mask in all the SDR, SIR and SAR indices, and it also obtains much better SDR and SIR than the RNN simply with original deterministic time-frequency masks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Brown, G.J., Cooke, M.: Computational auditory scene analysis. Comput. Speech Lang. 8(4), 297–336 (1994). https://doi.org/10.1006/csla.1994.1016
Article Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7472621
Chang, X., Qian, Y., Yu, D.: Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5974–5978. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8461570
Cherry, E.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953). https://doi.org/10.1121/1.1907229
Article Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014)
Google Scholar
Cooke, M., Hershey, J.R., Rennie, S.J.: Monaural speech separation and recognition challenge. Comput. Speech Lang. 24(1), 1–15 (2010). https://doi.org/10.1016/j.csl.2009.02.006
Article Google Scholar
Garofolo, J.S., et al.: TIMIT corpus. Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC93S1
Hershey, J.R., Chen, Z., Roux, J.L., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7471631
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1562–1566. IEEE (2014). https://doi.org/10.1109/ICASSP.2014.6853860
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012). https://doi.org/10.1145/3065386
Article Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788 (1999). https://doi.org/10.1038/44565
Article MATH Google Scholar
Maas, A.L., Le, Q.V., O’Neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: INTERSPEECH 2012, 13th Annual Conference of the International Speech, pp. 22–25 (2012)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Google Scholar
Pandey, A., Wang, D.: On adversarial training and loss functions for speech enhancement. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5414–5418. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8462614
Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing (2006)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093
Article Google Scholar
Srinivasan, S., Roman, N., Wang, D.: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006). https://doi.org/10.1016/j.specom.2006.09.003
Article Google Scholar
Tu, Y., Du, J., Xu, Y., Dai, L., Lee, C.H.: Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers. In: The 9th International Symposium on Chinese Spoken Language Processing, pp. 250–254. IEEE (2014). https://doi.org/10.1109/ISCSLP.2014.6936615
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005
Article Google Scholar
Wang, D.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008). https://doi.org/10.1177/1084713808326455
Article Google Scholar
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018). https://doi.org/10.1109/TASLP.2018.2842159
Article MathSciNet Google Scholar
Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014). https://doi.org/10.1109/TASLP.2014.2352935
Article Google Scholar
Weng, C., Yu, D., Seltzer, M.L., Droppo, J.: Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(10), 1670–1679 (2015). https://doi.org/10.1109/TASLP.2015.2444659
Article Google Scholar
Wikipedia. https://en.wikipedia.org/wiki/Circular_shift. Circular Shift
Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. CoRR abs/1505.00853 (2015)
Google Scholar
Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004). https://doi.org/10.1109/TSP.2004.828896
Article MathSciNet MATH Google Scholar
Yu, D., Deng, L., Dahl, G.E.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
Google Scholar
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE (2017). https://doi.org/10.1109/ICASSP.2017.7952154
Zhang, H., Zhang, X., Gao, G.: Multi-target ensemble learning for monaural speech separation. In: INTERSPEECH, pp. 1958–1962 (2017). https://doi.org/10.21437/Interspeech.2017-240

Download references

Author information

Authors and Affiliations

South China University of Technology, Guangzhou, 510006, China
Jiantao Zhang & Pingjian Zhang

Authors

Jiantao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Pingjian Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiantao Zhang .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Zhang, P. (2019). Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series. ICANN 2019. Lecture Notes in Computer Science(), vol 11730. Springer, Cham. https://doi.org/10.1007/978-3-030-30490-4_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-30490-4_32
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30489-8
Online ISBN: 978-3-030-30490-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics