Skip to main content

Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio

  • Chapter
  • First Online:

Abstract

We investigate the use of deep neural networks and deep recurrent neural networks for separation and recognition of speech in challenging environments. Mask prediction networks received considerable interest recently for speech separation and speech enhancement problems where the background signals are nonstationary and challenging. Initial signal-level enhancement with deep neural networks has also been shown to be useful for noise-robust speech recognition in these environments. We consider using various loss functions for training the networks and illustrate differences among them. We compare the performance of deep computational architectures with conventional statistical techniques as well as variants of nonnegative matrix factorization, and establish that one can achieve impressively superior results with deep-learning-based techniques on this problem.

This work was largely completed when the first author was on sabbatical leave at MERL from his faculty position at Sabanci University, Istanbul.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Benesty, J., Makino, S., Chen, J.: Speech Enhancement. Springer Science & Business Media, New York (2005)

    Google Scholar 

  2. Cohen, I.: Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator. IEEE Signal Process. Lett. 9(4), 113–116 (2002)

    Article  MathSciNet  Google Scholar 

  3. Cohen, I., Berdugo, B.: Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process. Lett. 9(1), 12–15 (2002)

    Article  Google Scholar 

  4. Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)

    Article  Google Scholar 

  5. Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)

    Article  Google Scholar 

  6. Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane (2015)

    Google Scholar 

  7. Gemmeke, J.F., Virtanen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)

    Article  Google Scholar 

  8. Grais, E.M., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: Proceedings of the International Conference on Digital Signal Processing (DSP), pp. 1–6 (2011)

    Google Scholar 

  9. Grais, E.M., Sen, M.U., Erdogan, H.: Deep neural networks for single channel source separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence (2014)

    Google Scholar 

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Hu, Y., Loizou, P.C.: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2008)

    Article  Google Scholar 

  12. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, pp. 1581–1585 (2014)

    Google Scholar 

  13. Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, vol. 1, pp. 125–128 (1993)

    Google Scholar 

  14. Le Roux, J., Vincent, E., Mizuno, Y., Kameoka, H., Ono, N., Sagayama, S.: Consistent Wiener filtering: generalized time–frequency masking respecting spectrogram consistency. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, pp. 89–96 (2010)

    Google Scholar 

  15. Le Roux, J., Weninger, F.J., Hershey, J.R.: Sparse NMF – half-baked or well done? Technical Report, TR2015-023, Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA (2015)

    Google Scholar 

  16. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (NIPS), pp. 556–562 (2001)

    Google Scholar 

  17. Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL (2013)

    Google Scholar 

  18. Lu, X., Tsao, Y., Matsuda, S., Hori, C.: Speech enhancement based on deep denoising autoencoder. In: Proceedings of the Interspeech, Lyon, pp. 3444–3448 (2013)

    Google Scholar 

  19. Maas, A.L., O’Neil, T.M., Hannun, A.Y., Ng, A.Y.: Recurrent neural network feature enhancement: the 2nd CHiME challenge. In: Proceedings of the CHiME Workshop on Machine Listening in Multisource Environments, Vancouver, pp. 79–80 (2013)

    Google Scholar 

  20. Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, pp. 7092–7096 (2013)

    Google Scholar 

  21. Paul, D.B., Baker, J.M.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362 (1992)

    Google Scholar 

  22. Schmidt, M.N., Olsson, R.K.: Single-channel speech separation using sparse non-negative matrix factorization. In: Proceedings of the Interspeech, Pittsburgh, PA, pp. 1652–55 (2006)

    Google Scholar 

  23. Smaragdis, P.: Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 15(1), 1–14 (2007)

    Article  Google Scholar 

  24. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  25. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  26. Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second “CHiME” speech separation and recognition challenge: datasets, tasks and baselines. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, pp. 126–130 (2013)

    Google Scholar 

  27. Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)

    Article  Google Scholar 

  28. Wang, Z.Q., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 796–806 (2016)

    Article  Google Scholar 

  29. Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–58 (2014)

    Article  Google Scholar 

  30. Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2013 CHiME challenge using BLSTM recurrent neural networks. In: Proceedings of the 2nd CHiME Speech Separation and Recognition Challenge held in conjunction with ICASSP 2013, Vancouver, pp. 86–90 (2013)

    Google Scholar 

  31. Weninger, F., Hershey, J.R., Le Roux, J., Schuller, B.: Discriminatively trained recurrent neural networks for single-channel speech separation. In: Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 577–581 (2014)

    Google Scholar 

  32. Weninger, F., Le Roux, J., Hershey, J., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of the Interspeech, Singapore (2014)

    Google Scholar 

  33. Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA) (2015)

    Google Scholar 

  34. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)

    Article  Google Scholar 

  35. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. Signal Process. Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hakan Erdogan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J. (2017). Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics