Skip to main content

Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition

  • Chapter
  • First Online:
New Era for Robust Speech Recognition

Abstract

In this chapter we review some promising speech enhancement front-end techniques for handling noise and reverberation. We focus on signal-processing-based multichannel approaches and describe beamforming-based noise reduction and linear-prediction-based dereverberation. We demonstrate the potential of these approaches by introducing two systems that achieved top performance on the recent REVERB and CHiME-3 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We should mention the notable exception of neural-network-based speech enhancement, which may be jointly optimized with the ASR back end and has been shown to improve ASR performance [15, 32, 41, 42]. Neural-network-based enhancement is also discussed in Chaps. 4, 5, and 7.

References

  1. Anguera, X.: BeamformIt. http://www.xavieranguera.com/beamformit/ (2014)

  2. Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2023 (2007)

    Article  Google Scholar 

  3. Araki, S., Sawada, H., Makino, S.: Blind speech separation in a meeting situation with maximum SNR beamformers. In: Proceedings of ICASSP’07, vol. 1, pp. I-41–I-44 (2007)

    Google Scholar 

  4. Araki, S., Okada, M., Higuchi, T., Ogawa, A., Nakatani, T.: Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition. In: Proceedings of ICASSP’16, pp. 385–389 (2016)

    Google Scholar 

  5. Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of ASRU’15, pp. 504–511 (2015)

    Google Scholar 

  6. Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)

    MATH  Google Scholar 

  7. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  8. Bradley, J.S., Sato, H., Picard, M.: On the importance of early reflections for speech in rooms. J. Acoust. Soc. Am. 113(6), 3233–3244 (2003)

    Article  Google Scholar 

  9. Brutti, A., Omologo, M., Svaizer, P.: Comparison between different sound source localization techniques based on a real data collection. In: Hands-Free Speech Communication and Microphone Arrays, 2008, HSCMA 2008, pp. 69–72 (2008)

    Google Scholar 

  10. Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., et al.: The AMI Meeting Corpus: A Pre-announcement. Springer, Berlin (2005)

    Google Scholar 

  11. Chen, J., Benesty, J., Huang, Y.: Time delay estimation in room acoustic environments: an overview. EURASIP J. Adv. Signal Process. 2006, 170–170 (2006). doi:10.1155/ASP/2006/26503. http://dx.doi.org/10.1155/ASP/2006/26503

    MATH  Google Scholar 

  12. Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Araki, S., Hori, T., Nakatani, T.: Strategies for distant speech recognition in reverberant environments. EURASIP J. Adv. Signal Process. 2015, 60 (2015). doi:10.1186/s13634-015-0245-7

    Google Scholar 

  13. Dennis, J., Dat, T.H.: Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R’S system description for the ASpIRE challenge. In: Proceedings of ASRU’15, pp. 518–524 (2015)

    Google Scholar 

  14. Doclo, S., Moonen, M.: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process. 50(9), 2230–2244 (2002)

    Article  Google Scholar 

  15. Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of ICASSP’15, pp. 708–712 (2015)

    Google Scholar 

  16. Frost, O.L.: An algorithm for linearly constrained adaptive array processing. Proc. IEEE 60(8), 926–935 (1972)

    Article  Google Scholar 

  17. Harper, M.: The automatic speech recognition in reverberant environments (ASpIRE) challenge. In: Proceedings of ASRU’15, pp. 547–554 (2015)

    Google Scholar 

  18. Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ (1996)

    MATH  Google Scholar 

  19. Heymann, J., Drude, L., Chinaev, A., Haeb-Umbach, R.: BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge. In: Proceedings of ASRU’15, pp. 444–451. IEEE, New York (2015)

    Google Scholar 

  20. Higuchi, T., Ito, N., Yoshioka, T., Nakatani, T.: Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise. In: Proceedings of ICASSP’16, pp. 5210–5214 (2016)

    Google Scholar 

  21. Hori, T., Araki, S., Yoshioka, T., Fujimoto, M., Watanabe, S., Oba, T., Ogawa, A., Otsuka, K., Mikami, D., Kinoshita, K., Nakatani, T., Nakamura, A., Yamato, J.: Low-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera. IEEE Trans. Audio Speech Lang. Process. 20(2), 499–513 (2012)

    Article  Google Scholar 

  22. Hori, T., Chen, Z., Erdogan, H., Hershey, J.R., Roux, J., Mitra, V., Watanabe, S.: The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition. In: Proceedings of ASRU’15, pp. 475–481 (2015)

    Google Scholar 

  23. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice-Hall, Upper Saddle River, NJ (2001)

    Google Scholar 

  24. Jukic, A., Doclo, S.: Speech dereverberation using weighted prediction error with Laplacian model of the desired signal. In: Proceedings of ICASSP’14, pp. 5172–5176 (2014)

    Google Scholar 

  25. Kinoshita, K., Delcroix, M., Nakatani, T., Miyoshi, M.: Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction. IEEE Trans. Audio Speech Lang. Process. 17(4), 534–545 (2009)

    Article  Google Scholar 

  26. Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Sehr, A., Kellermann, W., Gannot, S., Maas, R., Haeb-Umbach, R., Leutnant, V., Raj, B.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of WASPAA’13. New Paltz, NY (2013)

    Google Scholar 

  27. Kinoshita, K., Delcroix, M., Gannot, S., Habets, E., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A., Yoshioka, T.: A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Signal Process. (2016). doi:10.1186/s13634-016-0306-6

    Google Scholar 

  28. Kuttruff, H.: Room Acoustics, 5th edn. Taylor & Francis, London (2009)

    Google Scholar 

  29. Lebart, K., Boucher, J.M., Denbigh, P.N.: A new method based on spectral subtraction for speech dereverberation. Acta Acustica 87(3), 359–366 (2001)

    Google Scholar 

  30. Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation. In: Proceedings of ICASSP’08, pp. 85–88 (2008)

    Google Scholar 

  31. Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)

    Article  Google Scholar 

  32. Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of ICASSP’13, pp. 7092–7096. IEEE, New York (2013)

    Google Scholar 

  33. Renals, S., Swietojanski, P.: Neural networks for distant speech recognition. In: 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), pp. 172–176 (2014)

    Google Scholar 

  34. Sivasankaran, S., Nugraha, A.A., Vincent, E., Morales-Cordovilla, J.A., Dalmia, S., Illina, I., Liutkus, A.: Robust ASR using neural network based speech enhancement and feature simulation. In: Proceedings of ASRU’15, pp. 482–489 (2015)

    Google Scholar 

  35. Souden, M., Araki, S., Kinoshita, K., Nakatani, T., Sawada, H.: A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans. Audio Speech Lang. Process. 21(9), 1913–1928 (2013)

    Article  Google Scholar 

  36. Tachioka, Y., Narita, T., Weninger, F., Watanabe, S.: Dual system combination approach for various reverberant environments with dereverberation techniques. In: Proceedings of REVERB’14 (2014)

    Google Scholar 

  37. Van Trees, H.L.: Detection, Estimation, and Modulation Theory. Part IV, Optimum Array Processing. Wiley-Interscience, New York (2002)

    MATH  Google Scholar 

  38. Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)

    Article  Google Scholar 

  39. Warsitz, E., Haeb-Umbach, R.: Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15(5), 1529–1539 (2007)

    Article  Google Scholar 

  40. Weninger, F., Watanabe, S., Roux, J.L., Hershey, J.R., Tachioka, Y., Geiger, J., Schuller, B., Rigoll, G.: The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In: Proceedings of REVERB’14 (2014)

    Google Scholar 

  41. Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Proceedings of Latent Variable Analysis and Signal Separation, pp. 91–99. Springer, Berlin (2015)

    Google Scholar 

  42. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)

    Article  Google Scholar 

  43. Yoshioka, T., Nakatani, T.: Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012)

    Article  Google Scholar 

  44. Yoshioka, T., Tachibana, H., Nakatani, T., Miyoshi, M.: Adaptive dereverberation of speech signals with speaker-position change detection. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3733–3736 (2009)

    Google Scholar 

  45. Yoshioka, T., Nakatani, T., Miyoshi, M.: Integrated speech enhancement method using noise suppression and dereverberation. IEEE Trans. Audio Speech Lang. Process. 17(2), 231–246 (2009)

    Article  Google Scholar 

  46. Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., Kellermann, W.: Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process. Mag. 29(6), 114–126 (2012)

    Article  Google Scholar 

  47. Yoshioka, T., Chen, X., Gales, M.J.F.: Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In: Proceedings of ICASSP’14 (2014)

    Google Scholar 

  48. Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T., Araki, S., Nakatani, T.: The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In: Proceedings of ASRU’15, pp. 436–443 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Delcroix .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Delcroix, M. et al. (2017). Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics