Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition

Delcroix, Marc; Yoshioka, Takuya; Ito, Nobutaka; Ogawa, Atsunori; Kinoshita, Keisuke; Fujimoto, Masakiyo; Higuchi, Takuya; Araki, Shoko; Nakatani, Tomohiro

doi:10.1007/978-3-319-64680-0_2

Marc Delcroix⁵,
Takuya Yoshioka⁵,
Nobutaka Ito⁵,
Atsunori Ogawa⁵,
Keisuke Kinoshita⁵,
Masakiyo Fujimoto⁵,
Takuya Higuchi⁵,
Shoko Araki⁵ &
…
Tomohiro Nakatani⁵

2353 Accesses
3 Citations

Abstract

In this chapter we review some promising speech enhancement front-end techniques for handling noise and reverberation. We focus on signal-processing-based multichannel approaches and describe beamforming-based noise reduction and linear-prediction-based dereverberation. We demonstrate the potential of these approaches by introducing two systems that achieved top performance on the recent REVERB and CHiME-3 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We should mention the notable exception of neural-network-based speech enhancement, which may be jointly optimized with the ASR back end and has been shown to improve ASR performance [15, 32, 41, 42]. Neural-network-based enhancement is also discussed in Chaps. 4, 5, and 7.

References

Anguera, X.: BeamformIt. http://www.xavieranguera.com/beamformit/ (2014)
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2023 (2007)
Article Google Scholar
Araki, S., Sawada, H., Makino, S.: Blind speech separation in a meeting situation with maximum SNR beamformers. In: Proceedings of ICASSP’07, vol. 1, pp. I-41–I-44 (2007)
Google Scholar
Araki, S., Okada, M., Higuchi, T., Ogawa, A., Nakatani, T.: Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition. In: Proceedings of ICASSP’16, pp. 385–389 (2016)
Google Scholar
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of ASRU’15, pp. 504–511 (2015)
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)
MATH Google Scholar
Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
Bradley, J.S., Sato, H., Picard, M.: On the importance of early reflections for speech in rooms. J. Acoust. Soc. Am. 113(6), 3233–3244 (2003)
Article Google Scholar
Brutti, A., Omologo, M., Svaizer, P.: Comparison between different sound source localization techniques based on a real data collection. In: Hands-Free Speech Communication and Microphone Arrays, 2008, HSCMA 2008, pp. 69–72 (2008)
Google Scholar
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., et al.: The AMI Meeting Corpus: A Pre-announcement. Springer, Berlin (2005)
Google Scholar
Chen, J., Benesty, J., Huang, Y.: Time delay estimation in room acoustic environments: an overview. EURASIP J. Adv. Signal Process. 2006, 170–170 (2006). doi:10.1155/ASP/2006/26503. http://dx.doi.org/10.1155/ASP/2006/26503
MATH Google Scholar
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Araki, S., Hori, T., Nakatani, T.: Strategies for distant speech recognition in reverberant environments. EURASIP J. Adv. Signal Process. 2015, 60 (2015). doi:10.1186/s13634-015-0245-7
Google Scholar
Dennis, J., Dat, T.H.: Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R’S system description for the ASpIRE challenge. In: Proceedings of ASRU’15, pp. 518–524 (2015)
Google Scholar
Doclo, S., Moonen, M.: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process. 50(9), 2230–2244 (2002)
Article Google Scholar
Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of ICASSP’15, pp. 708–712 (2015)
Google Scholar
Frost, O.L.: An algorithm for linearly constrained adaptive array processing. Proc. IEEE 60(8), 926–935 (1972)
Article Google Scholar
Harper, M.: The automatic speech recognition in reverberant environments (ASpIRE) challenge. In: Proceedings of ASRU’15, pp. 547–554 (2015)
Google Scholar
Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ (1996)
MATH Google Scholar
Heymann, J., Drude, L., Chinaev, A., Haeb-Umbach, R.: BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge. In: Proceedings of ASRU’15, pp. 444–451. IEEE, New York (2015)
Google Scholar
Higuchi, T., Ito, N., Yoshioka, T., Nakatani, T.: Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise. In: Proceedings of ICASSP’16, pp. 5210–5214 (2016)
Google Scholar
Hori, T., Araki, S., Yoshioka, T., Fujimoto, M., Watanabe, S., Oba, T., Ogawa, A., Otsuka, K., Mikami, D., Kinoshita, K., Nakatani, T., Nakamura, A., Yamato, J.: Low-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera. IEEE Trans. Audio Speech Lang. Process. 20(2), 499–513 (2012)
Article Google Scholar
Hori, T., Chen, Z., Erdogan, H., Hershey, J.R., Roux, J., Mitra, V., Watanabe, S.: The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition. In: Proceedings of ASRU’15, pp. 475–481 (2015)
Google Scholar
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice-Hall, Upper Saddle River, NJ (2001)
Google Scholar
Jukic, A., Doclo, S.: Speech dereverberation using weighted prediction error with Laplacian model of the desired signal. In: Proceedings of ICASSP’14, pp. 5172–5176 (2014)
Google Scholar
Kinoshita, K., Delcroix, M., Nakatani, T., Miyoshi, M.: Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction. IEEE Trans. Audio Speech Lang. Process. 17(4), 534–545 (2009)
Article Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Sehr, A., Kellermann, W., Gannot, S., Maas, R., Haeb-Umbach, R., Leutnant, V., Raj, B.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of WASPAA’13. New Paltz, NY (2013)
Google Scholar
Kinoshita, K., Delcroix, M., Gannot, S., Habets, E., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A., Yoshioka, T.: A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Signal Process. (2016). doi:10.1186/s13634-016-0306-6
Google Scholar
Kuttruff, H.: Room Acoustics, 5th edn. Taylor & Francis, London (2009)
Google Scholar
Lebart, K., Boucher, J.M., Denbigh, P.N.: A new method based on spectral subtraction for speech dereverberation. Acta Acustica 87(3), 359–366 (2001)
Google Scholar
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation. In: Proceedings of ICASSP’08, pp. 85–88 (2008)
Google Scholar
Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., Juang, B.H.: Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010)
Article Google Scholar
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of ICASSP’13, pp. 7092–7096. IEEE, New York (2013)
Google Scholar
Renals, S., Swietojanski, P.: Neural networks for distant speech recognition. In: 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), pp. 172–176 (2014)
Google Scholar
Sivasankaran, S., Nugraha, A.A., Vincent, E., Morales-Cordovilla, J.A., Dalmia, S., Illina, I., Liutkus, A.: Robust ASR using neural network based speech enhancement and feature simulation. In: Proceedings of ASRU’15, pp. 482–489 (2015)
Google Scholar
Souden, M., Araki, S., Kinoshita, K., Nakatani, T., Sawada, H.: A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans. Audio Speech Lang. Process. 21(9), 1913–1928 (2013)
Article Google Scholar
Tachioka, Y., Narita, T., Weninger, F., Watanabe, S.: Dual system combination approach for various reverberant environments with dereverberation techniques. In: Proceedings of REVERB’14 (2014)
Google Scholar
Van Trees, H.L.: Detection, Estimation, and Modulation Theory. Part IV, Optimum Array Processing. Wiley-Interscience, New York (2002)
MATH Google Scholar
Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3), 1066–1074 (2007)
Article Google Scholar
Warsitz, E., Haeb-Umbach, R.: Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15(5), 1529–1539 (2007)
Article Google Scholar
Weninger, F., Watanabe, S., Roux, J.L., Hershey, J.R., Tachioka, Y., Geiger, J., Schuller, B., Rigoll, G.: The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In: Proceedings of REVERB’14 (2014)
Google Scholar
Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Proceedings of Latent Variable Analysis and Signal Separation, pp. 91–99. Springer, Berlin (2015)
Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)
Article Google Scholar
Yoshioka, T., Nakatani, T.: Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012)
Article Google Scholar
Yoshioka, T., Tachibana, H., Nakatani, T., Miyoshi, M.: Adaptive dereverberation of speech signals with speaker-position change detection. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3733–3736 (2009)
Google Scholar
Yoshioka, T., Nakatani, T., Miyoshi, M.: Integrated speech enhancement method using noise suppression and dereverberation. IEEE Trans. Audio Speech Lang. Process. 17(2), 231–246 (2009)
Article Google Scholar
Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., Kellermann, W.: Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process. Mag. 29(6), 114–126 (2012)
Article Google Scholar
Yoshioka, T., Chen, X., Gales, M.J.F.: Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In: Proceedings of ICASSP’14 (2014)
Google Scholar
Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.J., Espi, M., Higuchi, T., Araki, S., Nakatani, T.: The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In: Proceedings of ASRU’15, pp. 436–443 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai, Seika-cho, Kyoto, Japan
Marc Delcroix, Takuya Yoshioka, Nobutaka Ito, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Takuya Higuchi, Shoko Araki & Tomohiro Nakatani

Authors

Marc Delcroix
View author publications
You can also search for this author in PubMed Google Scholar
Takuya Yoshioka
View author publications
You can also search for this author in PubMed Google Scholar
Nobutaka Ito
View author publications
You can also search for this author in PubMed Google Scholar
Atsunori Ogawa
View author publications
You can also search for this author in PubMed Google Scholar
Keisuke Kinoshita
View author publications
You can also search for this author in PubMed Google Scholar
Masakiyo Fujimoto
View author publications
You can also search for this author in PubMed Google Scholar
Takuya Higuchi
View author publications
You can also search for this author in PubMed Google Scholar
Shoko Araki
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro Nakatani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc Delcroix .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Delcroix, M. et al. (2017). Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_2
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics