Raw Multichannel Processing Using Deep Neural Networks

Sainath, Tara N.; Weiss, Ron J.; Wilson, Kevin W.; Narayanan, Arun; Bacchiani, Michiel; Li, Bo; Variani, Ehsan; Shafran, Izhak; Senior, Andrew; Chin, Kean; Misra, Ananya; Kim, Chanwoo

doi:10.1007/978-3-319-64680-0_5

Tara N. Sainath⁵,
Ron J. Weiss⁵,
Kevin W. Wilson⁵,
Arun Narayanan⁶,
Michiel Bacchiani⁵,
Bo Li⁶,
Ehsan Variani⁶,
Izhak Shafran⁶,
Andrew Senior⁵,
Kean Chin⁶,
Ananya Misra⁶ &
…
Chanwoo Kim⁶

2266 Accesses
5 Citations

Abstract

Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this chapter, we perform multichannel enhancement jointly with acoustic modeling in a deep-neural-network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single-channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single-channel waveform model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use a small additive offset to truncate the output range and avoid numerical instability with very small inputs: log(⋅ + 0. 01).

References

Allen, J.B., Berkley, D.A.: Image method for efficiently simulation room-small acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979)
Article Google Scholar
Benesty, J., Chen, J., Huang, Y.: Microphone Array Signal Processing. Springer, Berlin (2009)
Google Scholar
Bengio, Y., Lecun, Y.: Scaling Learning Algorithms Towards AI. Large Scale Kernel Machines. MIT press, Cambridge (2007)
Google Scholar
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Google Scholar
Bracewell, R.: The Fourier Transform and Its Applications, 3rd edn. McGraw-Hill, New York (1999)
MATH Google Scholar
Brandstein, M., Ward, D.: Microphone Arrays: Signal Processing Techniques and Applications. Springer, Berlin (2001)
Book Google Scholar
Chen, Z., Watanabe, S., Erdoğan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: Proceedings of Interspeech, pp. 3274–3278. ISCA (2015)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. arXiv preprint. arXiv:1502.02367 (2015)
Google Scholar
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.: Large scale distributed deep networks. In: Proceedings of NIPS (2012)
Google Scholar
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., Nakamura, A.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: REVERB Workshop (2014)
Google Scholar
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Google Scholar
Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: Proceedings of ICASSP, pp. 5014–5018. IEEE (2015)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS (2014)
Google Scholar
Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982)
Article Google Scholar
Hain, T., Burget, L., Dines, J., Garner, P., Grezl, F., Hannani, A., Huijbregts, M., Karafiat, M., Lincoln, M., Wan, V.: Transcribing meetings with the AMIDA systems. IEEE Trans. Audio Speech Lang. Process. 20(2), 486–498 (2012)
Article Google Scholar
Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of ICASSP (2014)
Book Google Scholar
Hershey, J.R., Roux, J.L., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures. CoRR abs/1409.2574 (2014)
Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)
Book Google Scholar
Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted Boltzmann machines. In: Proceedings of ICASSP (2011)
Book Google Scholar
Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)
Article Google Scholar
Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Proceedings of Interspeech (2016)
Book Google Scholar
Liu, Y., Zhang, P., Hain, T.: Using neural network front-ends on far-field multiple microphones based speech recognition. In: Proceedings of ICASSP (2014)
Book Google Scholar
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of ICASSP (2012)
Book Google Scholar
Palaz, D., Collobert, R., Doss, M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Proceedings of Interspeech (2014)
Google Scholar
Sainath, T.N., Kingsbury, B., Mohamed, A., Dahl, G., Saon, G., Soltau, H., Beran, T., Aravkin, A., Ramabhadran, B.: Improvements to deep convolutional neural networks for LVCSR. In: Proceedings of ASRU (2013)
Book Google Scholar
Sainath, T.N., Li, B.: Modeling time–frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: Proceedings of Interspeech (2016)
Book Google Scholar
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: Proceedings of ICASSP (2013)
Book Google Scholar
Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)
Book Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: Proceedings of ASRU (2015)
Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Senior, A., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Proceedings of Interspeech (2015)
Google Scholar
Sainath, T.N., Narayanan, A., Weiss, R.J., Wilson, K.W., Bacchiani, M., Shafran, I.: Reducing the computational complexity of multimicrophone acoustic models with integrated feature extraction. In: Proceedings of Interspeech (2016)
Book Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNs. In: Proceedings of ICASSP (2016)
Book Google Scholar
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech (2014)
Google Scholar
Seltzer, M., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust handsfree speech recognition. IEEE Trans. Audio Speech Lang. Process. 12(5), 489–498 (2004)
Article Google Scholar
Stolcke, A., Anguera, X., Boakye, K., Çetin, O., Janin, A., Magimai-Doss, M., Wooters, C., Zheng, J.: The SRI-ICSI Spring 2007 meeting and lecture recognition system. In: Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, vol. 2, pp. 450–463. Springer, Berlin (2008)
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of ASRU (2013)
Book Google Scholar
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Proceedings of Interspeech (2014)
Google Scholar
Variani, E., Sainath, T.N., Shafran, I.: Complex linear projection (CLP): a discriminative approach to joint feature extraction and acoustic modeling. In: Proceedings of Interspeech (2016)
Google Scholar
Veen, B.D., Buckley, K.M.: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag. 5(2), 4–24 (1988)
Article Google Scholar
Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M.L., Chen, G., Zhang, Y., Mandel, M., Yu, D.: Deep beamforming networks for multi-channel speech recognition. In: Proceedings of ICASSP (2016)
Book Google Scholar
Xiao, X., Zhao, S., Zhong, X., Jones, D.L., Chng, E.S., Li, H.: A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2814–2818. IEEE (2015)
Google Scholar
Zhang, Y., Chuangsuwanich, E., Glass, J.R.: Extracting deep neural network bottleneck features using low-rank matrix factorization. In: ICASSP, pp. 185–189 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Google Inc., 76 9th Avenue, New York, NY, 10011, USA
Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Michiel Bacchiani & Andrew Senior
Google Inc., 1900 Charleston Road, Mountain View, CA, 94043, USA
Arun Narayanan, Bo Li, Ehsan Variani, Izhak Shafran, Kean Chin, Ananya Misra & Chanwoo Kim

Authors

Tara N. Sainath
View author publications
You can also search for this author in PubMed Google Scholar
Ron J. Weiss
View author publications
You can also search for this author in PubMed Google Scholar
Kevin W. Wilson
View author publications
You can also search for this author in PubMed Google Scholar
Arun Narayanan
View author publications
You can also search for this author in PubMed Google Scholar
Michiel Bacchiani
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Variani
View author publications
You can also search for this author in PubMed Google Scholar
Izhak Shafran
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Senior
View author publications
You can also search for this author in PubMed Google Scholar
Kean Chin
View author publications
You can also search for this author in PubMed Google Scholar
Ananya Misra
View author publications
You can also search for this author in PubMed Google Scholar
Chanwoo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tara N. Sainath .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sainath, T.N. et al. (2017). Raw Multichannel Processing Using Deep Neural Networks. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_5
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics