Skip to main content

Raw Multichannel Processing Using Deep Neural Networks

  • Chapter
  • First Online:
New Era for Robust Speech Recognition

Abstract

Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this chapter, we perform multichannel enhancement jointly with acoustic modeling in a deep-neural-network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single-channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single-channel waveform model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use a small additive offset to truncate the output range and avoid numerical instability with very small inputs: log(⋅ + 0. 01).

References

  1. Allen, J.B., Berkley, D.A.: Image method for efficiently simulation room-small acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979)

    Article  Google Scholar 

  2. Benesty, J., Chen, J., Huang, Y.: Microphone Array Signal Processing. Springer, Berlin (2009)

    Google Scholar 

  3. Bengio, Y., Lecun, Y.: Scaling Learning Algorithms Towards AI. Large Scale Kernel Machines. MIT press, Cambridge (2007)

    Google Scholar 

  4. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)

    Google Scholar 

  5. Bracewell, R.: The Fourier Transform and Its Applications, 3rd edn. McGraw-Hill, New York (1999)

    MATH  Google Scholar 

  6. Brandstein, M., Ward, D.: Microphone Arrays: Signal Processing Techniques and Applications. Springer, Berlin (2001)

    Book  Google Scholar 

  7. Chen, Z., Watanabe, S., Erdoğan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: Proceedings of Interspeech, pp. 3274–3278. ISCA (2015)

    Google Scholar 

  8. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. arXiv preprint. arXiv:1502.02367 (2015)

    Google Scholar 

  9. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.: Large scale distributed deep networks. In: Proceedings of NIPS (2012)

    Google Scholar 

  10. Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., Nakamura, A.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: REVERB Workshop (2014)

    Google Scholar 

  11. Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)

    Google Scholar 

  12. Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: Proceedings of ICASSP, pp. 5014–5018. IEEE (2015)

    Google Scholar 

  13. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS (2014)

    Google Scholar 

  14. Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982)

    Article  Google Scholar 

  15. Hain, T., Burget, L., Dines, J., Garner, P., Grezl, F., Hannani, A., Huijbregts, M., Karafiat, M., Lincoln, M., Wan, V.: Transcribing meetings with the AMIDA systems. IEEE Trans. Audio Speech Lang. Process. 20(2), 486–498 (2012)

    Article  Google Scholar 

  16. Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

  17. Hershey, J.R., Roux, J.L., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures. CoRR abs/1409.2574 (2014)

    Google Scholar 

  18. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  19. Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted Boltzmann machines. In: Proceedings of ICASSP (2011)

    Book  Google Scholar 

  20. Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)

    Article  Google Scholar 

  21. Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Proceedings of Interspeech (2016)

    Book  Google Scholar 

  22. Liu, Y., Zhang, P., Hain, T.: Using neural network front-ends on far-field multiple microphones based speech recognition. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

  23. Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of ICASSP (2012)

    Book  Google Scholar 

  24. Palaz, D., Collobert, R., Doss, M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Proceedings of Interspeech (2014)

    Google Scholar 

  25. Sainath, T.N., Kingsbury, B., Mohamed, A., Dahl, G., Saon, G., Soltau, H., Beran, T., Aravkin, A., Ramabhadran, B.: Improvements to deep convolutional neural networks for LVCSR. In: Proceedings of ASRU (2013)

    Book  Google Scholar 

  26. Sainath, T.N., Li, B.: Modeling time–frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: Proceedings of Interspeech (2016)

    Book  Google Scholar 

  27. Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: Proceedings of ICASSP (2013)

    Book  Google Scholar 

  28. Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  29. Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: Proceedings of ASRU (2015)

    Google Scholar 

  30. Sainath, T.N., Weiss, R.J., Wilson, K.W., Senior, A., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Proceedings of Interspeech (2015)

    Google Scholar 

  31. Sainath, T.N., Narayanan, A., Weiss, R.J., Wilson, K.W., Bacchiani, M., Shafran, I.: Reducing the computational complexity of multimicrophone acoustic models with integrated feature extraction. In: Proceedings of Interspeech (2016)

    Book  Google Scholar 

  32. Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNs. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  33. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech (2014)

    Google Scholar 

  34. Seltzer, M., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust handsfree speech recognition. IEEE Trans. Audio Speech Lang. Process. 12(5), 489–498 (2004)

    Article  Google Scholar 

  35. Stolcke, A., Anguera, X., Boakye, K., Çetin, O., Janin, A., Magimai-Doss, M., Wooters, C., Zheng, J.: The SRI-ICSI Spring 2007 meeting and lecture recognition system. In: Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, vol. 2, pp. 450–463. Springer, Berlin (2008)

    Google Scholar 

  36. Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of ASRU (2013)

    Book  Google Scholar 

  37. Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Proceedings of Interspeech (2014)

    Google Scholar 

  38. Variani, E., Sainath, T.N., Shafran, I.: Complex linear projection (CLP): a discriminative approach to joint feature extraction and acoustic modeling. In: Proceedings of Interspeech (2016)

    Google Scholar 

  39. Veen, B.D., Buckley, K.M.: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag. 5(2), 4–24 (1988)

    Article  Google Scholar 

  40. Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M.L., Chen, G., Zhang, Y., Mandel, M., Yu, D.: Deep beamforming networks for multi-channel speech recognition. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  41. Xiao, X., Zhao, S., Zhong, X., Jones, D.L., Chng, E.S., Li, H.: A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2814–2818. IEEE (2015)

    Google Scholar 

  42. Zhang, Y., Chuangsuwanich, E., Glass, J.R.: Extracting deep neural network bottleneck features using low-rank matrix factorization. In: ICASSP, pp. 185–189 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tara N. Sainath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Sainath, T.N. et al. (2017). Raw Multichannel Processing Using Deep Neural Networks. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics