Abstract
Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this chapter, we perform multichannel enhancement jointly with acoustic modeling in a deep-neural-network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single-channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single-channel waveform model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We use a small additive offset to truncate the output range and avoid numerical instability with very small inputs: log(⋅ + 0. 01).
References
Allen, J.B., Berkley, D.A.: Image method for efficiently simulation room-small acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979)
Benesty, J., Chen, J., Huang, Y.: Microphone Array Signal Processing. Springer, Berlin (2009)
Bengio, Y., Lecun, Y.: Scaling Learning Algorithms Towards AI. Large Scale Kernel Machines. MIT press, Cambridge (2007)
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Bracewell, R.: The Fourier Transform and Its Applications, 3rd edn. McGraw-Hill, New York (1999)
Brandstein, M., Ward, D.: Microphone Arrays: Signal Processing Techniques and Applications. Springer, Berlin (2001)
Chen, Z., Watanabe, S., Erdoğan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: Proceedings of Interspeech, pp. 3274–3278. ISCA (2015)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. arXiv preprint. arXiv:1502.02367 (2015)
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.: Large scale distributed deep networks. In: Proceedings of NIPS (2012)
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., Nakamura, A.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: REVERB Workshop (2014)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: Proceedings of ICASSP, pp. 5014–5018. IEEE (2015)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of AISTATS (2014)
Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982)
Hain, T., Burget, L., Dines, J., Garner, P., Grezl, F., Hannani, A., Huijbregts, M., Karafiat, M., Lincoln, M., Wan, V.: Transcribing meetings with the AMIDA systems. IEEE Trans. Audio Speech Lang. Process. 20(2), 486–498 (2012)
Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of ICASSP (2014)
Hershey, J.R., Roux, J.L., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures. CoRR abs/1409.2574 (2014)
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)
Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted Boltzmann machines. In: Proceedings of ICASSP (2011)
Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)
Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Proceedings of Interspeech (2016)
Liu, Y., Zhang, P., Hain, T.: Using neural network front-ends on far-field multiple microphones based speech recognition. In: Proceedings of ICASSP (2014)
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of ICASSP (2012)
Palaz, D., Collobert, R., Doss, M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Proceedings of Interspeech (2014)
Sainath, T.N., Kingsbury, B., Mohamed, A., Dahl, G., Saon, G., Soltau, H., Beran, T., Aravkin, A., Ramabhadran, B.: Improvements to deep convolutional neural networks for LVCSR. In: Proceedings of ASRU (2013)
Sainath, T.N., Li, B.: Modeling time–frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: Proceedings of Interspeech (2016)
Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In: Proceedings of ICASSP (2013)
Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: Proceedings of ASRU (2015)
Sainath, T.N., Weiss, R.J., Wilson, K.W., Senior, A., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Proceedings of Interspeech (2015)
Sainath, T.N., Narayanan, A., Weiss, R.J., Wilson, K.W., Bacchiani, M., Shafran, I.: Reducing the computational complexity of multimicrophone acoustic models with integrated feature extraction. In: Proceedings of Interspeech (2016)
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNs. In: Proceedings of ICASSP (2016)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech (2014)
Seltzer, M., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust handsfree speech recognition. IEEE Trans. Audio Speech Lang. Process. 12(5), 489–498 (2004)
Stolcke, A., Anguera, X., Boakye, K., Çetin, O., Janin, A., Magimai-Doss, M., Wooters, C., Zheng, J.: The SRI-ICSI Spring 2007 meeting and lecture recognition system. In: Multimodal Technologies for Perception of Humans. Lecture Notes in Computer Science, vol. 2, pp. 450–463. Springer, Berlin (2008)
Swietojanski, P., Ghoshal, A., Renals, S.: Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In: Proceedings of ASRU (2013)
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Proceedings of Interspeech (2014)
Variani, E., Sainath, T.N., Shafran, I.: Complex linear projection (CLP): a discriminative approach to joint feature extraction and acoustic modeling. In: Proceedings of Interspeech (2016)
Veen, B.D., Buckley, K.M.: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag. 5(2), 4–24 (1988)
Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M.L., Chen, G., Zhang, Y., Mandel, M., Yu, D.: Deep beamforming networks for multi-channel speech recognition. In: Proceedings of ICASSP (2016)
Xiao, X., Zhao, S., Zhong, X., Jones, D.L., Chng, E.S., Li, H.: A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2814–2818. IEEE (2015)
Zhang, Y., Chuangsuwanich, E., Glass, J.R.: Extracting deep neural network bottleneck features using low-rank matrix factorization. In: ICASSP, pp. 185–189 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Sainath, T.N. et al. (2017). Raw Multichannel Processing Using Deep Neural Networks. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)