Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition

Xiao, Xiong; Watanabe, Shinji; Erdogan, Hakan; Mandel, Michael; Lu, Liang; Hershey, John R.; Seltzer, Michael L.; Chen, Guoguo; Zhang, Yu; Yu, Dong

doi:10.1007/978-3-319-64680-0_4

Xiong Xiao⁵,
Shinji Watanabe⁶,
Hakan Erdogan⁷,
Michael Mandel⁸,
Liang Lu⁹,
John R. Hershey⁶,
Michael L. Seltzer⁷,
Guoguo Chen¹⁰,
Yu Zhang¹¹ &
…
Dong Yu¹²

2404 Accesses

Abstract

Speech-processing systems such as automatic speech recognition (ASR) usually consist of a large number of steps to accomplish their tasks. Due to the long processing pipeline, the processing steps are usually designed to optimize cost functions that are not directly related to the task, leading to suboptimal performance. In this chapter, we introduce a beamforming (BF) network to perform spatial filtering that is optimal for the ASR task. The BF network takes in array signals and predicts the optimal beamforming parameters in the frequency domain, assuming that the array geometry does not change. The network consists of both deterministic processing steps and trainable steps realized by neural networks and trained to minimize the cross-entropy cost function of ASR. In our experiments, the BF network is trained with both artificially generated and real microphone array signals. On the AMI meeting transcription, we found that the trained BF network produces competitive ASR results compared to traditional delay-and-sum beamforming on unseen array signals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Neural Beamforming for Speech Enhancement: Preliminary Results

A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation

Article Open access 06 December 2017

A New Neural Beamformer for Multi-channel Speech Separation

Article 09 May 2022

References

Agarwal, A., Akchurin, E., Basoglu, C., Chen, G., Cyphers, S., Droppo, J., Eversole, A., Guenter, B., Hillebrand, M., Hoens, R., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical Report, MSR-TR-2014-112 (2014)
Google Scholar
Allen, J., Berkley, D.: Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979)
Article Google Scholar
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007)
Article Google Scholar
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) (2015)
Google Scholar
Bitzer, J., Simmer, K.U.: Superdirective microphone arrays. In: Brandstein, M.S., Ward, D. (eds.) Microphone Arrays: Signal Processing Techniques and Applications, Chap. 2, pp. 19–38. Springer, Berlin (2001)
Chapter Google Scholar
Capon, J.: High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE 57(8), 1408–1418 (1969)
Article Google Scholar
Doclo, S., Moonen, M.: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process. 50(9), 2230–2244 (2002)
Article Google Scholar
Elko, G.W.: Spatial coherence functions for differential microphones in isotropic noise fields. In: Brandstein, M.S., Ward, D. (eds.) Microphone Arrays: Signal Processing Techniques and Applications, Chap. 4, pp. 61–85. Springer, Berlin (2001)
Chapter Google Scholar
Er, M., Cantoni, A.: Solar wind monitor satellite: derivative constraints for broad-band element space antenna array processors. IEEE Trans. Audio Speech Lang. Process. 31(6), 1378–1393 (1983)
Google Scholar
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982)
Article Google Scholar
Haeb-Umbach, R., Warsitz, E.: Adaptive filter-and-sum beamforming in spatially correlated noise. In: International Workshop on Acoustic Echo and Noise Control (IWAENC 2005) (2005)
Google Scholar
Heymann, J., Drude, L., Chinaev, A., Haeb-Umbach, R.: BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 444–451. IEEE, New York (2015)
Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4624–4628. IEEE, New York (2015)
Google Scholar
Jahn Heymann, L.D., Haeb-Umbach, R.: Neural network based spectral mask estimation for acoustic beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, New York (2016)
Book Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE, New York (2013)
Google Scholar
Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)
Article Google Scholar
Liu, Y., Zhang, P., Hain, T.: Using neural network front-ends on far field multiple microphones based speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5542–5546. IEEE, New York (2014)
Google Scholar
Narayanan, A., Wang, D.: Joint noise adaptive training for robust automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2504–2508. IEEE, New York (2014)
Google Scholar
Picone, J.W.: Signal modeling techniques in speech recognition. Proc. IEEE 81(9), 1215–1247 (1993)
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011). IEEE Catalog No.: CFP11SRW-USB
Google Scholar
Renals, S., Hain, T., Bourlard, H.: Recognition and understanding of meetings: the AMI and AMIDA projects. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Kyoto (2007). IDIAP-RR 07-46
Google Scholar
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81–84 (1995)
Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ARSU), pp. 30–36 (2015)
Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNs. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2016)
Book Google Scholar
Seltzer, M.L., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Speech Audio Process. 12(5), 489–498 (2004)
Article Google Scholar
Souden, M., Benesty, J., Affes, S.: On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Lang. Process. 18(2), 260–276 (2010)
Article Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Convolutional neural networks for distant speech recognition. IEEE Signal Process Lett. 21(9), 1120–1124 (2014)
Article Google Scholar
Van Veen, B.D., Buckley, K.M.: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag. 5(2), 4–24 (1988)
Article Google Scholar
Xiao, X., Zhao, S., Zhong, X., Jones, D.L., Chng, E.S., Li, H.: A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2814–2818. IEEE, New York (2015)
Google Scholar
Xiao, X., Xu, C., Zhang, Z., Zhao, S., Sun, S., Watanabe, S., Wang, L., Xie, L., Jones, D.L., Chng, E.S., Li, H.: Investigation of neural networks based beamforming approaches for speech recognition: the NTU systems for CHiME-4 evaluation. In: CHiME 4 Workshop (2016)
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK Book, 3.4 edn. Cambridge University Engineering Department, Cambridge (2006)
Google Scholar
Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B., Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., et al.: An introduction to computational networks and the computational network toolkit. Tech. Rep. MSR, Microsoft Research (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Temasek Laboratories, Nanyang Technological University, 9th Storey, BorderX Block, Research Techno Plaza, 50 Nanyang Drive, Singapore, 637553, Singapore
Xiong Xiao
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
Shinji Watanabe & John R. Hershey
Microsoft Research, City Center Square, Bellevue, WA, 98004, USA
Hakan Erdogan & Michael L. Seltzer
Brooklyn College (CUNY), 2900 Bedford Ave, Brooklyn, NY, 11210, USA
Michael Mandel
Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, IL, 60637, USA
Liang Lu
Johns Hopkins University, Baltimore, MD, USA
Guoguo Chen
Massachusetts Institute of Technology, Cambridge, MA, USA
Yu Zhang
Tencent AI Lab, 10900 NE 8th Street, Bellevue, WA, 98004, USA
Dong Yu

Authors

Xiong Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Hakan Erdogan
View author publications
You can also search for this author in PubMed Google Scholar
Michael Mandel
View author publications
You can also search for this author in PubMed Google Scholar
Liang Lu
View author publications
You can also search for this author in PubMed Google Scholar
John R. Hershey
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Seltzer
View author publications
You can also search for this author in PubMed Google Scholar
Guoguo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiong Xiao .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Xiao, X. et al. (2017). Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_4
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics