Skip to main content

Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition

  • Chapter
  • First Online:
Book cover New Era for Robust Speech Recognition

Abstract

Speech-processing systems such as automatic speech recognition (ASR) usually consist of a large number of steps to accomplish their tasks. Due to the long processing pipeline, the processing steps are usually designed to optimize cost functions that are not directly related to the task, leading to suboptimal performance. In this chapter, we introduce a beamforming (BF) network to perform spatial filtering that is optimal for the ASR task. The BF network takes in array signals and predicts the optimal beamforming parameters in the frequency domain, assuming that the array geometry does not change. The network consists of both deterministic processing steps and trainable steps realized by neural networks and trained to minimize the cross-entropy cost function of ASR. In our experiments, the BF network is trained with both artificially generated and real microphone array signals. On the AMI meeting transcription, we found that the trained BF network produces competitive ASR results compared to traditional delay-and-sum beamforming on unseen array signals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, A., Akchurin, E., Basoglu, C., Chen, G., Cyphers, S., Droppo, J., Eversole, A., Guenter, B., Hillebrand, M., Hoens, R., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical Report, MSR-TR-2014-112 (2014)

    Google Scholar 

  2. Allen, J., Berkley, D.: Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979)

    Article  Google Scholar 

  3. Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007)

    Article  Google Scholar 

  4. Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) (2015)

    Google Scholar 

  5. Bitzer, J., Simmer, K.U.: Superdirective microphone arrays. In: Brandstein, M.S., Ward, D. (eds.) Microphone Arrays: Signal Processing Techniques and Applications, Chap. 2, pp. 19–38. Springer, Berlin (2001)

    Chapter  Google Scholar 

  6. Capon, J.: High-resolution frequency-wavenumber spectrum analysis. Proc. IEEE 57(8), 1408–1418 (1969)

    Article  Google Scholar 

  7. Doclo, S., Moonen, M.: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process. 50(9), 2230–2244 (2002)

    Article  Google Scholar 

  8. Elko, G.W.: Spatial coherence functions for differential microphones in isotropic noise fields. In: Brandstein, M.S., Ward, D. (eds.) Microphone Arrays: Signal Processing Techniques and Applications, Chap. 4, pp. 61–85. Springer, Berlin (2001)

    Chapter  Google Scholar 

  9. Er, M., Cantoni, A.: Solar wind monitor satellite: derivative constraints for broad-band element space antenna array processors. IEEE Trans. Audio Speech Lang. Process. 31(6), 1378–1393 (1983)

    Google Scholar 

  10. Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  11. Griffiths, L.J., Jim, C.W.: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 30(1), 27–34 (1982)

    Article  Google Scholar 

  12. Haeb-Umbach, R., Warsitz, E.: Adaptive filter-and-sum beamforming in spatially correlated noise. In: International Workshop on Acoustic Echo and Noise Control (IWAENC 2005) (2005)

    Google Scholar 

  13. Heymann, J., Drude, L., Chinaev, A., Haeb-Umbach, R.: BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 444–451. IEEE, New York (2015)

    Google Scholar 

  14. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4624–4628. IEEE, New York (2015)

    Google Scholar 

  15. Jahn Heymann, L.D., Haeb-Umbach, R.: Neural network based spectral mask estimation for acoustic beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, New York (2016)

    Book  Google Scholar 

  16. Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE, New York (2013)

    Google Scholar 

  17. Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)

    Article  Google Scholar 

  18. Liu, Y., Zhang, P., Hain, T.: Using neural network front-ends on far field multiple microphones based speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5542–5546. IEEE, New York (2014)

    Google Scholar 

  19. Narayanan, A., Wang, D.: Joint noise adaptive training for robust automatic speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2504–2508. IEEE, New York (2014)

    Google Scholar 

  20. Picone, J.W.: Signal modeling techniques in speech recognition. Proc. IEEE 81(9), 1215–1247 (1993)

    Article  Google Scholar 

  21. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011). IEEE Catalog No.: CFP11SRW-USB

    Google Scholar 

  22. Renals, S., Hain, T., Bourlard, H.: Recognition and understanding of meetings: the AMI and AMIDA projects. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Kyoto (2007). IDIAP-RR 07-46

    Google Scholar 

  23. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81–84 (1995)

    Google Scholar 

  24. Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ARSU), pp. 30–36 (2015)

    Google Scholar 

  25. Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNs. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2016)

    Book  Google Scholar 

  26. Seltzer, M.L., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Speech Audio Process. 12(5), 489–498 (2004)

    Article  Google Scholar 

  27. Souden, M., Benesty, J., Affes, S.: On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Lang. Process. 18(2), 260–276 (2010)

    Article  Google Scholar 

  28. Swietojanski, P., Ghoshal, A., Renals, S.: Convolutional neural networks for distant speech recognition. IEEE Signal Process Lett. 21(9), 1120–1124 (2014)

    Article  Google Scholar 

  29. Van Veen, B.D., Buckley, K.M.: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag. 5(2), 4–24 (1988)

    Article  Google Scholar 

  30. Xiao, X., Zhao, S., Zhong, X., Jones, D.L., Chng, E.S., Li, H.: A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2814–2818. IEEE, New York (2015)

    Google Scholar 

  31. Xiao, X., Xu, C., Zhang, Z., Zhao, S., Sun, S., Watanabe, S., Wang, L., Xie, L., Jones, D.L., Chng, E.S., Li, H.: Investigation of neural networks based beamforming approaches for speech recognition: the NTU systems for CHiME-4 evaluation. In: CHiME 4 Workshop (2016)

    Google Scholar 

  32. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK Book, 3.4 edn. Cambridge University Engineering Department, Cambridge (2006)

    Google Scholar 

  33. Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B., Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., et al.: An introduction to computational networks and the computational network toolkit. Tech. Rep. MSR, Microsoft Research (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiong Xiao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Xiao, X. et al. (2017). Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics