Skip to main content

Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Abstract

Currently, the quality of Distant Speech Recognition (DSR) systems cannot match the quality of speech recognition on clean speech acquired by close-talking microphones. The main problems behind DSR are situated with the far field nature of data, one of which is unpredictable occurrence of acoustic events and scenes, which distort the signal’s speech component. Application of acoustic event detection and classification (AEC) in conjunction with DSR can benefit speech enhancement and improve DSR accuracy. However, no publicly available corpus for conjunctive AEC and DSR currently exists. This paper proposes a procedure of realistically mixing acoustic events and scenes with far field multi-channel recordings of the AMI meeting corpus, accounting for spatial reverberation and distinctive placement of sources of different kind. We evaluate the derived corpus for both DSR and AEC tasks and present replicative results, which can be used as a baseline for the corpus. The code for the proposed mixing procedure is made available online.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Corpus mixing scripts are available at https://github.com/sergeiastapov/nAMI.

References

  1. Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007)

    Article  Google Scholar 

  2. Barker, J., Watanabe, S., Vincent, E., Trmal, J.: The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In: Interspeech 2018–19th Annual Conference of the International Speech Communication Association, Hyderabad, India, September 2018

    Google Scholar 

  3. Cho, N., Kim, E.: Enhanced voice activity detection using acoustic event detection and classification. IEEE Trans. Consum. Electron. 57(1), 196–202 (2011)

    Article  Google Scholar 

  4. Dean, D.B., Sridharan, S., Vogt, R.J., Mason, M.W.: The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. In: Interspeech 2010, September 2010

    Google Scholar 

  5. Font, F., Roma, G., Serra, X.: Freesound technical demo. In: ACM International Conference on Multimedia (MM 2013). Barcelona, Spain, pp. 411–412, October 2013

    Google Scholar 

  6. Fujimura, H., Nagao, M., Masuko, T.: Simultaneous speech recognition and acoustic event detection using an LSTM-CTC acoustic model and a WFST decoder. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5834–5838, April 2018

    Google Scholar 

  7. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing, ICASSP (2017)

    Google Scholar 

  8. Kim, C., et al.: Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. In: INTERSPEECH 2017, pp. 379–383 (2017)

    Google Scholar 

  9. Li, X., Li, J., Yan, Y.: Ideal ratio mask estimation using deep neural networks for monaural speech segregation in noisy reverberant conditions. In: Proceedings of Interspeech 2017, pp. 1203–1207 (2017)

    Google Scholar 

  10. McCowan, I., et al.: The AMI meeting corpus. In: Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pp. 137–140 (2005)

    Google Scholar 

  11. Mostefa, D., et al.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Lang. Resour. Eval. 41(3), 389–407 (2007)

    Article  Google Scholar 

  12. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, April 2015

    Google Scholar 

  13. Pearce, D., Hirsch, H.G., Gmbh, E.E.D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ISCA ITRW ASR2000, pp. 29–32 (2000)

    Google Scholar 

  14. Povey, D.: AMI corpus Kaldi recipe s5b. https://github.com/kaldi-asr/kaldi/tree/master/egs/ami/s5b. Accessed Mar 2019

  15. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of Interspeech 2018, pp. 3743–3747 (2018)

    Google Scholar 

  16. Saon, G., et al.: English conversational telephone speech recognition by humans and machines. In: Interspeech 2017, pp. 132–136 (2017)

    Google Scholar 

  17. Scheibler, R., Bezzam, E., Dokmanic, I.: Pyroomacoustics: a Python package for audio room simulations and array processing algorithms. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 351–355 (2017)

    Google Scholar 

  18. Snyder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus. CoRR (2015)

    Google Scholar 

  19. Valentini Botinhao, C., Wang, X., Takaki, S., Yamagishi, J.: Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks. In: Interspeech 2016, pp. 352–356 (2016)

    Google Scholar 

  20. Vincent, E., Virtanen, T., Gannot, S.: Audio Source Separation and Speech Enhancement, 1st edn. Wiley, Hoboken (2018)

    Book  Google Scholar 

  21. Wan, E., Nelson, A., Peterson, R.: Speech enhancement assessment resource (SpEAR) database. https://github.com/dingzeyuli/SpEAR-speech-database. Accessed Feb 2019

  22. Woelfel, M., McDonough, J.: Distant Speech Recognition. Wiley, Hoboken (2009)

    Book  Google Scholar 

  23. Wu, J.: SETK: Speech enhancement tools integrated with Kaldi. https://github.com/funcwj/setk. Accessed Mar 2019

  24. Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. Technical report MSR-TR-2017-39 (2017)

    Google Scholar 

Download references

Acknowledgment

This research was financially supported by the Foundation NTI (Contract 20/18gr, ID 0000000007418QR20002) and by the Government of the Russian Federation (Grant 08-08).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergei Astapov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Astapov, S. et al. (2019). Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26061-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26060-6

  • Online ISBN: 978-3-030-26061-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics