Abstract
Currently, the quality of Distant Speech Recognition (DSR) systems cannot match the quality of speech recognition on clean speech acquired by close-talking microphones. The main problems behind DSR are situated with the far field nature of data, one of which is unpredictable occurrence of acoustic events and scenes, which distort the signal’s speech component. Application of acoustic event detection and classification (AEC) in conjunction with DSR can benefit speech enhancement and improve DSR accuracy. However, no publicly available corpus for conjunctive AEC and DSR currently exists. This paper proposes a procedure of realistically mixing acoustic events and scenes with far field multi-channel recordings of the AMI meeting corpus, accounting for spatial reverberation and distinctive placement of sources of different kind. We evaluate the derived corpus for both DSR and AEC tasks and present replicative results, which can be used as a baseline for the corpus. The code for the proposed mixing procedure is made available online.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Corpus mixing scripts are available at https://github.com/sergeiastapov/nAMI.
References
Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007)
Barker, J., Watanabe, S., Vincent, E., Trmal, J.: The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In: Interspeech 2018–19th Annual Conference of the International Speech Communication Association, Hyderabad, India, September 2018
Cho, N., Kim, E.: Enhanced voice activity detection using acoustic event detection and classification. IEEE Trans. Consum. Electron. 57(1), 196–202 (2011)
Dean, D.B., Sridharan, S., Vogt, R.J., Mason, M.W.: The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. In: Interspeech 2010, September 2010
Font, F., Roma, G., Serra, X.: Freesound technical demo. In: ACM International Conference on Multimedia (MM 2013). Barcelona, Spain, pp. 411–412, October 2013
Fujimura, H., Nagao, M., Masuko, T.: Simultaneous speech recognition and acoustic event detection using an LSTM-CTC acoustic model and a WFST decoder. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5834–5838, April 2018
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing, ICASSP (2017)
Kim, C., et al.: Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. In: INTERSPEECH 2017, pp. 379–383 (2017)
Li, X., Li, J., Yan, Y.: Ideal ratio mask estimation using deep neural networks for monaural speech segregation in noisy reverberant conditions. In: Proceedings of Interspeech 2017, pp. 1203–1207 (2017)
McCowan, I., et al.: The AMI meeting corpus. In: Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pp. 137–140 (2005)
Mostefa, D., et al.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Lang. Resour. Eval. 41(3), 389–407 (2007)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, April 2015
Pearce, D., Hirsch, H.G., Gmbh, E.E.D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ISCA ITRW ASR2000, pp. 29–32 (2000)
Povey, D.: AMI corpus Kaldi recipe s5b. https://github.com/kaldi-asr/kaldi/tree/master/egs/ami/s5b. Accessed Mar 2019
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of Interspeech 2018, pp. 3743–3747 (2018)
Saon, G., et al.: English conversational telephone speech recognition by humans and machines. In: Interspeech 2017, pp. 132–136 (2017)
Scheibler, R., Bezzam, E., Dokmanic, I.: Pyroomacoustics: a Python package for audio room simulations and array processing algorithms. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 351–355 (2017)
Snyder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus. CoRR (2015)
Valentini Botinhao, C., Wang, X., Takaki, S., Yamagishi, J.: Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks. In: Interspeech 2016, pp. 352–356 (2016)
Vincent, E., Virtanen, T., Gannot, S.: Audio Source Separation and Speech Enhancement, 1st edn. Wiley, Hoboken (2018)
Wan, E., Nelson, A., Peterson, R.: Speech enhancement assessment resource (SpEAR) database. https://github.com/dingzeyuli/SpEAR-speech-database. Accessed Feb 2019
Woelfel, M., McDonough, J.: Distant Speech Recognition. Wiley, Hoboken (2009)
Wu, J.: SETK: Speech enhancement tools integrated with Kaldi. https://github.com/funcwj/setk. Accessed Mar 2019
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. Technical report MSR-TR-2017-39 (2017)
Acknowledgment
This research was financially supported by the Foundation NTI (Contract 20/18gr, ID 0000000007418QR20002) and by the Government of the Russian Federation (Grant 08-08).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Astapov, S. et al. (2019). Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-26061-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)