Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

Astapov, Sergei; Svirskiy, Gleb; Lavrentyev, Aleksandr; Prisyach, Tatyana; Popov, Dmitriy; Ubskiy, Dmitriy; Kabarov, Vladimir

doi:10.1007/978-3-030-26061-3_4

Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

Sergei Astapov¹¹,
Gleb Svirskiy¹²,
Aleksandr Lavrentyev¹²,
Tatyana Prisyach¹²,
Dmitriy Popov¹²,
Dmitriy Ubskiy^11,12 &
…
Vladimir Kabarov¹¹

Conference paper
First Online: 24 July 2019

1165 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Abstract

Currently, the quality of Distant Speech Recognition (DSR) systems cannot match the quality of speech recognition on clean speech acquired by close-talking microphones. The main problems behind DSR are situated with the far field nature of data, one of which is unpredictable occurrence of acoustic events and scenes, which distort the signal’s speech component. Application of acoustic event detection and classification (AEC) in conjunction with DSR can benefit speech enhancement and improve DSR accuracy. However, no publicly available corpus for conjunctive AEC and DSR currently exists. This paper proposes a procedure of realistically mixing acoustic events and scenes with far field multi-channel recordings of the AMI meeting corpus, accounting for spatial reverberation and distinctive placement of sources of different kind. We evaluate the derived corpus for both DSR and AEC tasks and present replicative results, which can be used as a baseline for the corpus. The code for the proposed mixing procedure is made available online.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Corpus mixing scripts are available at https://github.com/sergeiastapov/nAMI.

References

Anguera, X., Wooters, C., Hernando, J.: Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 15(7), 2011–2022 (2007)
Article Google Scholar
Barker, J., Watanabe, S., Vincent, E., Trmal, J.: The fifth ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines. In: Interspeech 2018–19th Annual Conference of the International Speech Communication Association, Hyderabad, India, September 2018
Google Scholar
Cho, N., Kim, E.: Enhanced voice activity detection using acoustic event detection and classification. IEEE Trans. Consum. Electron. 57(1), 196–202 (2011)
Article Google Scholar
Dean, D.B., Sridharan, S., Vogt, R.J., Mason, M.W.: The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. In: Interspeech 2010, September 2010
Google Scholar
Font, F., Roma, G., Serra, X.: Freesound technical demo. In: ACM International Conference on Multimedia (MM 2013). Barcelona, Spain, pp. 411–412, October 2013
Google Scholar
Fujimura, H., Nagao, M., Masuko, T.: Simultaneous speech recognition and acoustic event detection using an LSTM-CTC acoustic model and a WFST decoder. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5834–5838, April 2018
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing, ICASSP (2017)
Google Scholar
Kim, C., et al.: Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. In: INTERSPEECH 2017, pp. 379–383 (2017)
Google Scholar
Li, X., Li, J., Yan, Y.: Ideal ratio mask estimation using deep neural networks for monaural speech segregation in noisy reverberant conditions. In: Proceedings of Interspeech 2017, pp. 1203–1207 (2017)
Google Scholar
McCowan, I., et al.: The AMI meeting corpus. In: Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, pp. 137–140 (2005)
Google Scholar
Mostefa, D., et al.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Lang. Resour. Eval. 41(3), 389–407 (2007)
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, April 2015
Google Scholar
Pearce, D., Hirsch, H.G., Gmbh, E.E.D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ISCA ITRW ASR2000, pp. 29–32 (2000)
Google Scholar
Povey, D.: AMI corpus Kaldi recipe s5b. https://github.com/kaldi-asr/kaldi/tree/master/egs/ami/s5b. Accessed Mar 2019
Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: Proceedings of Interspeech 2018, pp. 3743–3747 (2018)
Google Scholar
Saon, G., et al.: English conversational telephone speech recognition by humans and machines. In: Interspeech 2017, pp. 132–136 (2017)
Google Scholar
Scheibler, R., Bezzam, E., Dokmanic, I.: Pyroomacoustics: a Python package for audio room simulations and array processing algorithms. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 351–355 (2017)
Google Scholar
Snyder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus. CoRR (2015)
Google Scholar
Valentini Botinhao, C., Wang, X., Takaki, S., Yamagishi, J.: Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks. In: Interspeech 2016, pp. 352–356 (2016)
Google Scholar
Vincent, E., Virtanen, T., Gannot, S.: Audio Source Separation and Speech Enhancement, 1st edn. Wiley, Hoboken (2018)
Book Google Scholar
Wan, E., Nelson, A., Peterson, R.: Speech enhancement assessment resource (SpEAR) database. https://github.com/dingzeyuli/SpEAR-speech-database. Accessed Feb 2019
Woelfel, M., McDonough, J.: Distant Speech Recognition. Wiley, Hoboken (2009)
Book Google Scholar
Wu, J.: SETK: Speech enhancement tools integrated with Kaldi. https://github.com/funcwj/setk. Accessed Mar 2019
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. Technical report MSR-TR-2017-39 (2017)
Google Scholar

Download references

Acknowledgment

This research was financially supported by the Foundation NTI (Contract 20/18gr, ID 0000000007418QR20002) and by the Government of the Russian Federation (Grant 08-08).

Author information

Authors and Affiliations

International Research Laboratory “Multimodal Biometric and Speech Systems”, ITMO University, Kronverksky prospekt 49, St. Petersburg, 197101, Russia
Sergei Astapov, Dmitriy Ubskiy & Vladimir Kabarov
Speech Technology Center, Krasutskogo st. 4, St. Petersburg, 196084, Russia
Gleb Svirskiy, Aleksandr Lavrentyev, Tatyana Prisyach, Dmitriy Popov & Dmitriy Ubskiy

Authors

Sergei Astapov
View author publications
You can also search for this author in PubMed Google Scholar
Gleb Svirskiy
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandr Lavrentyev
View author publications
You can also search for this author in PubMed Google Scholar
Tatyana Prisyach
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Popov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Ubskiy
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Kabarov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergei Astapov .

Editor information

Editors and Affiliations

Utrecht University, Utrecht, The Netherlands
Albert Ali Salah
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Astapov, S. et al. (2019). Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-26061-3_4
Published: 24 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics