Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation

Medennikov, Ivan; Romanenko, Aleksei; Prudnikov, Alexey; Mendelev, Valentin; Khokhlov, Yuri; Korenevsky, Maxim; Tomashenko, Natalia; Zatvornitskiy, Alexander

doi:10.1007/978-3-319-66429-3_7

Ivan Medennikov^16,17,
Aleksei Romanenko^16,17,
Alexey Prudnikov¹⁶,
Valentin Mendelev^16,17,
Yuri Khokhlov¹⁶,
Maxim Korenevsky^16,17,
Natalia Tomashenko^16,17 &
…
Alexander Zatvornitskiy^16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2355 Accesses

Abstract

This paper describes in detail the acoustic modeling part of the keyword search system developed in the Speech Technology Center (STC) for the OpenKWS 2016 evaluation. The key idea was to utilize diversity of both sound representations and acoustic model architectures in the system. For the former, we extended speaker-dependent bottleneck (SDBN) approach to the multilingual case, which is the main contribution of the paper. Two types of multilingual SDBN features were applied in addition to conventional spectral and cepstral features. The acoustic model architectures employed in the final system are based on deep feedforward and recurrent neural networks. We also applied speaker adaptation of acoustic models using multilingual i-vectors, speed perturbation based data augmentation and semi-supervised training. Final STC system comprised 9 acoustic models, which allowed it to achieve strong performance and to be among the top three systems in the evaluation.

Alexey Prudnikov—Mail.ru Group, St. Petersburg, Russia

Natalia Tomashenko—LIUM University of Le Mans, France.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ASR Systems Under Acoustic Challenges: A Multilingual Study

Advances in STC Russian Spontaneous Speech Recognition System

Exploration of End-to-End ASR for OpenSTT – Russian Open Speech-to-Text Dataset

References

IARPA Babel program, https://www.iarpa.gov/index.php/research-programs/babel
OpenKWS 2016 Evaluation Plan, https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf
Khokhlov, Y., Medennikov, I., Mendelev, V., et al.: The STC keyword search system For OpenKWS 2016 evaluation. In: INTERSPEECH 2017 (accepted 2017)
Google Scholar
Khokhlov, Y., Tomashenko, N., et al.: Fast and accurate OOV decoder on high-level features. In: INTERSPEECH 2017 (accepted 2017)
Google Scholar
Lee, W., Kim, J., Lane, I.: Multi-stream combination for LVCSR and keyword search on GPU-accelerated platforms. In: ICASSP 2014, pp. 3296–3300 (2014)
Google Scholar
Cai, M., et al.: High-performance Swahili keyword search with very limited language pack: the THUEE system for the OpenKWS15 evaluation. In: ASRU 2015, pp. 215–222 (2015)
Google Scholar
Hartmann, W., et al.: Comparison of multiple system combination techniques for keyword spotting. In: INTERSPEECH 2016, pp. 1913–1917 (2016)
Google Scholar
Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). doi:10.1007/978-3-319-23132-7_29
Chapter Google Scholar
Medennikov, I., Prudnikov, A.: Advances in STC russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116–123. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_13
Chapter Google Scholar
Medennikov, I.P.: Speaker-dependent features for spontaneous speech recognition. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(1), 195–197 (2016). doi:10.17586/2226-1494-2016-16-1-195-197
Medennikov, I., Prudnikov, A., Zatvornitskiy, A.: Improving english conversational telephone speech recognition. In: INTERSPEECH 2016, pp. 2–6 (2016)
Google Scholar
Prudnikov, A., Korenevsky, M.: Training maxout neural networks for speech recognition tasks. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 443–451. Springer, Cham (2016). doi:10.1007/978-3-319-45510-5_51
Chapter Google Scholar
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU 2013, pp. 55–59 (2013)
Google Scholar
Rennie, S.J., Goel, V., Thomas, S.: Annealed dropout training of deep networks. In: 2014 IEEE Workshop on Spoken Language Technology (SLT), pp. 159–164 (2014)
Google Scholar
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 2440–2444 (2015)
Google Scholar
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014 (2014)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH 2015 (2015)
Google Scholar
Vesely, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: ASRU 2013, pp. 267–272 (2013)
Google Scholar
Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Article Google Scholar
Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 278–285. Springer, Cham (2013). doi:10.1007/978-3-319-01931-4_37
Chapter Google Scholar
Lee, K.A., et al.: The 2015 NIST Language Recognition Evaluation: the Shared View of I2R, Fantastic4 and SingaMS. In: INTERSPEECH 2016, pp. 3211–3215 (2016)
Google Scholar
Caruana, R.: Multitask learning. Mac. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: ASRU 2011, pp. 1–4 (2011)
Google Scholar
Karpathy, A.: The Unreasonable Effectiveness of Recurrent Neural Networks, http://karpathy.github.io/2015/05/21/rnn-effectiveness
Chen, G., Yilmaz, O., Trmal, J., Povey, D.: Khudanpur, S: Using proxies for OOV keywords in the keyword search task. In: ASRU 2013, pp. 416–421 (2013)
Google Scholar
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)
Article Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of momentum and initialization in deep learning. In: 30th International Conference on Machine Learning (2013)
Google Scholar
Povey, D., et al.: The subspace Gaussian mixture model–a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)
Article Google Scholar
Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH 2013, pp. 2345–2349 (2013)
Google Scholar
Trmal, J., et al.: A keyword search system using open source software. In: 2014 IEEE Workshop on Spoken Language Technology (2014)
Google Scholar

Download references

Acknowledgements

This work was financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0121 (ID RFMEFI57915X0121).

This effort uses the IARPA Babel Program language collection release IARPA-babel{101b-v0.4c, 102b-v0.5a, 103b-v0.4b, 201b-v0.2b, 203b-v3.1a, 205b-v1.0a, 206b-v0.1e, 207b-v1.0e, 301b-v2.0b, 302b-v1.0a, 303b-v1.0a, 304b-v1.0b, 305b-v1.0c, 306b-v2.0c, 307b-v1.0b, 401b-v2.0b, 402b-v1.0b, 403b-v1.0b, 404b-v1.0a}, set of training transcriptions and BBN part of clean web data for Georgian language.

Author information

Authors and Affiliations

STC-innovations Ltd., St. Petersburg, Russia
Ivan Medennikov, Aleksei Romanenko, Alexey Prudnikov, Valentin Mendelev, Yuri Khokhlov, Maxim Korenevsky, Natalia Tomashenko & Alexander Zatvornitskiy
ITMO University, St. Petersburg, Russia
Ivan Medennikov, Aleksei Romanenko, Valentin Mendelev, Maxim Korenevsky, Natalia Tomashenko & Alexander Zatvornitskiy

Authors

Ivan Medennikov
View author publications
You can also search for this author in PubMed Google Scholar
Aleksei Romanenko
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Prudnikov
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Mendelev
View author publications
You can also search for this author in PubMed Google Scholar
Yuri Khokhlov
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Korenevsky
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Tomashenko
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Zatvornitskiy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Medennikov .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Medennikov, I. et al. (2017). Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_7
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics