Abstract
This paper describes in detail the acoustic modeling part of the keyword search system developed in the Speech Technology Center (STC) for the OpenKWS 2016 evaluation. The key idea was to utilize diversity of both sound representations and acoustic model architectures in the system. For the former, we extended speaker-dependent bottleneck (SDBN) approach to the multilingual case, which is the main contribution of the paper. Two types of multilingual SDBN features were applied in addition to conventional spectral and cepstral features. The acoustic model architectures employed in the final system are based on deep feedforward and recurrent neural networks. We also applied speaker adaptation of acoustic models using multilingual i-vectors, speed perturbation based data augmentation and semi-supervised training. Final STC system comprised 9 acoustic models, which allowed it to achieve strong performance and to be among the top three systems in the evaluation.
Alexey PrudnikovāMail.ru Group, St. Petersburg, Russia
Natalia TomashenkoāLIUM University of Le Mans, France.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
IARPA Babel program, https://www.iarpa.gov/index.php/research-programs/babel
OpenKWS 2016 Evaluation Plan, https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf
Khokhlov, Y., Medennikov, I., Mendelev, V., et al.: The STC keyword search system For OpenKWS 2016 evaluation. In: INTERSPEECH 2017 (accepted 2017)
Khokhlov, Y., Tomashenko, N., et al.: Fast and accurate OOV decoder on high-level features. In: INTERSPEECH 2017 (accepted 2017)
Lee, W., Kim, J., Lane, I.: Multi-stream combination for LVCSR and keyword search on GPU-accelerated platforms. In: ICASSP 2014, pp. 3296ā3300 (2014)
Cai, M., et al.: High-performance Swahili keyword search with very limited language pack: the THUEE system for the OpenKWS15 evaluation. In: ASRU 2015, pp. 215ā222 (2015)
Hartmann, W., et al.: Comparison of multiple system combination techniques for keyword spotting. In: INTERSPEECH 2016, pp. 1913ā1917 (2016)
Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234ā242. Springer, Cham (2015). doi:10.1007/978-3-319-23132-7_29
Medennikov, I., Prudnikov, A.: Advances in STC russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., NĆ©meth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116ā123. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_13
Medennikov, I.P.: Speaker-dependent features for spontaneous speech recognition. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(1), 195ā197 (2016). doi:10.17586/2226-1494-2016-16-1-195-197
Medennikov, I., Prudnikov, A., Zatvornitskiy, A.: Improving english conversational telephone speech recognition. In: INTERSPEECH 2016, pp. 2ā6 (2016)
Prudnikov, A., Korenevsky, M.: Training maxout neural networks for speech recognition tasks. In: Sojka, P., HorĆ”k, A., KopeÄek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 443ā451. Springer, Cham (2016). doi:10.1007/978-3-319-45510-5_51
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU 2013, pp. 55ā59 (2013)
Rennie, S.J., Goel, V., Thomas, S.: Annealed dropout training of deep networks. In: 2014 IEEE Workshop on Spoken Language Technology (SLT), pp. 159ā164 (2014)
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 2440ā2444 (2015)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014 (2014)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH 2015 (2015)
Vesely, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: ASRU 2013, pp. 267ā272 (2013)
Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788ā798 (2010)
Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for NIST SRE 2012. In: ŽeleznĆ½, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 278ā285. Springer, Cham (2013). doi:10.1007/978-3-319-01931-4_37
Lee, K.A., et al.: The 2015 NIST Language Recognition Evaluation: the Shared View of I2R, Fantastic4 and SingaMS. In: INTERSPEECH 2016, pp. 3211ā3215 (2016)
Caruana, R.: Multitask learning. Mac. Learn. 28(1), 41ā75 (1997)
Povey, D., et al.: The kaldi speech recognition toolkit. In: ASRU 2011, pp. 1ā4 (2011)
Karpathy, A.: The Unreasonable Effectiveness of Recurrent Neural Networks, http://karpathy.github.io/2015/05/21/rnn-effectiveness
Chen, G., Yilmaz, O., Trmal, J., Povey, D.: Khudanpur, S: Using proxies for OOV keywords in the keyword search task. In: ASRU 2013, pp. 416ā421 (2013)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12, 75ā98 (1998)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of momentum and initialization in deep learning. In: 30th International Conference on Machine Learning (2013)
Povey, D., et al.: The subspace Gaussian mixture modelāa structured model for speech recognition. Comput. Speech Lang. 25(2), 404ā439 (2011)
Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH 2013, pp. 2345ā2349 (2013)
Trmal, J., et al.: A keyword search system using open source software. In: 2014 IEEE Workshop on Spoken Language Technology (2014)
Acknowledgements
This work was financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0121 (ID RFMEFI57915X0121).
This effort uses the IARPA Babel Program language collection release IARPA-babel{101b-v0.4c, 102b-v0.5a, 103b-v0.4b, 201b-v0.2b, 203b-v3.1a, 205b-v1.0a, 206b-v0.1e, 207b-v1.0e, 301b-v2.0b, 302b-v1.0a, 303b-v1.0a, 304b-v1.0b, 305b-v1.0c, 306b-v2.0c, 307b-v1.0b, 401b-v2.0b, 402b-v1.0b, 403b-v1.0b, 404b-v1.0a}, set of training transcriptions and BBN part of clean web data for Georgian language.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2017 Springer International Publishing AG
About this paper
Cite this paper
Medennikov, I. et al. (2017). Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)