Skip to main content

Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Abstract

This paper describes in detail the acoustic modeling part of the keyword search system developed in the Speech Technology Center (STC) for the OpenKWS 2016 evaluation. The key idea was to utilize diversity of both sound representations and acoustic model architectures in the system. For the former, we extended speaker-dependent bottleneck (SDBN) approach to the multilingual case, which is the main contribution of the paper. Two types of multilingual SDBN features were applied in addition to conventional spectral and cepstral features. The acoustic model architectures employed in the final system are based on deep feedforward and recurrent neural networks. We also applied speaker adaptation of acoustic models using multilingual i-vectors, speed perturbation based data augmentation and semi-supervised training. Final STC system comprised 9 acoustic models, which allowed it to achieve strong performance and to be among the top three systems in the evaluation.

Alexey Prudnikovā€”Mail.ru Group, St. Petersburg, Russia

Natalia Tomashenkoā€”LIUM University of Le Mans, France.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. IARPA Babel program, https://www.iarpa.gov/index.php/research-programs/babel

  2. OpenKWS 2016 Evaluation Plan, https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS16-evalplan-v04.pdf

  3. Khokhlov, Y., Medennikov, I., Mendelev, V., et al.: The STC keyword search system For OpenKWS 2016 evaluation. In: INTERSPEECH 2017 (accepted 2017)

    Google Scholar 

  4. Khokhlov, Y., Tomashenko, N., et al.: Fast and accurate OOV decoder on high-level features. In: INTERSPEECH 2017 (accepted 2017)

    Google Scholar 

  5. Lee, W., Kim, J., Lane, I.: Multi-stream combination for LVCSR and keyword search on GPU-accelerated platforms. In: ICASSP 2014, pp. 3296ā€“3300 (2014)

    Google Scholar 

  6. Cai, M., et al.: High-performance Swahili keyword search with very limited language pack: the THUEE system for the OpenKWS15 evaluation. In: ASRU 2015, pp. 215ā€“222 (2015)

    Google Scholar 

  7. Hartmann, W., et al.: Comparison of multiple system combination techniques for keyword spotting. In: INTERSPEECH 2016, pp. 1913ā€“1917 (2016)

    Google Scholar 

  8. Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234ā€“242. Springer, Cham (2015). doi:10.1007/978-3-319-23132-7_29

    Chapter  Google Scholar 

  9. Medennikov, I., Prudnikov, A.: Advances in STC russian spontaneous speech recognition system. In: Ronzhin, A., Potapova, R., NĆ©meth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 116ā€“123. Springer, Cham (2016). doi:10.1007/978-3-319-43958-7_13

    Chapter  Google Scholar 

  10. Medennikov, I.P.: Speaker-dependent features for spontaneous speech recognition. Sci. Tech. J. Inf. Technol. Mech. Opt. 16(1), 195ā€“197 (2016). doi:10.17586/2226-1494-2016-16-1-195-197

  11. Medennikov, I., Prudnikov, A., Zatvornitskiy, A.: Improving english conversational telephone speech recognition. In: INTERSPEECH 2016, pp. 2ā€“6 (2016)

    Google Scholar 

  12. Prudnikov, A., Korenevsky, M.: Training maxout neural networks for speech recognition tasks. In: Sojka, P., HorĆ”k, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 443ā€“451. Springer, Cham (2016). doi:10.1007/978-3-319-45510-5_51

    Chapter  Google Scholar 

  13. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU 2013, pp. 55ā€“59 (2013)

    Google Scholar 

  14. Rennie, S.J., Goel, V., Thomas, S.: Annealed dropout training of deep networks. In: 2014 IEEE Workshop on Spoken Language Technology (SLT), pp. 159ā€“164 (2014)

    Google Scholar 

  15. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH 2015, pp. 2440ā€“2444 (2015)

    Google Scholar 

  16. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH 2014 (2014)

    Google Scholar 

  17. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH 2015 (2015)

    Google Scholar 

  18. Vesely, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: ASRU 2013, pp. 267ā€“272 (2013)

    Google Scholar 

  19. Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788ā€“798 (2010)

    Article  Google Scholar 

  20. Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID speaker recognition system for NIST SRE 2012. In: ŽeleznĆ½, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 278ā€“285. Springer, Cham (2013). doi:10.1007/978-3-319-01931-4_37

    Chapter  Google Scholar 

  21. Lee, K.A., et al.: The 2015 NIST Language Recognition Evaluation: the Shared View of I2R, Fantastic4 and SingaMS. In: INTERSPEECH 2016, pp. 3211ā€“3215 (2016)

    Google Scholar 

  22. Caruana, R.: Multitask learning. Mac. Learn. 28(1), 41ā€“75 (1997)

    Article  MathSciNet  Google Scholar 

  23. Povey, D., et al.: The kaldi speech recognition toolkit. In: ASRU 2011, pp. 1ā€“4 (2011)

    Google Scholar 

  24. Karpathy, A.: The Unreasonable Effectiveness of Recurrent Neural Networks, http://karpathy.github.io/2015/05/21/rnn-effectiveness

  25. Chen, G., Yilmaz, O., Trmal, J., Povey, D.: Khudanpur, S: Using proxies for OOV keywords in the keyword search task. In: ASRU 2013, pp. 416ā€“421 (2013)

    Google Scholar 

  26. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12, 75ā€“98 (1998)

    Article  Google Scholar 

  27. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of momentum and initialization in deep learning. In: 30th International Conference on Machine Learning (2013)

    Google Scholar 

  28. Povey, D., et al.: The subspace Gaussian mixture modelā€“a structured model for speech recognition. Comput. Speech Lang. 25(2), 404ā€“439 (2011)

    Article  Google Scholar 

  29. Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH 2013, pp. 2345ā€“2349 (2013)

    Google Scholar 

  30. Trmal, J., et al.: A keyword search system using open source software. In: 2014 IEEE Workshop on Spoken Language Technology (2014)

    Google Scholar 

Download references

Acknowledgements

This work was financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0121 (ID RFMEFI57915X0121).

This effort uses the IARPA Babel Program language collection release IARPA-babel{101b-v0.4c, 102b-v0.5a, 103b-v0.4b, 201b-v0.2b, 203b-v3.1a, 205b-v1.0a, 206b-v0.1e, 207b-v1.0e, 301b-v2.0b, 302b-v1.0a, 303b-v1.0a, 304b-v1.0b, 305b-v1.0c, 306b-v2.0c, 307b-v1.0b, 401b-v2.0b, 402b-v1.0b, 403b-v1.0b, 404b-v1.0a}, set of training transcriptions and BBN part of clean web data for Georgian language.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Medennikov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2017 Springer International Publishing AG

About this paper

Cite this paper

Medennikov, I. et al. (2017). Acoustic Modeling in the STC Keyword Search System for OpenKWS 2016 Evaluation. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics