Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

Kipyatkova, Irina

doi:10.1007/978-3-319-99579-3_31

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

Irina Kipyatkova^16,17

Conference paper
First Online: 25 August 2018

1433 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Abstract

In the paper, we present our very large vocabulary continuous Russian speech recognition system based on various neural networks. We employed neural networks on both acoustic and language modeling stages. For training hybrid acoustic models, we experimented with several types of neural networks: feedforward deep neural network, time-delay neural network, Long Short-Term Memory, bidirectional Long Short-Term Memory. We created neural networks with various numbers of hidden layers and units in hidden layers. Language modeling was performed using recurrent neural network. At first, experiments on Russian speech recognition were carried out using hybrid acoustic models and 3-gram language model. Then 500-best list was rescored with recurrent neural network language model. The lowest word error rate equal to 15.13% was achieved using time-delay neural network for acoustic modeling and recurrent neural network language model interpolated with 3-gram model for 500-best list rescoring.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Yu, Dong, Deng, Li: Automatic Speech Recognition. SCT. Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Book MATH Google Scholar
Yu, D., Li, J.: Recent progresses in deep learning based acoustic models. IEEE/CAA J. Automatica Sinica 4(3), 396–409 (2017)
Article Google Scholar
Kipyatkova, I., Karpov, A.: Variants of deep artificial neural networks for speech recognition systems. In: SPIIRAS Proceedings, vol. 6, no. 49, pp. 80–103 (2016). (in Rus.), http://dx.doi.org/10.15622/sp.49.5
Peddini, V., Povey, D., Khundanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH-2015, pp. 3214–3218 (2015)
Google Scholar
Sun, M., et al: Compressed time delay neural network for small-footprint keyword spotting. In: INTERSPEECH -2017, pp. 3607–3611 (2017)
Google Scholar
Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH-2014, pp. 631–635 (2014)
Google Scholar
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 2462–2466 (2017)
Google Scholar
Peddinti, V., Wang, Y., Povey, D., Khudanpur, S.: Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Sig. Process. Lett. 25(3), 373–377 (2018)
Article Google Scholar
Wang, Y., Chen, X., Gales, M., Ragni, A., Wong, J.: Phonetic and graphemic systems for multi-genre broadcast transcription. Preprint arXiv:1802.00254, https://arxiv.org/pdf/1802.06412.pdf (2018)
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH 2010, Makuhari, Chiba, Japan, pp. 1045–1048 (2010)
Google Scholar
Su, C., Huang, H., Shi, S., Guo, Y., Wu, H.: A parallel recurrent neural network for language modeling with POS tags. In: Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation (PACLIC), https://paclic31.national-u.edu.ph/wp-content/uploads/2017/11/PACLIC_31_paper_125.pdf
Soutner, D., Müller, L.: Application of LSTM Neural Networks in Language Modelling. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 105–112. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_14
Chapter Google Scholar
Chen, X., Ragni, A., Liu, X., Gales, M.J.: Investigating bidirectional recurrent neural network language models for speech recognition. In: INTERSPEECH-2017, pp. 269–273 (2017)
Google Scholar
Chen, X., Liu, X., Ragni, A., Wang, Y., Gales, M.: Future word contexts in neural network language models. In: Preprint arXiv:1708.05592 (2017)
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The microsoft 2017 conversational Speech recognition system. Preprint arXiv:1708.06073, https://arxiv.org/abs/1708.06073 (2017)
Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: INTERSPEECH-2014, pp. 2997–3001 (2014)
Google Scholar
Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 234–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_29
Chapter Google Scholar
Vazhenina, D., Markov, K.: Evaluation of advanced language modeling techniques for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 124–131. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_17
Chapter Google Scholar
Kudinov, M.S.: On applicability of recurrent neural networks to language modelling for inflective languages. J. Siberian Federal Univ. Eng. Technol. 9(8), 1291–1301 (2016). (in Rus.)
Google Scholar
Povey, D. et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding ASRU (2011)
Google Scholar
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 55–59 (2013)
Google Scholar
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. Preprint arXiv:1410.7455, http://arxiv.org/pdf/1410.7455v8.pdf (2014)
Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219 (2014)
Google Scholar
Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29
Chapter Google Scholar
Kipyatkova, I.: Experimenting with Hybrid TDNN/HMM acoustic models for Russian speech recognition. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 362–369. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_35
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Geiger, J.T., et al.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: INTERSPEECH-2014, pp. 631–635 (2014)
Google Scholar
Kipyatkova, I., Karpov, A., Verkhodanova, V., Zelezny, M.: Modeling of pronunciation, language and nonverbal units at conversational Russian speech recognition. Int. J. Comput. Sci. Appl. 10(1), 11–30 (2013)
Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Automatic Speech Recognition and Understanding Workshop ASRU 2011 (2011)
Google Scholar
Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J.: RNNLM - Recurrent Neural Network Language Modeling Toolkit. In: ASRU 2011 Demo Session (2011)
Google Scholar
Mikolov, T., Deoras, A., Povey, D., Burget, L., Černocký, J.: Strategies for training large scale neural network language models. In: Proceedings of ASRU 2011, Hawaii, pp. 196–201 (2011)
Google Scholar
Kipyatkova, I., Karpov, A.: Language models with RNNs for rescoring hypotheses of Russian ASR. In: Cheng, L., Liu, Q., Ronzhin, A. (eds.) ISNN 2016. LNCS, vol. 9719, pp. 418–425. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40663-3_48
Chapter Google Scholar
Jokisch, O., et al.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. In: Proceedings of SPECOM’ 2009, pp. 515–520 (2009)
Google Scholar
Stepanova, S.B.: Phonetic features of Russian speech: realization and transcription. Ph.D. thesis (1988) (in Rus.)
Google Scholar
State Standard P 50840–95. Speech transmission by communication paths. Evaluation methods of quality, intelligibility and recognizability. Moscow, Standartov Publ., 230 p. (1996) (in Rus.)
Google Scholar
Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_40
Chapter Google Scholar
Karpov, A., Markov, K., Kipyatkova, I., Vazhenina, D., Ronzhin, A.: Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Commun. 56, 213–228 (2014)
Article Google Scholar

Download references

Acknowledgements

This research is supported by the Russian Foundation for Basic Research (projects No. 18-07-01216 and 18-07-01407), by the Council for Grants of the President of the Russian Federation (project No. MK-1000.2017.8), as well as by the state research No. 0073-2018-0002.

Author information

Authors and Affiliations

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia
Irina Kipyatkova
St. Petersburg State University of Aerospace Instrumentation (SUAI), St. Petersburg, Russia
Irina Kipyatkova

Authors

Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kipyatkova, I. (2018). Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_31
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics