Skip to main content

Investigation of Transfer Learning for End-to-End Russian Speech Recognition

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

  • 811 Accesses

Abstract

End-to-end speech recognition systems reduce the speech decoding time and required amount of memory comparing to standard systems. However they need much more data for training, which complicates creation of such systems for low-resourced languages. One way to improve performance of end-to-end low-resourced speech recognition system is model’s pre-training by transfer learning, that is training the model on the non-target data and then transferring the trained parameters to the target model. The aim of the current research was to investigate application of transfer learning to the training of the end-to-end Russian speech recognition system in low-resourced conditions. We used several speech corpora of different languages for pre-training. Then end-to-end model was fine-tuned on a small Russian speech corpus of 60 h. We conducted experiments on application of transfer learning in different parts of the model (feature extraction block, encoder, and attention mechanism) as well as on freezing of the lower layers. We have achieved 24.53% relative word error rate reduction comparing to the baseline system trained without transfer learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27

    Chapter  Google Scholar 

  2. Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)

    Article  Google Scholar 

  3. Wang, D., Zheng, T.F.: Transfer learning for speech and language processing. In: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1225–1237 (2015)

    Google Scholar 

  4. Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2007), pp. IV-757–IV-760 (2007)

    Google Scholar 

  5. Yan, Z.J., Huo, Q., Xu, J.: A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR. In: Proceedings of Interspeech-2013, pp. 104–108 (2013)

    Google Scholar 

  6. Yan, J., Lv, Z., Huang, S., Yu, H.: Low-resource tibetan dialect acoustic modeling based on transfer learning. In: Proceedings of SLTU. pp. 6–10 (2018)

    Google Scholar 

  7. Kunze, J., Kirsch, L., Kurenkov, I., Krug, A., Johannsmeier, J., Stober, S.: Transfer learning for speech recognition on a budget. ArXiv preprint arXiv:1706.00290 (2017). https://arxiv.org/abs/1706.00290

  8. Eberhard, O., Zesch, T.: Effects of layer freezing on transferring a speech recognition system to under-resourced languages. In: Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pp. 208–212 (2021)

    Google Scholar 

  9. Shivakumar, P.G., Georgiou, P.: Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput. Speech Lang. 63, 101077 (2020)

    Article  Google Scholar 

  10. Tjandra, A., Sakti, S., Nakamura, S.: Attention-based wav2text with feature transfer learning In: Proceedings of Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 309–315 (2017)

    Google Scholar 

  11. Kim, S., Seltzer, M.L., Li, J., Zhao, R.: Improved training for online end-to-end speech recognition systems. In: Proceedings of Interspeech-2018, pp. 2913–2917 (2018)

    Google Scholar 

  12. Li, J., Zhao, R., Huang, J.-T., Gong Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of Interspeech-2014, pp. 1910–1914 (2014)

    Google Scholar 

  13. Tachbelie, M.Y., Abate, S.T., Schultz, T.: Multilingual speech recognition for GlobalPhone languages. Speech Commun. 140, 71–86 (2022)

    Article  Google Scholar 

  14. Qin, C.-X., Qu, D., Zhang, L.-H.: Towards end-to-end speech recognition with transfer learning. EURASIP J. Audio, Speech, Music Process. 2018(1), 1–9 (2018). https://doi.org/10.1186/s13636-018-0141-9

    Article  Google Scholar 

  15. Abad, A., Bell, P., Carmantini, A., Renais, S.: Cross lingual transfer learning for zero-resource domain adaptation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2020), pp. 6909–6913 (2020)

    Google Scholar 

  16. Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)

    Google Scholar 

  17. Kipyatkova, I., Markovnikov, N.: Experimenting with attention mechanisms in Joint CTC-attention models for Russian speech recognition. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 214–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_22

    Chapter  Google Scholar 

  18. Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015). https://arxiv.org/abs/1505.00387

  19. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Adv. Neural. Inf. Process. Syst. 28, 577–585 (2015)

    Google Scholar 

  20. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556

  21. Watanabe, S., et al.: Espnet: End-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)

    Google Scholar 

  22. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2015), pp. 5206–5210 (2015)

    Google Scholar 

  23. Salesky, E., et al.: The multilingual TEDx corpus for speech recognition and translation. In: Proceedings of Interspeech-2021, pp. 3655–3659 (2021)

    Google Scholar 

  24. Külebi, B., Armentano-Oller, C., Rodríguez-Penagos, C., Villegas, M.: ParlamentParla: A speech corpus of catalan parliamentary sessions. In: Workshop on Creating, Enriching and Using Parliamentary Corpora, pp. 125–130 (2022)

    Google Scholar 

  25. The m-ailabs speech dataset. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/. Accessed 30 Jun 2022

  26. VoxForge. http://www.voxforge.org/. Accessed 30 Jun 2022

  27. Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)

    Article  Google Scholar 

  28. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. ArXiv preprint arXiv:1702.01806 (2017). https://arxiv.org/abs/1702.01806

  29. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. ArXiv preprint arXiv:1611.01144 (2016). https://arxiv.org/abs/1611.01144

  30. Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35

    Chapter  Google Scholar 

Download references

Acknowledgements

This research was supported by the state research â„– FFZF-2022-0005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kipyatkova, I. (2022). Investigation of Transfer Learning for End-to-End Russian Speech Recognition. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics