Investigation of Transfer Learning for End-to-End Russian Speech Recognition

Kipyatkova, Irina

doi:10.1007/978-3-031-20980-2_30

Irina Kipyatkova ORCID: orcid.org/0000-0002-1264-4458¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

811 Accesses

Abstract

End-to-end speech recognition systems reduce the speech decoding time and required amount of memory comparing to standard systems. However they need much more data for training, which complicates creation of such systems for low-resourced languages. One way to improve performance of end-to-end low-resourced speech recognition system is model’s pre-training by transfer learning, that is training the model on the non-target data and then transferring the trained parameters to the target model. The aim of the current research was to investigate application of transfer learning to the training of the end-to-end Russian speech recognition system in low-resourced conditions. We used several speech corpora of different languages for pre-training. Then end-to-end model was fine-tuned on a small Russian speech corpus of 60 h. We conducted experiments on application of transfer learning in different parts of the model (feature extraction block, encoder, and attention mechanism) as well as on freezing of the lower layers. We have achieved 24.53% relative word error rate reduction comparing to the baseline system trained without transfer learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27
Chapter Google Scholar
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Article Google Scholar
Wang, D., Zheng, T.F.: Transfer learning for speech and language processing. In: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1225–1237 (2015)
Google Scholar
Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2007), pp. IV-757–IV-760 (2007)
Google Scholar
Yan, Z.J., Huo, Q., Xu, J.: A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR. In: Proceedings of Interspeech-2013, pp. 104–108 (2013)
Google Scholar
Yan, J., Lv, Z., Huang, S., Yu, H.: Low-resource tibetan dialect acoustic modeling based on transfer learning. In: Proceedings of SLTU. pp. 6–10 (2018)
Google Scholar
Kunze, J., Kirsch, L., Kurenkov, I., Krug, A., Johannsmeier, J., Stober, S.: Transfer learning for speech recognition on a budget. ArXiv preprint arXiv:1706.00290 (2017). https://arxiv.org/abs/1706.00290
Eberhard, O., Zesch, T.: Effects of layer freezing on transferring a speech recognition system to under-resourced languages. In: Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pp. 208–212 (2021)
Google Scholar
Shivakumar, P.G., Georgiou, P.: Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput. Speech Lang. 63, 101077 (2020)
Article Google Scholar
Tjandra, A., Sakti, S., Nakamura, S.: Attention-based wav2text with feature transfer learning In: Proceedings of Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 309–315 (2017)
Google Scholar
Kim, S., Seltzer, M.L., Li, J., Zhao, R.: Improved training for online end-to-end speech recognition systems. In: Proceedings of Interspeech-2018, pp. 2913–2917 (2018)
Google Scholar
Li, J., Zhao, R., Huang, J.-T., Gong Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of Interspeech-2014, pp. 1910–1914 (2014)
Google Scholar
Tachbelie, M.Y., Abate, S.T., Schultz, T.: Multilingual speech recognition for GlobalPhone languages. Speech Commun. 140, 71–86 (2022)
Article Google Scholar
Qin, C.-X., Qu, D., Zhang, L.-H.: Towards end-to-end speech recognition with transfer learning. EURASIP J. Audio, Speech, Music Process. 2018(1), 1–9 (2018). https://doi.org/10.1186/s13636-018-0141-9
Article Google Scholar
Abad, A., Bell, P., Carmantini, A., Renais, S.: Cross lingual transfer learning for zero-resource domain adaptation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2020), pp. 6909–6913 (2020)
Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)
Google Scholar
Kipyatkova, I., Markovnikov, N.: Experimenting with attention mechanisms in Joint CTC-attention models for Russian speech recognition. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 214–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_22
Chapter Google Scholar
Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015). https://arxiv.org/abs/1505.00387
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Adv. Neural. Inf. Process. Syst. 28, 577–585 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556
Watanabe, S., et al.: Espnet: End-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)
Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2015), pp. 5206–5210 (2015)
Google Scholar
Salesky, E., et al.: The multilingual TEDx corpus for speech recognition and translation. In: Proceedings of Interspeech-2021, pp. 3655–3659 (2021)
Google Scholar
Külebi, B., Armentano-Oller, C., Rodríguez-Penagos, C., Villegas, M.: ParlamentParla: A speech corpus of catalan parliamentary sessions. In: Workshop on Creating, Enriching and Using Parliamentary Corpora, pp. 125–130 (2022)
Google Scholar
The m-ailabs speech dataset. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/. Accessed 30 Jun 2022
VoxForge. http://www.voxforge.org/. Accessed 30 Jun 2022
Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)
Article Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. ArXiv preprint arXiv:1702.01806 (2017). https://arxiv.org/abs/1702.01806
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. ArXiv preprint arXiv:1611.01144 (2016). https://arxiv.org/abs/1611.01144
Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35
Chapter Google Scholar

Download references

Acknowledgements

This research was supported by the state research № FFZF-2022-0005.

Author information

Authors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg, Russia
Irina Kipyatkova

Authors

Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kipyatkova, I. (2022). Investigation of Transfer Learning for End-to-End Russian Speech Recognition. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_30
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics