Abstract
End-to-end speech recognition systems reduce the speech decoding time and required amount of memory comparing to standard systems. However they need much more data for training, which complicates creation of such systems for low-resourced languages. One way to improve performance of end-to-end low-resourced speech recognition system is model’s pre-training by transfer learning, that is training the model on the non-target data and then transferring the trained parameters to the target model. The aim of the current research was to investigate application of transfer learning to the training of the end-to-end Russian speech recognition system in low-resourced conditions. We used several speech corpora of different languages for pre-training. Then end-to-end model was fine-tuned on a small Russian speech corpus of 60 h. We conducted experiments on application of transfer learning in different parts of the model (feature extraction block, encoder, and attention mechanism) as well as on freezing of the lower layers. We have achieved 24.53% relative word error rate reduction comparing to the baseline system trained without transfer learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Wang, D., Zheng, T.F.: Transfer learning for speech and language processing. In: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1225–1237 (2015)
Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2007), pp. IV-757–IV-760 (2007)
Yan, Z.J., Huo, Q., Xu, J.: A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR. In: Proceedings of Interspeech-2013, pp. 104–108 (2013)
Yan, J., Lv, Z., Huang, S., Yu, H.: Low-resource tibetan dialect acoustic modeling based on transfer learning. In: Proceedings of SLTU. pp. 6–10 (2018)
Kunze, J., Kirsch, L., Kurenkov, I., Krug, A., Johannsmeier, J., Stober, S.: Transfer learning for speech recognition on a budget. ArXiv preprint arXiv:1706.00290 (2017). https://arxiv.org/abs/1706.00290
Eberhard, O., Zesch, T.: Effects of layer freezing on transferring a speech recognition system to under-resourced languages. In: Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pp. 208–212 (2021)
Shivakumar, P.G., Georgiou, P.: Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput. Speech Lang. 63, 101077 (2020)
Tjandra, A., Sakti, S., Nakamura, S.: Attention-based wav2text with feature transfer learning In: Proceedings of Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 309–315 (2017)
Kim, S., Seltzer, M.L., Li, J., Zhao, R.: Improved training for online end-to-end speech recognition systems. In: Proceedings of Interspeech-2018, pp. 2913–2917 (2018)
Li, J., Zhao, R., Huang, J.-T., Gong Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of Interspeech-2014, pp. 1910–1914 (2014)
Tachbelie, M.Y., Abate, S.T., Schultz, T.: Multilingual speech recognition for GlobalPhone languages. Speech Commun. 140, 71–86 (2022)
Qin, C.-X., Qu, D., Zhang, L.-H.: Towards end-to-end speech recognition with transfer learning. EURASIP J. Audio, Speech, Music Process. 2018(1), 1–9 (2018). https://doi.org/10.1186/s13636-018-0141-9
Abad, A., Bell, P., Carmantini, A., Renais, S.: Cross lingual transfer learning for zero-resource domain adaptation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2020), pp. 6909–6913 (2020)
Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)
Kipyatkova, I., Markovnikov, N.: Experimenting with attention mechanisms in Joint CTC-attention models for Russian speech recognition. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 214–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_22
Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015). https://arxiv.org/abs/1505.00387
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. Adv. Neural. Inf. Process. Syst. 28, 577–585 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556
Watanabe, S., et al.: Espnet: End-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2015), pp. 5206–5210 (2015)
Salesky, E., et al.: The multilingual TEDx corpus for speech recognition and translation. In: Proceedings of Interspeech-2021, pp. 3655–3659 (2021)
Külebi, B., Armentano-Oller, C., RodrÃguez-Penagos, C., Villegas, M.: ParlamentParla: A speech corpus of catalan parliamentary sessions. In: Workshop on Creating, Enriching and Using Parliamentary Corpora, pp. 125–130 (2022)
The m-ailabs speech dataset. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/. Accessed 30 Jun 2022
VoxForge. http://www.voxforge.org/. Accessed 30 Jun 2022
Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. ArXiv preprint arXiv:1702.01806 (2017). https://arxiv.org/abs/1702.01806
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. ArXiv preprint arXiv:1611.01144 (2016). https://arxiv.org/abs/1611.01144
Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35
Acknowledgements
This research was supported by the state research â„– FFZF-2022-0005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Kipyatkova, I. (2022). Investigation of Transfer Learning for End-to-End Russian Speech Recognition. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)