Abstract
The aim of the current research was to improve the Russian end-to-end speech recognition system developed in SPC RAS by application of multi-head attention. The system was created by joining Connectional Temporal Classification model and attention-based encoder-decoder. The models with following attention types were created and researched: dot-product attention, additive attention, location-based attention, and multi-resolution location-based attention. The experiments of using different number of attention vectors were performed. The models were trained on a small Russian speech corpus of 60 h by application of transfer learning with English as non-target language. The usage of multi-head attention reduced word error rate for dot-product and additive attention comparing to the results obtained with one attention vector.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems. pp. 3104–3112 (2014)
Vaswani, A. et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017). https://arxiv.org/abs/1706.03762
Markovnikov, M., Kipyatkova, I.: An analytic survey of end-to-end speech recognition systems. SPIIRAS Proc. 58, 77–110 (2018)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 369–376 (2006)
Kim, S., Hori, T., Watanabe, S: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)
Salazar, J., Kirchhoff, K., Huang, Z.: Self-attention networks for connectionist temporal classification in speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 7115–7119 (2019)
Meng, Z., Gaur, Y., Li, J., Gong, Y.: Character-aware attention-based end-to-end speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 949–955 (2019)
Moritz, N., Hori, T., Le Roux, J.: Triggered attention for end-to-end speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 5666–5670 (2019)
Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. In: Proceedings of International Conference on Machine Learning, pp. 2837–2846 (2017)
Hayashi, T., et al.: Multi-head decoder for end-to-end speech recognition. arXiv preprint arXiv:1804.08050 (2018). https://arxiv.org/abs/1804.08050
Qin, C.-X., Zhang, W.-L., Qu, D.: A new joint CTC-attention-based speech recognition model with multi-level multi-head attention. EURASIP J. Audio, Speech, Music Proc. 2019(1), 1–12 (2019)
Kipyatkova, I., Markovnikov, N.: Experimenting with attention mechanisms in joint CTC-attention models for Russian speech recognition. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 214–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_22
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015). https://arxiv.org/abs/1505.00387
Kipyatkova, I.S., Karpov, A.A.: A comparative study of neural network architectures for integrated speech recognition system. J. Instrum. Eng. 63(11), 1027–1033 (2020). (In Russian)
Panayotov, V. et al.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2015), pp. 5206–5210 (2015)
Kipyatkova, I., Karpov, A.: Class-based LSTM Russian language model with linguistic information. In: Proceedings 12th International Conference on Language Resources and Evaluation LREC-2020, ELRA, Marseille, France, pp. 2470–2474 (2020)
Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 219–226. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_29
Watanabe, S. et al.: Espnet: end-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)
Karmakar, P., Teng, S.W., Lu, G.: Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition. arXiv preprint arXiv:2102.07259 (2021). https://arxiv.org/abs/2102.07259
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. ArXiv preprint arXiv:1702.01806 (2017). https://arxiv.org/abs/1702.01806
Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end Russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel softMax. ArXiv preprint arXiv:1611.01144 (2016). https://arxiv.org/abs/1611.01144
Acknowledgements
This research was supported by the Russian Foundation for Basic Research (project No. 19-29-09081) and by the state research â„– 0073-2019-0005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kipyatkova, I. (2021). End-to-End Russian Speech Recognition Models with Multi-head Attention. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)