End-to-End Russian Speech Recognition Models with Multi-head Attention

Kipyatkova, Irina

doi:10.1007/978-3-030-87802-3_30

Irina Kipyatkova¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

International Conference on Speech and Computer

1590 Accesses
1 Citations

Abstract

The aim of the current research was to improve the Russian end-to-end speech recognition system developed in SPC RAS by application of multi-head attention. The system was created by joining Connectional Temporal Classification model and attention-based encoder-decoder. The models with following attention types were created and researched: dot-product attention, additive attention, location-based attention, and multi-resolution location-based attention. The experiments of using different number of attention vectors were performed. The models were trained on a small Russian speech corpus of 60 h by application of transfer learning with English as non-target language. The usage of multi-head attention reduced word error rate for dot-product and additive attention comparing to the results obtained with one attention vector.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Article Open access 28 October 2019

Toward Developing Attention-Based End-To-End Automatic Speech Recognition

Notes

References

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems. pp. 3104–3112 (2014)
Google Scholar
Vaswani, A. et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017). https://arxiv.org/abs/1706.03762
Markovnikov, M., Kipyatkova, I.: An analytic survey of end-to-end speech recognition systems. SPIIRAS Proc. 58, 77–110 (2018)
Article Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 369–376 (2006)
Google Scholar
Kim, S., Hori, T., Watanabe, S: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)
Google Scholar
Salazar, J., Kirchhoff, K., Huang, Z.: Self-attention networks for connectionist temporal classification in speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 7115–7119 (2019)
Google Scholar
Meng, Z., Gaur, Y., Li, J., Gong, Y.: Character-aware attention-based end-to-end speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 949–955 (2019)
Google Scholar
Moritz, N., Hori, T., Le Roux, J.: Triggered attention for end-to-end speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 5666–5670 (2019)
Google Scholar
Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. In: Proceedings of International Conference on Machine Learning, pp. 2837–2846 (2017)
Google Scholar
Hayashi, T., et al.: Multi-head decoder for end-to-end speech recognition. arXiv preprint arXiv:1804.08050 (2018). https://arxiv.org/abs/1804.08050
Qin, C.-X., Zhang, W.-L., Qu, D.: A new joint CTC-attention-based speech recognition model with multi-level multi-head attention. EURASIP J. Audio, Speech, Music Proc. 2019(1), 1–12 (2019)
Article Google Scholar
Kipyatkova, I., Markovnikov, N.: Experimenting with attention mechanisms in joint CTC-attention models for Russian speech recognition. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 214–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_22
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015). https://arxiv.org/abs/1505.00387
Kipyatkova, I.S., Karpov, A.A.: A comparative study of neural network architectures for integrated speech recognition system. J. Instrum. Eng. 63(11), 1027–1033 (2020). (In Russian)
Google Scholar
Panayotov, V. et al.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2015), pp. 5206–5210 (2015)
Google Scholar
Kipyatkova, I., Karpov, A.: Class-based LSTM Russian language model with linguistic information. In: Proceedings 12th International Conference on Language Resources and Evaluation LREC-2020, ELRA, Marseille, France, pp. 2470–2474 (2020)
Google Scholar
Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 219–226. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_29
Chapter Google Scholar
Watanabe, S. et al.: Espnet: end-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)
Google Scholar
Karmakar, P., Teng, S.W., Lu, G.: Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition. arXiv preprint arXiv:2102.07259 (2021). https://arxiv.org/abs/2102.07259
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)
Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. ArXiv preprint arXiv:1702.01806 (2017). https://arxiv.org/abs/1702.01806
Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end Russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35
Chapter Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel softMax. ArXiv preprint arXiv:1611.01144 (2016). https://arxiv.org/abs/1611.01144

Download references

Acknowledgements

This research was supported by the Russian Foundation for Basic Research (project No. 19-29-09081) and by the state research № 0073-2019-0005.

Author information

Authors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), SPIIRAS, St. Petersburg, Russia
Irina Kipyatkova

Authors

Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kipyatkova, I. (2021). End-to-End Russian Speech Recognition Models with Multi-head Attention. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-87802-3_30
Published: 22 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

End-to-End Russian Speech Recognition Models with Multi-head Attention

Abstract

Access this chapter

Similar content being viewed by others

Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Toward Developing Attention-Based End-To-End Automatic Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

End-to-End Russian Speech Recognition Models with Multi-head Attention

Abstract

Access this chapter

Similar content being viewed by others

Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Toward Developing Attention-Based End-To-End Automatic Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation