Skip to main content

End-to-End Russian Speech Recognition Models with Multi-head Attention

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

Abstract

The aim of the current research was to improve the Russian end-to-end speech recognition system developed in SPC RAS by application of multi-head attention. The system was created by joining Connectional Temporal Classification model and attention-based encoder-decoder. The models with following attention types were created and researched: dot-product attention, additive attention, location-based attention, and multi-resolution location-based attention. The experiments of using different number of attention vectors were performed. The models were trained on a small Russian speech corpus of 60 h by application of transfer learning with English as non-target language. The usage of multi-head attention reduced word error rate for dot-product and additive attention comparing to the results obtained with one attention vector.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.voxforge.org/.

  2. 2.

    https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/.

References

  1. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems. pp. 3104–3112 (2014)

    Google Scholar 

  2. Vaswani, A. et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017). https://arxiv.org/abs/1706.03762

  3. Markovnikov, M., Kipyatkova, I.: An analytic survey of end-to-end speech recognition systems. SPIIRAS Proc. 58, 77–110 (2018)

    Article  Google Scholar 

  4. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 369–376 (2006)

    Google Scholar 

  5. Kim, S., Hori, T., Watanabe, S: Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)

    Google Scholar 

  6. Salazar, J., Kirchhoff, K., Huang, Z.: Self-attention networks for connectionist temporal classification in speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 7115–7119 (2019)

    Google Scholar 

  7. Meng, Z., Gaur, Y., Li, J., Gong, Y.: Character-aware attention-based end-to-end speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 949–955 (2019)

    Google Scholar 

  8. Moritz, N., Hori, T., Le Roux, J.: Triggered attention for end-to-end speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 5666–5670 (2019)

    Google Scholar 

  9. Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. In: Proceedings of International Conference on Machine Learning, pp. 2837–2846 (2017)

    Google Scholar 

  10. Hayashi, T., et al.: Multi-head decoder for end-to-end speech recognition. arXiv preprint arXiv:1804.08050 (2018). https://arxiv.org/abs/1804.08050

  11. Qin, C.-X., Zhang, W.-L., Qu, D.: A new joint CTC-attention-based speech recognition model with multi-level multi-head attention. EURASIP J. Audio, Speech, Music Proc. 2019(1), 1–12 (2019)

    Article  Google Scholar 

  12. Kipyatkova, I., Markovnikov, N.: Experimenting with attention mechanisms in joint CTC-attention models for Russian speech recognition. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 214–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_22

    Chapter  Google Scholar 

  13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  15. Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015). https://arxiv.org/abs/1505.00387

  16. Kipyatkova, I.S., Karpov, A.A.: A comparative study of neural network architectures for integrated speech recognition system. J. Instrum. Eng. 63(11), 1027–1033 (2020). (In Russian)

    Google Scholar 

  17. Panayotov, V. et al.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2015), pp. 5206–5210 (2015)

    Google Scholar 

  18. Kipyatkova, I., Karpov, A.: Class-based LSTM Russian language model with linguistic information. In: Proceedings 12th International Conference on Language Resources and Evaluation LREC-2020, ELRA, Marseille, France, pp. 2470–2474 (2020)

    Google Scholar 

  19. Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 219–226. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_29

    Chapter  Google Scholar 

  20. Watanabe, S. et al.: Espnet: end-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)

    Google Scholar 

  21. Karmakar, P., Teng, S.W., Lu, G.: Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition. arXiv preprint arXiv:2102.07259 (2021). https://arxiv.org/abs/2102.07259

  22. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)

    Google Scholar 

  23. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. ArXiv preprint arXiv:1702.01806 (2017). https://arxiv.org/abs/1702.01806

  24. Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end Russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35

    Chapter  Google Scholar 

  25. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel softMax. ArXiv preprint arXiv:1611.01144 (2016). https://arxiv.org/abs/1611.01144

Download references

Acknowledgements

This research was supported by the Russian Foundation for Basic Research (project No. 19-29-09081) and by the state research â„– 0073-2019-0005.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kipyatkova, I. (2021). End-to-End Russian Speech Recognition Models with Multi-head Attention. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics