Skip to main content

Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

Abstract

The paper presents an investigation of attention mechanisms in end-to-end Russian speech recognition system created by join Connectional Temporal Classification model and attention-based encoder-decoder. We trained the models on a small dataset of Russian speech with total duration of about 60 h, and performed pretraining of the models using transfer learning with English as non-target language. We experimented with following types of attention mechanism: coverage-based attention and 2D location-aware attention as well as their combination. At the decoding stage we used beam search pruning method and gumbel-softmax function instead of softmax. We have achieved 4% relative word error rate reduction using 2D location-aware attention.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.voxforge.org/.

  2. 2.

    https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/.

References

  1. Markovnikov, M., Kipyatkova, I.: An analytic survey of end-to-end speech recognition systems. SPIIRAS Proc. 58, 77–110 (2018)

    Article  Google Scholar 

  2. Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition (2016). arXiv preprint arXiv:1610.09975 https://arxiv.org/abs/1610.09975

  3. Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 368–373 (2013)

    Google Scholar 

  4. Google preferred lineup explorer – YouTube. https://www.youtube.com/yt/lineups/. Accessed 17 Feb 2018

  5. Tüske, Z., Audhkhasi, K., Saon, G.: Advancing sequence-to-sequence based speech recognition. In: INTERSPEECH-2019, pp. 3780–3784 (2019)

    Google Scholar 

  6. Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2017), pp. 4835–4839 (2017)

    Google Scholar 

  7. Salazar, J., Kirchhoff, K., Huang, Z.: Self-attention networks for connectionist temporal classification in speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-2019), pp. 7115–7119 (2019)

    Google Scholar 

  8. Chiu, C.C., Raffel, C.: Monotonic chunkwise attention (2017). arXiv preprint arXiv:1712.05382

  9. Miao, H., et al.: Online hybrid CTC/Attention architecture for end-to-end speech recognition. In: INTERSPEECH-2019, pp. 2623–2627 (2019)

    Google Scholar 

  10. Watanabe, S., et al.: ESPnet: end-to-end speech processing toolkit. In: INTERSPEECH-2018, pp. 2207–2211 (2018)

    Google Scholar 

  11. Srivastava, N., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  12. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  13. Chorowski, J.K., et al.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)

    Google Scholar 

  14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition(2014). arXiv preprint arXiv:1409.1556

  15. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). CoRR abs/1502.03167 http://arxiv.org/abs/1502.03167

  16. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)

    Google Scholar 

  17. Tu, Z., et al.: Modeling coverage for neural machine translation (2016). arXiv preprint arXiv:1601.04811

  18. See, A., Liu, P.J., Manning, C.D.: Get to the point: Summarization with pointer-generator networks (2017). arXiv preprint arXiv:1704.04368

  19. Kipyatkova, I.: Experimenting with hybrid TDNN/HMM acoustic models for Russian speech recognition. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 362–369. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_35

    Chapter  Google Scholar 

  20. Kipyatkova, I., Karpov, A.: Class-based LSTM Russian language model with linguistic information. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 2470–2474 (2020)

    Google Scholar 

  21. Jokisch, O., Wagner, A., Sabo, R., Jaeckel, R., Cylwik, N., Rusko, M., Ronzhin, A., Hoffmann, R.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. Proceedings of SPECOM 2009, 515–520 (2009)

    Google Scholar 

  22. State Standard P 50840–95. Speech Transmission by Communication Paths. Evaluation Methods of Quality, Intelligibility and Recognizability, p. 230. Standartov Publication, Moscow (1996). (in Russian)

    Google Scholar 

  23. Stepanova, S.B.: Phonetic features of Russian speech: realization and transcription, Ph.D. thesis (1988). (in Russian)

    Google Scholar 

  24. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_40

    Chapter  Google Scholar 

  25. Karpov, A.A., Ronzhin, A.L.: Information enquiry kiosk with multimodal user interface. Pattern Recogn. Image Anal. 19(3), 546–558 (2009)

    Article  Google Scholar 

  26. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation (2017). arXiv preprint arXiv:1702.01806

  27. Markovnikov, N., Kipyatkova, I.: Investigating joint CTC-attention models for end-to-end Russian speech recognition. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 337–347. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_35

    Chapter  Google Scholar 

  28. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax (2016). arXiv preprint arXiv:1611.01144

Download references

Acknowledgements

This research was supported by the Russian Foundation for Basic Research (project No. 18-07-01216).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kipyatkova, I., Markovnikov, N. (2020). Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60276-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60275-8

  • Online ISBN: 978-3-030-60276-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics