Skip to main content

Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2022)

Abstract

Architecture end-to-ends are commonly used methods in many areas of machine learning, namely speech recognition. The end-to-end structure represents the system as one whole element, in contrast to the traditional one, which has several independent elements. The end-to-end system provides a direct mapping of acoustic signals in a sequence of labels without intermediate states, without the need for post-processing at the output, making it easy to implement. Combining several end-to-end method types perform better results than applying them separately. Inspired by this issue, in this work we have realized a method for using CRF and CTC together to recognize a low-resource language like the Kazakh language. In this work, architectures of a recurrent neural network and a ResNet network were applied to build a model using language models. The results of experimental studies showed that the proposed approach based on the ResNet architecture with the RNN language model achieved the best CER result with a value of 9.86% compared to other network architectures for the Kazakh language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gales, M., Young, S.:  2007. The application of hidden Markov models in speech recognition. Found. Trends Signal Process. 1(3), 195–304 (2008). https://doi.org/10.1561/2000000004

  2. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups.  IEEE Signal Process. Mag. 29(6),  82–97, (2012).  https://doi.org/10.1109/MSP.2012.2205597

  3. Maas, A.,  Qi, P.,   Xie, Z.,  Hannun, A., Lengerich, C.,  Jurafsky, D., Ng, A.: Building DNN acoustic models for large vocabulary speech recognition. Comput Speech  Lang. 41 (2016). https://doi.org/10.1016/j.csl.2016.06.007

  4. Fohr, D., Mella, O., Illina. I.:New Paradigm in speech recognition: deep neural networks. In: IEEE International Conference on Information Systems and Economic Intelligence, Marrakech, Morocco. ffhal-01484447f (2017)

    Google Scholar 

  5. Shi, Y., Zhang, WQ., Liu, J., et al.: RNN language model with word clustering and class-based output layer. J. Audio Speech Music Proc.  22 (2013). https://doi.org/10.1186/1687-4722-2013-22

  6. Huang, S., Tang, J., Dai, J., Wang, Y.: Signal status recognition based on 1DCNN and its feature extraction mechanism analysis. Sensors (Basel) 19(9) (2018). https://doi.org/10.3390/s19092018

  7. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.:  Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ‘networks. In: ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning, pp.  369–376 (2006). https://doi.org/10.1145/1143844.1143891

  8. Mamyrbayev, O., Oralbekova, D.: Modern trends in the development of speech recognition systems. News  Nat. Acad. Sci. Republic of Kazakhstan,  4(32),  42 – 51 (2020).  https://doi.org/10.32014/2020.2518-1726.64

  9. Chan, W., Jaitly, N., Le, Q.V.,  Vinyals, O.L.: Attend and Spell. ArXiv, abs/1508.01211. (data of request: 14.09.2021) (2015)

    Google Scholar 

  10. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 4945–4949 (2016)

    Google Scholar 

  11. Lafferty, J., McCallum, A.,  Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML 2001), Williamstown, MA, USA, pp. 282–289 (2001)

    Google Scholar 

  12. Garcia-Moral, A.,  Solera-Ureña, R.,  Peláez-Moreno, C., Díaz-de-María, F.: Data balancing for efficient training of hybrid ANN/HMM automatic speech recognition systems. IEEE Trans. Audio Speech  Lang. Process. 19. 468 - 481 (2011). https://doi.org/10.1109/TASL.2010.2050513

  13. Agglutinating language - http://www.glottopedia.org/index.php/Agglutinating_language, (data of request: 27 Sep 2021)

  14. Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM (2017)

    Google Scholar 

  15. Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning (2016)

    Google Scholar 

  16. Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A.,  Zhumazhanov, B.: Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-Euro. J. Enter. Technol. 1(9(115), 84–92 (2022). https://doi.org/10.15587/1729-4061.2022.252801

  17. Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., Nuranbayeva, B.: Development of security systems using DNN and i & x-vector classifiers. Eastern-Euro. J. Enter. Technol. 4 (9 (112)), 32–45 (2021). https://doi.org/10.15587/1729-4061.2021.239186

  18. Orken, M., Dina, O., Keylan, A., Tolganay, T., Mohamed, O.: A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci Rep 12, 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y

    Article  Google Scholar 

  19. Dimopoulos, S.,  Fosler-Lussier, E.,  Lee, C.,  Potamianos, A.: Transition features for CRF-based speech recognition and boundary detection. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 99–102 (2009). https://doi.org/10.1109/ASRU.2009.5373287

  20. Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Using Conditional Random Fields for Sentence Boundary Detection in Speech (2005). https://doi.org/10.3115/1219840.1219896

  21. An, K., Xiang, H.,  Ou, Z.: CAT: CRF-based ASR Toolkit. arXiv: abs/1911.08747, https://arxiv.org/abs/1911.08747 (2019)

  22. An, K., et al.: CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. In: NTERSPEECH (2020)

    Google Scholar 

  23. Lu, L., Kong, L., Dyer, C.,  Smith, N.A.:Multitask Learning with CTC and Segmental CRF for Speech Recognition In: Interspeech (2017)

    Google Scholar 

  24. Xiang, H.,  Ou, Z.:  CRF-based single-stage acoustic modeling with CTC topology.  In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5676–5680 (2019)

    Google Scholar 

  25. An, K., Xiang, H., Ou, Z.: CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. In: INTERSPEECH (2020)

    Google Scholar 

  26. Yang, L., Li, Y., Wang, J.,  Tang, Z.:  Post Text Processing of Chinese Speech Recognition Based on Bidirectional LSTM Networks and CRF. Electronics 8(11) 1248 (2019). https://doi.org/10.3390/electronics8111248

  27. Abney S.: Parsing by chunks. In: Berwick, R., Abney, S.,  Tenny, C., (eds.) Principle-based Parsing. Kluwer Academic Publishers, pp. 257–279 (1991)

    Google Scholar 

  28. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. MIT Press (2006)

    Google Scholar 

  29. Lafferty, J., McCallum, A., Pereira, F.:  Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, Massachusetts, pp. 282–289 (2001)

    Google Scholar 

  30. Bottou, L.: Une approche theorique de l’apprentissage connexionniste: Applications a la reconnaissance de la parole. Doctoral dissertation, Universite de Paris XI (1991)

    Google Scholar 

  31. Culotta, A., Wick, M., Hall R., McCallum, A.: First-order probabilistic models for coreference resolution. In: Proc. of HLT-NAACL (2007)

    Google Scholar 

  32. Markovnikov, N.M., Kipyatkova, I.S.: An analytic survey of end-to-end speech recognition systems. Tr. SPIIRAN 58, 77–110 (2018)

    Google Scholar 

  33. Kong, L., Dyer C., Smith, N.A.: Segmental recurrent neural networks.  arXiv: 1511.06018, https://arxiv.org/abs/1511.06018. (Accessed 02 Oct 2021) (2015)

  34. Lu, L.,  Kong, L.,  Dyer, C., Smith,  N., Renals,  S.:  Segmental recurrent neural networks for end-to-end speech recognition. In: Proc. INTERSPEECH (2016)

    Google Scholar 

  35. Laboratory of computer engineering of intelligent systems – https://iict.kz/laboratory-of-computer-engineering-of-intelligent-systems/ (data of request: 02 Aug 2021)

  36. Li, F., et al.: Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J. Adv. Signal Process. 2019(1), 1–11 (2019). https://doi.org/10.1186/s13634-019-0651-3

    Article  Google Scholar 

  37. Zhao, G., Zhang, Z., Guan, H., Tang, P., Wang, J.: Rethinking ReLU to Train Better CNNs. 603–608 (2018). https://doi.org/10.1109/ICPR.2018.8545612

  38. Ioffe, S., Szegedy, C.: Proceedings of the 32nd International Conference on Machine Learning, PMLR, vol. 37, pp. 448–456 (2015)

    Google Scholar 

  39. Kingma D. P., Ba J. Adam: A method for stochastic optimization.  http://arxiv.org/abs/1412. 6980 (data of request: 01.11.2021) (2014)

  40. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals.   Soviet Phys. Doklady 10, 707–710 (1996)

    MathSciNet  Google Scholar 

Download references

Acknowledgement

This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dina Oralbekova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Oralbekova, D., Mamyrbayev, O., Othman, M., Alimhan, K., Zhumazhanov, B., Nuranbayeva, B. (2022). Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13757. Springer, Cham. https://doi.org/10.1007/978-3-031-21743-2_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21743-2_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21742-5

  • Online ISBN: 978-3-031-21743-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics