Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System

Oralbekova, Dina; Mamyrbayev, Orken; Othman, Mohamed; Alimhan, Keylan; Zhumazhanov, Bagashar; Nuranbayeva, Bulbul

doi:10.1007/978-3-031-21743-2_41

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13757))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

809 Accesses
1 Citations

Abstract

Architecture end-to-ends are commonly used methods in many areas of machine learning, namely speech recognition. The end-to-end structure represents the system as one whole element, in contrast to the traditional one, which has several independent elements. The end-to-end system provides a direct mapping of acoustic signals in a sequence of labels without intermediate states, without the need for post-processing at the output, making it easy to implement. Combining several end-to-end method types perform better results than applying them separately. Inspired by this issue, in this work we have realized a method for using CRF and CTC together to recognize a low-resource language like the Kazakh language. In this work, architectures of a recurrent neural network and a ResNet network were applied to build a model using language models. The results of experimental studies showed that the proposed approach based on the ResNet architecture with the RNN language model achieved the best CER result with a value of 9.86% compared to other network architectures for the Kazakh language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gales, M., Young, S.: 2007. The application of hidden Markov models in speech recognition. Found. Trends Signal Process. 1(3), 195–304 (2008). https://doi.org/10.1561/2000000004
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97, (2012). https://doi.org/10.1109/MSP.2012.2205597
Maas, A., Qi, P., Xie, Z., Hannun, A., Lengerich, C., Jurafsky, D., Ng, A.: Building DNN acoustic models for large vocabulary speech recognition. Comput Speech Lang. 41 (2016). https://doi.org/10.1016/j.csl.2016.06.007
Fohr, D., Mella, O., Illina. I.:New Paradigm in speech recognition: deep neural networks. In: IEEE International Conference on Information Systems and Economic Intelligence, Marrakech, Morocco. ffhal-01484447f (2017)
Google Scholar
Shi, Y., Zhang, WQ., Liu, J., et al.: RNN language model with word clustering and class-based output layer. J. Audio Speech Music Proc. 22 (2013). https://doi.org/10.1186/1687-4722-2013-22
Huang, S., Tang, J., Dai, J., Wang, Y.: Signal status recognition based on 1DCNN and its feature extraction mechanism analysis. Sensors (Basel) 19(9) (2018). https://doi.org/10.3390/s19092018
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ‘networks. In: ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006). https://doi.org/10.1145/1143844.1143891
Mamyrbayev, O., Oralbekova, D.: Modern trends in the development of speech recognition systems. News Nat. Acad. Sci. Republic of Kazakhstan, 4(32), 42 – 51 (2020). https://doi.org/10.32014/2020.2518-1726.64
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.L.: Attend and Spell. ArXiv, abs/1508.01211. (data of request: 14.09.2021) (2015)
Google Scholar
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 4945–4949 (2016)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML 2001), Williamstown, MA, USA, pp. 282–289 (2001)
Google Scholar
Garcia-Moral, A., Solera-Ureña, R., Peláez-Moreno, C., Díaz-de-María, F.: Data balancing for efficient training of hybrid ANN/HMM automatic speech recognition systems. IEEE Trans. Audio Speech Lang. Process. 19. 468 - 481 (2011). https://doi.org/10.1109/TASL.2010.2050513
Agglutinating language - http://www.glottopedia.org/index.php/Agglutinating_language, (data of request: 27 Sep 2021)
Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM (2017)
Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning (2016)
Google Scholar
Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., Zhumazhanov, B.: Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-Euro. J. Enter. Technol. 1(9(115), 84–92 (2022). https://doi.org/10.15587/1729-4061.2022.252801
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., Nuranbayeva, B.: Development of security systems using DNN and i & x-vector classifiers. Eastern-Euro. J. Enter. Technol. 4 (9 (112)), 32–45 (2021). https://doi.org/10.15587/1729-4061.2021.239186
Orken, M., Dina, O., Keylan, A., Tolganay, T., Mohamed, O.: A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci Rep 12, 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y
Article Google Scholar
Dimopoulos, S., Fosler-Lussier, E., Lee, C., Potamianos, A.: Transition features for CRF-based speech recognition and boundary detection. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 99–102 (2009). https://doi.org/10.1109/ASRU.2009.5373287
Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Using Conditional Random Fields for Sentence Boundary Detection in Speech (2005). https://doi.org/10.3115/1219840.1219896
An, K., Xiang, H., Ou, Z.: CAT: CRF-based ASR Toolkit. arXiv: abs/1911.08747, https://arxiv.org/abs/1911.08747 (2019)
An, K., et al.: CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. In: NTERSPEECH (2020)
Google Scholar
Lu, L., Kong, L., Dyer, C., Smith, N.A.:Multitask Learning with CTC and Segmental CRF for Speech Recognition In: Interspeech (2017)
Google Scholar
Xiang, H., Ou, Z.: CRF-based single-stage acoustic modeling with CTC topology. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5676–5680 (2019)
Google Scholar
An, K., Xiang, H., Ou, Z.: CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. In: INTERSPEECH (2020)
Google Scholar
Yang, L., Li, Y., Wang, J., Tang, Z.: Post Text Processing of Chinese Speech Recognition Based on Bidirectional LSTM Networks and CRF. Electronics 8(11) 1248 (2019). https://doi.org/10.3390/electronics8111248
Abney S.: Parsing by chunks. In: Berwick, R., Abney, S., Tenny, C., (eds.) Principle-based Parsing. Kluwer Academic Publishers, pp. 257–279 (1991)
Google Scholar
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. MIT Press (2006)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, Massachusetts, pp. 282–289 (2001)
Google Scholar
Bottou, L.: Une approche theorique de l’apprentissage connexionniste: Applications a la reconnaissance de la parole. Doctoral dissertation, Universite de Paris XI (1991)
Google Scholar
Culotta, A., Wick, M., Hall R., McCallum, A.: First-order probabilistic models for coreference resolution. In: Proc. of HLT-NAACL (2007)
Google Scholar
Markovnikov, N.M., Kipyatkova, I.S.: An analytic survey of end-to-end speech recognition systems. Tr. SPIIRAN 58, 77–110 (2018)
Google Scholar
Kong, L., Dyer C., Smith, N.A.: Segmental recurrent neural networks. arXiv: 1511.06018, https://arxiv.org/abs/1511.06018. (Accessed 02 Oct 2021) (2015)
Lu, L., Kong, L., Dyer, C., Smith, N., Renals, S.: Segmental recurrent neural networks for end-to-end speech recognition. In: Proc. INTERSPEECH (2016)
Google Scholar
Laboratory of computer engineering of intelligent systems – https://iict.kz/laboratory-of-computer-engineering-of-intelligent-systems/ (data of request: 02 Aug 2021)
Li, F., et al.: Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J. Adv. Signal Process. 2019(1), 1–11 (2019). https://doi.org/10.1186/s13634-019-0651-3
Article Google Scholar
Zhao, G., Zhang, Z., Guan, H., Tang, P., Wang, J.: Rethinking ReLU to Train Better CNNs. 603–608 (2018). https://doi.org/10.1109/ICPR.2018.8545612
Ioffe, S., Szegedy, C.: Proceedings of the 32nd International Conference on Machine Learning, PMLR, vol. 37, pp. 448–456 (2015)
Google Scholar
Kingma D. P., Ba J. Adam: A method for stochastic optimization. http://arxiv.org/abs/1412. 6980 (data of request: 01.11.2021) (2014)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707–710 (1996)
MathSciNet Google Scholar

Download references

Acknowledgement

This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).

Author information

Authors and Affiliations

Satbayev University, Almaty, Kazakhstan
Dina Oralbekova
Institute of Information and Computational Technologies, Almaty, Kazakhstan
Dina Oralbekova, Orken Mamyrbayev & Bagashar Zhumazhanov
Universiti Putra Malaysia, Kuala Lumpur, Malaysia
Mohamed Othman
L.N. Gumilyov, Eurasian National University, Nur-Sultan, Kazakhstan
Keylan Alimhan
Caspian University, Almaty, Kazakhstan
Bulbul Nuranbayeva

Authors

Dina Oralbekova
View author publications
You can also search for this author in PubMed Google Scholar
Orken Mamyrbayev
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Othman
View author publications
You can also search for this author in PubMed Google Scholar
Keylan Alimhan
View author publications
You can also search for this author in PubMed Google Scholar
Bagashar Zhumazhanov
View author publications
You can also search for this author in PubMed Google Scholar
Bulbul Nuranbayeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dina Oralbekova .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam
Tien Khoa Tran
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Ualsher Tukayev
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
University of Newcastle, Newcastle, NSW, Australia
Edward Szczerbicki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oralbekova, D., Mamyrbayev, O., Othman, M., Alimhan, K., Zhumazhanov, B., Nuranbayeva, B. (2022). Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13757. Springer, Cham. https://doi.org/10.1007/978-3-031-21743-2_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-21743-2_41
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21742-5
Online ISBN: 978-3-031-21743-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics