Abstract
Architecture end-to-ends are commonly used methods in many areas of machine learning, namely speech recognition. The end-to-end structure represents the system as one whole element, in contrast to the traditional one, which has several independent elements. The end-to-end system provides a direct mapping of acoustic signals in a sequence of labels without intermediate states, without the need for post-processing at the output, making it easy to implement. Combining several end-to-end method types perform better results than applying them separately. Inspired by this issue, in this work we have realized a method for using CRF and CTC together to recognize a low-resource language like the Kazakh language. In this work, architectures of a recurrent neural network and a ResNet network were applied to build a model using language models. The results of experimental studies showed that the proposed approach based on the ResNet architecture with the RNN language model achieved the best CER result with a value of 9.86% compared to other network architectures for the Kazakh language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gales, M., Young, S.:  2007. The application of hidden Markov models in speech recognition. Found. Trends Signal Process. 1(3), 195–304 (2008). https://doi.org/10.1561/2000000004
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97, (2012).  https://doi.org/10.1109/MSP.2012.2205597
Maas, A., Qi, P.,  Xie, Z., Hannun, A., Lengerich, C., Jurafsky, D., Ng, A.: Building DNN acoustic models for large vocabulary speech recognition. Comput Speech Lang. 41 (2016). https://doi.org/10.1016/j.csl.2016.06.007
Fohr, D., Mella, O., Illina. I.:New Paradigm in speech recognition: deep neural networks. In: IEEE International Conference on Information Systems and Economic Intelligence, Marrakech, Morocco. ffhal-01484447f (2017)
Shi, Y., Zhang, WQ., Liu, J., et al.: RNN language model with word clustering and class-based output layer. J. Audio Speech Music Proc. 22 (2013). https://doi.org/10.1186/1687-4722-2013-22
Huang, S., Tang, J., Dai, J., Wang, Y.: Signal status recognition based on 1DCNN and its feature extraction mechanism analysis. Sensors (Basel) 19(9) (2018). https://doi.org/10.3390/s19092018
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.:  Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ‘networks. In: ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006). https://doi.org/10.1145/1143844.1143891
Mamyrbayev, O., Oralbekova, D.: Modern trends in the development of speech recognition systems. News Nat. Acad. Sci. Republic of Kazakhstan, 4(32), 42 – 51 (2020).  https://doi.org/10.32014/2020.2518-1726.64
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.L.: Attend and Spell. ArXiv, abs/1508.01211. (data of request: 14.09.2021) (2015)
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 4945–4949 (2016)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML 2001), Williamstown, MA, USA, pp. 282–289 (2001)
Garcia-Moral, A., Solera-Ureña, R., Peláez-Moreno, C., DÃaz-de-MarÃa, F.: Data balancing for efficient training of hybrid ANN/HMM automatic speech recognition systems. IEEE Trans. Audio Speech Lang. Process. 19. 468 - 481 (2011). https://doi.org/10.1109/TASL.2010.2050513
Agglutinating language - http://www.glottopedia.org/index.php/Agglutinating_language, (data of request: 27 Sep 2021)
Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM (2017)
Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning (2016)
Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., Zhumazhanov, B.: Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-Euro. J. Enter. Technol. 1(9(115), 84–92 (2022). https://doi.org/10.15587/1729-4061.2022.252801
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., Nuranbayeva, B.: Development of security systems using DNN and i & x-vector classifiers. Eastern-Euro. J. Enter. Technol. 4 (9 (112)), 32–45 (2021). https://doi.org/10.15587/1729-4061.2021.239186
Orken, M., Dina, O., Keylan, A., Tolganay, T., Mohamed, O.: A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci Rep 12, 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y
Dimopoulos, S., Fosler-Lussier, E., Lee, C., Potamianos, A.: Transition features for CRF-based speech recognition and boundary detection. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 99–102 (2009). https://doi.org/10.1109/ASRU.2009.5373287
Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Using Conditional Random Fields for Sentence Boundary Detection in Speech (2005). https://doi.org/10.3115/1219840.1219896
An, K., Xiang, H., Ou, Z.: CAT: CRF-based ASR Toolkit. arXiv: abs/1911.08747, https://arxiv.org/abs/1911.08747 (2019)
An, K., et al.: CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. In: NTERSPEECH (2020)
Lu, L., Kong, L., Dyer, C., Smith, N.A.:Multitask Learning with CTC and Segmental CRF for Speech Recognition In: Interspeech (2017)
Xiang, H., Ou, Z.: CRF-based single-stage acoustic modeling with CTC topology. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5676–5680 (2019)
An, K., Xiang, H., Ou, Z.: CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency. In: INTERSPEECH (2020)
Yang, L., Li, Y., Wang, J., Tang, Z.: Post Text Processing of Chinese Speech Recognition Based on Bidirectional LSTM Networks and CRF. Electronics 8(11) 1248 (2019). https://doi.org/10.3390/electronics8111248
Abney S.: Parsing by chunks. In: Berwick, R., Abney, S., Tenny, C., (eds.) Principle-based Parsing. Kluwer Academic Publishers, pp. 257–279 (1991)
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. MIT Press (2006)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, Massachusetts, pp. 282–289 (2001)
Bottou, L.: Une approche theorique de l’apprentissage connexionniste: Applications a la reconnaissance de la parole. Doctoral dissertation, Universite de Paris XI (1991)
Culotta, A., Wick, M., Hall R., McCallum, A.: First-order probabilistic models for coreference resolution. In: Proc. of HLT-NAACL (2007)
Markovnikov, N.M., Kipyatkova, I.S.: An analytic survey of end-to-end speech recognition systems. Tr. SPIIRAN 58, 77–110 (2018)
Kong, L., Dyer C., Smith, N.A.: Segmental recurrent neural networks.  arXiv: 1511.06018, https://arxiv.org/abs/1511.06018. (Accessed 02 Oct 2021) (2015)
Lu, L., Kong, L., Dyer, C., Smith, N., Renals, S.: Segmental recurrent neural networks for end-to-end speech recognition. In: Proc. INTERSPEECH (2016)
Laboratory of computer engineering of intelligent systems – https://iict.kz/laboratory-of-computer-engineering-of-intelligent-systems/ (data of request: 02 Aug 2021)
Li, F., et al.: Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J. Adv. Signal Process. 2019(1), 1–11 (2019). https://doi.org/10.1186/s13634-019-0651-3
Zhao, G., Zhang, Z., Guan, H., Tang, P., Wang, J.: Rethinking ReLU to Train Better CNNs. 603–608 (2018). https://doi.org/10.1109/ICPR.2018.8545612
Ioffe, S., Szegedy, C.: Proceedings of the 32nd International Conference on Machine Learning, PMLR, vol. 37, pp. 448–456 (2015)
Kingma D. P., Ba J. Adam: A method for stochastic optimization.  http://arxiv.org/abs/1412. 6980 (data of request: 01.11.2021) (2014)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals.  Soviet Phys. Doklady 10, 707–710 (1996)
Acknowledgement
This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Oralbekova, D., Mamyrbayev, O., Othman, M., Alimhan, K., Zhumazhanov, B., Nuranbayeva, B. (2022). Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13757. Springer, Cham. https://doi.org/10.1007/978-3-031-21743-2_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-21743-2_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21742-5
Online ISBN: 978-3-031-21743-2
eBook Packages: Computer ScienceComputer Science (R0)