Difficulties Developing a Children’s Speech Recognition System for Language with Limited Training Data

Oralbekova, Dina; Mamyrbayev, Orken; Othman, Mohamed; Alimhan, Keylan; NinaKhairova; Zhunussova, Aliya

doi:10.1007/978-3-031-41774-0_33

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1864))

Included in the following conference series:

International Conference on Computational Collective Intelligence

385 Accesses

Abstract

Automatic speech recognition is a rapidly developing area in the field of machine learning and is a necessary tool for controlling various devices and automated systems. However, such recognition systems are more aimed at adults than at the younger generation. The peculiarity of the development of a child’s voice leads to an increase in the error in the recognition of children’s speech in applications developed based on adult speech data. In addition, many applications do not consider the peculiarities of children’s speech and the data used when children communicate between other children and adults. Thus, there is currently a huge demand for systems that understand adult and child speech and can process them correctly. In addition, there is the problem of the lack of these languages, which are part of the agglutinative, i.e. Turkic languages, especially Kazakh language. The difficulty of assembling and developing a high-quality and large case is still an unsolved problem. This paper presents studies of children’s speech recognition based on modified data from adults and their impact on the quality of recognition for Kazakh language. Two models were built, namely the Transformer model and the insert-based model. The results obtained are satisfactory, but still require improvement and expansion of the corpus of children’s speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Juang, B.H., Rabiner, L.R.: Hidden markov models for speech recognition. Technometrics 33(3), 251 (1991). https://doi.org/10.2307/1268779
Article MathSciNet MATH Google Scholar
Brown, J.C., Smaragdis, P.: Hidden Markov and Gaussian mixture models for automatic call classification. J. Acoustical Soc. Am. 125(6), EL221–EL224 (2009). https://doi.org/10.1121/1.3124659
Article Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Magazine 29(6), 82–97 (2012)
Article Google Scholar
Ghaffarzadegan, S., Bořil, H., Hansen, J.H.L.: Deep neural network training for whispered speech recognition using small databases and generative model sampling. Int. J. Speech Technol. 20(4), 1063–1075 (2017). https://doi.org/10.1007/s10772-017-9461-x
Article Google Scholar
Children’s Art School No. 4, Engels Homepage. https://engels-dshi4.ru/index.php?option=com_content&view=article&id=86:tormanova-o-v-detskij-golos-i-osobennosti-ego-razvitiya&catid=18&Itemid=131. Last accessed 16 Mar 2023
https://te-st.org/2021/06/02/voice-assistants-and-problems/
Mamyrbayev, O., Oralbekova, D., Alimhan, K., Othman, M., Turdalykyzy, T.: A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 12, 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y
Article Google Scholar
Mamyrbayev, O.Z., Oralbekova, D.O., Alimhan, K., Nuranbayeva, B.M.: Hybrid end-to-end model for Kazakh speech recognition. Int. J. Speech Technol. 26(2), 261–270 (2022). https://doi.org/10.1007/s10772-022-09983-8
Article Google Scholar
Oralbekova, D., Mamyrbayev, O., Othman, M., Alimhan, K., Zhumazhanov, B., Nuranbayeva, B.: Development of CRF and CTC based end-to-end kazakh speech recognition system. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds.) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol. 13757. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-21743-2_41
Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., Bekarystankyzy, A.: End-to-end model based on RNN-T for Kazakh speech recognition. In: 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), pp. 163–167 (2021). https://doi.org/10.1109/ICCCI51764.2021.9486811
Abulimiti, A., Schultz, T.: Automatic speech recognition for uyghur through multilingual acoustic modeling. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 6444–6449. European Language Resources Association, Marseille, France (2020)
Google Scholar
Du, W., Maimaitiyiming, Y., Nijat, M., Li, L., Hamdulla, A., Wang, D.: Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: an overview. Appl. Sci. 13(1), 326 (2022). https://doi.org/10.3390/app13010326
Article Google Scholar
Mukhamadiyev, A., Khujayarov, I., Djuraev, O., Cho, J.: Automatic speech recognition method based on deep learning approaches for Uzbek Language. Sensors 22, 3683 (2022). https://doi.org/10.3390/s22103683
Article Google Scholar
Ren, Z., Yolwas, N., Slamu, W., Cao, R., Wang, H.: Improving hybrid CTC/attention architecture for agglutinative language speech recognition. Sensors 22, 7319 (2022). https://doi.org/10.3390/s22197319
Article Google Scholar
Rathor, S., Jadon, R.S.: Speech recognition and system controlling using Hindi language. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. Kanpur, India (2019). https://doi.org/10.1109/ICCCNT45670.2019.8944641
TechInsider Homepage: https://www.techinsider.ru/technologies/1122303-raspoznavanie-rechi-v-medicine-zachem-nam-eto-nuzhno/. Last accessed 16 Mar 2023
Sensory Inc. Homepage: https://www.sensory.com/. Last accessed 16 Mar 2023
SoapBox Inc. Homepage. https://www.soapboxlabs.com/. Last accessed 16 Feb 2023
Kadyan, V., Shanawazuddin, S., Singh, A.: Developing children’s speech recognition system for low resource Punjabi language. Appl. Acoustics 178, 108002 (2021). https://doi.org/10.1016/j.apacoust.2021.108002
Article Google Scholar
Jenthe, T., Kris, D.: Transfer Learning for Robust Low-Resource Children’s Speech ASR with Transformers and Source-Filter Warping (2022). https://doi.org/10.48550/arXiv.2206.09396
Rong, T., Lei, W., Bin, M.: Transfer learning for children’s speech recognition, pp. 36–39 (2017). https://doi.org/10.1109/IALP.2017.8300540
Dissertation thesis. https://jscholarship.library.jhu.edu/bitstream/handle/1774.2/62766/WU-THESIS-2020.pdf?sequence=1. Last accessed 2 Feb 2023
Dubagunta, S.P., Hande Kabil, S., Magimai.-Doss, M.: Improving children speech recognition through feature learning from raw speech signal. In: ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5736–5740. Brighton, UK (2019). https://doi.org/10.1109/ICASSP.2019.8682826
Shivakumar, P.G., Narayanan, S.: End-to-end neural systems for automatic children speech recognition: an empirical study. Comput. Speech Lang. 72, 101289 (2022). https://doi.org/10.1016/j.csl.2021.101289
Article Google Scholar
Potamianos, A., Narayanan, S., Lee, S.: Automatic speech recognition for children (1997). https://doi.org/10.21437/Eurospeech.1997-623
Ignatenko, G.S.: Classification of audio signals using neural networks. In: Ignatenko, G.S., Lamchanovsky, A.G. (eds.) Text: direct // Young scientist. - No. 48 (286), pp. 23–25 (2019). https://moluch.ru/archive/286/64455/
Mamyrbayev, O., Oralbekova, D., Othman, M., Turdalykyzy, T., Zhumazhanov, B., Mukhsina, K.: Investigation of insertion-based speech recognition method. Int. J. Signal Process. 7, 32–35 (2022)
Google Scholar
Gu, J., Bradbury, J., Xiong, C., Li, V.O., Socher R.: Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017)
Chen, N., Watanabe, S., Villalba, J., Zelasko, P., Dehak, N.: Non-autoregressive transformer for speech recognition. IEEE Signal Process. Lett. 28, 121–125 (2021)
Article Google Scholar
Fujita, Y., Watanabe, S., Omachi, M., Chan, X.: Insertion-Based Modeling for End-to-End Automatic Speech Recognition. INTERSPEECH 2020 (2020). https://doi.org/10.48550/arXiv.2005.13211

Download references

Acknowledgement

This research has been funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP19174298).

Author information

Authors and Affiliations

Institute of Information and Computational Technologies, Almaty, Kazakhstan
Dina Oralbekova & Orken Mamyrbayev
Almaty University of Power Engineering and Telecommunications, Almaty, Kazakhstan
Dina Oralbekova
Universiti Putra Malaysia, Kuala Lumpur, Malaysia
Mohamed Othman
L.N. Gumilyov Eurasian National University, Nur-Sultan, Kazakhstan
Keylan Alimhan
National Technical University “Kharkiv Polytechnic Institute”, Kharkiv, Ukraine
NinaKhairova
Narxoz University, Almaty, Kazakhstan
Aliya Zhunussova

Authors

Dina Oralbekova
View author publications
You can also search for this author in PubMed Google Scholar
Orken Mamyrbayev
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Othman
View author publications
You can also search for this author in PubMed Google Scholar
Keylan Alimhan
View author publications
You can also search for this author in PubMed Google Scholar
NinaKhairova
View author publications
You can also search for this author in PubMed Google Scholar
Aliya Zhunussova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dina Oralbekova .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Eötvös Loránd University, Budapest, Hungary
János Botzheim
Eötvös Loránd University, Budapest, Hungary
László Gulyás
Universidad Complutense de Madrid, Madrid, Spain
Manuel Nunez
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Jan Treur
University of Münster, Münster, Germany
Gottfried Vossen
Wrocław University of Science and Technology, Wrocław, Poland
Adrianna Kozierkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oralbekova, D., Mamyrbayev, O., Othman, M., Alimhan, K., NinaKhairova, Zhunussova, A. (2023). Difficulties Developing a Children’s Speech Recognition System for Language with Limited Training Data. In: Nguyen, N.T., et al. Advances in Computational Collective Intelligence. ICCCI 2023. Communications in Computer and Information Science, vol 1864. Springer, Cham. https://doi.org/10.1007/978-3-031-41774-0_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-41774-0_33
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41773-3
Online ISBN: 978-3-031-41774-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics