Abstract
Sequence-to-sequence methods have been extensively used in end-to-end (E2E) speech processing for recognition, translation, and synthesis work. In speech recognition, the Transformer model, which supports parallel computation and has intrinsic attention, is frequently used nowadays. This technology's primary aspects are its quick learning efficiency and absence of sequential operation, unlike Deep Neural Networks (DNN). This study concentrated on Transformer, an emergent sequential model that excels in applications for natural language processing (NLP) and neural machine translation (NMT) applications. To create a framework for the automated recognition of spoken Hindi utterances, an end-to-end and Transformer based model to understand the phenomenon classification was considered. Hindi is one of several agglutinative languages, and there isn't much information available for speech/voice recognition algorithms. According to several research, the Transformer approach enhances the performance of the system for languages with limited resources. As per the analyses done by us, it was found that the Hindi-based speech recognition system performed better when Transformers were used along with the Connectionist Temporal Classification (CTC) models altogether. Further, when a language model was included, the Word Error Rate (WER) on a clean dataset was at its lowest i.e., 3.2%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, J., Rainie, L.: The positives of digital life (2018), https://www.pewresearch.org/internet/2018/07/03/the-positives-of-digital-life/. Accessed 15 May 2022
Deuerlein, C., Langer, M., Seßner, J., Heß, P., Franke, J.: Human-robot-interaction using cloud-based speech recognition systems. Procedia CIRP 97, 130–135 (2021). https://doi.org/10.1016/j.procir.2020.05.214
Rogowski, A., Bieliszczuk, K., Rapcewicz, J.: Integration of industrially-oriented human-robot speech communication and vision-based object recognition. Sensors 20(24), 7287 (2020). https://doi.org/10.3390/s20247287
Sharan, S., Bansal, S., Agrawal, S.S.: Speaker-independent recognition system for continuous hindi speech using probabilistic model. In: Agrawal, S.S., Dev, A., Wason, R., Bansal, P. (eds.) Speech and Language Processing for Human-Machine Communications. AISC, vol. 664, pp. 91–97. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-6626-9_10
Seide, F., Li, G., Yu, D.: Conversational speech transcription using Context-Dependent Deep. Neural Netw. (2011). https://doi.org/10.21437/interspeech.2011-169
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition. Springer, Boston, MA (1994). https://doi.org/10.1007/978-1-4615-3210-1
Smit, P., Virpioja, S., Kurimo, M.: Advances in subword-based HMM-DNN speech recognition across languages. Comput. Speech Lang. 66, 101158 (2021). https://doi.org/10.1016/j.csl.2020.101158
Yu, C., Kang, M., Chen, Y., Wu, J., Zhao, X.: Acoustic modeling based on deep learning for low-resource speech recognition: an overview. IEEE Access (2020). https://doi.org/10.1109/ACCESS.2020.3020421
Perero-Codosero, J.M., Espinoza-Cuadros, F.M., Hernández-Gómez, L.A.: A comparison of hybrid and end-to-end ASR systems for the IberSpeech-RTVE 2020 speech-to-text transcription challenge. Appl. Sci. (2022). https://doi.org/10.3390/app12020903
Wang, D., Wang, X., Lv, S.: An overview of end-to-end automatic speech recognition. Symmetry (2019). https://doi.org/10.3390/sym11081018
Karita, S., Soplin, N.E.Y., Watanabe, S., Delcroix, M., Ogawa, A., Nakatani, T.: Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In: Interspeech-2019, pp. 1408–1412 (2019). https://doi.org/10.21437/Interspeech.2019-1938
Miao, H., Cheng, G., Gao, C., Zhang, P., Yan, Y.: Transformer-based online ctc/attention end-to-end speech recognition architecture. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6084–6088 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053165
Bansal, S., Agrawal, S.S., Kumar, A.: Acoustic analysis and perception of emotions in Hindi speech using words and sentences. Int. J. Inf. Technol. 11(4), 807–812 (2018). https://doi.org/10.1007/s41870-017-0081-0
Agrawal, S.S., Bansal, S., Sharan, S., Mahajan, M.: Acoustic analysis of oral and nasal Hindi vowels spoken by native and non-native speakers. J. Acoust. Soc. Am. 140(4), 3338 (2016). https://doi.org/10.1121/1.4970648
Bie, A., Venkitesh, B., Monteiro, J., Haidar, M.A., Rezagholizadeh, M.: A Simplified Fully Quantized Transformer for End-to-end Speech Recognition (2019). https://doi.org/10.48550/arXiv.1911.03604
Orken, M., Dina, O., Keylan, A., Tolganay, T., Mohamed, O.: A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 12(1), 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bansal, S., Sharan, S., Agrawal, S.S. (2022). Study of Speech Recognition System Based on Transformer and Connectionist Temporal Classification Models for Low Resource Language. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)