Study of Speech Recognition System Based on Transformer and Connectionist Temporal Classification Models for Low Resource Language

Bansal, Shweta; Sharan, Shambhu; Agrawal, Shyam S.

doi:10.1007/978-3-031-20980-2_6

Shweta Bansal¹¹,
Shambhu Sharan¹² &
Shyam S. Agrawal¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

1089 Accesses

Abstract

Sequence-to-sequence methods have been extensively used in end-to-end (E2E) speech processing for recognition, translation, and synthesis work. In speech recognition, the Transformer model, which supports parallel computation and has intrinsic attention, is frequently used nowadays. This technology's primary aspects are its quick learning efficiency and absence of sequential operation, unlike Deep Neural Networks (DNN). This study concentrated on Transformer, an emergent sequential model that excels in applications for natural language processing (NLP) and neural machine translation (NMT) applications. To create a framework for the automated recognition of spoken Hindi utterances, an end-to-end and Transformer based model to understand the phenomenon classification was considered. Hindi is one of several agglutinative languages, and there isn't much information available for speech/voice recognition algorithms. According to several research, the Transformer approach enhances the performance of the system for languages with limited resources. As per the analyses done by us, it was found that the Hindi-based speech recognition system performed better when Transformers were used along with the Connectionist Temporal Classification (CTC) models altogether. Further, when a language model was included, the Word Error Rate (WER) on a clean dataset was at its lowest i.e., 3.2%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A study of transformer-based end-to-end speech recognition system for Kazakh language

Article Open access 18 May 2022

Exploring end-to-end framework towards Khasi speech recognition system

Article 27 January 2021

End-to-End Speech Recognition in Russian

References

Anderson, J., Rainie, L.: The positives of digital life (2018), https://www.pewresearch.org/internet/2018/07/03/the-positives-of-digital-life/. Accessed 15 May 2022
Deuerlein, C., Langer, M., Seßner, J., Heß, P., Franke, J.: Human-robot-interaction using cloud-based speech recognition systems. Procedia CIRP 97, 130–135 (2021). https://doi.org/10.1016/j.procir.2020.05.214
Article Google Scholar
Rogowski, A., Bieliszczuk, K., Rapcewicz, J.: Integration of industrially-oriented human-robot speech communication and vision-based object recognition. Sensors 20(24), 7287 (2020). https://doi.org/10.3390/s20247287
Article Google Scholar
Sharan, S., Bansal, S., Agrawal, S.S.: Speaker-independent recognition system for continuous hindi speech using probabilistic model. In: Agrawal, S.S., Dev, A., Wason, R., Bansal, P. (eds.) Speech and Language Processing for Human-Machine Communications. AISC, vol. 664, pp. 91–97. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-6626-9_10
Chapter Google Scholar
Seide, F., Li, G., Yu, D.: Conversational speech transcription using Context-Dependent Deep. Neural Netw. (2011). https://doi.org/10.21437/interspeech.2011-169
Article Google Scholar
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition. Springer, Boston, MA (1994). https://doi.org/10.1007/978-1-4615-3210-1
Smit, P., Virpioja, S., Kurimo, M.: Advances in subword-based HMM-DNN speech recognition across languages. Comput. Speech Lang. 66, 101158 (2021). https://doi.org/10.1016/j.csl.2020.101158
Yu, C., Kang, M., Chen, Y., Wu, J., Zhao, X.: Acoustic modeling based on deep learning for low-resource speech recognition: an overview. IEEE Access (2020). https://doi.org/10.1109/ACCESS.2020.3020421
Article Google Scholar
Perero-Codosero, J.M., Espinoza-Cuadros, F.M., Hernández-Gómez, L.A.: A comparison of hybrid and end-to-end ASR systems for the IberSpeech-RTVE 2020 speech-to-text transcription challenge. Appl. Sci. (2022). https://doi.org/10.3390/app12020903
Article Google Scholar
Wang, D., Wang, X., Lv, S.: An overview of end-to-end automatic speech recognition. Symmetry (2019). https://doi.org/10.3390/sym11081018
Article Google Scholar
Karita, S., Soplin, N.E.Y., Watanabe, S., Delcroix, M., Ogawa, A., Nakatani, T.: Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In: Interspeech-2019, pp. 1408–1412 (2019). https://doi.org/10.21437/Interspeech.2019-1938
Miao, H., Cheng, G., Gao, C., Zhang, P., Yan, Y.: Transformer-based online ctc/attention end-to-end speech recognition architecture. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6084–6088 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053165
Bansal, S., Agrawal, S.S., Kumar, A.: Acoustic analysis and perception of emotions in Hindi speech using words and sentences. Int. J. Inf. Technol. 11(4), 807–812 (2018). https://doi.org/10.1007/s41870-017-0081-0
Article Google Scholar
Agrawal, S.S., Bansal, S., Sharan, S., Mahajan, M.: Acoustic analysis of oral and nasal Hindi vowels spoken by native and non-native speakers. J. Acoust. Soc. Am. 140(4), 3338 (2016). https://doi.org/10.1121/1.4970648
Article Google Scholar
Bie, A., Venkitesh, B., Monteiro, J., Haidar, M.A., Rezagholizadeh, M.: A Simplified Fully Quantized Transformer for End-to-end Speech Recognition (2019). https://doi.org/10.48550/arXiv.1911.03604
Orken, M., Dina, O., Keylan, A., Tolganay, T., Mohamed, O.: A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 12(1), 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y
Article Google Scholar

Download references

Author information

Authors and Affiliations

K R Mangalam University, Gurugram, India
Shweta Bansal
Indira Gandhi Delhi Technical University for Women, Delhi, India
Shambhu Sharan
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Authors

Shweta Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Shambhu Sharan
View author publications
You can also search for this author in PubMed Google Scholar
Shyam S. Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shweta Bansal .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bansal, S., Sharan, S., Agrawal, S.S. (2022). Study of Speech Recognition System Based on Transformer and Connectionist Temporal Classification Models for Low Resource Language. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_6
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics