HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Sharma, Usha; Om, Hari; Mishra, A. N.

doi:10.1007/s11042-022-14019-z

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Published: 24 October 2022

Volume 82, pages 16173–16193, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Usha Sharma¹,
Hari Om¹ &
A. N. Mishra²

257 Accesses
1 Citation
Explore all metrics

Abstract

Automatic Speech Recognition (ASR) has become one of the major research areas over the past decade and gained a lot of interest. Their system implementation, adaptation to different languages and robustness in the performance are still some of the major challenges. Hindi is one of the most widely spoken languages in the world but it is a complex and resource-constraint language. Thus, speech recognition and classification systems need to be developed for Hindi language to spread the technology and to explore more communication means. But due to its language complexity than other languages and lack of standard databases, it is quite challenging to develop such systems. Deep learning is extensively used in different research fields and has proven its prominence to a broader extent. In this paper, a seven-layer 1D-convolutional neural network HindiSpeech-Net has been proposed to recognise different speech samples of the Hindi language in the respective category. A large dataset of 2400 speech samples in the Hindi language is collected in ten different classes in real-world conditions which is further accompanied by signal filtering and augmentation to enhance the dataset for making a robust model and avoid overfitting. The collected dataset is divided into training, validation and test set which were evaluated in different performance parameters. The trained HindiSpeech-Net model achieved an accuracy of 92.92% on the test set. The proposed framework is computationally less expensive, works in real-time and is suitable for implementation in embedded systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Article 15 September 2023

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Comparison of Deep Learning Methods for Spoken Language Identification

Data availability

The data used in the proposed work are available from the corresponding author upon reasonable request.

References

Adiwijaya, Aulia MN, Mubarok MS, Novia U, Nhita F (2017) A comparative study of MFCC-KNN and LPC-KNN for hijaiyyah letters pronounciation classification system. 2017 5th International Conference on Information and Communication Technology, ICoIC7 2017. https://doi.org/10.1109/ICoICT.2017.8074689
Alweshah M, Khalaileh S, Al, Gupta BB et al (2020) The monarch butterfly optimization algorithm for solving feature selection problems. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05210-0
Article Google Scholar
AlZu’bi S, Shehab M, Al-Ayyoub M et al (2020) Parallel implementation for 3D medical volume fuzzy segmentation. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2018.07.026
Article Google Scholar
Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C, Rose R, Tyagi V, Wellekens C (2007) Automatic speech recognition and speech variability: a review. Speech Commun. https://doi.org/10.1016/j.specom.2007.02.006
Bhatt S, Dev A, Jain A (2018) Hindi speech vowel recognition using hidden Markov model. The 6th intl. workshop on spoken language technologies for under-resourced languages, pp 196–199. https://doi.org/10.21437/SLTU.2018-41
Bhatt S, Jain A, Dev A (2020) Syllable based Hindi speech recognition. J Inform Optim Sci 41(6):1333–1351. https://doi.org/10.1080/02522667.2020.1809091
Article Google Scholar
Dey A, Zhang W, Fung P (2014) Acoustic modeling for hindi speech recognition in low-resource settings. 2014 international conference on audio, language and image processing, pp 891–894. https://doi.org/10.1109/ICALIP.2014.7009923
Dong X, Yin B, Cong Y, Du Z, Huang X (2020) Environment Sound event classification with a two-stream convolutional neural network. IEEE Access 8:125714–125721. https://doi.org/10.1109/ACCESS.2020.3007906
Dua M, Aggarwal RK, Biswas M (2018) Performance evaluation of Hindi speech recognition system using optimized filterbanks. Eng Sci Technol Int J 21(3):389–398. https://doi.org/10.1016/j.jestch.2018.04.005
Article Google Scholar
Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3499-9
Article Google Scholar
Farooq O, Datta S, Shrotriya MC (2010) Wavelet sub-band based temporal features for robust hindi phoneme recognition. Int J Wavelets Multiresolut Inf Process. https://doi.org/10.1142/S0219691310003845
Article Google Scholar
Ganapathiraju A, Hamaker J, Picone J (2004) Applications of support vector machines to speech recognition. IEEE Trans Signal Process 52(8):2348–2355. https://doi.org/10.1109/TSP.2004.831018
Gaudani H, Patel NM (2022) Comparative study of robust feature extraction techniques for ASR for Limited Resource Hindi Language, pp 763–775
Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, Gulati A, Pang R, Wu Y (2020) ContextNet: improving convolutional neural networks for automatic speech recognition with global context. Interspeech 2020, pp 3610–3614. https://doi.org/10.21437/Interspeech.2020-2059
Ishizuka K, Nakatani T (2006) A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition. Speech Commun. https://doi.org/10.1016/j.specom.2006.06.008
Article Google Scholar
Kong Q, Yu C, Xu Y, Iqbal T, Wang W, Plumbley MD (2019) Weakly labelled audioset tagging with attention neural networks. IEEE/ACM Trans Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2019.2930913
Article Google Scholar
Kumar A, Aggarwal RK (2020) Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. J Intell Syst 30(1):165–179. https://doi.org/10.1515/jisys-2018-0417
Article MathSciNet Google Scholar
Kumar A, Aggarwal RK (2020) Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int J Speech Technol. https://doi.org/10.1007/s10772-020-09757-0
Article Google Scholar
Kumar A, Mittal V (2021) Hindi speech recognition in noisy environment using hybrid technique. Int J Inform Technol. https://doi.org/10.1007/s41870-020-00586-7
Article Google Scholar
Kumar P, Jayanna HS (2022) Development of speaker-independent automatic speech recognition system for Kannada language. Indian J Sci Technol 15:333–342. https://doi.org/10.17485/IJST/v15i8.2322
Kumar A, Solanki SS, Chandra M (2022) Effect of background Indian music on performance of speech recognition models for Hindi databases. Int J Speech Technol. https://doi.org/10.1007/s10772-021-09948-3
Article Google Scholar
Lee J, Park J, Kim K, Nam J (2018) SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl Sci 8(1):150. https://doi.org/10.3390/app8010150
Li F, Liu M, Zhao Y, Kong L, Dong L, Liu X, Hui M (2019) Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J Adv Signal Process 2019(1):59. https://doi.org/10.1186/s13634-019-0651-3
Article Google Scholar
Liu Z, Wang Y, Chen T (1998) Audio feature extraction and analysis for scene segmentation and classification. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. https://doi.org/10.1023/A:1008066223044
Mustafa MK, Allen T, Appiah K (2019) A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition. Neural Comput Appl. https://doi.org/10.1007/s00521-017-3028-2
Article Google Scholar
Mustaqeem, Kwon S (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sens (Switzerland). https://doi.org/10.3390/s20010183
Article Google Scholar
Muzammel M, Salam H, Hoffmann Y, Chetouani M, Othmani A (2020) AudVowelConsNet: A phoneme-level based deep CNN architecture for clinical depression diagnosis. Mach Learn Appl. https://doi.org/10.1016/j.mlwa.2020.100005
Article Google Scholar
Nanni L, Costa YMG, Aguiar RL, Mangolin RB, Brahnam S, Silla CN (2020) Ensemble of convolutional neural networks to improve animal audio classification. Eurasip J Audio Speech Music Process. https://doi.org/10.1186/s13636-020-00175-3
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737. https://doi.org/10.1007/s10489-014-0629-7
Article Google Scholar
Oh D, Park J-S, Kim J-H, Jang G-J (2021) Hierarchical Phoneme Classification for Improved Speech Recognition. Appl Sci 11(1):428. https://doi.org/10.3390/app11010428
Article Google Scholar
Oneaţă D, Cucu H (2019) Kite: automatic speech recognition for unmanned aerial vehicles. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. https://doi.org/10.21437/Interspeech.2019-1390
Purwins H, Li B, Virtanen T, Schluter J, Chang S-Y, Sainath T (2019) Deep learning for audio signal processing. IEEE J Selec Topics Signal Process 13(2):206–219. https://doi.org/10.1109/JSTSP.2019.2908700
Samudravijaya K, Murthy HA (2012) Indian language speech sound label set (ILSL12), 2012 developed by Indian Language TTS Consortium & ASR Consortium retrieved from https://www.iitm.ac.in/donlab/tts/downloads/cls/cls_v2.1.6.pdf. Accessed 21 Feb 2021
Sertolli B, Ren Z, Schuller BW, Cummins N (2021) Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech. Comput Speech Lang 101204. https://doi.org/10.1016/j.csl.2021.101204
Sharma A, Shrotriya MC, Farooq O, Abbasi ZA (2008) Hybrid wavelet based LPC features for Hindi speech recognition. Int J Inf Commun Technol 1(3/4):373. https://doi.org/10.1504/IJICT.2008.024008
Article Google Scholar
Sharmila, Mishra AN, Awasthy N, Verma V, Malhotra S (2020) Hindi speech audio visual feature recognition. Int J Adv Sci Technol
Wang H, Li Z, Li Y et al (2020) Visual saliency guided complex image retrieval. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2018.08.010
Article Google Scholar
Yu C, Li J, Li X et al (2018) Four-image encryption scheme based on quaternion Fresnel transform, chaos and computer generated hologram. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-4637-6
Article Google Scholar
Zahid S, Hussain F, Rashid M, Yousaf MH, Habib HA (2015) Optimized audio classification and segmentation algorithm by using ensemble methods. Math Probl Eng. https://doi.org/10.1155/2015/209814

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Indian Institute of Technology (ISM) Dhanbad, Dhanbad, 826004, India
Usha Sharma & Hari Om
Krishna Engineering College, Ghaziabad, 201001, India
A. N. Mishra

Authors

Usha Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Hari Om
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Usha Sharma.

Ethics declarations

Competing interest

The authors declared no potential conflicts of interest concerning the research, authorship, and/or publication of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sharma, U., Om, H. & Mishra, A.N. HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language. Multimed Tools Appl 82, 16173–16193 (2023). https://doi.org/10.1007/s11042-022-14019-z

Download citation

Received: 16 August 2021
Revised: 26 April 2022
Accepted: 23 September 2022
Published: 24 October 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-14019-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Abstract

Access this article

Similar content being viewed by others

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Comparison of Deep Learning Methods for Spoken Language Identification

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Abstract

Access this article

Similar content being viewed by others

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Comparison of Deep Learning Methods for Spoken Language Identification

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation