Skip to main content
Log in

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic Speech Recognition (ASR) has become one of the major research areas over the past decade and gained a lot of interest. Their system implementation, adaptation to different languages and robustness in the performance are still some of the major challenges. Hindi is one of the most widely spoken languages in the world but it is a complex and resource-constraint language. Thus, speech recognition and classification systems need to be developed for Hindi language to spread the technology and to explore more communication means. But due to its language complexity than other languages and lack of standard databases, it is quite challenging to develop such systems. Deep learning is extensively used in different research fields and has proven its prominence to a broader extent. In this paper, a seven-layer 1D-convolutional neural network HindiSpeech-Net has been proposed to recognise different speech samples of the Hindi language in the respective category. A large dataset of 2400 speech samples in the Hindi language is collected in ten different classes in real-world conditions which is further accompanied by signal filtering and augmentation to enhance the dataset for making a robust model and avoid overfitting. The collected dataset is divided into training, validation and test set which were evaluated in different performance parameters. The trained HindiSpeech-Net model achieved an accuracy of 92.92% on the test set. The proposed framework is computationally less expensive, works in real-time and is suitable for implementation in embedded systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data used in the proposed work are available from the corresponding author upon reasonable request.

References

  1. Adiwijaya, Aulia MN, Mubarok MS, Novia U, Nhita F (2017) A comparative study of MFCC-KNN and LPC-KNN for hijaiyyah letters pronounciation classification system. 2017 5th International Conference on Information and Communication Technology, ICoIC7 2017. https://doi.org/10.1109/ICoICT.2017.8074689

  2. Alweshah M, Khalaileh S, Al, Gupta BB et al (2020) The monarch butterfly optimization algorithm for solving feature selection problems. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05210-0

    Article  Google Scholar 

  3. AlZu’bi S, Shehab M, Al-Ayyoub M et al (2020) Parallel implementation for 3D medical volume fuzzy segmentation. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2018.07.026

    Article  Google Scholar 

  4. Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C, Rose R, Tyagi V, Wellekens C (2007) Automatic speech recognition and speech variability: a review. Speech Commun. https://doi.org/10.1016/j.specom.2007.02.006

  5. Bhatt S, Dev A, Jain A (2018) Hindi speech vowel recognition using hidden Markov model. The 6th intl. workshop on spoken language technologies for under-resourced languages, pp 196–199. https://doi.org/10.21437/SLTU.2018-41

  6. Bhatt S, Jain A, Dev A (2020) Syllable based Hindi speech recognition. J Inform Optim Sci 41(6):1333–1351. https://doi.org/10.1080/02522667.2020.1809091

    Article  Google Scholar 

  7. Dey A, Zhang W, Fung P (2014) Acoustic modeling for hindi speech recognition in low-resource settings. 2014 international conference on audio, language and image processing, pp 891–894. https://doi.org/10.1109/ICALIP.2014.7009923

  8. Dong X, Yin B, Cong Y, Du Z, Huang X (2020) Environment Sound event classification with a two-stream convolutional neural network. IEEE Access 8:125714–125721. https://doi.org/10.1109/ACCESS.2020.3007906

  9. Dua M, Aggarwal RK, Biswas M (2018) Performance evaluation of Hindi speech recognition system using optimized filterbanks. Eng Sci Technol Int J 21(3):389–398. https://doi.org/10.1016/j.jestch.2018.04.005

    Article  Google Scholar 

  10. Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3499-9

    Article  Google Scholar 

  11. Farooq O, Datta S, Shrotriya MC (2010) Wavelet sub-band based temporal features for robust hindi phoneme recognition. Int J Wavelets Multiresolut Inf Process. https://doi.org/10.1142/S0219691310003845

    Article  Google Scholar 

  12. Ganapathiraju A, Hamaker J, Picone J (2004) Applications of support vector machines to speech recognition. IEEE Trans Signal Process 52(8):2348–2355. https://doi.org/10.1109/TSP.2004.831018

  13. Gaudani H, Patel NM (2022) Comparative study of robust feature extraction techniques for ASR for Limited Resource Hindi Language, pp 763–775

  14. Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, Gulati A, Pang R, Wu Y (2020) ContextNet: improving convolutional neural networks for automatic speech recognition with global context. Interspeech 2020, pp 3610–3614. https://doi.org/10.21437/Interspeech.2020-2059

  15. Ishizuka K, Nakatani T (2006) A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition. Speech Commun. https://doi.org/10.1016/j.specom.2006.06.008

    Article  Google Scholar 

  16. Kong Q, Yu C, Xu Y, Iqbal T, Wang W, Plumbley MD (2019) Weakly labelled audioset tagging with attention neural networks. IEEE/ACM Trans Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2019.2930913

    Article  Google Scholar 

  17. Kumar A, Aggarwal RK (2020) Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. J Intell Syst 30(1):165–179. https://doi.org/10.1515/jisys-2018-0417

    Article  MathSciNet  Google Scholar 

  18. Kumar A, Aggarwal RK (2020) Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int J Speech Technol. https://doi.org/10.1007/s10772-020-09757-0

    Article  Google Scholar 

  19. Kumar A, Mittal V (2021) Hindi speech recognition in noisy environment using hybrid technique. Int J Inform Technol. https://doi.org/10.1007/s41870-020-00586-7

    Article  Google Scholar 

  20. Kumar P, Jayanna HS (2022) Development of speaker-independent automatic speech recognition system for Kannada language. Indian J Sci Technol 15:333–342. https://doi.org/10.17485/IJST/v15i8.2322

  21. Kumar A, Solanki SS, Chandra M (2022) Effect of background Indian music on performance of speech recognition models for Hindi databases. Int J Speech Technol. https://doi.org/10.1007/s10772-021-09948-3

    Article  Google Scholar 

  22. Lee J, Park J, Kim K, Nam J (2018) SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl Sci 8(1):150. https://doi.org/10.3390/app8010150

  23. Li F, Liu M, Zhao Y, Kong L, Dong L, Liu X, Hui M (2019) Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J Adv Signal Process 2019(1):59. https://doi.org/10.1186/s13634-019-0651-3

    Article  Google Scholar 

  24. Liu Z, Wang Y, Chen T (1998) Audio feature extraction and analysis for scene segmentation and classification. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. https://doi.org/10.1023/A:1008066223044

  25. Mustafa MK, Allen T, Appiah K (2019) A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition. Neural Comput Appl. https://doi.org/10.1007/s00521-017-3028-2

    Article  Google Scholar 

  26. Mustaqeem, Kwon S (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sens (Switzerland). https://doi.org/10.3390/s20010183

    Article  Google Scholar 

  27. Muzammel M, Salam H, Hoffmann Y, Chetouani M, Othmani A (2020) AudVowelConsNet: A phoneme-level based deep CNN architecture for clinical depression diagnosis. Mach Learn Appl. https://doi.org/10.1016/j.mlwa.2020.100005

    Article  Google Scholar 

  28. Nanni L, Costa YMG, Aguiar RL, Mangolin RB, Brahnam S, Silla CN (2020) Ensemble of convolutional neural networks to improve animal audio classification. Eurasip J Audio Speech Music Process. https://doi.org/10.1186/s13636-020-00175-3

  29. Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737. https://doi.org/10.1007/s10489-014-0629-7

    Article  Google Scholar 

  30. Oh D, Park J-S, Kim J-H, Jang G-J (2021) Hierarchical Phoneme Classification for Improved Speech Recognition. Appl Sci 11(1):428. https://doi.org/10.3390/app11010428

    Article  Google Scholar 

  31. Oneaţă D, Cucu H (2019) Kite: automatic speech recognition for unmanned aerial vehicles. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. https://doi.org/10.21437/Interspeech.2019-1390

  32. Purwins H, Li B, Virtanen T, Schluter J, Chang S-Y, Sainath T (2019) Deep learning for audio signal processing. IEEE J Selec Topics Signal Process 13(2):206–219. https://doi.org/10.1109/JSTSP.2019.2908700

  33. Samudravijaya K, Murthy HA (2012) Indian language speech sound label set (ILSL12), 2012 developed by Indian Language TTS Consortium & ASR Consortium retrieved from https://www.iitm.ac.in/donlab/tts/downloads/cls/cls_v2.1.6.pdf. Accessed 21 Feb 2021

  34. Sertolli B, Ren Z, Schuller BW, Cummins N (2021) Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech. Comput Speech Lang 101204. https://doi.org/10.1016/j.csl.2021.101204

  35. Sharma A, Shrotriya MC, Farooq O, Abbasi ZA (2008) Hybrid wavelet based LPC features for Hindi speech recognition. Int J Inf Commun Technol 1(3/4):373. https://doi.org/10.1504/IJICT.2008.024008

    Article  Google Scholar 

  36. Sharmila, Mishra AN, Awasthy N, Verma V, Malhotra S (2020) Hindi speech audio visual feature recognition. Int J Adv Sci Technol

  37. Wang H, Li Z, Li Y et al (2020) Visual saliency guided complex image retrieval. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2018.08.010

    Article  Google Scholar 

  38. Yu C, Li J, Li X et al (2018) Four-image encryption scheme based on quaternion Fresnel transform, chaos and computer generated hologram. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-4637-6

    Article  Google Scholar 

  39. Zahid S, Hussain F, Rashid M, Yousaf MH, Habib HA (2015) Optimized audio classification and segmentation algorithm by using ensemble methods. Math Probl Eng. https://doi.org/10.1155/2015/209814

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Usha Sharma.

Ethics declarations

Competing interest

The authors declared no potential conflicts of interest concerning the research, authorship, and/or publication of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, U., Om, H. & Mishra, A.N. HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language. Multimed Tools Appl 82, 16173–16193 (2023). https://doi.org/10.1007/s11042-022-14019-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14019-z

Keywords

Navigation