Abstract
Automatic Speech Recognition (ASR) has become one of the major research areas over the past decade and gained a lot of interest. Their system implementation, adaptation to different languages and robustness in the performance are still some of the major challenges. Hindi is one of the most widely spoken languages in the world but it is a complex and resource-constraint language. Thus, speech recognition and classification systems need to be developed for Hindi language to spread the technology and to explore more communication means. But due to its language complexity than other languages and lack of standard databases, it is quite challenging to develop such systems. Deep learning is extensively used in different research fields and has proven its prominence to a broader extent. In this paper, a seven-layer 1D-convolutional neural network HindiSpeech-Net has been proposed to recognise different speech samples of the Hindi language in the respective category. A large dataset of 2400 speech samples in the Hindi language is collected in ten different classes in real-world conditions which is further accompanied by signal filtering and augmentation to enhance the dataset for making a robust model and avoid overfitting. The collected dataset is divided into training, validation and test set which were evaluated in different performance parameters. The trained HindiSpeech-Net model achieved an accuracy of 92.92% on the test set. The proposed framework is computationally less expensive, works in real-time and is suitable for implementation in embedded systems.
Similar content being viewed by others
Data availability
The data used in the proposed work are available from the corresponding author upon reasonable request.
References
Adiwijaya, Aulia MN, Mubarok MS, Novia U, Nhita F (2017) A comparative study of MFCC-KNN and LPC-KNN for hijaiyyah letters pronounciation classification system. 2017 5th International Conference on Information and Communication Technology, ICoIC7 2017. https://doi.org/10.1109/ICoICT.2017.8074689
Alweshah M, Khalaileh S, Al, Gupta BB et al (2020) The monarch butterfly optimization algorithm for solving feature selection problems. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05210-0
AlZu’bi S, Shehab M, Al-Ayyoub M et al (2020) Parallel implementation for 3D medical volume fuzzy segmentation. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2018.07.026
Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C, Rose R, Tyagi V, Wellekens C (2007) Automatic speech recognition and speech variability: a review. Speech Commun. https://doi.org/10.1016/j.specom.2007.02.006
Bhatt S, Dev A, Jain A (2018) Hindi speech vowel recognition using hidden Markov model. The 6th intl. workshop on spoken language technologies for under-resourced languages, pp 196–199. https://doi.org/10.21437/SLTU.2018-41
Bhatt S, Jain A, Dev A (2020) Syllable based Hindi speech recognition. J Inform Optim Sci 41(6):1333–1351. https://doi.org/10.1080/02522667.2020.1809091
Dey A, Zhang W, Fung P (2014) Acoustic modeling for hindi speech recognition in low-resource settings. 2014 international conference on audio, language and image processing, pp 891–894. https://doi.org/10.1109/ICALIP.2014.7009923
Dong X, Yin B, Cong Y, Du Z, Huang X (2020) Environment Sound event classification with a two-stream convolutional neural network. IEEE Access 8:125714–125721. https://doi.org/10.1109/ACCESS.2020.3007906
Dua M, Aggarwal RK, Biswas M (2018) Performance evaluation of Hindi speech recognition system using optimized filterbanks. Eng Sci Technol Int J 21(3):389–398. https://doi.org/10.1016/j.jestch.2018.04.005
Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3499-9
Farooq O, Datta S, Shrotriya MC (2010) Wavelet sub-band based temporal features for robust hindi phoneme recognition. Int J Wavelets Multiresolut Inf Process. https://doi.org/10.1142/S0219691310003845
Ganapathiraju A, Hamaker J, Picone J (2004) Applications of support vector machines to speech recognition. IEEE Trans Signal Process 52(8):2348–2355. https://doi.org/10.1109/TSP.2004.831018
Gaudani H, Patel NM (2022) Comparative study of robust feature extraction techniques for ASR for Limited Resource Hindi Language, pp 763–775
Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, Gulati A, Pang R, Wu Y (2020) ContextNet: improving convolutional neural networks for automatic speech recognition with global context. Interspeech 2020, pp 3610–3614. https://doi.org/10.21437/Interspeech.2020-2059
Ishizuka K, Nakatani T (2006) A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition. Speech Commun. https://doi.org/10.1016/j.specom.2006.06.008
Kong Q, Yu C, Xu Y, Iqbal T, Wang W, Plumbley MD (2019) Weakly labelled audioset tagging with attention neural networks. IEEE/ACM Trans Audio Speech Lang Process. https://doi.org/10.1109/TASLP.2019.2930913
Kumar A, Aggarwal RK (2020) Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. J Intell Syst 30(1):165–179. https://doi.org/10.1515/jisys-2018-0417
Kumar A, Aggarwal RK (2020) Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int J Speech Technol. https://doi.org/10.1007/s10772-020-09757-0
Kumar A, Mittal V (2021) Hindi speech recognition in noisy environment using hybrid technique. Int J Inform Technol. https://doi.org/10.1007/s41870-020-00586-7
Kumar P, Jayanna HS (2022) Development of speaker-independent automatic speech recognition system for Kannada language. Indian J Sci Technol 15:333–342. https://doi.org/10.17485/IJST/v15i8.2322
Kumar A, Solanki SS, Chandra M (2022) Effect of background Indian music on performance of speech recognition models for Hindi databases. Int J Speech Technol. https://doi.org/10.1007/s10772-021-09948-3
Lee J, Park J, Kim K, Nam J (2018) SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl Sci 8(1):150. https://doi.org/10.3390/app8010150
Li F, Liu M, Zhao Y, Kong L, Dong L, Liu X, Hui M (2019) Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J Adv Signal Process 2019(1):59. https://doi.org/10.1186/s13634-019-0651-3
Liu Z, Wang Y, Chen T (1998) Audio feature extraction and analysis for scene segmentation and classification. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. https://doi.org/10.1023/A:1008066223044
Mustafa MK, Allen T, Appiah K (2019) A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition. Neural Comput Appl. https://doi.org/10.1007/s00521-017-3028-2
Mustaqeem, Kwon S (2020) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sens (Switzerland). https://doi.org/10.3390/s20010183
Muzammel M, Salam H, Hoffmann Y, Chetouani M, Othmani A (2020) AudVowelConsNet: A phoneme-level based deep CNN architecture for clinical depression diagnosis. Mach Learn Appl. https://doi.org/10.1016/j.mlwa.2020.100005
Nanni L, Costa YMG, Aguiar RL, Mangolin RB, Brahnam S, Silla CN (2020) Ensemble of convolutional neural networks to improve animal audio classification. Eurasip J Audio Speech Music Process. https://doi.org/10.1186/s13636-020-00175-3
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737. https://doi.org/10.1007/s10489-014-0629-7
Oh D, Park J-S, Kim J-H, Jang G-J (2021) Hierarchical Phoneme Classification for Improved Speech Recognition. Appl Sci 11(1):428. https://doi.org/10.3390/app11010428
Oneaţă D, Cucu H (2019) Kite: automatic speech recognition for unmanned aerial vehicles. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. https://doi.org/10.21437/Interspeech.2019-1390
Purwins H, Li B, Virtanen T, Schluter J, Chang S-Y, Sainath T (2019) Deep learning for audio signal processing. IEEE J Selec Topics Signal Process 13(2):206–219. https://doi.org/10.1109/JSTSP.2019.2908700
Samudravijaya K, Murthy HA (2012) Indian language speech sound label set (ILSL12), 2012 developed by Indian Language TTS Consortium & ASR Consortium retrieved from https://www.iitm.ac.in/donlab/tts/downloads/cls/cls_v2.1.6.pdf. Accessed 21 Feb 2021
Sertolli B, Ren Z, Schuller BW, Cummins N (2021) Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech. Comput Speech Lang 101204. https://doi.org/10.1016/j.csl.2021.101204
Sharma A, Shrotriya MC, Farooq O, Abbasi ZA (2008) Hybrid wavelet based LPC features for Hindi speech recognition. Int J Inf Commun Technol 1(3/4):373. https://doi.org/10.1504/IJICT.2008.024008
Sharmila, Mishra AN, Awasthy N, Verma V, Malhotra S (2020) Hindi speech audio visual feature recognition. Int J Adv Sci Technol
Wang H, Li Z, Li Y et al (2020) Visual saliency guided complex image retrieval. Pattern Recognit Lett. https://doi.org/10.1016/j.patrec.2018.08.010
Yu C, Li J, Li X et al (2018) Four-image encryption scheme based on quaternion Fresnel transform, chaos and computer generated hologram. Multimed Tools Appl. https://doi.org/10.1007/s11042-017-4637-6
Zahid S, Hussain F, Rashid M, Yousaf MH, Habib HA (2015) Optimized audio classification and segmentation algorithm by using ensemble methods. Math Probl Eng. https://doi.org/10.1155/2015/209814
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interest
The authors declared no potential conflicts of interest concerning the research, authorship, and/or publication of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sharma, U., Om, H. & Mishra, A.N. HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language. Multimed Tools Appl 82, 16173–16193 (2023). https://doi.org/10.1007/s11042-022-14019-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14019-z