Skip to main content
Log in

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech is a natural phenomenon and a significant mode of communication used by humans that is divided into two categories, human-to-human and human-to-machine. Human-to-human communication depends on the language the speaker uses. In contrast, human-to-machine communication is a technique in which machines recognize human speech and act accordingly, often termed Automatic Speech Recognition (ASR). Recognition of Non-Indian language is challenging due to pitch variations and other factors such as accent, pronunciation, etc. This paper proposes a novel approach based on Dense Net201 and EfficientNetB0, i.e., a hybrid model for the recognition of Speech. Initially, 76,263 speech samples are taken from 11 non-Indian languages, including Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Japanese, Russian, Spanish and Persian. When collected, these speech samples are pre-processed by removing noise. Then, Spectrogram, Short-Term Fourier Transform (STFT), Spectral Rolloff-Bandwidth, Mel-frequency Cepstral Coefficient (MFCC), and Chroma feature are used to extract features from the speech sample. Further, a comparative analysis of the proposed approach is shown with other Deep Learning (DL) models like ResNet10, Inception V3, VGG16, DenseNet201, and EfficientNetB0. Standard parameters like Precision, Recall, F1-Score, Confusion Matrix, Accuracy, and Loss curves are used to evaluate the performance of each model by considering speech samples from all the languages mentioned above. Thus, the experimental results show that the hybrid model stands out from all the other models by giving the highest recognition accuracy of 99.84% with a loss of 0.004%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

The used dataset to train the model is cited in the dataset Sect. 3.1 as well as mentioned in the references section (reference no. 29 to 39).

References

  1. Bhable S, Kayte C (2020) Review: Multilingual Acoustic modeling of Automatic Speech Recognition (ASR) for low resource languages. In IEEE International Conference on Advent Trends in Multidisciplinary Research and Innovation (ICATMRI).https://doi.org/10.1109/ICATMRI51801.2020.9398431

  2. Malik M, Malik K, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. In Multimedia Tools and Applications, 9411–9457. https://doi.org/10.1007/s11042-020-10073-7.

  3. Xiaohui Chu X (2021) Speech Recognition Method Based on Deep Learning and Its Application. In IEEE International Conference of Social Computing and Digital Economy (ICSCDE). https://doi.org/10.1109/ICSCDE54196.2021.00075

  4. Kalhor E, Bakhtiari B (2021) Speaker independent feature selection for speech emotion recognition: A multi-task approach. In Multimedia Tools and Applications 80:8127–8146. https://doi.org/10.1007/s11042-020-10119-w

    Article  Google Scholar 

  5. Guntur R, Ramakrishnan K, Mittal V (2021) Automatic Classification of Foreign Language Accent. In IEEE 2nd Global Conference for Advancement in Technology (GCAT). https://doi.org/10.1109/GCAT52182.2021.9587650

  6. Dokuz Y, Tufekci Z (2022) Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. In Multimedia Tools Appl 81:9969–9988. https://doi.org/10.1007/s11042-022-12304-5

    Article  Google Scholar 

  7. Delic V, Peric Z, Secujski M, Jakovljevic N, Nikolic J, Miskovic D, Simic N, Suzic S, Delic T (2019) Speech Technology Progress Based on New Machine Learning Paradigm. Hindawi: Comput Intell Neurosci 2019:1–19. https://doi.org/10.1155/2019/4368036

    Article  Google Scholar 

  8. Abushariah A, Ting H, Mustafa M, Khairuddin A, Abushariah M, Tan T (2022) Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open Challenges. In IEEE Access, 5944–5954. https://doi.org/10.1109/ACCESS.2022.3218684

  9. Thukroo I, Bashir R, Giri K (2022) A review into deep learning techniques for spoken language identification. Multimedia Tools Appl 81:32593–32624. https://doi.org/10.1007/s11042-022-13054-0

    Article  Google Scholar 

  10. Xue Y, Gao S, Sun H, Qin W (2017) A Chinese Sign Language Recognition System Using Leap Motion. In International Conference on Virtual Reality and Visualization, 180–185. https://doi.org/10.1109/ICVRV.2017.00044

  11. Xu X, Li Y, Xu X, Wen Z, Che H, Liu S, Tao J (2014) Survey on discriminative feature selection for speech emotion recognition. In International Symposium on Chinese Spoken Language Processing, 345–349. https://doi.org/10.1109/ISCSLP.2014.6936641

  12. Gong C, Li X, Wu X (2014) Recurrent Neural Network Language Model with Part-of-speech for Mandarin Speech Recognition. In International Symposium on Chinese Spoken Language Processing, 459- 463. https://doi.org/10.1109/ISCSLP.2014.6936636

  13. Shao P (2020) Chinese Speech Recognition System based on Deep Learning. In Journal of Physics: Conference Series, 1–6. https://doi.org/10.1088/1742-6596/1549/2/022012

  14. Ropke W, Radulescu R, Efthymiadis K, Nowe A (2019) Training a Speech-to-Text Model for Dutch on the Corpus Gesproken Nederlands. In Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, 2491

  15. Singh G, Sharma S, Kumar V, Kaur M, Baz M, Masud M (2021) Spoken Language Identification Using Deep Learning. Hindawi Comput Intell Neurosci 2021:1–12. https://doi.org/10.1155/2021/5123671

    Article  CAS  Google Scholar 

  16. Smit P, Virpioja S, Kurimo M (2020) Advances in subword-based HMM-DNN speech recognition across languages. Comput Speech Lang 66:101–158. https://doi.org/10.1016/j.csl.2020.101158

    Article  Google Scholar 

  17. Berjon P, Nag A, Dev S (2021) Analysis of French Phonetic Idiosyncrasies for Accent Recognition. Soft Comput Lett. https://doi.org/10.1016/j.socl.2021.100018

    Article  Google Scholar 

  18. Yang H, Oehlke C, Meinel C (2011) German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings. In Proc. of 10th IEEE/ACIS International Conference on Computer and Information Science. https://doi.org/10.1109/ICIS.2011.38

  19. Xu J, Matta K, Islam S, Nurnberger A (2020) German Speech Recognition System using Deep Speech. In International Conference on Natural Language Processing and Information Retrieval, 102–106. https://doi.org/10.1145/3443279.3443313

  20. Milde B, Kohn M (2018) Open-Source Automatic Speech Recognition for German. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, Computation and Language. https://doi.org/10.48550/arXiv.1807.10311

  21. Pantazoglou F, Kladis G, Papadakis N (2019) A Greek voice recognition interface for ROV applications, using machine learning technologies and the CMU Sphinx platform. Wseas Transact Syst Control 13:550–560

    Google Scholar 

  22. Szarvas M, Fegyo T, Mihajlik P, Tatai P (2000) Automatic Recognition of Hungarian: Theory and Practice. Int J Speech Technol 3:237–251. https://doi.org/10.1023/A:1026515132762

    Article  Google Scholar 

  23. Chen J, Nishimura R, Kitaoka N (2020) End-to-end recognition of streaming Japanese speech using CTC and local attention. In SIP 9(25):1–7

    CAS  Google Scholar 

  24. Mu D, Zhu T, Xu G, Li H, Guo D, Liu Y (2019) Attention-Based Speech Model for Japanese Recognization. In IEEE International Conference on Smart Internet of Things (SmartIoT), 402–406. https://doi.org/10.1109/SmartIoT.2019.00071

  25. Abdallah A, Hamada M, Nurseitov D (2020) Attention-Based Fully Gated CNN-BGRU for Russian Handwritten Text. J Imaging, 6(141), 1–23. https://doi.org/10.48550/arXiv.2008.05373

  26. Gazeau V, Varol C (2018) Automatic Spoken Language Recognition with Neural Networks. Int J Inf Technol Comput Sci 8:11–17. https://doi.org/10.5815/ijitcs.2018.08.02

    Article  Google Scholar 

  27. Veisi H, Mani A (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905. https://doi.org/10.1007/s10772-020-09768-x

    Article  Google Scholar 

  28. Savargiv M, Bastanfard A (2015) Persian Speech Emotion Recognition. In IKT2015 7th International Conference on Information and Knowledge Technology, 1–5. https://doi.org/10.1109/IKT.2015.7288756

  29. Park K. Dutch: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/datasets/bryanpark/dutch-single-speaker-speech-dataset. Accessed 3 Feb 2022

  30. Park K. French: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/french-single-speaker-speech-dataset. Accessed 3 Feb 2022

  31. Park K. German: Single speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/german-single-speaker-speech-dataset. Accessed 3 Feb 2022

  32. Park K. Greek: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/greek-single-speaker-speech-dataset. Accessed 3 Feb 2022

  33. Park K. Hungarian: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/hungarian-single-speaker-speech-dataset. Accessed 3 Feb 2022

  34. Park K. Japanese: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/japanese-single-speaker-speech-dataset. Accessed 3 Feb 2022

  35. Park K. Russian: Single Speaker Speech Dataset. Available [Online]:https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset. Accessed 3 Feb 2022

  36. Park K. Spanish: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/spanish-single-speaker-speech-dataset. Accessed 3 Feb 2022

  37. Park K. Finnish: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/datasets/bryanpark/finnish-single-speaker-speech-dataset. Accessed 3 Feb 2022

  38. Park K. Chinese: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/chinese-single-speaker-speech-dataset. Accessed 3 Feb 2022

  39. Persian dataset, Persian Speech. Available [Online]: https://github.com/persiandataset/PersianSpeech. Accessed 3 Feb 2022

  40. Antoniadis P, Tsardoulias E, Symeonidis A (2022) A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case. Multimedia Tools Appl 81:40635–40652. https://doi.org/10.1007/s11042-022-12953-6

    Article  Google Scholar 

  41. Jain N, Gupta V, Shubham, Madan A, Chaudhary A, Santosh K (2021) Understanding cartoon emotion using integrated deep neural network on large dataset. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06003-9

    Article  PubMed  PubMed Central  Google Scholar 

  42. Kaur G, Sharma A (2023) A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis. J Big Data. https://doi.org/10.1186/s40537-022-00680-6

    Article  PubMed  PubMed Central  Google Scholar 

  43. Kaur A, Singh A, Sachdeva R, Kukreja V (2023) Automatic speech recognition systems: A survey of discriminative techniques. Multimed Tools Appl 82:13307–13339. https://doi.org/10.1007/s11042-022-13645-x

    Article  Google Scholar 

  44. Al-karawi K, Mohammed D (2021) Improving short utterance speaker verification by combining MFCC and Entrocy in Noisy conditions. Multimed Tools Appl 80:22231–22249. https://doi.org/10.1007/s11042-021-10767-6

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rakesh Kumar.

Ethics declarations

Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, A., Kumar, R. & Kumar, Y. Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages. Multimed Tools Appl 83, 30145–30166 (2024). https://doi.org/10.1007/s11042-023-16748-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16748-1

Keywords

Navigation