Abstract
Speech is a natural phenomenon and a significant mode of communication used by humans that is divided into two categories, human-to-human and human-to-machine. Human-to-human communication depends on the language the speaker uses. In contrast, human-to-machine communication is a technique in which machines recognize human speech and act accordingly, often termed Automatic Speech Recognition (ASR). Recognition of Non-Indian language is challenging due to pitch variations and other factors such as accent, pronunciation, etc. This paper proposes a novel approach based on Dense Net201 and EfficientNetB0, i.e., a hybrid model for the recognition of Speech. Initially, 76,263 speech samples are taken from 11 non-Indian languages, including Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Japanese, Russian, Spanish and Persian. When collected, these speech samples are pre-processed by removing noise. Then, Spectrogram, Short-Term Fourier Transform (STFT), Spectral Rolloff-Bandwidth, Mel-frequency Cepstral Coefficient (MFCC), and Chroma feature are used to extract features from the speech sample. Further, a comparative analysis of the proposed approach is shown with other Deep Learning (DL) models like ResNet10, Inception V3, VGG16, DenseNet201, and EfficientNetB0. Standard parameters like Precision, Recall, F1-Score, Confusion Matrix, Accuracy, and Loss curves are used to evaluate the performance of each model by considering speech samples from all the languages mentioned above. Thus, the experimental results show that the hybrid model stands out from all the other models by giving the highest recognition accuracy of 99.84% with a loss of 0.004%.
Similar content being viewed by others
Data Availability
The used dataset to train the model is cited in the dataset Sect. 3.1 as well as mentioned in the references section (reference no. 29 to 39).
References
Bhable S, Kayte C (2020) Review: Multilingual Acoustic modeling of Automatic Speech Recognition (ASR) for low resource languages. In IEEE International Conference on Advent Trends in Multidisciplinary Research and Innovation (ICATMRI).https://doi.org/10.1109/ICATMRI51801.2020.9398431
Malik M, Malik K, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. In Multimedia Tools and Applications, 9411–9457. https://doi.org/10.1007/s11042-020-10073-7.
Xiaohui Chu X (2021) Speech Recognition Method Based on Deep Learning and Its Application. In IEEE International Conference of Social Computing and Digital Economy (ICSCDE). https://doi.org/10.1109/ICSCDE54196.2021.00075
Kalhor E, Bakhtiari B (2021) Speaker independent feature selection for speech emotion recognition: A multi-task approach. In Multimedia Tools and Applications 80:8127–8146. https://doi.org/10.1007/s11042-020-10119-w
Guntur R, Ramakrishnan K, Mittal V (2021) Automatic Classification of Foreign Language Accent. In IEEE 2nd Global Conference for Advancement in Technology (GCAT). https://doi.org/10.1109/GCAT52182.2021.9587650
Dokuz Y, Tufekci Z (2022) Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. In Multimedia Tools Appl 81:9969–9988. https://doi.org/10.1007/s11042-022-12304-5
Delic V, Peric Z, Secujski M, Jakovljevic N, Nikolic J, Miskovic D, Simic N, Suzic S, Delic T (2019) Speech Technology Progress Based on New Machine Learning Paradigm. Hindawi: Comput Intell Neurosci 2019:1–19. https://doi.org/10.1155/2019/4368036
Abushariah A, Ting H, Mustafa M, Khairuddin A, Abushariah M, Tan T (2022) Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open Challenges. In IEEE Access, 5944–5954. https://doi.org/10.1109/ACCESS.2022.3218684
Thukroo I, Bashir R, Giri K (2022) A review into deep learning techniques for spoken language identification. Multimedia Tools Appl 81:32593–32624. https://doi.org/10.1007/s11042-022-13054-0
Xue Y, Gao S, Sun H, Qin W (2017) A Chinese Sign Language Recognition System Using Leap Motion. In International Conference on Virtual Reality and Visualization, 180–185. https://doi.org/10.1109/ICVRV.2017.00044
Xu X, Li Y, Xu X, Wen Z, Che H, Liu S, Tao J (2014) Survey on discriminative feature selection for speech emotion recognition. In International Symposium on Chinese Spoken Language Processing, 345–349. https://doi.org/10.1109/ISCSLP.2014.6936641
Gong C, Li X, Wu X (2014) Recurrent Neural Network Language Model with Part-of-speech for Mandarin Speech Recognition. In International Symposium on Chinese Spoken Language Processing, 459- 463. https://doi.org/10.1109/ISCSLP.2014.6936636
Shao P (2020) Chinese Speech Recognition System based on Deep Learning. In Journal of Physics: Conference Series, 1–6. https://doi.org/10.1088/1742-6596/1549/2/022012
Ropke W, Radulescu R, Efthymiadis K, Nowe A (2019) Training a Speech-to-Text Model for Dutch on the Corpus Gesproken Nederlands. In Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, 2491
Singh G, Sharma S, Kumar V, Kaur M, Baz M, Masud M (2021) Spoken Language Identification Using Deep Learning. Hindawi Comput Intell Neurosci 2021:1–12. https://doi.org/10.1155/2021/5123671
Smit P, Virpioja S, Kurimo M (2020) Advances in subword-based HMM-DNN speech recognition across languages. Comput Speech Lang 66:101–158. https://doi.org/10.1016/j.csl.2020.101158
Berjon P, Nag A, Dev S (2021) Analysis of French Phonetic Idiosyncrasies for Accent Recognition. Soft Comput Lett. https://doi.org/10.1016/j.socl.2021.100018
Yang H, Oehlke C, Meinel C (2011) German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings. In Proc. of 10th IEEE/ACIS International Conference on Computer and Information Science. https://doi.org/10.1109/ICIS.2011.38
Xu J, Matta K, Islam S, Nurnberger A (2020) German Speech Recognition System using Deep Speech. In International Conference on Natural Language Processing and Information Retrieval, 102–106. https://doi.org/10.1145/3443279.3443313
Milde B, Kohn M (2018) Open-Source Automatic Speech Recognition for German. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, Computation and Language. https://doi.org/10.48550/arXiv.1807.10311
Pantazoglou F, Kladis G, Papadakis N (2019) A Greek voice recognition interface for ROV applications, using machine learning technologies and the CMU Sphinx platform. Wseas Transact Syst Control 13:550–560
Szarvas M, Fegyo T, Mihajlik P, Tatai P (2000) Automatic Recognition of Hungarian: Theory and Practice. Int J Speech Technol 3:237–251. https://doi.org/10.1023/A:1026515132762
Chen J, Nishimura R, Kitaoka N (2020) End-to-end recognition of streaming Japanese speech using CTC and local attention. In SIP 9(25):1–7
Mu D, Zhu T, Xu G, Li H, Guo D, Liu Y (2019) Attention-Based Speech Model for Japanese Recognization. In IEEE International Conference on Smart Internet of Things (SmartIoT), 402–406. https://doi.org/10.1109/SmartIoT.2019.00071
Abdallah A, Hamada M, Nurseitov D (2020) Attention-Based Fully Gated CNN-BGRU for Russian Handwritten Text. J Imaging, 6(141), 1–23. https://doi.org/10.48550/arXiv.2008.05373
Gazeau V, Varol C (2018) Automatic Spoken Language Recognition with Neural Networks. Int J Inf Technol Comput Sci 8:11–17. https://doi.org/10.5815/ijitcs.2018.08.02
Veisi H, Mani A (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905. https://doi.org/10.1007/s10772-020-09768-x
Savargiv M, Bastanfard A (2015) Persian Speech Emotion Recognition. In IKT2015 7th International Conference on Information and Knowledge Technology, 1–5. https://doi.org/10.1109/IKT.2015.7288756
Park K. Dutch: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/datasets/bryanpark/dutch-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. French: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/french-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. German: Single speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/german-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Greek: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/greek-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Hungarian: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/hungarian-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Japanese: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/japanese-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Russian: Single Speaker Speech Dataset. Available [Online]:https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Spanish: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/spanish-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Finnish: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/datasets/bryanpark/finnish-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Chinese: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/chinese-single-speaker-speech-dataset. Accessed 3 Feb 2022
Persian dataset, Persian Speech. Available [Online]: https://github.com/persiandataset/PersianSpeech. Accessed 3 Feb 2022
Antoniadis P, Tsardoulias E, Symeonidis A (2022) A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case. Multimedia Tools Appl 81:40635–40652. https://doi.org/10.1007/s11042-022-12953-6
Jain N, Gupta V, Shubham, Madan A, Chaudhary A, Santosh K (2021) Understanding cartoon emotion using integrated deep neural network on large dataset. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06003-9
Kaur G, Sharma A (2023) A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis. J Big Data. https://doi.org/10.1186/s40537-022-00680-6
Kaur A, Singh A, Sachdeva R, Kukreja V (2023) Automatic speech recognition systems: A survey of discriminative techniques. Multimed Tools Appl 82:13307–13339. https://doi.org/10.1007/s11042-022-13645-x
Al-karawi K, Mohammed D (2021) Improving short utterance speaker verification by combining MFCC and Entrocy in Noisy conditions. Multimed Tools Appl 80:22231–22249. https://doi.org/10.1007/s11042-021-10767-6
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gupta, A., Kumar, R. & Kumar, Y. Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages. Multimed Tools Appl 83, 30145–30166 (2024). https://doi.org/10.1007/s11042-023-16748-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16748-1