Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Gupta, Astha; Kumar, Rakesh; Kumar, Yogesh

doi:10.1007/s11042-023-16748-1

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Published: 15 September 2023

Volume 83, pages 30145–30166, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

109 Accesses
Explore all metrics

Abstract

Speech is a natural phenomenon and a significant mode of communication used by humans that is divided into two categories, human-to-human and human-to-machine. Human-to-human communication depends on the language the speaker uses. In contrast, human-to-machine communication is a technique in which machines recognize human speech and act accordingly, often termed Automatic Speech Recognition (ASR). Recognition of Non-Indian language is challenging due to pitch variations and other factors such as accent, pronunciation, etc. This paper proposes a novel approach based on Dense Net201 and EfficientNetB0, i.e., a hybrid model for the recognition of Speech. Initially, 76,263 speech samples are taken from 11 non-Indian languages, including Chinese, Dutch, Finnish, French, German, Greek, Hungarian, Japanese, Russian, Spanish and Persian. When collected, these speech samples are pre-processed by removing noise. Then, Spectrogram, Short-Term Fourier Transform (STFT), Spectral Rolloff-Bandwidth, Mel-frequency Cepstral Coefficient (MFCC), and Chroma feature are used to extract features from the speech sample. Further, a comparative analysis of the proposed approach is shown with other Deep Learning (DL) models like ResNet10, Inception V3, VGG16, DenseNet201, and EfficientNetB0. Standard parameters like Precision, Recall, F1-Score, Confusion Matrix, Accuracy, and Loss curves are used to evaluate the performance of each model by considering speech samples from all the languages mentioned above. Thus, the experimental results show that the hybrid model stands out from all the other models by giving the highest recognition accuracy of 99.84% with a loss of 0.004%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Article 24 October 2022

Comparison of Deep Learning Methods for Spoken Language Identification

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Data Availability

The used dataset to train the model is cited in the dataset Sect. 3.1 as well as mentioned in the references section (reference no. 29 to 39).

References

Bhable S, Kayte C (2020) Review: Multilingual Acoustic modeling of Automatic Speech Recognition (ASR) for low resource languages. In IEEE International Conference on Advent Trends in Multidisciplinary Research and Innovation (ICATMRI).https://doi.org/10.1109/ICATMRI51801.2020.9398431
Malik M, Malik K, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey. In Multimedia Tools and Applications, 9411–9457. https://doi.org/10.1007/s11042-020-10073-7.
Xiaohui Chu X (2021) Speech Recognition Method Based on Deep Learning and Its Application. In IEEE International Conference of Social Computing and Digital Economy (ICSCDE). https://doi.org/10.1109/ICSCDE54196.2021.00075
Kalhor E, Bakhtiari B (2021) Speaker independent feature selection for speech emotion recognition: A multi-task approach. In Multimedia Tools and Applications 80:8127–8146. https://doi.org/10.1007/s11042-020-10119-w
Article Google Scholar
Guntur R, Ramakrishnan K, Mittal V (2021) Automatic Classification of Foreign Language Accent. In IEEE 2nd Global Conference for Advancement in Technology (GCAT). https://doi.org/10.1109/GCAT52182.2021.9587650
Dokuz Y, Tufekci Z (2022) Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. In Multimedia Tools Appl 81:9969–9988. https://doi.org/10.1007/s11042-022-12304-5
Article Google Scholar
Delic V, Peric Z, Secujski M, Jakovljevic N, Nikolic J, Miskovic D, Simic N, Suzic S, Delic T (2019) Speech Technology Progress Based on New Machine Learning Paradigm. Hindawi: Comput Intell Neurosci 2019:1–19. https://doi.org/10.1155/2019/4368036
Article Google Scholar
Abushariah A, Ting H, Mustafa M, Khairuddin A, Abushariah M, Tan T (2022) Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open Challenges. In IEEE Access, 5944–5954. https://doi.org/10.1109/ACCESS.2022.3218684
Thukroo I, Bashir R, Giri K (2022) A review into deep learning techniques for spoken language identification. Multimedia Tools Appl 81:32593–32624. https://doi.org/10.1007/s11042-022-13054-0
Article Google Scholar
Xue Y, Gao S, Sun H, Qin W (2017) A Chinese Sign Language Recognition System Using Leap Motion. In International Conference on Virtual Reality and Visualization, 180–185. https://doi.org/10.1109/ICVRV.2017.00044
Xu X, Li Y, Xu X, Wen Z, Che H, Liu S, Tao J (2014) Survey on discriminative feature selection for speech emotion recognition. In International Symposium on Chinese Spoken Language Processing, 345–349. https://doi.org/10.1109/ISCSLP.2014.6936641
Gong C, Li X, Wu X (2014) Recurrent Neural Network Language Model with Part-of-speech for Mandarin Speech Recognition. In International Symposium on Chinese Spoken Language Processing, 459- 463. https://doi.org/10.1109/ISCSLP.2014.6936636
Shao P (2020) Chinese Speech Recognition System based on Deep Learning. In Journal of Physics: Conference Series, 1–6. https://doi.org/10.1088/1742-6596/1549/2/022012
Ropke W, Radulescu R, Efthymiadis K, Nowe A (2019) Training a Speech-to-Text Model for Dutch on the Corpus Gesproken Nederlands. In Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, 2491
Singh G, Sharma S, Kumar V, Kaur M, Baz M, Masud M (2021) Spoken Language Identification Using Deep Learning. Hindawi Comput Intell Neurosci 2021:1–12. https://doi.org/10.1155/2021/5123671
Article CAS Google Scholar
Smit P, Virpioja S, Kurimo M (2020) Advances in subword-based HMM-DNN speech recognition across languages. Comput Speech Lang 66:101–158. https://doi.org/10.1016/j.csl.2020.101158
Article Google Scholar
Berjon P, Nag A, Dev S (2021) Analysis of French Phonetic Idiosyncrasies for Accent Recognition. Soft Comput Lett. https://doi.org/10.1016/j.socl.2021.100018
Article Google Scholar
Yang H, Oehlke C, Meinel C (2011) German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings. In Proc. of 10th IEEE/ACIS International Conference on Computer and Information Science. https://doi.org/10.1109/ICIS.2011.38
Xu J, Matta K, Islam S, Nurnberger A (2020) German Speech Recognition System using Deep Speech. In International Conference on Natural Language Processing and Information Retrieval, 102–106. https://doi.org/10.1145/3443279.3443313
Milde B, Kohn M (2018) Open-Source Automatic Speech Recognition for German. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, Computation and Language. https://doi.org/10.48550/arXiv.1807.10311
Pantazoglou F, Kladis G, Papadakis N (2019) A Greek voice recognition interface for ROV applications, using machine learning technologies and the CMU Sphinx platform. Wseas Transact Syst Control 13:550–560
Google Scholar
Szarvas M, Fegyo T, Mihajlik P, Tatai P (2000) Automatic Recognition of Hungarian: Theory and Practice. Int J Speech Technol 3:237–251. https://doi.org/10.1023/A:1026515132762
Article Google Scholar
Chen J, Nishimura R, Kitaoka N (2020) End-to-end recognition of streaming Japanese speech using CTC and local attention. In SIP 9(25):1–7
CAS Google Scholar
Mu D, Zhu T, Xu G, Li H, Guo D, Liu Y (2019) Attention-Based Speech Model for Japanese Recognization. In IEEE International Conference on Smart Internet of Things (SmartIoT), 402–406. https://doi.org/10.1109/SmartIoT.2019.00071
Abdallah A, Hamada M, Nurseitov D (2020) Attention-Based Fully Gated CNN-BGRU for Russian Handwritten Text. J Imaging, 6(141), 1–23. https://doi.org/10.48550/arXiv.2008.05373
Gazeau V, Varol C (2018) Automatic Spoken Language Recognition with Neural Networks. Int J Inf Technol Comput Sci 8:11–17. https://doi.org/10.5815/ijitcs.2018.08.02
Article Google Scholar
Veisi H, Mani A (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905. https://doi.org/10.1007/s10772-020-09768-x
Article Google Scholar
Savargiv M, Bastanfard A (2015) Persian Speech Emotion Recognition. In IKT2015 7th International Conference on Information and Knowledge Technology, 1–5. https://doi.org/10.1109/IKT.2015.7288756
Park K. Dutch: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/datasets/bryanpark/dutch-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. French: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/french-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. German: Single speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/german-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Greek: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/greek-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Hungarian: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/hungarian-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Japanese: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/japanese-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Russian: Single Speaker Speech Dataset. Available [Online]:https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Spanish: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/spanish-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Finnish: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/datasets/bryanpark/finnish-single-speaker-speech-dataset. Accessed 3 Feb 2022
Park K. Chinese: Single Speaker Speech Dataset. Available [Online]: https://www.kaggle.com/bryanpark/chinese-single-speaker-speech-dataset. Accessed 3 Feb 2022
Persian dataset, Persian Speech. Available [Online]: https://github.com/persiandataset/PersianSpeech. Accessed 3 Feb 2022
Antoniadis P, Tsardoulias E, Symeonidis A (2022) A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case. Multimedia Tools Appl 81:40635–40652. https://doi.org/10.1007/s11042-022-12953-6
Article Google Scholar
Jain N, Gupta V, Shubham, Madan A, Chaudhary A, Santosh K (2021) Understanding cartoon emotion using integrated deep neural network on large dataset. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06003-9
Article PubMed PubMed Central Google Scholar
Kaur G, Sharma A (2023) A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis. J Big Data. https://doi.org/10.1186/s40537-022-00680-6
Article PubMed PubMed Central Google Scholar
Kaur A, Singh A, Sachdeva R, Kukreja V (2023) Automatic speech recognition systems: A survey of discriminative techniques. Multimed Tools Appl 82:13307–13339. https://doi.org/10.1007/s11042-022-13645-x
Article Google Scholar
Al-karawi K, Mohammed D (2021) Improving short utterance speaker verification by combining MFCC and Entrocy in Noisy conditions. Multimed Tools Appl 80:22231–22249. https://doi.org/10.1007/s11042-021-10767-6
Article Google Scholar

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Chandigarh University, Mohali, Punjab, India
Astha Gupta & Rakesh Kumar
Indus Institute of Technology & Engineering, Indus University, Ahmedabad, Gujarat, India
Yogesh Kumar

Authors

Astha Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Yogesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rakesh Kumar.

Ethics declarations

Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gupta, A., Kumar, R. & Kumar, Y. Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages. Multimed Tools Appl 83, 30145–30166 (2024). https://doi.org/10.1007/s11042-023-16748-1

Download citation

Received: 20 May 2022
Revised: 15 May 2023
Accepted: 31 August 2023
Published: 15 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16748-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Abstract

Access this article

Similar content being viewed by others

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Comparison of Deep Learning Methods for Spoken Language Identification

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Abstract

Access this article

Similar content being viewed by others

HindiSpeech-Net: a deep learning based robust automatic speech recognition system for Hindi language

Comparison of Deep Learning Methods for Spoken Language Identification

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation