A review on Gujarati language based automatic speech recognition (ASR) systems

Dua, Mohit; Bhagat, Bhavesh; Dua, Shelza; Chakravarty, Nidhi

doi:10.1007/s10772-024-10087-8

A review on Gujarati language based automatic speech recognition (ASR) systems

Published: 12 March 2024

Volume 27, pages 133–156, (2024)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mohit Dua¹,
Bhavesh Bhagat¹,
Shelza Dua² &
…
Nidhi Chakravarty ORCID: orcid.org/0000-0002-5454-1561¹

516 Accesses
1 Citation
Explore all metrics

Abstract

Automatic speech recognition (ASR) plays a crucial role in facilitating natural and efficient human–computer interaction. This paper offers a comprehensive review of ASR systems tailored specifically for the Gujarati language. Existing literature on Gujarati ASR indicates that most survey papers have focused solely on back-end classification methods. This paper aims to fill this gap by presenting a comprehensive survey that encompasses feature extraction techniques, backend models, speech datasets, and evaluation metrics. This review provides an in-depth analysis of the cutting-edge methodologies employed in the development of Gujarati ASR systems. Firstly, the paper discusses various feature extraction techniques. Secondly, it covers classification models and their impact on ASR performance. Thirdly, the study delves into available speech datasets relevant to the Gujarati language, offering valuable insights for researchers and practitioners. Finally, the paper reviews the evaluation parameters used in ASR systems. Additionally, it gives an overview of online toolkits, resources, and language models pertinent to Gujarati ASR. This review serves as a comprehensive reference for academics and professionals involved in the research and development of ASR systems for the Gujarati language. The parameters selected for comparison include front-end feature extraction methods, back-end classification techniques, speech dataset, and evaluation metrics. Furthermore, the paper discusses both the contributions and limitations encountered by current ASR systems during the review. Finally, it addresses various challenges that still persist and provides directions for future research in this critical field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Turkish Speech Recognition

ASRoIL: a comprehensive survey for automatic speech recognition of Indian languages

Article 11 October 2019

Improving Automatic Speech Recognition with Dialect-Specific Language Models

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Algrbaa, H. A. (2023). Speaker recognition from speech using Gaussian mixture model (GMM) and (MFCC)
Anoop, C. S., & Ramakrishnan, A. G. (2021). CTC-based end-to-end ASR for the low resource Sanskrit language with spectrogram augmentation. In 2021 national conference on communications (NCC) (pp. 1–6). IEEE.
Bhogale, K., Raman, A., Javed, T., Doddapaneni, S., Kunchukuttan, A., Kumar, P., & Khapra, M. M. (2023). Effectiveness of mining audio and text pairs from public data for improving ASR systems for low-resource languages. In ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). IEEE.
Billa, J. (2018). ISI ASR system for the low resource speech recognition challenge for Indian languages. In Proceedings of the annual conference of the international speech community association (INTERSPEECH), (vol. 2018 - September, pp. 3207–3211).
Birkenes, O., Matsui, T., Tanabe, K., Siniscalchi, S. M., Myrvoll, T. A., & Johnsen, M. H. (2009). Penalized logistic regression with HMM log-likelihood regressors for speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 18(6), 1440–1454.
Article Google Scholar
Bourlard, H. A., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach (Vol. 247). Springer.
Book Google Scholar
Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1–5). IEEE.
Campos, M. M., & Carpenter, G. A. (1998). WSOM: Building adaptive wavelets with self-organizing maps. In 1998 IEEE international joint conference on neural networks proceedings, (Vol. 1, pp. 763–767). IEEE World Congress on Computational Intelligence (cat. No. 98CH36227). IEEE.
Chadha, H. S., Shah, P., Dhuriya, A., Chhimwal, N., Gupta, A., & Raghavan, V. (2022). Code Switched and Code Mixed Speech Recognition for Indic languages. arXiv preprint arXiv:2203.16578.
Chakravarty, N., & Dua, M. (2022). Noise robust ASV spoof detection using integrated features and time delay neural network. SN Computer Science, 4(2), 127.
Article Google Scholar
Chakravarty, N., & Dua, M. (2023a). Spoof detection using sequentially integrated image and audio features. International Journal of Computing and Digital Systems, 13(1), 1–1.
Article Google Scholar
Chakravarty, N., & Dua, M. (2023b). Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta, 98(9), 096001.
Article Google Scholar
Chauhan, H. B., & Tanawala, B. A. (2016). Performance based comparision of MFCC and LPC techniques for Gujarati numbers detection. In Emerging research in computing, information, communication and applications (ERCICA 2015), (Volume 3, pp. 25–33). Springer.
CMU-INDIC dataset (Gujarati). Retrieved January 22, 2024, from http://www.festvox.org/cmu_indic/index.html
Coifman, R. R., Meyer, Y., & Wickerhauser, V. (1992). Wavelet analysis and signal processing. In M. Misiti, Y. Misiti, & J.-M. Poggi (Eds.), Wavelets and their applications (pp. 153–178). Jones and Bartlett.
Google Scholar
Deshmukh, A. M. (2020). Comparison of hidden Markov model and recurrent neural network in automatic speech recognition. European Journal of Engineering and Technology Research, 5(8), 958–965.
Google Scholar
Diwan, A., & Jyothi, P. (2020). Reduce and reconstruct: ASR for low-resource phonetic languages. arXiv preprint arXiv:2010.09322.
Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., ... & Abraham, B. (2021). Multilingual and code-switching ASR challenges for low resource Indian languages. arXiv preprint arXiv:2104.00235.
Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., Unni, V., Vyas, S., Rajpuria, A., Yarra, C., Mittal, A., Ghosh, P., Jyothi, P., Bali, K., Seshadri, V., Sitaram, S., Bharadwaj, S., Nanavati, J., Nanavati, R., Sankaranarayanan, K. (2021). MUCS 2021: Multilingual and code-switching ASR challenges for low resource Indian languages (pp. 2446–2450). https://doi.org/10.21437/Interspeech.2021-1339
Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10, 2301–2314.
Article Google Scholar
Dua, M., Sadhu, A., Jindal, A., & Mehta, R. (2022a). A hybrid noise robust model for multireplay attack detection in Automatic speaker verification systems. Biomedical Signal Processing and Control, 74, 103517.
Article Google Scholar
Dua, M., Jain, C., & Kumar, S. (2022b). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0
Article Google Scholar
Dua, M., & Akanksha. (2023). Gujarati language automatic speech recognition using integrated feature extraction and hybrid acoustic model. In Proceedings of fourth international conference on communication, computing and electronics systems (ICCCES 2022) (pp. 45–54). Springer.
Dubey, P., & Shah, B. (2022). Deep speech based end-to-end automated speech recognition (ASR) for Indian-English Accents. arXiv preprint arXiv:2204.00977.
Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., ... & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619.
Gaudani, H., & Patel, N. M. (2022). Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language. In Proceedings of second international conference on sustainable expert systems (ICSES 2021) (pp. 763–775). Springer.
Graves, A., Jaitly, N., & Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM Alex Graves, Navdeep Jaitly and Abdel-rahman Mohamed University of Toronto Department of Computer Science 6 King ‘ s College Rd . Toronto , M5S 3G4 , Canada,” pp. 273–278
Indurkhya, N., & Damerau, F. (2010). An overview of modern speech recognition Xuedong Huang and Li Deng. Handbook of natural language processing, (pp.363–390).
Islam, J., Mubassira, M., Islam, M. R., & Das, A. K. (2019). A speech recognition system for Bengali language using recurrent neural network. In 2019 IEEE 4th international conference on computer and communication systems (ICCCS) (pp. 73–76). IEEE
Jinal, H., & Dipti, B. (2016). Speech recognition system architecture for Gujarati language. International Journal of Computer Applications, 138(12), 28–31.
Article Google Scholar
Joshi, S., & Dua, M. (2022). LSTM-GTCC based approach for audio spoof detection. In 2022 international conference on machine learning, big data, cloud and parallel computing (COM-IT-CON) (Vol. 1, pp. 656–661). IEEE.
Joshi, S., & Dua, M. (2023). Multi-order replay attack detection using enhanced feature extraction and deep learning classification. In Proceedings of international conference on recent trends in computing (ICRTC 2022) (pp. 739–745). Springer
Joshi, B., Bhatta, B., Panday, S. P., & Maharjan, R. K. (2022). A novel deep learning based Nepali speech recognition. In Innovations in electrical and electronic engineering: Proceedings of ICEEE 2022 (Vol. 2, pp. 433–443). Springer.
Joshi, S., Dua, M., & Dua, S. (2023). Various audio classification models for automatic speaker verification system in industry 4.0. In Intelligent analytics for industry 4.0 applications (pp. 113–130). CRC Press.
Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272.
Article MathSciNet Google Scholar
Khare, S., Mittal, A. R., Diwan, A., Sarawagi, S., Jyothi, P., & Bharadwaj, S. (2021). Low resource ASR: The surprising effectiveness of high resource transliteration. In Interspeech (pp. 1529–1533).
Lakshminarayanan, V. (2022). Impact of noise in automatic speech recognition for low-resourced languages, Doctoral dissertation, Rochester Institute of Technology.
Lazli, L., & Sellami, M. (2003) Connectionist probability estimators in HMM Arabic speech recognition using fuzzy logic. In International workshop on machine learning and data mining in pattern recognition (pp. 379–388). Springer
Maas, A. L., et al. (2017). Building DNN acoustic models for large vocabulary speech recognition. Computer Speech & Language, 41, 195–213.
Article MathSciNet Google Scholar
Maas, A., Hannun, A., Jurafsky, D., & Ng, A. (2014). First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs.
Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (pp. 345–354). https://doi.org/10.3115/v1/N15-1038.
Maji, B., Swain, M., & Panda, R. (2022). A feature selection based parallelized CNN-BiGRU network for speech emotion recognition in Odia language.
Microsoft Speech Corpus (Gujarati). Retrieved January 22, 2024, from https://download.microsoft.com/download/c/9/d/c9d113a8-3c34-4805-a4df-2f11c57ac2cd/microsoftspeechcorpusindianlanguages.zip
Mittal, A., & Dua, M. (2022a). Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex & Intelligent Systems, 8(2), 1153–1166.
Article Google Scholar
Mittal, A., & Dua, M. (2022b). Automatic speaker verification systems and spoof detection techniques: Review and analysis. International Journal of Speech Technology, 1–30, 77.
Google Scholar
Moondra, A., & Chahal, P. (2023). Speaker recognition improvement for degraded human voice using modified-MFCC with GMM. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2023.0140627
Article Google Scholar
Mowlaee, P., Saeidi, R., & Stylianou, Y. (2014). Phase importance in speech processing applications. Proceedings of Interspeech, 2014, 1623–1627. https://doi.org/10.21437/Interspeech.2014-385
Article Google Scholar
Nguyen, P., Heigold, G., & Zweig, G. (2010). Speech recognition with flat direct models. IEEE Journal on Selected Topics in Signal Processing, 4(6), 994–1006.
Article Google Scholar
O’Shaughnessy, D. (2003). Interacting with computers by voice: Automatic speech recognition and synthesis. Proceedings of the IEEE, 91(9), 1272–1305.
Article Google Scholar
Parikh, R. B., & Joshi, D. H. (2020). Gujarati speech recognition—A review. No, 549, 6.
Parlikar, A., Sitaram, S., Wilkinson, A., & Black, A. W. (2016). The festvox indic frontend for grapheme to phoneme conversion. WILDRE Workshop on indian language data-resources and evaluation.
Google Scholar
Patel, D., & Goswami, M. (2014). Word level correction in Gujarati document using probabilistic approach. https://doi.org/10.1109/ICGCCEE.2014.6921395.
Patel, H. N., & Virparia, P. V. (2011). A small vocabulary speech recognition for Gujarati. International Journal of Advanced Research in Computer Science, 2(1).
Paul, A. K., Das, D., & Kamal, M. M. (2009). Bangla speech recognition system using LPC and ANN. In 2009 Seventh international conference on advances in pattern recognition (pp. 171–174). IEEE
Paulson, L. D. (2006). Speech recognition moves from software to hardware. Computer, 39(11), 15–18.
Article Google Scholar
Pipalia Bhoomika Dave, D. S. (2007). An approach to increase word recognition accuracy in Gujarati language. International Journal of Innovative Research in Computer and Communication Engineering (An ISO Certif. Organ.), 3297(9), 6442–6450.
Pravin, P., & Jethva, H. (2013). Neural network based Gujarati language speech recognition, vol. 2, no. May 2013, pp. 2623–2627
Raval, D., Pathak, V., Patel, M., & Bhatt, B. (2021a). Improving deep learning based automatic speech recognition for Gujarati. Transactions on Asian and Low-Resource Language Information Processing, 21(3), 1–18.
Google Scholar
Raval, D., Pathak, V., Patel, M., & Bhatt, B. (2021b). Improving deep learning based automatic speech recognition for Gujarati. ACM Transactions on Asian and Low-Resource Language Information Processing. https://doi.org/10.1145/3483446
Article Google Scholar
Sailor, H. B., Siva Krishna, M. V., Chhabra, D., Patil, A. T., Kamble, R., & Patil, H. A. (2018). DA-IICT/IIITV system for low resource speech recognition challenge 2018. In Proceedings of the annual conference of the international speech communication association (INTERSPEECH) (Vol. 2018-Sept, no., pp. 3187–3191)
Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018). Multilingual speech recognition with a single end-to-end model Shubham Toshniwal ∗ Toyota Technological Institute at Chicago. In IEEE international conference on acoustics, speech and signal processing - proceedings (ICASSP) (pp. 4904–4908)
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv. https://doi.org/10.48550/ARXIV.1402.1128.
Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Article Google Scholar
Sha, F., & Saul, L. K. (2007). Large margin hidden Markov models for automatic speech recognition. In Advances in neural information processing systems (pp. 1249–1256).
Shewalkar, A., Nyavanandi, D., & Ludwig, S. A. (2019). Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. Journal of Artificial Intelligence and Soft Computing Research, 9(4), 235–245.
Article Google Scholar
Sivaram, G. S., & Hermansky, H. (2011). Multilayer perceptron with sparse hidden outputs for phoneme recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5336–5339). IEEE 146.
Sivaram, G. S., & Hermansky, H. (2011b). Sparse multilayer perceptron for phoneme recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1), 23–29.
Article Google Scholar
Srivastava, B., Sitaram, S., Mehta, R., Mohan, K., Matani, Satpal, S., Bali, K., Srikanth, R., Nayak, N. (2018). Interspeech 2018 low resource automatic speech recognition challenge for Indian languages. 11–14. https://doi.org/10.21437/SLTU.2018-3.
Srivastava, B., Abraham, B., Sitaram, S., Mehta, R., &Jyothi, P. (2019). End-to-End ASR for code-switched Hindi-English speech.
Tailor, J. H., & Shah, D. B. (2017). HMM-based lightweight speech recognition system for Gujarati language. In D. K. Mishra, M. K. Nayak, & A. Joshi (Eds.), Information and communication technology for sustainable development (pp. 451–461). Springer.
Google Scholar
Takaki, S., Nakashika, T., Wang, X., & Yamagishi, J. (2019). STFT spectral loss for training a neural speech waveform model, In 2019 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2019) (pp. 7065–7069).
Tóth, L. (2011). A hierarchical, context-dependent neural network architecture for improved phone recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5040–5043). IEEE.
Umarani, S. D., Raviram, P., & Wahidabanu, R. S. D. (2009). Implementation of HMM and radial basis function for speech recognition. In 2009 international conference on intelligent agent & multi-agent systems (pp. 1–4). IEEE.
Valaki, S., & Jethva, H. (2018). A hybrid HMM/ANN approach for automatic Gujarati speech recognition. In Proceedings of the 2017 international conference on innovations in information, embedded and communication systems (ICIIECS 2017) (Vol. 2018-Jan, pp. 1–5).
Venkateswarlu, R. L. K., Kumari, R. V., Jayasri, G. V. (2011). Speech recognition using radial basis function neural network. In 2011 3rd international conference on electronics computer technology (Vol. 3, pp. 441–445). IEEE.
Vydana, H. K., Gurugubelli, K., Raju, V. V. V., & Vuppala, A. K. (2018). An exploration towards joint acoustic modeling for Indian languages: IIIT-H submission for low resource speech recognition challenge for Indian languages, In Proceedings of the annual conference of the international speech community association (INTERSPEECH 2018) (Vol. 2018-Sept., pp. 3192–3196).
Wang, B., Yin, Y., & Lin, H. (2020). Attention-based transducer for online speech recognition. arXiv preprint arXiv:2005.08497
Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. https://doi.org/10.1109/5.58337
Article Google Scholar
Chan,William, Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Google Brain, 1, (pp. 4960–4964).
Wisdom, S., et al., (2019). Differentiable consistency constraints for improved deep speech enhancement. In 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP 2019) (pp. 900–904). https://doi.org/10.1109/ICASSP.2019.8682783

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, National Institute of Technology, Kurukshetra, India
Mohit Dua, Bhavesh Bhagat & Nidhi Chakravarty
Department of Electronics Communication and Engineering, Punjab Engineering College Chandigarh, Chandigarh, India
Shelza Dua

Authors

Mohit Dua
View author publications
You can also search for this author inPubMed Google Scholar
Bhavesh Bhagat
View author publications
You can also search for this author inPubMed Google Scholar
Shelza Dua
View author publications
You can also search for this author inPubMed Google Scholar
Nidhi Chakravarty
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nidhi Chakravarty.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dua, M., Bhagat, B., Dua, S. et al. A review on Gujarati language based automatic speech recognition (ASR) systems. Int J Speech Technol 27, 133–156 (2024). https://doi.org/10.1007/s10772-024-10087-8

Download citation

Received: 19 October 2023
Accepted: 23 January 2024
Published: 12 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10772-024-10087-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on Gujarati language based automatic speech recognition (ASR) systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Turkish Speech Recognition

ASRoIL: a comprehensive survey for automatic speech recognition of Indian languages

Improving Automatic Speech Recognition with Dialect-Specific Language Models

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now