Abstract
Automatic speech recognition (ASR) plays a crucial role in facilitating natural and efficient human–computer interaction. This paper offers a comprehensive review of ASR systems tailored specifically for the Gujarati language. Existing literature on Gujarati ASR indicates that most survey papers have focused solely on back-end classification methods. This paper aims to fill this gap by presenting a comprehensive survey that encompasses feature extraction techniques, backend models, speech datasets, and evaluation metrics. This review provides an in-depth analysis of the cutting-edge methodologies employed in the development of Gujarati ASR systems. Firstly, the paper discusses various feature extraction techniques. Secondly, it covers classification models and their impact on ASR performance. Thirdly, the study delves into available speech datasets relevant to the Gujarati language, offering valuable insights for researchers and practitioners. Finally, the paper reviews the evaluation parameters used in ASR systems. Additionally, it gives an overview of online toolkits, resources, and language models pertinent to Gujarati ASR. This review serves as a comprehensive reference for academics and professionals involved in the research and development of ASR systems for the Gujarati language. The parameters selected for comparison include front-end feature extraction methods, back-end classification techniques, speech dataset, and evaluation metrics. Furthermore, the paper discusses both the contributions and limitations encountered by current ASR systems during the review. Finally, it addresses various challenges that still persist and provides directions for future research in this critical field.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Algrbaa, H. A. (2023). Speaker recognition from speech using Gaussian mixture model (GMM) and (MFCC)
Anoop, C. S., & Ramakrishnan, A. G. (2021). CTC-based end-to-end ASR for the low resource Sanskrit language with spectrogram augmentation. In 2021 national conference on communications (NCC) (pp. 1–6). IEEE.
Bhogale, K., Raman, A., Javed, T., Doddapaneni, S., Kunchukuttan, A., Kumar, P., & Khapra, M. M. (2023). Effectiveness of mining audio and text pairs from public data for improving ASR systems for low-resource languages. In ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). IEEE.
Billa, J. (2018). ISI ASR system for the low resource speech recognition challenge for Indian languages. In Proceedings of the annual conference of the international speech community association (INTERSPEECH), (vol. 2018 - September, pp. 3207–3211).
Birkenes, O., Matsui, T., Tanabe, K., Siniscalchi, S. M., Myrvoll, T. A., & Johnsen, M. H. (2009). Penalized logistic regression with HMM log-likelihood regressors for speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 18(6), 1440–1454.
Bourlard, H. A., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach (Vol. 247). Springer.
Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1–5). IEEE.
Campos, M. M., & Carpenter, G. A. (1998). WSOM: Building adaptive wavelets with self-organizing maps. In 1998 IEEE international joint conference on neural networks proceedings, (Vol. 1, pp. 763–767). IEEE World Congress on Computational Intelligence (cat. No. 98CH36227). IEEE.
Chadha, H. S., Shah, P., Dhuriya, A., Chhimwal, N., Gupta, A., & Raghavan, V. (2022). Code Switched and Code Mixed Speech Recognition for Indic languages. arXiv preprint arXiv:2203.16578.
Chakravarty, N., & Dua, M. (2022). Noise robust ASV spoof detection using integrated features and time delay neural network. SN Computer Science, 4(2), 127.
Chakravarty, N., & Dua, M. (2023a). Spoof detection using sequentially integrated image and audio features. International Journal of Computing and Digital Systems, 13(1), 1–1.
Chakravarty, N., & Dua, M. (2023b). Data augmentation and hybrid feature amalgamation to detect audio deep fake attacks. Physica Scripta, 98(9), 096001.
Chauhan, H. B., & Tanawala, B. A. (2016). Performance based comparision of MFCC and LPC techniques for Gujarati numbers detection. In Emerging research in computing, information, communication and applications (ERCICA 2015), (Volume 3, pp. 25–33). Springer.
CMU-INDIC dataset (Gujarati). Retrieved January 22, 2024, from http://www.festvox.org/cmu_indic/index.html
Coifman, R. R., Meyer, Y., & Wickerhauser, V. (1992). Wavelet analysis and signal processing. In M. Misiti, Y. Misiti, & J.-M. Poggi (Eds.), Wavelets and their applications (pp. 153–178). Jones and Bartlett.
Deshmukh, A. M. (2020). Comparison of hidden Markov model and recurrent neural network in automatic speech recognition. European Journal of Engineering and Technology Research, 5(8), 958–965.
Diwan, A., & Jyothi, P. (2020). Reduce and reconstruct: ASR for low-resource phonetic languages. arXiv preprint arXiv:2010.09322.
Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., ... & Abraham, B. (2021). Multilingual and code-switching ASR challenges for low resource Indian languages. arXiv preprint arXiv:2104.00235.
Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., Unni, V., Vyas, S., Rajpuria, A., Yarra, C., Mittal, A., Ghosh, P., Jyothi, P., Bali, K., Seshadri, V., Sitaram, S., Bharadwaj, S., Nanavati, J., Nanavati, R., Sankaranarayanan, K. (2021). MUCS 2021: Multilingual and code-switching ASR challenges for low resource Indian languages (pp. 2446–2450). https://doi.org/10.21437/Interspeech.2021-1339
Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.
Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10, 2301–2314.
Dua, M., Sadhu, A., Jindal, A., & Mehta, R. (2022a). A hybrid noise robust model for multireplay attack detection in Automatic speaker verification systems. Biomedical Signal Processing and Control, 74, 103517.
Dua, M., Jain, C., & Kumar, S. (2022b). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0
Dua, M., & Akanksha. (2023). Gujarati language automatic speech recognition using integrated feature extraction and hybrid acoustic model. In Proceedings of fourth international conference on communication, computing and electronics systems (ICCCES 2022) (pp. 45–54). Springer.
Dubey, P., & Shah, B. (2022). Deep speech based end-to-end automated speech recognition (ASR) for Indian-English Accents. arXiv preprint arXiv:2204.00977.
Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., ... & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619.
Gaudani, H., & Patel, N. M. (2022). Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language. In Proceedings of second international conference on sustainable expert systems (ICSES 2021) (pp. 763–775). Springer.
Graves, A., Jaitly, N., & Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM Alex Graves, Navdeep Jaitly and Abdel-rahman Mohamed University of Toronto Department of Computer Science 6 King ‘ s College Rd . Toronto , M5S 3G4 , Canada,” pp. 273–278
Indurkhya, N., & Damerau, F. (2010). An overview of modern speech recognition Xuedong Huang and Li Deng. Handbook of natural language processing, (pp.363–390).
Islam, J., Mubassira, M., Islam, M. R., & Das, A. K. (2019). A speech recognition system for Bengali language using recurrent neural network. In 2019 IEEE 4th international conference on computer and communication systems (ICCCS) (pp. 73–76). IEEE
Jinal, H., & Dipti, B. (2016). Speech recognition system architecture for Gujarati language. International Journal of Computer Applications, 138(12), 28–31.
Joshi, S., & Dua, M. (2022). LSTM-GTCC based approach for audio spoof detection. In 2022 international conference on machine learning, big data, cloud and parallel computing (COM-IT-CON) (Vol. 1, pp. 656–661). IEEE.
Joshi, S., & Dua, M. (2023). Multi-order replay attack detection using enhanced feature extraction and deep learning classification. In Proceedings of international conference on recent trends in computing (ICRTC 2022) (pp. 739–745). Springer
Joshi, B., Bhatta, B., Panday, S. P., & Maharjan, R. K. (2022). A novel deep learning based Nepali speech recognition. In Innovations in electrical and electronic engineering: Proceedings of ICEEE 2022 (Vol. 2, pp. 433–443). Springer.
Joshi, S., Dua, M., & Dua, S. (2023). Various audio classification models for automatic speaker verification system in industry 4.0. In Intelligent analytics for industry 4.0 applications (pp. 113–130). CRC Press.
Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272.
Khare, S., Mittal, A. R., Diwan, A., Sarawagi, S., Jyothi, P., & Bharadwaj, S. (2021). Low resource ASR: The surprising effectiveness of high resource transliteration. In Interspeech (pp. 1529–1533).
Lakshminarayanan, V. (2022). Impact of noise in automatic speech recognition for low-resourced languages, Doctoral dissertation, Rochester Institute of Technology.
Lazli, L., & Sellami, M. (2003) Connectionist probability estimators in HMM Arabic speech recognition using fuzzy logic. In International workshop on machine learning and data mining in pattern recognition (pp. 379–388). Springer
Maas, A. L., et al. (2017). Building DNN acoustic models for large vocabulary speech recognition. Computer Speech & Language, 41, 195–213.
Maas, A., Hannun, A., Jurafsky, D., & Ng, A. (2014). First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs.
Maas, A., Xie, Z., Jurafsky, D., & Ng, A. (2015). Lexicon-free conversational speech recognition with neural networks. In Proceedings of the 2015 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies (pp. 345–354). https://doi.org/10.3115/v1/N15-1038.
Maji, B., Swain, M., & Panda, R. (2022). A feature selection based parallelized CNN-BiGRU network for speech emotion recognition in Odia language.
Microsoft Speech Corpus (Gujarati). Retrieved January 22, 2024, from https://download.microsoft.com/download/c/9/d/c9d113a8-3c34-4805-a4df-2f11c57ac2cd/microsoftspeechcorpusindianlanguages.zip
Mittal, A., & Dua, M. (2022a). Static–dynamic features and hybrid deep learning models based spoof detection system for ASV. Complex & Intelligent Systems, 8(2), 1153–1166.
Mittal, A., & Dua, M. (2022b). Automatic speaker verification systems and spoof detection techniques: Review and analysis. International Journal of Speech Technology, 1–30, 77.
Moondra, A., & Chahal, P. (2023). Speaker recognition improvement for degraded human voice using modified-MFCC with GMM. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2023.0140627
Mowlaee, P., Saeidi, R., & Stylianou, Y. (2014). Phase importance in speech processing applications. Proceedings of Interspeech, 2014, 1623–1627. https://doi.org/10.21437/Interspeech.2014-385
Nguyen, P., Heigold, G., & Zweig, G. (2010). Speech recognition with flat direct models. IEEE Journal on Selected Topics in Signal Processing, 4(6), 994–1006.
O’Shaughnessy, D. (2003). Interacting with computers by voice: Automatic speech recognition and synthesis. Proceedings of the IEEE, 91(9), 1272–1305.
Parikh, R. B., & Joshi, D. H. (2020). Gujarati speech recognition—A review. No, 549, 6.
Parlikar, A., Sitaram, S., Wilkinson, A., & Black, A. W. (2016). The festvox indic frontend for grapheme to phoneme conversion. WILDRE Workshop on indian language data-resources and evaluation.
Patel, D., & Goswami, M. (2014). Word level correction in Gujarati document using probabilistic approach. https://doi.org/10.1109/ICGCCEE.2014.6921395.
Patel, H. N., & Virparia, P. V. (2011). A small vocabulary speech recognition for Gujarati. International Journal of Advanced Research in Computer Science, 2(1).
Paul, A. K., Das, D., & Kamal, M. M. (2009). Bangla speech recognition system using LPC and ANN. In 2009 Seventh international conference on advances in pattern recognition (pp. 171–174). IEEE
Paulson, L. D. (2006). Speech recognition moves from software to hardware. Computer, 39(11), 15–18.
Pipalia Bhoomika Dave, D. S. (2007). An approach to increase word recognition accuracy in Gujarati language. International Journal of Innovative Research in Computer and Communication Engineering (An ISO Certif. Organ.), 3297(9), 6442–6450.
Pravin, P., & Jethva, H. (2013). Neural network based Gujarati language speech recognition, vol. 2, no. May 2013, pp. 2623–2627
Raval, D., Pathak, V., Patel, M., & Bhatt, B. (2021a). Improving deep learning based automatic speech recognition for Gujarati. Transactions on Asian and Low-Resource Language Information Processing, 21(3), 1–18.
Raval, D., Pathak, V., Patel, M., & Bhatt, B. (2021b). Improving deep learning based automatic speech recognition for Gujarati. ACM Transactions on Asian and Low-Resource Language Information Processing. https://doi.org/10.1145/3483446
Sailor, H. B., Siva Krishna, M. V., Chhabra, D., Patil, A. T., Kamble, R., & Patil, H. A. (2018). DA-IICT/IIITV system for low resource speech recognition challenge 2018. In Proceedings of the annual conference of the international speech communication association (INTERSPEECH) (Vol. 2018-Sept, no., pp. 3187–3191)
Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018). Multilingual speech recognition with a single end-to-end model Shubham Toshniwal ∗ Toyota Technological Institute at Chicago. In IEEE international conference on acoustics, speech and signal processing - proceedings (ICASSP) (pp. 4904–4908)
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv. https://doi.org/10.48550/ARXIV.1402.1128.
Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Sha, F., & Saul, L. K. (2007). Large margin hidden Markov models for automatic speech recognition. In Advances in neural information processing systems (pp. 1249–1256).
Shewalkar, A., Nyavanandi, D., & Ludwig, S. A. (2019). Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. Journal of Artificial Intelligence and Soft Computing Research, 9(4), 235–245.
Sivaram, G. S., & Hermansky, H. (2011). Multilayer perceptron with sparse hidden outputs for phoneme recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5336–5339). IEEE 146.
Sivaram, G. S., & Hermansky, H. (2011b). Sparse multilayer perceptron for phoneme recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1), 23–29.
Srivastava, B., Sitaram, S., Mehta, R., Mohan, K., Matani, Satpal, S., Bali, K., Srikanth, R., Nayak, N. (2018). Interspeech 2018 low resource automatic speech recognition challenge for Indian languages. 11–14. https://doi.org/10.21437/SLTU.2018-3.
Srivastava, B., Abraham, B., Sitaram, S., Mehta, R., &Jyothi, P. (2019). End-to-End ASR for code-switched Hindi-English speech.
Tailor, J. H., & Shah, D. B. (2017). HMM-based lightweight speech recognition system for Gujarati language. In D. K. Mishra, M. K. Nayak, & A. Joshi (Eds.), Information and communication technology for sustainable development (pp. 451–461). Springer.
Takaki, S., Nakashika, T., Wang, X., & Yamagishi, J. (2019). STFT spectral loss for training a neural speech waveform model, In 2019 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2019) (pp. 7065–7069).
Tóth, L. (2011). A hierarchical, context-dependent neural network architecture for improved phone recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5040–5043). IEEE.
Umarani, S. D., Raviram, P., & Wahidabanu, R. S. D. (2009). Implementation of HMM and radial basis function for speech recognition. In 2009 international conference on intelligent agent & multi-agent systems (pp. 1–4). IEEE.
Valaki, S., & Jethva, H. (2018). A hybrid HMM/ANN approach for automatic Gujarati speech recognition. In Proceedings of the 2017 international conference on innovations in information, embedded and communication systems (ICIIECS 2017) (Vol. 2018-Jan, pp. 1–5).
Venkateswarlu, R. L. K., Kumari, R. V., Jayasri, G. V. (2011). Speech recognition using radial basis function neural network. In 2011 3rd international conference on electronics computer technology (Vol. 3, pp. 441–445). IEEE.
Vydana, H. K., Gurugubelli, K., Raju, V. V. V., & Vuppala, A. K. (2018). An exploration towards joint acoustic modeling for Indian languages: IIIT-H submission for low resource speech recognition challenge for Indian languages, In Proceedings of the annual conference of the international speech community association (INTERSPEECH 2018) (Vol. 2018-Sept., pp. 3192–3196).
Wang, B., Yin, Y., & Lin, H. (2020). Attention-based transducer for online speech recognition. arXiv preprint arXiv:2005.08497
Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. https://doi.org/10.1109/5.58337
Chan,William, Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Google Brain, 1, (pp. 4960–4964).
Wisdom, S., et al., (2019). Differentiable consistency constraints for improved deep speech enhancement. In 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP 2019) (pp. 900–904). https://doi.org/10.1109/ICASSP.2019.8682783
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dua, M., Bhagat, B., Dua, S. et al. A review on Gujarati language based automatic speech recognition (ASR) systems. Int J Speech Technol 27, 133–156 (2024). https://doi.org/10.1007/s10772-024-10087-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10087-8