Abstract
Research on voice recognition for African languages is limited due to the scarcity of digital resources for training and adaptation, despite its broad usefulness. The Hausa language, spoken by almost fifty million inhabitants in West and Central Africa, is an example of a linguistic domain that has not been thoroughly studied. The Hausa language employs diacritics, which are symbols located above alphabetical characters to convey further information. By removing diacritics, the number of homographs increases, making it difficult to distinguish between similar words. This paper presents a study on speech recognition in the Hausa Language, specifically focusing on diacritized words. The study utilises the state-of-the-art wave2vec2.0 and Whisper deep learning architecture models, for transcribing audio signals into corresponding Hausa text. According to the results obtained in the study, the Whisper-large deep model emerged as the best, achieving a word error rate of 4.23% representing a considerable improvement of 43.9% when compared to the existing state-of-the-art model for Hausa language speech recognition. Additionally, the Whsiper-large model demonstrated a diacritic coverage of 92%, precision of 98.87%, with a diacritic error rate of 2.1%.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
The dataset used for this research is Mozilla Common Voice dataset Version 15.0 for the Hausa Language which contains 10106 audio files of 13 h with their corresponding transcripts with 4 validated hours, publicly available at https://commonvoice.mozilla.org/en/datasets.
References
Abdulhamid, T. H., & Tahir, S. M. (2017). Intelligent system speech recognition voice and speech recognition for Hausa words and numerals. International Journal of Advance Technology in Engineering, 5, 107519.
Abdulmumin, S. (2014). A survey of historical prevalence of Hausa language in contemporary literacy. ZAHIRA–Journal of Historical Research, 5(4)
Abubakar, M. K. (2014). Pronunciation problems of Hausa speakers of English
Akhilesh, A., Brinda, P., Keerthana, S., Gupta, D., & Vekkot, S. (2022). Tamil speech recognition using XLSR wav2vec2. 0 & CTC algorithm. In 13th international conference on computing communication and networking technologies (ICCCNT) (pp. 1–6). IEEE
Al-Dujaili, M. J., & Ebrahimi-Moghadam, A. (2023). Speech emotion recognition: A comprehensive survey. Wireless Personal Communications, 129(4), 2525–2561.
Alhumud, A. M., AL-Qurishi, M., Alomar, Y. O., Alzahrani, A., & Souissi, R. (2024). Improving automated speech recognition using retrieval-based voice conversion. In The second tiny papers track at ICLR 2024. https://openreview.net/forum?id=OMBFB6pU6c
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G. (2019). Common voice: A massively multilingual speech corpus. arXiv:1912.06670
Babatunde, A. N., Ogundokun, R. O., Jimoh, E. R., Misra, S., & Singh, D. (2023). Hausa character recognition using logistic regression. In Machine intelligence techniques for data analysis and signal processing: Proceedings of 4th international conference MISP 2022 (Vol. 1, pp. 801–811). Springer
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Bashir, M., Owaseye, J. F., & Eze, J. C. (2023). Substitution as a phonological interference in Hausa spoken by IGBO and Yoruba speakers. Advance Journal of Linguistics and Mass Communication, 7(4), 1–14.
Biswas, D., Nadipalli, S., Sneha, B., & Supriya, M. (2022). Speech recognition using weighted finite-state transducers. In 7th international conference for convergence in technology (I2CT) (pp. 1–5). IEEE
Callejo, D. R., & Boets, B. (2023). A systematic review on speech-in-noise perception in autism. Neuroscience & Biobehavioral Reviews. https://doi.org/10.1016/j.neubiorev.2023.105406
Caubrière, A., & Gauthier, E. (2024). Africa-centric self-supervised pre-training for multilingual speech representation in a sub-saharan context. arXiv:2404.02000
Chen, J., Vekkot, S., & Shukla, P. (2024). Music source separation based on a lightweight deep learning framework (DTTNET: Dual-path TFC-TDF UNET). In 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP 2024) (pp. 656–660). IEEE
Diskin, M., Bukhtiyarov, A., Ryabinin, M., Saulnier, L., Sinitsin, A., Popov, D., Pyrkin, D. V., Kashirin, M., Borzunov, A., Moral, A., et al. (2021). Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34, 7879–7897.
Dong, M., Peng, L., Nie, Q., & Li, W. (2023). Speech signal processing of industrial speech recognition. Journal of Physics: Conference Series, 2508, 012039.
Gauthier, E., Besacier, L., & Voisin, S. (2016). Automatic speech recognition for African languages with vowel length contrast. Procedia Computer Science, 81, 136–143.
Gris, L. R. S., Casanova, E., Oliveira, F. S., Soares, A., & Junior, A. C. (2021). Brazilian Portuguese speech recognition using wav2vec 2.0. arXiv:2107.11414
Hancock, A., Northcott, S., Hobson, H., & Clarke, M. (2023). Speech, language and communication needs and mental health: The experiences of speech and language therapists and mental health professionals. International Journal of Language & Communication Disorders, 58(1), 52–66.
Ibrahim, Y. A., Faki, S. A., & Abidemi, T. I. F. (2019). Automatic speech recognition using MFCC in feature extraction based HMM for human-computer interaction in Hausa. Anale Seria Informatica, 18
Ibrahim, U. A., Mahatma, M. B., & Suleiman, M. A. (2022). Framework for Hausa speech recognition. In 2022 5th information technology for education and development (ITED) (pp. 1–4). IEEE
Inuwa-Dutse, I. (2021). The first large-scale collection of diverse Hausa language datasets. arXiv:2102.06991
Klejch, O., Wallington, E., & Bell, P. (2021). Deciphering speech: A zero-resource approach to cross-lingual transfer in ASR. arXiv:2111.06799
Kumar, A., Cambria, E., & Trueman, T. E. (2021). Transformer-based bidirectional encoder representations for emotion detection from text. In IEEE symposium series on computational intelligence (SSCI) (pp 1–6). IEEE
Kumar, M. R., Vekkot, S., Lalitha, S., Gupta, D., Govindraj, V. J., Shaukat, K., Alotaibi, Y. A., & Zakariah, M. (2022). Dementia detection from speech using machine learning and deep learning architectures. Sensors, 22(23), 9311.
Likhomanenko, T., Lugosch, L., & Collobert, R. (2023). Unsupervised ASR via cross-lingual pseudo-labeling. arXiv:2305.13330
Luka, M. K., Ibikunle, F., & Gregory, O. (2012). Neural network based Hausa language speech recognition. International Journal of Advanced Research in Artificial Intelligence, 1(2), 39–44.
Mak, F., Govender, A., & Badenhorst, J. (2024). Exploring ASR fine-tuning on limited domain-specific data for low-resource languages. Journal of the Digital Humanities Association of Southern Africa. https://doi.org/10.55492/dhasa.v5i1.5024
Manasa, C. S., Priya, K. J., & Gupta, D. (2019). Comparison of acoustical models of GMM-HMM-based for speech recognition in Hindi using Pocketsphinx. In 3rd international conference on computing methodologies and communication (ICCMC) (pp. 534–539). IEEE
Mbonu, C. E., Chukwuneke, C. I., Paul, R. U., Ezeani, I., & Onyenwe, I. (2022). Igbosum1500-introducing the IGBO text summarization dataset. In 3rd workshop on African natural language processing
Mekki, S. A., Hassan, E. M., Dayhum, A. F. A., & Galhom, D. H. (2023). Brief insight about speech perception and classification of speech sound in Arabic dialects. Journal of Pharmaceutical Negative Results, 1256–1262
Millet, J., Caucheteux, C., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., King, J., et al. (2022). Toward a realistic model of speech processing in the brain with self-supervised learning. Advances in Neural Information Processing Systems, 35, 33428–33443.
Musa, I. I. (2022). An assessment of the ancient Hausa traditional security system before the imposition of the British colonial administration in Hausa land. Sapientia Global Journal of Arts, Humanities and Development Studies, 5(1)
Owodunni, A. T., Yadavalli, A., Emezue, C. C., Olatunji, T., & Mbataku, C. C. (2024). Accentfold: A journey through African accents for zero-shot ASR adaptation to target accents. arXiv:2402.01152
Palo, P., Moisik, S. R., & Faytak, M. (2023). Analysing speech data with Satkit. In International conference of phonetic sciences (ICPhS 2023), Prague
Pati, P. B., Shreyas, V. (2022). Speech to equation conversion using a POE tagger. In 7th international conference for convergence in technology (I2CT) (pp. 1–4). IEEE
Payne, J., Au, A., & Dowell, R. C. (2023). An overview of factors affecting bimodal and electric-acoustic stimulation (EAS) speech understanding outcomes. Hearing Research, 431, 108736.
Podila, R. S. A., Kommula, G. S. S., Ruthvik, K., Vekkot, S., & Gupta, D. (2022). Telugu dialect speech dataset creation and recognition using deep learning techniques. In IEEE 19th India council international conference (INDICON) (pp. 1–6). IEEE
Priya, K. J., Sowmya, S., Navya, T., & Gupta, D. (2018). Implementation of phonetic level speech recognition in Kannada using HTK. In Proceedings of international conference on communication and signal processing (ICCSP) (pp. 0082–0085). https://doi.org/10.1109/ICCSP.2018.8524192
Priyamvada, R., Kumar, S.S., Ganesh, H., & Soman, K. (2022). Multilingual speech recognition for Indian languages. In Advanced machine intelligence and signal processing (pp. 545–553)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning (PMLR) (pp. 28492–28518)
Ritchie, S., Cheng, Y.-C., Chen, M., Mathews, R., Esch, D., Li, B., & Sim, K. C. (2022). Large vocabulary speech recognition for languages of Africa: Multilingual modelling and self-supervised learning. arXiv:2208.03067
Schultz, I. T., Djomgang, E. G. K., Schlippe, D. T., & Vu, D. T. (2011). Hausa large vocabulary continuous speech recognition. Karlsruhe Institute of Technology
Seikel, J. A., Drumright, D. G., & Hudock, D. J. (2023). Anatomy & physiology for speech, language, and hearing. Plural Publishing.
Shamma, A. L., Vekkot, S., Gupta, D., Zakariah, M., & Alotaibi, Y. A. (2024). Development of a non-invasive COVID-19 detection framework using explainable AI and data augmentation 1. Journal of Intelligent & Fuzzy Systems. https://doi.org/10.3233/JIFS-219387
Sharma, R. S., Paladugu, S. H., Priya, K. J., & Gupta, D. (2019). Speech recognition in Kannada using HTK and Julius: A comparative study. In 2019 international conference on communication and signal processing (ICCSP) (pp. 0068–0072). https://doi.org/10.1109/ICCSP.2019.8698039
Sharma, S. B. N. (2017). Isolated word speech recognition system using dynamic time warping. Global Journal of Advance Engineering Technology and Science, 5, 107519.
Sneha, V., Hardhika, G., Priya, K. J., & Gupta, D. (2018). Isolated Kannada speech recognition using HTK—A detailed approach. In Progress in advanced computing and intelligent engineering: Proceedings of ICACIE 2016 (Vol. 2, pp. 185–194). Singapore
Tachbelie, M. Y., Abate, S. T., & Schultz, T. (2022). Multilingual speech recognition for globalphone languages. Speech Communication, 140, 71–86.
Unubi, S. A.: Significant linguistic information on the Arabic and Hausa languages (2023)
Vancha, P., Nagarajan, H., Inakollu, V., Gupta, D., & Vekkot, S. (2022). Word-level speech dataset creation for Sourashtra and recognition system using Kaldi. In IEEE 19th India council international conference (INDICON) (pp. 1–6). IEEE
Vekkot, S., & Gupta, D. (2022). Fusion of spectral and prosody modelling for multilingual speech emotion conversion. Knowledge-Based Systems, 242, 108360.
Vekkot, S., Prakash, N. N. V. S., Reddy, T. S. E., Sripathi, S. R., Lalitha, S., Gupta, D., Zakariah, M., & Alotaibi, Y. A. (2023). Dementia speech dataset creation and analysis in Indic languages—A pilot study. IEEE Access, 11, 130697–130718.
Venugopalan, M., & Gupta, D. (2020). An unsupervised hierarchical rule-based model for aspect term extraction augmented with pruning strategies. Procedia Computer Science, 171, 22–31.
Voice, M. C.: Mozilla common voice for Hausa language version 13.0. https://commonvoice.mozilla.org/en/datasets
Wu, P., Wang, R., Lin, H., Zhang, F., Tu, J., & Sun, M. (2023). Automatic depression recognition by intelligent speech signal processing: A systematic survey. CAAI Transactions on Intelligence Technology, 8(3), 701–711.
Xu, S., Yu, J., Guo, H., Tian, S., Long, Y., Yang, J., & Zhang, L. (2023). Force-induced ion generation in zwitterionic hydrogels for a sensitive silent-speech sensor. Nature Communications, 14(1), 219.
Zubairu, B. S., Kadiri, G. C., & Ekwueme, J. (2020). Comparative study of English and Hausa affixation. Academic Journal of Current Research, 7(11), 1–10.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Abubakar, A.M., Gupta, D. & Vekkot, S. Development of a diacritic-aware large vocabulary automatic speech recognition for Hausa language. Int J Speech Technol 27, 687–700 (2024). https://doi.org/10.1007/s10772-024-10111-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10111-x