Abstract
The latest improvements in end-to-end Automatic Speech Recognition (ASR) systems have achieved outstanding results and have thus enabled the creation of state-of-the-art models for well-resourced languages. However, most languages, such as Tigrinya, are under-resourced, discouraging field efforts. Tigrinya is a Semitic language with over nine million speakers. This paper presents the first hybrid Connectionist Temporal Classification (CTC) with an attention-based end-to-end speaker-independent ASR model for Tigrinya. This initiative constructed new text and speech corpora encompassing multiple domains and thorough pre-processing, which amounted to about 170,000 phrases and sentences of text and 30 h of speech corpus. Data augmentation was applied to generate synthetic data for better generalization capability. A Recurrent Neural Network Language Model (RNN-LM) was also used for post-processing to complement the model to achieve even better results. Multiple experiments were conducted with different settings and parameters. Whilst keeping the data size/split constant and employing various combinations of data augmentation techniques along with varying LM’s vocabulary size showed improved performances, increasing the vocabulary size from 5k to 20k resulted in minute decoding improvement. Our best model exhibited a Character Error Rate (CER) of 14.28% and a Word Error Rate (WER) of 36.01%, which is significant considering this end-to-end approach is the first of its kind for the under-resourced Tigrinya language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The seven diacritics are used together with the base letters to form unique letters. The diacritics are commonly known as orders.
References
The Tigrinya Language. (2021). https://www.ucl.ac.uk/atlas/tigrinya/language.html
Abate, S.T.: Automatic speech recognition for Amharic. Ph.D. thesis, Staats-und Universitätsbibliothek Hamburg Carl von Ossietzky (2006)
Abate, S.T., Menzel, W.: Syllable-based speech recognition for Amharic. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 33–40. Association for Computational Linguistics, Prague (2007). https://aclanthology.org/W07-0805
Abate, S.T., Tachbelie, M.Y., Schultz, T.: Multilingual acoustic and language modeling for ethio-semitic languages. In: Interspeech (2020)
Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.N., Adda-Decker, M., Rialland, A.: Parallel speech collection for under-resourced language studies using the lig-aikuma mobile device app. Procedia Comput. Sci. 81, 61–66 (2016). https://doi.org/10.1016/j.procs.2016.04.030
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. arXiv:1506.07503 cs.CL (2015)
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. arXiv:2006.12847 eess.AS (2020)
Emiru, E.D., Xiong, S., Li, Y., Fesseha, A., Diallo, M.: Improving amharic speech recognition system using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings. Information 12(2), 62 (2021). https://doi.org/10.3390/info12020062
Gebregergs, G.: DNN-HMM Based Isolated-Word Tigrigna Speech Recognition System. Master’s thesis, Addis Ababa Institute of Technology (2018)
Gebretsadik, T.: Sub-word Based Tigrinya Speech Recognizer an Experiment Using Hidden Markov Model, pp. 1–7. GRIN Verlag, Munich (2013)
Girmasien, Y.: Qalat Tigrinya ab Srah/Tigrinya Words in Action, 1st edn., pp. 22–30. Brhan Media Services (2011)
Graves, A.: Sequence transduction with recurrent neural networks. arXiv:1211.3711 cs.NE (2012)
Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Cohen, W.W., Moore, A.W. (eds.) Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, 25–29 June 2006. ACM International Conference Proceeding Series, vol. 148, pp. 369–376. ACM (2006). https://doi.org/10.1145/1143844.1143891
Hori, T., Watanabe, S., Hershey, J.: Joint CTC/attention decoding for end-to-end speech recognition. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 518–529 (2017)
Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv:1706.02737 cs.CL (2017)
Kamath, U., Liu, J., Whitaker, J.: Deep Learning for NLP and Speech Recognition. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14596-5
Kim, S., Hori, T., Watanabe, S.: Joint CTC/attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839 (2017)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589 (2015)
Ma, E.: Data augmentation for audio (2021). https://medium.com/makcedward/data-augmentation-for-audio-76912b01fdf6
Park, D., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: INTERSPEECH, pp. 2613–2617 (2019). https://doi.org/10.21437/Interspeech.2019-2680
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. IEEE Catalog: CFP11SRW-USB. IEEE Signal Processing Society (2011)
Sen, S., Dutta, A., Dey, N.: Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews. Springer, Singapore (2019)
T., D.P., William, B.: Ethiopic Writing. The World’s Writing Systems. Oxford University Press (1996)
Tachbelie, M.Y., Abate, S.T., Besacier, L.: Using different acoustic, lexical and language modeling units for ASR of an under-resourced language - amharic. Speech Commun. 56, 181–194 (2014). https://doi.org/10.1016/j.specom.2013.01.008
Tachbelie, M.Y., Abate, S.T., Besacier, L., Rossato, S.: Syllable-based and hybrid acoustic models for amharic speech recognition. In: Third Workshop on Spoken Language Technologies for Under-resourced Languages, SLTU 2012, Cape Town, 7–9 May 2012, pp. 5–10. ISCA (2012). https://www.isca-speech.org/archive/sltu_2012/tachbelie12_sltu.html
Tedla, Y.K., Yamamoto, K., Marasinghe, A.: Tigrinya part-of-speech tagging with morphological patterns and the new nagaoka tigrinya corpus. Int. J. Comput. Appl. 146(14) (2016)
Voigt, R.: Tigrinya. In: Weninger, S. (ed.) The Semitic Languages: An International Handbook, Handbücher zur Sprach- und Kommunikationswissenschaft/Handbooks of Linguistics and Communication Science, vol. 36, pp. 1153–1169. De Gruyter Mouton, Berlin (2011)
Wang, D., Wang, X., Lv, S.: End-to-End Mandarin Speech Recognition Combining CNN and BLSTM. Symmetry 11(5), 644 (2019). https://doi.org/10.3390/sym11050644
Watanabe, S., et al.: Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015 (2018)
Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Select. Topics Signal Process. 11(8), 1240–1253 (2017). https://doi.org/10.1109/JSTSP.2017.2763455
Xiao, Z., Ou, Z., Chu, W., Lin, H.: Hybrid CTC-attention based end-to-end speech recognition using subword units. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 146–150. IEEE (2018)
Yu, D., Deng, L.: Automatic Speech Recognition, pp. 13–48. Springer, London (2016)
Acknowledgments
First, we would like to thank the Almighty God. Then we would like to thank Dr. Yonas Meressi, Minister of Transport and Communications Mr. Tesfaslasie Berhane and the EriTel co., Dr. Yemane Keleta, the Department of Computer Science & Engineering, volunteer data donors, and last but not least, our heartfelt gratitude goes to our friends & family for their continuous love and moral support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors declare that the research data supporting the findings of this study are available from the corresponding author upon reasonable request. The authors retain the right to be the sole party able to provide and distribute the data used in this study.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ghebregiorgis, B.D., Tekle, Y.Y., Kidane, M.F., Keleta, M.K., Ghebraeb, R.F., Gebretatios, D.T. (2024). Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach. In: Debelee, T.G., Ibenthal, A., Schwenker, F., Megersa Ayano, Y. (eds) Pan-African Conference on Artificial Intelligence. PanAfriConAI 2023. Communications in Computer and Information Science, vol 2068. Springer, Cham. https://doi.org/10.1007/978-3-031-57624-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-57624-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57623-2
Online ISBN: 978-3-031-57624-9
eBook Packages: Computer ScienceComputer Science (R0)