Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

Ghebregiorgis, Bereket Desbele; Tekle, Yonatan Yosef; Kidane, Mebrahtu Fisshaye; Keleta, Mussie Kaleab; Ghebraeb, Rutta Fissehatsion; Gebretatios, Daniel Tesfai

doi:10.1007/978-3-031-57624-9_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2068))

Included in the following conference series:

Pan African Conference on Artificial Intelligence

40 Accesses

Abstract

The latest improvements in end-to-end Automatic Speech Recognition (ASR) systems have achieved outstanding results and have thus enabled the creation of state-of-the-art models for well-resourced languages. However, most languages, such as Tigrinya, are under-resourced, discouraging field efforts. Tigrinya is a Semitic language with over nine million speakers. This paper presents the first hybrid Connectionist Temporal Classification (CTC) with an attention-based end-to-end speaker-independent ASR model for Tigrinya. This initiative constructed new text and speech corpora encompassing multiple domains and thorough pre-processing, which amounted to about 170,000 phrases and sentences of text and 30 h of speech corpus. Data augmentation was applied to generate synthetic data for better generalization capability. A Recurrent Neural Network Language Model (RNN-LM) was also used for post-processing to complement the model to achieve even better results. Multiple experiments were conducted with different settings and parameters. Whilst keeping the data size/split constant and employing various combinations of data augmentation techniques along with varying LM’s vocabulary size showed improved performances, increasing the vocabulary size from 5k to 20k resulted in minute decoding improvement. Our best model exhibited a Character Error Rate (CER) of 14.28% and a Word Error Rate (WER) of 36.01%, which is significant considering this end-to-end approach is the first of its kind for the under-resourced Tigrinya language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The seven diacritics are used together with the base letters to form unique letters. The diacritics are commonly known as orders.

References

The Tigrinya Language. (2021). https://www.ucl.ac.uk/atlas/tigrinya/language.html
Abate, S.T.: Automatic speech recognition for Amharic. Ph.D. thesis, Staats-und Universitätsbibliothek Hamburg Carl von Ossietzky (2006)
Google Scholar
Abate, S.T., Menzel, W.: Syllable-based speech recognition for Amharic. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 33–40. Association for Computational Linguistics, Prague (2007). https://aclanthology.org/W07-0805
Abate, S.T., Tachbelie, M.Y., Schultz, T.: Multilingual acoustic and language modeling for ethio-semitic languages. In: Interspeech (2020)
Google Scholar
Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.N., Adda-Decker, M., Rialland, A.: Parallel speech collection for under-resourced language studies using the lig-aikuma mobile device app. Procedia Comput. Sci. 81, 61–66 (2016). https://doi.org/10.1016/j.procs.2016.04.030
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. arXiv:1506.07503 cs.CL (2015)
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)
Google Scholar
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. arXiv:2006.12847 eess.AS (2020)
Emiru, E.D., Xiong, S., Li, Y., Fesseha, A., Diallo, M.: Improving amharic speech recognition system using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings. Information 12(2), 62 (2021). https://doi.org/10.3390/info12020062
Gebregergs, G.: DNN-HMM Based Isolated-Word Tigrigna Speech Recognition System. Master’s thesis, Addis Ababa Institute of Technology (2018)
Google Scholar
Gebretsadik, T.: Sub-word Based Tigrinya Speech Recognizer an Experiment Using Hidden Markov Model, pp. 1–7. GRIN Verlag, Munich (2013)
Google Scholar
Girmasien, Y.: Qalat Tigrinya ab Srah/Tigrinya Words in Action, 1st edn., pp. 22–30. Brhan Media Services (2011)
Google Scholar
Graves, A.: Sequence transduction with recurrent neural networks. arXiv:1211.3711 cs.NE (2012)
Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Cohen, W.W., Moore, A.W. (eds.) Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, 25–29 June 2006. ACM International Conference Proceeding Series, vol. 148, pp. 369–376. ACM (2006). https://doi.org/10.1145/1143844.1143891
Hori, T., Watanabe, S., Hershey, J.: Joint CTC/attention decoding for end-to-end speech recognition. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 518–529 (2017)
Google Scholar
Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv:1706.02737 cs.CL (2017)
Kamath, U., Liu, J., Whitaker, J.: Deep Learning for NLP and Speech Recognition. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14596-5
Kim, S., Hori, T., Watanabe, S.: Joint CTC/attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839 (2017)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589 (2015)
Google Scholar
Ma, E.: Data augmentation for audio (2021). https://medium.com/makcedward/data-augmentation-for-audio-76912b01fdf6
Park, D., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: INTERSPEECH, pp. 2613–2617 (2019). https://doi.org/10.21437/Interspeech.2019-2680
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. IEEE Catalog: CFP11SRW-USB. IEEE Signal Processing Society (2011)
Google Scholar
Sen, S., Dutta, A., Dey, N.: Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews. Springer, Singapore (2019)
Google Scholar
T., D.P., William, B.: Ethiopic Writing. The World’s Writing Systems. Oxford University Press (1996)
Google Scholar
Tachbelie, M.Y., Abate, S.T., Besacier, L.: Using different acoustic, lexical and language modeling units for ASR of an under-resourced language - amharic. Speech Commun. 56, 181–194 (2014). https://doi.org/10.1016/j.specom.2013.01.008
Tachbelie, M.Y., Abate, S.T., Besacier, L., Rossato, S.: Syllable-based and hybrid acoustic models for amharic speech recognition. In: Third Workshop on Spoken Language Technologies for Under-resourced Languages, SLTU 2012, Cape Town, 7–9 May 2012, pp. 5–10. ISCA (2012). https://www.isca-speech.org/archive/sltu_2012/tachbelie12_sltu.html
Tedla, Y.K., Yamamoto, K., Marasinghe, A.: Tigrinya part-of-speech tagging with morphological patterns and the new nagaoka tigrinya corpus. Int. J. Comput. Appl. 146(14) (2016)
Google Scholar
Voigt, R.: Tigrinya. In: Weninger, S. (ed.) The Semitic Languages: An International Handbook, Handbücher zur Sprach- und Kommunikationswissenschaft/Handbooks of Linguistics and Communication Science, vol. 36, pp. 1153–1169. De Gruyter Mouton, Berlin (2011)
Google Scholar
Wang, D., Wang, X., Lv, S.: End-to-End Mandarin Speech Recognition Combining CNN and BLSTM. Symmetry 11(5), 644 (2019). https://doi.org/10.3390/sym11050644
Watanabe, S., et al.: Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015 (2018)
Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Select. Topics Signal Process. 11(8), 1240–1253 (2017). https://doi.org/10.1109/JSTSP.2017.2763455
Article Google Scholar
Xiao, Z., Ou, Z., Chu, W., Lin, H.: Hybrid CTC-attention based end-to-end speech recognition using subword units. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 146–150. IEEE (2018)
Google Scholar
Yu, D., Deng, L.: Automatic Speech Recognition, pp. 13–48. Springer, London (2016)
Google Scholar

Download references

Acknowledgments

First, we would like to thank the Almighty God. Then we would like to thank Dr. Yonas Meressi, Minister of Transport and Communications Mr. Tesfaslasie Berhane and the EriTel co., Dr. Yemane Keleta, the Department of Computer Science & Engineering, volunteer data donors, and last but not least, our heartfelt gratitude goes to our friends & family for their continuous love and moral support.

Author information

Authors and Affiliations

Mai Nefhi College of Engineering and Technology, Mai Nefhi, Maekel Zone, Eritrea
Bereket Desbele Ghebregiorgis, Yonatan Yosef Tekle, Mebrahtu Fisshaye Kidane, Mussie Kaleab Keleta, Rutta Fissehatsion Ghebraeb & Daniel Tesfai Gebretatios

Authors

Bereket Desbele Ghebregiorgis
View author publications
You can also search for this author in PubMed Google Scholar
Yonatan Yosef Tekle
View author publications
You can also search for this author in PubMed Google Scholar
Mebrahtu Fisshaye Kidane
View author publications
You can also search for this author in PubMed Google Scholar
Mussie Kaleab Keleta
View author publications
You can also search for this author in PubMed Google Scholar
Rutta Fissehatsion Ghebraeb
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Tesfai Gebretatios
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bereket Desbele Ghebregiorgis .

Editor information

Editors and Affiliations

Ethiopian Artificial Intelligence Instit, Addis Adaba, Ethiopia
Taye Girma Debelee
HAWK University of Applied Sciences and Arts, Göttingen, Germany
Achim Ibenthal
Universität Ulm, Ulm, Germany
Friedhelm Schwenker
Ethiopian Artificial Intelligence Instit, Addis Ababa, Ethiopia
Yehualashet Megersa Ayano

Ethics declarations

Disclosure of Interests

The authors declare that the research data supporting the findings of this study are available from the corresponding author upon reasonable request. The authors retain the right to be the sole party able to provide and distribute the data used in this study.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghebregiorgis, B.D., Tekle, Y.Y., Kidane, M.F., Keleta, M.K., Ghebraeb, R.F., Gebretatios, D.T. (2024). Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach. In: Debelee, T.G., Ibenthal, A., Schwenker, F., Megersa Ayano, Y. (eds) Pan-African Conference on Artificial Intelligence. PanAfriConAI 2023. Communications in Computer and Information Science, vol 2068. Springer, Cham. https://doi.org/10.1007/978-3-031-57624-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-57624-9_12
Published: 07 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57623-2
Online ISBN: 978-3-031-57624-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics