Skip to main content

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

  • Conference paper
  • First Online:
Pan-African Conference on Artificial Intelligence (PanAfriConAI 2023)

Abstract

The latest improvements in end-to-end Automatic Speech Recognition (ASR) systems have achieved outstanding results and have thus enabled the creation of state-of-the-art models for well-resourced languages. However, most languages, such as Tigrinya, are under-resourced, discouraging field efforts. Tigrinya is a Semitic language with over nine million speakers. This paper presents the first hybrid Connectionist Temporal Classification (CTC) with an attention-based end-to-end speaker-independent ASR model for Tigrinya. This initiative constructed new text and speech corpora encompassing multiple domains and thorough pre-processing, which amounted to about 170,000 phrases and sentences of text and 30 h of speech corpus. Data augmentation was applied to generate synthetic data for better generalization capability. A Recurrent Neural Network Language Model (RNN-LM) was also used for post-processing to complement the model to achieve even better results. Multiple experiments were conducted with different settings and parameters. Whilst keeping the data size/split constant and employing various combinations of data augmentation techniques along with varying LM’s vocabulary size showed improved performances, increasing the vocabulary size from 5k to 20k resulted in minute decoding improvement. Our best model exhibited a Character Error Rate (CER) of 14.28% and a Word Error Rate (WER) of 36.01%, which is significant considering this end-to-end approach is the first of its kind for the under-resourced Tigrinya language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The seven diacritics are used together with the base letters to form unique letters. The diacritics are commonly known as orders.

References

  1. The Tigrinya Language. (2021). https://www.ucl.ac.uk/atlas/tigrinya/language.html

  2. Abate, S.T.: Automatic speech recognition for Amharic. Ph.D. thesis, Staats-und Universitätsbibliothek Hamburg Carl von Ossietzky (2006)

    Google Scholar 

  3. Abate, S.T., Menzel, W.: Syllable-based speech recognition for Amharic. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 33–40. Association for Computational Linguistics, Prague (2007). https://aclanthology.org/W07-0805

  4. Abate, S.T., Tachbelie, M.Y., Schultz, T.: Multilingual acoustic and language modeling for ethio-semitic languages. In: Interspeech (2020)

    Google Scholar 

  5. Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.N., Adda-Decker, M., Rialland, A.: Parallel speech collection for under-resourced language studies using the lig-aikuma mobile device app. Procedia Comput. Sci. 81, 61–66 (2016). https://doi.org/10.1016/j.procs.2016.04.030

  6. Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. arXiv:1506.07503 cs.CL (2015)

  7. Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)

    Google Scholar 

  8. Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. arXiv:2006.12847 eess.AS (2020)

  9. Emiru, E.D., Xiong, S., Li, Y., Fesseha, A., Diallo, M.: Improving amharic speech recognition system using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings. Information 12(2), 62 (2021). https://doi.org/10.3390/info12020062

  10. Gebregergs, G.: DNN-HMM Based Isolated-Word Tigrigna Speech Recognition System. Master’s thesis, Addis Ababa Institute of Technology (2018)

    Google Scholar 

  11. Gebretsadik, T.: Sub-word Based Tigrinya Speech Recognizer an Experiment Using Hidden Markov Model, pp. 1–7. GRIN Verlag, Munich (2013)

    Google Scholar 

  12. Girmasien, Y.: Qalat Tigrinya ab Srah/Tigrinya Words in Action, 1st edn., pp. 22–30. Brhan Media Services (2011)

    Google Scholar 

  13. Graves, A.: Sequence transduction with recurrent neural networks. arXiv:1211.3711 cs.NE (2012)

  14. Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Cohen, W.W., Moore, A.W. (eds.) Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, 25–29 June 2006. ACM International Conference Proceeding Series, vol. 148, pp. 369–376. ACM (2006). https://doi.org/10.1145/1143844.1143891

  15. Hori, T., Watanabe, S., Hershey, J.: Joint CTC/attention decoding for end-to-end speech recognition. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 518–529 (2017)

    Google Scholar 

  16. Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv:1706.02737 cs.CL (2017)

  17. Kamath, U., Liu, J., Whitaker, J.: Deep Learning for NLP and Speech Recognition. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14596-5

  18. Kim, S., Hori, T., Watanabe, S.: Joint CTC/attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839 (2017)

    Google Scholar 

  19. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589 (2015)

    Google Scholar 

  20. Ma, E.: Data augmentation for audio (2021). https://medium.com/makcedward/data-augmentation-for-audio-76912b01fdf6

  21. Park, D., et al.: Specaugment: a simple data augmentation method for automatic speech recognition. In: INTERSPEECH, pp. 2613–2617 (2019). https://doi.org/10.21437/Interspeech.2019-2680

  22. Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. IEEE Catalog: CFP11SRW-USB. IEEE Signal Processing Society (2011)

    Google Scholar 

  23. Sen, S., Dutta, A., Dey, N.: Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews. Springer, Singapore (2019)

    Google Scholar 

  24. T., D.P., William, B.: Ethiopic Writing. The World’s Writing Systems. Oxford University Press (1996)

    Google Scholar 

  25. Tachbelie, M.Y., Abate, S.T., Besacier, L.: Using different acoustic, lexical and language modeling units for ASR of an under-resourced language - amharic. Speech Commun. 56, 181–194 (2014). https://doi.org/10.1016/j.specom.2013.01.008

  26. Tachbelie, M.Y., Abate, S.T., Besacier, L., Rossato, S.: Syllable-based and hybrid acoustic models for amharic speech recognition. In: Third Workshop on Spoken Language Technologies for Under-resourced Languages, SLTU 2012, Cape Town, 7–9 May 2012, pp. 5–10. ISCA (2012). https://www.isca-speech.org/archive/sltu_2012/tachbelie12_sltu.html

  27. Tedla, Y.K., Yamamoto, K., Marasinghe, A.: Tigrinya part-of-speech tagging with morphological patterns and the new nagaoka tigrinya corpus. Int. J. Comput. Appl. 146(14) (2016)

    Google Scholar 

  28. Voigt, R.: Tigrinya. In: Weninger, S. (ed.) The Semitic Languages: An International Handbook, Handbücher zur Sprach- und Kommunikationswissenschaft/Handbooks of Linguistics and Communication Science, vol. 36, pp. 1153–1169. De Gruyter Mouton, Berlin (2011)

    Google Scholar 

  29. Wang, D., Wang, X., Lv, S.: End-to-End Mandarin Speech Recognition Combining CNN and BLSTM. Symmetry 11(5), 644 (2019). https://doi.org/10.3390/sym11050644

  30. Watanabe, S., et al.: Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015 (2018)

  31. Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Select. Topics Signal Process. 11(8), 1240–1253 (2017). https://doi.org/10.1109/JSTSP.2017.2763455

    Article  Google Scholar 

  32. Xiao, Z., Ou, Z., Chu, W., Lin, H.: Hybrid CTC-attention based end-to-end speech recognition using subword units. In: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 146–150. IEEE (2018)

    Google Scholar 

  33. Yu, D., Deng, L.: Automatic Speech Recognition, pp. 13–48. Springer, London (2016)

    Google Scholar 

Download references

Acknowledgments

First, we would like to thank the Almighty God. Then we would like to thank Dr. Yonas Meressi, Minister of Transport and Communications Mr. Tesfaslasie Berhane and the EriTel co., Dr. Yemane Keleta, the Department of Computer Science & Engineering, volunteer data donors, and last but not least, our heartfelt gratitude goes to our friends & family for their continuous love and moral support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bereket Desbele Ghebregiorgis .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors declare that the research data supporting the findings of this study are available from the corresponding author upon reasonable request. The authors retain the right to be the sole party able to provide and distribute the data used in this study.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ghebregiorgis, B.D., Tekle, Y.Y., Kidane, M.F., Keleta, M.K., Ghebraeb, R.F., Gebretatios, D.T. (2024). Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach. In: Debelee, T.G., Ibenthal, A., Schwenker, F., Megersa Ayano, Y. (eds) Pan-African Conference on Artificial Intelligence. PanAfriConAI 2023. Communications in Computer and Information Science, vol 2068. Springer, Cham. https://doi.org/10.1007/978-3-031-57624-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57624-9_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57623-2

  • Online ISBN: 978-3-031-57624-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics