Abstract
This paper proposes a scheme for classifying news articles in Tigrinya, a language spoken in northern Ethiopia and Eritrea, and is known for its lack of extensive and readily available data. We present the first publicly available news article dataset for Tigrinya, containing 2396 articles. In addition, we propose a data augmentation method for text classification. Furthermore, we explore the performance in text classification using traditional machine learning methods (support vector machine, logistic regression, random forest, linear discriminant analysis, decision tree, and Naive Bayes), a neural network-based model (bidirectional long short term memory), and a transformer-based model (TigRoBERTa). The experimental results show that the proposed method performs better in accuracy than the comparative methods up to nine points. The codes and the dataset are open to the research community (https://github.com/mehari-eng/Article-News-Categorization).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Dernoncourt, F., Lee, J.Y., Szolovits, P.: NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dos Santos, C., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: International Conference on Machine Learning, pp. 1818–1826. PMLR (2014)
Endalie, D., Haile, G.: Automated Amharic news categorization using deep learning models. Comput. Intell. Neurosci. 2021 (2021)
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017)
Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M., Dahou, A.: Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information 12(2), 52 (2021)
Gao, S., Zhang, Y., Ou, Z., Yu, Z.: Paraphrase augmented task-oriented dialog generation. arXiv preprint arXiv:2004.07462 (2020)
Hazem, A., Daille, B.: Word embedding approach for synonym extraction of multi-word terms. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kelemework, W.: Automatic Amharic text news classification: Aneural networks approach. Ethiop. J. Sci. Technol. 6(2), 127–137 (2013)
Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Osman, O., Mikami, Y.: Stemming Tigrinya words for information retrieval. In: Proceedings of COLING 2012: Demonstration Papers, pp. 345–352 (2012)
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)
Sugiyama, A., Yoshinaga, N.: Data augmentation using back-translation for context-aware neural machine translation. In: Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp. 35–44 (2019)
Tedla, Y.K., Yamamoto, K., Marasinghe, A.: Nagaoka Tigrinya corpus: design and development of part-of-speech tagged corpus. Nagaoka University of Technology, pp. 1–4 (2016)
Tesfagergish, S.G., Kapociute-Dzikiene, J.: Deep learning-based part-of-speech tagging of the Tigrinya language. In: Lopata, A., Butkienė, R., Gudonienė, D., Sukackė, V. (eds.) ICIST 2020. CCIS, vol. 1283, pp. 357–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59506-7_29
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019)
Yohannes, H.M., Amagasa, T.: Named-entity recognition for a low-resource language using pre-trained language model. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pp. 837–844 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yohannes, H.M., Amagasa, T. (2022). A Scheme for News Article Classification in a Low-Resource Language. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds) Information Integration and Web Intelligence. iiWAS 2022. Lecture Notes in Computer Science, vol 13635. Springer, Cham. https://doi.org/10.1007/978-3-031-21047-1_47
Download citation
DOI: https://doi.org/10.1007/978-3-031-21047-1_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21046-4
Online ISBN: 978-3-031-21047-1
eBook Packages: Computer ScienceComputer Science (R0)