A Scheme for News Article Classification in a Low-Resource Language

Yohannes, Hailemariam Mehari; Amagasa, Toshiyuki

doi:10.1007/978-3-031-21047-1_47

Hailemariam Mehari Yohannes^11,12 &
Toshiyuki Amagasa^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13635))

Included in the following conference series:

International Conference on Information Integration and Web

849 Accesses
3 Citations

Abstract

This paper proposes a scheme for classifying news articles in Tigrinya, a language spoken in northern Ethiopia and Eritrea, and is known for its lack of extensive and readily available data. We present the first publicly available news article dataset for Tigrinya, containing 2396 articles. In addition, we propose a data augmentation method for text classification. Furthermore, we explore the performance in text classification using traditional machine learning methods (support vector machine, logistic regression, random forest, linear discriminant analysis, decision tree, and Naive Bayes), a neural network-based model (bidirectional long short term memory), and a transformer-based model (TigRoBERTa). The experimental results show that the proposed method performs better in accuracy than the comparative methods up to nine points. The codes and the dataset are open to the research community (https://github.com/mehari-eng/Article-News-Categorization).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Approach to Mizo Language News Classification Using Machine Learning

Tamil News Classification Using LSTM

Performance Comparison of Different Machine Learning Algorithms on Hindi News Classification

Notes

References

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Dernoncourt, F., Lee, J.Y., Szolovits, P.: NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dos Santos, C., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: International Conference on Machine Learning, pp. 1818–1826. PMLR (2014)
Google Scholar
Endalie, D., Haile, G.: Automated Amharic news categorization using deep learning models. Comput. Intell. Neurosci. 2021 (2021)
Google Scholar
Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017)
Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M., Dahou, A.: Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information 12(2), 52 (2021)
Article Google Scholar
Gao, S., Zhang, Y., Ou, Z., Yu, Z.: Paraphrase augmented task-oriented dialog generation. arXiv preprint arXiv:2004.07462 (2020)
Hazem, A., Daille, B.: Word embedding approach for synonym extraction of multi-word terms. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kelemework, W.: Automatic Amharic text news classification: Aneural networks approach. Ethiop. J. Sci. Technol. 6(2), 127–137 (2013)
Google Scholar
Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Osman, O., Mikami, Y.: Stemming Tigrinya words for information retrieval. In: Proceedings of COLING 2012: Demonstration Papers, pp. 345–352 (2012)
Google Scholar
Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)
Article Google Scholar
Sugiyama, A., Yoshinaga, N.: Data augmentation using back-translation for context-aware neural machine translation. In: Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp. 35–44 (2019)
Google Scholar
Tedla, Y.K., Yamamoto, K., Marasinghe, A.: Nagaoka Tigrinya corpus: design and development of part-of-speech tagged corpus. Nagaoka University of Technology, pp. 1–4 (2016)
Google Scholar
Tesfagergish, S.G., Kapociute-Dzikiene, J.: Deep learning-based part-of-speech tagging of the Tigrinya language. In: Lopata, A., Butkienė, R., Gudonienė, D., Sukackė, V. (eds.) ICIST 2020. CCIS, vol. 1283, pp. 357–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59506-7_29
Chapter Google Scholar
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019)
Yohannes, H.M., Amagasa, T.: Named-entity recognition for a low-resource language using pre-trained language model. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pp. 837–844 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Ibaraki, Japan
Hailemariam Mehari Yohannes & Toshiyuki Amagasa
Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, Japan
Hailemariam Mehari Yohannes & Toshiyuki Amagasa

Authors

Hailemariam Mehari Yohannes
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Amagasa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailemariam Mehari Yohannes .

Editor information

Editors and Affiliations

La Trobe University, Melbourne, VIC, Australia
Eric Pardede
Monash University, Melbourne, VIC, Australia
Pari Delir Haghighi
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yohannes, H.M., Amagasa, T. (2022). A Scheme for News Article Classification in a Low-Resource Language. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds) Information Integration and Web Intelligence. iiWAS 2022. Lecture Notes in Computer Science, vol 13635. Springer, Cham. https://doi.org/10.1007/978-3-031-21047-1_47

Download citation

DOI: https://doi.org/10.1007/978-3-031-21047-1_47
Published: 20 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21046-4
Online ISBN: 978-3-031-21047-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Scheme for News Article Classification in a Low-Resource Language