Skip to main content

A Scheme for News Article Classification in a Low-Resource Language

  • Conference paper
  • First Online:
Information Integration and Web Intelligence (iiWAS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13635))

Included in the following conference series:

Abstract

This paper proposes a scheme for classifying news articles in Tigrinya, a language spoken in northern Ethiopia and Eritrea, and is known for its lack of extensive and readily available data. We present the first publicly available news article dataset for Tigrinya, containing 2396 articles. In addition, we propose a data augmentation method for text classification. Furthermore, we explore the performance in text classification using traditional machine learning methods (support vector machine, logistic regression, random forest, linear discriminant analysis, decision tree, and Naive Bayes), a neural network-based model (bidirectional long short term memory), and a transformer-based model (TigRoBERTa). The experimental results show that the proposed method performs better in accuracy than the comparative methods up to nine points. The codes and the dataset are open to the research community (https://github.com/mehari-eng/Article-News-Categorization).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.infoplease.com/world/social-statistics/how-many-languages-are-there.

  2. 2.

    https://github.com/mehari-eng/Article-News-Categorization.

  3. 3.

    http://ethiopiantreasures.co.uk/pages/language.htm.

References

  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  2. Dernoncourt, F., Lee, J.Y., Szolovits, P.: NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487 (2017)

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  4. Dos Santos, C., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: International Conference on Machine Learning, pp. 1818–1826. PMLR (2014)

    Google Scholar 

  5. Endalie, D., Haile, G.: Automated Amharic news categorization using deep learning models. Comput. Intell. Neurosci. 2021 (2021)

    Google Scholar 

  6. Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017)

  7. Fesseha, A., Xiong, S., Emiru, E.D., Diallo, M., Dahou, A.: Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information 12(2), 52 (2021)

    Article  Google Scholar 

  8. Gao, S., Zhang, Y., Ou, Z., Yu, Z.: Paraphrase augmented task-oriented dialog generation. arXiv preprint arXiv:2004.07462 (2020)

  9. Hazem, A., Daille, B.: Word embedding approach for synonym extraction of multi-word terms. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Kelemework, W.: Automatic Amharic text news classification: Aneural networks approach. Ethiop. J. Sci. Technol. 6(2), 127–137 (2013)

    Google Scholar 

  12. Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020)

  13. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  15. Osman, O., Mikami, Y.: Stemming Tigrinya words for information retrieval. In: Proceedings of COLING 2012: Demonstration Papers, pp. 345–352 (2012)

    Google Scholar 

  16. Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)

    Google Scholar 

  17. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  18. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)

    Article  Google Scholar 

  19. Sugiyama, A., Yoshinaga, N.: Data augmentation using back-translation for context-aware neural machine translation. In: Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp. 35–44 (2019)

    Google Scholar 

  20. Tedla, Y.K., Yamamoto, K., Marasinghe, A.: Nagaoka Tigrinya corpus: design and development of part-of-speech tagged corpus. Nagaoka University of Technology, pp. 1–4 (2016)

    Google Scholar 

  21. Tesfagergish, S.G., Kapociute-Dzikiene, J.: Deep learning-based part-of-speech tagging of the Tigrinya language. In: Lopata, A., Butkienė, R., Gudonienė, D., Sukackė, V. (eds.) ICIST 2020. CCIS, vol. 1283, pp. 357–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59506-7_29

    Chapter  Google Scholar 

  22. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019)

  23. Yohannes, H.M., Amagasa, T.: Named-entity recognition for a low-resource language using pre-trained language model. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pp. 837–844 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailemariam Mehari Yohannes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yohannes, H.M., Amagasa, T. (2022). A Scheme for News Article Classification in a Low-Resource Language. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds) Information Integration and Web Intelligence. iiWAS 2022. Lecture Notes in Computer Science, vol 13635. Springer, Cham. https://doi.org/10.1007/978-3-031-21047-1_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21047-1_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21046-4

  • Online ISBN: 978-3-031-21047-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics