Skip to main content

Advertisement

Log in

Comparative evaluation of text classification techniques using a large diverse Arabic dataset

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Al-Saleem, S. (2010). Associative classification to categorize Arabic data sets. The International Journal Of ACM JORDAN, 1, 118–127.

    Google Scholar 

  • Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.

    Article  Google Scholar 

  • Bawaneh, J. M., Alkoffash, M. S., & Alrabea, A. I. (2008). Arabic text classification using K-NN and Naive Bayes. Journal of Computer Science, 4, 600–605.

    Article  Google Scholar 

  • Diederich, J., Kindermann, J. L., Leopold, E., & PAAß, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.

    Article  Google Scholar 

  • Duwairi, R. (2006). Machine learning for Arabic text categorization. Journal of the American Society for Information Science and Technology JASIST, 57(8), 1005–1010.

    Article  Google Scholar 

  • Duwairi, R., Al-Refai, M., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science, 60(11), 2347–2352.

    Article  Google Scholar 

  • El-Halees, A. (2008). A comparative study on Arabic text classification, Egyptian Computer Science Journal, 30(2). http://www.informatik.uni-trier.de/~ley/db/journals/ecs/ecs30.html

  • Elkourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic document categorization based on the Naive Bayes algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-Based Languages, (pp. 51–58).

  • Kanaan, G., Al-Shalabi R., & Al-Azzam, O. (2005). Automatic text classification using Naïve Bayesian algorithm on Arabic language. In Proceedings of the 5 th International Business Information Management Conference (IBIMA), (pp. 327–339).

  • Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., & Al-Ma’adeed, H. (2009). A comparison of text-classification techniques applied to Arabic text. Journal of the American Society for Information Science and Technology, 60(9), 1836–1844.

    Article  Google Scholar 

  • Khreisat, L. (2006). Arabic text classification using N-gram frequency statistics a comparative study. In Proceedings of the 2006 International Conference on Data Mining, (pp. 78–82).

  • Mesleh, A. A. (2007). Chi square feature extraction based Svms Arabic language text categorization system. Journal of Computer Science, 3(6), 430–435.

    Article  Google Scholar 

  • Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (2006). YALE: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, & T. Eliassi-Rad (Eds.), KDD 06 Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 935–940. New York, USA: ACM.

    Google Scholar 

  • Sawaf, H., Zaplo, J., & Ney, H. (2001). Statistical classification methods for Arabic news articles. Arabic Natural Language Processing Workshop, ACL’2001, (pp. 127–132).

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  • Sinclair, J. (1995). Corpus typology–a framework for classification. In G. Melchers & B. Warren (Eds.), Studies in anglistics (pp. 17–33). Stockholm: Almqvist & Wiksell.

    Google Scholar 

  • Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text categorization. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.

    Google Scholar 

  • Thabtah, F., Eljinini, M., Zamzeer, M., & Hadi, W. (2009). Naïve Bayesian based on Chi Square to categorize Arabic data. In Proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, (pp. 930–935).

  • Thabtah, F., Hadi, W., & Al-Shammare, G. (2008). VSMs with K-Nearest Neighbour to categorise Arabic text data. In The World Congress on Engineering and Computer Science 2008, (pp. 778–781).

  • Zahran, M. M., Kanaan, G., & Habib, M. B. (2009). Text feature selection using particle Swarm optimization algorithm. World Applied Sciences Journal, 7(Special Issue of Computer & IT), 69–74.

    Google Scholar 

Download references

Acknowledgments

This project was fully funded by King Abdulaziz City for Science and Technology via grant number 104-27-30. The authors would like to thank the two anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad S. Khorsheed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khorsheed, M.S., Al-Thubaity, A.O. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Lang Resources & Evaluation 47, 513–538 (2013). https://doi.org/10.1007/s10579-013-9221-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9221-8

Keywords

Navigation