Comparative evaluation of text classification techniques using a large diverse Arabic dataset

Khorsheed, Mohammad S.; Al-Thubaity, Abdulmohsen O.

doi:10.1007/s10579-013-9221-8

Comparative evaluation of text classification techniques using a large diverse Arabic dataset

Original Paper
Published: 10 March 2013

Volume 47, pages 513–538, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mohammad S. Khorsheed¹ &
Abdulmohsen O. Al-Thubaity¹

1156 Accesses
86 Citations
Explore all metrics

Abstract

A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application

Article 21 November 2020

Comparative Study on Arabic Text Classification: Challenges and Opportunities

Machine Learning Implementations in Arabic Text Classification

References

Al-Saleem, S. (2010). Associative classification to categorize Arabic data sets. The International Journal Of ACM JORDAN, 1, 118–127.
Google Scholar
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Article Google Scholar
Bawaneh, J. M., Alkoffash, M. S., & Alrabea, A. I. (2008). Arabic text classification using K-NN and Naive Bayes. Journal of Computer Science, 4, 600–605.
Article Google Scholar
Diederich, J., Kindermann, J. L., Leopold, E., & PAAß, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.
Article Google Scholar
Duwairi, R. (2006). Machine learning for Arabic text categorization. Journal of the American Society for Information Science and Technology JASIST, 57(8), 1005–1010.
Article Google Scholar
Duwairi, R., Al-Refai, M., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science, 60(11), 2347–2352.
Article Google Scholar
El-Halees, A. (2008). A comparative study on Arabic text classification, Egyptian Computer Science Journal, 30(2). http://www.informatik.uni-trier.de/~ley/db/journals/ecs/ecs30.html
Elkourdi, M., Bensaid, A., & Rachidi, T. (2004). Automatic Arabic document categorization based on the Naive Bayes algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-Based Languages, (pp. 51–58).
Kanaan, G., Al-Shalabi R., & Al-Azzam, O. (2005). Automatic text classification using Naïve Bayesian algorithm on Arabic language. In Proceedings of the 5 ^th International Business Information Management Conference (IBIMA), (pp. 327–339).
Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., & Al-Ma’adeed, H. (2009). A comparison of text-classification techniques applied to Arabic text. Journal of the American Society for Information Science and Technology, 60(9), 1836–1844.
Article Google Scholar
Khreisat, L. (2006). Arabic text classification using N-gram frequency statistics a comparative study. In Proceedings of the 2006 International Conference on Data Mining, (pp. 78–82).
Mesleh, A. A. (2007). Chi square feature extraction based Svms Arabic language text categorization system. Journal of Computer Science, 3(6), 430–435.
Article Google Scholar
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., & Euler, T. (2006). YALE: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, & T. Eliassi-Rad (Eds.), KDD 06 Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 935–940. New York, USA: ACM.
Google Scholar
Sawaf, H., Zaplo, J., & Ney, H. (2001). Statistical classification methods for Arabic news articles. Arabic Natural Language Processing Workshop, ACL’2001, (pp. 127–132).
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Sinclair, J. (1995). Corpus typology–a framework for classification. In G. Melchers & B. Warren (Eds.), Studies in anglistics (pp. 17–33). Stockholm: Almqvist & Wiksell.
Google Scholar
Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text categorization. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.
Google Scholar
Thabtah, F., Eljinini, M., Zamzeer, M., & Hadi, W. (2009). Naïve Bayesian based on Chi Square to categorize Arabic data. In Proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, (pp. 930–935).
Thabtah, F., Hadi, W., & Al-Shammare, G. (2008). VSMs with K-Nearest Neighbour to categorise Arabic text data. In The World Congress on Engineering and Computer Science 2008, (pp. 778–781).
Zahran, M. M., Kanaan, G., & Habib, M. B. (2009). Text feature selection using particle Swarm optimization algorithm. World Applied Sciences Journal, 7(Special Issue of Computer & IT), 69–74.
Google Scholar

Download references

Acknowledgments

This project was fully funded by King Abdulaziz City for Science and Technology via grant number 104-27-30. The authors would like to thank the two anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

King Abdulaziz City for Science & Technology, P O Box 6086, Riyadh, 11442, Saudi Arabia
Mohammad S. Khorsheed & Abdulmohsen O. Al-Thubaity

Authors

Mohammad S. Khorsheed
View author publications
You can also search for this author in PubMed Google Scholar
Abdulmohsen O. Al-Thubaity
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad S. Khorsheed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khorsheed, M.S., Al-Thubaity, A.O. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Lang Resources & Evaluation 47, 513–538 (2013). https://doi.org/10.1007/s10579-013-9221-8

Download citation

Published: 10 March 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10579-013-9221-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative evaluation of text classification techniques using a large diverse Arabic dataset

Abstract

Access this article

Similar content being viewed by others

Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application

Comparative Study on Arabic Text Classification: Challenges and Opportunities

Machine Learning Implementations in Arabic Text Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparative evaluation of text classification techniques using a large diverse Arabic dataset

Abstract

Access this article

Similar content being viewed by others

Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application

Comparative Study on Arabic Text Classification: Challenges and Opportunities

Machine Learning Implementations in Arabic Text Classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation