Using Word N-Grams as Features in Arabic Text Classification

Al-Thubaity, Abdulmohsen; Alhoshan, Muneera; Hazzaa, Itisam

doi:10.1007/978-3-319-10389-1_3

Abdulmohsen Al-Thubaity³,
Muneera Alhoshan³ &
Itisam Hazzaa⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 569))

996 Accesses
6 Citations

Abstract

The feature type (FT) chosen for extraction from the text and presented to the classification algorithm (CAL) is one of the factors affecting text classification (TC) accuracy. Character N-grams, word roots, word stems, and single words have been used as features for Arabic TC (ATC). A survey of current literature shows that no prior studies have been conducted on the effect of using word N-grams (N consecutive words) on ATC accuracy. Consequently, we have conducted 576 experiments using four FTs (single words, 2-grams, 3-grams, and 4-grams), four feature selection methods (document frequency (DF), chi-squared, information gain, and Galavotti, Sebastiani, Simi) with four thresholds for numbers of features (50, 100, 150, and 200), three data representation schemas (Boolean, term frequency-inversed document frequency, and lookup table convolution), and three CALs (naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM)). Our results show that the use of single words as a feature provides greater classification accuracy (CA) for ATC compared to N-grams. Moreover, CA decreases by 17% on average when the number of N-grams increases. The data also show that the SVM CAL provides greater CA than NB and KNN; however, the best CA for 2-grams, 3-grams, and 4-grams is achieved when the NB CAL is used with Boolean representation and the number of features is 200.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alarifi, A., Alghamdi, M., Zarour, M., Aloqail, B., Lraqibah, H., Alsadhan, K., Alkwai, L.: Estimating the Size of Arabic Indexed Web Content. Scientific Research and Essays 7(28), 2472–2483 (2012)
Google Scholar
Mesleh, A.M.: Feature sub-set selection metrics for Arabic text classification. Pattern Recognition Letters 32(14), 1922–1929 (2011)
Article Google Scholar
Althubaity, A., Almuhareb, A., Alharbi, S., Al-Rajeh, A., Khorsheed, M.: KACST Arabic Text Classification Project: Overview and Preliminary Results. In: 9th IBMIA Conference on Information Management in Modern Organizations (2008)
Google Scholar
Alwedyan, J., Hadi, W.M., Salam, M., Mansour, H.Y.: Categorize Arabic data sets using multi-class classification based on association rule approach. In: Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications, vol. 18 (2011)
Google Scholar
Khorsheed, M.S., Al-Thubaity, A.O.: Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Language Resources and Evaluation 47(2), 513–538 (2013)
Article Google Scholar
Duwairi, R., Al-Refai, M.N., Khasawneh, N.: Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology 60(11), 2347–2352 (2009)
Article Google Scholar
Noaman, H.M., Elmougy, S., Ghoneim, A., Hamza, T.: Naive Bayes classifier based Arabic document categorization. In: 7th International Conference on Informatics and Systems (INFOS 2010), pp. 1–5 (2010)
Google Scholar
Harrag, F., El-Qawasmah, E., Al-Salman, A.M.S.: Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: First International Conference on Integrated Intelligent Computing (ICIIC 2010), pp. 6–11 (2010)
Google Scholar
Al-Shammari, E.T.: Improving Arabic document categorization: Introducing local stem. In: 10th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 385–390 (2010)
Google Scholar
Sawalha, M., Atwell, E.S.: Comparative evaluation of Arabic language morphological analysers and stemmers. In: Proceedings of COLING 2008 22nd International Conference on Computational Linguistics (Poster Volume), pp. 107–110. Coling 2008 Organizing Committee (2008)
Google Scholar
Sawaf, H., Zaplo, J., Ney, H.: Statistical classification methods for Arabic news articles. In: Proceedings of the ACL/EACL 2001 Workshop on Arabic Language Processing: Status and Prospects, Toulouse, France (2001)
Google Scholar
Khreisat, L.: A machine learning approach for Arabic text classification using N-gram frequency statistics. Journal of Informetrics 3(1), 72–77 (2009)
Article Google Scholar
Al-Shalabi, R., Obeidat, R.: Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt, pp. 108–112 (2008)
Google Scholar
Güran, A., Akyokucs, S., Bayazit, N.G., Gürbüz, M.Z.: Turkish text categorization using N-gram words. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2009), pp. 369–373 (2009)
Google Scholar
Bina, B., Ahmadi, M., Rahgozar, M.: Farsi text classification using n-grams and KNN algorithm: A comparative study. In: Proceedings of the 4th International Conference on Data Mining (DMIN 2008), pp. 385–390 (2008)
Google Scholar
Froud, H., Lachkar, A., Ouatik, S.A.: A comparative study of root-based and stem-based approaches for measuring the similarity between Arabic words for Arabic text mining applications. arXiv preprint arXiv:1212.3634 (2012)
Google Scholar
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., Al-Rajeh, A.: Automatic Arabic text classification. In: 9es Journées Internationales d’Analyse Statistique des Données Textuelles, JADT 2008, pp. 77–83 (2008)
Google Scholar
Al-Saleem, S.: Associative classification to categorize Arabic data sets. International Journal of ACM JORDAN 1, 118–127 (2010)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: YALE: Rapid prototyping for complex data mining tasks. In: Ungar, L., Craven, M., Gunopulos, D., Eliassi-Rad, T. (eds.) KDD 2006 Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–940. ACM, New York (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Research Institute, King Abdulaziz City for Science and Technology, Riyadh, KSA
Abdulmohsen Al-Thubaity & Muneera Alhoshan
College of Computer and Information Sciences, King Saud University, Riyadh, KSA
Itisam Hazzaa

Authors

Abdulmohsen Al-Thubaity
View author publications
You can also search for this author in PubMed Google Scholar
Muneera Alhoshan
View author publications
You can also search for this author in PubMed Google Scholar
Itisam Hazzaa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdulmohsen Al-Thubaity .

Editor information

Editors and Affiliations

Software Engineering & Information Technology Institute, Central Michigan University, Mt. Pleasant, Michigan, USA
Roger Lee

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Al-Thubaity, A., Alhoshan, M., Hazzaa, I. (2015). Using Word N-Grams as Features in Arabic Text Classification. In: Lee, R. (eds) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. Studies in Computational Intelligence, vol 569. Springer, Cham. https://doi.org/10.1007/978-3-319-10389-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-10389-1_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10388-4
Online ISBN: 978-3-319-10389-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics