Skip to main content
Log in

Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text classification could be defined as the way of allocating text into predefined groups according to its contents. Over the past few years, an increase emerged in the volume of information in the varied fields on the Internet, thus making the classification of texts one of the most important, yet challenging. Text classification is commonly employed in numerous applications and for different objectives. The extensive and broad use of the Internet, particularly in the Arab world, as well as the massive number of the documents and pages which are provided in the Arabic language, raised the need for having suitable tools for classification of these pages and documents by their main categories. The aim of this paper to study the effect of the improved CHI (ImpCHI) Square on the performance of six well-known classifiers: Random Forest, Decision Tree, Naïve Bayes, Naïve Bayes Multinomial, Bayes Net, and Artificial Neural Networks. These proposed techniques are quite important for improving classification of Arabic documents and can be regarded as a promising basis for the stage of text classification because it contributes to the classification of the texts into predefined categories. This combination method takes the advantages of more than one technique, which can produce better results in the final outcomes. The dataset employed in this paper includes 9055 Arabic documents that were collected from various Arabic resources. Based on their content, these documents were divided into twelve categories. Four performance evaluation criteria were used: the F-measure, recall, precision, and Time build model. The experimental results show that the use of ImpCHI square gives better classification results than the normal CHI square method with all studied classifiers, in terms of all used performance criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abualigah L, Alfar HE, Shehab M, Hussein AMA (2020) Sentiment analysis in healthcare: a brief review. In: Recent advances in NLP:the case of arabic language. Springer, Cham, pp 129–141

  2. Abualigah L, Alsalibi B, Shehab M, Alshinwan M, Khasawneh AM, Alabool H (2020) A parallel hybrid krill herd algorithm for feature selection. Int J Mach Learn Cybern:1–24

  3. Abualigah L, Bashabsheh MQ, Alabool H, Shehab M (2020) Text summarization: a brief review. In: Recent advances in NLP: the case of arabic language. Springer, Cham, pp 1–15

  4. Abualigah L, Diabat A, Geem ZW (2020) A comprehensive survey of the harmony search algorithm in clustering applications. Appl Sci 10(11):3827

  5. Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin, pp 1–165

    Book  Google Scholar 

  6. Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19

    Google Scholar 

  7. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Article  Google Scholar 

  8. Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36

    Article  Google Scholar 

  9. Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071

    Article  Google Scholar 

  10. Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466

    Article  Google Scholar 

  11. Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125

    Article  Google Scholar 

  12. Abualigah L, Shehab M, Diabat A, Abraham A (2020) Selection scheme sensitivity for a hybrid Salp swarm algorithm: analysis and applications. Eng Comput 1–27

  13. Aliwy AH (2012) Tokenization as preprocessing for arabic tagging system. Int J Inform Educ Technol (IJET) 2(4):348

  14. Alshaer H, Alzwahrah B, Otair M (2017) Arabic text classification using Bayes classifiers. Int J Inform Syst Comput Sci

  15. Ayedh A, Tan G, Alwesabi K, Rajeh H (2016) The effect of preprocessing on arabic document categorization. Algorithms 9(2):27

  16. Bahassine S, Madani A, Al-Sarem M, Kissi M (2020) Feature selection using an improved chi-square for Arabic text classification. J King Saud Univ Comp & Info Sci 32(2):225–231

  17. Bahassine S, Madani A, Kissi M (2016) An improved chi-sqaure feature selection for Arabic text classification using decision tree. In 2016 11th international conference on intelligent systems: theories and applications (SITA), IEEE, pp. 1–5

  18. Bawaneh MJ, Alkoffash MS, Al Rabea AI (2008) Arabic text classification using K-NN and naive Bayes. J Comput Sci 4(7):600–605

  19. Chanod JP, Tapanainen P (1996) A non-deterministic tokeniser for finite-state parsing. In: Proceedings of the workshop on extended finite state models of language (ECAI’96)

  20. Chen Y, He F, Li H, Zhang D, Wu Y (2020) A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Appl Soft Comput:106335

  21. Cutler D, Edwards C, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random Forest for classification in ecology. Ecology 88:2783–2792

    Article  Google Scholar 

  22. Gharib TF, Habib MB, Fayed ZT (2009) Arabic text classification using support vector machines. Int J Comput Their Appl 16(4):192–199

  23. Hawashin B, Mansour A, Aljawarneh S (2013) An efficient feature selection method for Arabic text classification. Int J Comput Appl 83(17)

  24. Hmeidi I, Al-Ayyoub M, Abdulla NA, Almodawar AA, Abooraig R, Mahyoub NA (2015) Automatic Arabic text categorization: A comprehensive comparative study. J Inf Sci 41(1):114–124

  25. Jadon E, Sharma R (2017) Data mining: document classification using naive Bayes classifier. Int J Comput Appl 167(6):13–16

  26. Kanan T, Fox EA (2016) Automated arabic text classification with P-S temmer, machine learning, and a tailored news article taxonomy. J Assoc Inf Sci Technol 67(11):2667–2683

  27. McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization 752(1):41–48

  28. Moh'd A, Mesleh A (2007) Chi square feature extraction based SVMs arabic language text categorization system. J Comput Sci 3(6):430–435

  29. Mesleh A (2011) Feature sub-set selection metrics for Arabic text classification. Pattern Recogn Lett 32:1922–1929

    Article  Google Scholar 

  30. Mohana R, Sumathi S (2014) Document classification using multinomial Naïve Bayesian classifier. Int J Sci Eng Technol Res(IJSETR) 3(5):1557–1563

  31. Mohammad AH, Alwada'n T, Al-Momani O (2016) Arabic text categorization using support vector machine, Naïve Bayes and neural network. GSTF Journal on Computing (JoC) 5(1):108

  32. Osisanwo FY, Akinsola JET, Awodele O, Hinmikaiye JO, Olakanmi O, Akinjobi J (2017) Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT) 48(3):128–138

  33. Otair MA (2013) Comparative analysis of Arabic stemming algorithms. J Inf Technol Manag 5(2):1–13

  34. Parekh R, Yang J, Honavar V (2000) Constructive neural-network learning algorithms for pattern classification. IEEE Trans Neural Netw 11:436–451

    Article  Google Scholar 

  35. Patra A, Singh D (2013) Neural network approach for text classification using relevance factor as term weighing method. Int J Comput Appl 68(17):37–41

  36. Raho G, Al-Shalabi R, Kanaan G, Nassar A (2015) Different classification algorithms based on Arabic text classification: feature selection comparative study. International Journal of Advanced Computer Science and Applications (IJACSA) 6(2):23–28

  37. Saravanan K, Sasithra S (2014) Review on classification based on artificial neural networks. International Journal of Ambient Systems and Applications (IJASA) 2(4):11–18

  38. Sembok TMT, Ata BA, Bakar ZA (2011) A rule-based Arabic stemming algorithm. Proceedings of the European Computing Conference, pp 392–397

  39. Sharma D, Jain S (2015) Evaluation of stemming and stop word techniques on text classification problem. International Journal of Scientific Research in Computer Science and Engineering (IJSRCSE)) 3(2):1–4

  40. Xu Q, Li M (2019) A new cluster computing technique for social media data analysis. Clust Comput 22(2):2731–2738

    Article  Google Scholar 

  41. Xu Q, Li M, Li M, Liu S (2018) Energy spectrum CT image detection based dimensionality reduction with phase congruency. J Med Syst 42(3):49

    Article  Google Scholar 

  42. Xu Q, Wang Z, Wang F, Li J (2018) Thermal comfort research on human CT data modeling. Multimed Tools Appl 77(5):6311–6326

    Article  Google Scholar 

  43. Xu Q, Li M, Yu M (2019) Learning to rank with relational graph and pointwise constraint for cross-modal retrieval. Soft Comput 23(19):9413–9427

    Article  Google Scholar 

  44. Xu Q, Wang F, Gong Y, Wang Z, Zeng K, Li Q, Luo X (2019) A novel edge-oriented framework for saliency detection enhancement. Image Vis Comput 87:1–12

    Article  Google Scholar 

  45. Zakariah M (2014) Classification of large datasets using random Forest algorithm in various applications: survey. International Journal of Engineering and Innovative Technology (IJJEIT) 4(3))

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laith Abualigah.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alshaer, H.N., Otair, M.A., Abualigah, L. et al. Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application. Multimed Tools Appl 80, 10373–10390 (2021). https://doi.org/10.1007/s11042-020-10074-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10074-6

Keywords

Navigation