Abstract
Text classification could be defined as the way of allocating text into predefined groups according to its contents. Over the past few years, an increase emerged in the volume of information in the varied fields on the Internet, thus making the classification of texts one of the most important, yet challenging. Text classification is commonly employed in numerous applications and for different objectives. The extensive and broad use of the Internet, particularly in the Arab world, as well as the massive number of the documents and pages which are provided in the Arabic language, raised the need for having suitable tools for classification of these pages and documents by their main categories. The aim of this paper to study the effect of the improved CHI (ImpCHI) Square on the performance of six well-known classifiers: Random Forest, Decision Tree, Naïve Bayes, Naïve Bayes Multinomial, Bayes Net, and Artificial Neural Networks. These proposed techniques are quite important for improving classification of Arabic documents and can be regarded as a promising basis for the stage of text classification because it contributes to the classification of the texts into predefined categories. This combination method takes the advantages of more than one technique, which can produce better results in the final outcomes. The dataset employed in this paper includes 9055 Arabic documents that were collected from various Arabic resources. Based on their content, these documents were divided into twelve categories. Four performance evaluation criteria were used: the F-measure, recall, precision, and Time build model. The experimental results show that the use of ImpCHI square gives better classification results than the normal CHI square method with all studied classifiers, in terms of all used performance criteria.







Similar content being viewed by others
References
Abualigah L, Alfar HE, Shehab M, Hussein AMA (2020) Sentiment analysis in healthcare: a brief review. In: Recent advances in NLP:the case of arabic language. Springer, Cham, pp 129–141
Abualigah L, Alsalibi B, Shehab M, Alshinwan M, Khasawneh AM, Alabool H (2020) A parallel hybrid krill herd algorithm for feature selection. Int J Mach Learn Cybern:1–24
Abualigah L, Bashabsheh MQ, Alabool H, Shehab M (2020) Text summarization: a brief review. In: Recent advances in NLP: the case of arabic language. Springer, Cham, pp 1–15
Abualigah L, Diabat A, Geem ZW (2020) A comprehensive survey of the harmony search algorithm in clustering applications. Appl Sci 10(11):3827
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin, pp 1–165
Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36
Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071
Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466
Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125
Abualigah L, Shehab M, Diabat A, Abraham A (2020) Selection scheme sensitivity for a hybrid Salp swarm algorithm: analysis and applications. Eng Comput 1–27
Aliwy AH (2012) Tokenization as preprocessing for arabic tagging system. Int J Inform Educ Technol (IJET) 2(4):348
Alshaer H, Alzwahrah B, Otair M (2017) Arabic text classification using Bayes classifiers. Int J Inform Syst Comput Sci
Ayedh A, Tan G, Alwesabi K, Rajeh H (2016) The effect of preprocessing on arabic document categorization. Algorithms 9(2):27
Bahassine S, Madani A, Al-Sarem M, Kissi M (2020) Feature selection using an improved chi-square for Arabic text classification. J King Saud Univ Comp & Info Sci 32(2):225–231
Bahassine S, Madani A, Kissi M (2016) An improved chi-sqaure feature selection for Arabic text classification using decision tree. In 2016 11th international conference on intelligent systems: theories and applications (SITA), IEEE, pp. 1–5
Bawaneh MJ, Alkoffash MS, Al Rabea AI (2008) Arabic text classification using K-NN and naive Bayes. J Comput Sci 4(7):600–605
Chanod JP, Tapanainen P (1996) A non-deterministic tokeniser for finite-state parsing. In: Proceedings of the workshop on extended finite state models of language (ECAI’96)
Chen Y, He F, Li H, Zhang D, Wu Y (2020) A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Appl Soft Comput:106335
Cutler D, Edwards C, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random Forest for classification in ecology. Ecology 88:2783–2792
Gharib TF, Habib MB, Fayed ZT (2009) Arabic text classification using support vector machines. Int J Comput Their Appl 16(4):192–199
Hawashin B, Mansour A, Aljawarneh S (2013) An efficient feature selection method for Arabic text classification. Int J Comput Appl 83(17)
Hmeidi I, Al-Ayyoub M, Abdulla NA, Almodawar AA, Abooraig R, Mahyoub NA (2015) Automatic Arabic text categorization: A comprehensive comparative study. J Inf Sci 41(1):114–124
Jadon E, Sharma R (2017) Data mining: document classification using naive Bayes classifier. Int J Comput Appl 167(6):13–16
Kanan T, Fox EA (2016) Automated arabic text classification with P-S temmer, machine learning, and a tailored news article taxonomy. J Assoc Inf Sci Technol 67(11):2667–2683
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization 752(1):41–48
Moh'd A, Mesleh A (2007) Chi square feature extraction based SVMs arabic language text categorization system. J Comput Sci 3(6):430–435
Mesleh A (2011) Feature sub-set selection metrics for Arabic text classification. Pattern Recogn Lett 32:1922–1929
Mohana R, Sumathi S (2014) Document classification using multinomial Naïve Bayesian classifier. Int J Sci Eng Technol Res(IJSETR) 3(5):1557–1563
Mohammad AH, Alwada'n T, Al-Momani O (2016) Arabic text categorization using support vector machine, Naïve Bayes and neural network. GSTF Journal on Computing (JoC) 5(1):108
Osisanwo FY, Akinsola JET, Awodele O, Hinmikaiye JO, Olakanmi O, Akinjobi J (2017) Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT) 48(3):128–138
Otair MA (2013) Comparative analysis of Arabic stemming algorithms. J Inf Technol Manag 5(2):1–13
Parekh R, Yang J, Honavar V (2000) Constructive neural-network learning algorithms for pattern classification. IEEE Trans Neural Netw 11:436–451
Patra A, Singh D (2013) Neural network approach for text classification using relevance factor as term weighing method. Int J Comput Appl 68(17):37–41
Raho G, Al-Shalabi R, Kanaan G, Nassar A (2015) Different classification algorithms based on Arabic text classification: feature selection comparative study. International Journal of Advanced Computer Science and Applications (IJACSA) 6(2):23–28
Saravanan K, Sasithra S (2014) Review on classification based on artificial neural networks. International Journal of Ambient Systems and Applications (IJASA) 2(4):11–18
Sembok TMT, Ata BA, Bakar ZA (2011) A rule-based Arabic stemming algorithm. Proceedings of the European Computing Conference, pp 392–397
Sharma D, Jain S (2015) Evaluation of stemming and stop word techniques on text classification problem. International Journal of Scientific Research in Computer Science and Engineering (IJSRCSE)) 3(2):1–4
Xu Q, Li M (2019) A new cluster computing technique for social media data analysis. Clust Comput 22(2):2731–2738
Xu Q, Li M, Li M, Liu S (2018) Energy spectrum CT image detection based dimensionality reduction with phase congruency. J Med Syst 42(3):49
Xu Q, Wang Z, Wang F, Li J (2018) Thermal comfort research on human CT data modeling. Multimed Tools Appl 77(5):6311–6326
Xu Q, Li M, Yu M (2019) Learning to rank with relational graph and pointwise constraint for cross-modal retrieval. Soft Comput 23(19):9413–9427
Xu Q, Wang F, Gong Y, Wang Z, Zeng K, Li Q, Luo X (2019) A novel edge-oriented framework for saliency detection enhancement. Image Vis Comput 87:1–12
Zakariah M (2014) Classification of large datasets using random Forest algorithm in various applications: survey. International Journal of Engineering and Innovative Technology (IJJEIT) 4(3))
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alshaer, H.N., Otair, M.A., Abualigah, L. et al. Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application. Multimed Tools Appl 80, 10373–10390 (2021). https://doi.org/10.1007/s11042-020-10074-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10074-6