Abstract
Feature selection is a technique to select a subset of the most relevant features for modeling training. In this paper, a new concept of TDR is firstly proposed to improve the classification accuracy. Then, a TDR-based algorithm for text classification is advanced. Finally, the extensive experiments are made on seven datasets (K1a, K1b, WAP, R52, R8, 20NewGroups, and Cade12) for two classifiers of Naive Bayes and Support Vector Machine. The experimental results indicate that the new approach can improve the classification accuracy by an average percent of 7.9%.
Similar content being viewed by others
References
James J (2014) Data never sleeps 2.0. http://www.domo.com/blog/2014/04/data-never-sleeps-2-0
Chen H, Schuffels C, Orwig R (1996) Internet categorization and search: a self-organizing approach. J Vis Commun Image Represent 7(1):88–102
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34 (1):1–47
Mironczuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54
Aggarwal CC, Zhai CX (2012) A survey of text classification algorithms. Mining text data. Springer, Berlin, pp 163–222
Gunal S, Ergin S, Gulmezoglu MB, Gerek ON (2006) On feature extraction for spam e-mail detection. Lect Notes Comput Sci 4105:635–642
Hammad AA, El-Halees A (2015) An approach for detecting spam in Arabic opinion reviews. International Arab Journal of Information Technology 12:9–16
Zhang C, Wu X, Niu Z, Ding W (2014) Authorship identification from unstructured texts. Knowledge Based Systems 66:99–111
Saraç E, Özel SA (2014) An ant colony optimization based feature selection for web page classification. The Scientific World Journal 2014:649260
Chaturvedi I, Cambria E, Welsch RE, et al. (2017) Distinguishing between facts and opinions for sentiment analysis: survey and challenges. Information Fusion 44:65–77
Rill S, Reinel D, Scheidt J, et al. (2014) Politwi: early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis. Knowl-Based Syst 69:24–33
Burdisso SG, Errecalde M, Montes-Y-Gomez M (2019) A text classification framework for simple and effective early depression detection over social media streams. Expert Systems with Application 133:182–197
Uysal AK, Gunal S, Ergin S, Gunal ES (2013) The impact of feature extraction and selection on SMS spam filtering. Elektronika Ir Elektrotechnika 19(5):67–72
Wang Y, Wang M, Fujita H (2019) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowl-Based Syst 105030:190
Mohamed AM (2017) An evaluation of sentiment analysis and classification algorithms for Arabic textual data. International Journal of Computer Applications 158(3):29–36
Forman G (2007) Feature selection for text classification. In: Computational methods of feature selection. Chapman and Hall/CRC, pp 257–276
Li X, Xie H, Chen L, et al. (2014) News impact on stock price return via sentiment analysis. Knowledge Based Systems 69:14–23
Wu X, Chen H, Wang J, et al. (2020) Adaptive stock trading strategies with deep reinforcement learning methods. Inf Sci 538:142–158
Rao Y, Xie H, Li J, et al. (2016) Social emotion classification of short text via topic-level maximum entropy model. Information and Management 53(8):978–986
Marin A, Holenstein R, Sarikaya R, Ostendorf M (2014) Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In: Proceedings of the fifteenth annual conference of the international speech communication association, pp 253–257
Lan M, Tan CL, Su J, et al. (2007) Text representations for text categorization: a case study in biomedical domain. In: International joint conference on neural networks. IEEE, pp 2557–2562
Joachims T (2002) Learning to classify text using support vector machines: methods theory and algorithms. Kluwer Academic Publishers, New York
Wang D, Zhang H, Liu R, et al. (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters 45:1–10
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. The Journal of Machine Learning Research 3(6):1157–1182
Cekik R, Uysal AK (2020) A novel filter feature selection method using rough set for short text data. Expert Syst Appl 160:113691
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420
Rehman A, Javed K, Babri HA, Saeed M (2015) Relative discrimination criterion–a novel feature ranking method for text data. Expert Syst Appl 42(7):3670–3681
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Xu Y, Jones GJ, Li J, Wang B, Sun C (2007) A study on mutual information-based feature selection for text categorization. Journal of Computational Information Systems 3(3):1007–1012
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowledge Based Systems 36:226–235
Rehman A, Javed K, Babri HA, Asim MN (2018) Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Application 114:78–96
Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inf Process Manag 39:45–65
Wang D, Zhang H, Liu R, et al. (2014) T-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45:1–10
Kim K, Zzang SY (2019) Trigonometric comparison measure: a feature selection method for text categorization. Data and Knowledge Engineering 119:1–21
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168
Asuncion A, Newman DJ (2007) UCI machine learning repository, University of California, Irvine, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html
Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the twelfth international conference on machine learning, pp 331–339
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the twenty-first international conference on machine learning, pp 297–304
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naive Bayes. Expert Syst Appl 36(3p1):5432–5435
Ergin S, Gunal ES, Yigit H, Aydin R (2012) Turkish anti-spam filtering using binary and probabilistic models. AWERProcedia Information Technology and Computer Science 1:1007–1012
Makhoul J, Kubala F, Schwartz R, Weischedel R (1999) Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop, pp 249–252
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Journal of the American Society for Information Science and Technology 43(3):824–825
Acknowledgements
The corresponding author would like to thank the support from the National Key Research and Development Plan under the Grant of 2017YFB1402103, the National Natural Science Foundation of China under the Grant of 61402363 and 61971347, the Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of 2020KJRC0094, the Ministry of Education of Shaanxi Province Research Project under the Grant of 17JK0534, and Beilin district of Xi’an Science and Technology Project under the Grant of GX1625.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, H., Ma, Y. & Li, X. Feature selection based on term frequency deviation rate for text classification. Appl Intell 51, 3255–3274 (2021). https://doi.org/10.1007/s10489-020-01937-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01937-4