Skip to main content
Log in

Feature selection based on term frequency deviation rate for text classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Feature selection is a technique to select a subset of the most relevant features for modeling training. In this paper, a new concept of TDR is firstly proposed to improve the classification accuracy. Then, a TDR-based algorithm for text classification is advanced. Finally, the extensive experiments are made on seven datasets (K1a, K1b, WAP, R52, R8, 20NewGroups, and Cade12) for two classifiers of Naive Bayes and Support Vector Machine. The experimental results indicate that the new approach can improve the classification accuracy by an average percent of 7.9%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.

  2. http://ana.cachopo.org/datasets-for-single-label-text-categorization

References

  1. James J (2014) Data never sleeps 2.0. http://www.domo.com/blog/2014/04/data-never-sleeps-2-0

  2. Chen H, Schuffels C, Orwig R (1996) Internet categorization and search: a self-organizing approach. J Vis Commun Image Represent 7(1):88–102

    Article  Google Scholar 

  3. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34 (1):1–47

    Article  Google Scholar 

  4. Mironczuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54

    Article  Google Scholar 

  5. Aggarwal CC, Zhai CX (2012) A survey of text classification algorithms. Mining text data. Springer, Berlin, pp 163–222

    Google Scholar 

  6. Gunal S, Ergin S, Gulmezoglu MB, Gerek ON (2006) On feature extraction for spam e-mail detection. Lect Notes Comput Sci 4105:635–642

    Article  Google Scholar 

  7. Hammad AA, El-Halees A (2015) An approach for detecting spam in Arabic opinion reviews. International Arab Journal of Information Technology 12:9–16

    Google Scholar 

  8. Zhang C, Wu X, Niu Z, Ding W (2014) Authorship identification from unstructured texts. Knowledge Based Systems 66:99–111

    Article  Google Scholar 

  9. Saraç E, Özel SA (2014) An ant colony optimization based feature selection for web page classification. The Scientific World Journal 2014:649260

    Article  Google Scholar 

  10. Chaturvedi I, Cambria E, Welsch RE, et al. (2017) Distinguishing between facts and opinions for sentiment analysis: survey and challenges. Information Fusion 44:65–77

    Article  Google Scholar 

  11. Rill S, Reinel D, Scheidt J, et al. (2014) Politwi: early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis. Knowl-Based Syst 69:24–33

    Article  Google Scholar 

  12. Burdisso SG, Errecalde M, Montes-Y-Gomez M (2019) A text classification framework for simple and effective early depression detection over social media streams. Expert Systems with Application 133:182–197

    Article  Google Scholar 

  13. Uysal AK, Gunal S, Ergin S, Gunal ES (2013) The impact of feature extraction and selection on SMS spam filtering. Elektronika Ir Elektrotechnika 19(5):67–72

    Article  Google Scholar 

  14. Wang Y, Wang M, Fujita H (2019) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowl-Based Syst 105030:190

    Google Scholar 

  15. Mohamed AM (2017) An evaluation of sentiment analysis and classification algorithms for Arabic textual data. International Journal of Computer Applications 158(3):29–36

    Article  Google Scholar 

  16. Forman G (2007) Feature selection for text classification. In: Computational methods of feature selection. Chapman and Hall/CRC, pp 257–276

  17. Li X, Xie H, Chen L, et al. (2014) News impact on stock price return via sentiment analysis. Knowledge Based Systems 69:14–23

    Article  Google Scholar 

  18. Wu X, Chen H, Wang J, et al. (2020) Adaptive stock trading strategies with deep reinforcement learning methods. Inf Sci 538:142–158

    Article  MathSciNet  Google Scholar 

  19. Rao Y, Xie H, Li J, et al. (2016) Social emotion classification of short text via topic-level maximum entropy model. Information and Management 53(8):978–986

    Article  Google Scholar 

  20. Marin A, Holenstein R, Sarikaya R, Ostendorf M (2014) Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In: Proceedings of the fifteenth annual conference of the international speech communication association, pp 253–257

  21. Lan M, Tan CL, Su J, et al. (2007) Text representations for text categorization: a case study in biomedical domain. In: International joint conference on neural networks. IEEE, pp 2557–2562

  22. Joachims T (2002) Learning to classify text using support vector machines: methods theory and algorithms. Kluwer Academic Publishers, New York

    Book  Google Scholar 

  23. Wang D, Zhang H, Liu R, et al. (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters 45:1–10

    Article  Google Scholar 

  24. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. The Journal of Machine Learning Research 3(6):1157–1182

    MATH  Google Scholar 

  25. Cekik R, Uysal AK (2020) A novel filter feature selection method using rough set for short text data. Expert Syst Appl 160:113691

    Article  Google Scholar 

  26. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420

  27. Rehman A, Javed K, Babri HA, Saeed M (2015) Relative discrimination criterion–a novel feature ranking method for text data. Expert Syst Appl 42(7):3670–3681

    Article  Google Scholar 

  28. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  29. Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489

    Article  Google Scholar 

  30. Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37

    Article  Google Scholar 

  31. Xu Y, Jones GJ, Li J, Wang B, Sun C (2007) A study on mutual information-based feature selection for text categorization. Journal of Computational Information Systems 3(3):1007–1012

    Google Scholar 

  32. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowledge Based Systems 36:226–235

    Article  Google Scholar 

  33. Rehman A, Javed K, Babri HA, Asim MN (2018) Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Application 114:78–96

    Article  Google Scholar 

  34. Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inf Process Manag 39:45–65

    Article  Google Scholar 

  35. Wang D, Zhang H, Liu R, et al. (2014) T-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45:1–10

    Article  Google Scholar 

  36. Kim K, Zzang SY (2019) Trigonometric comparison measure: a feature selection method for text categorization. Data and Knowledge Engineering 119:1–21

    Article  Google Scholar 

  37. Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168

    Article  MathSciNet  Google Scholar 

  38. Asuncion A, Newman DJ (2007) UCI machine learning repository, University of California, Irvine, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html

  39. Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the twelfth international conference on machine learning, pp 331–339

  40. Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the twenty-first international conference on machine learning, pp 297–304

  41. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naive Bayes. Expert Syst Appl 36(3p1):5432–5435

    Article  Google Scholar 

  42. Ergin S, Gunal ES, Yigit H, Aydin R (2012) Turkish anti-spam filtering using binary and probabilistic models. AWERProcedia Information Technology and Computer Science 1:1007–1012

    Google Scholar 

  43. Makhoul J, Kubala F, Schwartz R, Weischedel R (1999) Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop, pp 249–252

  44. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Journal of the American Society for Information Science and Technology 43(3):824–825

    MATH  Google Scholar 

Download references

Acknowledgements

The corresponding author would like to thank the support from the National Key Research and Development Plan under the Grant of 2017YFB1402103, the National Natural Science Foundation of China under the Grant of 61402363 and 61971347, the Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of 2020KJRC0094, the Ministry of Education of Shaanxi Province Research Project under the Grant of 17JK0534, and Beilin district of Xi’an Science and Technology Project under the Grant of GX1625.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongfang Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, H., Ma, Y. & Li, X. Feature selection based on term frequency deviation rate for text classification. Appl Intell 51, 3255–3274 (2021). https://doi.org/10.1007/s10489-020-01937-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01937-4

Keywords

Navigation