Feature selection based on term frequency deviation rate for text classification

Zhou, Hongfang; Ma, Yiming; Li, Xiang

doi:10.1007/s10489-020-01937-4

Feature selection based on term frequency deviation rate for text classification

Published: 11 November 2020

Volume 51, pages 3255–3274, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

549 Accesses
10 Citations
Explore all metrics

Abstract

Feature selection is a technique to select a subset of the most relevant features for modeling training. In this paper, a new concept of TDR is firstly proposed to improve the classification accuracy. Then, a TDR-based algorithm for text classification is advanced. Finally, the extensive experiments are made on seven datasets (K1a, K1b, WAP, R52, R8, 20NewGroups, and Cade12) for two classifiers of Naive Bayes and Support Vector Machine. The experimental results indicate that the new approach can improve the classification accuracy by an average percent of 7.9%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Feature Selection Technique for Text Classification

Selection of Relevant Features for Text Classification with K-NN

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Notes

References

James J (2014) Data never sleeps 2.0. http://www.domo.com/blog/2014/04/data-never-sleeps-2-0
Chen H, Schuffels C, Orwig R (1996) Internet categorization and search: a self-organizing approach. J Vis Commun Image Represent 7(1):88–102
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34 (1):1–47
Article Google Scholar
Mironczuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Expert Syst Appl 106:36–54
Article Google Scholar
Aggarwal CC, Zhai CX (2012) A survey of text classification algorithms. Mining text data. Springer, Berlin, pp 163–222
Google Scholar
Gunal S, Ergin S, Gulmezoglu MB, Gerek ON (2006) On feature extraction for spam e-mail detection. Lect Notes Comput Sci 4105:635–642
Article Google Scholar
Hammad AA, El-Halees A (2015) An approach for detecting spam in Arabic opinion reviews. International Arab Journal of Information Technology 12:9–16
Google Scholar
Zhang C, Wu X, Niu Z, Ding W (2014) Authorship identification from unstructured texts. Knowledge Based Systems 66:99–111
Article Google Scholar
Saraç E, Özel SA (2014) An ant colony optimization based feature selection for web page classification. The Scientific World Journal 2014:649260
Article Google Scholar
Chaturvedi I, Cambria E, Welsch RE, et al. (2017) Distinguishing between facts and opinions for sentiment analysis: survey and challenges. Information Fusion 44:65–77
Article Google Scholar
Rill S, Reinel D, Scheidt J, et al. (2014) Politwi: early detection of emerging political topics on twitter and the impact on concept-level sentiment analysis. Knowl-Based Syst 69:24–33
Article Google Scholar
Burdisso SG, Errecalde M, Montes-Y-Gomez M (2019) A text classification framework for simple and effective early depression detection over social media streams. Expert Systems with Application 133:182–197
Article Google Scholar
Uysal AK, Gunal S, Ergin S, Gunal ES (2013) The impact of feature extraction and selection on SMS spam filtering. Elektronika Ir Elektrotechnika 19(5):67–72
Article Google Scholar
Wang Y, Wang M, Fujita H (2019) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowl-Based Syst 105030:190
Google Scholar
Mohamed AM (2017) An evaluation of sentiment analysis and classification algorithms for Arabic textual data. International Journal of Computer Applications 158(3):29–36
Article Google Scholar
Forman G (2007) Feature selection for text classification. In: Computational methods of feature selection. Chapman and Hall/CRC, pp 257–276
Li X, Xie H, Chen L, et al. (2014) News impact on stock price return via sentiment analysis. Knowledge Based Systems 69:14–23
Article Google Scholar
Wu X, Chen H, Wang J, et al. (2020) Adaptive stock trading strategies with deep reinforcement learning methods. Inf Sci 538:142–158
Article MathSciNet Google Scholar
Rao Y, Xie H, Li J, et al. (2016) Social emotion classification of short text via topic-level maximum entropy model. Information and Management 53(8):978–986
Article Google Scholar
Marin A, Holenstein R, Sarikaya R, Ostendorf M (2014) Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In: Proceedings of the fifteenth annual conference of the international speech communication association, pp 253–257
Lan M, Tan CL, Su J, et al. (2007) Text representations for text categorization: a case study in biomedical domain. In: International joint conference on neural networks. IEEE, pp 2557–2562
Joachims T (2002) Learning to classify text using support vector machines: methods theory and algorithms. Kluwer Academic Publishers, New York
Book Google Scholar
Wang D, Zhang H, Liu R, et al. (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters 45:1–10
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. The Journal of Machine Learning Research 3(6):1157–1182
MATH Google Scholar
Cekik R, Uysal AK (2020) A novel filter feature selection method using rough set for short text data. Expert Syst Appl 160:113691
Article Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420
Rehman A, Javed K, Babri HA, Saeed M (2015) Relative discrimination criterion–a novel feature ranking method for text data. Expert Syst Appl 42(7):3670–3681
Article Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489
Article Google Scholar
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Article Google Scholar
Xu Y, Jones GJ, Li J, Wang B, Sun C (2007) A study on mutual information-based feature selection for text categorization. Journal of Computational Information Systems 3(3):1007–1012
Google Scholar
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowledge Based Systems 36:226–235
Article Google Scholar
Rehman A, Javed K, Babri HA, Asim MN (2018) Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Application 114:78–96
Article Google Scholar
Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inf Process Manag 39:45–65
Article Google Scholar
Wang D, Zhang H, Liu R, et al. (2014) T-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45:1–10
Article Google Scholar
Kim K, Zzang SY (2019) Trigonometric comparison measure: a feature selection method for text categorization. Data and Knowledge Engineering 119:1–21
Article Google Scholar
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168
Article MathSciNet Google Scholar
Asuncion A, Newman DJ (2007) UCI machine learning repository, University of California, Irvine, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html
Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the twelfth international conference on machine learning, pp 331–339
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the twenty-first international conference on machine learning, pp 297–304
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naive Bayes. Expert Syst Appl 36(3p1):5432–5435
Article Google Scholar
Ergin S, Gunal ES, Yigit H, Aydin R (2012) Turkish anti-spam filtering using binary and probabilistic models. AWERProcedia Information Technology and Computer Science 1:1007–1012
Google Scholar
Makhoul J, Kubala F, Schwartz R, Weischedel R (1999) Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop, pp 249–252
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Journal of the American Society for Information Science and Technology 43(3):824–825
MATH Google Scholar

Download references

Acknowledgements

The corresponding author would like to thank the support from the National Key Research and Development Plan under the Grant of 2017YFB1402103, the National Natural Science Foundation of China under the Grant of 61402363 and 61971347, the Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of 2020KJRC0094, the Ministry of Education of Shaanxi Province Research Project under the Grant of 17JK0534, and Beilin district of Xi’an Science and Technology Project under the Grant of GX1625.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Xi’an University of Technology, No. 5 South Jinhua Road, Xi’an, Shaanxi, China
Hongfang Zhou, Yiming Ma & Xiang Li

Authors

Hongfang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yiming Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongfang Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, H., Ma, Y. & Li, X. Feature selection based on term frequency deviation rate for text classification. Appl Intell 51, 3255–3274 (2021). https://doi.org/10.1007/s10489-020-01937-4

Download citation

Accepted: 11 September 2020
Published: 11 November 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10489-020-01937-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection based on term frequency deviation rate for text classification

Abstract

Access this article

Similar content being viewed by others

A Novel Feature Selection Technique for Text Classification

Selection of Relevant Features for Text Classification with K-NN

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature selection based on term frequency deviation rate for text classification

Abstract

Access this article

Similar content being viewed by others

A Novel Feature Selection Technique for Text Classification

Selection of Relevant Features for Text Classification with K-NN

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation