Skip to main content
Log in

A novel feature and class-based globalization technique for text classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Text classification is a very important topic in the current era due to the high volume of textual data and handling. Feature selection is one of the most important steps in text classification studies, as well as significantly affecting classification performance. In the literature, filter-based global feature selection methods are widely proposed. While these methods are globalized, although they are generally performed by looking at the class information, feature information is ignored beside the class information. When calculating the score of each feature, the information of the feature should be taken into account along with the class information. To solve this problem, a new globalization technique called Feature and Class-based Weighted Sum (FCWS) which takes into account both feature and class information is proposed. FCWS method is compared with traditional globalization techniques on four datasets named as Reuters-21,578, 20Newsgroup, Enron1 and Polarity in addition to Support Vector Machines (SVM), Decision Tree (DT) and Multinomial Naive Bayes (MNB) classifiers. Also, it was employed 50, 100, 300, 500, 1000 and 3000 as dimension. Experimental studies on benchmark datasets show that the efficiency of the proposed method is higher performance than the other three methods named as maximum (MAX), sum (SUM), and weighted-sum (AVG), in most cases according to Micro-F1 and Macro-F1 scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data availability

The data that support the findings of this study are openly available in (Reference 4) for Reuters-21,578 and 20Newsgroup datasets; Enron1 and Polarity is binary-class dataset in (Reference 20); The data that support the findings of this study are openly available in Machine Learning Repository-UCI at https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection (Reference 4).

References

  1. Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281

    Article  Google Scholar 

  2. Agnihotri D, Verma K, Tripathi P, Singh BK (2019) Soft voting technique to improve the performance of global filter based feature selection in text corpus. Appl Intell 49(4):1597–1619

    Article  Google Scholar 

  3. Ahmed B (2020) Wrapper feature selection approach based on binary firefly algorithm for spam e-mail filtering. J Soft Comput Data Min 1(2):44–52

    MathSciNet  Google Scholar 

  4. Asuncion A, Newman D (2007) UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php

  5. Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. Text mining and its applications. Springer, Berlin, pp 81–97

  6. Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: a review. Multimedia Tools Appl 78(3):3797–3816

    Article  Google Scholar 

  7. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar):1289–1305

    MATH  Google Scholar 

  8. Gupta ST, Sahoo JK, Roul RK (2019) Authorship identification using recurrent neural networks. Proceedings of the 2019 3rd International Conference on Information System and Data Mining, p 133–7

  9. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182

    MATH  Google Scholar 

  10. Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. European conference on machine learning: Springer, Berlin, p 137–42

  11. Khan J, Alam A, Lee Y (2021) Intelligent hybrid feature selection for textual sentiment classification. IEEE Access 9:140590–140608

  12. Khurana A, Verma OP (2020) Novel approach with nature-inspired and ensemble techniques for optimal text classification. Multimedia Tools Appl 79(33):23821–23848

    Article  Google Scholar 

  13. Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836

    Article  Google Scholar 

  14. Kumar A, Bhatia M, Sangwan SR (2022) Rumour detection using deep learning and filter-wrapper feature selection in benchmark twitter dataset. Multimedia Tools Appl 81(24):34615–34632

    Article  Google Scholar 

  15. Madasu A, Elango S (2020) Efficient feature selection techniques for sentiment analysis. Multimedia Tools Appl 79(9):6313–6335

    Article  Google Scholar 

  16. Onan A (2018) An ensemble scheme based on language function analysis and feature engineering for text genre classification. J Inform Sci 44(1):28–47

    Article  Google Scholar 

  17. Özgür A, Özgür L, Güngör T (2005) Text categorization with class-based and corpus-based keyword selection. International Symposium on Computer and Information Sciences: Springer, Berlin, p 606–15

  18. Parlak B (2022) Class‐index corpus‐index measure: A novel feature selection method for imbalanced text data. Concurr Comput Pract Exp 34(21):e7140

  19. Parlak B, Uysal AK (2019) On classification of abstracts obtained from medical journals. J Inf Sci 46(5):648–663

  20. Parlak B, Uysal AK (2020) The effects of globalisation techniques on feature selection for text classification. J Inf Sci 47(6):727–739

  21. Parlak B, Uysal AK (2023) A novel filter feature selection method for text classification: extensive feature selector. J Inf Sci 49(1):59–78

  22. Porter MF (1997). In: Sparck Jones K, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers Inc, San Francisco

    Google Scholar 

  23. Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489

    Article  Google Scholar 

  24. Rehman A, Javed K, Babri HA, Asim N (2018) Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Syst Appl 114:78–96

    Article  Google Scholar 

  25. Rehman A, Javed K, Babri HA, Saeed M (2015) Relative discrimination criterion–A novel feature ranking method for text data. Expert Syst Appl 42(7):3670–3681

    Article  Google Scholar 

  26. Schütze H, Manning CD, Raghavan P (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

  27. Shunmugapriya P, Kanmani S (2017) A hybrid algorithm using ant and bee colony optimization for feature selection and classification (AC-ABC hybrid). Swarm Evol Comput 36:27–36

    Article  Google Scholar 

  28. Taşcı Ş, Güngör T (2013) Comparison of text feature selection policies and using an adaptive framework. Expert Syst Appl 40(12):4871–4886

    Article  Google Scholar 

  29. Theodoridis S, Koutroumbas K (2009) Pattern recognition, 4th edn. Academic

  30. Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92

  31. Uysal AK (2018) On two-stage feature selection methods for text classification. IEEE Access 6:43233–43251

    Article  MathSciNet  Google Scholar 

  32. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36:226–235

    Article  Google Scholar 

  33. Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112

    Article  Google Scholar 

  34. Xia T, Chen X (2021) A weighted feature enhanced hidden Markov Model for spam SMS filtering. Neurocomputing 444:48–58

    Article  Google Scholar 

  35. Zhang Z, Hong W-C (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl Based Syst 228:107297

    Article  Google Scholar 

  36. Zong W, Wu F, Chu L-K, Sculli D (2015) A discriminative and semantic feature selection method for text categorization. Int J Prod Econ 165:215–222

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bekir Parlak.

Ethics declarations

Conflict of interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Parlak, B. A novel feature and class-based globalization technique for text classification. Multimed Tools Appl 82, 37635–37660 (2023). https://doi.org/10.1007/s11042-023-15459-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15459-x

Keywords

Navigation