Skip to main content
Log in

An efficient automatic multiple objectives optimization feature selection strategy for internet text classification

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Research on feature selection in text classification is usually limited to propose various techniques to select a set of features with highest scores based on different metrics. The selected features are usually determined by using a separate validation dataset with a fixed threshold. Obviously, it may not generalize well to new data as the best number for selected features is various on different datasets. In this paper, we first conduct a deep analysis, and find that simply extracting the features based on the score calculated by a metric may not always be the best strategy as it may turn many documents into zero length, which make them not suitable for training. We then model the feature selection process as a multiple objectives optimization problem to gain the best number of selected features rationally and automatically. In addition, as the optimization process costs a lot of resources, we design a parallel algorithm to improve the running time using dynamic programming. Extensive experiments are performed on several popular datasets, and the results indicate that our proposed approach is effective and feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/wiki/Dynamic_programming

  2. www.daviddlewis.com/resources/testcollections/reuters21578/

  3. http://people.csail.mit.edu/jrennie/20Newsgroups/

  4. https://en.wikipedia.org/wiki/Goal_programming

  5. https://en.wikipedia.org/wiki/Multi-objective_optimization

  6. http://en.wikipedia.org/wiki/T-test

References

  1. Aldehim G, Wang Wen J (2017) Determining appropriate approaches for using data in feature selection. Int J Mach Learn Cybern 8(3):915–928

    Article  Google Scholar 

  2. Chen L, Li BX (2016) Clustering-based joint feature selection for semantic attribute prediction. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp 3338–3344

  3. Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155

  4. Fernanda M, Matwin S, Sebastiani F (eds) (2001) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. IGI Global, Hershey

  5. Forman G, Guyon I, Elisseeff A (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn, pp 1289–1305

  6. Fuhr N, Hartmann S, Lustig G, Schwantner M, Tzeras K, Knorz G (1991) Air/x—a rule-based multistage indexing system for large subject fields. In: Proceedings of the 3rd international conference on intelligent text and image handling, pp 606–623

  7. Fung GPC, Yu J, Lu H (2002) Discriminative category matching: efficient text classification for huge document collections. In: IEEE international conference on data mining, pp 187–194

  8. Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of the 4th European conference on research and advanced technology for digital libraries, pp 59–68

  9. Gan JQ, Hasan BAS, Tsui CSL (2014) A filter-dominating hybrid sequential forward floating search method for feature subset selection in high-dimensional space. Int J Mach Learn Cybern 5(3):413–423

    Article  Google Scholar 

  10. Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp 137–142

  11. Lam W, Lai KY (2001) A meta-learning approach for text categorization. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 303–309

  12. Larkey LS, Croft WB (1996) Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, pp 289–297

  13. Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, pp 37–50

  14. Li GL, Braysy O, Jiang LX, Wu ZD, Wang YZ (2013) Finding time series discord based on bit representation clustering. Knowl Based Syst 54:243–254

    Article  Google Scholar 

  15. Li YH, Jain AK (1998) Classification of text documents. Comput J 41(8):537–546

    Article  MATH  Google Scholar 

  16. Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31

    Google Scholar 

  17. McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: The 15th national conference on artificial intelligence (AAAI 1998) workshop on learning for text categorization, pp 41–48

  18. Meng J, Lin H, Yu Y (2011) A two-stage feature selection method for text categorization. 2010 seventh international conference on fuzzy systems and knowledge discovery, pp 1492–1496

  19. Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the 6th international conference on machine learning, pp 258–267

  20. Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, pp 67–73

  21. Onan A (2016) An ensemble scheme based on language function analysis and feature engineering for text genre classification. J Inf Sci 62:1–14

    Google Scholar 

  22. Onan A (2017) Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes 46(2):330–348

    Article  Google Scholar 

  23. Onan A, Korukoglu S (2015) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 39(5):1103–1107

    Google Scholar 

  24. Onan A, Korukoglu S, Bulut H (2016) A multi-objective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl 62:1–16

    Article  Google Scholar 

  25. Sarkar C, Cooley S, Srivastava J (2014) Robust feature selection technique using rank aggregation. Appl Artif Intell 28(3):243–257

    Article  Google Scholar 

  26. Schütze H, Hull DA, Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, pp 229–237

  27. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  28. Sebastiani F, Sperduti A, Valdambrini N (2000) An improved boosting algorithm and its application to text categorization. In: Proceedings of the ninth international conference on information and knowledge management, pp 78–85

  29. Uguz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 1024–1032

  30. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36(6):226–235

    Article  Google Scholar 

  31. Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Exp Syst Appl 41(13):5938–5947

    Article  Google Scholar 

  32. Wang XZ, He YL, Wang DD (2014) Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Trans Cybern 44(1):21–39

    Article  Google Scholar 

  33. Wang XZ, Wang R, Feng HM, Wang HC (2014) A new approach to classifier fusion based on upper integral. IEEE Trans Cybern 44(5):620–635

    Article  MathSciNet  Google Scholar 

  34. Wu ZD, Zhu H, Li G, Cui ZM, Huang H, Li J, Chen EH, Xu GD (2017) An efficient wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28

    Article  MathSciNet  Google Scholar 

  35. Xu GD, Wu ZD, Li GL, Chen EH (2015) Improving contextual advertising matching by using wikipedia thesaurus knowledge. Knowl Inf Syst 43(3):599–631

    Article  Google Scholar 

  36. Yang M, Tu WT, Lu ZY, Yin WP, Chow KP (2015) Lcct: a semisupervised model for sentiment classification. In: The 2015 annual conference of the North American Chapter of the ACL (NAACL). Association for Computational Linguistics, pp 546–555

  37. Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22Nd annual international ACM SIGIR conference on research and development in information retrieval, pp 42–49

  38. Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM special interest group on information retrieval (SIGIR) conference on research and development in information retrieval, pp 42–49

  39. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, pp 412–420

  40. Zheng LJ, Wang HW, Gao S (2015) Sentimental feature selection for sentiment analysis of chinese online reviews. Int J Mach Learn Cybern 6:1–10

    Article  Google Scholar 

  41. Zhu J, Wu X, Xiao J et al (2018) Improved expert selection model for forex trading. Front Comput Sci 2017(2):1–10

    Google Scholar 

  42. Zhu J, Xie Q, Yu SI, Wong WH (2016) Exploiting link structure for web page genre identification. Data Min Knowl Discov 30(3) :550–575

    Article  MathSciNet  Google Scholar 

  43. Zhu J, Xie Q, Zheng K (2015) An improved early detection method of type-2 diabetes mellitus using multiple classifier system. Inf Sci 292(292):1–14

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Science Foundation of China (nos. 61772211, 61370229, 61750110516), the Natural Science Foundation of Guangdong Province, China (no. 2015A030310509), the S&T Projects of Guangdong Province, China (nos. 2015A030401087, 2016A030303055, 2016B030305004, 2016B010109008), GDUPS (2015), and the science and technology Projects of Guangzhou Municipality, China (201604010003, 201604016019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Zhu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, C., Zhu, J., Liang, Y. et al. An efficient automatic multiple objectives optimization feature selection strategy for internet text classification. Int. J. Mach. Learn. & Cyber. 10, 1151–1163 (2019). https://doi.org/10.1007/s13042-018-0793-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-018-0793-x

Keywords

Navigation