Abstract
Research on feature selection in text classification is usually limited to propose various techniques to select a set of features with highest scores based on different metrics. The selected features are usually determined by using a separate validation dataset with a fixed threshold. Obviously, it may not generalize well to new data as the best number for selected features is various on different datasets. In this paper, we first conduct a deep analysis, and find that simply extracting the features based on the score calculated by a metric may not always be the best strategy as it may turn many documents into zero length, which make them not suitable for training. We then model the feature selection process as a multiple objectives optimization problem to gain the best number of selected features rationally and automatically. In addition, as the optimization process costs a lot of resources, we design a parallel algorithm to improve the running time using dynamic programming. Extensive experiments are performed on several popular datasets, and the results indicate that our proposed approach is effective and feasible.
Similar content being viewed by others
Notes
https://en.wikipedia.org/wiki/Dynamic_programming
www.daviddlewis.com/resources/testcollections/reuters21578/
http://people.csail.mit.edu/jrennie/20Newsgroups/
https://en.wikipedia.org/wiki/Goal_programming
https://en.wikipedia.org/wiki/Multi-objective_optimization
http://en.wikipedia.org/wiki/T-test
References
Aldehim G, Wang Wen J (2017) Determining appropriate approaches for using data in feature selection. Int J Mach Learn Cybern 8(3):915–928
Chen L, Li BX (2016) Clustering-based joint feature selection for semantic attribute prediction. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp 3338–3344
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155
Fernanda M, Matwin S, Sebastiani F (eds) (2001) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. IGI Global, Hershey
Forman G, Guyon I, Elisseeff A (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn, pp 1289–1305
Fuhr N, Hartmann S, Lustig G, Schwantner M, Tzeras K, Knorz G (1991) Air/x—a rule-based multistage indexing system for large subject fields. In: Proceedings of the 3rd international conference on intelligent text and image handling, pp 606–623
Fung GPC, Yu J, Lu H (2002) Discriminative category matching: efficient text classification for huge document collections. In: IEEE international conference on data mining, pp 187–194
Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of the 4th European conference on research and advanced technology for digital libraries, pp 59–68
Gan JQ, Hasan BAS, Tsui CSL (2014) A filter-dominating hybrid sequential forward floating search method for feature subset selection in high-dimensional space. Int J Mach Learn Cybern 5(3):413–423
Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp 137–142
Lam W, Lai KY (2001) A meta-learning approach for text categorization. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 303–309
Larkey LS, Croft WB (1996) Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, pp 289–297
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, pp 37–50
Li GL, Braysy O, Jiang LX, Wu ZD, Wang YZ (2013) Finding time series discord based on bit representation clustering. Knowl Based Syst 54:243–254
Li YH, Jain AK (1998) Classification of text documents. Comput J 41(8):537–546
Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: The 15th national conference on artificial intelligence (AAAI 1998) workshop on learning for text categorization, pp 41–48
Meng J, Lin H, Yu Y (2011) A two-stage feature selection method for text categorization. 2010 seventh international conference on fuzzy systems and knowledge discovery, pp 1492–1496
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the 6th international conference on machine learning, pp 258–267
Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, pp 67–73
Onan A (2016) An ensemble scheme based on language function analysis and feature engineering for text genre classification. J Inf Sci 62:1–14
Onan A (2017) Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes 46(2):330–348
Onan A, Korukoglu S (2015) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 39(5):1103–1107
Onan A, Korukoglu S, Bulut H (2016) A multi-objective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl 62:1–16
Sarkar C, Cooley S, Srivastava J (2014) Robust feature selection technique using rank aggregation. Appl Artif Intell 28(3):243–257
Schütze H, Hull DA, Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, pp 229–237
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Sebastiani F, Sperduti A, Valdambrini N (2000) An improved boosting algorithm and its application to text categorization. In: Proceedings of the ninth international conference on information and knowledge management, pp 78–85
Uguz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 1024–1032
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36(6):226–235
Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Exp Syst Appl 41(13):5938–5947
Wang XZ, He YL, Wang DD (2014) Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Trans Cybern 44(1):21–39
Wang XZ, Wang R, Feng HM, Wang HC (2014) A new approach to classifier fusion based on upper integral. IEEE Trans Cybern 44(5):620–635
Wu ZD, Zhu H, Li G, Cui ZM, Huang H, Li J, Chen EH, Xu GD (2017) An efficient wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28
Xu GD, Wu ZD, Li GL, Chen EH (2015) Improving contextual advertising matching by using wikipedia thesaurus knowledge. Knowl Inf Syst 43(3):599–631
Yang M, Tu WT, Lu ZY, Yin WP, Chow KP (2015) Lcct: a semisupervised model for sentiment classification. In: The 2015 annual conference of the North American Chapter of the ACL (NAACL). Association for Computational Linguistics, pp 546–555
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22Nd annual international ACM SIGIR conference on research and development in information retrieval, pp 42–49
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM special interest group on information retrieval (SIGIR) conference on research and development in information retrieval, pp 42–49
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, pp 412–420
Zheng LJ, Wang HW, Gao S (2015) Sentimental feature selection for sentiment analysis of chinese online reviews. Int J Mach Learn Cybern 6:1–10
Zhu J, Wu X, Xiao J et al (2018) Improved expert selection model for forex trading. Front Comput Sci 2017(2):1–10
Zhu J, Xie Q, Yu SI, Wong WH (2016) Exploiting link structure for web page genre identification. Data Min Knowl Discov 30(3) :550–575
Zhu J, Xie Q, Zheng K (2015) An improved early detection method of type-2 diabetes mellitus using multiple classifier system. Inf Sci 292(292):1–14
Acknowledgements
This work was supported by the National Science Foundation of China (nos. 61772211, 61370229, 61750110516), the Natural Science Foundation of Guangdong Province, China (no. 2015A030310509), the S&T Projects of Guangdong Province, China (nos. 2015A030401087, 2016A030303055, 2016B030305004, 2016B010109008), GDUPS (2015), and the science and technology Projects of Guangzhou Municipality, China (201604010003, 201604016019).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huang, C., Zhu, J., Liang, Y. et al. An efficient automatic multiple objectives optimization feature selection strategy for internet text classification. Int. J. Mach. Learn. & Cyber. 10, 1151–1163 (2019). https://doi.org/10.1007/s13042-018-0793-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-018-0793-x