An efficient automatic multiple objectives optimization feature selection strategy for internet text classification

Huang, Changqin; Zhu, Jia; Liang, Yuzhi; Yang, Min; Fung, Gabriel Pui Cheong; Luo, Junyu

doi:10.1007/s13042-018-0793-x

An efficient automatic multiple objectives optimization feature selection strategy for internet text classification

Original Article
Published: 16 February 2018

Volume 10, pages 1151–1163, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Changqin Huang¹,
Jia Zhu^1,2,
Yuzhi Liang³,
Min Yang⁴,
Gabriel Pui Cheong Fung⁵ &
…
Junyu Luo⁶

662 Accesses
18 Citations
Explore all metrics

Abstract

Research on feature selection in text classification is usually limited to propose various techniques to select a set of features with highest scores based on different metrics. The selected features are usually determined by using a separate validation dataset with a fixed threshold. Obviously, it may not generalize well to new data as the best number for selected features is various on different datasets. In this paper, we first conduct a deep analysis, and find that simply extracting the features based on the score calculated by a metric may not always be the best strategy as it may turn many documents into zero length, which make them not suitable for training. We then model the feature selection process as a multiple objectives optimization problem to gain the best number of selected features rationally and automatically. In addition, as the optimization process costs a lot of resources, we design a parallel algorithm to improve the running time using dynamic programming. Extensive experiments are performed on several popular datasets, and the results indicate that our proposed approach is effective and feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Notes

https://en.wikipedia.org/wiki/Dynamic_programming
www.daviddlewis.com/resources/testcollections/reuters21578/
http://people.csail.mit.edu/jrennie/20Newsgroups/
https://en.wikipedia.org/wiki/Goal_programming
https://en.wikipedia.org/wiki/Multi-objective_optimization
http://en.wikipedia.org/wiki/T-test

References

Aldehim G, Wang Wen J (2017) Determining appropriate approaches for using data in feature selection. Int J Mach Learn Cybern 8(3):915–928
Article Google Scholar
Chen L, Li BX (2016) Clustering-based joint feature selection for semantic attribute prediction. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp 3338–3344
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155
Fernanda M, Matwin S, Sebastiani F (eds) (2001) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. IGI Global, Hershey
Forman G, Guyon I, Elisseeff A (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn, pp 1289–1305
Fuhr N, Hartmann S, Lustig G, Schwantner M, Tzeras K, Knorz G (1991) Air/x—a rule-based multistage indexing system for large subject fields. In: Proceedings of the 3rd international conference on intelligent text and image handling, pp 606–623
Fung GPC, Yu J, Lu H (2002) Discriminative category matching: efficient text classification for huge document collections. In: IEEE international conference on data mining, pp 187–194
Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of the 4th European conference on research and advanced technology for digital libraries, pp 59–68
Gan JQ, Hasan BAS, Tsui CSL (2014) A filter-dominating hybrid sequential forward floating search method for feature subset selection in high-dimensional space. Int J Mach Learn Cybern 5(3):413–423
Article Google Scholar
Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, pp 137–142
Lam W, Lai KY (2001) A meta-learning approach for text categorization. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 303–309
Larkey LS, Croft WB (1996) Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, pp 289–297
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, pp 37–50
Li GL, Braysy O, Jiang LX, Wu ZD, Wang YZ (2013) Finding time series discord based on bit representation clustering. Knowl Based Syst 54:243–254
Article Google Scholar
Li YH, Jain AK (1998) Classification of text documents. Comput J 41(8):537–546
Article MATH Google Scholar
Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31
Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: The 15th national conference on artificial intelligence (AAAI 1998) workshop on learning for text categorization, pp 41–48
Meng J, Lin H, Yu Y (2011) A two-stage feature selection method for text categorization. 2010 seventh international conference on fuzzy systems and knowledge discovery, pp 1492–1496
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the 6th international conference on machine learning, pp 258–267
Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, pp 67–73
Onan A (2016) An ensemble scheme based on language function analysis and feature engineering for text genre classification. J Inf Sci 62:1–14
Google Scholar
Onan A (2017) Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes 46(2):330–348
Article Google Scholar
Onan A, Korukoglu S (2015) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 39(5):1103–1107
Google Scholar
Onan A, Korukoglu S, Bulut H (2016) A multi-objective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl 62:1–16
Article Google Scholar
Sarkar C, Cooley S, Srivastava J (2014) Robust feature selection technique using rank aggregation. Appl Artif Intell 28(3):243–257
Article Google Scholar
Schütze H, Hull DA, Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, pp 229–237
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Sebastiani F, Sperduti A, Valdambrini N (2000) An improved boosting algorithm and its application to text categorization. In: Proceedings of the ninth international conference on information and knowledge management, pp 78–85
Uguz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl Based Syst 1024–1032
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36(6):226–235
Article Google Scholar
Uysal AK, Gunal S (2014) Text classification using genetic algorithm oriented latent semantic features. Exp Syst Appl 41(13):5938–5947
Article Google Scholar
Wang XZ, He YL, Wang DD (2014) Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Trans Cybern 44(1):21–39
Article Google Scholar
Wang XZ, Wang R, Feng HM, Wang HC (2014) A new approach to classifier fusion based on upper integral. IEEE Trans Cybern 44(5):620–635
Article MathSciNet Google Scholar
Wu ZD, Zhu H, Li G, Cui ZM, Huang H, Li J, Chen EH, Xu GD (2017) An efficient wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28
Article MathSciNet Google Scholar
Xu GD, Wu ZD, Li GL, Chen EH (2015) Improving contextual advertising matching by using wikipedia thesaurus knowledge. Knowl Inf Syst 43(3):599–631
Article Google Scholar
Yang M, Tu WT, Lu ZY, Yin WP, Chow KP (2015) Lcct: a semisupervised model for sentiment classification. In: The 2015 annual conference of the North American Chapter of the ACL (NAACL). Association for Computational Linguistics, pp 546–555
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22Nd annual international ACM SIGIR conference on research and development in information retrieval, pp 42–49
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM special interest group on information retrieval (SIGIR) conference on research and development in information retrieval, pp 42–49
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, pp 412–420
Zheng LJ, Wang HW, Gao S (2015) Sentimental feature selection for sentiment analysis of chinese online reviews. Int J Mach Learn Cybern 6:1–10
Article Google Scholar
Zhu J, Wu X, Xiao J et al (2018) Improved expert selection model for forex trading. Front Comput Sci 2017(2):1–10
Google Scholar
Zhu J, Xie Q, Yu SI, Wong WH (2016) Exploiting link structure for web page genre identification. Data Min Knowl Discov 30(3) :550–575
Article MathSciNet Google Scholar
Zhu J, Xie Q, Zheng K (2015) An improved early detection method of type-2 diabetes mellitus using multiple classifier system. Inf Sci 292(292):1–14
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Science Foundation of China (nos. 61772211, 61370229, 61750110516), the Natural Science Foundation of Guangdong Province, China (no. 2015A030310509), the S&T Projects of Guangdong Province, China (nos. 2015A030401087, 2016A030303055, 2016B030305004, 2016B010109008), GDUPS (2015), and the science and technology Projects of Guangzhou Municipality, China (201604010003, 201604016019).

Author information

Authors and Affiliations

Guangdong Engineeting Research Center for Smart Learning, South China Normal University, Guangzhou, China
Changqin Huang & Jia Zhu
School of Computer Science, South China Normal University, Guangzhou, China
Jia Zhu
School of Computer Science, The University of Hong Kong, Hong Kong, China
Yuzhi Liang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Min Yang
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, China
Gabriel Pui Cheong Fung
Department of Computer Science, Sichuan University, Sichuan, China
Junyu Luo

Authors

Changqin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jia Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhi Liang
View author publications
You can also search for this author in PubMed Google Scholar
Min Yang
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Pui Cheong Fung
View author publications
You can also search for this author in PubMed Google Scholar
Junyu Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, C., Zhu, J., Liang, Y. et al. An efficient automatic multiple objectives optimization feature selection strategy for internet text classification. Int. J. Mach. Learn. & Cyber. 10, 1151–1163 (2019). https://doi.org/10.1007/s13042-018-0793-x

Download citation

Received: 06 May 2017
Accepted: 10 February 2018
Published: 16 February 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s13042-018-0793-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient automatic multiple objectives optimization feature selection strategy for internet text classification

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient automatic multiple objectives optimization feature selection strategy for internet text classification

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

Introduction to Machine Learning

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation