PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection

Zhang, Chun-Xia; Zhang, Jiang-She; Kim, Sang-Woon

doi:10.1007/s00180-016-0652-8

PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection

Original Paper
Published: 16 March 2016

Volume 31, pages 1237–1262, (2016)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Chun-Xia Zhang¹,
Jiang-She Zhang¹ &
Sang-Woon Kim²

621 Accesses
10 Citations
Explore all metrics

Abstract

Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Bagging Ensemble Approach for Variable Ranking and Selection for Linear Regression Models

SLUG: Feature Selection Using Genetic Algorithms and Genetic Programming

Modified Rule Ensemble Method for Binary Data and Its Applications

Article 01 July 2014

Notes

The authors are grateful to one anonymous referee for providing us with the insight into this.

References

Breiman L (1996a) Heuristics of instability and stabilization in model selection. Ann Stat 24(6):2350–2383
Article MathSciNet MATH Google Scholar
Breiman L (1996b) Bagging predictors. Mach Learn 24(2):123–140
MathSciNet MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MathSciNet MATH Google Scholar
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505
Article MathSciNet MATH Google Scholar
Bühlmann P, Hothorn T (2010) Twin boosting: improved feature selection and prediction. Stat Comput 20(2):119–138
Article MathSciNet Google Scholar
Bühlmann P, Mandozzi J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29(3–4):407–430
Article MathSciNet MATH Google Scholar
Bühlmann P, van de Geer S (2010) Statistics for high-dimensional data: methods, theory and applications. Springer, New York
MATH Google Scholar
Chatterjee S, Lauadto M, Lynch LA (1996) Genetic algorithms and their statistical applications: an introduction. Comput Stat Data Anal 22(6):633–651
Article MATH Google Scholar
Drucker H (1997) Improving regressors using boosting techniques. In: Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 107–115
Efron B, Hastie T, Hohnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet MATH Google Scholar
Fan JQ, Li RZ (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet MATH Google Scholar
Fan JQ, Lv JC (2008) Sure independence screening for ultrahigh dimensional feature space (with discussions). J R Stat Soc B 70(5):849–911
Article MathSciNet Google Scholar
Fan JQ, Lv JC (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148
MathSciNet MATH Google Scholar
Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet MATH Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Article MathSciNet MATH Google Scholar
Guo L, Boukir S (2013) Margin-based ordered aggregation for ensemble pruning. Pattern Recognit Lett 34:603–609
Article Google Scholar
He HB, Garcia EA (2009) Learning from imbalanced data. IEEE Transl Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Jadhav NH, Kashid DN, Kulkarni SR (2014) Subset selection in multiple linear regression in the presence of outlier and multicollinearity. Stat Methodol 19:44–59
Article MathSciNet Google Scholar
Liu C, Shi T, Lee Y (2014) Two tales of variable selection for high dimensional regression: screening and model building. Stat Anal Data Min 7(2):140–159
Article MathSciNet Google Scholar
Meinshausen N, Bühlmann P (2010) Stability selection (with discussion). J R Stat Soc B 72(4):417–473
Article Google Scholar
Mendes-Moreira J, Soares C, Jorge AM, de Sousa JF (2012) Ensemble approaches for regression: a survey. ACM Comput Surv 45(1):1–40 (Article 10)
Miller A (2002) Subset selection in regression, 2nd edn. Chapman & Hall, New Work
Book MATH Google Scholar
Rokach L (2009) Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography. Comput Stat Data Anal 53(12):4046–4072
Article MathSciNet MATH Google Scholar
Sauerbrei W, Buchholz A, Boulesteix A, Binder H (2015) On stability issues in deriving multivariable regression models. Biom J 57(4):531–555
Article MathSciNet MATH Google Scholar
Shah RD, Samworth RJ (2013) Variable selection with error control: another look at stability selection. J R Stat Soc B 75(1):55–80
Article MathSciNet Google Scholar
Shrestha DL, Solomatine DP (2006) Experiments with AdaBoost.RT, an improved boosting scheme for regression. Neural Comput 18(7):1678–1710
Article MATH Google Scholar
Shmueli G (2010) To explain or to predict? Stat Sci 25(3):289–310
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1):267–288
MathSciNet MATH Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B 63(2):411–423
Article MathSciNet MATH Google Scholar
Wang SJ, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5(1):468–485
Article MathSciNet MATH Google Scholar
Xin L, Zhu M (2012) Stochastic stepwise ensembles for variable selection. J Comput Graph Stat 21(2):275–294
Article MathSciNet Google Scholar
Zhang C, Ma YQ (2012) Ensemble machine learning: methods and applications. Springer, New York
Book MATH Google Scholar
Zhang CX, Wang GW (2014) Boosting variable selection algorithm for linear regression models. In: Proceedings of the 10th international conference on natural computation. IEEE Press, China, pp 769–774
Zhang CX, Wang GW, Liu JM (2015a) RandGA: injecting randomness into parallel genetic algorithm for variable selection. J Appl Stat 42(3):630–647
Article MathSciNet Google Scholar
Zhang CX, Zhang JS, Wang GW (2015b) A novel bagging ensemble approach for variable ranking and selection for linear regression models. In: The 12th international workshop on multiple classifier systems, Günzburg, Germany. LNCS, vol 9132, pp 3–14
Zhou ZH (2012) Ensemble methods: foundations and algorithms. Taylor & Francis, Boca Raton
Google Scholar
Zhu M, Chipman HA (2006) Darwinian evolution in parallel universes: a parallel genetic algorithm for variable selection. Technometrics 48(4):491–502
Article MathSciNet Google Scholar
Zhu M, Fan GZ (2011) Variable selection by ensembles for the Cox model. J Stat Comput Simul 81(12):1983–1992
Article MathSciNet MATH Google Scholar
Zhu XY, Yang YH (2015) Variable selection after screening: with or without data splitting? Comput Stat 30(1):191–203
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors are very grateful to the anonymous referees and the editor for their critical comments which helped improve the presentation greatly. This research was supported by the National Basic Research Program of China (973 Program, No. 2013CB329406), the National Natural Science Foundations of China (Nos. 11201367, 91230101, 61572393), the National Research Foundation of Korea (NRF-2012R1A1A2041661), the Basic Research Program of Natural Science of Shaanxi Province of China (No. 2015JQ1002).

Author information

Authors and Affiliations

School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Chun-Xia Zhang & Jiang-She Zhang
Department of Computer Engineering, Myongji University, Yongin, 17058, Republic of Korea
Sang-Woon Kim

Authors

Chun-Xia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiang-She Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Woon Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Xia Zhang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, CX., Zhang, JS. & Kim, SW. PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection. Comput Stat 31, 1237–1262 (2016). https://doi.org/10.1007/s00180-016-0652-8

Download citation

Received: 13 May 2015
Accepted: 04 March 2016
Published: 16 March 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s00180-016-0652-8

Keywords

JEL Classification

C15 (Statistical simulation methods: general)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection

Abstract

Access this article

Similar content being viewed by others

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation