Skip to main content

Advertisement

Log in

PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The authors are grateful to one anonymous referee for providing us with the insight into this.

References

  • Breiman L (1996a) Heuristics of instability and stabilization in model selection. Ann Stat 24(6):2350–2383

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman L (1996b) Bagging predictors. Mach Learn 24(2):123–140

    MathSciNet  MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, Hothorn T (2010) Twin boosting: improved feature selection and prediction. Stat Comput 20(2):119–138

    Article  MathSciNet  Google Scholar 

  • Bühlmann P, Mandozzi J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29(3–4):407–430

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, van de Geer S (2010) Statistics for high-dimensional data: methods, theory and applications. Springer, New York

    MATH  Google Scholar 

  • Chatterjee S, Lauadto M, Lynch LA (1996) Genetic algorithms and their statistical applications: an introduction. Comput Stat Data Anal 22(6):633–651

    Article  MATH  Google Scholar 

  • Drucker H (1997) Improving regressors using boosting techniques. In: Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 107–115

  • Efron B, Hastie T, Hohnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  MATH  Google Scholar 

  • Fan JQ, Li RZ (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  • Fan JQ, Lv JC (2008) Sure independence screening for ultrahigh dimensional feature space (with discussions). J R Stat Soc B 70(5):849–911

    Article  MathSciNet  Google Scholar 

  • Fan JQ, Lv JC (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148

    MathSciNet  MATH  Google Scholar 

  • Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  • Guo L, Boukir S (2013) Margin-based ordered aggregation for ensemble pruning. Pattern Recognit Lett 34:603–609

    Article  Google Scholar 

  • He HB, Garcia EA (2009) Learning from imbalanced data. IEEE Transl Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Jadhav NH, Kashid DN, Kulkarni SR (2014) Subset selection in multiple linear regression in the presence of outlier and multicollinearity. Stat Methodol 19:44–59

    Article  MathSciNet  Google Scholar 

  • Liu C, Shi T, Lee Y (2014) Two tales of variable selection for high dimensional regression: screening and model building. Stat Anal Data Min 7(2):140–159

    Article  MathSciNet  Google Scholar 

  • Meinshausen N, Bühlmann P (2010) Stability selection (with discussion). J R Stat Soc B 72(4):417–473

    Article  Google Scholar 

  • Mendes-Moreira J, Soares C, Jorge AM, de Sousa JF (2012) Ensemble approaches for regression: a survey. ACM Comput Surv 45(1):1–40 (Article 10)

  • Miller A (2002) Subset selection in regression, 2nd edn. Chapman & Hall, New Work

    Book  MATH  Google Scholar 

  • Rokach L (2009) Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography. Comput Stat Data Anal 53(12):4046–4072

    Article  MathSciNet  MATH  Google Scholar 

  • Sauerbrei W, Buchholz A, Boulesteix A, Binder H (2015) On stability issues in deriving multivariable regression models. Biom J 57(4):531–555

    Article  MathSciNet  MATH  Google Scholar 

  • Shah RD, Samworth RJ (2013) Variable selection with error control: another look at stability selection. J R Stat Soc B 75(1):55–80

    Article  MathSciNet  Google Scholar 

  • Shrestha DL, Solomatine DP (2006) Experiments with AdaBoost.RT, an improved boosting scheme for regression. Neural Comput 18(7):1678–1710

    Article  MATH  Google Scholar 

  • Shmueli G (2010) To explain or to predict? Stat Sci 25(3):289–310

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B 63(2):411–423

    Article  MathSciNet  MATH  Google Scholar 

  • Wang SJ, Nan B, Rosset S, Zhu J (2011) Random lasso. Ann Appl Stat 5(1):468–485

    Article  MathSciNet  MATH  Google Scholar 

  • Xin L, Zhu M (2012) Stochastic stepwise ensembles for variable selection. J Comput Graph Stat 21(2):275–294

    Article  MathSciNet  Google Scholar 

  • Zhang C, Ma YQ (2012) Ensemble machine learning: methods and applications. Springer, New York

    Book  MATH  Google Scholar 

  • Zhang CX, Wang GW (2014) Boosting variable selection algorithm for linear regression models. In: Proceedings of the 10th international conference on natural computation. IEEE Press, China, pp 769–774

  • Zhang CX, Wang GW, Liu JM (2015a) RandGA: injecting randomness into parallel genetic algorithm for variable selection. J Appl Stat 42(3):630–647

    Article  MathSciNet  Google Scholar 

  • Zhang CX, Zhang JS, Wang GW (2015b) A novel bagging ensemble approach for variable ranking and selection for linear regression models. In: The 12th international workshop on multiple classifier systems, Günzburg, Germany. LNCS, vol 9132, pp 3–14

  • Zhou ZH (2012) Ensemble methods: foundations and algorithms. Taylor & Francis, Boca Raton

    Google Scholar 

  • Zhu M, Chipman HA (2006) Darwinian evolution in parallel universes: a parallel genetic algorithm for variable selection. Technometrics 48(4):491–502

    Article  MathSciNet  Google Scholar 

  • Zhu M, Fan GZ (2011) Variable selection by ensembles for the Cox model. J Stat Comput Simul 81(12):1983–1992

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu XY, Yang YH (2015) Variable selection after screening: with or without data splitting? Comput Stat 30(1):191–203

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors are very grateful to the anonymous referees and the editor for their critical comments which helped improve the presentation greatly. This research was supported by the National Basic Research Program of China (973 Program, No. 2013CB329406), the National Natural Science Foundations of China (Nos. 11201367, 91230101, 61572393), the National Research Foundation of Korea (NRF-2012R1A1A2041661), the Basic Research Program of Natural Science of Shaanxi Province of China (No. 2015JQ1002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chun-Xia Zhang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 66 KB)

Supplementary material 2 (m 3 KB)

Supplementary material 3 (m 3 KB)

Supplementary material 4 (m 1 KB)

Supplementary material 5 (m 0 KB)

Supplementary material 6 (m 1 KB)

Supplementary material 7 (m 2 KB)

Supplementary material 8 (mat 9 KB)

Supplementary material 9 (m 2 KB)

Supplementary material 10 (m 3 KB)

Supplementary material 11 (m 1 KB)

Supplementary material 12 (m 1 KB)

Supplementary material 13 (m 3 KB)

Supplementary material 14 (m 1 KB)

Supplementary material 15 (m 1 KB)

Supplementary material 16 (m 1 KB)

Supplementary material 17 (mat 5 KB)

Supplementary material 18 (m 7 KB)

Supplementary material 19 (m 4 KB)

Supplementary material 20 (m 1 KB)

Supplementary material 21 (m 2 KB)

Supplementary material 22 (m 1 KB)

Supplementary material 23 (m 0 KB)

Supplementary material 24 (m 6 KB)

Supplementary material 25 (m 15 KB)

Supplementary material 26 (m 9 KB)

Supplementary material 27 (m 3 KB)

Supplementary material 28 (m 3 KB)

Supplementary material 29 (m 7 KB)

Supplementary material 30 (m 3 KB)

Supplementary material 31 (m 5 KB)

Supplementary material 32 (m 5 KB)

Supplementary material 33 (m 2 KB)

Supplementary material 34 (m 1 KB)

Supplementary material 35 (m 3 KB)

Supplementary material 36 (m 9 KB)

Supplementary material 37 (m 5 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, CX., Zhang, JS. & Kim, SW. PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection. Comput Stat 31, 1237–1262 (2016). https://doi.org/10.1007/s00180-016-0652-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-016-0652-8

Keywords

JEL Classification

Navigation