Skip to main content
Log in

Variable selection after screening: with or without data splitting?

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 5:232–253

    Article  MATH  MathSciNet  Google Scholar 

  • Bühlmann P, Mandozzi J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29:407–430

    Article  MATH  Google Scholar 

  • Chen L, Yang Y (2010) Combining statistical procedures. Frontiers of Statistics. In: Cai T, Shen X (eds) High-dimensional data analysis, vol 2. World Scientific Publishing, Singapore

    Chapter  Google Scholar 

  • Clarke B (2003) Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. J Mach Learn Res 4:683–712

    Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  MATH  MathSciNet  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70:849–911

    Article  MathSciNet  Google Scholar 

  • Hoeting J, Madigan D, Raftery A, Volinsky C (1999) Bayesian model averaging: a tutorial (with discussion). Stat Sci 14:382–417

    Article  MATH  MathSciNet  Google Scholar 

  • Huang J, Ma S, Zhang C (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18:1603–1618

    MATH  MathSciNet  Google Scholar 

  • Leng C, Wang H (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space”. J R Stat Soc Ser B 70:849–911

    Article  Google Scholar 

  • Meinshausen N, Meiera L, Bühlmann P (2009) \(p\)-values for high-dimensional regression. J Am Stat Assoc 104:1671–1681

    Article  MATH  Google Scholar 

  • Scheetz TE, Kim K-YA, Swiderski RE, Philip AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheeld VC, Stone EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 103:14429–14434

    Article  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267–288

  • Wasserman L, Roeder K (2009) High-dimensional variable selection. Ann Stat 37:2178–2201

    Article  MATH  MathSciNet  Google Scholar 

  • Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92:937–950

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang C (2010) Nearly unbiased variables selection under minimax concave penalty. Ann Stat 38:894–942

    Article  MATH  Google Scholar 

  • Zhang W, Xia Y (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space”. J R Stat Soc Ser B 70:849–911

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhong Yang.

Additional information

The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, X., Yang, Y. Variable selection after screening: with or without data splitting?. Comput Stat 30, 191–203 (2015). https://doi.org/10.1007/s00180-014-0528-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-014-0528-8

Keywords

Navigation