Abstract
High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.
Similar content being viewed by others
References
Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 5:232–253
Bühlmann P, Mandozzi J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29:407–430
Chen L, Yang Y (2010) Combining statistical procedures. Frontiers of Statistics. In: Cai T, Shen X (eds) High-dimensional data analysis, vol 2. World Scientific Publishing, Singapore
Clarke B (2003) Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. J Mach Learn Res 4:683–712
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70:849–911
Hoeting J, Madigan D, Raftery A, Volinsky C (1999) Bayesian model averaging: a tutorial (with discussion). Stat Sci 14:382–417
Huang J, Ma S, Zhang C (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18:1603–1618
Leng C, Wang H (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space”. J R Stat Soc Ser B 70:849–911
Meinshausen N, Meiera L, Bühlmann P (2009) \(p\)-values for high-dimensional regression. J Am Stat Assoc 104:1671–1681
Scheetz TE, Kim K-YA, Swiderski RE, Philip AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheeld VC, Stone EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 103:14429–14434
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267–288
Wasserman L, Roeder K (2009) High-dimensional variable selection. Ann Stat 37:2178–2201
Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92:937–950
Zhang C (2010) Nearly unbiased variables selection under minimax concave penalty. Ann Stat 38:894–942
Zhang W, Xia Y (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space”. J R Stat Soc Ser B 70:849–911
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576.
Rights and permissions
About this article
Cite this article
Zhu, X., Yang, Y. Variable selection after screening: with or without data splitting?. Comput Stat 30, 191–203 (2015). https://doi.org/10.1007/s00180-014-0528-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-014-0528-8