Variable selection after screening: with or without data splitting?

Zhu, Xiaoyi; Yang, Yuhong

doi:10.1007/s00180-014-0528-8

Variable selection after screening: with or without data splitting?

Original Paper
Published: 27 August 2014

Volume 30, pages 191–203, (2015)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Xiaoyi Zhu¹ &
Yuhong Yang¹

599 Accesses
9 Citations
Explore all metrics

Abstract

High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849–911, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Limitations of “Limitations of Bayesian Leave-one-out Cross-Validation for Model Selection”

Article Open access 30 November 2018

Beyond support in two-stage variable selection

Article 20 November 2015

Variable Selection and Feature Screening

References

Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 5:232–253
Article MATH MathSciNet Google Scholar
Bühlmann P, Mandozzi J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29:407–430
Article MATH Google Scholar
Chen L, Yang Y (2010) Combining statistical procedures. Frontiers of Statistics. In: Cai T, Shen X (eds) High-dimensional data analysis, vol 2. World Scientific Publishing, Singapore
Chapter Google Scholar
Clarke B (2003) Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. J Mach Learn Res 4:683–712
Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MATH MathSciNet Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70:849–911
Article MathSciNet Google Scholar
Hoeting J, Madigan D, Raftery A, Volinsky C (1999) Bayesian model averaging: a tutorial (with discussion). Stat Sci 14:382–417
Article MATH MathSciNet Google Scholar
Huang J, Ma S, Zhang C (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18:1603–1618
MATH MathSciNet Google Scholar
Leng C, Wang H (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space”. J R Stat Soc Ser B 70:849–911
Article Google Scholar
Meinshausen N, Meiera L, Bühlmann P (2009) \(p\)-values for high-dimensional regression. J Am Stat Assoc 104:1671–1681
Article MATH Google Scholar
Scheetz TE, Kim K-YA, Swiderski RE, Philip AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheeld VC, Stone EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 103:14429–14434
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267–288
Wasserman L, Roeder K (2009) High-dimensional variable selection. Ann Stat 37:2178–2201
Article MATH MathSciNet Google Scholar
Yang Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92:937–950
Article MATH MathSciNet Google Scholar
Zhang C (2010) Nearly unbiased variables selection under minimax concave penalty. Ann Stat 38:894–942
Article MATH Google Scholar
Zhang W, Xia Y (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space”. J R Stat Soc Ser B 70:849–911
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Statistics, University of Minnesota, 313 Ford Hall, 224 Church St SE, Minneapolis, MN, 55455, USA
Xiaoyi Zhu & Yuhong Yang

Authors

Xiaoyi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yuhong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhong Yang.

Additional information

The authors thank Ying Nan for sharing her computer codes related to their work. A referee and the editors are appreciated for their very helpful comments on improving the paper. The research was partially supported by the NSF Grant DMS-1106576.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, X., Yang, Y. Variable selection after screening: with or without data splitting?. Comput Stat 30, 191–203 (2015). https://doi.org/10.1007/s00180-014-0528-8

Download citation

Received: 28 March 2014
Accepted: 09 August 2014
Published: 27 August 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s00180-014-0528-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection after screening: with or without data splitting?

Abstract

Access this article

Similar content being viewed by others

Limitations of “Limitations of Bayesian Leave-one-out Cross-Validation for Model Selection”

Beyond support in two-stage variable selection

Variable Selection and Feature Screening

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variable selection after screening: with or without data splitting?

Abstract

Access this article

Similar content being viewed by others

Limitations of “Limitations of Bayesian Leave-one-out Cross-Validation for Model Selection”

Beyond support in two-stage variable selection

Variable Selection and Feature Screening

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation