Abstract
Independence screening procedure plays a vital role in variable selection when the number of variables is massive. However, high dimensionality of the data may bring in many challenges, such as multicollinearity or high correlation (possibly spurious) between the covariates, which results in marginal correlation being unreliable as a measure of association between the covariates and the response. We propose a novel and simple screening procedure called Gram–Schmidt screening (GSS) by integrating the classical Gram–Schmidt orthogonalization and the sure independence screening technique, which takes into account high correlations between the covariates in a data-driven way. GSS could successfully discriminate between the relevant and the irrelevant variables to achieve a high true positive rate without including many irrelevant and redundant variables, which offers a new perspective for screening method when the covariates are highly correlated. The practical performance of GSS was shown by comparative simulation studies and analysis of two real datasets.
Similar content being viewed by others
References
Björck Å (1994) Numerics of Gram–Schmidt orthogonalization. Linear Algebra Appl 197(198):297–316
Candès E, Tao T (2007) The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35(6):2313–2351
Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771
Chen S, Billings SA, Luo W (1989) Orthogonal least squares methods and their application to non-linear system identification. Int J Control 50(5):1873–1896
Chen S, Cowan CF, Grant PM (1991) Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans Neural Netw 2(2):302–309
Cho H, Fryzlewicz P (2012) High dimensional variable selection via tilting. J R Stat Soc Ser B (Stat Methodol) 74(3):593–622
Chong I-G, Jun C-H (2005) Performance of some variable selection methods when multicollinearity is present. Chemom Intell Lab Syst 78(1–2):103–112
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70(5):849–911
Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38(6):3567–3604
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10(9):2013–2038
Fisher R (1921) On the probable error of a coefficient of correlation deduced from a small sample. Metron 1(4):3–32
Huang J, Ma S, Zhang CH (2006) Adaptive LASSO for sparse high-dimensional regression. Stat Sin 18(4):1603–1618
Ing C-K, Lai TL (2011) A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Stat Sin 21(4):1473–1513
Korenberg M, Billings S, Liu Y, Mcilroy P (1988) Orthogonal parameter estimation algorithm for non-linear stochastic systems. Int J Control 48(1):193–210
Leon SJ, Björck Å, Gander W (2013) Gram–Schmidt orthogonalization: 100 years and more. Numer Linear Algebra Appl 20(3):492–532
Li G, Peng H, Zhang J, Zhu L (2012a) Robust rank correlation based screening. Ann Stat 40(3):1846–1877
Li R, Zhong W, Zhu L (2012b) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139
Mangold WD, Bean L, Adams D (2003) The impact of intercollegiate athletics on graduation rates among major NCAA division I universities: implications for college persistence theory and practice. J High Educ 74(5):540–562
Oussar Y, Dreyfus G (2000) Initialization by selection for wavelet network training. Neurocomputing 34(1):131–143
Scheetz TE, Stone EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci USA 103(39):14429–14434
Segal MR, Dahlquist KD, Conklin BR (2003) Regression approaches for microarray data analysis. J Comput Biol 10(6):961–980
Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. J Mach Learn Res 3(3):1399–1414
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B (Methodol) 58(1):267–288
Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 104(488):1512–1524
Wang S, Xiang L (2017) Two-layer EM algorithm for ALD mixture regression models: a new solution to composite quantile regression. Comput Stat Data Anal 115(11):136–154
Wang H, Li B, Leng C (2009) Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc Ser B (Stat Methodol) 71(3):671–683
Zhao Y-P, Li Z-Q, Xi P-P, Liang D, Sun L, Chen T-H (2017) Gram–Schmidt process based incremental extreme learning machine. Neurocomputing 241(7):1–17
Zhu L-P, Li L, Li R, Zhu L-X (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Stat Assoc 106(496):1464–1475
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320
Acknowledgements
The authors want to extend grateful thanks to the Editors and the reviewers whose comments have greatly improved the scope and presentation of the paper, and to Prof. Yuhong Yang and Yingying Ma for their valuable suggestions. This work was supported by the National Science Foundation of China (Grant Nos. 71420107025, 11701023).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wang, H., Liu, R., Wang, S. et al. Ultra-high dimensional variable screening via Gram–Schmidt orthogonalization. Comput Stat 35, 1153–1170 (2020). https://doi.org/10.1007/s00180-020-00963-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00963-7