Abstract
In high-dimensional linear regression, selecting an appropriate tuning parameter is essential for the penalized linear models. From the perspective of the expected prediction error of the model, cross-validation methods are commonly used to select the tuning parameter in machine learning. In this paper, blocked \(3\times 2\) cross-validation (\(3\times 2\) BCV) is proposed as the tuning parameter selection method because of its small variance for the prediction error estimation. Under some weaker conditions than leave-\(n_v\)-out cross-validation, the tuning parameter selection method based on \(3\times 2\) BCV is proved to be consistent for the high-dimensional linear regression model. Furthermore, simulated and real data experiments support the theoretical results and demonstrate that the proposed method works well in several criteria about selecting the true model.
Similar content being viewed by others
References
Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95:759–771
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Meinshausen N, Buhlmann P (2006) High-dimensional graphs and variable selection with the LASSO. Ann Stat 34(3):1436–1462
Ng S (2013) Variable selection in predictive regressions. Handb Econ Forecast 2B:753–789
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58:267–288
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Zou H, Zhang HH (2009) On the adaptive elastic-net with a diverging number of parameters. Ann Stat 37(4):1733–1751
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Schawarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Wang H, Li B, Leng C (2009) Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc Ser B 71:671–683
Alpaydin E (1999) Combined 5 \(\times \) 2 cv F test for comparing supervised classification learning algorithms. Neural Comput 11(8):1885–1892
Yang Y (2007) Consistency of cross validation for comparing regression procedures. Ann Stat 35:2450–2473
Wang Y, Wang R, Jia H, Li J (2014) Blocked \(3\times 2\) cross-validated t-test for comparing supervised classification learning algorithms. Neural Comput 26(1):208–235
Zhang Y, Yang Y (2015) Cross-validation for selecting a model selection procedure. J Econom 187(1):95–112
Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1924
Feng Y, Yu Y (2013) Consistent cross-validation for tuning parameter selection in high-dimensional variable selection. In: World statistics congress
Rao C, Wu Y (1989) A strongly consistent procedure for model selection in a regression problem. Biometrika 76:369–374
Wang T, Zhu L (2011) Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivar Anal 102:1141–1151
Fan J, Guo S, Hao N (2012) Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J R Stat Soc Ser B. 74(1):37–65
Shao J (1993) Linear model selection by cross-validation. Stat Assoc 88:486–494
Wang Y, Li J, Li Y (2017) Choosing between two classification learning algorithms based on calibrated balanced 5\(\times \) 2 cross-validated F-test. Neural Process Lett 46(1):1–13
Wang R, Wang Y, Li J, Yang X, Yang J (2017) Block-regularized \(m \times 2\) cross-validated estimator of the generalization error. Neural Comput 29(2):519–544
Yang Y (2006) Comparing learning methods for classification. Stat Sin 16:635–657
Zhang C, Huang J (2008) The sparsity and bias of the LASSO selection in high dimensional linear regression. Ann Stat 36(4):1567–1594
Buza K (2014) Feedback prediction for blogs. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Data analysis, machine learning and knowledge discovery. Springer International Publishing, New York, pp 145–152
Lalley SP (2013) Concentration inequalities. http://www.stat.uchicago.edu/~lalley/Courses/386/Concentration.pdf
Acknowledgements
This work was supported by the National Natural Science Funds of China (61806115), Shanxi Applied Basic Research Program (201801D211002) and National Statistical Science Research Project (2017LY04).
Author information
Authors and Affiliations
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
We first present a lemma [18, 28] used in the proof of the theorem.
Lemma 1
(Gaussian concentration) Let P be the standard Gaussian probability measure on \(R^{n}\) (that is, the distribution of a \(N(0,I_n)\) random vector), and let \(F: R^n \rightarrow R\) be Lipschitz function in each variable separately relative to the Euclidean metric, with Lipschitz constant c. Then for every \(t>0\),
The proof of the theorem 1
We need to prove that the model \(M_\alpha \) corresponding to the \({\hat{\lambda }}_n\) obtained is the optimal model \(M_{\alpha _*}\) when \(n\rightarrow \infty \). Let
where \({\tilde{\beta }}_{t^k_i,\alpha }=(X^T_{t^k_i,\alpha } X_{t^k_i,\alpha })^{-1} X^T_{t^k_i,\alpha } y_{t^k_i}\). Then, we need to prove the following equation is true, \(\forall \epsilon >0\)
where \({\hat{\mu }}_{\alpha _*}\) is the prediction error of the blocked \(3 \times 2\) cross-validation on the optimal \(\alpha _*\).
First, for all \(M_\alpha \in M\), \(\lambda \in \varLambda \),
where \(A=\frac{1}{3}\sum _{i=1}^3\sum _{k=1}^2\frac{1}{n} \Vert y_{v^k_i}-X_{v^k_i} \beta ^0 \Vert ^2 = \frac{1}{3}\sum _{i=1}^3\frac{1}{n} \Vert y-X \beta ^0 \Vert ^2=\frac{1}{n} \Vert y-X\beta ^0 \Vert ^2\), \(B_\alpha =\frac{1}{3}\sum _{i=1}^3\frac{1}{n} (\sum _{k=1}^2 \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha \Vert ^2 - \sum _{k=1}^2 \Vert y_{v^k_i}-X_{v^k_i} \beta ^0 \Vert ^2)=\frac{1}{3}\sum _{i=1}^3 \frac{1}{n} (\Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\Vert y-X \beta ^0 \Vert ^2)=\frac{1}{n} (\Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\Vert y-X \beta ^0 \Vert ^2).\)
From Wilks’s theorem, it is known that, if \(\alpha \in M\setminus M_c\), as \(n \rightarrow \infty \), we have \(-2nB_\alpha \sim \sigma ^2 \chi ^2(k_\alpha )\), where \(k_\alpha =d_0-d_{\alpha 0}\), \(d_{\alpha 0}:=\#\{j: \beta _j \in \alpha \bigcap \alpha _0\}\). This means \(E B_\alpha =- \frac{\sigma ^2 k_\alpha }{2n}\). Otherwise, if \(\alpha \in M_c\), we have \(B_\alpha =\frac{1}{n} (\Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\Vert y-X_\alpha \beta _\alpha ^0 \Vert ^2)\), then \(E B_\alpha =O(\frac{1}{n})\). The following analyses \(C_\alpha \) in detail.
Define \(u_{t^k_i}(\gamma )=X^T_{t^k_i,\alpha }(y_{t^k_i}-X_{t^k_i,\alpha }\gamma )\), then \(u_{t^k_i}({\tilde{\beta }}_{t^k_i,\alpha })=0\), \(u_{t^k_i}({\tilde{\beta }}_\alpha ) = X^T_{t^k_i,\alpha }(y_{t^k_i}-X_{t^k_i,\alpha }{\tilde{\beta }}_\alpha )\). By Taylor expansion, we have
Then,
By \({\dot{u}}_{t^k_i}(\gamma )=-X^T_{t^k_i,\alpha }X_{t^k_i,\alpha }\), we get \({\dot{u}}_{t^k_i}({\tilde{\beta }}_\alpha )=-X^T_{t^k_i,\alpha }X_{t^k_i,\alpha }\), then,
Let \(b(z)=z^T z\), then by Taylor expansion,
and \({\dot{b}}(z)=2z, \ddot{b}(z)=2I\)(I is an identity matrix), we get \({\dot{b}}(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )=2X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha \), \(\ddot{b}(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )=2I\). Then, the above formula is
Moreover,
From Conditions 3 and 5, we have,
Then,
Moreover, for the \({\hat{\mu }}_{\alpha }\),
and if \(\alpha \in M_c\), then \({\hat{\mu }}_\alpha =\frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\frac{2}{n} d_\alpha \sigma ^2+o(\frac{1}{n})\). And if \(\alpha \in M\backslash M_c\), then \({\hat{\mu }}_\alpha =\frac{1}{n} \Vert y-X\beta ^0 \Vert ^2- \frac{\sigma ^2 k_\alpha }{2n} -\frac{2}{n} d_\alpha \sigma ^2+o(\frac{1}{n})\).
For all \(\alpha \in M\) and \(\alpha _*\) is the optimal model, we have
For the Eq. (16), Let \(P_\alpha =X_\alpha (X^T_\alpha X_\alpha )^{-1} X^T_\alpha \) and \(P_{\alpha _*} = X_{\alpha _*} (X^T_{\alpha _*} X_{\alpha _*})^{-1}\)\(X^T_{\alpha _*}\), we have,
According to \(\alpha _* \in M_c\), we obtain \(X_{\alpha _*}\beta ^0_{\alpha _*}= X \beta ^0\). Moreover, we have \((\beta ^0)^T X^T P_{\alpha _*} X\beta ^0=(\beta ^0)^T X^T X \beta ^0\), \((\beta ^0)^T X^T P_{\alpha _*} \varepsilon =(\beta ^0)^T X^T \varepsilon \). Then, the last equation is the following:
At first, for the third term of the right side of the Eq. (17), let \(G(\varepsilon )=\frac{1}{n}\varepsilon ^T P_{\alpha _*}\varepsilon \). And by the symmetry and idempotent of the \(P_{\alpha _*}\), there is a orthogonal matrix \(Q=(q_1, q_2, \cdot \cdot \cdot , q_n)\) such that \(P_{\alpha _*}=Q I_{d_{\alpha _*},n} Q^T\), where \(I_{d_{\alpha _*},n}\) is a n dimensional diagonal matrix with the first \(d_{\alpha _*}\) diagonal elements 1 and 0 elsewhere, and \(d_{\alpha _*}=tr(P_{\alpha _*})\). Then, we have,
By Taylor expansion of the vector function, for all \(\varepsilon ^*_1, \varepsilon ^*_2\), we get
Moreover, in the last equation \(\Vert (\sum _{i=1}^{d_{\alpha _*}} q_i q_i^T)(\varepsilon ^*_1-\varepsilon ^*_2)\Vert _2 \le \sum _{i=1}^{d_{\alpha _*}} \Vert q_i q_i^T\Vert _2 \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2\). And for all \(i \in \{1,2,\ldots ,d_{\alpha _*}\}\), \(q_i\) is the unit feature vector of \(P_{\alpha _*}\) corresponding to the feature value 1. So we have \(q_i^T q_i=1\) and \(q_i q_i^T\) is a symmetric idempotent matrix. Then we obtain \(\Vert q_i q_i^T\Vert _2=1\). Moreover, we get \(\Vert (\sum _{i=1}^{d_{\alpha _*}} q_i q_i^T)(\varepsilon ^*_1-\varepsilon ^*_2)\Vert _2 \le d_{\alpha _*} \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2\). Thus, we obtain the following:
where, \(\varepsilon _j\) is the jth element of \(\varepsilon ^*_2\), i.e., \(\varepsilon ^*_2=(\varepsilon _1, \varepsilon _2, \ldots , \varepsilon _n)^T\). Moreover, if we can prove \(\sup _{1\le j \le n}|\varepsilon _j|\le n^{\frac{1}{4}}\), then \(\Vert G(\varepsilon ^*_1)-G(\varepsilon ^*_2)\Vert _2 \le \frac{2d_{\alpha _*}}{\root 4 \of {n}} \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2\), i.e., G is a Lipschitz function with the Lipschitz constant \(\frac{2d_{\alpha _*}}{\root 4 \of {n}}\). The following is the prove of \(\lim _{n\rightarrow \infty }P(\sup _{1\le j \le n}|\varepsilon _j|\le n^{\frac{1}{4}})=1\).
By \(\varepsilon _j \sim N(0,\sigma ^{2})\), and for all \(t>0\), we have:
Moreover:
i.e., G is a Lipschitz function with the Lipschitz constant \(\frac{2d_{\alpha _*}}{\root 4 \of {n}}\) with a probability of 1. Thus, by the Lemma 1, for all \(t>0\), we have,
This shows \(\lim _{n\rightarrow \infty }P(G=E(G))=1\). And by \(E(G)=\frac{\sigma ^2 d_{\alpha _*}}{n}\), we obtain \(\lim _{n\rightarrow \infty }P(G=\frac{\sigma ^2 d_{\alpha _*}}{n})=1\). Likewise, let \(F(\varepsilon )=\frac{1}{n}\varepsilon ^T P_\alpha \varepsilon \), we have \(\lim _{n\rightarrow \infty } P(F=\frac{\sigma ^2 d_\alpha }{n})=1\), where \(d_\alpha =tr(P_\alpha )\). In addition, considering the mean of the second part of the Eq. (17) \(E(\frac{2}{n}(\beta ^0)^T X^T(I_n-P_\alpha )\varepsilon )=0\), the Eq. (17) becomes,
According to the Eq. (18), the formula (16) can be changed to the following:
By the condition 3 of the theorem and \(p=O(n^\gamma ),\gamma \in (0,1)\), the right side of the Eq. (19) is 0 when \(n \rightarrow \infty \).\(\square \)
Rights and permissions
About this article
Cite this article
Yang, X., Wang, Y., Wang, R. et al. Tuning Parameter Selection Based on Blocked \(3\times 2\) Cross-Validation for High-Dimensional Linear Regression Model. Neural Process Lett 51, 1007–1029 (2020). https://doi.org/10.1007/s11063-019-10105-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-10105-w