Skip to main content
Log in

Tuning Parameter Selection Based on Blocked \(3\times 2\) Cross-Validation for High-Dimensional Linear Regression Model

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In high-dimensional linear regression, selecting an appropriate tuning parameter is essential for the penalized linear models. From the perspective of the expected prediction error of the model, cross-validation methods are commonly used to select the tuning parameter in machine learning. In this paper, blocked \(3\times 2\) cross-validation (\(3\times 2\) BCV) is proposed as the tuning parameter selection method because of its small variance for the prediction error estimation. Under some weaker conditions than leave-\(n_v\)-out cross-validation, the tuning parameter selection method based on \(3\times 2\) BCV is proved to be consistent for the high-dimensional linear regression model. Furthermore, simulated and real data experiments support the theoretical results and demonstrate that the proposed method works well in several criteria about selecting the true model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95:759–771

    Google Scholar 

  2. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Google Scholar 

  3. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    Google Scholar 

  4. Meinshausen N, Buhlmann P (2006) High-dimensional graphs and variable selection with the LASSO. Ann Stat 34(3):1436–1462

    Google Scholar 

  5. Ng S (2013) Variable selection in predictive regressions. Handb Econ Forecast 2B:753–789

    Google Scholar 

  6. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58:267–288

    Google Scholar 

  7. Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942

    Google Scholar 

  8. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320

    Google Scholar 

  9. Zou H, Zhang HH (2009) On the adaptive elastic-net with a diverging number of parameters. Ann Stat 37(4):1733–1751

    Google Scholar 

  10. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723

    Google Scholar 

  11. Schawarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Google Scholar 

  12. Wang H, Li B, Leng C (2009) Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc Ser B 71:671–683

    Google Scholar 

  13. Alpaydin E (1999) Combined 5 \(\times \) 2 cv F test for comparing supervised classification learning algorithms. Neural Comput 11(8):1885–1892

    Google Scholar 

  14. Yang Y (2007) Consistency of cross validation for comparing regression procedures. Ann Stat 35:2450–2473

    Google Scholar 

  15. Wang Y, Wang R, Jia H, Li J (2014) Blocked \(3\times 2\) cross-validated t-test for comparing supervised classification learning algorithms. Neural Comput 26(1):208–235

    Google Scholar 

  16. Zhang Y, Yang Y (2015) Cross-validation for selecting a model selection procedure. J Econom 187(1):95–112

    Google Scholar 

  17. Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1924

    Google Scholar 

  18. Feng Y, Yu Y (2013) Consistent cross-validation for tuning parameter selection in high-dimensional variable selection. In: World statistics congress

  19. Rao C, Wu Y (1989) A strongly consistent procedure for model selection in a regression problem. Biometrika 76:369–374

    Google Scholar 

  20. Wang T, Zhu L (2011) Consistent tuning parameter selection in high dimensional sparse linear regression. J Multivar Anal 102:1141–1151

    Google Scholar 

  21. Fan J, Guo S, Hao N (2012) Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J R Stat Soc Ser B. 74(1):37–65

    Google Scholar 

  22. Shao J (1993) Linear model selection by cross-validation. Stat Assoc 88:486–494

    Google Scholar 

  23. Wang Y, Li J, Li Y (2017) Choosing between two classification learning algorithms based on calibrated balanced 5\(\times \) 2 cross-validated F-test. Neural Process Lett 46(1):1–13

    Google Scholar 

  24. Wang R, Wang Y, Li J, Yang X, Yang J (2017) Block-regularized \(m \times 2\) cross-validated estimator of the generalization error. Neural Comput 29(2):519–544

    Google Scholar 

  25. Yang Y (2006) Comparing learning methods for classification. Stat Sin 16:635–657

    Google Scholar 

  26. Zhang C, Huang J (2008) The sparsity and bias of the LASSO selection in high dimensional linear regression. Ann Stat 36(4):1567–1594

    Google Scholar 

  27. Buza K (2014) Feedback prediction for blogs. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Data analysis, machine learning and knowledge discovery. Springer International Publishing, New York, pp 145–152

    Google Scholar 

  28. Lalley SP (2013) Concentration inequalities. http://www.stat.uchicago.edu/~lalley/Courses/386/Concentration.pdf

Download references

Acknowledgements

This work was supported by the National Natural Science Funds of China (61806115), Shanxi Applied Basic Research Program (201801D211002) and National Statistical Science Research Project (2017LY04).

Author information

Authors and Affiliations

Authors

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

We first present a lemma [18, 28] used in the proof of the theorem.

Lemma 1

(Gaussian concentration) Let P be the standard Gaussian probability measure on \(R^{n}\) (that is, the distribution of a \(N(0,I_n)\) random vector), and let \(F: R^n \rightarrow R\) be Lipschitz function in each variable separately relative to the Euclidean metric, with Lipschitz constant c. Then for every \(t>0\),

$$\begin{aligned} P\{|F-E(F)|>t\}\le 2\exp \left( -\frac{t^{2}}{c^{2}\pi ^{2}}\right) . \end{aligned}$$

The proof of the theorem 1

We need to prove that the model \(M_\alpha \) corresponding to the \({\hat{\lambda }}_n\) obtained is the optimal model \(M_{\alpha _*}\) when \(n\rightarrow \infty \). Let

$$\begin{aligned} {\hat{\mu }}_\alpha = \frac{1}{6} \sum _{i=1}^3 \sum _{k=1}^2 \frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_{t^k_i,\alpha } \Vert ^2. \end{aligned}$$

where \({\tilde{\beta }}_{t^k_i,\alpha }=(X^T_{t^k_i,\alpha } X_{t^k_i,\alpha })^{-1} X^T_{t^k_i,\alpha } y_{t^k_i}\). Then, we need to prove the following equation is true, \(\forall \epsilon >0\)

$$\begin{aligned} {\lim \nolimits _{n\rightarrow \infty } P\{|{\hat{\mu }}_{\alpha _*}-{\hat{\mu }}_{\alpha }| > \epsilon \}=0.} \end{aligned}$$
(15)

where \({\hat{\mu }}_{\alpha _*}\) is the prediction error of the blocked \(3 \times 2\) cross-validation on the optimal \(\alpha _*\).

First, for all \(M_\alpha \in M\), \(\lambda \in \varLambda \),

$$\begin{aligned} {\hat{\mu }}_{\alpha }= & {} \frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_{t^k_i,\alpha } \Vert ^2 \\= & {} \frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i} \beta ^0 \Vert ^2\\&+\frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha \Vert ^2-\frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i} \beta ^0 \Vert ^2\\&+\frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_{t^k_i,\alpha } \Vert ^2-\frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha \Vert ^2\\= & {} A+B_\alpha +C_\alpha \end{aligned}$$

where \(A=\frac{1}{3}\sum _{i=1}^3\sum _{k=1}^2\frac{1}{n} \Vert y_{v^k_i}-X_{v^k_i} \beta ^0 \Vert ^2 = \frac{1}{3}\sum _{i=1}^3\frac{1}{n} \Vert y-X \beta ^0 \Vert ^2=\frac{1}{n} \Vert y-X\beta ^0 \Vert ^2\), \(B_\alpha =\frac{1}{3}\sum _{i=1}^3\frac{1}{n} (\sum _{k=1}^2 \Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha \Vert ^2 - \sum _{k=1}^2 \Vert y_{v^k_i}-X_{v^k_i} \beta ^0 \Vert ^2)=\frac{1}{3}\sum _{i=1}^3 \frac{1}{n} (\Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\Vert y-X \beta ^0 \Vert ^2)=\frac{1}{n} (\Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\Vert y-X \beta ^0 \Vert ^2).\)

From Wilks’s theorem, it is known that, if \(\alpha \in M\setminus M_c\), as \(n \rightarrow \infty \), we have \(-2nB_\alpha \sim \sigma ^2 \chi ^2(k_\alpha )\), where \(k_\alpha =d_0-d_{\alpha 0}\), \(d_{\alpha 0}:=\#\{j: \beta _j \in \alpha \bigcap \alpha _0\}\). This means \(E B_\alpha =- \frac{\sigma ^2 k_\alpha }{2n}\). Otherwise, if \(\alpha \in M_c\), we have \(B_\alpha =\frac{1}{n} (\Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\Vert y-X_\alpha \beta _\alpha ^0 \Vert ^2)\), then \(E B_\alpha =O(\frac{1}{n})\). The following analyses \(C_\alpha \) in detail.

Define \(u_{t^k_i}(\gamma )=X^T_{t^k_i,\alpha }(y_{t^k_i}-X_{t^k_i,\alpha }\gamma )\), then \(u_{t^k_i}({\tilde{\beta }}_{t^k_i,\alpha })=0\), \(u_{t^k_i}({\tilde{\beta }}_\alpha ) = X^T_{t^k_i,\alpha }(y_{t^k_i}-X_{t^k_i,\alpha }{\tilde{\beta }}_\alpha )\). By Taylor expansion, we have

$$\begin{aligned} u_{t^k_i}({\tilde{\beta }}_\alpha )-u_{t^k_i}({\tilde{\beta }}_{t^k_i,\alpha }) ={\dot{u}}_{t^k_i}({\tilde{\beta }}_\alpha ) ({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha }) +o({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha }). \end{aligned}$$

Then,

$$\begin{aligned} {\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha }= ({\dot{u}}_{t^k_i}({\tilde{\beta }}_\alpha ))^{-1} u_{t^k_i}({\tilde{\beta }}_\alpha )(1+o(1)). \end{aligned}$$

By \({\dot{u}}_{t^k_i}(\gamma )=-X^T_{t^k_i,\alpha }X_{t^k_i,\alpha }\), we get \({\dot{u}}_{t^k_i}({\tilde{\beta }}_\alpha )=-X^T_{t^k_i,\alpha }X_{t^k_i,\alpha }\), then,

$$\begin{aligned} {\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha }= -(X^T_{t^k_i,\alpha }X_{t^k_i,\alpha })^{-1} u_{t^k_i}({\tilde{\beta }}_\alpha )(1+o(1)). \end{aligned}$$

Let \(b(z)=z^T z\), then by Taylor expansion,

$$\begin{aligned} b(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )-b(X_{v^k_i,\alpha } {\tilde{\beta }}_{t^k_i,\alpha })= & {} ({\dot{b}}(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha ))^T X_{v^k_i,\alpha }({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })\\&+ \frac{1}{2}({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })^T X^T_{v^k_i,\alpha } \ddot{b}(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )X_{v^k_i,\alpha }\\&\quad ({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })+o(1). \end{aligned}$$

and \({\dot{b}}(z)=2z, \ddot{b}(z)=2I\)(I is an identity matrix), we get \({\dot{b}}(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )=2X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha \), \(\ddot{b}(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )=2I\). Then, the above formula is

$$\begin{aligned} b(X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha )-b(X_{v^k_i,\alpha } {\tilde{\beta }}_{t^k_i,\alpha })= & {} 2{\tilde{\beta }}^T_\alpha X^T_{v^k_i,\alpha }X_{v^k_i,\alpha }({\tilde{\beta }}_\alpha -\tilde{\beta }_{t^k_i,\alpha })\\&+ ({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })^T X^T_{v^k_i,\alpha } X_{v^k_i,\alpha }({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })+o(1). \end{aligned}$$

Moreover,

$$\begin{aligned}&\Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_{t^k_i,\alpha } \Vert ^2-\Vert y_{v^k_i}-X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha \Vert ^2 \\&\quad =2y^T_{v^k_i}X_{v^k_i,\alpha }({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha }) -({\tilde{\beta }}^T_\alpha X^T_{v^k_i,\alpha }X_{v^k_i,\alpha }{\tilde{\beta }}_\alpha -{\tilde{\beta }}^T_{t^k_i,\alpha } X^T_{v^k_i,\alpha }X_{v^k_i,\alpha }{\tilde{\beta }}_{t^k_i,\alpha }) \\&\quad = 2(y_{v^k_i}- X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha )^T X_{v^k_i,\alpha }({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })\\&\qquad -({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })^T X^T_{v^k_i,\alpha } X_{v^k_i,\alpha }({\tilde{\beta }}_\alpha -{\tilde{\beta }}_{t^k_i,\alpha })+o(1)\\&\quad = -2(y_{v^k_i}- X_{v^k_i,\alpha } {\tilde{\beta }}_\alpha )^T X_{v^k_i,\alpha }(X^T_{t^k_i,\alpha }X_{t^k_i,\alpha })^{-1} X^T_{t^k_i,\alpha }(y_{t^k_i}-X_{t^k_i,\alpha }{\tilde{\beta }}_\alpha )(1+o(1))\\&\qquad -(y_{t^k_i}-X_{t^k_i,\alpha }{\tilde{\beta }}_\alpha )^T X_{t^k_i,\alpha } (X^T_{t^k_i,\alpha }X_{t^k_i,\alpha })^{-1} X^T_{v^k_i,\alpha } X_{v^k_i,\alpha }(X^T_{t^k_i,\alpha }X_{t^k_i,\alpha })^{-1}\\&\qquad X^T_{t^k_i,\alpha }(y_{t^k_i}-X_{t^k_i,\alpha } {\tilde{\beta }}_\alpha )(1+o(1))+o(1)\\&\quad = C_{1\alpha }-C_{2\alpha }. \end{aligned}$$

From Conditions 3 and 5, we have,

$$\begin{aligned} EC_{1\alpha }=0, \quad EC_{2\alpha }=d_\alpha \sigma ^2. \end{aligned}$$

Then,

$$\begin{aligned} EC_\alpha =\frac{1}{6}\sum _{i=1}^3\sum _{k=1}^2\frac{2}{n} (EC_{1\alpha }-EC_{2\alpha }) =-\frac{2}{n} d_\alpha \sigma ^2+o\left( \frac{1}{n}\right) . \end{aligned}$$

Moreover, for the \({\hat{\mu }}_{\alpha }\),

$$\begin{aligned} {\hat{\mu }}_{\alpha }=\frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2+C_\alpha = \frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2-\frac{2}{n} d_\alpha \sigma ^2+o\left( \frac{1}{n}\right) \end{aligned}$$

and if \(\alpha \in M_c\), then \({\hat{\mu }}_\alpha =\frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\frac{2}{n} d_\alpha \sigma ^2+o(\frac{1}{n})\). And if \(\alpha \in M\backslash M_c\), then \({\hat{\mu }}_\alpha =\frac{1}{n} \Vert y-X\beta ^0 \Vert ^2- \frac{\sigma ^2 k_\alpha }{2n} -\frac{2}{n} d_\alpha \sigma ^2+o(\frac{1}{n})\).

For all \(\alpha \in M\) and \(\alpha _*\) is the optimal model, we have

$$\begin{aligned} {\hat{\mu }}_\alpha -{\hat{\mu }}_{\alpha _*}= & {} \left( \frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\frac{2}{n} d_\alpha \sigma ^2\right) -\left( \frac{1}{n} \Vert y-X_{\alpha _*} {\tilde{\beta }}_{\alpha _*} \Vert ^2-\frac{2}{n} d_{\alpha _*} \sigma ^2\right) \nonumber \\&+o\left( \frac{1}{n}\right) \end{aligned}$$
(16)

For the Eq. (16), Let \(P_\alpha =X_\alpha (X^T_\alpha X_\alpha )^{-1} X^T_\alpha \) and \(P_{\alpha _*} = X_{\alpha _*} (X^T_{\alpha _*} X_{\alpha _*})^{-1}\)\(X^T_{\alpha _*}\), we have,

$$\begin{aligned}&\frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2-\frac{1}{n} \Vert y-X_{\alpha _*} {\tilde{\beta }}_{\alpha _*} \Vert ^2 = \frac{1}{n} y^T(P_{\alpha _*}-P_\alpha )y\\&\quad = \frac{1}{n}(\beta ^0)^T X^T(P_{\alpha _*}-P_\alpha )X\beta ^0 + \frac{2}{n}(\beta ^0)^T X^T(P_{\alpha _*}-P_\alpha )\varepsilon + \frac{1}{n}\varepsilon ^T (P_{\alpha _*}-P_\alpha )\varepsilon \end{aligned}$$

According to \(\alpha _* \in M_c\), we obtain \(X_{\alpha _*}\beta ^0_{\alpha _*}= X \beta ^0\). Moreover, we have \((\beta ^0)^T X^T P_{\alpha _*} X\beta ^0=(\beta ^0)^T X^T X \beta ^0\), \((\beta ^0)^T X^T P_{\alpha _*} \varepsilon =(\beta ^0)^T X^T \varepsilon \). Then, the last equation is the following:

$$\begin{aligned}&\frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2 -\frac{1}{n} \Vert y-X_{\alpha _*} {\tilde{\beta }}_{\alpha _*} \Vert ^2= \frac{1}{n}(\beta ^0)^T X^T(I_n-P_\alpha )X\beta ^0\nonumber \\&\quad + \frac{2}{n}(\beta ^0)^T X^T(I_n-P_\alpha )\varepsilon + \frac{1}{n}\varepsilon ^T (P_{\alpha _*}-P_\alpha )\varepsilon . \end{aligned}$$
(17)

At first, for the third term of the right side of the Eq. (17), let \(G(\varepsilon )=\frac{1}{n}\varepsilon ^T P_{\alpha _*}\varepsilon \). And by the symmetry and idempotent of the \(P_{\alpha _*}\), there is a orthogonal matrix \(Q=(q_1, q_2, \cdot \cdot \cdot , q_n)\) such that \(P_{\alpha _*}=Q I_{d_{\alpha _*},n} Q^T\), where \(I_{d_{\alpha _*},n}\) is a n dimensional diagonal matrix with the first \(d_{\alpha _*}\) diagonal elements 1 and 0 elsewhere, and \(d_{\alpha _*}=tr(P_{\alpha _*})\). Then, we have,

$$\begin{aligned} G(\varepsilon )=\frac{1}{n}\varepsilon ^T\left( \sum _{i=1}^{d_{\alpha _*}} q_i q_i^T\right) \varepsilon . \end{aligned}$$

By Taylor expansion of the vector function, for all \(\varepsilon ^*_1, \varepsilon ^*_2\), we get

$$\begin{aligned} G(\varepsilon ^*_1)-G(\varepsilon ^*_2)=\frac{2}{n} (\varepsilon ^*_2)^T\left( \sum _{i=1}^{d_{\alpha _*}} q_i q_i^T\right) (\varepsilon ^*_1-\varepsilon ^*_2) \end{aligned}$$

Moreover, in the last equation \(\Vert (\sum _{i=1}^{d_{\alpha _*}} q_i q_i^T)(\varepsilon ^*_1-\varepsilon ^*_2)\Vert _2 \le \sum _{i=1}^{d_{\alpha _*}} \Vert q_i q_i^T\Vert _2 \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2\). And for all \(i \in \{1,2,\ldots ,d_{\alpha _*}\}\), \(q_i\) is the unit feature vector of \(P_{\alpha _*}\) corresponding to the feature value 1. So we have \(q_i^T q_i=1\) and \(q_i q_i^T\) is a symmetric idempotent matrix. Then we obtain \(\Vert q_i q_i^T\Vert _2=1\). Moreover, we get \(\Vert (\sum _{i=1}^{d_{\alpha _*}} q_i q_i^T)(\varepsilon ^*_1-\varepsilon ^*_2)\Vert _2 \le d_{\alpha _*} \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2\). Thus, we obtain the following:

$$\begin{aligned} \Vert G(\varepsilon ^*_1)-G(\varepsilon ^*_2)\Vert _2\le & {} \frac{2}{n} \Vert \varepsilon ^*_2\Vert _2 \Vert \left( \sum _{i=1}^{d_{\alpha _*}} q_i q_i^T\right) (\varepsilon ^*_1-\varepsilon ^*_2)\Vert _2 \\\le & {} \frac{2d_{\alpha _*}}{\sqrt{n}}\sup _j |\varepsilon _j| \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2 \end{aligned}$$

where, \(\varepsilon _j\) is the jth element of \(\varepsilon ^*_2\), i.e., \(\varepsilon ^*_2=(\varepsilon _1, \varepsilon _2, \ldots , \varepsilon _n)^T\). Moreover, if we can prove \(\sup _{1\le j \le n}|\varepsilon _j|\le n^{\frac{1}{4}}\), then \(\Vert G(\varepsilon ^*_1)-G(\varepsilon ^*_2)\Vert _2 \le \frac{2d_{\alpha _*}}{\root 4 \of {n}} \Vert \varepsilon ^*_1-\varepsilon ^*_2\Vert _2\), i.e., G is a Lipschitz function with the Lipschitz constant \(\frac{2d_{\alpha _*}}{\root 4 \of {n}}\). The following is the prove of \(\lim _{n\rightarrow \infty }P(\sup _{1\le j \le n}|\varepsilon _j|\le n^{\frac{1}{4}})=1\).

By \(\varepsilon _j \sim N(0,\sigma ^{2})\), and for all \(t>0\), we have:

$$\begin{aligned} P(\varepsilon _j \ge t) \le \frac{\sigma }{\sqrt{2\pi }} \frac{\exp \left\{ -\frac{t^2}{2\sigma ^2}\right\} }{t} \end{aligned}$$

Moreover:

$$\begin{aligned} P\left( \sup _{1\le j\le n}|\varepsilon _{j}|\le n^{\frac{1}{4}}\right)\ge & {} 1-\frac{\sigma \root 4 \of {n^3}\exp \left\{ -\frac{\sqrt{n}}{2\sigma ^2}\right\} }{\sqrt{2\pi }} \equiv p_1 \rightarrow 1(n\rightarrow \infty ) \end{aligned}$$

i.e., G is a Lipschitz function with the Lipschitz constant \(\frac{2d_{\alpha _*}}{\root 4 \of {n}}\) with a probability of 1. Thus, by the Lemma 1, for all \(t>0\), we have,

$$\begin{aligned} P(|G-E(G)|>t)\le 2 \exp \left( -\frac{\sqrt{n} t^2}{4\pi ^2 d^2_{\alpha _*}}\right) \rightarrow 0. \end{aligned}$$

This shows \(\lim _{n\rightarrow \infty }P(G=E(G))=1\). And by \(E(G)=\frac{\sigma ^2 d_{\alpha _*}}{n}\), we obtain \(\lim _{n\rightarrow \infty }P(G=\frac{\sigma ^2 d_{\alpha _*}}{n})=1\). Likewise, let \(F(\varepsilon )=\frac{1}{n}\varepsilon ^T P_\alpha \varepsilon \), we have \(\lim _{n\rightarrow \infty } P(F=\frac{\sigma ^2 d_\alpha }{n})=1\), where \(d_\alpha =tr(P_\alpha )\). In addition, considering the mean of the second part of the Eq. (17) \(E(\frac{2}{n}(\beta ^0)^T X^T(I_n-P_\alpha )\varepsilon )=0\), the Eq. (17) becomes,

$$\begin{aligned}&\frac{1}{n} \Vert y-X_\alpha {\tilde{\beta }}_\alpha \Vert ^2-\frac{1}{n} \Vert y-X_{\alpha _*} {\tilde{\beta }}_{\alpha _*} \Vert ^2 \nonumber \\&\quad = \frac{1}{n}(\beta ^0)^T X^T(I_n-P_\alpha )X\beta ^0 \nonumber \\&\qquad +\frac{1}{n} \sigma ^2 (d_{\alpha _*}-d_\alpha )+o\left( \frac{1}{n}\right) \end{aligned}$$
(18)

According to the Eq. (18), the formula (16) can be changed to the following:

$$\begin{aligned} {\hat{\mu }}_\alpha -{\hat{\mu }}_{\alpha _*}= \frac{1}{n}(\beta ^0)^T X^T(I_n-P_\alpha )X\beta ^0+ \frac{3}{n} \sigma ^2 (d_{\alpha _*}-d_\alpha ) +o\left( \frac{1}{n}\right) . \end{aligned}$$
(19)

By the condition 3 of the theorem and \(p=O(n^\gamma ),\gamma \in (0,1)\), the right side of the Eq. (19) is 0 when \(n \rightarrow \infty \).\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, X., Wang, Y., Wang, R. et al. Tuning Parameter Selection Based on Blocked \(3\times 2\) Cross-Validation for High-Dimensional Linear Regression Model. Neural Process Lett 51, 1007–1029 (2020). https://doi.org/10.1007/s11063-019-10105-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-019-10105-w

Keywords

Navigation