Abstract
Strong correlation among predictors and heavy-tailed noises pose a great challenge in the analysis of ultra-high dimensional data. Such challenge leads to an increase in the computation time for discovering active variables and a decrease in selection accuracy. To address this issue, we propose an innovative two-stage screen-then-select approach and its derivative procedure based on a robust quantile regression with sparsity assumption. This approach initially screens important features by ranking quantile ridge estimation and subsequently employs a likelihood-based post-screening selection strategy to refine variable selection. Additionally, we conduct an internal competition mechanism along the greedy search path to enhance the robustness of algorithm against the design dependence. Our methods are simple to implement and possess numerous desirable properties from theoretical and computational standpoints. Theoretically, we establish the strong consistency of feature selection for the proposed methods under some regularity conditions. In empirical studies, we assess the finite sample performance of our methods by comparing them with utility screening approaches and existing penalized quantile regression methods. Furthermore, we apply our methods to identify genes associated with anticancer drug sensitivities for practical guidance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
The publicly available Cancer Cell Line Encyclopedia (CCLE) dataset is obtained from https://sites.broadinstitute.org/ccle.
References
Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.: The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012)
Bermingham, M.L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A.F., Wilson, J.F., Agakov, F., Navarro, P., Haley, C.S.: Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep. (2015). https://doi.org/10.1038/srep10312
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011). https://doi.org/10.1561/2200000016
Buccini, A., Dell’Acqua, P., Donatelli, M.: A general framework for admm acceleration. Numer. Algorithms 85, 829–848 (2020). https://doi.org/10.1007/s11075-019-00839-y
Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001). https://doi.org/10.1198/016214501753382273
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–883 (2008). https://doi.org/10.1111/j.1467-9868.2008.00674.x
Fan, J., Fan, Y., Barut, E.: Adaptive robust variable selection. Ann. Stat. 42, 324–351 (2014). https://doi.org/10.1214/13-AOS1191
Fang, E.X., He, B., Liu, H., Yuan, X.: Generalized alternating direction method of multipliers: new theoretical insights and applications. Math. Program. Comput. 7, 149–187 (2015)
Hastie, T.: Ridge regularization: an essential concept in data science. Technometrics 62, 426–433 (2020). https://doi.org/10.1080/00401706.2020.1791959
He, J., Kang, J.: Prior knowledge guided ultra-high dimensional variable screening with application to neuroimaging data. Stat. Sin. 32, 2095–2117 (2022). https://doi.org/10.5705/ss.202020.0427
He, X., Wang, L., Hong, H.G.: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat. 41, 342–369 (2013). https://doi.org/10.1214/13-AOS1087
Hoerl, A., Kennard, R.: Ridge regression-biased estimation for nonorthogonal problems. Technometrics 12, 55 (1970). https://doi.org/10.1080/00401706.1970.10488634
Honda, T., Lin, C.T.: Forward variable selection for ultra-high dimensional quantile regression models. Ann. Inst. Stat. Math. 75(3), 393–424 (2023). https://doi.org/10.1007/s10463-022-00849-z
Koenker, R., Bassett, G.: Regression quantiles. Econometrica 46, 33–50 (1978). https://doi.org/10.2307/1913643
Koenker, R., Machado, J.A.: Goodness of fit and related inference processes for quantile regression. J. Am. Stat. Assoc. 94, 1296–1310 (1999)
Kong, Y., Li, Y., Zerom, D.: Screening and selection for quantile regression using an alternative measure of variable importance. J. Multivar. Anal. 173, 435–455 (2019). https://doi.org/10.1016/j.jmva.2019.04.007
Lee, E.R., Noh, H., Park, B.U.: Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc. 109, 216–229 (2014). https://doi.org/10.1080/01621459.2013.836975
Li, R., Zhong, W., Zhu, L.: Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139 (2012). https://doi.org/10.1080/01621459.2012.695654
Liu, W., Ke, Y., Liu, J., Li, R.: Model-free feature screening and fdr control with knockoff features. J. Am. Stat. Assoc. 117, 428–443 (2022). https://doi.org/10.1080/01621459.2020.1783274
Lorbert, A., Eis, D., Kostina, V., Blei, D., Ramadge, P.: Exploiting covariate similarity in sparse regression via the pairwise elastic net. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, pp. 477–484 (2010)
Ma, X., Zhang, J.: Robust model-free feature screening via quantile correlation. J. Multivar. Anal. 143, 472–480 (2016). https://doi.org/10.1016/j.jmva.2015.10.010
Ma, S., Li, R., Tsai, C.L.: Variable screening via quantile partial correlation. J. Am. Stat. Assoc. 112, 650–663 (2017). https://doi.org/10.1080/01621459.2016.1156545
Meinshausen, N., Rocha, G., Yu, B.: Discussion: a tale of three cousins: Lasso, l2boosting and dantzig. Ann. Stat. 35, 2373–2384 (2007)
Mkhadri, A., Ouhourane, M.: An extended variable inclusion and shrinkage algorithm for correlated variables. Comput. Stat. Data Anal. 57, 631–644 (2013). https://doi.org/10.1016/j.csda.2012.07.023
Scheetz, T.E., Kim, K.Y.A., Swiderski, R.E., Philp, A.R., Braun, T.A., Knudtson, K.L., Dorrance, A.M., DiBona, G.F., Huang, J., Casavant, T.L., Sheffield, V.C., Stone, E.M.: Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA 103, 14429–14434 (2006). https://doi.org/10.1073/pnas.0602562103
Sherwood, B., Li, S.: Quantile regression feature selection and estimation with grouped variables using Huber approximation. Stat. Comput. 32, 4 (2022). https://doi.org/10.1007/s11222-022-10135-w
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996). https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Wang, H.: Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Assoc. 104, 1512–1524 (2009). https://doi.org/10.1198/jasa.2008.tm08516
Wang, H., Jin, H., Jiang, X.: Feature selection for high-dimensional varying coefficient models via ordinary least squares projection. Commun. Math. Stat. (2023). https://doi.org/10.1007/s40304-022-00326-2
Wang, X., Leng, C.: High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 78, 589–611 (2016). https://doi.org/10.1111/rssb.12127
Wu, Y., Yin, G.: Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102, 65–76 (2015). https://doi.org/10.1093/biomet/asu068
Wu, Y., Zen, M.: A strongly consistent information criterion for linear model selection based on m-estimation. Probab. Theory Relat. Fields 113, 599–625 (1999). https://doi.org/10.1007/s004400050219
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010). https://doi.org/10.1214/09-AOS729
Zhao, Y., Zhang, J., Tian, Y., Xue, C., Hu, Z., Zhang, L.: Met tyrosine kinase inhibitor, pf-2341066, suppresses growth and invasion of nasopharyngeal carcinoma. Drug Des. Dev. Ther. 9, 4897 (2015)
Zhou, T., Zhu, L., Xu, C., Li, R.: Model-free forward screening via cumulative divergence. J. Am. Stat. Assoc. 115, 1393–1405 (2020). https://doi.org/10.1080/01621459.2019.1632078
Zhu, L.P., Li, L., Li, R., Zhu, L.X.: Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc. 106, 1464–1475 (2011). https://doi.org/10.1198/jasa.2011.tm10563
Zoppoli, G., Regairaz, M., Leo, E., Reinhold, W.C., Varma, S., Ballestrero, A., Doroshow, J.H., Pommier, Y.: Putative dna/rna helicase schlafen-11 (slfn11) sensitizes cancer cells to dna-damaging agents. Proc. Natl. Acad. Sci. USA 109, 15030–15035 (2012). https://doi.org/10.1073/pnas.1205943109
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005). https://doi.org/10.1111/j.1467-9868.2005.00503.x
Acknowledgements
This research was supported by NSFC grants 12271238, Guangdong NSF Fund 2023A1515010025, and Shenzhen Sci-Tech Fund (JCYJ20210324104803010) for Xuejun Jiang.
Author information
Authors and Affiliations
Contributions
Conceptualization, XJ and HF; methodology, YK, HF and XJ; software, YK; resources, XJ; data curation, YK; writing-original draft preparation, YK and HF; supervision, XJ; funding acquisition, XJ.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Xuejun Jiang, Yakun Liang and Haofeng Wang have equally contributed to this work.
Appendices
Appendix A. Useful lemmas
We first introduce a notation \(\varvec{\beta }^*\) for the true regression coefficient vector which is commonly imposed to be sparse with only a small proportion of nonzeros. Lemma 1 is used to prove Proposition 1 and Theorem 1. Lemmas 2 and 3 are useful to prove Theorems 2–3.
Lemma 1
Suppose assumptions A1 and A2 hold. If the dimension \(p_n\) satisfies \(\log (p_n) = o({n^{1-5\omega -2\kappa -v}}/{\log (n)})\), then there exist some constants c, \(\tilde{c}\), \(c'_{1}\), and \(c'_{2}\) such that
-
(a)
For any fixed vector \(\varvec{t}\) with \(\Vert \varvec{t}\Vert _{2}=1\),
$$\begin{aligned}&P\left( \varvec{t}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{t}<c_{1}'n^{1-\omega }p_n^{-1} \ \text {or} \ \ \varvec{t}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{t}>c_{2}'n^{1+\omega }p_n^{-1}\right) \\&\quad \le 4\exp (-C_1n); \end{aligned}$$ -
(b)
\(P\left( \Vert (\mathbb {X}\mathbb {X}^\top )^{-1}\mathbb {X}\varvec{e}_{i}\Vert _{2}^2>c_{1}c_{2}'n^{1+2\omega }p_n^{-2}\right) {<}3\exp (-C_{1}n);\)
-
(c)
\(P\left( \min _{i\in \mathcal {S}^*}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| <\dfrac{cn^{1-\omega -\kappa }}{p_{n}}\right) =\) \(O\left\{ \exp \left( \dfrac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} ;\)
-
(d)
\(P\left( \max _{i\notin \mathcal {S}^{*}}\left| \varvec{e}_{i}^\top \mathbb {P}_{\mathbb {X}^\top }\varvec{\beta }^*\right| >\dfrac{\tilde{c} n^{1-\omega -\kappa }}{p_{n}\sqrt{\log n}} \right) =\) \( O\left\{ \exp \left( \dfrac{-C_{1}n^{1-5\omega -2\kappa -v}}{2\log n}\right) \right\} ;\)
-
(e)
\(P\Bigg (\lambda _{\max }(\mathbb {X}\mathbb {X}^\top )\ge c_{1}c_{4}c_{5} p_{n}n^{\omega }, \lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\le c_{1}^{-1}c_{4}^{-1}c_{5}p_nn^{-\omega } \Bigg )\le 2\exp (-C_{1}n);\)
where \(\varvec{e}_i=(0,\ldots ,1,0,\ldots ,0)^\top \) be the i-th natural base in the \(p_n\)-dimensional Euclidean space, \(\omega ,\kappa ,v\) are parameters defined in assumption A2, \(C_1\) is defined in assumption A1, and \(\mathbb {P}_{\mathbb {X}^{\top }} = \mathbb {X}^{\top }(\mathbb {X}\mathbb {X}^{\top })^{-1}\mathbb {X}\) represents the projection matrix.
Proof of Lemma 1
(a) and (b) are obtained by Lemma 4 and formula (22) in the Supplementary of Wang and Leng (2016), respectively.
For (c) and (d). By Lemma 5 in the Supplementary of Wang and Leng (2016), there exist some \(c,\tilde{c}>0\) such that for \(i\in \mathcal {S}^*\),
and for \(i\notin \mathcal {S}^*\),
Applying assumption A2 with Bonferroni’s inequality, we have
and
For (e), by assumption A2, we have that
and
Combined with assumption of eigenvalues in A1, we have
This proves the lemma. \(\square \)
Lemma 2
For \(\mathcal {S}^*\not \subset \mathcal {S}\), denote \(\hat{\varvec{\beta }}_{\mathcal {S}} = \arg \min _{\varvec{\beta }_{\mathcal {S}}\in \mathbb {R}^{|\mathcal {S}|}}n^{-1}\sum _{i=1}^{n}\rho _{\tau }(Y_i-\varvec{X}^{\top }_{i,\mathcal {S}}\varvec{\beta }_{\mathcal {S}})\), and the pseudo true coefficient \(\tilde{\varvec{\beta }}_{\mathcal {S}}= \arg \min _{\varvec{\beta }_{\mathcal {S}}\in \mathbb {R}^{|\mathcal {S}|}}E\left[ n^{-1}\sum _{i=1}^{n}\rho _{\tau }(Y_i-\varvec{X}^{\top }_{i,\mathcal {S}}\varvec{\beta }_{\mathcal {S}})\right] \) on the support of the model \(\mathcal {S}\). If assumption A2-A4 hold, then
uniformly in \(\mathcal {S}\) as \(n\rightarrow \infty \) for \(|\mathcal {S}|\le d\) and \(d=O(n^{1/2})\).
Proof of Lemma 2
For given a deterministic \(\gamma >0\), we first define the set \(\mathcal {B}_{\gamma } = \left\{ \varvec{\beta }_{\mathcal {S}}\in \mathbb {R}^{|\mathcal {S}|}: \Vert {\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}}\Vert _2 \le \gamma \right\} \), and the function
Using Knight’s identity:
where \(\psi _{\tau }(h) = \tau - I(h<0)\). Let \(u_i = Y_i - \varvec{X}^{\top }_{i,\mathcal {S}}\tilde{\varvec{\beta }}_{\mathcal {S}}\) and \(v_i=\varvec{X}^{\top }_{i,\mathcal {S}}({\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_{\mathcal {S}})\), then
For \(I_{a1}\), \(E(I_{a1}|\varvec{X})=0\) due to the first order condition. For \(I_{a2}\), by Fubini’s theorem, the mean value theorem, and assumption A2 and A4,
for some positive constant \(C_f\). It means that
If we define a convex combination \({\varvec{\theta }}_{\mathcal {S}} = a\hat{\varvec{\beta }}_{\mathcal {S}} + (1-a)\tilde{\varvec{\beta }}_\mathcal {S}\) with \(a=\gamma /(\gamma + \Vert \hat{\varvec{\beta }}_{\mathcal {S}}-\tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2})\), by definition of \({\varvec{\theta }}_{\mathcal {S}}\), \(\Vert {\varvec{\theta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2} = a\Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2}\le \gamma \), which falls in the set \(\mathcal {B}_{\gamma }\). Then by the convexity and the definition of \(\hat{\varvec{\beta }}_{\mathcal {S}}\),
Using this and the triangle inequality, we have
Note that \(\Vert {\varvec{\theta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2}\le \gamma /2\) implies \(\Vert \hat{\varvec{\beta }}_{\mathcal {S}} - \tilde{\varvec{\beta }}_\mathcal {S}\Vert _{2}\le \gamma \). Denote \(\gamma _n = \sqrt{|\mathcal {S}|\log (n)\log (p_n)/n}\) and some positive constant \(C_\gamma \). Combining (10) with (11), we have
Similar to the argument of Lemma 1 of Fan et al. (2014), we have \(E(D_{\gamma })\le 4\gamma \sqrt{|\mathcal {S}|/n}\) after employing the standard symmetrization and contraction theorem, see in Section 14.7 of Bühlmann and Van De Geer (2011). Applying Massart’s concentration theorem, see in Section 14.6 of Bühlmann and Van De Geer (2011), yields that for any \(t>0\), \(P([D_\gamma - E(D_\gamma )]/V_n \ge t)\le \exp (-nt^2/8)\), where \(V_n=2C_x\sqrt{|\mathcal {S}|}\gamma \), \(C_x\) is a constant greater than \(\max _{i,j}|X_{ij}|\). It is equivalent to
Let \(\gamma = 16C_f^{-1}n^{-1/2}(1+t)\), \(1+t = C_\gamma C_f \sqrt{|\mathcal {S}|\log (n)\log (p_n)}/16\), it follows (12) that
with some positive constant \(C_{\gamma }'\). Moreover, using Boole’s inequality, we have
Thus, the proof is complete. \(\square \)
Lemma 3
Suppose that the assumptions A2-A4 hold. For \(\mathcal {S}^*\subset \mathcal {S}\), denote \(\varvec{\beta }^*_{\mathcal {S}}\) as the true coefficient vector \(\varvec{\beta }^*\) truncated by the support of \(\mathcal {S}\). Then we have
uniformly in \(\mathcal {S}\) as \(n\rightarrow \infty \) for \(|\mathcal {S}|\le d\) and \(d=O(n^{1/2})\).
Proof of Lemma 3
Define \(h_{\mathcal {S}}(\Delta \varvec{\beta }_{\mathcal {S}})=\sum _{i=1}^{n}\{\rho _{\tau }(\epsilon _{i}-\varvec{X}_{i,\mathcal {S}}^{\top }\Delta \varvec{\beta }_{\mathcal {S}} - \rho _{\tau }(\epsilon _{i}))\}\), \(\Delta \varvec{\beta }_{\mathcal {S}}=\varvec{\beta }_{\mathcal {S}}-\varvec{\beta }_{\mathcal {S}}^*\). By its convexity, it is sufficient to show that for any given D, there exists a large constant \(L_D >0\) such that
where \(\gamma _n = \sqrt{|\mathcal {S}|\log (n)\log (p_n)/n}\).
Referring to Lemma A.1 of Supplementary of Lee et al. (2014), we have for any sequence \(\{D_n\}\) that satisfies \(1\le D_n \le d^{\delta _0/10}\) for some \(\delta _0>0\) with \(d^{2+\delta _0}=o(n)\) such that
in which
Here, we take \(D_n=\sqrt{\log (n)\log (p_n)}\). Thus, \(h_{\mathcal {S}}\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \) can be decomposed by
for any \(\mathcal {S}^*\subset \mathcal {S}\), \(|\mathcal {S}|\le d\) and \(\Vert \Delta \varvec{\beta }_{\mathcal {S}}\Vert = L_D\gamma _{n}\).
For \(A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \),
By applying Example 14.3 of Bühlmann and Van De Geer (2011), we have
For \(B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \), using Knight’s identity (see in (9)) and Taylor’s theorem and the assumption A2 and A4, we have
Combined with (15) and (16), we have \(\left| A_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \right| \le M_1 L_D n\gamma _{n}^2\), and \(B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \ge M_2 L_D^2n\gamma _{n}^2\) for some constant \(M_1\) and \(M_2\). Therefore, formula (13) holds since \(B_n\left( \Delta \varvec{\beta }_{\mathcal {S}}\right) \) dominates all the other terms in (14) for sufficiently large \(L_D>M_1/M_2\), not depending on the choice of \(\mathcal {S}\). This completes the proof. \(\square \)
Appendix B. Proof of Proposition 1
We divide the proof into three parts:
(i) to show that the estimates of QRR \(\hat{\varvec{\beta }}\in \mathcal {C}\left( \mathbb {X}^\top \right) \), where \(\mathcal {C}\left( \mathbb {X}^\top \right) \) represents the column space spanned by \((\varvec{X}_{1},\ldots ,\varvec{X}_{n})^{\top }\);
(ii) to show that for \(n\rightarrow \infty \),
where \(\varvec{\xi }=\mathbb {P}_{\mathbb {X}^\top }{\varvec{\beta }}^*\), \(\mathbb {P}_{\mathbb {X}^{\top }} = \mathbb {X}^{\top }(\mathbb {X}\mathbb {X}^{\top })^{-1}\mathbb {X}\);
(iii) to show that \(\hat{\varvec{\beta }}=\mathbb {X}^{\top } (\mathbb {X}\mathbb {X}^{\top }+\lambda \mathbb {I}_n)^{-1} \mathbb {X}\varvec{\beta }^*+\varvec{R}_n\).
Part (i). Since \(\hat{\varvec{\beta }}=\arg \min _{\varvec{\beta }}\{Q_{n}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _{2}^2\}\), where \(Q_{n}(\varvec{\beta })=n^{-1}\sum _{i=1}^n\rho _\tau (Y_i-\varvec{X}_i^\top \varvec{\beta })\), the first-order condition satisfies that
It is equivalent to \(\hat{\varvec{\beta }}=(2\lambda n)^{-1}\mathbb {X}^\top \varvec{v}\), where \(\varvec{v}{=}\left( v_1,\ldots ,v_n\right) ^{\top }\), \(v_{i}=\tau -I(Y_i-\varvec{X}_{i}^\top \hat{\varvec{\beta }}<0)\) for \(i=1,\ldots ,n\), \(I(\cdot )\) is the indicator function. Hence \(\hat{\varvec{\beta }}\in \mathcal {C}\left( \mathbb {X}^\top \right) \).
Part (ii). Define the set
for some sufficiently large constant \(\alpha \), where \(M_{1}= n^{-2\omega -\kappa }/\sqrt{\log (n)}\), and \(M_{2}=n^{1/2-3\omega /2-\kappa }/\sqrt{p_n\log (n)}\), and \(\varvec{\xi }=\varvec{P}_{\mathbb {X}^\top }{\varvec{\beta }^*}\). Let \(\mathcal {L}(\varvec{\beta }) = Q_{n}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _{2}^2\). Notice the following decomposition:
where \(I_{n,1}= E\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right] \), \(I_{n,2} = Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi }) - E\left[ Q_{n}(\varvec{\beta })-Q_{n}(\varvec{\xi })\right] \), and \(I_{n,3}=2\lambda \varvec{\xi }^\top (\varvec{\beta } -\varvec{\xi })\).
Note that
where \(\varvec{1}_{i}=(0,\ldots ,1,0,\ldots ,0)^\top \) be the i-th natural base in the n-dimensional Euclidean space. We set \(a_{i}=\varvec{X}_{i}^\top (\varvec{\beta }-\varvec{\xi })\) for \(i=1,\ldots ,n\). Then it is easy to derive that
By the definition of \(\tau = E\left( I(\epsilon _{i} \le 0)\right) =F_{i}(0)\), mean value theorem, and the integration by parts \(\int _{0}^{a_{i}}sf_{i}(s)\textrm{d}s=a_{i}F(a_{i})-\int _{0}^{a_{i}}F_{i}(s)\textrm{d}s\), combined with assumption A3, we have if \(a_i>0\),
where o(1) is uniformly over all \(i=1,\ldots ,n\). The same result can be obtained when \(a_i <0\). Then
As \(\varvec{\beta }-\varvec{\xi } \in \mathcal {C}(\mathbb {X}^\top )\), we let \(\varvec{\beta }-\varvec{\xi } = \mathbb {X}^{\top }\varvec{\zeta }\) for some vector \(\varvec{\zeta }\). Assume the singular value decomposition of \(\mathbb {X}\mathbb {X}^{\top }\) be \(\varvec{U}\varvec{D}\varvec{U}^{\top }\) with diagonal entries of \(\varvec{D}\) arranged decreasingly and orthogonal \(\varvec{U}\). Thus \((\varvec{\beta }-\varvec{\xi })^{\top } \mathbb {X}^{\top }\mathbb {X}(\varvec{\beta }-\varvec{\xi })= \varvec{\zeta }^{\top }\mathbb {X}\mathbb {X}^{\top }\mathbb {X}\mathbb {X}^{\top }\varvec{\zeta }= \varvec{\zeta }^{\top }\varvec{U}\varvec{D}^2\varvec{U}^{\top }\varvec{\zeta } \ge \lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\varvec{\zeta }^{\top }\varvec{U}\varvec{D}\varvec{U}^{\top }\varvec{\zeta }= \lambda _{\min }(\mathbb {X}\mathbb {X}^\top )\Vert \varvec{\beta }-\varvec{\xi }\Vert _{2}^2\). Combined with the definition of \(\mathcal {B}_{\alpha }\) and Lemma 1(e), it establishes that
with probability going to 1, where c is the lower bound for \(f_{i}(\cdot )\) of in the neighborhood for 0.
Define \(\rho (Y,s)=\rho _{\tau }(Y-s)\) and omit the subscript \(\tau \) for simplicity. Note that the following Lipschitz condition holds for \(\rho (Y_i,\cdot )\),
By definition, \(n^{-1}\sum _{i=1}^n\left| \varvec{X}_{i}^\top (\varvec{\beta }-\varvec{\xi })\right| ^2\le \alpha ^2M_{1}^2\) holds for any \(\varvec{\beta }\in \mathcal {B}_{\alpha }\). Using (18), and (20), we have that
which entails that
By assumption A2, we have
Moreover, using Cauchy-Schwarz inequality, the term \(|I_{n,3}|\) is bounded by
with probability approaching one.
Combining the equation (19), (21), and (22), we have \(P\left( \inf _{\mathcal {B}_{\alpha }}\left\{ \mathcal {L}(\varvec{\beta })-\mathcal {L}(\varvec{\xi })\right\} >0\right) \rightarrow 1\) as \(n\rightarrow \infty \), \(M_{1}= n^{-2\omega -\kappa }/\sqrt{\log (n)}\), \(M_{2}=n^{1/2-3\omega /2-\kappa }/\sqrt{p_n\log (n)}\), and \(n^{2\omega +\kappa }\sqrt{\log (n)}=o(n^{1/2})\) by assumption for \(p_n\), which ensures \(I_{n,3}\) and \(I_{n,2}\) are dominated by \(I_{n,1}\) for sufficient large \(\alpha \). By the convexity of \(\mathcal {L}(\varvec{\beta })-\mathcal {L}(\varvec{\xi })\) and the fact that \(\mathcal {L}(\hat{\varvec{\beta }})\le \mathcal {L}(\varvec{\xi })\), we have
Part (iii). By simple algebra calculation, we have that
which, combined with Hölder inequality, yields that
This and conditions A1–A2 guarantees that
which, together with (17), established result (iii). \(\square \)
Appendix C. Proof of Theorem 1
Applying \(\mathbb {P}_{\mathbb {X}^\top }\left( \hat{\varvec{\beta }}-\varvec{\xi }\right) =\hat{\varvec{\beta }}-\varvec{\xi }\) and Cauchy-Schwarz inequality, we obtain that
Using Lemma 1(a), (b) and Bonferroni inequality, we obtain that
and
This together with assumption for \(p_n\), i.e., \(\log (p_n) = o({n^{1-5\omega -2\kappa -v}}/{\log (n)})\), and result (ii) yields that
Then by Lemma 1(c) and (d),
This completes proof. \(\square \)
Appendix D. Proof of Theorem 2
Recall the QRR screening indexes set \(\mathcal {F}_d=\{i_1,i_2,\cdots ,i_d\}\) and denote \(\mathcal {S}_k=\{i_1,\cdots ,i_k\}\) for \(k=1,\ldots ,d\). By Theorem 1, we have \(P(\mathcal {S}_{s_n} = \mathcal {S}^*)=1\), where \(s_n\) is defined in Sect. 3. Given \(\mathcal {S}_{k-1}\), now we consider the likelihood-based statistic
where \(k=2,\ldots ,d\).
For \(|\mathcal {S}_{k-1}|<s_n\), it means that \(\mathcal {S}^* \not \subset \mathcal {S}_{k-1}\). We next prove that the k-th shortlisted index \(i_k\) should be selected with probability approaching one as the statistic \(L(\mathcal {S}_k) > C k\log (n)\log (p_n)\). As well defined \(Q_{n}(\varvec{\beta })=n^{-1}\sum _{i=1}^n\rho _\tau (Y_i-\varvec{X}_i^\top \varvec{\beta })\), and the pseudo true coefficient \(\tilde{\varvec{\beta }}_{\mathcal {S}}\) on the support of the model \(\mathcal {S}\). For \(\forall \mathcal {M}_1=\mathcal {S}_{k-1} \cup \{j\}\), \(j \in \mathcal {S}_{k-1}^{c}\cap \mathcal {S}^*\), we decompose
By triangular inequality, we note that
For \(I_{11}\), by combining Lemma 2, Lipschitz condition in (20), and assumption A4,
Similarly, we have
For \(I_{13}\), by Lipschitz condition in (20) and assumption A4,
Noted that \(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}\) is in \(\mathbb {R}^{|\mathcal {S}_{k-1}|}\) while \(\tilde{\varvec{\beta }}_{\mathcal {M}_1}\) belongs to \(\mathbb {R}^{|\mathcal {M}_{1}|}\). When we discuss \(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1}\), since \(\mathcal {S}_{k-1}\subset \mathcal {M}_{1}\), it defaults to add a coefficient \(\tilde{\beta }_j = 0\) to \(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}\) to align two vectors.
Then by Hoeffding’s inequality, we have for any \(t>0\),
Taking \(t=|\mathcal {M}_{1}|\sqrt{{\log (n)\log (p_n)}/{n}}\), we have
Moreover, using Boole’s inequality, for \(\forall \mathcal {M}_1=\mathcal {S}_{k-1} \cup \{j\}\), \(j \in \mathcal {S}_{k-1}^{c}\cap \mathcal {S}^*\) we have
Denote \(\gamma _n = \sqrt{k\log (n)\log (p_n)/n}\). Therefore, combined with (24), (25), and (26) yields \(|I_1|=o_p(\gamma _n^2)\) uniformly for all \(\mathcal {M}_1\).
For \(I_2\), we utilize the difference between \(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}\) and \(\tilde{\varvec{\beta }}_{\mathcal {M}_1}\) to derive its lower bound. This is to reduce the redundancy of symbols. Employing Knight’s identity, \(u_i=Y_i - \varvec{X}^{\top }_{i,\mathcal {M}_1}\tilde{\varvec{\beta }}_{\mathcal {M}_1}\), \(v_i=\varvec{X}^{\top }_{i,\mathcal {M}_1}(\tilde{\varvec{\beta }}_{\mathcal {S}_{k-1}}-\tilde{\varvec{\beta }}_{\mathcal {M}_1})\), we have,
where \(\tilde{b}_j\) is the pseudo true coefficient for variable j in \(\tilde{\varvec{\beta }}_{\mathcal {M}_1}\). Then we consider a vector of coefficient \(\breve{{\varvec{\beta }}}_{\mathcal {M}_1}\), in which the coefficient for variable j is 0, the coefficients of the other indexes are the same as the \(\tilde{{\varvec{\beta }}}_{\mathcal {M}_1}\)’s. Now by condition A4,
Thus we have \(I_2\ge \gamma _{l}^2/(2\underline{f})\) by \(|E[X_{j}\psi _{\tau }(Y-\varvec{X}^{\top }_{\mathcal {M}_1}\tilde{{\varvec{\beta }}}_{\mathcal {M}_1})]|>\gamma _{l}\) in assumption A6. Here, \(\gamma _n\le \gamma _{l}\). Therefore, for \(\forall \mathcal {M}_1=\mathcal {S}_{k-1} \cup \{j\}\), \(j \in \mathcal {S}_{k-1}^{c}\cap \mathcal {S}^*\) and some constant \(C_{M_1}\),
For \(|\mathcal {S}_{k-1}|\ge s_n\), by Theorem 1, it means that \(P(\mathcal {S}_{k-1}\supset \mathcal {S}^*)\rightarrow 1\). Here we prove that the k-th shortlisted index \(i_k\) will be discarded as the statistic \(L(\mathcal {S}_k) < C k\log (n)\log (p_n)\) with probability approaching one. Consider \(\forall \mathcal {M}_2\) satisfying \(|\mathcal {M}_2|=k\), and \(\mathcal {S}^*\subset \mathcal {M}_2\), then \(Q_n({\varvec{\beta }}_{\mathcal {S}_{k-1}}^*) = Q_n({\varvec{\beta }}_{\mathcal {M}_{2}}^*)\). Thus we can decompose
The inequality holds by triangular inequality. Similar to the argument of (24), by Lemma 3, we have
Similarly, we have
Therefore, for \(\forall \mathcal {M}_2 = \mathcal {S}_{k-1} \cup \{j\}\), \(j \in \mathcal {F}_p{\setminus } \mathcal {S}^*\) and some constant \(C_{M_2}\),
One can take the union threshold constant C by \(C_{M_2}<C<C_{M_1}\). Combined (27) with (28) leads to \(P(\hat{\mathcal {S}}_{V}=\mathcal {S}^*)=1\). The proof is completed. In practice, we suggest to choose a conservative \(C\in (0,1)\) since \(C_{M_2}\) tends to zero. \(\square \)
Appendix E. Proof of Theorem 3
The proposed sequential LIPS is substantially based on the framework of forward update and backward deletion. It adds the internal competition to connect the above two phases. Correspondingly, the proof can be divided into two parts: (i) to show that \(P(\hat{\mathcal {S}}_{V}=\mathcal {S}^*)=1\); (ii) to show that \(P(\hat{\mathcal {S}}_{S}=\hat{\mathcal {S}}_{V})=1\).
Part (i). It follows the results in Theorem 2.
Part (ii). Denote the current selected index set as \(\hat{\mathcal {S}}_{T}\). In the initial step, we set \(\hat{\mathcal {S}}_{T} = \hat{\mathcal {S}}_{V}\).
If \(\mathcal {S}^*\not \subset \hat{\mathcal {S}}_{T}\), without loss of generality, we let \(k_1\) be an index satisfying \(k_1 \in \hat{\mathcal {S}}_{T}^{c} \cap \mathcal {S}^*\). For \(\mathcal {M}_1=\hat{\mathcal {S}}_{T} \cup \{k_1\}\), we consider
It means that the omission of active features can be reselected.
If \(\mathcal {S}^*\subset \hat{\mathcal {S}}_{T}\), without loss of generality, we assume \(\hat{\mathcal {S}}_{T} \cap \mathcal {S}^{*c} \ne \varnothing \). Now we investigate \(k_2\in \hat{\mathcal {S}}_{T} \cap \mathcal {S}^{*c}\). Let \(\mathcal {M}_2=\hat{\mathcal {S}}_{T} {\setminus } \{k_2\}\), then \(\mathcal {S}^{*} \subset \mathcal {M}_2\). As for
by (28) of Theorem 2, we have that \(k_1\) will not be retained due to
It means that the inactive features can be removed.
An error situation can be regarded as that some spurious variables, which are highly correlated to error term, may take precedence over the truly active ones and prevents these variables from being selected. When this occurs, \(k_1\) will not be reselected since \(L(\mathcal {M}_{1})\ge C (|\hat{\mathcal {S}}_{T}|+1)\log (n)\log (p_n)\). However, we use internal competition to let \(k_1\) join in \(\hat{\mathcal {S}}_{T}\) temporarily to clear spurious and redundant variables. Therefore, we have \(P(\hat{\mathcal {S}}_{S}=\hat{\mathcal {S}}_{V})=1\).
Combined with Part (i), and (ii) leads to \(P(\hat{\mathcal {S}}_{S}=\mathcal {S}^*)=1\). This completes the proof of Theorem 3. \(\square \)
Appendix F. Addition simulations
1.1 Example 1: Sure screening rate for \(n=200\), \(p_n=5000\) when model size \(d_n=n-1\)
See Table 9.
1.2 Example 2: Selection performance when \(n=300\) at \(\tau =0.5\)
1.3 Example 3: Selection performance when \(n=100\) at \(\tau =0.8\)
1.4 Example 4: Selection performance when \(n=300\) at \(\tau =0.8\)
1.5 Example 5: Weak correlations
In this example, we continue our examination of the two scenarios from Section 4.2; however, the correlation among predictors in the data generation process is notably weak. We examine the quantile level \(\tau =0.5\) for each scenario. The comparative performance of different methods is evaluated through the average Quantile Prediction Error (QPE), False Negatives (FN), False Positives (FP), and the average running time of the algorithm over 500 replications. The results are shown in Tables 19 and 20.
\(\bullet \) Block Diagonal Auto-Regressive correlation (BDAR):
-
Mode 3
(denoted by BDAR-\(3_{0.5}\)): In this mode, the covariance matrix \(\varvec{\Sigma } = (\sigma _{ij})\), where \(\sigma _{ij}=0.2^{|i-j|}\), \(1\le i,j\le p_n\). Non-zero coefficients of \(\varvec{\beta }^*\) are set to be \(\beta _1^*=\sqrt{8}\), \( \beta _3^*=\sqrt{2}\), \(\beta _6^*=\sqrt{3}\), and \(\beta _{10}^*=\sqrt{5}\).
\(\bullet \) Block Diagonal Compound Symmetry (BDCS):
-
Mode 3
(denoted by BDCS-\(3_{0.5}\)): In this mode, the covariance matrix \(\varvec{\Sigma }\) has diagonal element 1 and off-diagonal element 0.2. Non-zero coefficients of \(\varvec{\beta }^*\) are set to be \(\beta _{1}^*=\beta _{2}^*=\beta _{3}^*=\sqrt{6}\).
1.6 Example 6: Time series
In this example, we use the model (7), \(n=100\), \(p_n=1000\), to generate data. We assume that the predictors are generated from the process \(\varvec{X}_i = A_1\varvec{X}_{i-1} + A_2\varvec{X}_{i-2} + \varvec{\eta }_i\) for \(i = 1,\ldots ,n\), in which \(A_1\), \(A_2\), \(\varvec{\eta }_i\), and \(\epsilon _{i}\) are set in specific cases. Non-zero coefficients of \(\varvec{\beta }^*\) are set to be \(\beta _{1}^*=\beta _{2}^*=\beta _{3}^*=\beta _{4}^*=\sqrt{2}\). Following three cases are considered:
-
Case 1: \(A_1=(a_{ij})\), where \(a_{ij}=0.4^{|i-j|+1}\), \(1\le i,j\le p_n\), \(A_2 = 0\). For each i, \(\varvec{\eta }_i\sim N(\varvec{0}, \mathbb {I}_{p_n})\). The error term follows \(\epsilon _{i}=0.5\epsilon _{i-1}+e_i\), \(e_i\sim t(5)\).
-
Case 2: \(A_1=1.2\mathbb {I}_{p_n}, A_2 = -0.5\mathbb {I}_{p_n}\). For each i, \(\varvec{\eta }_i\sim N(\varvec{0}, \varvec{\Sigma }_{\varvec{\eta }})\), \(\varvec{\Sigma }_{\varvec{\eta }} = (\sigma _{ij})\), where \(\sigma _{ij}=0.5^{|i-j|}\), \(1\le i,j\le p_n\). The error term follows \(\epsilon _{i}=-0.5\epsilon _{i-1}+0.3\epsilon _{i-2}+e_i\), \(e_i\sim t(5)\).
-
Case 3: \(A_1=0.8\mathbb {I}_{p_n}, A_2 = 0\). For each i, \(\varvec{\eta }_i = \varvec{u}_i + B_1\varvec{u}_{i-1}+B_2\varvec{u}_{i-2}\), \(\varvec{u}_i\sim N(\varvec{0}, \mathbb {I}_{p_n})\), \(B_1=0.6\mathbb {I}_{p_n}, B_2 = -0.4\mathbb {I}_{p_n}\). The error term follows \(\epsilon _{i}=0.5\epsilon _{i-1}+e_i\), \(e_i\sim t(5)\).
The quantile level \(\tau =0.5\) is tested in each setting (Table 21).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, X., Liang, Y. & Wang, H. Screen then select: a strategy for correlated predictors in high-dimensional quantile regression. Stat Comput 34, 112 (2024). https://doi.org/10.1007/s11222-024-10424-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10424-6