Abstract
The Cox model is an exceedingly popular semiparametric hazard regression model for the analysis of time-to-event accompanied by explanatory variables. Within the ultrahigh-dimensional data setting, not like the marginal screening strategy, there is a joint feature screening method based on the partial likelihood of the Cox model but it leaves computational feasibility unsolved. In this paper, we develop an enhanced iterative hard-thresholding algorithm by adapting the non-monotone proximal gradient method under the Cox model. The proposed algorithm is efficient because it is computationally both effective and fast. Meanwhile, our proposed algorithm begins with a LASSO initial estimator rather than the naive zero initial and still enjoys sure screening in theory and further enhances the computational efficiency in practice. We also give a rigorous theory proof. The advantage of our proposed work is demonstrated by numerical studies and illustrated by the diffuse large B-cell lymphoma data example.
Similar content being viewed by others
References
Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10(4):1100–1120
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Bradic J, Fan J, Jiang J (2011) Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Stat 39(6):3092–3120
Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30(1):89–99
Chen X, Lu Z, Pong TK (2016) Penalty methods for a class of non-Lipschitz optimization problems. SIAM J Optim 26(3):1465–1492
Chen X, Chen X, Wang H (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 119:118–138
Cox DR (1975) Partial likelihood. Biometrika 62(2):269–276
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70(5):849–911
Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38(6):3567–3604
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional variable selection: beyond the linear model. J Mach Learn Res 70:2013–2038
Fan J, Feng Y, Wu Y (2010) High-dimensional variable selection for Cox’s proportional hazards model. In: Borrowing strength: theory powering applications—a festschrift for Lawrence D. Brown, Institute of Mathematical Statistics, pp 70–86
Fleming TR, Harrington DP (2011) Counting processes and survival analysis, vol 169. Wiley, New York
Gong P, Zhang C, Lu Z, Huang J, Ye J (2013) A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: International conference on machine learning, pp 37–45
Gorst-Rasmussen A, Scheike T (2013) Independent screening for single-index hazard rate models with ultrahigh dimensional features. J R Stat Soc Ser B 75(2):217–245
He X, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41(1):342–369
Hong HG, Chen X, Christiani DC, Li Y (2018) Integrated powered density: screening ultrahigh dimensional covariates with survival outcomes. Biometrics 74(2):421–429
Huang J, Sun T, Ying Z, Yu Y, Zhang CH (2013) Oracle inequalities for the Lasso in the Cox model. Ann Stat 41(3):1142–1165
Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105:201–216
Liu Y, Chen X (2018) Quantile screening for ultra-high-dimensional heterogeneous data conditional on some variables. J Stat Comput Simul 88(2):329–342
Liu Y, Zhang J, Zhao X (2018) A new nonparametric screening method for ultrahigh-dimensional survival data. Comput Stat Data Anal 119:74–85
Ma S, Li R, Tsai CL (2017) Variable screening via quantile partial correlation. J Am Stat Assoc 112(518):650–663
Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462
Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP et al (2008) An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood 112(10):4193–4201
Oberthuer A, Berthold F, Warnat P, Hero B, Kahlert Y, Spitz R, Ernestus K, Konig R, Haas S, Eils R et al (2006) Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. J Clin Oncol 24(31):5070–5078
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 346(25):1937–1947
Rosenwald A, Wright G, Wiestner A, Chan WC, Connors JM, Campo E, Gascoyne RD, Grogan TM, Muller-Hermelink HK, Smeland EB et al (2003) The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell 3(2):185–197
She Y (2009) Thresholding-based iterative selection procedures for model selection and shrinkage. Electron J Stat 3:384–415
Song R, Lu W, Ma S, Jessie Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101(4):799–814
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395
van Wieringen WN, Kun D, Hampel R, Boulesteix AL (2009) Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal 53(5):1590–1603
Wright SJ, Nowak RD, Figueiredo MA (2009) Sparse reconstruction by separable approximation. IEEE Trans Signal Process 57(7):2479–2493
Wu Y, Yin G (2015) Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102(1):65–76
Xu C, Chen J (2014) The sparse mle for ultrahigh-dimensional feature screening. J Am Stat Assoc 109(507):1257–1269
Yang L (2017) Proximal gradient method with extrapolation and line search for a class of nonconvex and nonsmooth problems. ArXiv preprint arXiv:1711.06831
Yang G, Yu Y, Li R, Buu A (2016) Feature screening in ultrahigh dimensional Cox’s model. Stat Sin 26:881
Zhang T (2009) Multi-stage convex relaxation for learning with sparse regularization. In: Advances in neural information processing systems, pp 1929–1936
Zhang J, Liu Y, Wu Y (2017) Correlation rank screening for ultrahigh-dimensional survival data. Comput Stat Data Anal 108:121–132
Zhao SD, Li Y (2012) Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivar Anal 105(1):397–411
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
Zhou T, Zhu L (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27(4):947–961
Acknowledgements
The authors are grateful to the editor, anonymous associate editor and two anonymous reviewers for their helpful and insightful comments, which lead to significant improvements to this paper. Chen’s research is partially supported by the National Natural Science Foundation of China (11501573 and 11771250). Liu’s research is partially supported by the General Research Fund (15327216), Research Grants of Council (RGC), Hong Kong. Xu’s research is supported by The Hong Kong Polytechnic University Ph.D. Studentship. The authors thank Dr. Lei Yang for his discussion in optimization problems.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Without loss of generality, in the Cox model (1), we assume that the first q components of \(\varvec{\beta }\) are truly important. Then we partition \(\varvec{\beta }\) into \(\varvec{\beta }=(\varvec{\beta }_{1}^T,\varvec{\beta }_{2}^T)^T\), where \(\varvec{\beta }_{1}=(\beta _{1},\ldots ,\beta _{q})^{T}\) and \(\varvec{\beta }_{2}=(\beta _{q+1},\ldots ,\beta _{p})^{T}=(0,\ldots ,0)^{T}\). In addition, define \( {\mathbf {M}}_{+}^{k}=\{M:M_{0}\subset M, \Vert M\Vert _{0}\le k\} \) and \( {\mathbf {M}}_{-}^{k}=\{M:M_{0} \not \subset M, \Vert M\Vert _{0}\le k\}. \)
Assumption 1
There exist scalar, vector and matrix functions \(s^{(l)}(\beta ,t)\) defined on \({\mathcal {B}}\times [0,\tau ]\), \(l=0,1,2\), that satisfy the following conditions: (i) \(\mathrm {sup}_{t \in [0,\tau ],\varvec{\beta }_{1}\in {\mathcal {B}}_{1} }\Vert S^{(l)}(\varvec{\beta }_{1},t)-s^{(l)}(\varvec{\beta }_{1},t)\Vert _{2}\rightarrow 0\) in probability as \(n\rightarrow \infty \) for \({\mathcal {B}}_{1}\subset R^{q}\), \({\mathcal {B}}_{1}\subset {\mathcal {B}}\); (ii) The functions \(s^{(l)}(\beta ,t)\) are bounded and \(s^{(0)}(\beta ,t)\) is bounded away from 0 on \({\mathcal {B}}\times [0,\tau ]\); and the family of functions \(s^{(l)}(\cdot ,t),0\le t\le \tau \), is an equicontinuous family at \(\beta ^{*}\).
Assumption 2
Let \(\bar{{\mathbf {z}}}(\varvec{\beta },t)=s^{(1)}(\varvec{\beta },t)/s^{(0)}(\varvec{\beta },t)\). Denote \(d_{n}=\mathrm {sup}_{t \in [0,\tau ]}\Vert \bar{{\mathbf {Z}}}(\varvec{\beta }^{*},t)-\bar{{\mathbf {z}}}(\varvec{\beta }^{*},t)\Vert _{\infty }\) and \(e_{n}=\mathrm {sup}_{t \in [0,\tau ]}\Vert S^{(0)}(\varvec{\beta }^{*},t)\) \(-s^{(0)}(\varvec{\beta }^{*},t)\Vert _{\infty }\). The random sequences \(d_{n}\) and \(e_{n}\) are bounded almost surely.
Assumption 3
Define \(\varepsilon _{ij}=\int _{0}^{\tau }\{Z_{ij}-{\bar{z}}_{j}(\varvec{\beta }^{*},t)\}dM_{i}(t)\), where \({\bar{z}}_{j}(\varvec{\beta }^{*},t)\) is the jth component of \(\bar{{\mathbf {z}}}(\varvec{\beta }^{*},t)\). Suppose that the Cramér condition holds for \(\varepsilon _{ij}\), i.e., \(E|\varepsilon _{ij}|^{l}\le 2^{-1}l!c_{1}^{l-2}\sigma _{j}^{2}\) for all j, where \(c_{1}\) is a positive constant, \(l\ge 2\), and \(\sigma _{j}^{2}=\mathrm {var}(\varepsilon _{ij})<\infty \).
Assumption 4
There exist positive constants \(c_{2},c_{3},\tau _{1}\) and \(\tau _{2}\) such that \(\mathrm {min}_{j\in M_{0}}|\beta _{j}^{*}|\ge c_{2}n^{-\tau _{1}}\) and \(q\le k\le c_{3}n^{\tau _{2}}\).
Assumption 5
When n is sufficiently large, for \(\varvec{\beta }_{M}\in \{\varvec{\beta }_{M}:\Vert \varvec{\beta }_{M}-\varvec{\beta }_{M}^{*}\Vert _{2}\le \delta \}\), \(M \in {\mathbf {M}}_{+}^{2k}\), it holds that \(\lambda _{\mathrm {min}}\{n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\}\ge c_{4}\), where \(c_{4}\) and \(\delta \) are positive constants depending on k but not M, \(\lambda _{\mathrm {min}}(A)\) is the minimum eigenvalue of the matrix A, and \(V(\varvec{\beta }_{M},t)\) is a version of \(V(\varvec{\beta }, t)\) based on the model M.
Assumption 6
There exists a positive constant \(c_{5}\) such that, for sufficiently large n,
for any \(\varvec{\eta }\ne 0\), \(\Vert \varvec{\eta }_{_{M_{0}^{c}}}\Vert _{1}\le 3\Vert \varvec{\eta }_{_{M_{0}}}\Vert _{1}\), and \(\varvec{\beta } \in {\mathcal {B}}\), where \(M_{0}^{c}\) is the complement of \(M_{0}\) in \(\{1,2,\ldots ,p\}\).
Assumption 7
\(\mathrm {sup}_{(\varvec{\beta },t)\in {\mathcal {B}}(\varvec{\beta }^{*},o(w))\times [0,\tau ]} \Vert V(\varvec{\beta },t)\Vert _{\infty }=O_{p}(n^{\tau _{3}})\), where \(\tau _{3}\) is a positive constant, \(w=\mathrm {min}_{j\in M_{0}}\Vert \beta _{j}^{*}\Vert \), and \({\mathcal {B}}(\varvec{\beta }^{*},o(w))\) is a p-dimensional ball centered at \(\varvec{\beta }^{*}\) with radius o(w).
Assumptions 1 to 3 are mild, which are necessities to obtain a large deviation result. See Bradic et al. (2011) for more discussions of these assumptions. The first part of Assumption 4 states that the marginal signals are strong enough to be detected. Similar assumptions have been widely made in the ultrahigh-dimensional data analysis. In the second part, we allow the dimension of the important features to diverge to infinity at a polynomial speed. Assumption 5 is a very weak assumption since it usually holds that \(n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\) is positive definite when k is not too large by noting that the dimension of \(\varvec{\beta }_{M}\) is k. For example, if \(k=5\), \(\varvec{\beta }_{M}\) is a 5-dimensional vector parametric. Then \(n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\) is a \(5\times 5\) matrix, which is positive definite with high probability when the sample size is moderate. Assumption 6 is needed for deriving an error bound for the LASSO estimator. There are many similar assumptions in the literature, such as Bickel et al. (2009) for the linear model, Xu and Chen (2014) for the generalized linear model, Huang et al. (2013) for the Cox model, and so on. As was discussed in Fleming and Harrington (2011), \(V(\varvec{\beta },t)\) is an empirical covariance matrix of \({\mathbf {Z}}_{i}\) with the weights proportional to \(Y_{i}(t)\mathrm {exp}\{{\mathbf {Z}}_{i}^{T}\varvec{\beta }\}\). Hence, Assumption 7 points out that the association of features should not be too strong. In particular, if all the features are mutually independent, this assumption is easily met. Similar assumptions can be found in Bradic et al. (2011).
The proofs of the theorems can be obtained along the line of Xu and Chen (2014) by combining the large deviation result for martingales of Bradic et al. (2011) under our specific assumptions for the Cox model.
Proof of Theorem 1
Firstly, we prove the monotonicity. It is easy to see that
By the Taylor’s expansion of \(\sum _{i=1}^{n}\delta _{i}\mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t+1)},X_{i}))\) at \(\varvec{\beta }^{(t)}\) and some algebraic manipulations, we have
So under the assumptions in Theorem 1, it finally arrives at
Secondly, we will show that \(\{\varvec{\beta }^{(t)}\}\) converges to a local maximum of \(l_{n}(\varvec{\beta })\). It is noted that \(l_{n}(\cdot )\) is bounded with \(\varvec{\beta }\) being confined in \({\mathcal {B}}\). From the proof of the first part, we can see
By the monotonicity and boundness of \(l_{n}(\cdot )\), \(\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}\rightarrow 0\) as t goes to \(\infty \).
It is noted that k and p are constants in the proof for convergence of \(\{\varvec{\beta }^{(t)}\}\). For the finite sparse patterns of \(\{\varvec{\beta }^{(t)}\}\), we can always find a subsequence of \(\{\varvec{\beta }^{(t)}\}\), say \(\{\varvec{\beta }^{(t_{m})}\}\), with a common sparse pattern M. Because \(\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\) is positive definite for any \(\varvec{\beta }_{M}\), \(l_{n}(\cdot )\) is strictly concave for \(\varvec{\beta }_{M}\) with sparse pattern M. Then \(\{\varvec{\beta }^{(t_{m})}\}\) is bounded and has at least one limiting point, denoted by \(\varvec{\beta }^{\dag }=(\beta _{1}^{\dag },\beta _{2}^{\dag },\ldots ,\beta _{p}^{\dag })^{T}\). Based on the facts that \(\varvec{\beta }^{(t_{m}+1)}=\mathop {\mathrm {argmax}}_{\varvec{\beta }\in {\mathcal {B}}(k)} \{Q_{n}(\varvec{\beta }|\varvec{\beta }^{(t_{m})})\}\) and \(\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}\rightarrow 0\), we have \(\varvec{\beta }^{\dag }=\mathop {\mathrm {argmax}}_{\varvec{\beta }\in {\mathcal {B}}(k)} \{Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })\}\) This indicates that \(\varvec{\beta }^{\dag }\) maximizes \(Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })\) with respect to \(\varvec{\beta }\) and sparse pattern M.
If \(\Vert \varvec{\beta }^{\dag }\Vert _{0}<k\), i.e., \(\varvec{\beta }^{\dag }\) has fewer than k non-zero entries, it is easy to see that \({\dot{l}}_{n}(\varvec{\beta }^{\dag })=0\). So \(\varvec{\beta }^{\dag }\) is the unconstrained maximizer of \(l_{n}(\varvec{\beta })\) and satisfies \(\Vert \varvec{\beta }^{\dag }\Vert _{0}\le k\).
If \(\Vert \varvec{\beta }^{\dag }\Vert _{0}=k\), we will prove that \(\varvec{\beta }^{\dag }\) is the unique maximizer for the sparse pattern M. Note that
where
So it is easy to see that \(\frac{\partial R(\varvec{\beta }|\varvec{\beta }^{\dag })}{\partial \varvec{\beta }}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0\). In addition, \(\frac{\partial Q(\varvec{\beta }|\varvec{\beta }^{\dag })}{\partial \varvec{\beta }_{j}}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0\), because \(\varvec{\beta }^{\dag }\) is the local maximizer for the sparse pattern M. Then we have \(\frac{\partial l_{n}(\varvec{\beta })}{\partial \varvec{\beta }_{j}}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0\), which implies \(\varvec{\beta }^{\dag }\) is the unique maximizer of \(l_{n}(\cdot )\) by the property of strict convexity. \(\square \)
Above all, any limiting point of \(\{\varvec{\beta }^{(t)}\}\) is a local maximizer satisfying \(\Vert \varvec{\beta }\Vert _{0}\le k\). By the finiteness of sparse patterns, there are at most finite many limiting points. Furthermore, similarly to Xu and Chen (2014), by the techniques of mathematical analysis, it can be shown that \(\{\varvec{\beta }^{(t)}\}\) has only one limiting point and thus converges.
Lemma 1
Define \(\varvec{\beta }^{(0)}=\mathrm {argmax}_{\varvec{\beta }}\{l_{n}(\varvec{\beta })-n\lambda \Vert \varvec{\beta }\Vert _{1}\}\), where \(\lambda \) satisfies \(\lambda n^{\frac{1}{2}-m}\rightarrow \infty \), \(\lambda n^{\tau _{1}+\tau _{2}}\rightarrow 0\). Under Assumptions 1 to 3 and 6, if \(\mathrm {max}_{j}\sigma _{j}^{2}=O(\lambda n^{\frac{1}{2}})\), we have
where \(c_{5}\) is defined in Assumption 6.
Proof
It is easy to see that
or equivalently
Define \(\varvec{\delta }=(\varvec{\beta }^{(0)}-\varvec{\beta }^{*})=(\delta _{1},\ldots ,\delta _{p})^{T}\). By the Taylor’s expansion of \(l_{n}(\varvec{\beta }^{(0)})\) at \(\varvec{\beta }^{*}\), we have
where \(\varvec{\beta }\) is between \(\varvec{\beta }_{L}\) and \(\varvec{\beta }^{*}\). So
Denote \({\mathcal {A}}=\{\mathrm {max}_{1\le j \le p}|{\dot{l}}_{nj}(\varvec{\beta }^{*})|\le \frac{n\lambda }{2}\}\).
Because \(\mathrm {max}_{j}\sigma _{j}^{2}=O(\lambda n^{\frac{1}{2}})\), together with Assumptions 1–3, the result of the bound of tail probability in Theorem 3.1 in Bradic et al. (2011) can be applied here via substituting their \(\xi _j\) by \({\dot{l}}_{nj}(\varvec{\beta }^{*})\), that is, there exist positive constants \(c_{7}\) and \(c_{8}\) such that \(\mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\sqrt{n}u_{n}) \le c_{7}\mathrm {exp}(-c_{8}u_{n})\). Then we have
where \(c_{9}\) is a positive constant. So we can conclude that \(\mathrm {pr}({\mathcal {A}})\rightarrow 1\) and \(\Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }=O_{p}(n\lambda )\). Under the event \({\mathcal {A}}\), it is easy to see that
So
It is easy to see that \(\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\) is semipositive definite. Thus \(\Vert \varvec{\delta }\Vert _{1}\le 4\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\), and furthermore \(\Vert \varvec{\delta }_{M_{0}^{c}}\Vert _{1}\le 3\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\). By the Cauchy-Schwarz inequality and Assumption 6,
So \(\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\le 4 c_{5}^{-1}\lambda q\). Then finally we arrive at
This finishes the proof.
Proof of Theorem 2
Recall that \(w=\mathrm {min}_{j\in M_{0}}\Vert \beta _{j}^{*}\Vert \). We just need to show \(\mathrm {pr}(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }<\frac{w}{2})\rightarrow 1\). It suffices to prove \(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). As in Xu and Chen (2014), we use the method of mathematical induction to get this result.
When \(t=0\), by Lemma 1, we have
Because \(\lambda =o(n^{-(\tau _{1}+\tau _{2})})\), \(q=O(n^{\tau _{2}})\), \(w^{-1}=O(n^{\tau _{1}})\), \(\lambda qw^{-1}=o(n^{-(\tau _{1}+\tau _{2})})O(n^{\tau _{2}})O(n^{\tau _{1}})=o(1)\). Thus \(\lambda q=o(w)\). So we have \(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}=o_{p}(w)\). It is noted \(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}\). Then the desired result is obtained for \(t=0\).
Suppose that \(\Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). In the following, we will show that \(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\) is also true. It is noted that \(\varvec{\beta }^{(t)}={\mathbf {H}}(\tilde{\varvec{\beta }};k)\), where \(\tilde{\varvec{\beta }}=\varvec{\beta }^{(t-1)}+u^{-1}{\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\). If \(\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\) holds, it can be seen that elements of \(\tilde{\varvec{\beta }}_{M_{0}}\) are among the ones with top k largest absolute values in probability. Thus \(\Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). So what remains is to prove \(\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). Note that \(\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }+ \frac{1}{u}\Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }\). By the Taylor’s expansion of \({\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\) at \(\beta ^{*}\), we have
where \(\bar{\varvec{\beta }}\) is between \(\varvec{\beta }^{(t-1)}\) and \(\varvec{\beta }^{*}\). So
This ends up the proof. \(\square \)
Assumptions verification for our NPG algorthm
It is easy to see that our algorithm could be formulated as
where \(P(\varvec{\beta })\) takes value 0 if \(\varvec{\beta }\in {\mathcal {B}}(k)\) and \(\infty \) otherwise. In the following, we verify that \(-l_{n}(\varvec{\beta })\) and \(P(\varvec{\beta })\) satisfy the conditions in Assumption A.1 of Chen et al. (2016)(page 1485–1486). It is easy to check that \(\text {Dom}(P)={\mathcal {B}}(k)\), where \(\text {Dom}(P)\) is defined in Chen et al. (2016). It should be noted that \({\mathcal {B}}(k) \subseteq {\mathcal {B}}\), where \({\mathcal {B}}\) is a bounded set.
-
(i)
It is easy to be satisfied since \(-l_{n}(\varvec{\beta })\) is continuously differentiable and its derivative is bounded in a bounded closed set.
-
(ii)
For any \(\varvec{\beta }^\dagger \in {\mathbb {R}}^p\), if \( \varvec{\beta }^\dagger \in {\mathcal {B}}(k)\), then \( P(\varvec{\beta }^\dagger ) = 0\). Since \(\liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\ge 0\), \(P(\varvec{\beta }^\dagger ) \le \liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\). If \( \varvec{\beta }^\dagger \notin {\mathcal {B}}(k)\), then \( P(\varvec{\beta }^\dagger )=\infty \). The complement of \(\text {Dom}(P)\) is an open set, therefore, \(\liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })=\infty \). Then \(P(\varvec{\beta }^\dagger ) \le \liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\). Hence P is a lower semicontinuous function in \({\mathbb {R}}^p\).
-
(iii)
For \(\varvec{\beta }^\dagger \in \text {Dom}(P)\), \(\Omega (\varvec{\beta }^\dagger )=\{\varvec{\beta } \in {\mathbb {R}}^p: F(\varvec{\beta })\le F(\varvec{\beta }^\dagger )\}\subseteq \text {Dom}(P)\). Then \(F(\varvec{\beta })\) is bounded below in \(\Omega (\varvec{\beta }^\dagger )\).
-
(iv)
Let \(A=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}\Vert \nabla (-l_{n}(\varvec{\beta }))\Vert \), \(B=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}P(\beta )\) and \(C=\inf _{\beta \in R^{p}}P(\beta )\). Because \(\Omega (\varvec{\beta }^\dagger )\subseteq \text {Dom}(P)\) and \(\nabla (-l_n(\varvec{\beta }))\) is continuous, \(A<\infty \). Furthermore, \(B=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}P(\beta )\le \sup _{\beta \in \text {Dom}(P)}P(\beta )=0<\infty \). In addition, \(C=0<\infty \).
Hence the assumptions in Chen et al. (2016) are verified.
Rights and permissions
About this article
Cite this article
Chen, X., Liu, C.C. & Xu, S. An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model. Comput Stat 36, 885–910 (2021). https://doi.org/10.1007/s00180-020-01032-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-01032-9