Skip to main content
Log in

An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The Cox model is an exceedingly popular semiparametric hazard regression model for the analysis of time-to-event accompanied by explanatory variables. Within the ultrahigh-dimensional data setting, not like the marginal screening strategy, there is a joint feature screening method based on the partial likelihood of the Cox model but it leaves computational feasibility unsolved. In this paper, we develop an enhanced iterative hard-thresholding algorithm by adapting the non-monotone proximal gradient method under the Cox model. The proposed algorithm is efficient because it is computationally both effective and fast. Meanwhile, our proposed algorithm begins with a LASSO initial estimator rather than the naive zero initial and still enjoys sure screening in theory and further enhances the computational efficiency in practice. We also give a rigorous theory proof. The advantage of our proposed work is demonstrated by numerical studies and illustrated by the diffuse large B-cell lymphoma data example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10(4):1100–1120

    Article  MathSciNet  MATH  Google Scholar 

  • Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37(4):1705–1732

    Article  MathSciNet  MATH  Google Scholar 

  • Bradic J, Fan J, Jiang J (2011) Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Stat 39(6):3092–3120

    Article  MathSciNet  MATH  Google Scholar 

  • Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30(1):89–99

    Article  Google Scholar 

  • Chen X, Lu Z, Pong TK (2016) Penalty methods for a class of non-Lipschitz optimization problems. SIAM J Optim 26(3):1465–1492

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Chen X, Wang H (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 119:118–138

    Article  MathSciNet  MATH  Google Scholar 

  • Cox DR (1975) Partial likelihood. Biometrika 62(2):269–276

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70(5):849–911

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38(6):3567–3604

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional variable selection: beyond the linear model. J Mach Learn Res 70:2013–2038

    MATH  Google Scholar 

  • Fan J, Feng Y, Wu Y (2010) High-dimensional variable selection for Cox’s proportional hazards model. In: Borrowing strength: theory powering applications—a festschrift for Lawrence D. Brown, Institute of Mathematical Statistics, pp 70–86

  • Fleming TR, Harrington DP (2011) Counting processes and survival analysis, vol 169. Wiley, New York

    MATH  Google Scholar 

  • Gong P, Zhang C, Lu Z, Huang J, Ye J (2013) A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: International conference on machine learning, pp 37–45

  • Gorst-Rasmussen A, Scheike T (2013) Independent screening for single-index hazard rate models with ultrahigh dimensional features. J R Stat Soc Ser B 75(2):217–245

    Article  MathSciNet  Google Scholar 

  • He X, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41(1):342–369

    Article  MathSciNet  MATH  Google Scholar 

  • Hong HG, Chen X, Christiani DC, Li Y (2018) Integrated powered density: screening ultrahigh dimensional covariates with survival outcomes. Biometrics 74(2):421–429

    Article  MathSciNet  MATH  Google Scholar 

  • Huang J, Sun T, Ying Z, Yu Y, Zhang CH (2013) Oracle inequalities for the Lasso in the Cox model. Ann Stat 41(3):1142–1165

    Article  MathSciNet  MATH  Google Scholar 

  • Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105:201–216

    Article  MathSciNet  MATH  Google Scholar 

  • Liu Y, Chen X (2018) Quantile screening for ultra-high-dimensional heterogeneous data conditional on some variables. J Stat Comput Simul 88(2):329–342

    Article  MathSciNet  MATH  Google Scholar 

  • Liu Y, Zhang J, Zhao X (2018) A new nonparametric screening method for ultrahigh-dimensional survival data. Comput Stat Data Anal 119:74–85

    Article  MathSciNet  MATH  Google Scholar 

  • Ma S, Li R, Tsai CL (2017) Variable screening via quantile partial correlation. J Am Stat Assoc 112(518):650–663

    Article  MathSciNet  Google Scholar 

  • Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462

    MathSciNet  MATH  Google Scholar 

  • Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP et al (2008) An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood 112(10):4193–4201

    Article  Google Scholar 

  • Oberthuer A, Berthold F, Warnat P, Hero B, Kahlert Y, Spitz R, Ernestus K, Konig R, Haas S, Eils R et al (2006) Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. J Clin Oncol 24(31):5070–5078

    Article  Google Scholar 

  • Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 346(25):1937–1947

    Article  Google Scholar 

  • Rosenwald A, Wright G, Wiestner A, Chan WC, Connors JM, Campo E, Gascoyne RD, Grogan TM, Muller-Hermelink HK, Smeland EB et al (2003) The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell 3(2):185–197

    Article  Google Scholar 

  • She Y (2009) Thresholding-based iterative selection procedures for model selection and shrinkage. Electron J Stat 3:384–415

    Article  MathSciNet  MATH  Google Scholar 

  • Song R, Lu W, Ma S, Jessie Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101(4):799–814

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395

    Article  Google Scholar 

  • van Wieringen WN, Kun D, Hampel R, Boulesteix AL (2009) Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal 53(5):1590–1603

    Article  MathSciNet  MATH  Google Scholar 

  • Wright SJ, Nowak RD, Figueiredo MA (2009) Sparse reconstruction by separable approximation. IEEE Trans Signal Process 57(7):2479–2493

    Article  MathSciNet  MATH  Google Scholar 

  • Wu Y, Yin G (2015) Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102(1):65–76

    Article  MathSciNet  MATH  Google Scholar 

  • Xu C, Chen J (2014) The sparse mle for ultrahigh-dimensional feature screening. J Am Stat Assoc 109(507):1257–1269

    Article  MathSciNet  MATH  Google Scholar 

  • Yang L (2017) Proximal gradient method with extrapolation and line search for a class of nonconvex and nonsmooth problems. ArXiv preprint arXiv:1711.06831

  • Yang G, Yu Y, Li R, Buu A (2016) Feature screening in ultrahigh dimensional Cox’s model. Stat Sin 26:881

    MathSciNet  MATH  Google Scholar 

  • Zhang T (2009) Multi-stage convex relaxation for learning with sparse regularization. In: Advances in neural information processing systems, pp 1929–1936

  • Zhang J, Liu Y, Wu Y (2017) Correlation rank screening for ultrahigh-dimensional survival data. Comput Stat Data Anal 108:121–132

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao SD, Li Y (2012) Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivar Anal 105(1):397–411

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563

    MathSciNet  MATH  Google Scholar 

  • Zhou T, Zhu L (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27(4):947–961

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the editor, anonymous associate editor and two anonymous reviewers for their helpful and insightful comments, which lead to significant improvements to this paper. Chen’s research is partially supported by the National Natural Science Foundation of China (11501573 and 11771250). Liu’s research is partially supported by the General Research Fund (15327216), Research Grants of Council (RGC), Hong Kong. Xu’s research is supported by The Hong Kong Polytechnic University Ph.D. Studentship. The authors thank Dr. Lei Yang for his discussion in optimization problems.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sheng Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Without loss of generality, in the Cox model (1), we assume that the first q components of \(\varvec{\beta }\) are truly important. Then we partition \(\varvec{\beta }\) into \(\varvec{\beta }=(\varvec{\beta }_{1}^T,\varvec{\beta }_{2}^T)^T\), where \(\varvec{\beta }_{1}=(\beta _{1},\ldots ,\beta _{q})^{T}\) and \(\varvec{\beta }_{2}=(\beta _{q+1},\ldots ,\beta _{p})^{T}=(0,\ldots ,0)^{T}\). In addition, define \( {\mathbf {M}}_{+}^{k}=\{M:M_{0}\subset M, \Vert M\Vert _{0}\le k\} \) and \( {\mathbf {M}}_{-}^{k}=\{M:M_{0} \not \subset M, \Vert M\Vert _{0}\le k\}. \)

Assumption 1

There exist scalar, vector and matrix functions \(s^{(l)}(\beta ,t)\) defined on \({\mathcal {B}}\times [0,\tau ]\), \(l=0,1,2\), that satisfy the following conditions: (i) \(\mathrm {sup}_{t \in [0,\tau ],\varvec{\beta }_{1}\in {\mathcal {B}}_{1} }\Vert S^{(l)}(\varvec{\beta }_{1},t)-s^{(l)}(\varvec{\beta }_{1},t)\Vert _{2}\rightarrow 0\) in probability as \(n\rightarrow \infty \) for \({\mathcal {B}}_{1}\subset R^{q}\), \({\mathcal {B}}_{1}\subset {\mathcal {B}}\); (ii) The functions \(s^{(l)}(\beta ,t)\) are bounded and \(s^{(0)}(\beta ,t)\) is bounded away from 0 on \({\mathcal {B}}\times [0,\tau ]\); and the family of functions \(s^{(l)}(\cdot ,t),0\le t\le \tau \), is an equicontinuous family at \(\beta ^{*}\).

Assumption 2

Let \(\bar{{\mathbf {z}}}(\varvec{\beta },t)=s^{(1)}(\varvec{\beta },t)/s^{(0)}(\varvec{\beta },t)\). Denote \(d_{n}=\mathrm {sup}_{t \in [0,\tau ]}\Vert \bar{{\mathbf {Z}}}(\varvec{\beta }^{*},t)-\bar{{\mathbf {z}}}(\varvec{\beta }^{*},t)\Vert _{\infty }\) and \(e_{n}=\mathrm {sup}_{t \in [0,\tau ]}\Vert S^{(0)}(\varvec{\beta }^{*},t)\) \(-s^{(0)}(\varvec{\beta }^{*},t)\Vert _{\infty }\). The random sequences \(d_{n}\) and \(e_{n}\) are bounded almost surely.

Assumption 3

Define \(\varepsilon _{ij}=\int _{0}^{\tau }\{Z_{ij}-{\bar{z}}_{j}(\varvec{\beta }^{*},t)\}dM_{i}(t)\), where \({\bar{z}}_{j}(\varvec{\beta }^{*},t)\) is the jth component of \(\bar{{\mathbf {z}}}(\varvec{\beta }^{*},t)\). Suppose that the Cramér condition holds for \(\varepsilon _{ij}\), i.e., \(E|\varepsilon _{ij}|^{l}\le 2^{-1}l!c_{1}^{l-2}\sigma _{j}^{2}\) for all j, where \(c_{1}\) is a positive constant, \(l\ge 2\), and \(\sigma _{j}^{2}=\mathrm {var}(\varepsilon _{ij})<\infty \).

Assumption 4

There exist positive constants \(c_{2},c_{3},\tau _{1}\) and \(\tau _{2}\) such that \(\mathrm {min}_{j\in M_{0}}|\beta _{j}^{*}|\ge c_{2}n^{-\tau _{1}}\) and \(q\le k\le c_{3}n^{\tau _{2}}\).

Assumption 5

When n is sufficiently large, for \(\varvec{\beta }_{M}\in \{\varvec{\beta }_{M}:\Vert \varvec{\beta }_{M}-\varvec{\beta }_{M}^{*}\Vert _{2}\le \delta \}\), \(M \in {\mathbf {M}}_{+}^{2k}\), it holds that \(\lambda _{\mathrm {min}}\{n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\}\ge c_{4}\), where \(c_{4}\) and \(\delta \) are positive constants depending on k but not M, \(\lambda _{\mathrm {min}}(A)\) is the minimum eigenvalue of the matrix A, and \(V(\varvec{\beta }_{M},t)\) is a version of \(V(\varvec{\beta }, t)\) based on the model M.

Assumption 6

There exists a positive constant \(c_{5}\) such that, for sufficiently large n,

$$\begin{aligned} \varvec{\eta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\eta } \ge c_{5} n\Vert \varvec{\eta }_{_{M_{0}}}\Vert _{2}^{2}, \end{aligned}$$

for any \(\varvec{\eta }\ne 0\), \(\Vert \varvec{\eta }_{_{M_{0}^{c}}}\Vert _{1}\le 3\Vert \varvec{\eta }_{_{M_{0}}}\Vert _{1}\), and \(\varvec{\beta } \in {\mathcal {B}}\), where \(M_{0}^{c}\) is the complement of \(M_{0}\) in \(\{1,2,\ldots ,p\}\).

Assumption 7

\(\mathrm {sup}_{(\varvec{\beta },t)\in {\mathcal {B}}(\varvec{\beta }^{*},o(w))\times [0,\tau ]} \Vert V(\varvec{\beta },t)\Vert _{\infty }=O_{p}(n^{\tau _{3}})\), where \(\tau _{3}\) is a positive constant, \(w=\mathrm {min}_{j\in M_{0}}\Vert \beta _{j}^{*}\Vert \), and \({\mathcal {B}}(\varvec{\beta }^{*},o(w))\) is a p-dimensional ball centered at \(\varvec{\beta }^{*}\) with radius o(w).

Assumptions 1 to 3 are mild, which are necessities to obtain a large deviation result. See Bradic et al. (2011) for more discussions of these assumptions. The first part of Assumption 4 states that the marginal signals are strong enough to be detected. Similar assumptions have been widely made in the ultrahigh-dimensional data analysis. In the second part, we allow the dimension of the important features to diverge to infinity at a polynomial speed. Assumption 5 is a very weak assumption since it usually holds that \(n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\) is positive definite when k is not too large by noting that the dimension of \(\varvec{\beta }_{M}\) is k. For example, if \(k=5\), \(\varvec{\beta }_{M}\) is a 5-dimensional vector parametric. Then \(n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\) is a \(5\times 5\) matrix, which is positive definite with high probability when the sample size is moderate. Assumption 6 is needed for deriving an error bound for the LASSO estimator. There are many similar assumptions in the literature, such as Bickel et al. (2009) for the linear model, Xu and Chen (2014) for the generalized linear model, Huang et al. (2013) for the Cox model, and so on. As was discussed in Fleming and Harrington (2011), \(V(\varvec{\beta },t)\) is an empirical covariance matrix of \({\mathbf {Z}}_{i}\) with the weights proportional to \(Y_{i}(t)\mathrm {exp}\{{\mathbf {Z}}_{i}^{T}\varvec{\beta }\}\). Hence, Assumption 7 points out that the association of features should not be too strong. In particular, if all the features are mutually independent, this assumption is easily met. Similar assumptions can be found in Bradic et al. (2011).

The proofs of the theorems can be obtained along the line of Xu and Chen (2014) by combining the large deviation result for martingales of Bradic et al. (2011) under our specific assumptions for the Cox model.

Proof of Theorem 1

Firstly, we prove the monotonicity. It is easy to see that

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t)})= & {} Q_{n}(\varvec{\beta }^{(t)}|\varvec{\beta }^{(t)}) \\\le & {} Q_{n}(\varvec{\beta }^{(t+1)}|\varvec{\beta }^{(t)}) \\= & {} l_{n}(\varvec{\beta }^{(t)})+(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T}{\dot{l}}_{n}(\varvec{\beta }^{(t)}) -\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} \\= & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} +(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T}{\dot{l}}_{n}(\varvec{\beta }^{(t)}) \\&+\,l_{n}(\varvec{\beta }^{(t)})-l_{n}(\varvec{\beta }^{(t+1)}) \\= & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}+ \sum _{i=1}^{n}\delta _{i} \Big [ \mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t+1)},X_{i})) \\&-\, \mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t)},X_{i})) -(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{\mathrm {T}}\bar{{\mathbf {Z}}}(\varvec{\beta }^{(t)},X_{i}) \Big ] \end{aligned}$$

By the Taylor’s expansion of \(\sum _{i=1}^{n}\delta _{i}\mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t+1)},X_{i}))\) at \(\varvec{\beta }^{(t)}\) and some algebraic manipulations, we have

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t)})\le & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}\nonumber \\&+\,\frac{1}{2}(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T} \sum _{i=1}^{n}\delta _{i} \Big (\frac{S^{(2)}(\varvec{\beta },X_{i})}{S^{(0)}(\varvec{\beta },X_{i})}- \big [\frac{S^{(1)}(\varvec{\beta },X_{i})}{S^{(0)}(\varvec{\beta },X_{i})}\big ]^{\otimes 2}\Big ) \Big |_{\varvec{\beta }=\bar{\varvec{\beta }}} (\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})\nonumber \\= & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} +\frac{1}{2}(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T} \int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t) (\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})\nonumber \\\le & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} +\frac{1}{2}\lambda _{\mathrm {max}}\Big (\int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)\Big ) \Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}. \end{aligned}$$
(A.1)

So under the assumptions in Theorem 1, it finally arrives at

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t)})\le & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{1}{2}(u-\rho ^{(t)}) \Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} \\\le & {} l_{n}(\varvec{\beta }^{(t+1)}). \end{aligned}$$

Secondly, we will show that \(\{\varvec{\beta }^{(t)}\}\) converges to a local maximum of \(l_{n}(\varvec{\beta })\). It is noted that \(l_{n}(\cdot )\) is bounded with \(\varvec{\beta }\) being confined in \({\mathcal {B}}\). From the proof of the first part, we can see

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t+1)})-l_{n}(\varvec{\beta }^{(t)}) \ge \frac{1}{2}(u-\rho ) \Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}. \end{aligned}$$

By the monotonicity and boundness of \(l_{n}(\cdot )\), \(\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}\rightarrow 0\) as t goes to \(\infty \).

It is noted that k and p are constants in the proof for convergence of \(\{\varvec{\beta }^{(t)}\}\). For the finite sparse patterns of \(\{\varvec{\beta }^{(t)}\}\), we can always find a subsequence of \(\{\varvec{\beta }^{(t)}\}\), say \(\{\varvec{\beta }^{(t_{m})}\}\), with a common sparse pattern M. Because \(\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\) is positive definite for any \(\varvec{\beta }_{M}\), \(l_{n}(\cdot )\) is strictly concave for \(\varvec{\beta }_{M}\) with sparse pattern M. Then \(\{\varvec{\beta }^{(t_{m})}\}\) is bounded and has at least one limiting point, denoted by \(\varvec{\beta }^{\dag }=(\beta _{1}^{\dag },\beta _{2}^{\dag },\ldots ,\beta _{p}^{\dag })^{T}\). Based on the facts that \(\varvec{\beta }^{(t_{m}+1)}=\mathop {\mathrm {argmax}}_{\varvec{\beta }\in {\mathcal {B}}(k)} \{Q_{n}(\varvec{\beta }|\varvec{\beta }^{(t_{m})})\}\) and \(\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}\rightarrow 0\), we have \(\varvec{\beta }^{\dag }=\mathop {\mathrm {argmax}}_{\varvec{\beta }\in {\mathcal {B}}(k)} \{Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })\}\) This indicates that \(\varvec{\beta }^{\dag }\) maximizes \(Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })\) with respect to \(\varvec{\beta }\) and sparse pattern M.

If \(\Vert \varvec{\beta }^{\dag }\Vert _{0}<k\), i.e., \(\varvec{\beta }^{\dag }\) has fewer than k non-zero entries, it is easy to see that \({\dot{l}}_{n}(\varvec{\beta }^{\dag })=0\). So \(\varvec{\beta }^{\dag }\) is the unconstrained maximizer of \(l_{n}(\varvec{\beta })\) and satisfies \(\Vert \varvec{\beta }^{\dag }\Vert _{0}\le k\).

If \(\Vert \varvec{\beta }^{\dag }\Vert _{0}=k\), we will prove that \(\varvec{\beta }^{\dag }\) is the unique maximizer for the sparse pattern M. Note that

$$\begin{aligned} l_{n}(\varvec{\beta })=Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })-R(\varvec{\beta }|\varvec{\beta }^{\dag }), \end{aligned}$$

where

$$\begin{aligned} R(\varvec{\beta }|\varvec{\beta }^{\dag })=-\frac{u}{2}\Vert \varvec{\beta }-\varvec{\beta }^{\dag }\Vert _{2}^{2} +(\varvec{\beta }-\varvec{\beta }^{\dag })^{T}{\dot{l}}_{n}(\varvec{\beta }^{\dag })+l_{n}(\varvec{\beta }^{\dag })-l_{n}(\varvec{\beta }). \end{aligned}$$

So it is easy to see that \(\frac{\partial R(\varvec{\beta }|\varvec{\beta }^{\dag })}{\partial \varvec{\beta }}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0\). In addition, \(\frac{\partial Q(\varvec{\beta }|\varvec{\beta }^{\dag })}{\partial \varvec{\beta }_{j}}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0\), because \(\varvec{\beta }^{\dag }\) is the local maximizer for the sparse pattern M. Then we have \(\frac{\partial l_{n}(\varvec{\beta })}{\partial \varvec{\beta }_{j}}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0\), which implies \(\varvec{\beta }^{\dag }\) is the unique maximizer of \(l_{n}(\cdot )\) by the property of strict convexity. \(\square \)

Above all, any limiting point of \(\{\varvec{\beta }^{(t)}\}\) is a local maximizer satisfying \(\Vert \varvec{\beta }\Vert _{0}\le k\). By the finiteness of sparse patterns, there are at most finite many limiting points. Furthermore, similarly to Xu and Chen (2014), by the techniques of mathematical analysis, it can be shown that \(\{\varvec{\beta }^{(t)}\}\) has only one limiting point and thus converges.

Lemma 1

Define \(\varvec{\beta }^{(0)}=\mathrm {argmax}_{\varvec{\beta }}\{l_{n}(\varvec{\beta })-n\lambda \Vert \varvec{\beta }\Vert _{1}\}\), where \(\lambda \) satisfies \(\lambda n^{\frac{1}{2}-m}\rightarrow \infty \), \(\lambda n^{\tau _{1}+\tau _{2}}\rightarrow 0\). Under Assumptions 1 to 3 and 6, if   \(\mathrm {max}_{j}\sigma _{j}^{2}=O(\lambda n^{\frac{1}{2}})\), we have

$$\begin{aligned} \mathrm {pr}(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}\le 16c_{5}^{-1}\lambda q)\rightarrow 1, \end{aligned}$$

where \(c_{5}\) is defined in Assumption 6.

Proof

It is easy to see that

$$\begin{aligned} l_{n}(\varvec{\beta }^{(0)})-n\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}- (l_{n}(\varvec{\beta }^{*})-n\lambda \Vert \varvec{\beta }^{*}\Vert _{1})\ge 0, \end{aligned}$$

or equivalently

$$\begin{aligned} l_{n}(\varvec{\beta }^{*})-l_{n}(\varvec{\beta }^{(0)})\le n\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-n\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}. \end{aligned}$$

Define \(\varvec{\delta }=(\varvec{\beta }^{(0)}-\varvec{\beta }^{*})=(\delta _{1},\ldots ,\delta _{p})^{T}\). By the Taylor’s expansion of \(l_{n}(\varvec{\beta }^{(0)})\) at \(\varvec{\beta }^{*}\), we have

$$\begin{aligned} l_{n}(\varvec{\beta }^{(0)})-l_{n}(\varvec{\beta }^{*})=\varvec{\delta }^{T}{\dot{l}}_{n}(\varvec{\beta }^{*})- \frac{1}{2}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta }, \end{aligned}$$

where \(\varvec{\beta }\) is between \(\varvec{\beta }_{L}\) and \(\varvec{\beta }^{*}\). So

$$\begin{aligned} \frac{1}{n}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta }= & {} \frac{1}{n} \big [2\varvec{\delta }^{T}{\dot{l}}_{n}(\varvec{\beta }^{*})+2l_{n}(\varvec{\beta }^{*})-2l_{n}(\varvec{\beta }^{(0)})\big ] \\\le & {} \frac{2}{n}\varvec{\delta }^{T}{\dot{l}}_{n}(\varvec{\beta }^{*})+ 2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1} \\\le & {} \frac{2}{n}|\varvec{\delta }|^{T}|{\dot{l}}_{n}(\varvec{\beta }^{*})|+ 2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}. \end{aligned}$$

Denote \({\mathcal {A}}=\{\mathrm {max}_{1\le j \le p}|{\dot{l}}_{nj}(\varvec{\beta }^{*})|\le \frac{n\lambda }{2}\}\).

Because \(\mathrm {max}_{j}\sigma _{j}^{2}=O(\lambda n^{\frac{1}{2}})\), together with Assumptions 13, the result of the bound of tail probability in Theorem 3.1 in Bradic et al. (2011) can be applied here via substituting their \(\xi _j\) by \({\dot{l}}_{nj}(\varvec{\beta }^{*})\), that is, there exist positive constants \(c_{7}\) and \(c_{8}\) such that \(\mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\sqrt{n}u_{n}) \le c_{7}\mathrm {exp}(-c_{8}u_{n})\). Then we have

$$\begin{aligned} \mathrm {pr}({\mathcal {A}}^{c})\le & {} \sum _{j=1}^{p}\mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\frac{n\lambda }{2}) =\sum _{j=1}^{p} \mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\sqrt{n}\frac{\sqrt{n}\lambda }{2}) \\\le & {} p c_{7}\mathrm {exp}(-c_{8}\frac{\sqrt{n}\lambda }{2})\le c_{7}\mathrm {exp}(c_{9}n^{m}-c_{8}\frac{\sqrt{n}\lambda }{2}) \rightarrow 0, \end{aligned}$$

where \(c_{9}\) is a positive constant. So we can conclude that \(\mathrm {pr}({\mathcal {A}})\rightarrow 1\) and \(\Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }=O_{p}(n\lambda )\). Under the event \({\mathcal {A}}\), it is easy to see that

$$\begin{aligned} \frac{1}{n}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta } \le \lambda \Vert \varvec{\delta }\Vert _{1}+2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}. \end{aligned}$$

So

$$\begin{aligned} \frac{1}{n}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta }+\lambda \Vert \varvec{\delta }\Vert _{1}\le & {} 2\lambda \Vert \varvec{\delta }\Vert _{1}+2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}\\\le & {} 2\lambda \sum _{j=1}^{p} \big (|\beta _{j}^{(0)}-\beta _{j}^{*}|+|\beta _{j}^{*}|-|\beta _{j}^{(0)}|\big )\\= & {} 2\lambda \sum _{j\in M_{0}} \big (|\beta _{j}^{(0)}-\beta _{j}^{*}|+|\beta _{j}^{*}|-|\beta _{j}^{(0)}|\big ) \\\le & {} 4\lambda \sum _{j\in M_{0}}|\delta _{j}| \\\le & {} 4\lambda \Vert \varvec{\delta }_{M_{0}}\Vert _{1}. \end{aligned}$$

It is easy to see that \(\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\) is semipositive definite. Thus \(\Vert \varvec{\delta }\Vert _{1}\le 4\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\), and furthermore \(\Vert \varvec{\delta }_{M_{0}^{c}}\Vert _{1}\le 3\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\). By the Cauchy-Schwarz inequality and Assumption 6,

$$\begin{aligned} \Vert \varvec{\delta }_{M_{0}}\Vert _{1}^{2}\le q \Vert \varvec{\delta }_{M_{0}}\Vert _{2}^{2} \le q c_{5}^{-1}n^{-1}\Big [\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\Big ] \le 4 c_{5}^{-1}\lambda q \Vert \varvec{\delta }_{M_{0}}\Vert _{1}. \end{aligned}$$

So \(\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\le 4 c_{5}^{-1}\lambda q\). Then finally we arrive at

$$\begin{aligned} \Vert \varvec{\delta }\Vert _{1}=\Vert \varvec{\delta }_{M_{0}^{c}}\Vert _{1}+\Vert \varvec{\delta }_{M_{0}}\Vert _{1} \le 4 \Vert \varvec{\delta }_{M_{0}}\Vert _{1} \le 16 c_{5}^{-1}\lambda q. \end{aligned}$$

This finishes the proof.

Proof of Theorem 2

Recall that \(w=\mathrm {min}_{j\in M_{0}}\Vert \beta _{j}^{*}\Vert \). We just need to show \(\mathrm {pr}(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }<\frac{w}{2})\rightarrow 1\). It suffices to prove \(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). As in Xu and Chen (2014), we use the method of mathematical induction to get this result.

When \(t=0\), by Lemma 1, we have

$$\begin{aligned} \mathrm {pr}(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}\le 16c_{5}^{-1}\lambda q)\rightarrow 1. \end{aligned}$$

Because \(\lambda =o(n^{-(\tau _{1}+\tau _{2})})\), \(q=O(n^{\tau _{2}})\), \(w^{-1}=O(n^{\tau _{1}})\), \(\lambda qw^{-1}=o(n^{-(\tau _{1}+\tau _{2})})O(n^{\tau _{2}})O(n^{\tau _{1}})=o(1)\). Thus \(\lambda q=o(w)\). So we have \(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}=o_{p}(w)\). It is noted \(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}\). Then the desired result is obtained for \(t=0\).

Suppose that \(\Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). In the following, we will show that \(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\) is also true. It is noted that \(\varvec{\beta }^{(t)}={\mathbf {H}}(\tilde{\varvec{\beta }};k)\), where \(\tilde{\varvec{\beta }}=\varvec{\beta }^{(t-1)}+u^{-1}{\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\). If \(\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\) holds, it can be seen that elements of \(\tilde{\varvec{\beta }}_{M_{0}}\) are among the ones with top k largest absolute values in probability. Thus \(\Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). So what remains is to prove \(\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)\). Note that \(\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }+ \frac{1}{u}\Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }\). By the Taylor’s expansion of \({\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\) at \(\beta ^{*}\), we have

$$\begin{aligned} \Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }= & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})- \int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)(\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \\\le & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }+ \Vert \int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)(\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \\\le & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }+ n\Vert \frac{1}{n}\int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)\Vert _{\infty } \Vert (\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \\\le & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }+ n\Vert (\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \mathrm {sup}_{(\varvec{\beta },t)\in {\mathcal {B}}(\varvec{\beta }^{*},o(w))\times [0,\tau ]} \Vert V(\varvec{\beta },t)\Vert _{\infty }, \end{aligned}$$

where \(\bar{\varvec{\beta }}\) is between \(\varvec{\beta }^{(t-1)}\) and \(\varvec{\beta }^{*}\). So

$$\begin{aligned} \frac{1}{u}\Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }= & {} \frac{1}{u}O_{p}(n\lambda )+\frac{n}{u}o_{p}(w)O_{p}(n^{\tau _{3}}) \\\le & {} (c_{6}rn)^{-1}n\lambda O_{p}(1)+(c_{6}rn)^{-1}n^{1+\tau _{3}}o_{p}(w)O_{p}(1) \\= & {} c_{6}^{-1}O(n^{-\tau _{3}})o(n^{-(\tau _{1}+\tau _{2})}) O_{p}(1) +c_{6}^{-1}O_{p}(n^{-\tau _{3}})n^{\tau _{3}}o_{p}(w) \\= & {} o_{p}(w). \end{aligned}$$

This ends up the proof. \(\square \)

Assumptions verification for our NPG algorthm

It is easy to see that our algorithm could be formulated as

$$\begin{aligned} \min F(\varvec{\beta }) := -l_{n}(\varvec{\beta })+P(\varvec{\beta }), \end{aligned}$$

where \(P(\varvec{\beta })\) takes value 0 if \(\varvec{\beta }\in {\mathcal {B}}(k)\) and \(\infty \) otherwise. In the following, we verify that \(-l_{n}(\varvec{\beta })\) and \(P(\varvec{\beta })\) satisfy the conditions in Assumption A.1 of Chen et al. (2016)(page 1485–1486). It is easy to check that \(\text {Dom}(P)={\mathcal {B}}(k)\), where \(\text {Dom}(P)\) is defined in Chen et al. (2016). It should be noted that \({\mathcal {B}}(k) \subseteq {\mathcal {B}}\), where \({\mathcal {B}}\) is a bounded set.

  1. (i)

    It is easy to be satisfied since \(-l_{n}(\varvec{\beta })\) is continuously differentiable and its derivative is bounded in a bounded closed set.

  2. (ii)

    For any \(\varvec{\beta }^\dagger \in {\mathbb {R}}^p\), if \( \varvec{\beta }^\dagger \in {\mathcal {B}}(k)\), then \( P(\varvec{\beta }^\dagger ) = 0\). Since \(\liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\ge 0\), \(P(\varvec{\beta }^\dagger ) \le \liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\). If \( \varvec{\beta }^\dagger \notin {\mathcal {B}}(k)\), then \( P(\varvec{\beta }^\dagger )=\infty \). The complement of \(\text {Dom}(P)\) is an open set, therefore, \(\liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })=\infty \). Then \(P(\varvec{\beta }^\dagger ) \le \liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\). Hence P is a lower semicontinuous function in \({\mathbb {R}}^p\).

  3. (iii)

    For \(\varvec{\beta }^\dagger \in \text {Dom}(P)\), \(\Omega (\varvec{\beta }^\dagger )=\{\varvec{\beta } \in {\mathbb {R}}^p: F(\varvec{\beta })\le F(\varvec{\beta }^\dagger )\}\subseteq \text {Dom}(P)\). Then \(F(\varvec{\beta })\) is bounded below in \(\Omega (\varvec{\beta }^\dagger )\).

  4. (iv)

    Let \(A=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}\Vert \nabla (-l_{n}(\varvec{\beta }))\Vert \), \(B=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}P(\beta )\) and \(C=\inf _{\beta \in R^{p}}P(\beta )\). Because \(\Omega (\varvec{\beta }^\dagger )\subseteq \text {Dom}(P)\) and \(\nabla (-l_n(\varvec{\beta }))\) is continuous, \(A<\infty \). Furthermore, \(B=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}P(\beta )\le \sup _{\beta \in \text {Dom}(P)}P(\beta )=0<\infty \). In addition, \(C=0<\infty \).

Hence the assumptions in Chen et al. (2016) are verified.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Liu, C.C. & Xu, S. An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model. Comput Stat 36, 885–910 (2021). https://doi.org/10.1007/s00180-020-01032-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-01032-9

Keywords

Navigation