An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model

Chen, Xiaolin; Liu, Catherine Chunling; Xu, Sheng

doi:10.1007/s00180-020-01032-9

An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model

Original paper
Published: 12 September 2020

Volume 36, pages 885–910, (2021)
Cite this article

Computational Statistics Aims and scope Submit manuscript

444 Accesses
1 Citation
Explore all metrics

Abstract

The Cox model is an exceedingly popular semiparametric hazard regression model for the analysis of time-to-event accompanied by explanatory variables. Within the ultrahigh-dimensional data setting, not like the marginal screening strategy, there is a joint feature screening method based on the partial likelihood of the Cox model but it leaves computational feasibility unsolved. In this paper, we develop an enhanced iterative hard-thresholding algorithm by adapting the non-monotone proximal gradient method under the Cox model. The proposed algorithm is efficient because it is computationally both effective and fast. Meanwhile, our proposed algorithm begins with a LASSO initial estimator rather than the naive zero initial and still enjoys sure screening in theory and further enhances the computational efficiency in practice. We also give a rigorous theory proof. The advantage of our proposed work is demonstrated by numerical studies and illustrated by the diffuse large B-cell lymphoma data example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Conditional screening for ultra-high dimensional covariates with survival outcomes

Article 08 December 2016

Joint feature screening for ultra-high-dimensional sparse additive hazards model by the sparsity-restricted pseudo-score estimator

Article 22 June 2018

Variable Selection Approaches in High-Dimensional Space

References

Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10(4):1100–1120
Article MathSciNet MATH Google Scholar
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Article MathSciNet MATH Google Scholar
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Article MathSciNet MATH Google Scholar
Bradic J, Fan J, Jiang J (2011) Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Stat 39(6):3092–3120
Article MathSciNet MATH Google Scholar
Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30(1):89–99
Article Google Scholar
Chen X, Lu Z, Pong TK (2016) Penalty methods for a class of non-Lipschitz optimization problems. SIAM J Optim 26(3):1465–1492
Article MathSciNet MATH Google Scholar
Chen X, Chen X, Wang H (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 119:118–138
Article MathSciNet MATH Google Scholar
Cox DR (1975) Partial likelihood. Biometrika 62(2):269–276
Article MathSciNet MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70(5):849–911
Article MathSciNet MATH Google Scholar
Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38(6):3567–3604
Article MathSciNet MATH Google Scholar
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional variable selection: beyond the linear model. J Mach Learn Res 70:2013–2038
MATH Google Scholar
Fan J, Feng Y, Wu Y (2010) High-dimensional variable selection for Cox’s proportional hazards model. In: Borrowing strength: theory powering applications—a festschrift for Lawrence D. Brown, Institute of Mathematical Statistics, pp 70–86
Fleming TR, Harrington DP (2011) Counting processes and survival analysis, vol 169. Wiley, New York
MATH Google Scholar
Gong P, Zhang C, Lu Z, Huang J, Ye J (2013) A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: International conference on machine learning, pp 37–45
Gorst-Rasmussen A, Scheike T (2013) Independent screening for single-index hazard rate models with ultrahigh dimensional features. J R Stat Soc Ser B 75(2):217–245
Article MathSciNet Google Scholar
He X, Wang L, Hong HG (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Stat 41(1):342–369
Article MathSciNet MATH Google Scholar
Hong HG, Chen X, Christiani DC, Li Y (2018) Integrated powered density: screening ultrahigh dimensional covariates with survival outcomes. Biometrics 74(2):421–429
Article MathSciNet MATH Google Scholar
Huang J, Sun T, Ying Z, Yu Y, Zhang CH (2013) Oracle inequalities for the Lasso in the Cox model. Ann Stat 41(3):1142–1165
Article MathSciNet MATH Google Scholar
Lai P, Liu Y, Liu Z, Wan Y (2017) Model free feature screening for ultrahigh dimensional data with responses missing at random. Comput Stat Data Anal 105:201–216
Article MathSciNet MATH Google Scholar
Liu Y, Chen X (2018) Quantile screening for ultra-high-dimensional heterogeneous data conditional on some variables. J Stat Comput Simul 88(2):329–342
Article MathSciNet MATH Google Scholar
Liu Y, Zhang J, Zhao X (2018) A new nonparametric screening method for ultrahigh-dimensional survival data. Comput Stat Data Anal 119:74–85
Article MathSciNet MATH Google Scholar
Ma S, Li R, Tsai CL (2017) Variable screening via quantile partial correlation. J Am Stat Assoc 112(518):650–663
Article MathSciNet Google Scholar
Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462
MathSciNet MATH Google Scholar
Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP et al (2008) An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood 112(10):4193–4201
Article Google Scholar
Oberthuer A, Berthold F, Warnat P, Hero B, Kahlert Y, Spitz R, Ernestus K, Konig R, Haas S, Eils R et al (2006) Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. J Clin Oncol 24(31):5070–5078
Article Google Scholar
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 346(25):1937–1947
Article Google Scholar
Rosenwald A, Wright G, Wiestner A, Chan WC, Connors JM, Campo E, Gascoyne RD, Grogan TM, Muller-Hermelink HK, Smeland EB et al (2003) The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell 3(2):185–197
Article Google Scholar
She Y (2009) Thresholding-based iterative selection procedures for model selection and shrinkage. Electron J Stat 3:384–415
Article MathSciNet MATH Google Scholar
Song R, Lu W, Ma S, Jessie Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101(4):799–814
Article MathSciNet MATH Google Scholar
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16(4):385–395
Article Google Scholar
van Wieringen WN, Kun D, Hampel R, Boulesteix AL (2009) Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal 53(5):1590–1603
Article MathSciNet MATH Google Scholar
Wright SJ, Nowak RD, Figueiredo MA (2009) Sparse reconstruction by separable approximation. IEEE Trans Signal Process 57(7):2479–2493
Article MathSciNet MATH Google Scholar
Wu Y, Yin G (2015) Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102(1):65–76
Article MathSciNet MATH Google Scholar
Xu C, Chen J (2014) The sparse mle for ultrahigh-dimensional feature screening. J Am Stat Assoc 109(507):1257–1269
Article MathSciNet MATH Google Scholar
Yang L (2017) Proximal gradient method with extrapolation and line search for a class of nonconvex and nonsmooth problems. ArXiv preprint arXiv:1711.06831
Yang G, Yu Y, Li R, Buu A (2016) Feature screening in ultrahigh dimensional Cox’s model. Stat Sin 26:881
MathSciNet MATH Google Scholar
Zhang T (2009) Multi-stage convex relaxation for learning with sparse regularization. In: Advances in neural information processing systems, pp 1929–1936
Zhang J, Liu Y, Wu Y (2017) Correlation rank screening for ultrahigh-dimensional survival data. Comput Stat Data Anal 108:121–132
Article MathSciNet MATH Google Scholar
Zhao SD, Li Y (2012) Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivar Anal 105(1):397–411
Article MathSciNet MATH Google Scholar
Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563
MathSciNet MATH Google Scholar
Zhou T, Zhu L (2017) Model-free feature screening for ultrahigh dimensional censored regression. Stat Comput 27(4):947–961
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to the editor, anonymous associate editor and two anonymous reviewers for their helpful and insightful comments, which lead to significant improvements to this paper. Chen’s research is partially supported by the National Natural Science Foundation of China (11501573 and 11771250). Liu’s research is partially supported by the General Research Fund (15327216), Research Grants of Council (RGC), Hong Kong. Xu’s research is supported by The Hong Kong Polytechnic University Ph.D. Studentship. The authors thank Dr. Lei Yang for his discussion in optimization problems.

Author information

Authors and Affiliations

School of Statistics, Qufu Normal University, Qufu, China
Xiaolin Chen
Department of Applied Mathematics, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Catherine Chunling Liu & Sheng Xu

Authors

Xiaolin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Chunling Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Without loss of generality, in the Cox model (1), we assume that the first q components of $\varvec{\beta }$ are truly important. Then we partition $\varvec{\beta }$ into $\varvec{\beta }=(\varvec{\beta }_{1}^T,\varvec{\beta }_{2}^T)^T$, where $\varvec{\beta }_{1}=(\beta _{1},\ldots ,\beta _{q})^{T}$ and $\varvec{\beta }_{2}=(\beta _{q+1},\ldots ,\beta _{p})^{T}=(0,\ldots ,0)^{T}$. In addition, define $ {\mathbf {M}}_{+}^{k}=\{M:M_{0}\subset M, \Vert M\Vert _{0}\le k\} $ and $ {\mathbf {M}}_{-}^{k}=\{M:M_{0} \not \subset M, \Vert M\Vert _{0}\le k\}. $

Assumption 1

There exist scalar, vector and matrix functions $s^{(l)}(\beta ,t)$ defined on ${\mathcal {B}}\times [0,\tau ]$, $l=0,1,2$, that satisfy the following conditions: (i) $\mathrm {sup}_{t \in [0,\tau ],\varvec{\beta }_{1}\in {\mathcal {B}}_{1} }\Vert S^{(l)}(\varvec{\beta }_{1},t)-s^{(l)}(\varvec{\beta }_{1},t)\Vert _{2}\rightarrow 0$ in probability as $n\rightarrow \infty $ for ${\mathcal {B}}_{1}\subset R^{q}$, ${\mathcal {B}}_{1}\subset {\mathcal {B}}$; (ii) The functions $s^{(l)}(\beta ,t)$ are bounded and $s^{(0)}(\beta ,t)$ is bounded away from 0 on ${\mathcal {B}}\times [0,\tau ]$; and the family of functions $s^{(l)}(\cdot ,t),0\le t\le \tau $, is an equicontinuous family at $\beta ^{*}$.

Assumption 2

Let $\bar{{\mathbf {z}}}(\varvec{\beta },t)=s^{(1)}(\varvec{\beta },t)/s^{(0)}(\varvec{\beta },t)$. Denote $d_{n}=\mathrm {sup}_{t \in [0,\tau ]}\Vert \bar{{\mathbf {Z}}}(\varvec{\beta }^{*},t)-\bar{{\mathbf {z}}}(\varvec{\beta }^{*},t)\Vert _{\infty }$ and $e_{n}=\mathrm {sup}_{t \in [0,\tau ]}\Vert S^{(0)}(\varvec{\beta }^{*},t)$ $-s^{(0)}(\varvec{\beta }^{*},t)\Vert _{\infty }$. The random sequences $d_{n}$ and $e_{n}$ are bounded almost surely.

Assumption 3

Define $\varepsilon _{ij}=\int _{0}^{\tau }\{Z_{ij}-{\bar{z}}_{j}(\varvec{\beta }^{*},t)\}dM_{i}(t)$, where ${\bar{z}}_{j}(\varvec{\beta }^{*},t)$ is the jth component of $\bar{{\mathbf {z}}}(\varvec{\beta }^{*},t)$. Suppose that the Cramér condition holds for $\varepsilon _{ij}$, i.e., $E|\varepsilon _{ij}|^{l}\le 2^{-1}l!c_{1}^{l-2}\sigma _{j}^{2}$ for all j, where $c_{1}$ is a positive constant, $l\ge 2$, and $\sigma _{j}^{2}=\mathrm {var}(\varepsilon _{ij})<\infty $.

Assumption 4

There exist positive constants $c_{2},c_{3},\tau _{1}$ and $\tau _{2}$ such that $\mathrm {min}_{j\in M_{0}}|\beta _{j}^{*}|\ge c_{2}n^{-\tau _{1}}$ and $q\le k\le c_{3}n^{\tau _{2}}$.

Assumption 5

When n is sufficiently large, for $\varvec{\beta }_{M}\in \{\varvec{\beta }_{M}:\Vert \varvec{\beta }_{M}-\varvec{\beta }_{M}^{*}\Vert _{2}\le \delta \}$, $M \in {\mathbf {M}}_{+}^{2k}$, it holds that $\lambda _{\mathrm {min}}\{n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)\}\ge c_{4}$, where $c_{4}$ and $\delta $ are positive constants depending on k but not M, $\lambda _{\mathrm {min}}(A)$ is the minimum eigenvalue of the matrix A, and $V(\varvec{\beta }_{M},t)$ is a version of $V(\varvec{\beta }, t)$ based on the model M.

Assumption 6

There exists a positive constant $c_{5}$ such that, for sufficiently large n,

$$\begin{aligned} \varvec{\eta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\eta } \ge c_{5} n\Vert \varvec{\eta }_{_{M_{0}}}\Vert _{2}^{2}, \end{aligned}$$

for any $\varvec{\eta }\ne 0$, $\Vert \varvec{\eta }_{_{M_{0}^{c}}}\Vert _{1}\le 3\Vert \varvec{\eta }_{_{M_{0}}}\Vert _{1}$, and $\varvec{\beta } \in {\mathcal {B}}$, where $M_{0}^{c}$ is the complement of $M_{0}$ in $\{1,2,\ldots ,p\}$.

Assumption 7

$\mathrm {sup}_{(\varvec{\beta },t)\in {\mathcal {B}}(\varvec{\beta }^{*},o(w))\times [0,\tau ]} \Vert V(\varvec{\beta },t)\Vert _{\infty }=O_{p}(n^{\tau _{3}})$, where $\tau _{3}$ is a positive constant, $w=\mathrm {min}_{j\in M_{0}}\Vert \beta _{j}^{*}\Vert $, and ${\mathcal {B}}(\varvec{\beta }^{*},o(w))$ is a p-dimensional ball centered at $\varvec{\beta }^{*}$ with radius o(w).

Assumptions 1 to 3 are mild, which are necessities to obtain a large deviation result. See Bradic et al. (2011) for more discussions of these assumptions. The first part of Assumption 4 states that the marginal signals are strong enough to be detected. Similar assumptions have been widely made in the ultrahigh-dimensional data analysis. In the second part, we allow the dimension of the important features to diverge to infinity at a polynomial speed. Assumption 5 is a very weak assumption since it usually holds that $n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)$ is positive definite when k is not too large by noting that the dimension of $\varvec{\beta }_{M}$ is k. For example, if $k=5$, $\varvec{\beta }_{M}$ is a 5-dimensional vector parametric. Then $n^{-1}\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)$ is a $5\times 5$ matrix, which is positive definite with high probability when the sample size is moderate. Assumption 6 is needed for deriving an error bound for the LASSO estimator. There are many similar assumptions in the literature, such as Bickel et al. (2009) for the linear model, Xu and Chen (2014) for the generalized linear model, Huang et al. (2013) for the Cox model, and so on. As was discussed in Fleming and Harrington (2011), $V(\varvec{\beta },t)$ is an empirical covariance matrix of ${\mathbf {Z}}_{i}$ with the weights proportional to $Y_{i}(t)\mathrm {exp}\{{\mathbf {Z}}_{i}^{T}\varvec{\beta }\}$. Hence, Assumption 7 points out that the association of features should not be too strong. In particular, if all the features are mutually independent, this assumption is easily met. Similar assumptions can be found in Bradic et al. (2011).

The proofs of the theorems can be obtained along the line of Xu and Chen (2014) by combining the large deviation result for martingales of Bradic et al. (2011) under our specific assumptions for the Cox model.

Proof of Theorem 1

Firstly, we prove the monotonicity. It is easy to see that

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t)})= & {} Q_{n}(\varvec{\beta }^{(t)}|\varvec{\beta }^{(t)}) \\\le & {} Q_{n}(\varvec{\beta }^{(t+1)}|\varvec{\beta }^{(t)}) \\= & {} l_{n}(\varvec{\beta }^{(t)})+(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T}{\dot{l}}_{n}(\varvec{\beta }^{(t)}) -\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} \\= & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} +(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T}{\dot{l}}_{n}(\varvec{\beta }^{(t)}) \\&+\,l_{n}(\varvec{\beta }^{(t)})-l_{n}(\varvec{\beta }^{(t+1)}) \\= & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}+ \sum _{i=1}^{n}\delta _{i} \Big [ \mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t+1)},X_{i})) \\&-\, \mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t)},X_{i})) -(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{\mathrm {T}}\bar{{\mathbf {Z}}}(\varvec{\beta }^{(t)},X_{i}) \Big ] \end{aligned}$$

By the Taylor’s expansion of $\sum _{i=1}^{n}\delta _{i}\mathrm {log}(nS^{(0)}(\varvec{\beta }^{(t+1)},X_{i}))$ at $\varvec{\beta }^{(t)}$ and some algebraic manipulations, we have

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t)})\le & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}\nonumber \\&+\,\frac{1}{2}(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T} \sum _{i=1}^{n}\delta _{i} \Big (\frac{S^{(2)}(\varvec{\beta },X_{i})}{S^{(0)}(\varvec{\beta },X_{i})}- \big [\frac{S^{(1)}(\varvec{\beta },X_{i})}{S^{(0)}(\varvec{\beta },X_{i})}\big ]^{\otimes 2}\Big ) \Big |_{\varvec{\beta }=\bar{\varvec{\beta }}} (\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})\nonumber \\= & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} +\frac{1}{2}(\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})^{T} \int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t) (\varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)})\nonumber \\\le & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{u}{2}\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} +\frac{1}{2}\lambda _{\mathrm {max}}\Big (\int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)\Big ) \Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}. \end{aligned}$$

(A.1)

So under the assumptions in Theorem 1, it finally arrives at

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t)})\le & {} l_{n}(\varvec{\beta }^{(t+1)})-\frac{1}{2}(u-\rho ^{(t)}) \Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2} \\\le & {} l_{n}(\varvec{\beta }^{(t+1)}). \end{aligned}$$

Secondly, we will show that $\{\varvec{\beta }^{(t)}\}$ converges to a local maximum of $l_{n}(\varvec{\beta })$. It is noted that $l_{n}(\cdot )$ is bounded with $\varvec{\beta }$ being confined in ${\mathcal {B}}$. From the proof of the first part, we can see

$$\begin{aligned} l_{n}(\varvec{\beta }^{(t+1)})-l_{n}(\varvec{\beta }^{(t)}) \ge \frac{1}{2}(u-\rho ) \Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}^{2}. \end{aligned}$$

By the monotonicity and boundness of $l_{n}(\cdot )$, $\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}\rightarrow 0$ as t goes to $\infty $.

It is noted that k and p are constants in the proof for convergence of $\{\varvec{\beta }^{(t)}\}$. For the finite sparse patterns of $\{\varvec{\beta }^{(t)}\}$, we can always find a subsequence of $\{\varvec{\beta }^{(t)}\}$, say $\{\varvec{\beta }^{(t_{m})}\}$, with a common sparse pattern M. Because $\int _{0}^{\tau }V(\varvec{\beta }_{M},t)d{\bar{N}}(t)$ is positive definite for any $\varvec{\beta }_{M}$, $l_{n}(\cdot )$ is strictly concave for $\varvec{\beta }_{M}$ with sparse pattern M. Then $\{\varvec{\beta }^{(t_{m})}\}$ is bounded and has at least one limiting point, denoted by $\varvec{\beta }^{\dag }=(\beta _{1}^{\dag },\beta _{2}^{\dag },\ldots ,\beta _{p}^{\dag })^{T}$. Based on the facts that $\varvec{\beta }^{(t_{m}+1)}=\mathop {\mathrm {argmax}}_{\varvec{\beta }\in {\mathcal {B}}(k)} \{Q_{n}(\varvec{\beta }|\varvec{\beta }^{(t_{m})})\}$ and $\Vert \varvec{\beta }^{(t+1)}-\varvec{\beta }^{(t)}\Vert _{2}\rightarrow 0$, we have $\varvec{\beta }^{\dag }=\mathop {\mathrm {argmax}}_{\varvec{\beta }\in {\mathcal {B}}(k)} \{Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })\}$ This indicates that $\varvec{\beta }^{\dag }$ maximizes $Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })$ with respect to $\varvec{\beta }$ and sparse pattern M.

If $\Vert \varvec{\beta }^{\dag }\Vert _{0}<k$, i.e., $\varvec{\beta }^{\dag }$ has fewer than k non-zero entries, it is easy to see that ${\dot{l}}_{n}(\varvec{\beta }^{\dag })=0$. So $\varvec{\beta }^{\dag }$ is the unconstrained maximizer of $l_{n}(\varvec{\beta })$ and satisfies $\Vert \varvec{\beta }^{\dag }\Vert _{0}\le k$.

If $\Vert \varvec{\beta }^{\dag }\Vert _{0}=k$, we will prove that $\varvec{\beta }^{\dag }$ is the unique maximizer for the sparse pattern M. Note that

$$\begin{aligned} l_{n}(\varvec{\beta })=Q_{n}(\varvec{\beta }|\varvec{\beta }^{\dag })-R(\varvec{\beta }|\varvec{\beta }^{\dag }), \end{aligned}$$

where

$$\begin{aligned} R(\varvec{\beta }|\varvec{\beta }^{\dag })=-\frac{u}{2}\Vert \varvec{\beta }-\varvec{\beta }^{\dag }\Vert _{2}^{2} +(\varvec{\beta }-\varvec{\beta }^{\dag })^{T}{\dot{l}}_{n}(\varvec{\beta }^{\dag })+l_{n}(\varvec{\beta }^{\dag })-l_{n}(\varvec{\beta }). \end{aligned}$$

So it is easy to see that $\frac{\partial R(\varvec{\beta }|\varvec{\beta }^{\dag })}{\partial \varvec{\beta }}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0$. In addition, $\frac{\partial Q(\varvec{\beta }|\varvec{\beta }^{\dag })}{\partial \varvec{\beta }_{j}}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0$, because $\varvec{\beta }^{\dag }$ is the local maximizer for the sparse pattern M. Then we have $\frac{\partial l_{n}(\varvec{\beta })}{\partial \varvec{\beta }_{j}}|_{\varvec{\beta }=\varvec{\beta }^{\dag }}=0$, which implies $\varvec{\beta }^{\dag }$ is the unique maximizer of $l_{n}(\cdot )$ by the property of strict convexity. $\square $

Above all, any limiting point of $\{\varvec{\beta }^{(t)}\}$ is a local maximizer satisfying $\Vert \varvec{\beta }\Vert _{0}\le k$. By the finiteness of sparse patterns, there are at most finite many limiting points. Furthermore, similarly to Xu and Chen (2014), by the techniques of mathematical analysis, it can be shown that $\{\varvec{\beta }^{(t)}\}$ has only one limiting point and thus converges.

Lemma 1

Define $\varvec{\beta }^{(0)}=\mathrm {argmax}_{\varvec{\beta }}\{l_{n}(\varvec{\beta })-n\lambda \Vert \varvec{\beta }\Vert _{1}\}$, where $\lambda $ satisfies $\lambda n^{\frac{1}{2}-m}\rightarrow \infty $, $\lambda n^{\tau _{1}+\tau _{2}}\rightarrow 0$. Under Assumptions 1 to 3 and 6, if $\mathrm {max}_{j}\sigma _{j}^{2}=O(\lambda n^{\frac{1}{2}})$, we have

$$\begin{aligned} \mathrm {pr}(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}\le 16c_{5}^{-1}\lambda q)\rightarrow 1, \end{aligned}$$

where $c_{5}$ is defined in Assumption 6.

Proof

It is easy to see that

$$\begin{aligned} l_{n}(\varvec{\beta }^{(0)})-n\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}- (l_{n}(\varvec{\beta }^{*})-n\lambda \Vert \varvec{\beta }^{*}\Vert _{1})\ge 0, \end{aligned}$$

or equivalently

$$\begin{aligned} l_{n}(\varvec{\beta }^{*})-l_{n}(\varvec{\beta }^{(0)})\le n\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-n\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}. \end{aligned}$$

Define $\varvec{\delta }=(\varvec{\beta }^{(0)}-\varvec{\beta }^{*})=(\delta _{1},\ldots ,\delta _{p})^{T}$. By the Taylor’s expansion of $l_{n}(\varvec{\beta }^{(0)})$ at $\varvec{\beta }^{*}$, we have

$$\begin{aligned} l_{n}(\varvec{\beta }^{(0)})-l_{n}(\varvec{\beta }^{*})=\varvec{\delta }^{T}{\dot{l}}_{n}(\varvec{\beta }^{*})- \frac{1}{2}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta }, \end{aligned}$$

where $\varvec{\beta }$ is between $\varvec{\beta }_{L}$ and $\varvec{\beta }^{*}$. So

$$\begin{aligned} \frac{1}{n}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta }= & {} \frac{1}{n} \big [2\varvec{\delta }^{T}{\dot{l}}_{n}(\varvec{\beta }^{*})+2l_{n}(\varvec{\beta }^{*})-2l_{n}(\varvec{\beta }^{(0)})\big ] \\\le & {} \frac{2}{n}\varvec{\delta }^{T}{\dot{l}}_{n}(\varvec{\beta }^{*})+ 2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1} \\\le & {} \frac{2}{n}|\varvec{\delta }|^{T}|{\dot{l}}_{n}(\varvec{\beta }^{*})|+ 2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}. \end{aligned}$$

Denote ${\mathcal {A}}=\{\mathrm {max}_{1\le j \le p}|{\dot{l}}_{nj}(\varvec{\beta }^{*})|\le \frac{n\lambda }{2}\}$.

Because $\mathrm {max}_{j}\sigma _{j}^{2}=O(\lambda n^{\frac{1}{2}})$, together with Assumptions 1–3, the result of the bound of tail probability in Theorem 3.1 in Bradic et al. (2011) can be applied here via substituting their $\xi _j$ by ${\dot{l}}_{nj}(\varvec{\beta }^{*})$, that is, there exist positive constants $c_{7}$ and $c_{8}$ such that $\mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\sqrt{n}u_{n}) \le c_{7}\mathrm {exp}(-c_{8}u_{n})$. Then we have

$$\begin{aligned} \mathrm {pr}({\mathcal {A}}^{c})\le & {} \sum _{j=1}^{p}\mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\frac{n\lambda }{2}) =\sum _{j=1}^{p} \mathrm {pr}(|{\dot{l}}_{nj}(\varvec{\beta }^{*})|>\sqrt{n}\frac{\sqrt{n}\lambda }{2}) \\\le & {} p c_{7}\mathrm {exp}(-c_{8}\frac{\sqrt{n}\lambda }{2})\le c_{7}\mathrm {exp}(c_{9}n^{m}-c_{8}\frac{\sqrt{n}\lambda }{2}) \rightarrow 0, \end{aligned}$$

where $c_{9}$ is a positive constant. So we can conclude that $\mathrm {pr}({\mathcal {A}})\rightarrow 1$ and $\Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }=O_{p}(n\lambda )$. Under the event ${\mathcal {A}}$, it is easy to see that

$$\begin{aligned} \frac{1}{n}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta } \le \lambda \Vert \varvec{\delta }\Vert _{1}+2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}. \end{aligned}$$

So

$$\begin{aligned} \frac{1}{n}\varvec{\delta }^{T}\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\varvec{\delta }+\lambda \Vert \varvec{\delta }\Vert _{1}\le & {} 2\lambda \Vert \varvec{\delta }\Vert _{1}+2\lambda \Vert \varvec{\beta }^{*}\Vert _{1}-2\lambda \Vert \varvec{\beta }^{(0)}\Vert _{1}\\\le & {} 2\lambda \sum _{j=1}^{p} \big (|\beta _{j}^{(0)}-\beta _{j}^{*}|+|\beta _{j}^{*}|-|\beta _{j}^{(0)}|\big )\\= & {} 2\lambda \sum _{j\in M_{0}} \big (|\beta _{j}^{(0)}-\beta _{j}^{*}|+|\beta _{j}^{*}|-|\beta _{j}^{(0)}|\big ) \\\le & {} 4\lambda \sum _{j\in M_{0}}|\delta _{j}| \\\le & {} 4\lambda \Vert \varvec{\delta }_{M_{0}}\Vert _{1}. \end{aligned}$$

It is easy to see that $\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)$ is semipositive definite. Thus $\Vert \varvec{\delta }\Vert _{1}\le 4\Vert \varvec{\delta }_{M_{0}}\Vert _{1}$, and furthermore $\Vert \varvec{\delta }_{M_{0}^{c}}\Vert _{1}\le 3\Vert \varvec{\delta }_{M_{0}}\Vert _{1}$. By the Cauchy-Schwarz inequality and Assumption 6,

$$\begin{aligned} \Vert \varvec{\delta }_{M_{0}}\Vert _{1}^{2}\le q \Vert \varvec{\delta }_{M_{0}}\Vert _{2}^{2} \le q c_{5}^{-1}n^{-1}\Big [\int _{0}^{\tau }V(\varvec{\beta },t)d{\bar{N}}(t)\Big ] \le 4 c_{5}^{-1}\lambda q \Vert \varvec{\delta }_{M_{0}}\Vert _{1}. \end{aligned}$$

So $\Vert \varvec{\delta }_{M_{0}}\Vert _{1}\le 4 c_{5}^{-1}\lambda q$. Then finally we arrive at

$$\begin{aligned} \Vert \varvec{\delta }\Vert _{1}=\Vert \varvec{\delta }_{M_{0}^{c}}\Vert _{1}+\Vert \varvec{\delta }_{M_{0}}\Vert _{1} \le 4 \Vert \varvec{\delta }_{M_{0}}\Vert _{1} \le 16 c_{5}^{-1}\lambda q. \end{aligned}$$

This finishes the proof.

Proof of Theorem 2

Recall that $w=\mathrm {min}_{j\in M_{0}}\Vert \beta _{j}^{*}\Vert $. We just need to show $\mathrm {pr}(\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }<\frac{w}{2})\rightarrow 1$. It suffices to prove $\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)$. As in Xu and Chen (2014), we use the method of mathematical induction to get this result.

When $t=0$, by Lemma 1, we have

$$\begin{aligned} \mathrm {pr}(\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}\le 16c_{5}^{-1}\lambda q)\rightarrow 1. \end{aligned}$$

Because $\lambda =o(n^{-(\tau _{1}+\tau _{2})})$, $q=O(n^{\tau _{2}})$, $w^{-1}=O(n^{\tau _{1}})$, $\lambda qw^{-1}=o(n^{-(\tau _{1}+\tau _{2})})O(n^{\tau _{2}})O(n^{\tau _{1}})=o(1)$. Thus $\lambda q=o(w)$. So we have $\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}=o_{p}(w)$. It is noted $\Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \varvec{\beta }^{(0)}-\varvec{\beta }^{*}\Vert _{1}$. Then the desired result is obtained for $t=0$.

Suppose that $\Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)$. In the following, we will show that $\Vert \varvec{\beta }^{(t)}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)$ is also true. It is noted that $\varvec{\beta }^{(t)}={\mathbf {H}}(\tilde{\varvec{\beta }};k)$, where $\tilde{\varvec{\beta }}=\varvec{\beta }^{(t-1)}+u^{-1}{\dot{l}}_{n}(\varvec{\beta }^{(t-1)})$. If $\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)$ holds, it can be seen that elements of $\tilde{\varvec{\beta }}_{M_{0}}$ are among the ones with top k largest absolute values in probability. Thus $\Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)$. So what remains is to prove $\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }=o_{p}(w)$. Note that $\Vert \tilde{\varvec{\beta }}-\varvec{\beta }^{*}\Vert _{\infty }\le \Vert \varvec{\beta }^{(t-1)}-\varvec{\beta }^{*}\Vert _{\infty }+ \frac{1}{u}\Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }$. By the Taylor’s expansion of ${\dot{l}}_{n}(\varvec{\beta }^{(t-1)})$ at $\beta ^{*}$, we have

$$\begin{aligned} \Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }= & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})- \int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)(\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \\\le & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }+ \Vert \int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)(\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \\\le & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }+ n\Vert \frac{1}{n}\int _{0}^{\tau }V(\bar{\varvec{\beta }},t)d{\bar{N}}(t)\Vert _{\infty } \Vert (\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \\\le & {} \Vert {\dot{l}}_{n}(\varvec{\beta }^{*})\Vert _{\infty }+ n\Vert (\varvec{\beta }^{(t-1)}-\varvec{\beta }^{*})\Vert _{\infty } \mathrm {sup}_{(\varvec{\beta },t)\in {\mathcal {B}}(\varvec{\beta }^{*},o(w))\times [0,\tau ]} \Vert V(\varvec{\beta },t)\Vert _{\infty }, \end{aligned}$$

where $\bar{\varvec{\beta }}$ is between $\varvec{\beta }^{(t-1)}$ and $\varvec{\beta }^{*}$. So

$$\begin{aligned} \frac{1}{u}\Vert {\dot{l}}_{n}(\varvec{\beta }^{(t-1)})\Vert _{\infty }= & {} \frac{1}{u}O_{p}(n\lambda )+\frac{n}{u}o_{p}(w)O_{p}(n^{\tau _{3}}) \\\le & {} (c_{6}rn)^{-1}n\lambda O_{p}(1)+(c_{6}rn)^{-1}n^{1+\tau _{3}}o_{p}(w)O_{p}(1) \\= & {} c_{6}^{-1}O(n^{-\tau _{3}})o(n^{-(\tau _{1}+\tau _{2})}) O_{p}(1) +c_{6}^{-1}O_{p}(n^{-\tau _{3}})n^{\tau _{3}}o_{p}(w) \\= & {} o_{p}(w). \end{aligned}$$

This ends up the proof. $\square $

Assumptions verification for our NPG algorthm

It is easy to see that our algorithm could be formulated as

$$\begin{aligned} \min F(\varvec{\beta }) := -l_{n}(\varvec{\beta })+P(\varvec{\beta }), \end{aligned}$$

where $P(\varvec{\beta })$ takes value 0 if $\varvec{\beta }\in {\mathcal {B}}(k)$ and $\infty $ otherwise. In the following, we verify that $-l_{n}(\varvec{\beta })$ and $P(\varvec{\beta })$ satisfy the conditions in Assumption A.1 of Chen et al. (2016)(page 1485–1486). It is easy to check that $\text {Dom}(P)={\mathcal {B}}(k)$, where $\text {Dom}(P)$ is defined in Chen et al. (2016). It should be noted that ${\mathcal {B}}(k) \subseteq {\mathcal {B}}$, where ${\mathcal {B}}$ is a bounded set.

(i)
It is easy to be satisfied since $-l_{n}(\varvec{\beta })$ is continuously differentiable and its derivative is bounded in a bounded closed set.
(ii)
For any $\varvec{\beta }^\dagger \in {\mathbb {R}}^p$, if $ \varvec{\beta }^\dagger \in {\mathcal {B}}(k)$, then $ P(\varvec{\beta }^\dagger ) = 0$. Since $\liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })\ge 0$, $P(\varvec{\beta }^\dagger ) \le \liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })$. If $ \varvec{\beta }^\dagger \notin {\mathcal {B}}(k)$, then $ P(\varvec{\beta }^\dagger )=\infty $. The complement of $\text {Dom}(P)$ is an open set, therefore, $\liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })=\infty $. Then $P(\varvec{\beta }^\dagger ) \le \liminf \limits _{\varvec{\beta } \rightarrow \varvec{\beta }^\dagger }P(\varvec{\beta })$. Hence P is a lower semicontinuous function in ${\mathbb {R}}^p$.
(iii)
For $\varvec{\beta }^\dagger \in \text {Dom}(P)$, $\Omega (\varvec{\beta }^\dagger )=\{\varvec{\beta } \in {\mathbb {R}}^p: F(\varvec{\beta })\le F(\varvec{\beta }^\dagger )\}\subseteq \text {Dom}(P)$. Then $F(\varvec{\beta })$ is bounded below in $\Omega (\varvec{\beta }^\dagger )$.
(iv)
Let $A=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}\Vert \nabla (-l_{n}(\varvec{\beta }))\Vert $, $B=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}P(\beta )$ and $C=\inf _{\beta \in R^{p}}P(\beta )$. Because $\Omega (\varvec{\beta }^\dagger )\subseteq \text {Dom}(P)$ and $\nabla (-l_n(\varvec{\beta }))$ is continuous, $A<\infty $. Furthermore, $B=\sup _{\beta \in \Omega (\varvec{\beta }^\dagger )}P(\beta )\le \sup _{\beta \in \text {Dom}(P)}P(\beta )=0<\infty $. In addition, $C=0<\infty $.

Hence the assumptions in Chen et al. (2016) are verified.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, X., Liu, C.C. & Xu, S. An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model. Comput Stat 36, 885–910 (2021). https://doi.org/10.1007/s00180-020-01032-9

Download citation

Received: 07 July 2019
Accepted: 02 September 2020
Published: 12 September 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00180-020-01032-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model

Abstract

Access this article

Similar content being viewed by others

Conditional screening for ultra-high dimensional covariates with survival outcomes

Joint feature screening for ultra-high-dimensional sparse additive hazards model by the sparsity-restricted pseudo-score estimator

Variable Selection Approaches in High-Dimensional Space

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Assumption 5

Assumption 6

Assumption 7

Proof of Theorem 1

Lemma 1

Proof

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient algorithm for joint feature screening in ultrahigh-dimensional Cox’s model

Abstract

Access this article

Similar content being viewed by others

Conditional screening for ultra-high dimensional covariates with survival outcomes

Joint feature screening for ultra-high-dimensional sparse additive hazards model by the sparsity-restricted pseudo-score estimator

Variable Selection Approaches in High-Dimensional Space

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Assumption 5

Assumption 6

Assumption 7

Proof of Theorem 1

Lemma 1

Proof

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation