Abstract
Broken adaptive ridge (BAR) is a penalized regression method that performs variable selection via a computationally scalable surrogate to \(L_0\) regularization. The BAR regression has many appealing features; it converges to selection with \(L_0\) penalties as a result of reweighting \(L_2\) penalties, and satisfies the oracle property with grouping effect for highly correlated covariates. In this paper, we investigate the BAR procedure for variable selection in a semiparametric accelerated failure time model with complex high-dimensional censored data. Coupled with Buckley-James-type responses, BAR-based variable selection procedures can be performed when event times are censored in complex ways, such as right-censored, left-censored, or double-censored. Our approach utilizes a two-stage cyclic coordinate descent algorithm to minimize the objective function by iteratively estimating the pseudo survival response and regression coefficients along the direction of coordinates. Under some weak regularity conditions, we establish both the oracle property and the grouping effect of the proposed BAR estimator. Numerical studies are conducted to investigate the finite-sample performance of the proposed algorithm and an application to real data is provided as a data example.


Similar content being viewed by others
References
Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24(6):2350–2383
Buckley J, James I (1979) Linear regression with censored data. Biometrika 66(3):429–436
Choi S, Cho H (2019) Accelerated failure time models for the analysis of competing risks. J Korean Stat Soc 48:315–326
Choi T, Choi S (2021) A fast algorithm for the accelerated failure time model with high-dimensional time-to-event data. J Stat Comput Simul 91(16):3385–3403
Choi S, Choi T, Cho H, Bandyopadhyay D (2022) Weighted least-squares regression with competing risks data. Stat Med 41(2):227–241
Choi T, Kim AK, Choi S (2021) Semiparametric least-squares regression with doubly-censored data. Comput Stat Data Anal 164:107306
Dai L, Chen K, Li G (2020) The broken adaptive ridge procedure and its applications. Statistica Sinica 30(2):1069–1094
Dai L, Chen K, Sun Z, Liu Z, Li G (2018) Broken adaptive ridge regression and its asymptotic properties. J Multivar Anal 168:334–351
Daubechies I, DeVore R, Fornasier M, Güntürk CS (2010) Iteratively reweighted least squares minimization for sparse recovery. Commun Pure Appl Math J Issued Courant Instit Math Sci 63(1):1–38
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Frommlet F, Nuel G (2016) An adaptive ridge procedure for \(l_0\) regularization. PloS one 11(2):e0148620
Gao F, Zeng D, Lin DY (2017) Semiparametric estimation of the accelerated failure time model with partly interval-censored data. Biometrics 73(4):1161–1168
Huang J (1999) Asymptotic properties of nonparametric estimation based on partly interval-censored data. Statistica Sinica 9(2):501–519
Jin Z, Lin D, Wei L, Ying Z (2003) Rank-based inference for the accelerated failure time model. Biometrika 90(2):341–353
Jin Z, Lin D, Ying Z (2006) On least-squares regression with censored data. Biometrika 93(1):147–161
Johnson BA (2009) On lasso for censored data. Electron J Stat 3:485–506
Johnson BA, Lin DY, Zeng D (2008) Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc 103(482):672–680
Kawaguchi ES, Shen JI, Suchard MA, Li G (2021) Scalable algorithms for large competing risks data. J Comput Graph Stat 30(3):685–693
Kawaguchi ES, Suchard MA, Liu Z, Li G (2020) A surrogate \({L}_0\) sparse Cox’s regression with applications to sparse high-dimensional massive sample size time-to-event data. Stat Med 39(6):675–686
Leurgans S (1987) Linear models, random censoring and synthetic data. Biometrika 74(2):301–309
Li Y, Dicker L, Zhao S (2014) The Dantzig selector for censored linear regression models. Statistica Sinica 24(1):251–268
Liu Y, Chen X, Li G (2019) A new joint screening method for right-censored time-to-event data with ultra-high dimensional covariates. Stat Methods Med Res 29(6):1499–1513
Meir A, Keeler E (1969) A theorem on contraction mappings. J Math Anal Appl 28(2):326–329
Rippe RC, Meulman JJ, Eilers PH (2012) Visualization of genomic changes by segmented smoothing using an l 0 penalty. PloS one 7(6):e38230
Ritov Y (1990) Estimation in a linear regression model with censored data. Ann Stat 18(1):303–328
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494
Son M, Choi T, Shin SJ, Jung Y, Choi S (2021) Regularized linear censored quantile regression. J Korean Stat Soc 51:1–19
Sun Z, Liu Y, Chen K, Li G (2022) Broken adaptive ridge regression for right-censored survival data. Ann Instit Stat Math 74(1):69–91
Sun Z, Yu C, Li G, Chen K, Liu Y (2020) CenBAR: Broken Adaptive Ridge AFT Model with Censored Data. https://cran.r-project.org/web/packages/CenBAR/index.html, r package version 0.1.1
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Series B (Methodological) 58(1):267–288
Turnbull BW (1976) The empirical distribution function with arbitrarily grouped, censored and truncated data. J Royal Stat Soc Ser B 38(3):290–295
Wang S, Nan B, Zhu J, Beer DG (2008) Doubly penalized Buckley-James method for survival data with high-dimensional covariates. Biometrics 64(1):132–140
Xu J, Leng C, Ying Z (2010) Rank-based variable selection with censored data. Stat Comput 20(2):165–176
Zeng D, Lin D (2007) Efficient estimation for the accelerated failure time model. J Am Stat Assoc 69(4):507–564
Zhao H, Sun D, Li G, Sun J (2018) Variable selection for recurrent event data with broken adaptive ridge regression. Can J Stat 46(3):416–428
Zhao H, Wu Q, Li G, Sun J (2020) Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression. J Am Stat Assoc 115(529):204–216
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Funding
The research of T. Choi was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (RS-2023-00237435). The research of S. Choi was supported by grant from the National Research Foundation (NSF) of Korea (2022M3J6A1063595, 2022R1A2C1008514) and the Korea University research grant (K2018721, K2008341).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In this appendix, we prove the asymptotic properties of the proposed BJ-BAR estimator \(\hat{\beta }\). For the proof, define
and the partition of matrix \({\Sigma }_n^{-1}\) as
where A is a \(q\times q\) matrix and \(\hat{Y}=\hat{Y}(\tilde{\beta })\) is the“imputed” failure time. We write \({\alpha }^*(\beta )\) and \({\gamma }^*(\beta )\) as \(\alpha ^*\) and \(\gamma ^*\), where \(\alpha ^*\) is a \(q\times 1\) vector and \(\gamma ^*\) is a \((p-q)\times 1\) vector. Note that since \({\Sigma }_n=n^{-1} X^T X\) is nonsingular, by multiplying \((X^T X)^{-1} (X^T X + \lambda _n { D}(\beta ))\) and subtracting \(\beta _0=(\beta _{10}^T,0^T)^T\) on both sides of Eq. (10), we have
since \(\tilde{\beta }=(X^TX)^{-1}X^T \hat{Y}(\tilde{\beta })\), where \({ D}_1({\beta }_1) = \text{ diag }(\beta _1^{-2})\) and \({ D}_2({\beta }_2) = \text{ diag }(\beta _2^{-2})\). We also define \({\gamma }^*=0\) if \({\beta _2}=0\) in Eq. (11). According to Ritov (1990), we have \(\Vert {\tilde{\beta }} -{\beta }_0\Vert = O_p(n^{-1/2}).\)
In order to establish the validity of Theorem 1, we rely on the foundation provided by two lemmas. These lemmas are derived from the works of Dai et al. (2018); Zhao et al. (2018, 2020), and form an integral part of the proof for Theorem 1.
Lemma 1
Let \(\{\, \delta _n \, \}\) be a sequence of positive real numbers satisfying \(\delta _n\rightarrow \infty\) and \(\delta _n^2/\lambda _n \rightarrow 0\). Define \(H= \{{\beta } = ({\beta }_1^T, {\beta }_2^T)^T: {\beta }_1 \in [1/K_0, K_0]^{q}, \Vert {\beta }_2\Vert \le \delta _n/\sqrt{n}\}\), where \(K_0>1\) is a constant such that \({\beta }_{10}\in [1/K_0, K_0]^{q}\). Then under the regularity conditions (C1)–(C6) and with probability tending to 1, we have
-
(i)
\(\displaystyle \sup _{{\beta }\in H}\dfrac{\Vert {\gamma }^*({\beta })\Vert }{\Vert {\beta }_2\Vert } < \dfrac{1}{c_0}\) for some constant \(c_0>1\).
-
(ii)
\(g(\cdot )\) is a mapping from H to itself.
Proof
By Eq. (11) and Ritov (1990), we have
Note that \(\Vert \beta _2\Vert \le \delta _n / \sqrt{n}\) and \(\lambda _n/n=o_p(n^{-1/2})\). Based on conditions (C5) and (C6) and the fact that \({\beta }_1 \in [1/K_0, K_0]^{q}\) and \(\Vert {\alpha }^*\Vert \le \Vert g({\beta })\Vert < K\) for some constant \(K>0\), we have
where \(a_0 = \min _{1 \le j \le q} |\beta _{10_j} |\), \(a_1 = \max _{1 \le j \le q} |\beta _{10_j} |\), and \(\Vert B^T\Vert \le \sqrt{2}c\), which follows from the inequality \(\Vert BB^T\Vert - \Vert A^2 \Vert \le \Vert BB^T + A^2 \Vert \le \Vert {\Sigma }_n ^{-2}\Vert < c^2.\) Then, it follows from Eq. (11) that, with probability tending to 1,
because \(\lambda _{\min }({ G})> c^{-1}\). Let \(m_{{\gamma }^*/{\beta }_2}=(\gamma _1^*/\beta _{{q}+1}, \gamma _2^*/\beta _{{q}+2},\dots , \gamma _{p-q}^*/\beta _{p})^T\). It follows from the Cauchy-Schwarz inequality and the assumption \(\Vert {\beta }_2\Vert \le \delta _n/\sqrt{n}\) and \({ D}_2({\beta }_2) = \text{ diag }(\beta _2^{-2})\) that
where \(\odot\) denotes the component-wise product and
for all large n. Thus, Eq. (12), together with (13) and (14), implies that
Immediately from \(\delta _n^2/\lambda _n\rightarrow 0\), we have
for some constant \(c_0>1\), with probability tending to 1. Hence, it follows from inequality (14) and (15) that
which implies that conclusion (i) holds.
To prove (ii), we need to verify that \({\alpha }^*\in [1/K_0, K_0]^{q}\) with probability tending to 1. By Eq. (11) and the results from Ritov (1990), we have
Similarly, given conditions (C5), \(\beta _1 \in [1/K_0, K_0]^{q}\) and \(\Vert {\alpha }^*\Vert < K\),
Then, from Eq. (11), we have
Also, according to inequalities in (12) and condition (C5), we know that as \(n\rightarrow \infty\) and with probability tending to 1,
Therefore, from (17) and (18), we can get
with probability tending to 1, which implies that for any \(\varepsilon >0\), \(P(\Vert {\alpha }^*-{\beta }_{10}\Vert \le \varepsilon )\rightarrow 1\). Since \({\beta }_{10}\in [1/K_0, K_0]^{q}\), thus \({\alpha }^* \in [1/K_0, K_0]^{q}\) holds for large n. Then with the fact that \(\Vert {\gamma }^*\Vert \le \delta _n/\sqrt{n}\), we proved that (ii) holds. \(\square\)
Lemma 2
Under the regular conditions (C1)–(C6) and with probability tending to 1, the equation \({\alpha } = (X_1 ^T X_1 + \lambda _n {D}_1({\alpha }))^{-1}\hat{Y}_1\) has a unique fixed-point \(\hat{\alpha }^*\) in the domain \([1/K_0, K_0]^{q}\).
Proof
Since \({\beta }_{20}=0\), we define
where \({\alpha }=(\alpha _1,\dots ,\alpha _{q})^T\). Note that \((f({\alpha })^T,0^T)^T = g(({\alpha }^T, 0^T)^T)\) and \(f({\alpha })\) is a map from \([1/K_0, K_0]^q\) to itself. Multiplying \((X_1 ^T X_1 + \lambda _n {D}_1({\alpha }))\) and taking derivative with respect to \({\alpha }\) on both sides of Eq. (19), we have
where \({f}^{\prime }({\alpha })=\partial f({\alpha })/\partial {\alpha }^T\). Then
According to condition (C5) and the fact that \({\alpha }\in [1/K_0, K_0]^{q}\), we can derive
Thus, \(\sup _{{\alpha }\in [1/K_0, K_0]^{q}}\Vert {f}^{\prime }({\alpha })\Vert \rightarrow 0\), which implies that \(f(\cdot )\) is a contraction mapping from \([1/K_0, K_0]^{q}\) to itself with probability tending to 1 (Meir and Keeler 1969). Hence, according to the contraction mapping theorem, there exists one unique fixed-point \(\hat{\alpha }^* \in [1/K_0, K_0]^{q}\), such that
\(\square\)
Proof of Theorem 1
First, we prove the sparsity of the BJ-BAR estimator. According to the definitions of \(\hat{\beta }_2\) and \(\hat{\beta }_2^{(k)}\), it follows from inequality (15) that
holds with the probability tending to 1.
Next, we will show that \(P( \hat{\beta }_1 = \hat{\alpha }^*)\rightarrow 1\). Consider Eq. (11), we define \({\gamma }^*=0\) if \({\beta _2}=0\). Note that for any fixed large n, from Eq. (11), we have \(\lim _{\beta _2\rightarrow 0} \gamma ^*(\beta )=0\). By multiplying \((X^T X +\lambda _n {D}(\beta ))\) on both sides of Eq. (10), we can have
Combining Eqs. (20) and (22), we have
Since \(f(\cdot )\) is a contract mapping, it follows from Eq. (20) that
Let \(h_k=\Vert \hat{\beta }_1^{(k)}-\hat{\alpha }^*\Vert\), then, from (23) and (24), we get
From (23), for any \(\varepsilon \ge 0\), there exists \(N>0\), such that \(|\eta _k|<\varepsilon\). Following some recursive calculation as in Dai et al. (2018), we can show that \(h_k\rightarrow 0\) as \(k\rightarrow \infty\). Hence, with probability tending to 1, \(h_k\rightarrow 0 \; \text{ as } \; k\rightarrow \infty .\) Since \(\hat{\beta }_1 = \lim _{k\rightarrow \infty }\hat{\beta }_1^{(k)}\) and from the uniqueness of fixed-point, we have \(P(\hat{\beta }_1 = \hat{\alpha }^* )\rightarrow 1\), completing the proof of Theorem 1(i).
Finally, based on Eq. (11) and condition (C5) and the fact that \(\lambda _n/n=o_p(n^{-1/2})\), we get \(\sqrt{n}(\hat{\beta }_1 - {\beta }_{10})\approx \sqrt{n}(\tilde{\beta }_1 - {\beta }_{10}),\) where \(\tilde{\beta }_1\) is the first q elements of \(\tilde{\beta }\). In other words, if we plug in \(\lambda _n/n=o_p(n^{-1/2})\) in Eq. (11) under the regularity conditions, we deduce that \(\sqrt{n}(\hat{\beta }_1 - {\beta }_{10})\approx \sqrt{n}(\tilde{\beta }_1 - {\beta }_{10}).\) Then Theorem 1(ii) follows from the asymptotic normality of \(\tilde{\beta }\) (Ritov 1990).
Proof of Theorem 2
Recall that \(\hat{\beta }= \lim _{k\rightarrow \infty } \hat{\beta }^{(k+1)}\) and \(\hat{\beta }^{(k+1)}=\arg \min _{\beta } Q\big (\beta \mid \hat{\beta }^{(k)}\big )\), where
If \(\beta _{\ell } \ne 0\) for \(\ell \in \{i, j\}\), then \(\hat{\beta }\) must satisfy the following normal equations for \(\ell \in \{i, j\}\):
Thus, for \(\ell \in \{i, j\}\),
where \(\hat{\varepsilon }^{*(k+1)}=\hat{Y}-X \hat{\beta }^{(k+1)}.\) Since
we have
Letting \(k \rightarrow \infty\) in (25) and (26), we have that for \(\ell \in \{i, j\}\) and \(\big \Vert \hat{\varepsilon }^{*}\big \Vert \le \big \Vert \hat{Y}\big \Vert ,\) \(\hat{\beta }_{\ell }^{-1}=X_{\ell }^{T} \hat{\varepsilon }^{*} / \lambda _n\), where \(\hat{\varepsilon }^{*}=\hat{Y}-X \hat{\beta }\). Therefore,
which completes the proof.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lee, J., Choi, T. & Choi, S. Censored broken adaptive ridge regression in high-dimension. Comput Stat 39, 3457–3482 (2024). https://doi.org/10.1007/s00180-023-01446-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01446-1