Skip to main content
Log in

Censored broken adaptive ridge regression in high-dimension

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Broken adaptive ridge (BAR) is a penalized regression method that performs variable selection via a computationally scalable surrogate to \(L_0\) regularization. The BAR regression has many appealing features; it converges to selection with \(L_0\) penalties as a result of reweighting \(L_2\) penalties, and satisfies the oracle property with grouping effect for highly correlated covariates. In this paper, we investigate the BAR procedure for variable selection in a semiparametric accelerated failure time model with complex high-dimensional censored data. Coupled with Buckley-James-type responses, BAR-based variable selection procedures can be performed when event times are censored in complex ways, such as right-censored, left-censored, or double-censored. Our approach utilizes a two-stage cyclic coordinate descent algorithm to minimize the objective function by iteratively estimating the pseudo survival response and regression coefficients along the direction of coordinates. Under some weak regularity conditions, we establish both the oracle property and the grouping effect of the proposed BAR estimator. Numerical studies are conducted to investigate the finite-sample performance of the proposed algorithm and an application to real data is provided as a data example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1

Similar content being viewed by others

References

  • Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24(6):2350–2383

    Article  MathSciNet  Google Scholar 

  • Buckley J, James I (1979) Linear regression with censored data. Biometrika 66(3):429–436

    Article  Google Scholar 

  • Choi S, Cho H (2019) Accelerated failure time models for the analysis of competing risks. J Korean Stat Soc 48:315–326

    Article  MathSciNet  Google Scholar 

  • Choi T, Choi S (2021) A fast algorithm for the accelerated failure time model with high-dimensional time-to-event data. J Stat Comput Simul 91(16):3385–3403

    Article  MathSciNet  Google Scholar 

  • Choi S, Choi T, Cho H, Bandyopadhyay D (2022) Weighted least-squares regression with competing risks data. Stat Med 41(2):227–241

    Article  MathSciNet  Google Scholar 

  • Choi T, Kim AK, Choi S (2021) Semiparametric least-squares regression with doubly-censored data. Comput Stat Data Anal 164:107306

    Article  MathSciNet  Google Scholar 

  • Dai L, Chen K, Li G (2020) The broken adaptive ridge procedure and its applications. Statistica Sinica 30(2):1069–1094

    MathSciNet  Google Scholar 

  • Dai L, Chen K, Sun Z, Liu Z, Li G (2018) Broken adaptive ridge regression and its asymptotic properties. J Multivar Anal 168:334–351

    Article  MathSciNet  Google Scholar 

  • Daubechies I, DeVore R, Fornasier M, Güntürk CS (2010) Iteratively reweighted least squares minimization for sparse recovery. Commun Pure Appl Math J Issued Courant Instit Math Sci 63(1):1–38

    Article  MathSciNet  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  Google Scholar 

  • Frommlet F, Nuel G (2016) An adaptive ridge procedure for \(l_0\) regularization. PloS one 11(2):e0148620

    Article  Google Scholar 

  • Gao F, Zeng D, Lin DY (2017) Semiparametric estimation of the accelerated failure time model with partly interval-censored data. Biometrics 73(4):1161–1168

    Article  MathSciNet  Google Scholar 

  • Huang J (1999) Asymptotic properties of nonparametric estimation based on partly interval-censored data. Statistica Sinica 9(2):501–519

    MathSciNet  Google Scholar 

  • Jin Z, Lin D, Wei L, Ying Z (2003) Rank-based inference for the accelerated failure time model. Biometrika 90(2):341–353

    Article  MathSciNet  Google Scholar 

  • Jin Z, Lin D, Ying Z (2006) On least-squares regression with censored data. Biometrika 93(1):147–161

    Article  MathSciNet  Google Scholar 

  • Johnson BA (2009) On lasso for censored data. Electron J Stat 3:485–506

    Article  MathSciNet  Google Scholar 

  • Johnson BA, Lin DY, Zeng D (2008) Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc 103(482):672–680

    Article  MathSciNet  Google Scholar 

  • Kawaguchi ES, Shen JI, Suchard MA, Li G (2021) Scalable algorithms for large competing risks data. J Comput Graph Stat 30(3):685–693

    Article  MathSciNet  Google Scholar 

  • Kawaguchi ES, Suchard MA, Liu Z, Li G (2020) A surrogate \({L}_0\) sparse Cox’s regression with applications to sparse high-dimensional massive sample size time-to-event data. Stat Med 39(6):675–686

    Article  MathSciNet  Google Scholar 

  • Leurgans S (1987) Linear models, random censoring and synthetic data. Biometrika 74(2):301–309

    Article  MathSciNet  Google Scholar 

  • Li Y, Dicker L, Zhao S (2014) The Dantzig selector for censored linear regression models. Statistica Sinica 24(1):251–268

    MathSciNet  Google Scholar 

  • Liu Y, Chen X, Li G (2019) A new joint screening method for right-censored time-to-event data with ultra-high dimensional covariates. Stat Methods Med Res 29(6):1499–1513

    Article  MathSciNet  Google Scholar 

  • Meir A, Keeler E (1969) A theorem on contraction mappings. J Math Anal Appl 28(2):326–329

    Article  MathSciNet  Google Scholar 

  • Rippe RC, Meulman JJ, Eilers PH (2012) Visualization of genomic changes by segmented smoothing using an l 0 penalty. PloS one 7(6):e38230

    Article  Google Scholar 

  • Ritov Y (1990) Estimation in a linear regression model with censored data. Ann Stat 18(1):303–328

    Article  MathSciNet  Google Scholar 

  • Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494

    Article  MathSciNet  Google Scholar 

  • Son M, Choi T, Shin SJ, Jung Y, Choi S (2021) Regularized linear censored quantile regression. J Korean Stat Soc 51:1–19

    MathSciNet  Google Scholar 

  • Sun Z, Liu Y, Chen K, Li G (2022) Broken adaptive ridge regression for right-censored survival data. Ann Instit Stat Math 74(1):69–91

    Article  MathSciNet  Google Scholar 

  • Sun Z, Yu C, Li G, Chen K, Liu Y (2020) CenBAR: Broken Adaptive Ridge AFT Model with Censored Data. https://cran.r-project.org/web/packages/CenBAR/index.html, r package version 0.1.1

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Series B (Methodological) 58(1):267–288

    Article  MathSciNet  Google Scholar 

  • Turnbull BW (1976) The empirical distribution function with arbitrarily grouped, censored and truncated data. J Royal Stat Soc Ser B 38(3):290–295

    Article  MathSciNet  Google Scholar 

  • Wang S, Nan B, Zhu J, Beer DG (2008) Doubly penalized Buckley-James method for survival data with high-dimensional covariates. Biometrics 64(1):132–140

    Article  MathSciNet  Google Scholar 

  • Xu J, Leng C, Ying Z (2010) Rank-based variable selection with censored data. Stat Comput 20(2):165–176

    Article  MathSciNet  Google Scholar 

  • Zeng D, Lin D (2007) Efficient estimation for the accelerated failure time model. J Am Stat Assoc 69(4):507–564

    MathSciNet  Google Scholar 

  • Zhao H, Sun D, Li G, Sun J (2018) Variable selection for recurrent event data with broken adaptive ridge regression. Can J Stat 46(3):416–428

    Article  MathSciNet  Google Scholar 

  • Zhao H, Wu Q, Li G, Sun J (2020) Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression. J Am Stat Assoc 115(529):204–216

    Article  MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    Article  MathSciNet  Google Scholar 

Download references

Funding

The research of T. Choi was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (RS-2023-00237435). The research of S. Choi was supported by grant from the National Research Foundation (NSF) of Korea (2022M3J6A1063595, 2022R1A2C1008514) and the Korea University research grant (K2018721, K2008341).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sangbum Choi.

Ethics declarations

Conflict of interest

We declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In this appendix, we prove the asymptotic properties of the proposed BJ-BAR estimator \(\hat{\beta }\). For the proof, define

$$\begin{aligned} \left( \begin{array}{c} {\alpha }^*(\beta ) \\ {\gamma }^*(\beta ) \\ \end{array} \right) = g(\beta )=(X^T X +\lambda _n { D}(\beta ))^{-1} \hat{Y} \,, \end{aligned}$$
(10)

and the partition of matrix \({\Sigma }_n^{-1}\) as

$$\begin{aligned} {\Sigma }_n^{-1} = \left( \begin{array}{cc} A &{} B\\ B^T &{} G \\ \end{array} \right) \,, \end{aligned}$$

where A is a \(q\times q\) matrix and \(\hat{Y}=\hat{Y}(\tilde{\beta })\) is the“imputed” failure time. We write \({\alpha }^*(\beta )\) and \({\gamma }^*(\beta )\) as \(\alpha ^*\) and \(\gamma ^*\), where \(\alpha ^*\) is a \(q\times 1\) vector and \(\gamma ^*\) is a \((p-q)\times 1\) vector. Note that since \({\Sigma }_n=n^{-1} X^T X\) is nonsingular, by multiplying \((X^T X)^{-1} (X^T X + \lambda _n { D}(\beta ))\) and subtracting \(\beta _0=(\beta _{10}^T,0^T)^T\) on both sides of Eq. (10), we have

$$\begin{aligned} \left( \begin{array}{c} {\alpha }^*-{\beta }_{10} \\ {\gamma }^* \\ \end{array} \right) +\frac{\lambda _n}{n}\left( \begin{array}{c} A { D}_1({\beta }_1){\alpha }^* + B { D}_2({\beta }_2){\gamma }^* \\ B^T { D}_1({\beta }_1){\alpha }^* + G { D}_2({\beta }_2){\gamma }^* \\ \end{array} \right) ={\tilde{\beta }} -{\beta }_0 \,, \end{aligned}$$
(11)

since \(\tilde{\beta }=(X^TX)^{-1}X^T \hat{Y}(\tilde{\beta })\), where \({ D}_1({\beta }_1) = \text{ diag }(\beta _1^{-2})\) and \({ D}_2({\beta }_2) = \text{ diag }(\beta _2^{-2})\). We also define \({\gamma }^*=0\) if \({\beta _2}=0\) in Eq. (11). According to Ritov (1990), we have \(\Vert {\tilde{\beta }} -{\beta }_0\Vert = O_p(n^{-1/2}).\)

In order to establish the validity of Theorem 1, we rely on the foundation provided by two lemmas. These lemmas are derived from the works of Dai et al. (2018); Zhao et al. (2018, 2020), and form an integral part of the proof for Theorem 1.

Lemma 1

Let \(\{\, \delta _n \, \}\) be a sequence of positive real numbers satisfying \(\delta _n\rightarrow \infty\) and \(\delta _n^2/\lambda _n \rightarrow 0\). Define \(H= \{{\beta } = ({\beta }_1^T, {\beta }_2^T)^T: {\beta }_1 \in [1/K_0, K_0]^{q}, \Vert {\beta }_2\Vert \le \delta _n/\sqrt{n}\}\), where \(K_0>1\) is a constant such that \({\beta }_{10}\in [1/K_0, K_0]^{q}\). Then under the regularity conditions (C1)–(C6) and with probability tending to 1, we have

  1. (i)

    \(\displaystyle \sup _{{\beta }\in H}\dfrac{\Vert {\gamma }^*({\beta })\Vert }{\Vert {\beta }_2\Vert } < \dfrac{1}{c_0}\) for some constant \(c_0>1\).

  2. (ii)

    \(g(\cdot )\) is a mapping from H to itself.

Proof

By Eq. (11) and Ritov (1990), we have

$$\begin{aligned} \displaystyle \sup _{{\beta }\in H}\Big \Vert \ \gamma ^*+ \frac{\lambda _n}{n}B^T {D}_1({\beta }_1){\alpha }^* + \frac{\lambda _n}{n}G {D}_2({\beta }_2){\gamma }^* \Big \Vert =O_p(n^{-1/2}). \end{aligned}$$

Note that \(\Vert \beta _2\Vert \le \delta _n / \sqrt{n}\) and \(\lambda _n/n=o_p(n^{-1/2})\). Based on conditions (C5) and (C6) and the fact that \({\beta }_1 \in [1/K_0, K_0]^{q}\) and \(\Vert {\alpha }^*\Vert \le \Vert g({\beta })\Vert < K\) for some constant \(K>0\), we have

$$\begin{aligned} \displaystyle \sup _{{\beta }\in H}\Big \Vert \frac{\lambda _n}{n}B^T {D}_1({\beta }_1){\alpha }^* \Big \Vert \le \frac{\lambda _n}{n} \Vert B^T \Vert \displaystyle \sup _{{\beta }\in H} \Vert {D}_1({\beta }_1){\alpha }^*\Vert \le \sqrt{2}c \frac{\lambda _n}{n} \frac{a_1}{a_0 ^2} \displaystyle \sup _{{\beta }\in H} \Vert {\alpha }^*\Vert =o_p(n^{-1/2}), \end{aligned}$$

where \(a_0 = \min _{1 \le j \le q} |\beta _{10_j} |\), \(a_1 = \max _{1 \le j \le q} |\beta _{10_j} |\), and \(\Vert B^T\Vert \le \sqrt{2}c\), which follows from the inequality \(\Vert BB^T\Vert - \Vert A^2 \Vert \le \Vert BB^T + A^2 \Vert \le \Vert {\Sigma }_n ^{-2}\Vert < c^2.\) Then, it follows from Eq. (11) that, with probability tending to 1,

$$\begin{aligned} c^{-1}\Big \Vert \dfrac{\lambda _n}{n}{D}_2({\beta }_2){\gamma }^*\Big \Vert -\Vert {\gamma }^*\Vert \le \displaystyle \sup _{{\beta }\in H}\Big \Vert {\gamma }^* +\frac{\lambda _n}{n}G {D}_2({\beta }_2){\gamma }^*\Big \Vert = O_p(n^{-1/2}) \le \dfrac{\delta _n}{\sqrt{n}}, \end{aligned}$$
(12)

because \(\lambda _{\min }({ G})> c^{-1}\). Let \(m_{{\gamma }^*/{\beta }_2}=(\gamma _1^*/\beta _{{q}+1}, \gamma _2^*/\beta _{{q}+2},\dots , \gamma _{p-q}^*/\beta _{p})^T\). It follows from the Cauchy-Schwarz inequality and the assumption \(\Vert {\beta }_2\Vert \le \delta _n/\sqrt{n}\) and \({ D}_2({\beta }_2) = \text{ diag }(\beta _2^{-2})\) that

$$\begin{aligned} \Vert m_{{\gamma }^*/{\beta }_2}\Vert = \Vert D_2(\beta _2) \gamma ^* \odot \beta _2 \Vert \le \Vert D_2(\beta _2) \gamma ^* \Vert \cdot \Vert \beta _2 \Vert \le \Vert {D}_2({\beta }_2){\gamma }^*\Vert \frac{\delta _n}{\sqrt{n}}, \end{aligned}$$
(13)

where \(\odot\) denotes the component-wise product and

$$\begin{aligned} \Vert {\gamma }^*\Vert =\Vert {D}_2({\beta }_2)^{-1/2}m_{{\gamma }^*/{\beta }_2}\Vert \le \Vert m_{{\gamma }^*/{\beta }_2}\Vert \cdot \Vert {\beta }_2\Vert \le \Vert m_{{\gamma }^*/{\beta }_2}\Vert \frac{\delta _n}{\sqrt{n}}, \end{aligned}$$
(14)

for all large n. Thus, Eq. (12), together with (13) and (14), implies that

$$\begin{aligned} \frac{\lambda _n}{nc}\frac{\sqrt{n}}{\delta _n}\Vert m_{{\gamma }^*/{\beta }_2}\Vert -\Vert m_{{\gamma }^*/{\beta }_2}\Vert \frac{\delta _n}{\sqrt{n}}\le \frac{\delta _n}{\sqrt{n}}. \end{aligned}$$

Immediately from \(\delta _n^2/\lambda _n\rightarrow 0\), we have

$$\begin{aligned} \Vert m_{{\gamma }^*/{\beta }_2}\Vert \le \dfrac{1}{\frac{\lambda _n}{\delta ^2_n c}-1} < \frac{1}{c_0} \end{aligned}$$
(15)

for some constant \(c_0>1\), with probability tending to 1. Hence, it follows from inequality (14) and (15) that

$$\begin{aligned} \Vert {\gamma }^*\Vert < \Vert {\beta }_2\Vert \le \frac{\delta _n}{\sqrt{n}}\rightarrow 0\;\;\text{ as }\;\; n\rightarrow \infty , \end{aligned}$$
(16)

which implies that conclusion (i) holds.

To prove (ii), we need to verify that \({\alpha }^*\in [1/K_0, K_0]^{q}\) with probability tending to 1. By Eq. (11) and the results from Ritov (1990), we have

$$\begin{aligned} \displaystyle \sup _{{\beta }\in H}\bigg \Vert \alpha ^*-\beta _{10}+ \frac{\lambda _n}{n}A {D}_1({\beta }_1){\alpha }^* + \frac{\lambda _n}{n}B {D}_2({\beta }_2){\gamma }^* \bigg \Vert =O_p(n^{-1/2}). \end{aligned}$$

Similarly, given conditions (C5), \(\beta _1 \in [1/K_0, K_0]^{q}\) and \(\Vert {\alpha }^*\Vert < K\),

$$\begin{aligned} \sup _{\beta \in H}\bigg \Vert \frac{\lambda _n}{n}A {D}_1({\beta }_1){\alpha }^* \bigg \Vert =o_p(n^{-1/2}). \end{aligned}$$

Then, from Eq. (11), we have

$$\begin{aligned} \sup _{{\beta }\in H}\bigg \Vert {\alpha }^*-{\beta }_{10}+\frac{\lambda _n}{n}B {D}_2({\beta }_2){\gamma }^*\bigg \Vert =O_p(n^{-1/2})\le \frac{\delta _n}{\sqrt{n}}. \end{aligned}$$
(17)

Also, according to inequalities in (12) and condition (C5), we know that as \(n\rightarrow \infty\) and with probability tending to 1,

$$\begin{aligned} \sup _{\beta \in H}\Big \Vert \frac{\lambda _n}{n} B {D}_2({\beta }_2) {\gamma }^*\Big \Vert \le \frac{\lambda _n}{n} \Vert {B}\Vert \sup _{{\beta }\in H}\Vert {D}_2({\beta }_2) {\gamma }^*\Vert \le \frac{2c^2\delta _n}{\sqrt{n}}. \end{aligned}$$
(18)

Therefore, from (17) and (18), we can get

$$\begin{aligned} \sup _{{\beta }\in H}\Vert {\alpha }^*-{\beta }_{10}\Vert \le \frac{(2c^2+1)\delta _n}{\sqrt{n}}\rightarrow 0 \end{aligned}$$

with probability tending to 1, which implies that for any \(\varepsilon >0\), \(P(\Vert {\alpha }^*-{\beta }_{10}\Vert \le \varepsilon )\rightarrow 1\). Since \({\beta }_{10}\in [1/K_0, K_0]^{q}\), thus \({\alpha }^* \in [1/K_0, K_0]^{q}\) holds for large n. Then with the fact that \(\Vert {\gamma }^*\Vert \le \delta _n/\sqrt{n}\), we proved that (ii) holds. \(\square\)

Lemma 2

Under the regular conditions (C1)–(C6) and with probability tending to 1, the equation \({\alpha } = (X_1 ^T X_1 + \lambda _n {D}_1({\alpha }))^{-1}\hat{Y}_1\) has a unique fixed-point \(\hat{\alpha }^*\) in the domain \([1/K_0, K_0]^{q}\).

Proof

Since \({\beta }_{20}=0\), we define

$$\begin{aligned} f(\alpha )=(f_1(\alpha ), f_2(\alpha ),\dots , f_{q}(\alpha ))^T = (X_1 ^T X_1 + \lambda _n {D}_1({\alpha }))^{-1}\hat{Y}_1, \end{aligned}$$
(19)

where \({\alpha }=(\alpha _1,\dots ,\alpha _{q})^T\). Note that \((f({\alpha })^T,0^T)^T = g(({\alpha }^T, 0^T)^T)\) and \(f({\alpha })\) is a map from \([1/K_0, K_0]^q\) to itself. Multiplying \((X_1 ^T X_1 + \lambda _n {D}_1({\alpha }))\) and taking derivative with respect to \({\alpha }\) on both sides of Eq. (19), we have

$$\begin{aligned} \Big ({\Sigma }_{n1} + \frac{\lambda _n}{n}{D}_1({\alpha })\Big ){f}^{\prime }({\alpha }) + \frac{\lambda _n}{n}\text{ diag }\bigg (\frac{-2 f_1(\alpha )}{\alpha _1^3},\dots , \frac{-2 f_q(\alpha )}{\alpha _{q}^3}\bigg )=0, \end{aligned}$$

where \({f}^{\prime }({\alpha })=\partial f({\alpha })/\partial {\alpha }^T\). Then

$$\begin{aligned} \sup _{\alpha \in [1/K_0, K_0]^{q}}\bigg \Vert \Big (\Sigma _{n1} + \frac{\lambda _n}{n}{D}_1({\alpha })\Big ){f}^{\prime }({\alpha })\bigg \Vert = \sup _{\alpha \in [1/K_0, K_0]^{q}}\frac{2\lambda _n}{n}\bigg \Vert \text{ diag }\Big (\frac{f_1({\alpha })}{\alpha _1^3},\dots , \frac{f_q({\alpha })}{\alpha _{q}^3}\Big )\bigg \Vert =o_p(1). \end{aligned}$$

According to condition (C5) and the fact that \({\alpha }\in [1/K_0, K_0]^{q}\), we can derive

$$\begin{aligned} \bigg \Vert \Big (\Sigma _{n1} + \frac{\lambda _n}{n}{D}_1({\alpha })\Big ){f}^{\prime }({\alpha })\bigg \Vert \ge \bigg \Vert \Sigma _{n1}{f}^{\prime }({\alpha })\bigg \Vert -\bigg \Vert \frac{\lambda _n}{n}{D}_1({\alpha }){f}^{\prime }({\alpha })\bigg \Vert \ge \Big (\frac{1}{c}-\frac{\lambda _n}{n}K_0^2\Big )\Vert {f}^{\prime }({\alpha })\Vert . \end{aligned}$$

Thus, \(\sup _{{\alpha }\in [1/K_0, K_0]^{q}}\Vert {f}^{\prime }({\alpha })\Vert \rightarrow 0\), which implies that \(f(\cdot )\) is a contraction mapping from \([1/K_0, K_0]^{q}\) to itself with probability tending to 1 (Meir and Keeler 1969). Hence, according to the contraction mapping theorem, there exists one unique fixed-point \(\hat{\alpha }^* \in [1/K_0, K_0]^{q}\), such that

$$\begin{aligned} \hat{\alpha }^* =( X_1^T X_1+ \lambda _n {D}_1(\hat{\alpha }^*))^{-1} \hat{Y}_1. \end{aligned}$$
(20)

\(\square\)

Proof of Theorem 1

First, we prove the sparsity of the BJ-BAR estimator. According to the definitions of \(\hat{\beta }_2\) and \(\hat{\beta }_2^{(k)}\), it follows from inequality (15) that

$$\begin{aligned} \hat{\beta }_2=\lim _{k\rightarrow \infty }\hat{\beta }_2^{(k)}=0 \end{aligned}$$
(21)

holds with the probability tending to 1.

Next, we will show that \(P( \hat{\beta }_1 = \hat{\alpha }^*)\rightarrow 1\). Consider Eq. (11), we define \({\gamma }^*=0\) if \({\beta _2}=0\). Note that for any fixed large n, from Eq. (11), we have \(\lim _{\beta _2\rightarrow 0} \gamma ^*(\beta )=0\). By multiplying \((X^T X +\lambda _n {D}(\beta ))\) on both sides of Eq. (10), we can have

$$\begin{aligned} \lim _{\beta _2\rightarrow 0}{\alpha }^*(\beta )=(X_1 ^T X_1 +\lambda _n {D}_1(\beta _1))^{-1}\hat{Y}_1=f(\beta _1). \end{aligned}$$
(22)

Combining Eqs. (20) and (22), we have

$$\begin{aligned} \sup _{\beta _1\in [1/K_0, K_0]^{q}}\Vert f(\beta _1)-{\alpha }^*(\beta _1, \hat{\beta }_2^{(k)})\Vert \rightarrow 0,\; \text{ as } \; k\rightarrow \infty . \end{aligned}$$
(23)

Since \(f(\cdot )\) is a contract mapping, it follows from Eq. (20) that

$$\begin{aligned} \Vert f(\hat{\beta }_1^{(k)})-\hat{\alpha }^*\Vert =\Vert f(\hat{\beta }_1^{(k)})- f(\hat{\alpha }^*)\Vert \le \frac{1}{c} \Vert \hat{\beta }_1^{(k)}-\hat{\alpha }^*\Vert \;\;\;(c>1). \end{aligned}$$
(24)

Let \(h_k=\Vert \hat{\beta }_1^{(k)}-\hat{\alpha }^*\Vert\), then, from (23) and (24), we get

$$\begin{aligned} \begin{array}{rl} h_{k+1}=\Vert {\alpha }^*(\hat{\beta }^{(k)})-\hat{\alpha }^*\Vert &{}\le \Vert \alpha ^*(\hat{\beta }^{(k)})-f(\hat{\beta }_1^{(k)})\Vert +\Vert f(\hat{\beta }_1^{(k)})-\hat{\alpha }^*\Vert \\ &{}\le \eta _k + \dfrac{1}{c} ~ h_k,\;\text{ for } \text{ some } \text{ small }\; \eta _k>0. \end{array} \end{aligned}$$

From (23), for any \(\varepsilon \ge 0\), there exists \(N>0\), such that \(|\eta _k|<\varepsilon\). Following some recursive calculation as in Dai et al. (2018), we can show that \(h_k\rightarrow 0\) as \(k\rightarrow \infty\). Hence, with probability tending to 1, \(h_k\rightarrow 0 \; \text{ as } \; k\rightarrow \infty .\) Since \(\hat{\beta }_1 = \lim _{k\rightarrow \infty }\hat{\beta }_1^{(k)}\) and from the uniqueness of fixed-point, we have \(P(\hat{\beta }_1 = \hat{\alpha }^* )\rightarrow 1\), completing the proof of Theorem 1(i).

Finally, based on Eq. (11) and condition (C5) and the fact that \(\lambda _n/n=o_p(n^{-1/2})\), we get \(\sqrt{n}(\hat{\beta }_1 - {\beta }_{10})\approx \sqrt{n}(\tilde{\beta }_1 - {\beta }_{10}),\) where \(\tilde{\beta }_1\) is the first q elements of \(\tilde{\beta }\). In other words, if we plug in \(\lambda _n/n=o_p(n^{-1/2})\) in Eq. (11) under the regularity conditions, we deduce that \(\sqrt{n}(\hat{\beta }_1 - {\beta }_{10})\approx \sqrt{n}(\tilde{\beta }_1 - {\beta }_{10}).\) Then Theorem 1(ii) follows from the asymptotic normality of \(\tilde{\beta }\) (Ritov 1990).

Proof of Theorem 2

Recall that \(\hat{\beta }= \lim _{k\rightarrow \infty } \hat{\beta }^{(k+1)}\) and \(\hat{\beta }^{(k+1)}=\arg \min _{\beta } Q\big (\beta \mid \hat{\beta }^{(k)}\big )\), where

$$\begin{aligned} Q\big (\beta \mid \hat{\beta }^{(k)}\big )=\big \Vert \hat{Y}-X \beta \big \Vert ^{2}+\lambda _n \sum _{i=1}^{p} \frac{\beta _{i}^{2}}{\big \{\hat{\beta }_{i}^{(k)}\big \}^{2}}. \end{aligned}$$

If \(\beta _{\ell } \ne 0\) for \(\ell \in \{i, j\}\), then \(\hat{\beta }\) must satisfy the following normal equations for \(\ell \in \{i, j\}\):

$$\begin{aligned} -2 X_{\ell }^T\big (\hat{Y}-X \hat{\beta }^{(k+1)}\big )+2 \lambda _n \frac{\hat{\beta }_{\ell }^{(k+1)}}{\big \{\hat{\beta }_{\ell }^{(k)}\big \}^{2}}=0. \end{aligned}$$

Thus, for \(\ell \in \{i, j\}\),

$$\begin{aligned} \frac{\hat{\beta }_{\ell }^{(k+1)}}{\big \{\hat{\beta }_{\ell }^{(k)}\big \}^{2}}= \frac{X_{\ell }^{T} \hat{\varepsilon }^{*(k+1)}}{\lambda _n}, \end{aligned}$$
(25)

where \(\hat{\varepsilon }^{*(k+1)}=\hat{Y}-X \hat{\beta }^{(k+1)}.\) Since

$$\begin{aligned} \big \Vert \hat{\varepsilon }^{*(k+1)}\big \Vert ^{2}+\lambda _n \sum _{i=1}^{p} \frac{\hat{\beta }_{i}^{2}}{\big \{\hat{\beta }_{i}^{(k)}\big \}^2}= Q\big (\hat{\beta }^{(k+1)} \mid \hat{\beta }^{(k)}\big ) \le Q\big (0 \mid \hat{\beta }^{(k)}\big )=\big \Vert \hat{Y}\big \Vert ^{2}, \end{aligned}$$

we have

$$\begin{aligned} \big \Vert \hat{\varepsilon }^{*(k+1)}\big \Vert \le \big \Vert \hat{Y}\big \Vert . \end{aligned}$$
(26)

Letting \(k \rightarrow \infty\) in (25) and (26), we have that for \(\ell \in \{i, j\}\) and \(\big \Vert \hat{\varepsilon }^{*}\big \Vert \le \big \Vert \hat{Y}\big \Vert ,\) \(\hat{\beta }_{\ell }^{-1}=X_{\ell }^{T} \hat{\varepsilon }^{*} / \lambda _n\), where \(\hat{\varepsilon }^{*}=\hat{Y}-X \hat{\beta }\). Therefore,

$$\begin{aligned} \big |\hat{\beta }_{i}^{-1}-\hat{\beta }_{j}^{-1}\big | \le \frac{1}{\lambda _n}\big \Vert \hat{\varepsilon }^{*} \big \Vert \big \Vert X_{i}-X_{j}\big \Vert \le \frac{1}{\lambda _n}\big \Vert \hat{Y}\big \Vert \sqrt{2\big (1-r_{i j}\big )}, \end{aligned}$$

which completes the proof.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, J., Choi, T. & Choi, S. Censored broken adaptive ridge regression in high-dimension. Comput Stat 39, 3457–3482 (2024). https://doi.org/10.1007/s00180-023-01446-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-023-01446-1

Keywords

Navigation