0-Regularized high-dimensional accelerated failure time model

https://doi.org/10.1016/j.csda.2022.107430Get rights and content

Abstract

We develop a constructive approach for 0-penalized estimation in the sparse accelerated failure time (AFT) model with high-dimensional covariates. The proposed approach is based on Stute's weighted least squares criterion combined with 0-penalization. This method is a computational algorithm that generates a sequence of solutions iteratively, based on active sets derived from primal and dual information and root finding according to the Karush-Kuhn-Tucker (KKT) conditions. We refer to the proposed method as AFT-SDAR (for support detection and root finding). An important aspect of our theoretical results is that we directly concern the sequence of solutions generated based on the AFT-SDAR algorithm. We prove that the estimation errors of the solution sequence decay exponentially to the optimal error bound with high probability, as long as the covariate matrix satisfies a mild regularity condition which is necessary and sufficient for model identification even in the setting of high-dimensional linear regression. An adaptive version of AFT-SDAR is also proposed, i.e., AFT-ASDAR, which determines the support size of the estimated coefficient in a data-driven fashion. Simulation studies demonstrate the superior performance of the proposed method over the lasso and MCP in terms of accuracy and speed. The application of the proposed method is also illustrated by analyzing a real data set.

Introduction

In survival analysis, an attractive alternative to the widely used proportional hazards model (Cox, 1972) is the accelerated failure time (AFT) model (Koul et al., 1981; Wei, 1992; Kalbfleisch and Prentice, 2011). The AFT model is a linear regression model in which the response variable is usually the logarithm or a known monotone transformation of the failure time. Let Ti be the failure time and xi be a p-dimensional covariate vector for the ith subject in a random sample of size n. The AFT model assumesTi=xiβ+ϵi,i=1,,n, where βRp is the underlying regression coefficient vector, ϵi's are random error terms. Often times, Ti is taken to be the logarithm of the failure time. When Ti is subject to right censoring, we only observe (Yi,δi,xi), where Yi=min{Ti,Ci}, Ci is the censoring time, and δi=1{TiCi} is the censoring indicator. Assume that a random sample of i.i.d. observations (Yi, δi, xi), i=1,,n, is available. To estimate β when the distribution of the error terms is unspecified, several approaches have been proposed in the literature. One approach is the Buckley-James estimator (Buckley and James, 1979), which adjusts for censored observations using the Kaplan-Meier estimator. The second approach is the rank-based estimator (Ying, 1993), which is motivated by the score function of the partial likelihood. Another interesting alternative is the weighted least squares approach (Stute et al., 1993; Stute, 1996), which involves the minimization of a weighted least squares objective function.

In this paper, we focus on the high-dimensional AFT model, where the dimension of the covariate vector can exceed the sample size. Under the high-dimensional AFT model, many researchers have proposed various methods for parameter estimation and variable selection. For example, Huang et al. (2006) considered the LASSO (Tibshirani, 1996) in the AFT model, based on the weighted least squares criterion; Johnson (2008) and Johnson et al. (2008) applied the SCAD (Fan and Li, 2001) penalty to the rank-based estimator and Buckley-James estimator; Cai et al. (2009) proposed the rank-based adaptive LASSO (Zou, 2006) method; Huang and Ma (2010) used the bridge penalization for the regularized estimation and variable selection; Hu and Chai (2013) extended the MCP (Zhang et al., 2010) penalty to the weighted least square estimation; Khan and Shaw (2016) used the adaptive and weighted elastic net methods (Zou and Zhang, 2009; Hong and Zhang, 2010) based on the weighted least squares criterion.

We propose the 0-penalized method for estimation and variable selection under the high-dimensional AFT model. We extend the support detection and root finding (SDAR) algorithm (Huang et al., 2018) for linear regression model to the AFT model. For convenience, we refer to the proposed method as AFT-SDAR. In the same spirit as the SDAR method, AFT-SDAR is a constructive approach to estimating the sparse and high-dimensional AFT model. This approach is a computational algorithm motivated from the KKT conditions for the 0-penalized weighted least squares solution, and generates a sequence of solutions iteratively, based on support detection using primal and dual information and root finding. Theoretically, we show that the -norm of the estimation errors of the solution sequence decay exponentially to the optima order O(logpn) with high probability, as long as the covariate matrix satisfies the weakest regularity condition that is necessary and sufficient for model identification. Moreover, the estimated support coincides with the true support of the underlying vector regression coefficients if the minimum absolute value of the nonzero entries of the target is above the detectable order.

The rest of this paper is organized as follows. In Section 2, we described the 0-penalized criterion for the AFT model. In Section 3, we give the KKT conditions for the 0-penalized weighted least squares solutions and describe the proposed AFT-SDAR algorithm. In Section 4, we first establish the finite-step and deterministic error bounds for the solution sequence generated by the AFT-SDAR algorithm. As a consequence of these deterministic error bounds, we provide nonasymptotic error bounds for the solution sequence. We also show that the proposed method recovers the support of the underlying regression coefficient vector in finite iterations with high probability. In Section 5, we describe AFT-ASDAR, the adaptive version of AFT-SDAR that selects the tuning parameter in a data driven fashion. In Section 6, we assess the finite sample performance of the proposed method with different simulation studies and a real case study on a breast cancer gene expression data set. Concluding remarks are given in Section 7. Proofs for all the lemmas and theorems are deferred to Appendix. An R package implementing the proposed method is available at https://github.com/Shuang-Zhang/ASDAR/.

Section snippets

AFT regression with 0-penalization

Let Y(1),,Y(n) be the order statistics of Yi's. Let δ(1),,δ(n) be the associated censoring indicators and let x(1),,x(n) be the associated covariates. In the weighted least squares method, the weights w(i)'s are the jumps in Kaplan-Meier estimator based on (Y(i),δ(i)), i=1,,n, which can be expressed asw(1)=δ(1)n,w(i)=δ(i)ni+1j=1i1(njnj+1)δ(j),i=2,,n. Let wni=nw(i),i=1,,n. The weighted least squares criterion is given byL1(β)=12ni=1nwni(Y(i)x(i)β)2. In the high-dimensional

AFT-SDAR algorithm

We first introduce some notation used throughout the paper. Let ηq=(i=1p|ηi|q)1q be the usual q (q[1,]) norm of the vector η=(η1,,ηp)Rp. Let |A| denote the cardinality of the set A. Denote ηA=(ηi,iA)R|A|, η|ARp with its ith element (η|A)i=ηi1(iA), where 1() is the indicator function. Let ηT, and ηmin be the Tth largest elements (in absolute value) and the minimum absolute value of η, respectively. Let M denote the maximum value (in absolute value) of the matrix M. Let L

Theoretical properties

In this section, we consider the finite-step error bound for the solution sequence computed based on Algorithm 1. We also study the probabilistic and nonasymptotic error bound for the solution sequence.

We first consider the deterministic error bounds for the solution sequence generated based on AFT-SDAR. We choose the step size τ satisfies0<τ<1TU with UX¯22/n, and let L be a constant satisfying0<Lσ(min,2T)n2T.

Theorem 1

Suppose TK and set β0=0 in Algorithm 1. If (11) and (12) hold, for the

Adaptive AFT-SDAR

In practice, the sparsity levels of the true parameters η or β are unknown. Therefore, we can regard T as a tuning parameter. Let T increase from 0 to Q, where Q is a large enough integer. In general, we set Q=αn/log(n) as suggested by Fan and Lv (2008), where α is a positive constant. Then we can obtain a set of solutions paths: {ηˆ(T):T=0,1,,Q}, where ηˆ(0)=0. Finally, we use the cross-validation method or the HBIC criterion (Wang et al., 2013) to determine Tˆ as the estimated

Numerical studies

In this section, we conduct simulation studies and real data analysis to illustrate the effectiveness of the proposed method. We compare the simulation results of AFT-SDAR/AFT-ASDAR with those of Lasso and MCP in terms of accuracy and efficiency. We also evaluate the performance under different study designs by considering the sample size n, the variable dimension p, the correlation measure ρ among covariates and the censoring rate c.r. Moreover, we examine the average number of iterations for

Conclusion

In this paper, we consider the 0-penalized method for estimation and variable selection under the high-dimensional AFT models. We extend the SDAR algorithm for the linear regression to the AFT model with censored survival data based on a weighted least squares criterion. The proposed AFT-SDAR algorithm is a constructive approach for approximating 0-penalized weighted least squares solutions. In theoretical analysis, we establish the nonasymptotic error bounds for the solution sequence

Acknowledgement

The authors thank two anonymous reviewers and the Associate Editor for many valuable comments and suggestions, which have helped to improve the quality of the article. Dr. Feng's work was partially supported by National Natural Science Foundation of China (No. 11971292). Dr. Jiao's work was partially supported by National Natural Science Foundation of China (No. 11871474).

References (32)

  • J. Hu et al.

    Adjusted regularized estimation in the accelerated failure time model with high dimensional covariates

    J. Multivar. Anal.

    (2013)
  • P. Breheny et al.

    Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection

    Ann. Appl. Stat.

    (2011)
  • J. Buckley et al.

    Linear regression with censored data

    Biometrika

    (1979)
  • T. Cai et al.

    Regularized estimation for the accelerated failure time model

    Biometrics

    (2009)
  • D.R. Cox

    Regression models and life-tables

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (1972)
  • J. Fan et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Am. Stat. Assoc.

    (2001)
  • J. Fan et al.

    Sure independence screening for ultrahigh dimensional feature space

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (2008)
  • D. Hong et al.

    Weighted elastic net model for mass spectrometry imaging processing

    Math. Model. Nat. Phenom.

    (2010)
  • J. Huang et al.

    Variable selection in the accelerated failure time model via the bridge method

    Lifetime Data Anal.

    (2010)
  • J. Huang et al.

    Regularized estimation in the accelerated failure time model with high-dimensional covariates

    Biometrics

    (2006)
  • J. Huang et al.

    A constructive approach to 0 penalized regression

    J. Mach. Learn. Res.

    (2018)
  • B.A. Johnson

    Variable selection in semiparametric linear regression with censored data

    J. R. Stat. Soc., Ser. B, Stat. Methodol.

    (2008)
  • B.A. Johnson et al.

    Penalized estimating functions and variable selection in semiparametric regression models

    J. Am. Stat. Assoc.

    (2008)
  • J.D. Kalbfleisch et al.

    The Statistical Analysis of Failure Time Data, vol. 360

    (2011)
  • M.H.R. Khan et al.

    Variable selection for survival data with a class of adaptive elastic net techniques

    Stat. Comput.

    (2016)
  • H. Koul et al.

    Regression analysis with randomly right-censored data

    Ann. Stat.

    (1981)
  • Cited by (7)

    • A fast robust best subset regression

      2024, Knowledge-Based Systems
    View all citing articles on Scopus
    View full text