Variable selection for generalized odds rate mixture cure models with interval-censored failure time data
Introduction
Variable selection is needed and has been discussed in many fields and subsequently, a great deal of literature on it has been established. Among other methods, a general approach for variable selection that has recently garnered significant attention is the penalized approach, which involves a penalty function. This is especially ideal when there are a large number of risk factors and one needs to identify the relevant or prognostic predictors among them. Among them, one of the commonly used penalty functions is the least absolute shrinkage and selection operator (LASSO), proposed by Tibshirani (1996). It is a continuous shrinkage process that can penalize some coefficients to zero exactly and retain some desired features. Following Tibshirani (1996), Fan and Li (2001) proposed the smoothly clipped absolute deviation (SCAD) penalty and Zou (2006) developed the adaptive LASSO (ALASSO) penalty. The SCAD penalty is a nonconcave function on and has singularities at the origin to yield sparse estimators. The ALASSO penalty is the weighted version of LASSO and can yield the estimators with the oracle property. However, most of the existing methods are intended for complete data or data with relatively simple structures. In this paper, we consider a type of data with much more complex structures and interval-censored failure time data with a cured fraction or subgroup.
Interval-censored failure time data arise when the failure time of interest is known or observed only to belong to some intervals instead of being observed exactly. Such data commonly occur in many areas including clinical studies, epidemiology researches, and sociological surveys. It is apparent that interval-censored data include right-censored data as a special case. An example of interval-censored data arose from the 2003 Nigeria Demographic and Health Survey concerning the childhood mortality (Kneib, 2006). In the study, the health status of women in the reproductive age and their children were collected. In particular, the death time of child was observed exactly if the death occurs within the first two months of birth and after that, the information on the mortality was collected through interviewing the mothers of the children. Thus, only interval-censored data are available for the death time in general (Sun, 2006).
In failure time studies, sometimes there may exist a so-called cured subgroup, meaning that a portion of study subjects are not susceptible to the failure event of interest. For example, for the childhood mortality data discussed above, about 88% of the observations are right-censored. To see the meaning of this, Fig. 1 presents the nonparametric maximum likelihood estimates of the underlying survival functions given by Turnbull’s self-consistency algorithm (Turnbull, 1976) for male and female subjects, respectively. It is easy to see that the estimated curves become flat or are leveling off towards the end, indicating that there might exist a potentially cured subgroup. In general, these subjects are considered to be cured or immune to the failure event of interest and are often referred to as long-term survivors or cured individuals. It is easy to see that standard survival methods or models are not suitable for such situations because they assume that all subjects will eventually experience the event of interest.
To deal with the failure time data with a cured subgroup, one common type of method is the mixture model approach, assuming that the underlying population is a mixture of cured and uncured subpopulations. In the method, the cure rate and latency survival function of the uncured subjects are commonly modeled separately. Among other models, one commonly used model for the situation is the proportional hazards mixture cure (PHMC) model and several methods have been proposed for inference about it based on interval-censored data. For example, Ma (2009) developed a penalized maximum likelihood method with the use of the weighted bootstrap procedure for variance estimation. Kim and Jhun (2008) considered general interval-censored data and proposed an expectation-maximization (EM) algorithm with the use of the multiple imputation approach for variance estimation. A more general class of models, which includes the aforementioned model as a special case, is the generalized odds rate mixture cure (GORMC) model. Zhou et al. (2018) discussed the fitting of the GORMC model to interval-censored data.
Several authors have discussed variable selection for interval-censored failure time data. For example, Wu and Cook (2015) considered the problem under the frame of the proportional hazards (PH) model with a piecewise constant baseline hazard function and developed an EM algorithm to determine the proposed estimators. Zhao et al. (2020) considered the same problem and proposed a broken adaptive ridge penalty function. In particular, unlike the former, the latter established the asymptotic properties of the proposed estimators. Also Li et al. (2020a) proposed a penalized variable selection tool for the proportional hazards model with interval-censored and possibly left-truncated data via the penalized nonparametric maximum likelihood estimation with an adaptive lasso penalty. Li et al. (2020b) discussed the situation wherein the failure time of interest follows a general class of semiparametric transformation models.
A few mixture model-based methods are available for variable selection when one faces failure time data with a cured subgroup. For example, Liu et al. (2012) and Masud et al. (2016) presented some semiparametric methods for the case of right-censored data, and Scolas et al. (2016) considered the case of interval-censored data arising from a parametric accelerated mixture cure model. In the following, we will investigate the variable selection problem when one faces interval-censored data under the two-component GORMC model and propose a semiparametric penalized approach. In the model, the first component is a logistic regression model, which is used to describe the cure rate, and the second component is the generalized odds rate (GOR) model, which will be employed to describe the survival probability for the uncured subjects. For the implementation of the proposed method, a novel penalized EM algorithm is developed by employing Gamma–Poisson data augmentation, and this approach is shown to be stable and faster than the direct maximization method.
The remainder of the paper is organized as follows. In Section 2, after introducing some notation and the GORMC model along with some commonly used penalty functions, we will present the proposed penalized variable selection approach. Section 3 discusses the implementation of the proposed method and the development of a novel penalized EM algorithm. In Section 4, the asymptotic properties of the proposed estimators, including the consistency, sparsity and the oracle properties, are established. Section 5 presents some results obtained from a simulation study conducted to assess the finite sample performance of the proposed approach, which indicate that it works well in practical situations. Section 6 applies the proposed methodology to the children’s mortality data discussed above, and some discussion and concluding remarks are given in Section 7.
Section snippets
Variable selection and estimation
Consider a failure time study in which there might exist a cured subgroup. Let denote the failure time of interest and be a binary indicator variable, where if a patient is uncured. Suppose that there exist two vectors of covariates and that may affect and , respectively. Let , ) denote the survival function of given and , with the first component of being 1. The two-component mixture cure model is defined as where , the
Computation and penalized EM algorithm
For the development of the penalized EM algorithm, we will first discuss the augmentation of the observed data. For this, note that if ’s were known, then conditional on , the likelihood function can be written as Furthermore, let be a random variable following the Gamma distribution with . Then the survival function of an uncured patient can be written as
Asymptotic properties
In this section, we discuss the asymptotic properties of the proposed estimator. Denote the true value of by , where includes all non-zero components and other zero components. Similarly, denote the true value of by . Let and , where and .
Given the tuning parameters and , define as the dimension vector of some functions of tuning parameters as follows. By letting , define
A simulation study
An extensive simulation study was conducted to assess the performance of the variable selection and estimation approach proposed in the previous sections. In the study, we generated the covariate vectors and from the multivariate normal distributions with mean zero, variance one and the correlation with . For the number of covariates, we considered two situations with or . For , we set and
An application
In this section, we apply the methodology proposed in the previous sections to the childhood mortality data arisen from the 2003 Nigeria Demographic and Health Survey discussed above. The childhood mortality is an important indicator of the health and socioeconomic development of a country, especially the level of maternal and children’s health care. A better understanding of it can help the government to take preventive measures to reduce the mortality rate. As discussed above, Fig. 1 shows a
Discussion and concluding remarks
This paper discussed variable selection and estimation for interval-censored failure time data in the presence of a cured subgroup. The problem considered is challenging both computationally and theoretically due to the existence of a cured subgroup and the complex data structures. In particular, it is difficult to directly maximize the complicated likelihood function arising from the two-component mixture cure model. For the problem, we proposed a mixture model-based penalized likelihood
Acknowledgments
The authors wish to thank the Co-Editor, the Associated Editor and two reviewers for their many helpful and insightful comments and suggestions that greatly improved the paper. Tao Hu’s work was partly supported by the National Nature Science Foundation of China (Grant Nos. 11971064 and 11671274), Beijing Talent Foundation Outstanding Young Individual Project, the Support Project of High-level Teachers in Beijing Municipal Universities in the Period of 13th Five-year Plan grant CIT & TCD
References (37)
- et al.
Efficient estimation for semiparametric cure models with interval-censored data
J. Multivariate Anal.
(2013) Mixed model-based inference in geoadditive hazard regression for interval-censored survival times
Comput. Statist. Data Anal.
(2006)- et al.
Regression analysis of bivariate current status data under the gamma-frailty proportional hazards model using the em algorithm
Comput. Statist. Data Anal.
(2015) - et al.
Parametric spatial cure rate models for interval-censored time-to-relapse data
Biometrics
(2004) - et al.
Variable selection for multivariate failure time data
Biometrika
(2015) - et al.
A new Bayesian model for survival data with a surviving fraction
J. Amer. Statist. Assoc.
(1999) - et al.
Semiparametric transformation models for interval-censored data in the presence of a cure fraction
Biom. J.
(2019) - et al.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Amer. Statist. Assoc.
(2001) - et al.
An improved variable selection procedure for adaptive lasso in high-dimensional survival analysis
Lifetime Data Anal.
(2018) - et al.
Cure rate model with interval censored data
Stat. Med.
(2008)
A semiparametric cure model for interval-censored data
Stat. Med.
Theory of Point Estimation
Adaptive lasso for the Cox regression with interval censored and possibly left truncated data
Stat. Methods Med. Res.
Penalized estimation of semiparametric transformation models with interval-censored data and application to Alzheimer’s disease
Stat. Methods Med. Res.
Variable selection in semiparametric cure models based on penalized likelihood, with application to breast cancer clinical trials
Stat. Med.
A semiparametric regression cure model for interval censored data
J. Amer. Statist. Assoc.
Estimation of the mean function with panel count data using monotone polynomial splines
Biometrika
Cure model with current status data
Statist. Sinica
Cited by (9)
On variable selection in a semiparametric AFT mixture cure model
2024, Lifetime Data AnalysisA Bayesian proportional hazards mixture cure model for interval-censored data
2024, Lifetime Data AnalysisVariable selection for mixed panel count data under the proportional mean model
2023, Statistical Methods in Medical ResearchRegression analysis of multivariate interval-censored failure time data with a cured subgroup and informative censoring
2023, Journal of Nonparametric Statistics