Variable selection for generalized odds rate mixture cure models with interval-censored failure time data

https://doi.org/10.1016/j.csda.2020.107115Get rights and content

Abstract

Variable selection for failure time data with a cured fraction has been discussed by many authors but most of existing methods apply only to right-censored failure time data. In this paper, we consider variable selection when one faces interval-censored failure time data arising from a general class of generalized odds rate mixture cure models, and we propose a penalized variable selection method by maximizing a derived penalized likelihood function. In the method, the sieve approach is employed to approximate the unknown function, and it is implemented using a novel penalized expectation–maximization (EM) algorithm. Also the asymptotic properties of the proposed estimators of regression parameters, including the oracle property, are obtained. Furthermore, a simulation study is conducted to assess the finite sample performance of the proposed method, and the results indicate that it works well in practice. Finally, the approach is applied to a set of real data on childhood mortality taken from the Nigeria Demographic and Health Survey.

Introduction

Variable selection is needed and has been discussed in many fields and subsequently, a great deal of literature on it has been established. Among other methods, a general approach for variable selection that has recently garnered significant attention is the penalized approach, which involves a penalty function. This is especially ideal when there are a large number of risk factors and one needs to identify the relevant or prognostic predictors among them. Among them, one of the commonly used penalty functions is the least absolute shrinkage and selection operator (LASSO), proposed by Tibshirani (1996). It is a continuous shrinkage process that can penalize some coefficients to zero exactly and retain some desired features. Following Tibshirani (1996), Fan and Li (2001) proposed the smoothly clipped absolute deviation (SCAD) penalty and Zou (2006) developed the adaptive LASSO (ALASSO) penalty. The SCAD penalty is a nonconcave function on [0,) and has singularities at the origin to yield sparse estimators. The ALASSO penalty is the weighted version of LASSO and can yield the estimators with the oracle property. However, most of the existing methods are intended for complete data or data with relatively simple structures. In this paper, we consider a type of data with much more complex structures and interval-censored failure time data with a cured fraction or subgroup.

Interval-censored failure time data arise when the failure time of interest is known or observed only to belong to some intervals instead of being observed exactly. Such data commonly occur in many areas including clinical studies, epidemiology researches, and sociological surveys. It is apparent that interval-censored data include right-censored data as a special case. An example of interval-censored data arose from the 2003 Nigeria Demographic and Health Survey concerning the childhood mortality (Kneib, 2006). In the study, the health status of women in the reproductive age and their children were collected. In particular, the death time of child was observed exactly if the death occurs within the first two months of birth and after that, the information on the mortality was collected through interviewing the mothers of the children. Thus, only interval-censored data are available for the death time in general (Sun, 2006).

In failure time studies, sometimes there may exist a so-called cured subgroup, meaning that a portion of study subjects are not susceptible to the failure event of interest. For example, for the childhood mortality data discussed above, about 88% of the observations are right-censored. To see the meaning of this, Fig. 1 presents the nonparametric maximum likelihood estimates of the underlying survival functions given by Turnbull’s self-consistency algorithm (Turnbull, 1976) for male and female subjects, respectively. It is easy to see that the estimated curves become flat or are leveling off towards the end, indicating that there might exist a potentially cured subgroup. In general, these subjects are considered to be cured or immune to the failure event of interest and are often referred to as long-term survivors or cured individuals. It is easy to see that standard survival methods or models are not suitable for such situations because they assume that all subjects will eventually experience the event of interest.

To deal with the failure time data with a cured subgroup, one common type of method is the mixture model approach, assuming that the underlying population is a mixture of cured and uncured subpopulations. In the method, the cure rate and latency survival function of the uncured subjects are commonly modeled separately. Among other models, one commonly used model for the situation is the proportional hazards mixture cure (PHMC) model and several methods have been proposed for inference about it based on interval-censored data. For example, Ma (2009) developed a penalized maximum likelihood method with the use of the weighted bootstrap procedure for variance estimation. Kim and Jhun (2008) considered general interval-censored data and proposed an expectation-maximization (EM) algorithm with the use of the multiple imputation approach for variance estimation. A more general class of models, which includes the aforementioned model as a special case, is the generalized odds rate mixture cure (GORMC) model. Zhou et al. (2018) discussed the fitting of the GORMC model to interval-censored data.

Several authors have discussed variable selection for interval-censored failure time data. For example, Wu and Cook (2015) considered the problem under the frame of the proportional hazards (PH) model with a piecewise constant baseline hazard function and developed an EM algorithm to determine the proposed estimators. Zhao et al. (2020) considered the same problem and proposed a broken adaptive ridge penalty function. In particular, unlike the former, the latter established the asymptotic properties of the proposed estimators. Also Li et al. (2020a) proposed a penalized variable selection tool for the proportional hazards model with interval-censored and possibly left-truncated data via the penalized nonparametric maximum likelihood estimation with an adaptive lasso penalty. Li et al. (2020b) discussed the situation wherein the failure time of interest follows a general class of semiparametric transformation models.

A few mixture model-based methods are available for variable selection when one faces failure time data with a cured subgroup. For example, Liu et al. (2012) and Masud et al. (2016) presented some semiparametric methods for the case of right-censored data, and Scolas et al. (2016) considered the case of interval-censored data arising from a parametric accelerated mixture cure model. In the following, we will investigate the variable selection problem when one faces interval-censored data under the two-component GORMC model and propose a semiparametric penalized approach. In the model, the first component is a logistic regression model, which is used to describe the cure rate, and the second component is the generalized odds rate (GOR) model, which will be employed to describe the survival probability for the uncured subjects. For the implementation of the proposed method, a novel penalized EM algorithm is developed by employing Gamma–Poisson data augmentation, and this approach is shown to be stable and faster than the direct maximization method.

The remainder of the paper is organized as follows. In Section 2, after introducing some notation and the GORMC model along with some commonly used penalty functions, we will present the proposed penalized variable selection approach. Section 3 discusses the implementation of the proposed method and the development of a novel penalized EM algorithm. In Section 4, the asymptotic properties of the proposed estimators, including the consistency, sparsity and the oracle properties, are established. Section 5 presents some results obtained from a simulation study conducted to assess the finite sample performance of the proposed approach, which indicate that it works well in practical situations. Section 6 applies the proposed methodology to the children’s mortality data discussed above, and some discussion and concluding remarks are given in Section 7.

Section snippets

Variable selection and estimation

Consider a failure time study in which there might exist a cured subgroup. Let T denote the failure time of interest and u be a binary indicator variable, where u=1 if a patient is uncured. Suppose that there exist two vectors of covariates x and z that may affect T and u, respectively. Let S(|x, z) denote the survival function of T given x and z, with the first component of z being 1. The two-component mixture cure model is defined as S(t|x,z)=1π(z)+π(z)Su(t|x),where π(z)=P(u=1|z), the

Computation and penalized EM algorithm

For the development of the penalized EM algorithm, we will first discuss the augmentation of the observed data. For this, note that if ui’s were known, then conditional on u=u1,,un, the likelihood function can be written as L(η,β,H|O,u)=i=1nπziui1πzi1ui1SuRi|xiδL,i×SuLi|xiSuRi|xiδI,iSuLi|xiδR,iui. Furthermore, let ϕ be a random variable following the Gamma distribution Γ(1r,r) with r>0. Then the survival function of an uncured patient can be written as Su(t|x)=1+rHe(t)eβx1r=0expϕHe(t

Asymptotic properties

In this section, we discuss the asymptotic properties of the proposed estimator. Denote the true value of η by η0=(η10T,η20T)T, where η10 includes all non-zero components and η20 other zero components. Similarly, denote the true value of β by β0=(β10T,β20T)T. Let ζ=(ηT,βT)T and ζ0=ζ10T,ζ20TT, where ζ10=(η10T,β10T)T and ζ20=(η20T,β20T)T.

Given the tuning parameters ξn and λn, define φn as the p+q dimension vector of some functions of tuning parameters as follows. By letting ζ̃j=η, define φnj=ξn

A simulation study

An extensive simulation study was conducted to assess the performance of the variable selection and estimation approach proposed in the previous sections. In the study, we generated the covariate vectors z and x from the multivariate normal distributions with mean zero, variance one and the correlation corr(zj,zk)=corr(xj,xk)=ρ|jk| with ρ=0.5. For the number of covariates, we considered two situations with p=10 or 30. For p=10, we set η=(0.6,0.8,0.8,0,0,0,0,0,0,0,0) and β=(0.8,0.8,0.8,0,0,0,0,0

An application

In this section, we apply the methodology proposed in the previous sections to the childhood mortality data arisen from the 2003 Nigeria Demographic and Health Survey discussed above. The childhood mortality is an important indicator of the health and socioeconomic development of a country, especially the level of maternal and children’s health care. A better understanding of it can help the government to take preventive measures to reduce the mortality rate. As discussed above, Fig. 1 shows a

Discussion and concluding remarks

This paper discussed variable selection and estimation for interval-censored failure time data in the presence of a cured subgroup. The problem considered is challenging both computationally and theoretically due to the existence of a cured subgroup and the complex data structures. In particular, it is difficult to directly maximize the complicated likelihood function arising from the two-component mixture cure model. For the problem, we proposed a mixture model-based penalized likelihood

Acknowledgments

The authors wish to thank the Co-Editor, the Associated Editor and two reviewers for their many helpful and insightful comments and suggestions that greatly improved the paper. Tao Hu’s work was partly supported by the National Nature Science Foundation of China (Grant Nos. 11971064 and 11671274), Beijing Talent Foundation Outstanding Young Individual Project, the Support Project of High-level Teachers in Beijing Municipal Universities in the Period of 13th Five-year Plan grant CIT & TCD

References (37)

  • LamK.F. et al.

    A semiparametric cure model for interval-censored data

    Stat. Med.

    (2013)
  • LehmannE.L. et al.

    Theory of Point Estimation

    (1998)
  • LiC. et al.

    Adaptive lasso for the Cox regression with interval censored and possibly left truncated data

    Stat. Methods Med. Res.

    (2020)
  • LiS. et al.

    Penalized estimation of semiparametric transformation models with interval-censored data and application to Alzheimer’s disease

    Stat. Methods Med. Res.

    (2020)
  • LiuX. et al.

    Variable selection in semiparametric cure models based on penalized likelihood, with application to breast cancer clinical trials

    Stat. Med.

    (2012)
  • LiuH. et al.

    A semiparametric regression cure model for interval censored data

    J. Amer. Statist. Assoc.

    (2009)
  • LuM. et al.

    Estimation of the mean function with panel count data using monotone polynomial splines

    Biometrika

    (2007)
  • MaS.

    Cure model with current status data

    Statist. Sinica

    (2009)
  • View full text