Abstract
Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Its key idea is to select informative variables using correlations between the response and the covariates. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariate measurement error are often accompanying with survival analysis. Even though many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates that are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem and propose the model-free feature screening method in the presence of the censored response and error-prone covariates. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Numerical studies are reported to assess the performance of the proposed method. Finally, we implement the proposed method to a real dataset.
Similar content being viewed by others
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov N, Czaki F (eds) 2nd international symposium on information theory. Akademiai Kaido, Bydapest, pp 267–281
Buckley J, James I (1979) Linear regression with censored data. Biometrika 66:429–436
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Ann Stat 35:2313–2404
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear model. CRC Press, New York
Chen L-P (2018) Semiparametric estimation for the accelerated failure time model with length-biased sampling and covariate measurement error. Stat 7:e209. https://doi.org/10.1002/sta4.209
Chen L-P (2019a) Pseudo likelihood estimation for the additive hazards model with data subject to left-truncation and right-censoring. Stat Its Interface 12:135–148
Chen L-P (2019b) Semiparametric estimation for cure survival model with left-truncated and right-censored data and covariate measurement error. Stat Probab Lett 154:108547. https://doi.org/10.1016/j.spl.2019.06.023
Chen L-P (2019c) Statistical analysis with measurement error or misclassification: strategy, method and application by Grace Y. Yi. Biometrics 75:1045–1046. https://doi.org/10.1111/biom.13130
Chen L-P (2020) Semiparametric estimation for the transformation model with length-biased data and covariate measurement error. J Stat Comput Simul 90:420–442. https://doi.org/10.1080/00949655.2019.1687700
Chen L-P, Yi GY (2020) Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Ann Inst Stat Math. https://doi.org/10.1007/s10463-020-00755-2 (To appear)
Chen X, Chen X, Wang H (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 119:118–138
Chen X, Zhang Y, Chen X, Liu Y (2019) A simple model-free survival conditional feature screening. Stat Probab Lett 146:156–160
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:409–499
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Stat Soc Ser B 70:849–911
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:1829–1853
Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604
Fan J, Feng Y, Wu Y (2010) Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collect 6:70–86
Hall P, Miller H (2009) Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat 18:533–550
Lawless JF (2003) Statistical models and methods for lifetime data. Wiley, New York
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Miller RG (1981) Survival analysis. Wiley, New York
Rocke DM, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8:557–569
Schwarz G (1978) Estimating the dimension of model. Ann Stat 6:461–464
Song R, Lu W, Ma S, Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35:2769–2794
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
van de Vijver MJ, He YD, van’t Veer, L. J., Dai, H., Hart, A. A.M., Voskuil, D. W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E.T., Friend, S.H. and Bernards, R. (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009
Yan X, Tang N, Zhao X (2017) The Spearman rank correlation screening for ultrahigh dimensional censored data. arXiv:1702.02708v1
Zhong W, Zhu L (2015) An iterative approach to distance correlation-based sure independence screening. J Stat Comput Simul 85:2331–2345
Zhu L, Li L, Li R, Zhu L (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Stat Assoc 106:1464–1475
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proof of Theorem 3.1
We first consider \(\text {dcov} \left( Y^*, X_k \right) \) and \(\text {dcov}^*\left( Y^*, X_k^*\right) \) for the kth component of X with \(k=1,\ldots ,p\). Note that the former formulation is based on the true covariates \(X_k\), while the latter formulation is based on the surrogate covariates \(X_k^*\).
Since the error term \(\epsilon _k\), the kth component of \(\epsilon \), follows a normal distribution \(N(0,\sigma _{\epsilon ,kk})\), then its characteristic function is given by
By the direct derivations, we have
where the second equality is due to the independence of \(X_k\) and \(\epsilon _k\), and the last equality is due to (A.1).
In addition, we can also derive
where the second equality is due to the independence of \(\epsilon _k\) and \((X_k, Y^*)\), and the last equality again comes from (A.1). As a result, combining (A.2) and (A.3) with \(\text {dcov}^*\left( Y^*, X_k^*\right) \) gives the same expression as \(\text {dcov} \left( Y^*, X_k \right) \).
The equivalence of \(\text {dcov}^*\left( X_k^*, X_k^*\right) \) and \(\text {dcov} \left( X_k, X_k \right) \) holds by the similar derivations. Therefore, we conclude that \(\text {dcorr} \left( Y^*, X_k \right) \) and \(\text {dcorr}^*\left( Y^*, X_k^*\right) \) are equivalent in the sense that \(\text {dcorr} \left( Y^*, X_k \right) > 0\) if and only if \(\text {dcorr}^*\left( Y^*, X_k^*\right) > 0\). Consequently, the same active features can be determined for \(X^*\) and X. \(\square \)
Appendix B Overview of error correction in the Cox model
In this appendix, we outline the strategy of correcting covariate measurement error when fitting the Cox model. This idea comes from Chen and Yi (2020), and this method can be used to fit the Cox model with covariate measurement error in Sect. 4.3.
For \(i=1,\ldots ,n\), let \(Y_i\) and \(\delta _i\) denote the survival time and the censoring indicator defined in Sect. 2.1. Let \(X_i\) denote q-dimensional vector of unobserved covariates with \(q<n\) after feature screening procedure, and let \(X_i^*\) be the surrogate version of \(X_i\). Based on the Cox model (17) and the unobserved covariates \(X_i\), the likelihood function is given by (e.g., Lawless 2003)
where \(\Lambda _0(t) = \int _0^t \lambda _0(u)du\) is called the cumulative baseline hazards function.
Let \(\ell (\gamma ) = \log L(\gamma )\). Since \(\ell (\gamma )\) contains the \(X_i\) whose measurements are unavailable, we want to modify \(\ell (\gamma )\) to be a new function, say \(\ell ^*(\gamma )\), of the observed measurements and the model parameters so that its conditional expectation equals to \(\ell (\gamma )\):
where the expectation is taken with respect to the conditional distribution of \(\mathbb {X}^*\) given \(\left\{ \mathbb {X}, \mathbb {C}, \mathbb {T} \right\} \), where \(\mathbb {X}^*= \{X_1^*,\ldots ,X_n^*\}\), \(\mathbb {X} = \{X_1,\ldots ,X_n\}\), \(\mathbb {C} = \{C_1,\ldots ,C_n\}\), and \(\mathbb {T} = \{T_1,\ldots ,T_n\}\). Such a strategy is useful in yielding an unbiased estimating function and is sometimes called the “corrected” likelihood method or the insertion correction approach (e.g., Carroll et al. 2006, Section 7.4).
Noticing that the \(X_i\) appear in \(\ell (\gamma )\) in linear and exponential forms, we define
where \(m(z) = \exp (\frac{1}{2} z^\top \Sigma _{\epsilon } z)\) and \(\Sigma _{\epsilon }\) is defined in Sect. 2.2. It is easily seen that \(\ell ^*(\gamma )\) satisfies (B.2).
To use (B.3) to derive an estimator of \(\gamma \), we need to deal with the baseline hazard function \(\lambda _0 \left( \cdot \right) \) and its cumulative function \(\Lambda _0 \left( \cdot \right) \). First, we discretize \(\Lambda _0 \left( \cdot \right) \) so that \(\lambda _0 \left( \cdot \right) \) has a nonzero value if \(t = Y_i\) for \(i = 1,\ldots ,n\); otherwise, \(\lambda _0 (t) =0\). Let \(\lambda _i \) denote \(\lambda _0 (Y_i)\) for \(i = 1,\ldots ,n\). Then \(\Lambda _0 (t)\) is taken as \(\sum \nolimits _{i=1}^{n} \mathbb {I}(Y_i \leqslant t) \lambda _i\). Next, given \(\gamma \), we solve \(\frac{\partial \ell ^*(\gamma )}{\partial \lambda _i} = 0\) for \(i = 1,\ldots ,n\), which leads to an estimator of \(\lambda _i\), given by
and the corresponding estimate of the cumulative baseline hazards function:
Finally, plugging (B.4) and (B.5) into (B.3) gives the function
An estimator of \(\gamma \) is then obtained by maximizing \(\widehat{\ell }^*(\gamma )\):
Rights and permissions
About this article
Cite this article
Chen, LP. Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error. Comput Stat 36, 857–884 (2021). https://doi.org/10.1007/s00180-020-01039-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-01039-2