Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error

Chen, Li-Pang

doi:10.1007/s00180-020-01039-2

Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error

Original paper
Published: 12 October 2020

Volume 36, pages 857–884, (2021)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Li-Pang Chen ORCID: orcid.org/0000-0001-5440-5036¹

427 Accesses
8 Citations
Explore all metrics

Abstract

Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Its key idea is to select informative variables using correlations between the response and the covariates. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariate measurement error are often accompanying with survival analysis. Even though many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates that are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem and propose the model-free feature screening method in the presence of the censored response and error-prone covariates. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Numerical studies are reported to assess the performance of the proposed method. Finally, we implement the proposed method to a real dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-free feature screening for high-dimensional survival data

Article 02 April 2018

Feature Screening for Ultrahigh-dimensional Censored Data with Varying Coefficient Single-index Model

Article 01 September 2019

Nonparametric independence feature screening for ultrahigh-dimensional survival data

Article 25 April 2018

References

Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov N, Czaki F (eds) 2nd international symposium on information theory. Akademiai Kaido, Bydapest, pp 267–281
Google Scholar
Buckley J, James I (1979) Linear regression with censored data. Biometrika 66:429–436
Article Google Scholar
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Ann Stat 35:2313–2404
MATH Google Scholar
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear model. CRC Press, New York
Book Google Scholar
Chen L-P (2018) Semiparametric estimation for the accelerated failure time model with length-biased sampling and covariate measurement error. Stat 7:e209. https://doi.org/10.1002/sta4.209
Article MathSciNet Google Scholar
Chen L-P (2019a) Pseudo likelihood estimation for the additive hazards model with data subject to left-truncation and right-censoring. Stat Its Interface 12:135–148
Chen L-P (2019b) Semiparametric estimation for cure survival model with left-truncated and right-censored data and covariate measurement error. Stat Probab Lett 154:108547. https://doi.org/10.1016/j.spl.2019.06.023
Chen L-P (2019c) Statistical analysis with measurement error or misclassification: strategy, method and application by Grace Y. Yi. Biometrics 75:1045–1046. https://doi.org/10.1111/biom.13130
Chen L-P (2020) Semiparametric estimation for the transformation model with length-biased data and covariate measurement error. J Stat Comput Simul 90:420–442. https://doi.org/10.1080/00949655.2019.1687700
Article MathSciNet MATH Google Scholar
Chen L-P, Yi GY (2020) Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Ann Inst Stat Math. https://doi.org/10.1007/s10463-020-00755-2 (To appear)
Article Google Scholar
Chen X, Chen X, Wang H (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 119:118–138
Article MathSciNet Google Scholar
Chen X, Zhang Y, Chen X, Liu Y (2019) A simple model-free survival conditional feature screening. Stat Probab Lett 146:156–160
Article MathSciNet Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:409–499
Article MathSciNet Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Stat Soc Ser B 70:849–911
Article MathSciNet Google Scholar
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:1829–1853
MathSciNet MATH Google Scholar
Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604
MathSciNet MATH Google Scholar
Fan J, Feng Y, Wu Y (2010) Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collect 6:70–86
Google Scholar
Hall P, Miller H (2009) Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat 18:533–550
Article MathSciNet Google Scholar
Lawless JF (2003) Statistical models and methods for lifetime data. Wiley, New York
MATH Google Scholar
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107:1129–1139
Article MathSciNet Google Scholar
Miller RG (1981) Survival analysis. Wiley, New York
Google Scholar
Rocke DM, Durbin B (2001) A model for measurement error for gene expression arrays. J Comput Biol 8:557–569
Article Google Scholar
Schwarz G (1978) Estimating the dimension of model. Ann Stat 6:461–464
Article MathSciNet Google Scholar
Song R, Lu W, Ma S, Jeng X (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101:799–814
Article MathSciNet Google Scholar
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35:2769–2794
Article MathSciNet Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
van de Vijver MJ, He YD, van’t Veer, L. J., Dai, H., Hart, A. A.M., Voskuil, D. W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E.T., Friend, S.H. and Bernards, R. (2002) A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347:1999–2009
Yan X, Tang N, Zhao X (2017) The Spearman rank correlation screening for ultrahigh dimensional censored data. arXiv:1702.02708v1
Zhong W, Zhu L (2015) An iterative approach to distance correlation-based sure independence screening. J Stat Comput Simul 85:2331–2345
Article MathSciNet Google Scholar
Zhu L, Li L, Li R, Zhu L (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Stat Assoc 106:1464–1475
Article MathSciNet Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
Article MathSciNet Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistical and Actuarial Sciences, University of Western Ontario, 1151 Richmond St, London, ON, N6A 3K7, Canada
Li-Pang Chen

Authors

Li-Pang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li-Pang Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proof of Theorem 3.1

We first consider $\text {dcov} \left( Y^*, X_k \right) $ and $\text {dcov}^*\left( Y^*, X_k^*\right) $ for the kth component of X with $k=1,\ldots ,p$. Note that the former formulation is based on the true covariates $X_k$, while the latter formulation is based on the surrogate covariates $X_k^*$.

Since the error term $\epsilon _k$, the kth component of $\epsilon $, follows a normal distribution $N(0,\sigma _{\epsilon ,kk})$, then its characteristic function is given by

$$\begin{aligned} E\left\{ \exp \left( \mathbf {i} s \epsilon _k \right) \right\} = \exp \left( -\frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) . \end{aligned}$$

(A.1)

By the direct derivations, we have

$$\begin{aligned} \phi ^*_{X_k^*}(s)= & {} E\left\{ \exp \left( \mathbf {i}s X_k^*\right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}s X_k \right) \right\} E\left\{ \exp \left( \mathbf {i}s \epsilon _k \right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}s X_k \right) \right\} , \end{aligned}$$

(A.2)

where the second equality is due to the independence of $X_k$ and $\epsilon _k$, and the last equality is due to (A.1).

In addition, we can also derive

$$\begin{aligned} \phi ^*_{Y^*,X_k^*}(r,s)= & {} E\left\{ \exp \left( \mathbf {i}rY^*+ \mathbf {i}s X_k^*\right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}rY^*+ \mathbf {i}s X_k \right) \right\} E\left\{ \exp \left( \mathbf {i}s \epsilon _k \right) \right\} \exp \left( \frac{1}{2} s^2 \sigma _{\epsilon ,kk} \right) \nonumber \\= & {} E\left\{ \exp \left( \mathbf {i}rY^*+ \mathbf {i}s X_k \right) \right\} , \end{aligned}$$

(A.3)

where the second equality is due to the independence of $\epsilon _k$ and $(X_k, Y^*)$, and the last equality again comes from (A.1). As a result, combining (A.2) and (A.3) with $\text {dcov}^*\left( Y^*, X_k^*\right) $ gives the same expression as $\text {dcov} \left( Y^*, X_k \right) $.

The equivalence of $\text {dcov}^*\left( X_k^*, X_k^*\right) $ and $\text {dcov} \left( X_k, X_k \right) $ holds by the similar derivations. Therefore, we conclude that $\text {dcorr} \left( Y^*, X_k \right) $ and $\text {dcorr}^*\left( Y^*, X_k^*\right) $ are equivalent in the sense that $\text {dcorr} \left( Y^*, X_k \right) > 0$ if and only if $\text {dcorr}^*\left( Y^*, X_k^*\right) > 0$. Consequently, the same active features can be determined for $X^*$ and X. $\square $

Appendix B Overview of error correction in the Cox model

In this appendix, we outline the strategy of correcting covariate measurement error when fitting the Cox model. This idea comes from Chen and Yi (2020), and this method can be used to fit the Cox model with covariate measurement error in Sect. 4.3.

For $i=1,\ldots ,n$, let $Y_i$ and $\delta _i$ denote the survival time and the censoring indicator defined in Sect. 2.1. Let $X_i$ denote q-dimensional vector of unobserved covariates with $q<n$ after feature screening procedure, and let $X_i^*$ be the surrogate version of $X_i$. Based on the Cox model (17) and the unobserved covariates $X_i$, the likelihood function is given by (e.g., Lawless 2003)

$$\begin{aligned} L(\gamma ) = \prod \limits _{i=1}^{n} \left\{ \lambda _0(Y_i) \exp \left( X_i^\top \gamma \right) \right\} ^{\delta _i} \exp \left\{ - \Lambda _0(Y_i) \exp \left( X_i^\top \gamma \right) \right\} , \end{aligned}$$

(B.1)

where $\Lambda _0(t) = \int _0^t \lambda _0(u)du$ is called the cumulative baseline hazards function.

Let $\ell (\gamma ) = \log L(\gamma )$. Since $\ell (\gamma )$ contains the $X_i$ whose measurements are unavailable, we want to modify $\ell (\gamma )$ to be a new function, say $\ell ^*(\gamma )$, of the observed measurements and the model parameters so that its conditional expectation equals to $\ell (\gamma )$:

$$\begin{aligned} E\left\{ \ell ^*(\gamma )|\mathbb {X},\mathbb {C},\mathbb {T}\right\} = \ell (\gamma ), \end{aligned}$$

(B.2)

where the expectation is taken with respect to the conditional distribution of $\mathbb {X}^*$ given $\left\{ \mathbb {X}, \mathbb {C}, \mathbb {T} \right\} $, where $\mathbb {X}^*= \{X_1^*,\ldots ,X_n^*\}$, $\mathbb {X} = \{X_1,\ldots ,X_n\}$, $\mathbb {C} = \{C_1,\ldots ,C_n\}$, and $\mathbb {T} = \{T_1,\ldots ,T_n\}$. Such a strategy is useful in yielding an unbiased estimating function and is sometimes called the “corrected” likelihood method or the insertion correction approach (e.g., Carroll et al. 2006, Section 7.4).

Noticing that the $X_i$ appear in $\ell (\gamma )$ in linear and exponential forms, we define

$$\begin{aligned} \ell ^*(\gamma )= & {} \sum _{i=1}^{n} \bigg [ \delta _i \log \lambda _0 (Y_i) + \delta _i (X_i^{*\top } \gamma ) - \Lambda _0 (Y_i) \exp (X_i^{*\top } \gamma ) \left\{ m(\gamma )\right\} ^{-1} \bigg ],\qquad \end{aligned}$$

(B.3)

where $m(z) = \exp (\frac{1}{2} z^\top \Sigma _{\epsilon } z)$ and $\Sigma _{\epsilon }$ is defined in Sect. 2.2. It is easily seen that $\ell ^*(\gamma )$ satisfies (B.2).

To use (B.3) to derive an estimator of $\gamma $, we need to deal with the baseline hazard function $\lambda _0 \left( \cdot \right) $ and its cumulative function $\Lambda _0 \left( \cdot \right) $. First, we discretize $\Lambda _0 \left( \cdot \right) $ so that $\lambda _0 \left( \cdot \right) $ has a nonzero value if $t = Y_i$ for $i = 1,\ldots ,n$; otherwise, $\lambda _0 (t) =0$. Let $\lambda _i $ denote $\lambda _0 (Y_i)$ for $i = 1,\ldots ,n$. Then $\Lambda _0 (t)$ is taken as $\sum \nolimits _{i=1}^{n} \mathbb {I}(Y_i \leqslant t) \lambda _i$. Next, given $\gamma $, we solve $\frac{\partial \ell ^*(\gamma )}{\partial \lambda _i} = 0$ for $i = 1,\ldots ,n$, which leads to an estimator of $\lambda _i$, given by

$$\begin{aligned} \widehat{\lambda }_i = \frac{\delta _i}{\sum \limits _{k=1}^{n} \mathbb {I}(Y_i \le Y_k) \exp { \left( X_k^{*\top } \gamma \right) } \left\{ m(\gamma ) \right\} ^{-1}}\ \text{ for } i=1,...,n; \end{aligned}$$

(B.4)

and the corresponding estimate of the cumulative baseline hazards function:

$$\begin{aligned} \widehat{\Lambda }_0 (t) = \sum _{i=1}^{n} \mathbb {I}(Y_i \le t) \widehat{\lambda }_i \ . \end{aligned}$$

(B.5)

Finally, plugging (B.4) and (B.5) into (B.3) gives the function

$$\begin{aligned} \widehat{\ell }^*(\gamma )= & {} \sum \limits _{i=1}^{n} \left[ \delta _i \log \widehat{\lambda }_i + \delta _i (X_i^{*\top } \gamma ) - \widehat{\Lambda }_0 (Y_i) \exp (X_i^{*\top } \gamma ) \left\{ m(\gamma )\right\} ^{-1} \right] . \end{aligned}$$

An estimator of $\gamma $ is then obtained by maximizing $\widehat{\ell }^*(\gamma )$:

$$\begin{aligned} \widehat{\gamma } = {\mathop {\mathrm{argmax}}\limits _{\gamma }} \widehat{\ell }^*(\gamma ). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, LP. Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error. Comput Stat 36, 857–884 (2021). https://doi.org/10.1007/s00180-020-01039-2

Download citation

Received: 22 May 2019
Accepted: 03 October 2020
Published: 12 October 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00180-020-01039-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error

Abstract

Access this article

Similar content being viewed by others

Model-free feature screening for high-dimensional survival data

Feature Screening for Ultrahigh-dimensional Censored Data with Varying Coefficient Single-index Model

Nonparametric independence feature screening for ultrahigh-dimensional survival data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A Proof of Theorem 3.1

Appendix B Overview of error correction in the Cox model

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error

Abstract

Access this article

Similar content being viewed by others

Model-free feature screening for high-dimensional survival data

Feature Screening for Ultrahigh-dimensional Censored Data with Varying Coefficient Single-index Model

Nonparametric independence feature screening for ultrahigh-dimensional survival data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A Proof of Theorem 3.1

Appendix B Overview of error correction in the Cox model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation