Feature screening under missing indicator imputation with non-ignorable missing response

https://doi.org/10.1016/j.csda.2020.106975Get rights and content

Abstract

This article develops a model-free variable screening technique with the non-ignorable missing response in ultrahigh-dimensional data analysis. Based on the common logistic model assumption of the propensity function, a novel screening procedure is proposed by borrowing hidden information of missingness indicator such that any variable screening method for ultrahigh-dimensional covariates with full data can be applied to the non-ignorable missing response case. And it is shown that the sure screening property can be kept as long as the corresponding screening method for full data is of sure screening property. The finite sample performances of the proposed method are demonstrated via some simulations and analysis of functional neuroimaging data.

Introduction

With the rapid development of modern technology, high-dimensional data have frequently been collected in a large variety of areas at relatively low cost, such as genomics, proteomics, biomedical imaging, tumor classifications and finance. Due to numerous predictors, analysis of high-dimensional data poses many challenges, such as computational expediency, statistical accuracy and algorithmic stability (Fan et al., 2009). To address those challenges, it is fairly common to make the assumption of the sparsity principle, by which only a small number of predictors contribute to the response. Basing on this general principle, some novel statistical methods are developed for ultrahigh-dimensional data. Fan and Lv (2008) proposed the sure independence screening (SIS) for the linear regression. The approach is further developed by Fan et al. (2009) and Fan and Song (2010) in the context of generalized linear models. Other marginal screening methods include tilting methods (Hall et al., 2009), generalized correlation screening (Hall and Miller, 2009), nonparametric screening (Fan et al., 2011), robust rank correlation screening (Li et al., 2012a), the distance correlation screening (Li et al., 2012b), the quantile-adaptive screening (He et al., 2013), the fused Kolmogorov filter (Mai and Zou, 2015), the mean–variance filter (Cui et al., 2015) and its fused form (Yan et al., 2018) as well as Ball correlation sure independence screening (Pan et al., 2018), among others. Unfortunately, these variable screening approaches cannot be applied directly to the case of nonresponse or missing data, which arise frequently in various statistical applications such as clinical trials and social science.

In the presence of missing response, the inverse probability weighting and regression imputation are commonly applied to statistical analysis. However, it is hardly practical to estimate the propensity function and regression predictor before high dimension problem has been solved, and hence this leads to the infeasibility of these two means. In addition, the estimation of using complete case may also result in serious bias under some missing situations. In consequence, these above mentioned standard approaches cannot be applied directly to developing variable screening techniques for the missing case.

When the response is missing at random (MAR), missingness depending on the observable variables only, some variable screening methods have been proposed recently by Lai et al. (2017) and Wang and Li (2018). Lai et al. (2017) develop a two-step screening method. The first step screens out the variables in the propensity function under MAR, and then screening statistics can be calculated by leveraging the variables obtained in the first step based on the inverse probability weighted technique. However, when the selected variables are not significant for the response or the number of the selected variables in the first step is larger than 3, this method performs poorly (Wang and Li, 2018).

The latter suggests the missing indicator imputation (MI-I) method by employing information of missingness indicator. This method was developed by proving that the set of the active predictors for the response is a subset of the set of the active predictors for the product of the response and missingness indicator, which makes any variable screening approach for ultrahigh-dimensional predictors with full data applicable to the case of missing response variable. In many practical problems, missingness may be non-ignorable (e.g., Ibrahim et al., 1999), where missingness depends on some variables whose observations may be missing. In applications of many literatures (e.g., Zhao and Shao, 2015, Tang et al., 2003), the existence of the non-ignorable missingness has been demonstrated. For instance, in the example of the resting-state functional magnetic resonance imaging data from the Autism Brain Imaging Data Exchange study, verbal intelligence quotient (VIQ) is a missing outcome variable. In the data, because some individuals’ ability of language-based reasoning is poor, they may refuse to provide their values of VIQ. That is, the missingness depends on the nonresponse variable. Clearly, MAR is a special case of non-ignorable missingness.

To the best of our knowledge, the variable screening problem for the non-ignorable missingness has not yet been considered. This may be because the commonly used statistical analysis methods for missing response are hard to be applied to variable screening study under the most general non-ignorable missing situation. In this paper, we generalize the MI-I screening method to the case of non-ignorable missingness by proving that the sure screening property still holds true under the more general missing mechanism.

The rest of this article is organized as follows. In Section 2, we develop techniques such that any variable screening approach for full data can be applied to the non-ignorable missing case, and present the primary theoretical results. In Section 3, we first illustrate how to implement the proposed procedures using the DC-SIS. Then some simulation studies are conducted to examine the finite sample performances. In Section 4, a real dataset in the resting-state functional magnetic resonance imaging study is analyzed to illustrate our methodology. Section 5 gives a brief discussion. The proofs of main results are relegated to Appendix A. Some additional numerical studies are gathered in the Supplementary Material.

Section snippets

Methodology

We assume that Y, a response variable with support Ψy, is subject to missingness and that X=(X1,,Xp)T, a p-dimensional covariate vector, is fully observed for the entire sample. Let δ be the response status indicator for Y, with δ=1 if Y is observed and δ=0 if Y is missing. The conditional probability π(Y,X)=P(δ=1|Y,X) is called the propensity function. Here, we study variable screening for non-ignorable missing data, namely, π(Y,X) depending on Y, which may be missing.

Denote by F(y|X) the

Simulation studies

In this section, we conducted some simulations to evaluate the finite sample performances of the proposed method for variable screening under non-ignorable missingness. Because different variable screening methods have been compared in a crowd of literatures (e.g., Mai and Zou, 2015, Yan et al., 2018, Pan et al., 2018), we chose one of them to avoid the discrepancy caused by the different rank statistics, whereas, our approach is independent of a particular rank statistic from the derivation of

Application

Verbal intelligence quotient (VIQ) measures an individual’s ability to use language-based reasoning to analyze information and solve problems. Language-based reasoning may involve reading or listening to words, conversing, writing, or even thinking. And our modern world is built around listening to or reading words for meaning and expressing knowledge through spoken language. Therefore, it is critical to study which brain regions can influence an individual’s VIQ; the findings could potentially

Discussion

In this article, we give a feasible technique to implement variable screening under the non-ignorable missingness. Although the proposed method is simple and its performances are commendable, some improvements are still desirable.

According to Theorem 1 and the definition of Aˆ(Y|X), Aˆ(Y|X) may contain some variables, which are important for δ but not Y. An improvement is to define a refined estimator of A(Y|X), which can remove some superfluous variables. A possible method is the Venn diagram

Acknowledgments

Wang’s research was supported by the National Natural Science Foundation of China (General program 11871460, Key program 11331011 and program for Creative Research Group in China 61621003), a grant from the Key Lab of Random Complex Structure and Data Science, CAS, China .

References (32)

  • FanJ. et al.

    Sure independence screening for ultrahigh dimensional feature space

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2008)
  • FanJ. et al.

    Ultrahigh dimensional feature selection: Beyond the linear model

    J. Mach. Learn. Res.

    (2009)
  • FanJ. et al.

    Sure independence screening in generalized linear models with NP-dimensionality

    Ann. Statist.

    (2010)
  • FangF. et al.

    Model selection with nonignorable nonresponse

    Biometrika

    (2016)
  • HallP. et al.

    Using generalized correlation to effect variable selection in very high dimensional problems

    J. Comput. Graph. Statist.

    (2009)
  • HallP. et al.

    Tilting methods for assessing the influence of components in a classifier

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2009)
  • Cited by (0)

    View full text