Adaptive conditional feature screening

doi:10.1016/j.csda.2015.09.002

Computational Statistics & Data Analysis

Volume 94, February 2016, Pages 287-301

https://doi.org/10.1016/j.csda.2015.09.002 Get rights and content

Abstract

When the correlation among the predictors is relatively strong and/or the model structures cannot be specified, the construction of adaptive feature screening remains a challenging issue. A general technique of conditional feature screening is proposed via combining a model-free feature screening with a predetermined set of predictors. The proposed centralization technique can remove the irrelevant part from the criterion of the model-free feature screening. Consequently, the new criterion can measure the marginal utilities of predictors conditional on the predetermined set of predictors. The conditional information about these predetermined predictors helps reducing the correlation among covariates and as a result the resulting method can reduce the false positive and the false negative rates in the variable selection procedure. Thus, our method is adaptive to both the correlation among the covariates and the model misspecification. The new procedures are computationally efficient and simple, and can be extended to other relevant methods.

Introduction

In some contemporary applications, such as biomedical imaging, functional magnetic resonance imaging, tomography, tumor classifications and finance, researchers are frequently confronted with high-dimensional variables and the models whose structure cannot be completely specified. In such situations, the number $p$ of variables or parameters in the model can be much larger than the sample size $n$ and only little information about the actual model structures is known in advance. When the correlation among the covariates is relatively strong and the model structures cannot be correctly specified, it is difficult to establish the dimension reduction methodologies that are adaptive to both the correlation among covariates and model misspecification. In this paper, we are going to try and address this issue.

It is known that when the dimension of predictor vector is much larger than the sample size, ranking and screening have been proved to be useful for dimension reduction under the situations where models are specified correctly and the true model structures are relatively simple, such as linear structure and generalized linear structure. This type of approaches is about feature screening or marginal utility screening. Fan and Lv (2008) first introduced sure independence screening (SIS) and iterated sure independence screening (ISIS) in the context of linear regression models; Fan et al. (2009) and Fan and Song (2010) extended the SIS and the ISIS to handle generalized linear models; Fan et al. (2011) developed the nonparametric independence screening (NIS) for nonparametric models with additive structure. For more related methodologies see Xue and Zou (2011), Zhu et al. (2011), Li et al. (2012), Wang (2012), Zhao and Li (2012), Lin et al. (2013) and Chang et al. (2013), among others.

All the feature screening methods aforementioned are based on a common condition: the true model structures are specified accurately. Of course the performances of these methods critically depend on the belief that the models under study are equal to or at least are close to the underlying models. When the supposed structures are far from the underlying ones, however, their behaviors may become poor. To develop robust feature screening against model misspecification, Zhu et al. (2011) proposed a sure independent ranking and screening (SIRS). Their proposal can be available for a wide range of commonly used parametric and semiparametric models. Thus theirs could be thought of as a model-free method. Lin et al. (2013) proposed a nonparametric ranking feature screening (NRS) through local information flows of the predictors, by which the function-correlation between response and predictors can be captured successfully, without any model structure assumption. Li et al. (2012) proposed a distance correlation-based sure independence screening (DC-SIS). This is a model-free approach as well. Recently, He et al. (2013) introduced a quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, such an approach allows the set of active variables to vary across quantile and thus make the variable selection more flexible to accommodate heterogeneity.

Moreover, as was mentioned in existing literature such as Fan and Lv (2008), Zhu et al. (2011) and Barut et al. (2012), the correlation among predictors heavily influence the marginal utility. When the correlation among the predictors is relatively high, simple feature screenings may result in false positives (i.e., the selected active predictors may be actually inactive) and false negatives (i.e., the true active predictors may be regarded as inactive predictors and then are removed from the models). Thus most of existing feature screening methods require the relevant conditions to restrict the correlation among predictors. However, it was proved by Hall and Li (1993), Fan and Lv (2008) that with growing dimensionality $p$ , there always exist spurious correlations among predictors. Thus the correlation among predictors is an unevadable problem in statistical inference for all high-dimensional models. To adapt to the circumstances in which predictors may be relatively highly correlated, Cho and Fryzlewicz (2012) proposed a new criterion, for linear models, to measure the contribution of each predictor to response. Their method takes into account the correlations among predictors by projecting correlated predictors into the orthogonal spaces and then eliminating the correlations between the transformed variables. However, the projection method is difficultly applied in or cannot be extended to other models such as nonlinear models and nonparametric models.

In many applications, researchers know from previous investigations and experiences that certain predictors are responsible for the response. It was stated by Barut et al. (2012) that, with these known active predictors, conditioning can help reducing the correlation among the predictors. This is particularly the case when predictors share some common factors, as in many biological (e.g. treatment effects) and financial studies (e.g. market risk factors). Thus, it can be expected that conditioning could help improving the measure of marginal utility. But the conditional sure independence screening of Barut et al. (2012) strongly depends on the model structure assumption, generalized linear model, and needs to estimate the corresponding parameters. It is difficult to extend the method to the models with complex or unspecified structure.

As stated above, a strong correlation among the predictors seriously damages the quality of the existing feature screening methods. However, such a correlation can be easily predetermined. For example, the marginal correlation between any two predictors can be easily and efficiently estimated by sample correlation coefficient. It is an interesting issue to use such a predetermined correlation to reducing the correlation among the predictors and then to enhance the adaptability of the feature screening methods.

In this paper, a general technique is proposed for reducing correlation among the predictors and formulating conditional feature ranking. The key technique used here is to centralize the criterion of the existing model-free criterion, by which the irrelevant term that is related only to the predetermined set of predictors can be removed from the criterion of model-free screening. Consequently, the correlation between the centralized variable and the preselected variables is reduced significantly or eliminated completely, and the new criterion can measure the marginal utility of a predictor conditional on the known set of predictors. As stated above, the conditional information about the predetermined predictors helps reducing the correlation among predictors. It implies that the new method can reduce the false positive and the false negative rates in the variable selection process. This, together with the model-free property, ensures that our method is adaptive to both the correlation among the predictors and model misspecification, especially for the case of the number of the predetermined predictors being large. It is proved that with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. Moreover, unlike the conditional feature screening of Barut et al. (2012), the new criteria do not need to estimate any model parameter, the new procedures are computationally efficient and simple, and can be extended to other relevant methods.

The remainder of the paper is organized in the following way. In Section 2, the SIRS proposed by Zhu et al. (2011) is first reviewed to motivate the methodological development. Then, the SIRS is centralized so that the irrelevant term is removed from the original criterion and consequently, new conditional model-free feature screening is defined naturally. Furthermore, the consistent estimators for the new conditional model-free feature screening are proposed. In Section 3, for our method, the theoretical properties (including correlation reduction and ranking consistency) are investigated. Simulation studies, together with a two-stage procedure, are presented in Section 4, and the technical proofs are postponed to Appendix.

Section snippets

Problems and motivations

Let $x = {(X_{1}, \dots, X_{p})}^{τ}$ be a $p$ -dimensional vector of predictors and $Y$ be the response variable. Denote by $X_{k}$ and $Y$ the supports of $X_{k}$ and $Y$ , respectively. Here the dimension $p$ is large and may be much larger than the sample size $n$ . Denote by $A$ the index set of the active predictors, namely, $A = {k : F (y | x) functionally depends on X_{k} for some y \in Y},$ where $F (y | x)$ is the distribution function of $Y$ conditional on $x$ . If $k \in A$ , $X_{k}$ and $F (y | x)$ are indeed functionally correlated for some $y \in Y$ . Denote by $\bar{A}$ the

Theoretical properties

We now investigate the theoretical properties of the marginal correlation function $Ω_{k} (y, C, β_{C})$ in (2.4), which is a foundation for our method.

Theorem 3.1

The marginal correlation function $Ω_{k} (y, C, β_{C})$ in (2.4) has the following properties:

(1)
If $X_{k} (k \in D)$ and $Y$ are independent conditional on $β_{C}^{τ} x_{C}$ for any $β_{C} \in ℬ_{C}$ , then $Ω_{k} (y, C, β_{C}) = 0$ uniformly for $y \in Y$ and $β_{C} \in ℬ_{C}$ .
(2)
Particularly, under model(2.1), suppose $X_{k} (k \in D)$ and $X_{j} (j \in D \cap A, j \neq k)$ are independent. If $X_{k} (k \in D)$ and $Y$ are functionally uncorrelated conditional on $β_{C}^{τ} x_{C}$ for

Simulation studies

In this subsection we present several simulation examples, together with a two-stage procedure, to compare the finite sample performances of the newly proposed CSIRS in both Case 1 and Case 2 with the existing competitors, such as the unconditional SIRS (Zhu et al., 2011) and the CSIS (Barut et al., 2012). To get comprehensive comparisons, we investigate these feature screening methods in a variety of settings with $p = 2000$ predictors and the sample size $n = 200$ . Throughout, the number of

Acknowledgments

Authors thank the Associate Editor and the referees for the very constructive and thoughtful comments and suggestions. Lu Lin was supported by NNSF projects (11171188, 11571204 and 11231005) of China. Jing Sun was supported by NNSF project (11426126) of China, and NSF project (ZR2014AP007) of Shandong Province, China.

References (19)

L. Lin et al.
Nonparametric feature screening
Comput. Statist. Data Anal.
(2013)
Barut, E., Fan, J., Verhasselt, A., 2012. Conditional sure independence screening. Manuscript....
J. Chang et al.
Marginal empirical likelihood and sure independence feature screening
Ann. Statist.
(2013)
H. Cho et al.
High dimensional variable selection via tilting
R. Stat. Soc. Ser. B Stat. Methodol.
(2012)
J. Fan et al.
Nonparametric independence screening in sparse ultra-high-dimensional additive models
J. Amer. Statist. Assoc. Ser. B
(2011)
J. Fan et al.
Sure independence screening for ultrahigh dimensional feature space (with discussion)
J. R. Stat. Soc. Ser. B Stat. Methodol.
(2008)
J. Fan et al.
Ultrahigh dimensional feature selection: beyond the linear model
J. Mach. Learn. Res.
(2009)
J. Fan et al.
Sure independence screening in generalized linear models with NP-dimensionality
Ann. Statist.
(2010)
P. Hall et al.
On almost linearity of low dimensional projection from high dimensional data
Ann. Statist.
(1993)

There are more references available in the full text version of this article.

Cited by (15)

A scalable surrogate L<inf>0</inf> sparse regression method for generalized linear models with applications to large scale data
2021, Journal of Statistical Planning and Inference
This paper rigorously studies large sample properties of a surrogate $L_{0}$ penalization method via iteratively performing reweighted $L_{2}$ penalized regressions for generalized linear models and develop a scalable implementation of the method for sparse high dimensional massive sample size (sHDMSS) data. We show that for generalized linear models, the limit of the algorithm, referred to as the broken adaptive ridge (BAR) estimator, is consistent for variable selection, enjoys an oracle property for parameter estimation, and possesses a grouping property for highly correlated covariates. We further demonstrate that by taking advantage of an existing efficient implementation of massive $L_{2}$ -penalized generalized linear models, the proposed BAR method can be conveniently implemented for sHDMSS data. An illustration is given using a large sHDMSS data from the Truven MarketScan Medicare (MDCR) database to investigate the safety of dabigatran versus warfarin for treatment of nonvalvular atrial filbrillation in elder patients.
A simple model-free survival conditional feature screening
2019, Statistics and Probability Letters
Citation Excerpt :
Some papers have investigated this problem under different contexts. The related literature includes Barut et al. (2016), Lin and Sun (2016), Liu and Chen (2018), and Hong et al. (2018). However, among these articles, only Hong et al. (2018) studied conditional screening for survival data.
Feature screening for multi-response varying coefficient models with ultrahigh dimensional predictors
2018, Computational Statistics and Data Analysis
Citation Excerpt :
However, the model-based methods face the risk of the model misspecification, consequently, they may lead to erroneous screening results. To avoid such a problem, many researchers have put forward model-free methods, the relevant literature includes Zhu et al. (2011), Li et al. (2012a), Li et al. (2012b), Mai et al. (2015), Ma and Zhang (2016), Lin and Sun (2016), Lu and Lin (2017), Xue and Liang (2017) and the references therein. In this paper, we develop a new feature screening method specifically for the multivariate response varying coefficient linear model (MVCLM).
This article investigates the feature screening procedure for multivariate response varying coefficient linear models. A new conditional canonical correlation coefficient is proposed to characterize the correlation between each predictor and the multivariate response. It is shown that the proposed method is more powerful to distinguish the informative features from the noises than the existing competitors, especially for the case of high-dimensional response. The ranking consistency and the sure screening property are established for the new method. Meanwhile, an iterative version of the feature screening procedure is also introduced. Both the numerical simulations and real data analysis are conducted to illustrate the effectiveness of our method.
Broken adaptive ridge regression and its asymptotic properties
2018, Journal of Multivariate Analysis
This paper studies the asymptotic properties of a sparse linear regression estimator, referred to as broken adaptive ridge (BAR) estimator, resulting from an $L_{0}$ -based iteratively reweighted $L_{2}$ penalization algorithm using the ridge estimator as its initial value. We show that the BAR estimator is consistent for variable selection and has an oracle property for parameter estimation. Moreover, we show that the BAR estimator possesses a grouping effect: highly correlated covariates are naturally grouped together, which is a desirable property not known for other oracle variable selection methods. Lastly, we combine BAR with a sparsity-restricted least squares estimator and give conditions under which the resulting two-stage sparse regression method is selection and estimation consistent in addition to having the grouping property in high- or ultrahigh-dimensional settings. Numerical studies are conducted to investigate and illustrate the operating characteristics of the BAR method in comparison with other methods.
Joint feature screening method for ultrahigh dimensional censored data
2023, Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice
Feature screening for multi-response ultrahigh-dimensional linear models by empirical likelihood
2023, Scientia Sinica Mathematica

View all citing articles on Scopus

View full text

Adaptive conditional feature screening

Abstract

Introduction

Section snippets

Problems and motivations

Theoretical properties

Simulation studies

Acknowledgments

Comput. Statist. Data Anal.

Marginal empirical likelihood and sure independence feature screening

Ann. Statist.

High dimensional variable selection via tilting

R. Stat. Soc. Ser. B Stat. Methodol.

Nonparametric independence screening in sparse ultra-high-dimensional additive models

J. Amer. Statist. Assoc. Ser. B

Sure independence screening for ultrahigh dimensional feature space (with discussion)

J. R. Stat. Soc. Ser. B Stat. Methodol.

Ultrahigh dimensional feature selection: beyond the linear model

J. Mach. Learn. Res.

Sure independence screening in generalized linear models with NP-dimensionality

Ann. Statist.

On almost linearity of low dimensional projection from high dimensional data

Ann. Statist.