A sequential test for variable selection in high dimensional complex data

https://doi.org/10.1016/j.csda.2014.07.016Get rights and content

Abstract

Given a high dimensional p-vector of continuous predictors X and a univariate response Y, principal fitted components (PFC) provide a sufficient reduction of X that retains all regression information about Y in X while reducing the dimensionality. The reduction is a set of linear combinations of all the p predictors, where with the use of a flexible set of basis functions, predictors related to Y via complex, nonlinear relationship can be detected. In the presence of possibly large number of irrelevant predictors, the accuracy of the sufficient reduction is hindered. The proposed method adapts a sequential test to the PFC to obtain a “pruned” sufficient reduction that shed off the irrelevant predictors. The sequential test is based on the likelihood ratio which expression is derived under different covariance structures of X|Y. The resulting reduction has an improved accuracy and also allows the identification of the relevant variables.

Introduction

Consider the Big-Mac dataset (Enz, 1991), a simple dataset that gives average values in 1991 of several economic indicators for 45 world cities. It has nine continuous predictors and a continuous outcome variable. The outcome is the minimum labor to buy a Big Mac and fries in US dollars. A regression fitting to the raw data without any transformation of the response or predictors yields a multiple R2, the square of the correlation between the observed and the fitted response to be 0.46. After a graphical exploration and the appropriate transformation of the variables, we obtained R2=0.87. The reason of this drastic improvement is that the relationships between the response and the predictors, initially nonlinear (Fig. 1), were transformed into linear through a model fitting procedure guided by diagnostics (see  Cook and Weisberg, 1982, Section 1.2). With nine predictors, this procedure is easily doable. However, when p is large, say 50 or more, a regression modeling using this iterative procedure is rather a daunting task, tedious and imponderable. Ubiquitously, a forward linear model is considered, and the relationship between individual predictors and the response is often unexplored because of the high dimensionality of the predictors. Diagnostic methods are seldom used for model checking. Using an ill-fitting model to solve a variable selection problem can result in reduced performance.

Most variable selection methods are constructed around forward linear regression models. Because the ordinary least squares estimation does not yield satisfactory results when p is large, it is often assumed that a large portion of these p predictors is irrelevant in explaining the response Y. The corresponding coefficients of these predictors in a linear regression model are shrunk or even set to zero. This brings the concept of sparsity into regression modeling with two induced consequences: parsimony of the model and accuracy in prediction. A flurry of research on algorithms and theory for variable selection involving sparsity constraints have been observed in recent years. These methods include the soft thresholding (Donoho, 1995), the nonnegative garotte (Breiman, 1995), lasso (Tibshirani, 1996), the smoothly clipped absolute deviation penalty (SCAD; Fan and Li, 2001), elastic net (Zou and Hastie, 2005), and Dantzig selector (Candès and Tao, 2007) among many others. These methods work exceptionally well when the model is accurate. However they do not perform adequately when the predictors and the response have an arbitrary non-linear relationship.

A recent methodology proposed by Cook (2007) brings significant openings to address the shortcomings of linear models in capturing information about high dimensional predictors non-linearly related to the response. Cook (2007) proposed the concept of sufficient dimension reduction in regression and set up a new paradigm of dimension reduction through a likelihood-based approach called principal fitted components (PFC). A reduction R:RpRd,dp, was defined to be sufficient if it satisfies one of the following three statements: (i) Y|XY|R(X), (ii) X|(Y,R(X))X|R(X), and (iii) XY|R(X). The symbol stands for statistical independence, and UV stands for U and V having identical distribution. Statement (i) holds in a forward regression while statement (ii) holds in an inverse regression setup. Under a joint distribution of (Y,X) the three statements are equivalent.

Principal fitted components are a class of inverse regression models that yield a sufficient reduction of the predictors. Let Xy denote the random vector X|(Y=y) and assume that there is a vector-valued function ν(Y)Rd, with dmin(n,p) and E[ν(Y)]=0, so that Xy can be represented by the model XyN(μ+Γν(y),Δ). The term ΓRp×d is a semi-orthogonal matrix, and μ=E(X). The covariance Δ is assumed to be independent of Y. Under this model the translated conditional means E(Xy)μ fall in the d-dimensional subspace span(Γ), and thus Γ captures the dependency between X and Y. Once the response is observed, the term ν(y) which is unobserved can be approximated using a flexible set of basis functions as ν(y)βf(y). The subsequent model Xy=μ+Γβf(y)+Δ1/2ε is called a PFC model where ε is assumed to be normally distributed with mean 0 and variance Ip. Under this model, Cook (2007) showed that ΓTΔ1X is a sufficient reduction of X. The choice of the basis function allows to capture predictors that are linearly and nonlinearly related to the response. The maximum likelihood estimators of the parameters in the model have been obtained (Cook, 2007, Cook and Forzani, 2008).

In high dimensional settings, irrelevant predictors, which often abound, can hinder the accuracy of the estimated sufficient reduction. Our goal is to obtain a “pruned” estimator of the sufficient reduction, which not only helps achieve accuracy, but also allows the identification of the relevant variables. By “pruning”, we mean removing inactive predictors that do not contain any regression information about the response. This is often called a sparse estimator.

An estimation of the sparse reduction kernel Δ1Γ has been proposed by Li (2007) who established a framework to obtain the sparse sufficient reduction using a regression-type formulation with the lasso (Tibshirani, 1996) and elastic net (Zou and Hastie, 2005) penalties. Chen et al. (2010) proposed the coordinate independent sparse sufficient dimension reduction that shrinks row elements of Δ1Γ while preserving the orthogonality constraint of Γ. Both methodologies are apt when np. We herein construct a sequential likelihood ratio test that is reminiscent of the idea of testing predictor contribution in sufficient dimension reduction of Cook (2004). It helps obtain the sparse reduction under structures of Δ that allow p>n. We show the performance of the procedure through simulations.

Section snippets

A sequential test for sparse PFC

We assume that the p-vector predictor X can be partitioned as (X1T,X2T)T, with X2Rp2, and let (Γ1T,Γ2T)T,Δ=(Δij)i,j=1,2 and Δ1=(Δij)i,j=1,2 be the corresponding partitions of Γ,Δ and Δ1 following the partition of X. Under model  (1), the sufficient reduction can be written asΓTΔ1X=(Γ1TΔ11+Γ2TΔ21)X1+(Γ1TΔ12+Γ2TΔ22)X2. Let us suppose that X2 represents the set of predictors with no regression information about Y in the sense that X2|Y has the same distribution as X2. Consequently, we have Γ2=0

Numerical studies

We illustrate the performance of the sequential likelihood ratio test for sparse sufficient reduction estimation and variable selection with PFC on two datasets and also through a simulation study. With the first dataset, the performance of the method is evaluated when the assumption of the conditional independence is violated. The second dataset is a case where the sufficient reduction methodology leads to fitting a linear regression model and its related shrinkage methodologies for variable

Discussions

We have presented a sequential likelihood ratio test to obtain a sparse estimate of the sufficient reduction of the data with PFC in high dimensional setup when the relationship between the active predictors and the response is nonlinear. The sparse sufficient reduction also yields the active or important predictors relevant in explaining the response.

The sparse sufficient reduction can be readily carried into a forward model for prediction or classification. With the reduction of the

References (31)

  • Tao Wang et al.

    Sparse sufficient dimension reduction using optimal scoring

    Comput. Statist. Data Anal.

    (2013)
  • K.P. Adragni et al.

    Discussion on the sure independence screening for ultrahigh dimensional feature space of Jianqing Fan and Jinchi Lv (2007)

    J. R. Stat. Soc. Ser. B

    (2008)
  • K.P. Adragni et al.

    Sufficient dimension reduction and prediction in regression

    Phil. Trans. R. Soc. A

    (2009)
  • Adragni, K.P., Raim, A., 2014. ldr: Methods for likelihood-based dimension reduction in regression. R package version...
  • L. Breiman

    Better subset regression using the nonnegative garrote

    Technometrics

    (1995)
  • Candès et al.

    The Dantzig selector: statistical estimation when p is much larger than n

    Ann. Statist.

    (2007)
  • X. Chen et al.

    Coordinate-independent sparse sufficient dimension reduction and variable selection

    Ann. Statist.

    (2010)
  • H. Chun et al.

    Sparse partial least squares regression for simultaneous dimension reduction and variable selection

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (2010)
  • Chung, D., Chun, H., Keles, S., 2012. spls: Sparse Partial Least Squares (SPLS) Regression and Classification. R...
  • R.D. Cook

    Regression Graphics: Ideas for Studying Regression Through Graphics

    (1998)
  • R.D. Cook

    Testing predictor contributions in sufficient dimension reduction

    Ann. Statist.

    (2004)
  • R.D. Cook

    Fisher lecture

    Statist. Sci.

    (2007)
  • R.D. Cook et al.

    Principal fitted components for dimension reduction in regression

    Statist. Sci.

    (2008)
  • R.D. Cook et al.

    Sufficient dimension reduction via inverse regression: a minimum discrepancy approach

    J. Amer. Statist. Assoc.

    (2005)
  • R.D. Cook et al.

    Residuals and Influence in Regression

    (1982)
  • Cited by (1)

    View full text