Influential data cases when the Cp criterion is used for variable selection in multiple linear regression

https://doi.org/10.1016/j.csda.2005.02.003Get rights and content

Abstract

The influence of data cases when the Cp criterion is used for variable selection in multiple linear regression analysis is studied in terms of the predictive power and the predictor variables included in the resulting model when variable selection is applied. In particular, the focus is on the importance of identifying and dealing with these so-called selection influential data cases before model selection and fitting are performed. A new selection influence measure based on the Cp criterion to identify selection influential data cases is developed. The success with which this influence measure identifies selection influential data cases is evaluated in two example data sets.

Introduction

Multiple linear regression analysis is a widely used and well documented statistical procedure. Two aspects of regression analysis which have been particularly well investigated are identifying and dealing with influential data cases, and selecting a subset of the explanatory variables for use in the regression function. Standard references on the first issue include Cook (1977), Belsley et al. (1980) and Atkinson and Riani (2000), while Burnham and Anderson (2002) and Miller (2002) provide recent overviews of the second issue.

Although influential data cases and variable selection have separately been extensively dealt with in the literature, relatively little have been published on investigations into a combination of these two problems. We briefly refer to some of the relevant references. Chatterjee and Hadi (1988) propose measuring the effect of simultaneous omission of a variable and an observation from the data set in terms of changes in the values of the least squares regression coefficients, the residual sum of squares, the fitted values, and the predicted value of the omitted observation. Peixoto and Lamotte (1989) investigate a procedure which adds a dummy variable for each observation to the explanatory variables. Variable selection is then performed, and observations corresponding to selected dummy variables are pronounced to be influential. Léger and Altman (1993) identify conditional and unconditional approaches to the problem of identifying influential data cases in a variable selection context. In the conditional approach the full data set is used to select a set of explanatory variables, and case diagnostics are then calculated conditional on this model, i.e., the set of selected variables remains fixed. In the unconditional approach we apply variable selection to the full data set and calculate a vector of fitted values; we then omit the data case under consideration from the data set and repeat the variable selection as well as calculation of the vector of fitted values; finally, a standardised distance between the two vectors of fitted values is calculated to measure the influence of the omitted case. Léger and Altman (1993) argue that the unconditional approach is preferable since it explicitly takes the variable selection into account when trying to quantify the influence of a given data case. Arguing along similar lines, Hoeting et al. (1996) point out that the model which is selected can depend upon the order in which variable selection and outlier identification are carried out. They therefore propose a Bayesian method which can be used to simultaneously select variables and identify outliers.

In this paper we restrict attention to variable selection using the Cp statistic proposed by Mallows (1973). Our contribution is the introduction of a new p-value based procedure for identifying influential data cases in this context. Weisberg (1981) shows how the Cp statistic can be written as a sum of n terms (where n is the number of data cases), with each term in the sum corresponding to one of the n cases. In Section 2 of this paper we provide a brief exposition of the coordinate free approach to linear model selection, and in Section 3 we will see that the breakup of the Cp statistic described by Weisberg (1981) can also be formulated within the coordinate free approach. Section 4 of the paper is devoted to a discussion of the p-value based procedure for identification of influential data cases in a variable selection context, and Section 5 contains two examples illustrating application of the procedure. We close in Section 6 with conclusions and open questions.

Section snippets

A coordinate free approach to linear model selection

The coordinate free approach to variable selection in multiple linear regression analysis offers the advantage that the results which are obtained can also be applied in a wider linear model context. In this section we briefly indicate how the Cp statistic can be derived within this framework. The interested reader is referred to Chapter 4 of Arnold (1981) for a more detailed discussion.

Consider the standard normal linear modelY=μ+εNnμ,σ2In,where Y is the n-component response vector, μ=μ1,μ2,,

Expanding the Cp statistic

We start this section by showing how the ESEE in (3) can be expressed as a sum of n terms. Consider in this regard the random vector Z defined by σZ=Y-μ. Then ZNn(0,In), and we find the following expression for the ESEE corresponding to a given subspace L:EPLY-μ2=EσPLZ-PM|Lμ2=σ2EPLZ2+PM|Lμ2.Let u1,u2,,un be the standard orthonormal basis for Rn. The first term in (7) may be written asσ2EPLZ2=σ2i=1nEui,PLZ2=σ2i=1nEPLui,Z2σ2EPLZ2=σ2i=1nPLui2+i=1nPLui,E(Z)2=σ2i=1nPLui2,where .,. denotes

Identifying selection influential data cases

Consider the statistic (14) written in the formi=1nCp(i,j),where the subscript j indexes the different subspaces. Let t=2m-1 denote the total number of possible subspaces(models) that can be considered for selection, with m denoting the total number of predictors in a multiple linear regression setup. If we use the Cp statistic for variable selection, we effectively identify the subspace L minimizing (15). We can therefore think of the nt values Cp(i,j),i=1,2,,n;j=1,2,,t, as the basic data

The fuel data

Consider the fuel data (Weisberg, 1985, pp. 35–36, 126). There are n=50 cases (1 case for each of the 50 states in the USA). The response is the 1972 fuel consumption in gallons per person, while the m=4 predictor variables refer to various characteristics of the 50 states. We calculated the values of pij(C) in (17) and pij(D) in (20) (using σ^2=7,452) for this data set, and the entries in Table 1 as well as the first two graphs in Fig. 1 summarize the results. The first four columns in Table 1

Conclusions and open questions

In this paper we indicated how Mallows’ Cp statistic for a given subset of predictor variables can be expressed as a sum of n terms, each term corresponding to one of the data cases. A basic problem arising from this representation of the Cp statistic is how to decide whether a specific term in such a representation is significantly small or large, which would serve as an indication that the data case concerned is selection influential with respect to the subset concerned. Our proposal for

References (15)

There are more references available in the full text version of this article.

Cited by (3)

  • Parallel maximum likelihood estimator for multiple linear regression models

    2015, Journal of Computational and Applied Mathematics
    Citation Excerpt :

    [10], [11], [12], [13], [14], [15], [16] and [17].

  • Multiple regression models to predict the annual energy consumption in the Spanish banking sector

    2012, Energy and Buildings
    Citation Excerpt :

    Values of the Mallows’ Cp statistic for each regression model should be evaluated by function p, where p is the number of independent variables plus one, which corresponds to the intercept. The selected regression model should have a low Cp value and should be close to the number of independent variables to be estimated in the model to avoid a considerable difference between the expected value and the real value of the dependent variable Y [31]. Additionally, the selected model should have the lowest possible residual deviation (S) combined with the highest R2 (adj) value.

View full text