Influential data cases when the criterion is used for variable selection in multiple linear regression
Introduction
Multiple linear regression analysis is a widely used and well documented statistical procedure. Two aspects of regression analysis which have been particularly well investigated are identifying and dealing with influential data cases, and selecting a subset of the explanatory variables for use in the regression function. Standard references on the first issue include Cook (1977), Belsley et al. (1980) and Atkinson and Riani (2000), while Burnham and Anderson (2002) and Miller (2002) provide recent overviews of the second issue.
Although influential data cases and variable selection have separately been extensively dealt with in the literature, relatively little have been published on investigations into a combination of these two problems. We briefly refer to some of the relevant references. Chatterjee and Hadi (1988) propose measuring the effect of simultaneous omission of a variable and an observation from the data set in terms of changes in the values of the least squares regression coefficients, the residual sum of squares, the fitted values, and the predicted value of the omitted observation. Peixoto and Lamotte (1989) investigate a procedure which adds a dummy variable for each observation to the explanatory variables. Variable selection is then performed, and observations corresponding to selected dummy variables are pronounced to be influential. Léger and Altman (1993) identify conditional and unconditional approaches to the problem of identifying influential data cases in a variable selection context. In the conditional approach the full data set is used to select a set of explanatory variables, and case diagnostics are then calculated conditional on this model, i.e., the set of selected variables remains fixed. In the unconditional approach we apply variable selection to the full data set and calculate a vector of fitted values; we then omit the data case under consideration from the data set and repeat the variable selection as well as calculation of the vector of fitted values; finally, a standardised distance between the two vectors of fitted values is calculated to measure the influence of the omitted case. Léger and Altman (1993) argue that the unconditional approach is preferable since it explicitly takes the variable selection into account when trying to quantify the influence of a given data case. Arguing along similar lines, Hoeting et al. (1996) point out that the model which is selected can depend upon the order in which variable selection and outlier identification are carried out. They therefore propose a Bayesian method which can be used to simultaneously select variables and identify outliers.
In this paper we restrict attention to variable selection using the statistic proposed by Mallows (1973). Our contribution is the introduction of a new p-value based procedure for identifying influential data cases in this context. Weisberg (1981) shows how the statistic can be written as a sum of n terms (where n is the number of data cases), with each term in the sum corresponding to one of the n cases. In Section 2 of this paper we provide a brief exposition of the coordinate free approach to linear model selection, and in Section 3 we will see that the breakup of the statistic described by Weisberg (1981) can also be formulated within the coordinate free approach. Section 4 of the paper is devoted to a discussion of the p-value based procedure for identification of influential data cases in a variable selection context, and Section 5 contains two examples illustrating application of the procedure. We close in Section 6 with conclusions and open questions.
Section snippets
A coordinate free approach to linear model selection
The coordinate free approach to variable selection in multiple linear regression analysis offers the advantage that the results which are obtained can also be applied in a wider linear model context. In this section we briefly indicate how the statistic can be derived within this framework. The interested reader is referred to Chapter 4 of Arnold (1981) for a more detailed discussion.
Consider the standard normal linear modelwhere is the n-component response vector,
Expanding the statistic
We start this section by showing how the ESEE in (3) can be expressed as a sum of n terms. Consider in this regard the random vector defined by . Then , and we find the following expression for the ESEE corresponding to a given subspace L:Let be the standard orthonormal basis for . The first term in (7) may be written aswhere denotes
Identifying selection influential data cases
Consider the statistic (14) written in the formwhere the subscript indexes the different subspaces. Let denote the total number of possible subspaces(models) that can be considered for selection, with m denoting the total number of predictors in a multiple linear regression setup. If we use the statistic for variable selection, we effectively identify the subspace L minimizing (15). We can therefore think of the nt values , as the basic data
The fuel data
Consider the fuel data (Weisberg, 1985, pp. 35–36, 126). There are cases (1 case for each of the 50 states in the USA). The response is the 1972 fuel consumption in gallons per person, while the predictor variables refer to various characteristics of the 50 states. We calculated the values of in (17) and in (20) (using ) for this data set, and the entries in Table 1 as well as the first two graphs in Fig. 1 summarize the results. The first four columns in Table 1
Conclusions and open questions
In this paper we indicated how Mallows’ statistic for a given subset of predictor variables can be expressed as a sum of n terms, each term corresponding to one of the data cases. A basic problem arising from this representation of the statistic is how to decide whether a specific term in such a representation is significantly small or large, which would serve as an indication that the data case concerned is selection influential with respect to the subset concerned. Our proposal for
References (15)
- et al.
Impact of simultaneous omission of a variable and an observation on a linear regression equation
Comput. Statist. & Data Anal.
(1988) - et al.
A method for simultaneous variable selection and outlier identification in linear regression
Comput. Statist. & Data Anal.
(1996) - et al.
Simultaneous identification of outliers and predictors using variable selection techniques
J. Statist. Planning and Inference
(1989) The Theory of Linear Models and Multivariate Analysis
(1981)- et al.
Robust diagnostic regression analysis
(2000) - et al.
Regression Diagnostics
(1980) - et al.
Model Selection and Inference. A Practical Information-theoretic Approach
(2002)
Cited by (3)
Parallel maximum likelihood estimator for multiple linear regression models
2015, Journal of Computational and Applied MathematicsCitation Excerpt :[10], [11], [12], [13], [14], [15], [16] and [17].
Multiple regression models to predict the annual energy consumption in the Spanish banking sector
2012, Energy and BuildingsCitation Excerpt :Values of the Mallows’ Cp statistic for each regression model should be evaluated by function p, where p is the number of independent variables plus one, which corresponds to the intercept. The selected regression model should have a low Cp value and should be close to the number of independent variables to be estimated in the model to avoid a considerable difference between the expected value and the real value of the dependent variable Y [31]. Additionally, the selected model should have the lowest possible residual deviation (S) combined with the highest R2 (adj) value.