Principal component regression analysis with spss

https://doi.org/10.1016/S0169-2607(02)00058-5Get rights and content

Abstract

The paper introduces all indices of multicollinearity diagnoses, the basic principle of principal component regression and determination of ‘best’ equation method. The paper uses an example to describe how to do principal component regression analysis with spss 10.0: including all calculating processes of the principal component regression and all operations of linear regression, factor analysis, descriptives, compute variable and bivariate correclations procedures in spss 10.0. The principal component regression analysis can be used to overcome disturbance of the multicollinearity. The simplified, speeded up and accurate statistical effect is reached through the principal component regression analysis with spss.

Introduction

In multivariate analysis, the least-squares method is generally adopted in fitting a multiple linear regression model, but estimation of the least-squares is sometimes far from being perfect. One of important causes leading to the result is column vectors of matrix X is close to linear correlation. Approximate linear relationship among independent variables is called multicollinearity. That there exists multicollinearity among independent variables tends to lead to the result that symbol and value of actual regression coefficient are not consistent with the expected ones. The often-used index to justify collinearity is simple correlation coefficient. When simple correlation coefficient between two independent variables is large, the collinearity is considered. Apart from simple correlation coefficient, SPSS provides collinearity statistics ([1], pp. 221): tolerance and variance inflation factor (VIF). Tolerance=1−Ri2, where Ri2 is squared multiple correlation of ith variable with other independent variables. When its value is small (close to 0), the variable is almost a linear combination of the other independent variables. VIF is reciprocal of tolerance. Variables with low tolerance tend to have large VIF, so variables with low tolerance and large VIF suggest that they have a collinearity. Eigenvalue, condition index and variance proportion are also indices of collinearity diagnosis ([1], pp. 229–230). Eigenvalues provide an indication of how many distinct dimensions there are among independent variables. When several eigenvalues are close to 0, the variables are highly intercorrelated and the matrix is said to be ill-conditioned. Condition indices are square roots of ratios of the largest eigenvalue to each successive eigenvalue. A condition index greater than 15 indicates a possible problem and an index greater than 30 suggests a serious problem with collinearity. Variance proportions are proportions of variance of estimate accounted for by each principal component associated with each of eigenvalues. A component associated with a high condition index contributes substantially to variance of two or more variables, so independent variables with large variances are the ones being highly intercorrelated. The principal component regression is the method of combining linear regression with principal component analysis ([2], pp. 327–332). The principal component analysis can gather highly correlated independent variables into a principal component, and all principal components are independent of each other, so that all it does is to transform a set of correlated variables to a set of uncorrelated principal components. Then we built the regression equations with a set of uncorrelated principal components and get the ‘best’ equation according to the principle of the maximum adjusted R2 and minimum standard error of estimate. At last the ‘best’ equation is transformed into the general linear regression equation. The present paper will demonstrate how multicollinearity problem is solved by using spss 10.0 to do the principal component regression [3].

Section snippets

Basic principle and formulas

(1) Proceed a stepwise regression with a dependent variable Y and all independent variables X for getting the p independent variables with statistical significances (P<0.05) and revealing whether the p independent variables have a multicollinearity or not.

(2) Proceed a principal component analysis with the p independent variables for transforming a set of correlated variables to a set of uncorrelated principal components and indicating information quantities of different set of principal

Example

Between the years of 1951 and 1998 (in which the data of 1969 and 1986 are not indexed), the mortality (1/100000) of traffic accident in the mainland of China each year, the quantity (10000 vehicles) of motors, the quantity (10000 tons) of freight transport, the quantity (10000 persons) of passenger transport, the mileage (10000 km) of motor's running on formal highway, and the mileage (10000 km) of motor's running on informal highway are respectively expressed as the dependent variable Y and

Discussion

Not only can the principal component regression analysis overcome disturbance of collinearity and real face of the fact is exposed (e.g. that b1=−7.52×10−4 is corrected to b1=0.00149 through principal component regression analysis indicates there is a positive correlation between the mortality of traffic accidents and the quantity of motors, as is in accordance with the fact), while original information is not lost yet (Table 4 shows that the cumulative variance proportion with three principal

References (3)

  • spss Inc., spss Base 10.0 Applications Guide, spss Inc., USA,...
There are more references available in the full text version of this article.

Cited by (259)

View all citing articles on Scopus
View full text