Selection and validation of parameters in multiple linear and principal component regressions

https://doi.org/10.1016/j.envsoft.2007.04.012Get rights and content

Abstract

This paper aims to select statistically valid regression parameters using multiple linear and principal component regression models. The selection methods were: (i) backward elimination based on the confidence interval limits; (ii) backward elimination based on the correlation coefficient; (iii) forward selection based on the correlation coefficient; (iv) forward selection based on the sum of square errors; and (v) combinations of all variables. For the purpose of the work, a case study was considered. The case study focused on the determination of the parameters that influence the concentration of tropospheric ozone. The explanatory variables were meteorological data (temperature, relative humidity, wind speed, wind direction and solar radiation), and environmental data (nitrogen oxides and ozone concentrations of the previous day). The results showed that each selection method led to different multiple linear regression models, as a consequence of the collinearities between explanatory variables. Such collinearities can be removed by pre-processing the explanatory data set, through the application of principal component analysis. The application of this procedure allowed the achievement of the same regression model using all selection methods.

Introduction

Multiple linear regression (MLR) attempts to model the relationship between two or more explanatory variables and a response variable, by fitting a linear equation to the observed data. The dependent variable (y) is given by:y=βˆ0+i=1kβˆixi+εwhere xi (i = 1,  , k) are the explanatory independent variables, βˆi (i = 0,  , k) are the regression coefficients, and ε is the error associated with the regression and assumed to be normally distributed with both expectation value zero and constant variance (Agirre-Basurko et al., 2006).

The predicted value given by the regression model (yˆ) is calculated by:yˆ=βˆ0+i=1kβˆixi

The most common method to estimate the regression parameters βˆi is the minimization of the sum of square errors (SSE). The equation is as follows:βˆi=argmini=1n(yiyˆi)2

MLR is one of the most used methods for forecasting. This method is widely used to fit the observed data and to create models that can be used for the prediction in many research fields such as biology (Khan et al., 2006, Mercaldo-Allen et al., 2006, Smith and Wachob, 2006), medicine (Sanchez-Ortuno et al., 2006, Andersen et al., 2005, Dorbala et al., 2006), psychology (Ansiau et al., 2005), economics (Mohamed and Bodger, 2005, Singh, 2006) and environment (Sousa et al., 2006, Turalıoğlu et al., 2006, Lu and Chang, 2005, Friis et al., 2005, Goyal et al., 2006).

The environment is a research field where MLR has more applications. Sousa et al. (2006) used the MLR approach to predict hourly mean ozone concentrations, comparing it with feedforward artificial neural networks (FANN) and time series (TS) modelling. In that study, MLR models showed good performance in the development step, but in the validation step FANN presented lower residual errors. Turalıoğlu et al. (2006) used MLR to determine the relationship between daily average total suspended particulate (TSP), sulphur dioxide (SO2) concentrations and meteorological parameters (temperature, wind speed, relative humidity, pressure and precipitation) in Erzurum, Turkey. While pollutants concentrations had a strong relation with temperature, they had a significant correlation with wind speed and pressure. The precipitation and humidity were weakly correlated with SO2 and TSP. This method was also used by Goyal et al. (2006) to forecast daily averaged concentration of respirable suspended particulate matter (RSPM) in Delhi and Hong Kong based on some meteorological factors. The results were compared with time series auto regressive integrated moving average (ARIMA) model and the combination of the two models. The combination of MLR and ARIMA presented better performance in comparison with the MLR or the ARIMA.

The studies described above did not, however, consider the variable dependence before the application of MLR. When explanatory variables are correlated with each other, the application of this method usually presents some drawbacks due to the fact that high correlations between predictor variables can difficult a correct analysis. The dependence of the explanatory variables can be removed through the application of principal component analysis (PCA).

PCA creates new variables, the principal components (PC), that are orthogonal and uncorrelated. These variables are linear combinations of the original variables. The PC are ordered in such a way that the first component has the largest fraction of the original data variability (Abdul-Wahab et al., 2005, Wang and Xiao, 2004, Sousa et al., 2007). To evaluate the influence of each variable in the PC, varimax rotation is generally used to obtain the rotated factor loadings that represent the contribution of each variable in a specific principal component.

Principal component regression (PCR) is a method that combines linear regression and PCA. PCR establishes a relationship between the output variable (y) and the selected PC of the input variables (xi).

To develop these models with variables that correspond to significant regression parameters it is necessary to use one of the statistical procedures described in Section 2. There are several published papers where these procedures were ignored (Boughton and Chiew, 2007, Liu et al., in press, Zhu et al., 2007). The application of these procedures avoids the inclusion in the models of input variables less correlated with the output variable.

In this paper, regressions were performed with and without the application of PCA to the original data. The aim was to apply a methodology, based on a statistical procedure, to select the explanatory variables to be used in the development of multiple linear and principal component regression models. The following five methods were compared: (i) backward elimination based on the confidence interval limits; (ii) backward elimination based on the correlation coefficient; (iii) forward selection based on the correlation coefficient; (iv) forward selection based on the sum of square errors; and (v) combinations of all variables. A case study was considered, regarding the determination of the parameters that influence the concentration of tropospheric ozone. The explanatory variables were meteorological data (temperature, relative humidity, wind speed, wind direction and solar radiation), and environmental data (nitrogen oxides and ozone concentrations of the previous day).

The remainder of this paper is outlined as follows: in Section 2 different methods to validate regression coefficients of multiple linear and principal component regressions models are presented; in Section 3 the case study is described; in Section 4 the results of the parameter validation methods are discussed, and finally; in Section 5 some conclusions are presented.

Section snippets

Selection methods

The significance of the regression parameters in the MLR and PCR models was evaluated through the calculation of their confidence interval. The parameter βˆi is valid if (Hayter et al., 2005):|βˆi|>tnk1α/2σˆSxxiwhere t is the Student t distribution, n is the number of points, k is the number of parameters, α is the significance level, σˆ is the standard deviation given by SSE/(nk1) and Sxxi is the sum of squares related to xi given by j=1n(xijx¯i)2. The description of the methods is given

Case study

This study aims to determine the parameters that mostly influence the concentration of tropospheric ozone.

Increased tropospheric ozone levels have been affecting human health, climate, vegetation, materials and atmospheric composition. Tropospheric ozone is formed by reactions involving solar radiation and anthropogenic pollutants (methane, non-methane volatile organic compounds, carbon monoxide) in the presence of nitrogen oxides. Consequently, a typical daily profile of ozone concentrations

Results and discussion

The selected variables for each method were validated with t-test and Partial F test (method 4b) using a significance level of 0.05.

Table 1 shows the MLR coefficients and the correspondent values of τi obtained using different variable selection methods, for the analysed period. The value of τi is equal to γi or (Fi  fc) for the methods that apply, respectively, t-test or Partial F test for the validation of the regression parameters (in both cases, a parameter is valid for positive values of τi

Conclusions

Aiming the selection of statistically valid regression parameters, using multiple linear and principal component regression models, five methods were compared. These methods were used to evaluate the variables that influenced tropospheric ozone concentration during the night period.

When multiple linear regression was used, the results showed that each selection method led to different models, as a consequence of the collinearities between explanatory variables. On the contrary, when principal

Acknowledgements

Authors are grateful to Comissão de Coordenação da Direcção Regional-Norte and to Instituto Geofísico da Universidade do Porto for kindly providing the air quality and meteorological data. This work was supported by Fundação para a Ciência e Tecnologia (FCT). J.C.M. Pires also thanks the FCT for the fellowship SFRD/BD/23302/2005.

References (32)

Cited by (83)

View all citing articles on Scopus
View full text