Selection and validation of parameters in multiple linear and principal component regressions
Introduction
Multiple linear regression (MLR) attempts to model the relationship between two or more explanatory variables and a response variable, by fitting a linear equation to the observed data. The dependent variable (y) is given by:where (i = 1, … , k) are the explanatory independent variables, (i = 0, … , k) are the regression coefficients, and ε is the error associated with the regression and assumed to be normally distributed with both expectation value zero and constant variance (Agirre-Basurko et al., 2006).
The predicted value given by the regression model () is calculated by:
The most common method to estimate the regression parameters is the minimization of the sum of square errors (SSE). The equation is as follows:
MLR is one of the most used methods for forecasting. This method is widely used to fit the observed data and to create models that can be used for the prediction in many research fields such as biology (Khan et al., 2006, Mercaldo-Allen et al., 2006, Smith and Wachob, 2006), medicine (Sanchez-Ortuno et al., 2006, Andersen et al., 2005, Dorbala et al., 2006), psychology (Ansiau et al., 2005), economics (Mohamed and Bodger, 2005, Singh, 2006) and environment (Sousa et al., 2006, Turalıoğlu et al., 2006, Lu and Chang, 2005, Friis et al., 2005, Goyal et al., 2006).
The environment is a research field where MLR has more applications. Sousa et al. (2006) used the MLR approach to predict hourly mean ozone concentrations, comparing it with feedforward artificial neural networks (FANN) and time series (TS) modelling. In that study, MLR models showed good performance in the development step, but in the validation step FANN presented lower residual errors. Turalıoğlu et al. (2006) used MLR to determine the relationship between daily average total suspended particulate (TSP), sulphur dioxide (SO2) concentrations and meteorological parameters (temperature, wind speed, relative humidity, pressure and precipitation) in Erzurum, Turkey. While pollutants concentrations had a strong relation with temperature, they had a significant correlation with wind speed and pressure. The precipitation and humidity were weakly correlated with SO2 and TSP. This method was also used by Goyal et al. (2006) to forecast daily averaged concentration of respirable suspended particulate matter (RSPM) in Delhi and Hong Kong based on some meteorological factors. The results were compared with time series auto regressive integrated moving average (ARIMA) model and the combination of the two models. The combination of MLR and ARIMA presented better performance in comparison with the MLR or the ARIMA.
The studies described above did not, however, consider the variable dependence before the application of MLR. When explanatory variables are correlated with each other, the application of this method usually presents some drawbacks due to the fact that high correlations between predictor variables can difficult a correct analysis. The dependence of the explanatory variables can be removed through the application of principal component analysis (PCA).
PCA creates new variables, the principal components (PC), that are orthogonal and uncorrelated. These variables are linear combinations of the original variables. The PC are ordered in such a way that the first component has the largest fraction of the original data variability (Abdul-Wahab et al., 2005, Wang and Xiao, 2004, Sousa et al., 2007). To evaluate the influence of each variable in the PC, varimax rotation is generally used to obtain the rotated factor loadings that represent the contribution of each variable in a specific principal component.
Principal component regression (PCR) is a method that combines linear regression and PCA. PCR establishes a relationship between the output variable (y) and the selected PC of the input variables (xi).
To develop these models with variables that correspond to significant regression parameters it is necessary to use one of the statistical procedures described in Section 2. There are several published papers where these procedures were ignored (Boughton and Chiew, 2007, Liu et al., in press, Zhu et al., 2007). The application of these procedures avoids the inclusion in the models of input variables less correlated with the output variable.
In this paper, regressions were performed with and without the application of PCA to the original data. The aim was to apply a methodology, based on a statistical procedure, to select the explanatory variables to be used in the development of multiple linear and principal component regression models. The following five methods were compared: (i) backward elimination based on the confidence interval limits; (ii) backward elimination based on the correlation coefficient; (iii) forward selection based on the correlation coefficient; (iv) forward selection based on the sum of square errors; and (v) combinations of all variables. A case study was considered, regarding the determination of the parameters that influence the concentration of tropospheric ozone. The explanatory variables were meteorological data (temperature, relative humidity, wind speed, wind direction and solar radiation), and environmental data (nitrogen oxides and ozone concentrations of the previous day).
The remainder of this paper is outlined as follows: in Section 2 different methods to validate regression coefficients of multiple linear and principal component regressions models are presented; in Section 3 the case study is described; in Section 4 the results of the parameter validation methods are discussed, and finally; in Section 5 some conclusions are presented.
Section snippets
Selection methods
The significance of the regression parameters in the MLR and PCR models was evaluated through the calculation of their confidence interval. The parameter is valid if (Hayter et al., 2005):where t is the Student t distribution, n is the number of points, k is the number of parameters, α is the significance level, is the standard deviation given by and Sxxi is the sum of squares related to xi given by . The description of the methods is given
Case study
This study aims to determine the parameters that mostly influence the concentration of tropospheric ozone.
Increased tropospheric ozone levels have been affecting human health, climate, vegetation, materials and atmospheric composition. Tropospheric ozone is formed by reactions involving solar radiation and anthropogenic pollutants (methane, non-methane volatile organic compounds, carbon monoxide) in the presence of nitrogen oxides. Consequently, a typical daily profile of ozone concentrations
Results and discussion
The selected variables for each method were validated with t-test and Partial F test (method 4b) using a significance level of 0.05.
Table 1 shows the MLR coefficients and the correspondent values of τi obtained using different variable selection methods, for the analysed period. The value of τi is equal to γi or (Fi − fc) for the methods that apply, respectively, t-test or Partial F test for the validation of the regression parameters (in both cases, a parameter is valid for positive values of τi
Conclusions
Aiming the selection of statistically valid regression parameters, using multiple linear and principal component regression models, five methods were compared. These methods were used to evaluate the variables that influenced tropospheric ozone concentration during the night period.
When multiple linear regression was used, the results showed that each selection method led to different models, as a consequence of the collinearities between explanatory variables. On the contrary, when principal
Acknowledgements
Authors are grateful to Comissão de Coordenação da Direcção Regional-Norte and to Instituto Geofísico da Universidade do Porto for kindly providing the air quality and meteorological data. This work was supported by Fundação para a Ciência e Tecnologia (FCT). J.C.M. Pires also thanks the FCT for the fellowship SFRD/BD/23302/2005.
References (32)
- et al.
Principal component and multiple regression analysis in modelling of ground-level ozone and factors affecting its concentrations
Environmental Modelling & Software
(2005) - et al.
Regression and multilayer perceptron-based models to forecast hourly O3 and NO2 levels in the Bilbao area
Environmental Modelling & Software
(2006) - et al.
Contribution of anthropogenic pollutants to the increase of tropospheric ozone levels in Oporto Metropolitan Area, Portugal since the 19th century
Environmental Pollution
(2006) - et al.
Relationships between cognitive characteristics of the job, age, and cognitive efficiency
International Congress Series
(2005) - et al.
Experimental determination of the effect of mountain-valley breeze circulation on air pollution in the vicinity of Freiburg
Atmospheric Environment
(1999) - et al.
Modelling the effects of meteorological variables on ozone concentration – a quantile regression approach
Atmospheric Environment
(2004) - et al.
Estimating runoff in ungauged catchments from rainfall, PET and the AWBM model
Environmental Modelling & Software
(2007) - et al.
Ozone concentration jump in the stable nocturnal boundary layer during a LLJ-event
Atmospheric Environment
(1997) - et al.
Effect of body mass index on left ventricular cavity size and ejection fraction
American Journal of Cardiology
(2006) - et al.
Summer nocturnal ozone maxima in Göteborg, Sweden
Atmospheric Environment
(2003)
On the temporal increase of anthropogenic CO2 in the subpolar North Atlantic
Deep-Sea Research I
Statistical models for the prediction of respirable suspended particulate matter in urban cities
Atmospheric Environment
A quick and accurate estimation of heat losses from a cow
Biosystems Engineering
Meteorologically adjusted trends of daily maximum ozone concentrations in Taipei, Taiwan
Atmospheric Environment
A model to estimate growth in young-of-the-year tautog, Tautoga onitis, based on RNA/DNA ratio and seawater temperature
Journal of Experimental Marine Biology and Ecology
Forecasting electricity consumption in New Zealand using economic and demographic variables
Energy
Cited by (83)
Multiple-layer statistical methodology for developing data-driven models of anaerobic digestion process
2023, Journal of Environmental ManagementFatigue life estimation of TMT reinforcing steel bar considering pitting corrosion and high temperature impacted surface topography
2023, International Journal of FatigueFactors influencing indoor air pollution in buildings using PCA-LMBP neural network: A case study of a university campus
2022, Building and EnvironmentPhotocatalytic treatment of landfill leachate: A comparison between N-, P-, and N-P-type TiO<inf>2</inf> nanoparticles
2020, Environmental Technology and Innovation