Abstract
The most popularly used statistic \(R^2\) has a fundamental weakness in model building: it favors adding more predictors to the model because \(R^2\) can only increase. In effect, additional predictors start fitting noise to the data. Other measures used in selecting a regression model such as \(R^2_{adj}\), AIC, SBC, and Mallow’s \(C_p\) does not guarantee that the model selected will also make better prediction of future values. To avoid this, data scientists withhold a percentage of the data for validation purposes. The PRESS statistic does something similar by withholding each observation in calculating its own predicted value. In this paper, we investigated the behavior of \(R^2_{PRESS}\), and how it performs compared to other criterion in model selection in the presence of unnecessary predictors. Using simulated data, we found \(R^2_{PRESS}\) has generally performed best in selecting the true model as the best model for prediction among the model selection measures considered.


Similar content being viewed by others
References
Allen DM (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13(3):469–475
Chang L-Y (2005) Analysis of freeway accident frequencies: negative binomial regression versus artificial neural network. Saf Sci 43(8):541–557
Hettmansperger TP, McKean JW (2010) Robust nonparametric statistical methods, 2nd edn. CRC Press, Boca Raton, FL
Landram FG, Abdullat A, Shah V (2011) The coefficient of prediction for model specification. Southwest Econ Rev 32:149–156
Ma R (2017) The influence factors of highway traffic accident and accident rates model. In: Proceedings of 3rd international symposium on social science (ISSS 2017)
Ma W, Yuan Z (2018) Analysis and comparison of traffic accident regression prediction model. In: 3rd International conference on electromechanical control technology and transportation
McQuarrie AD, Tsai C-L (1998) Regression and time series model selection. World Scientific, Singapore
Mediavilla F, Landram F, Shah V (2008) A comparison of the coefficient of predictive power, the coefficient of determination and AIC for linear regression. J Appl Bus Econ 8(4):44
Murtaugh PA (1998) Methods of variable selection in regression modeling. Commun Stat Simul Comput 27(3):711–734
Pretis F, Reade JJ, Sucarrat G (2018) Automated general-to-specific (GETS) regression modeling and indicator saturation for outliers and structural breaks. J Stat Softw 86:1–44
Tamhane A, Dunlop D (2000) Statistics and data analysis: from elementary to intermediate. Prentice Hall, New Jersey
Weisberg S (1985) Applied linear regression. Wiley, New York
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alcantara, I.M., Naranjo, J. & Lang, Y. Model selection using PRESS statistic. Comput Stat 38, 285–298 (2023). https://doi.org/10.1007/s00180-022-01228-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-022-01228-1