Abstract
This simulation study explores the impact of different undesirable scenarios (e.g., collinearity, Simpson’s paradox, variable interaction, Freedman’s paradox) on feature selection and coefficients’ estimation using traditional methodologies, such as automatic selection (e.g., stepwise using Akaike information criterion and Bayesian information criterion) and penalized regression (e.g., least absolute shrinkage and selection operator (LASSO), elastic net, relaxed LASSO, adaptive LASSO, minimax concave penalty and smoothly clipped absolute deviation penalty, penalized regression with second-generation p-values). Specifically, we compare wrapper and embedded methods regarding the feature selection, coefficients’ estimation and models’ performance. Our results show that the choice of the methodology can affect the number and the type of selected features, as well as accuracy and precision of coefficients’ estimates. Furthermore, we find that the performance can also depend on the characteristics of the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Freedman, D.A.: A note on screening regression equations. Am. Stat. 37(2), 152–155 (1983)
Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics – Identifying Influential Data and Sources of Collinearity. Wiley, Hoboken, New Jersey (2004)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning – Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)
Efron, B., Hastie, T.: Computer Age Statistical Inference. Cambridge University Press, Cambridge (2016)
Efroymson, M.A.: Multiple regression analysis. In: Ralston, A., Wilf, H.S. (eds.) Mathematical Methods for Digital Computers. Wiley, New York (1960)
Hocking, R.R.: The analysis and selection of variables in linear regression. Biometrics 32(1), 1–49 (1976)
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58(1), 267–288 (1996)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv:1707.08692 (2017)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67(2), 301–320 (2005)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zuo, Y., Stewart, T.G., Blume, J.D.: Variable selection with second-generation p-values. Am. Stat. 76(2), 91–101 (2022)
Joe, H.: Generating random correlation matrices based on partial correlations. J. Multivar. Anal. 97, 2177–2189 (2006)
Simpson, E.H.: The interpretation of interaction in contingency tables. J. R. Stat. Soc. B 13, 238–241 (1951)
Posit Team. RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA (2022). http://www.posit.co/
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2022). https://www.R-project.org/
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2022)
Zuo, Y., Stewart, T.G., Blume, J.D.: ProSGPV: an R package for variable selection with second-generation p-values. F1000Research 11, 58 (2022). https://doi.org/10.12688/f1000research.74401.1
Goldfeld, K., Wujciak-Jens, J.: simstudy: illuminating research methods through data generation. J. Open Source Softw. 5(54), 2763 (2020)
Qiu, W., Joe, H.: clusterGeneration: random cluster generation (with specified degree of separation). R version 1.3.7 (2020). https://CRAN.R-project.org/package=clusterGeneration
Makowski, D., Ben-Shachar, M., Lüdecke, D.: bayestestR: describing effects and their uncertainty, existence and significance within the Bayesian framework. J. Open Source Softw. 4(40), 1541 (2019)
Funding
This research was partially funded by Portuguese funds through CIDMA, The Center for Research and Development in Mathematics and Applications of University of Aveiro, and the Portuguese Foundation for Science and Technology (FCT–Fundação para a Ciência e a Tecnologia), within projects UIDB/04106/2020 and UIDP/04106/2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Afreixo, V., Cabral, J., Macedo, P. (2023). Comparison of Feature Selection Methods in Regression Modeling: A Simulation Study. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14112. Springer, Cham. https://doi.org/10.1007/978-3-031-37129-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-37129-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37128-8
Online ISBN: 978-3-031-37129-5
eBook Packages: Computer ScienceComputer Science (R0)