Comparison of Feature Selection Methods in Regression Modeling: A Simulation Study

Afreixo, Vera; Cabral, Jorge; Macedo, Pedro

doi:10.1007/978-3-031-37129-5_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14112))

Included in the following conference series:

International Conference on Computational Science and Its Applications

399 Accesses

Abstract

This simulation study explores the impact of different undesirable scenarios (e.g., collinearity, Simpson’s paradox, variable interaction, Freedman’s paradox) on feature selection and coefficients’ estimation using traditional methodologies, such as automatic selection (e.g., stepwise using Akaike information criterion and Bayesian information criterion) and penalized regression (e.g., least absolute shrinkage and selection operator (LASSO), elastic net, relaxed LASSO, adaptive LASSO, minimax concave penalty and smoothly clipped absolute deviation penalty, penalized regression with second-generation p-values). Specifically, we compare wrapper and embedded methods regarding the feature selection, coefficients’ estimation and models’ performance. Our results show that the choice of the methodology can affect the number and the type of selected features, as well as accuracy and precision of coefficients’ estimates. Furthermore, we find that the performance can also depend on the characteristics of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Freedman, D.A.: A note on screening regression equations. Am. Stat. 37(2), 152–155 (1983)
MathSciNet Google Scholar
Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics – Identifying Influential Data and Sources of Collinearity. Wiley, Hoboken, New Jersey (2004)
MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning – Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)
MATH Google Scholar
Efron, B., Hastie, T.: Computer Age Statistical Inference. Cambridge University Press, Cambridge (2016)
Book MATH Google Scholar
Efroymson, M.A.: Multiple regression analysis. In: Ralston, A., Wilf, H.S. (eds.) Mathematical Methods for Digital Computers. Wiley, New York (1960)
Google Scholar
Hocking, R.R.: The analysis and selection of variables in linear regression. Biometrics 32(1), 1–49 (1976)
Article MathSciNet MATH Google Scholar
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Article MathSciNet MATH Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58(1), 267–288 (1996)
MathSciNet MATH Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Article MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv:1707.08692 (2017)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67(2), 301–320 (2005)
Article MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Article MathSciNet MATH Google Scholar
Zuo, Y., Stewart, T.G., Blume, J.D.: Variable selection with second-generation p-values. Am. Stat. 76(2), 91–101 (2022)
Article MathSciNet MATH Google Scholar
Joe, H.: Generating random correlation matrices based on partial correlations. J. Multivar. Anal. 97, 2177–2189 (2006)
Article MathSciNet MATH Google Scholar
Simpson, E.H.: The interpretation of interaction in contingency tables. J. R. Stat. Soc. B 13, 238–241 (1951)
MathSciNet MATH Google Scholar
Posit Team. RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA (2022). http://www.posit.co/
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2022). https://www.R-project.org/
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2022)
MATH Google Scholar
Zuo, Y., Stewart, T.G., Blume, J.D.: ProSGPV: an R package for variable selection with second-generation p-values. F1000Research 11, 58 (2022). https://doi.org/10.12688/f1000research.74401.1
Article Google Scholar
Goldfeld, K., Wujciak-Jens, J.: simstudy: illuminating research methods through data generation. J. Open Source Softw. 5(54), 2763 (2020)
Article Google Scholar
Qiu, W., Joe, H.: clusterGeneration: random cluster generation (with specified degree of separation). R version 1.3.7 (2020). https://CRAN.R-project.org/package=clusterGeneration
Makowski, D., Ben-Shachar, M., Lüdecke, D.: bayestestR: describing effects and their uncertainty, existence and significance within the Bayesian framework. J. Open Source Softw. 4(40), 1541 (2019)
Article Google Scholar

Download references

Funding

This research was partially funded by Portuguese funds through CIDMA, The Center for Research and Development in Mathematics and Applications of University of Aveiro, and the Portuguese Foundation for Science and Technology (FCT–Fundação para a Ciência e a Tecnologia), within projects UIDB/04106/2020 and UIDP/04106/2020.

Author information

Authors and Affiliations

CIDMA – Center for Research and Development in Mathematics and Applications, Department of Mathematics, University of Aveiro, 3810-193, Aveiro, Portugal
Vera Afreixo, Jorge Cabral & Pedro Macedo

Authors

Vera Afreixo
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Cabral
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Macedo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vera Afreixo .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Italy
Beniamino Murgante
University of Minho, Braga, Portugal
Ana Maria A. C. Rocha
University of Cagliari, Cagliari, Italy
Chiara Garau
University of Basilicata, Potenza, Italy
Francesco Scorza
University of Massachusetts Medical School, Worcester, MA, USA
Yeliz Karaca
Polytechnic University of Bari, Bari, Italy
Carmelo M. Torre

Appendix

See Fig. A1 and Fig. A2 for two illustrations.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Afreixo, V., Cabral, J., Macedo, P. (2023). Comparison of Feature Selection Methods in Regression Modeling: A Simulation Study. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14112. Springer, Cham. https://doi.org/10.1007/978-3-031-37129-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-37129-5_13
Published: 30 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37128-8
Online ISBN: 978-3-031-37129-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Comparison of Feature Selection Methods in Regression Modeling: A Simulation Study

Abstract

Access this chapter

References

Funding

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation