Skip to main content

Comparison of Feature Selection Methods in Regression Modeling: A Simulation Study

  • Conference paper
  • First Online:
Computational Science and Its Applications – ICCSA 2023 Workshops (ICCSA 2023)

Abstract

This simulation study explores the impact of different undesirable scenarios (e.g., collinearity, Simpson’s paradox, variable interaction, Freedman’s paradox) on feature selection and coefficients’ estimation using traditional methodologies, such as automatic selection (e.g., stepwise using Akaike information criterion and Bayesian information criterion) and penalized regression (e.g., least absolute shrinkage and selection operator (LASSO), elastic net, relaxed LASSO, adaptive LASSO, minimax concave penalty and smoothly clipped absolute deviation penalty, penalized regression with second-generation p-values). Specifically, we compare wrapper and embedded methods regarding the feature selection, coefficients’ estimation and models’ performance. Our results show that the choice of the methodology can affect the number and the type of selected features, as well as accuracy and precision of coefficients’ estimates. Furthermore, we find that the performance can also depend on the characteristics of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Freedman, D.A.: A note on screening regression equations. Am. Stat. 37(2), 152–155 (1983)

    MathSciNet  Google Scholar 

  • Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics – Identifying Influential Data and Sources of Collinearity. Wiley, Hoboken, New Jersey (2004)

    MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning – Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)

    MATH  Google Scholar 

  • Efron, B., Hastie, T.: Computer Age Statistical Inference. Cambridge University Press, Cambridge (2016)

    Book  MATH  Google Scholar 

  • Efroymson, M.A.: Multiple regression analysis. In: Ralston, A., Wilf, H.S. (eds.) Mathematical Methods for Digital Computers. Wiley, New York (1960)

    Google Scholar 

  • Hocking, R.R.: The analysis and selection of variables in linear regression. Biometrics 32(1), 1–49 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  • Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv:1707.08692 (2017)

  • Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Zuo, Y., Stewart, T.G., Blume, J.D.: Variable selection with second-generation p-values. Am. Stat. 76(2), 91–101 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  • Joe, H.: Generating random correlation matrices based on partial correlations. J. Multivar. Anal. 97, 2177–2189 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Simpson, E.H.: The interpretation of interaction in contingency tables. J. R. Stat. Soc. B 13, 238–241 (1951)

    MathSciNet  MATH  Google Scholar 

  • Posit Team. RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA (2022). http://www.posit.co/

  • R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2022). https://www.R-project.org/

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  • Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2022)

    MATH  Google Scholar 

  • Zuo, Y., Stewart, T.G., Blume, J.D.: ProSGPV: an R package for variable selection with second-generation p-values. F1000Research 11, 58 (2022). https://doi.org/10.12688/f1000research.74401.1

    Article  Google Scholar 

  • Goldfeld, K., Wujciak-Jens, J.: simstudy: illuminating research methods through data generation. J. Open Source Softw. 5(54), 2763 (2020)

    Article  Google Scholar 

  • Qiu, W., Joe, H.: clusterGeneration: random cluster generation (with specified degree of separation). R version 1.3.7 (2020). https://CRAN.R-project.org/package=clusterGeneration

  • Makowski, D., Ben-Shachar, M., Lüdecke, D.: bayestestR: describing effects and their uncertainty, existence and significance within the Bayesian framework. J. Open Source Softw. 4(40), 1541 (2019)

    Article  Google Scholar 

Download references

Funding

This research was partially funded by Portuguese funds through CIDMA, The Center for Research and Development in Mathematics and Applications of University of Aveiro, and the Portuguese Foundation for Science and Technology (FCT–Fundação para a Ciência e a Tecnologia), within projects UIDB/04106/2020 and UIDP/04106/2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vera Afreixo .

Editor information

Editors and Affiliations

Appendix

Appendix

See Fig. A1 and Fig. A2 for two illustrations.

Fig. A1.
figure 4

Shiny app, view of the number of features.

Fig. A2.
figure 5

Shiny app, view of the features’ importance.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Afreixo, V., Cabral, J., Macedo, P. (2023). Comparison of Feature Selection Methods in Regression Modeling: A Simulation Study. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14112. Springer, Cham. https://doi.org/10.1007/978-3-031-37129-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37129-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37128-8

  • Online ISBN: 978-3-031-37129-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics