skip to main content
10.1145/3632410.3632468acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomadConference Proceedingsconference-collections
short-paper
Open Access

CGFLasso: Combating Multicollinearity using Domain Knowledge

Published:04 January 2024Publication History

ABSTRACT

Multicollinearity, or near-linear dependencies amongst features, is a statistical phenomenon wherein multiple predictor variables are severely correlated in a multiple regression model. This results in inaccurate parameter estimates, decreased power, low confidence on regressor coefficients and incorrect feature selection in domains such as biological sciences or finance. This paper aims to tackle this issue by constructing a contextual graph based on available domain knowledge. We propose a novel regularization approach called Contextual Graph-guided Fused Lasso (CGFlasso), based on GFLasso regularization to show how prior domain knowledge can improve the current state-of-the-art. We show that our method significantly reduces standard error across multiple datasets.

References

  1. [1]Wonsuk Yoo, Robert Mayberry, Sejong Bae, Karan Singh, Qinghua Peter He, and James W Lillard. 2014. A Study of Effects of MultiCollinearity in the Multivariable Analysis. International journal of applied science and technology 4, 5 (October 2014), 9—19. https://europepmc.org/articles/PMC4318006Google ScholarGoogle Scholar
  2. [2]Emine Ozgur Bayman and Franklin Dexter. 2021. Multicollinearity in logistic regression models. Anesth. Analg. 133, 2 (Aug. 2021), 362–365.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3]Jamal I Daoud. 2017. Multicollinearity and regression analysis. In Journal of Physics: Conference Series, Vol. 949. IOP Publishing, 012009.Google ScholarGoogle Scholar
  4. [4]Jenni Niemelä-Nyrhinen and E. Leskinen. 2014. Multicollinearity in marketing models: Notes on the application of ridge trace estimation in structural equation modelling. Electronic Journal of Business Research Methods 12 (01 2014), 3–15.Google ScholarGoogle Scholar
  5. [5]Jireh Chan, Steven Leow, Khean Bea, Wai Khuen Cheng, Seuk Wai Phoong, Zeng-Wei Hong, and Yen-Lin Chen. 2022. Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review. Mathematics 10 (04 2022), 1283. https://doi.org/10.3390/math10081283Google ScholarGoogle ScholarCross RefCross Ref
  6. [6]Mickaël Dubocq, Nadia Haddy, Boris Schwartz, Carole Rubino, Florent Dayet, Florent de Vathaire, Ibrahima Diallo, and Rodrigue Allodji. 2018. Exploring the Performance of Methods to Deal Multicollinearity: Simulation and Real Data in Radiation Epidemiology Area. International Journal of Statistics in Medical Research 7 (05 2018), 33–44. https://doi.org/10.6000/1929-6029.2018.07.02.2Google ScholarGoogle ScholarCross RefCross Ref
  7. [7]Kristina Vatcheva, MinJae Lee, Joseph McCormick, and Mohammad Rahbar. 2016. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiology open access 06 (03 2016), 227. https://doi.org/10.4172/2161- 1165.1000227Google ScholarGoogle ScholarCross RefCross Ref
  8. [8]Susannah G. Ellsworth, Peter S.N. van Rossum, Radhe Mohan, Steven H. Lin, Clemens Grassberger, and Brian Hobbs. 2023. Declarations of independence: How embedded multicollinearity errors affect dosimetric and other complex analyses in radiation oncology. International Journal of Radiation Oncology*Biology*Physics (2023). https://doi.org/10.1016/j.ijrobp.2023.06.015Google ScholarGoogle ScholarCross RefCross Ref
  9. [9]Edoardo Fiorillo, Edmondo Di Giuseppe, Giacomo Fontanelli, and Fabio Maselli. 2020. Lowland Rice Mapping in Sédhiou Region (Senegal) Using Sentinel 1 and Sentinel 2 Data and Random Forest. Remote Sensing 12 (10 2020), 3403. https://doi.org/10.3390/rs12203403Google ScholarGoogle ScholarCross RefCross Ref
  10. [10]Ranjit Kumar Paul. 2008. MULTICOLLINEARITY : CAUSES, EFFECTS AND REMEDIES.Google ScholarGoogle Scholar
  11. [11]Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178–183. https://doi.org/10.2307/2290467Google ScholarGoogle ScholarCross RefCross Ref
  12. [12]Carsten F. Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, Bernd Gruber, Bruno Lafour- cade, Pedro J. Leitão, Tamara Münkemüller, Colin McClean, Patrick E. Os- borne, Björn Reineking, Boris Schröder, Andrew K. Skidmore, Damaris Zurell, and Sven Lautenbach. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecogra- phy 36, 1 (2013), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2012.07348.xGoogle ScholarGoogle ScholarCross RefCross Ref
  13. [13]Anthony Ralston and Herbert S Wilf. 1967. Mathematical methods for digital computers-Volumes 1 and 2. New York: Wiley (1967).Google ScholarGoogle Scholar
  14. [14]S.Q. Lafi and J.B. Kaneene. 1992. An explanation of the use of principal-components analysis to detect and correct for multicollinearity. Preventive Veterinary Medicine 13, 4 (1992), 261–275. https://doi.org/10.1016/0167-5877(92)90041- DGoogle ScholarGoogle ScholarCross RefCross Ref
  15. [15]Herman Wold. 1982. Soft modelling: the basic design and some extensions. Systems under indirect observation, Part II (1982), 36–37.Google ScholarGoogle Scholar
  16. [16]AE Horel. 1962. Application of ridge analysis to regression problems. Chemical Engineering Progress 58 (1962), 54–59.Google ScholarGoogle Scholar
  17. [17]Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58, 1 (1996), 267– 288.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18]Hui Zou and Trevor Hastie. 2005. Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67, 2 (03 2005), 301–320.https://doi.org/10.1111/j.1467-9868.2005.00503.x arXiv:https://academic.oup.com/jrsssb/article-pdf/67/2/301/49795094/jrsssb_67_2_301.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  19. [19]BM Kibria, Adewale F Lukman, et al. [n. d.]. A new ridge-type estimator for the linear regression model: Simulations and applications. Scientifica 2020 ([n. d.]).Google ScholarGoogle Scholar
  20. [20]Adewale Lukman, Rasha Farghali, B M Golam Kibria, and Oluyemi Okunlola. 2023. Robust-stein estimator for overcoming outliers and multicollinearity. Scientific Reports 13 (06 2023), 1–19. https://doi.org/10.1038/s41598-023-36053-zGoogle ScholarGoogle ScholarCross RefCross Ref
  21. [21]Jie Gui, Zhenan Sun, Shuiwang Ji, Dacheng Tao, and Tieniu Tan. 2016. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE Transactions on Neural Networks and Learning Systems 28 (04 2016), 1–18. https://doi.org/10.1109/TNNLS.2016.2551724Google ScholarGoogle ScholarCross RefCross Ref
  22. [22]Akhil Garg and K. Tai. 2012. Comparison of regression analysis, Artificial Neural Network and genetic programming in Handling the multicollinearity problem. Proceedings of 2012 International Conference on Modelling, Identification and Control, ICMIC 2012, 353–358.Google ScholarGoogle Scholar
  23. [23]Maaz Mahadi, Tarig Ballal, Muhammad Moinuddin, and Ubaid M Al-Saggaf. 2022. A recursive least-squares with a time-varying regularization parameter. Applied Sciences 12, 4 (2022), 2077.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24]Rick Stevens, Valerie Taylor, Jeff Nichols, Arthur Barney Maccabe, Katherine Yelick, and David Brown. 2020. AI for Science: Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science. (2 2020). https://doi.org/10.2172/1604756Google ScholarGoogle ScholarCross RefCross Ref
  25. [25]Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. 2023. Informed Machine Learning – A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2023), 614–633. https://doi.org/10.1109/TKDE.2021.3079836Google ScholarGoogle ScholarCross RefCross Ref
  26. [26]Kevin Liu, J.-Y Kuo, K. Yeh, C.-W Chen, H.-H Liang, and Y.-H Sun. 2013. Using fuzzy logic to generate conditional probabilities in Bayesian belief networks: a case study of ecological assessment. International Journal of Environmental Science and Technology 12 (03 2013). https://doi.org/10.1007/s13762-013-0459-xGoogle ScholarGoogle ScholarCross RefCross Ref
  27. [27]Seyoung Kim and Eric Xing. 2009. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLoS genetics 5 (09 2009), e1000587. https://doi.org/10.1371/journal.pgen.1000587Google ScholarGoogle ScholarCross RefCross Ref
  28. [28]Mary Ann Schroeder, Janice Lander, and Stacey Levine-Silverman. 1990. Diagnosing and Dealing with Multicollinearity. Western Journal of Nursing Research 12, 2 (1990), 175–187.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29]Noora Shrestha. 2020. Detecting Multicollinearity in Regression Analysis. American Journal of Applied Mathematics and Statistics 8, 2 (2020), 39–42. https://doi.org/10.12691/ajams-8-2-1Google ScholarGoogle ScholarCross RefCross Ref
  30. [30]Mekonnen Abegaz, Kenenisa Debela, and Reta Hundie. 2023. The effect of governance on entrepreneurship: from all income economies perspective. Journal of Innovation and Entrepreneurship 12 (01 2023). https://doi.org/10.1186/s13731-022-00264-xGoogle ScholarGoogle ScholarCross RefCross Ref
  31. [31]J. Neter, W. Wasserman, and M.H. Kutner. 1985. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. Number no. 469 in Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. R.D. Irwin. https://books.google.co.in/books?id=YxTvAAAAMAAJGoogle ScholarGoogle Scholar
  32. [32]Ryuta Tamura, Ken Kobayashi, Yuichi Takano, Ryuhei Miyashiro, Kazuhide Nakata, and Tomomi Matsui. 2019. Mixed integer quadratic optimization formu- lations for eliminating multicollinearity based on variance inflation factor. Journal of Global Optimization 73 (02 2019). https://doi.org/10.1007/s10898-018-0713-3Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33]Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. 2009. Structured Variable Selection with Sparsity-Inducing Norms. Journal of Machine Learning Research 12 (04 2009).Google ScholarGoogle Scholar
  34. [34]David Harrison and Daniel Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5 (03 1978), 81–102. https://doi.org/10.1016/0095-0696(78)90006-2Google ScholarGoogle ScholarCross RefCross Ref
  35. [35]Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47, 4 (2009), 547–553. https://doi.org/10.1016/j.dss.2009. 05.016 Smart Business Networks: Concepts and Empirical Evidence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36]Nikhil Kohli. 2020. US Stock Market Data Technical Indicators (Version 1). https://www.kaggle.com/datasets/nikhilkohli/us-stock-market-data-60- extracted-featuresGoogle ScholarGoogle Scholar
  37. [37]Julia M. Rohrer. 2018. Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data. Advances in Methods and Practices in Psychological Science 1, 1 (2018), 27–42. https://doi.org/10.1177/ 2515245917745629 arXiv:https://doi.org/10.1177/2515245917745629Google ScholarGoogle ScholarCross RefCross Ref
  38. [38]Sacha Epskamp and Eiko Fried. 2017. A Tutorial on Regularized Partial Correlation Networks. Psychological Methods 23 (01 2017). https://doi.org/10.1037/met0000167Google ScholarGoogle ScholarCross RefCross Ref
  39. [39]Albert Scrieciu, Alessandro Pagano, Virginia Rosa Coletta, Umberto Fratino, and Raffaele Giordano. 2021. Bayesian Belief Networks for Integrating Scientific and Stakeholders’ Knowledge to Support Nature-Based Solution Implementation. Frontiers in Earth Science 9 (2021). https://doi.org/10.3389/feart.2021.674618Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. CGFLasso: Combating Multicollinearity using Domain Knowledge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)
      January 2024
      627 pages

      Copyright © 2024 ACM

      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)26
      • Downloads (Last 6 weeks)13

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format