ABSTRACT
Multicollinearity, or near-linear dependencies amongst features, is a statistical phenomenon wherein multiple predictor variables are severely correlated in a multiple regression model. This results in inaccurate parameter estimates, decreased power, low confidence on regressor coefficients and incorrect feature selection in domains such as biological sciences or finance. This paper aims to tackle this issue by constructing a contextual graph based on available domain knowledge. We propose a novel regularization approach called Contextual Graph-guided Fused Lasso (CGFlasso), based on GFLasso regularization to show how prior domain knowledge can improve the current state-of-the-art. We show that our method significantly reduces standard error across multiple datasets.
- [1]Wonsuk Yoo, Robert Mayberry, Sejong Bae, Karan Singh, Qinghua Peter He, and James W Lillard. 2014. A Study of Effects of MultiCollinearity in the Multivariable Analysis. International journal of applied science and technology 4, 5 (October 2014), 9—19. https://europepmc.org/articles/PMC4318006Google Scholar
- [2]Emine Ozgur Bayman and Franklin Dexter. 2021. Multicollinearity in logistic regression models. Anesth. Analg. 133, 2 (Aug. 2021), 362–365.Google ScholarCross Ref
- [3]Jamal I Daoud. 2017. Multicollinearity and regression analysis. In Journal of Physics: Conference Series, Vol. 949. IOP Publishing, 012009.Google Scholar
- [4]Jenni Niemelä-Nyrhinen and E. Leskinen. 2014. Multicollinearity in marketing models: Notes on the application of ridge trace estimation in structural equation modelling. Electronic Journal of Business Research Methods 12 (01 2014), 3–15.Google Scholar
- [5]Jireh Chan, Steven Leow, Khean Bea, Wai Khuen Cheng, Seuk Wai Phoong, Zeng-Wei Hong, and Yen-Lin Chen. 2022. Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review. Mathematics 10 (04 2022), 1283. https://doi.org/10.3390/math10081283Google ScholarCross Ref
- [6]Mickaël Dubocq, Nadia Haddy, Boris Schwartz, Carole Rubino, Florent Dayet, Florent de Vathaire, Ibrahima Diallo, and Rodrigue Allodji. 2018. Exploring the Performance of Methods to Deal Multicollinearity: Simulation and Real Data in Radiation Epidemiology Area. International Journal of Statistics in Medical Research 7 (05 2018), 33–44. https://doi.org/10.6000/1929-6029.2018.07.02.2Google ScholarCross Ref
- [7]Kristina Vatcheva, MinJae Lee, Joseph McCormick, and Mohammad Rahbar. 2016. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiology open access 06 (03 2016), 227. https://doi.org/10.4172/2161- 1165.1000227Google ScholarCross Ref
- [8]Susannah G. Ellsworth, Peter S.N. van Rossum, Radhe Mohan, Steven H. Lin, Clemens Grassberger, and Brian Hobbs. 2023. Declarations of independence: How embedded multicollinearity errors affect dosimetric and other complex analyses in radiation oncology. International Journal of Radiation Oncology*Biology*Physics (2023). https://doi.org/10.1016/j.ijrobp.2023.06.015Google ScholarCross Ref
- [9]Edoardo Fiorillo, Edmondo Di Giuseppe, Giacomo Fontanelli, and Fabio Maselli. 2020. Lowland Rice Mapping in Sédhiou Region (Senegal) Using Sentinel 1 and Sentinel 2 Data and Random Forest. Remote Sensing 12 (10 2020), 3403. https://doi.org/10.3390/rs12203403Google ScholarCross Ref
- [10]Ranjit Kumar Paul. 2008. MULTICOLLINEARITY : CAUSES, EFFECTS AND REMEDIES.Google Scholar
- [11]Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178–183. https://doi.org/10.2307/2290467Google ScholarCross Ref
- [12]Carsten F. Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, Bernd Gruber, Bruno Lafour- cade, Pedro J. Leitão, Tamara Münkemüller, Colin McClean, Patrick E. Os- borne, Björn Reineking, Boris Schröder, Andrew K. Skidmore, Damaris Zurell, and Sven Lautenbach. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecogra- phy 36, 1 (2013), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2012.07348.xGoogle ScholarCross Ref
- [13]Anthony Ralston and Herbert S Wilf. 1967. Mathematical methods for digital computers-Volumes 1 and 2. New York: Wiley (1967).Google Scholar
- [14]S.Q. Lafi and J.B. Kaneene. 1992. An explanation of the use of principal-components analysis to detect and correct for multicollinearity. Preventive Veterinary Medicine 13, 4 (1992), 261–275. https://doi.org/10.1016/0167-5877(92)90041- DGoogle ScholarCross Ref
- [15]Herman Wold. 1982. Soft modelling: the basic design and some extensions. Systems under indirect observation, Part II (1982), 36–37.Google Scholar
- [16]AE Horel. 1962. Application of ridge analysis to regression problems. Chemical Engineering Progress 58 (1962), 54–59.Google Scholar
- [17]Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58, 1 (1996), 267– 288.Google ScholarCross Ref
- [18]Hui Zou and Trevor Hastie. 2005. Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67, 2 (03 2005), 301–320.https://doi.org/10.1111/j.1467-9868.2005.00503.x arXiv:https://academic.oup.com/jrsssb/article-pdf/67/2/301/49795094/jrsssb_67_2_301.pdfGoogle ScholarCross Ref
- [19]BM Kibria, Adewale F Lukman, et al. [n. d.]. A new ridge-type estimator for the linear regression model: Simulations and applications. Scientifica 2020 ([n. d.]).Google Scholar
- [20]Adewale Lukman, Rasha Farghali, B M Golam Kibria, and Oluyemi Okunlola. 2023. Robust-stein estimator for overcoming outliers and multicollinearity. Scientific Reports 13 (06 2023), 1–19. https://doi.org/10.1038/s41598-023-36053-zGoogle ScholarCross Ref
- [21]Jie Gui, Zhenan Sun, Shuiwang Ji, Dacheng Tao, and Tieniu Tan. 2016. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE Transactions on Neural Networks and Learning Systems 28 (04 2016), 1–18. https://doi.org/10.1109/TNNLS.2016.2551724Google ScholarCross Ref
- [22]Akhil Garg and K. Tai. 2012. Comparison of regression analysis, Artificial Neural Network and genetic programming in Handling the multicollinearity problem. Proceedings of 2012 International Conference on Modelling, Identification and Control, ICMIC 2012, 353–358.Google Scholar
- [23]Maaz Mahadi, Tarig Ballal, Muhammad Moinuddin, and Ubaid M Al-Saggaf. 2022. A recursive least-squares with a time-varying regularization parameter. Applied Sciences 12, 4 (2022), 2077.Google ScholarCross Ref
- [24]Rick Stevens, Valerie Taylor, Jeff Nichols, Arthur Barney Maccabe, Katherine Yelick, and David Brown. 2020. AI for Science: Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science. (2 2020). https://doi.org/10.2172/1604756Google ScholarCross Ref
- [25]Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. 2023. Informed Machine Learning – A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2023), 614–633. https://doi.org/10.1109/TKDE.2021.3079836Google ScholarCross Ref
- [26]Kevin Liu, J.-Y Kuo, K. Yeh, C.-W Chen, H.-H Liang, and Y.-H Sun. 2013. Using fuzzy logic to generate conditional probabilities in Bayesian belief networks: a case study of ecological assessment. International Journal of Environmental Science and Technology 12 (03 2013). https://doi.org/10.1007/s13762-013-0459-xGoogle ScholarCross Ref
- [27]Seyoung Kim and Eric Xing. 2009. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLoS genetics 5 (09 2009), e1000587. https://doi.org/10.1371/journal.pgen.1000587Google ScholarCross Ref
- [28]Mary Ann Schroeder, Janice Lander, and Stacey Levine-Silverman. 1990. Diagnosing and Dealing with Multicollinearity. Western Journal of Nursing Research 12, 2 (1990), 175–187.Google ScholarCross Ref
- [29]Noora Shrestha. 2020. Detecting Multicollinearity in Regression Analysis. American Journal of Applied Mathematics and Statistics 8, 2 (2020), 39–42. https://doi.org/10.12691/ajams-8-2-1Google ScholarCross Ref
- [30]Mekonnen Abegaz, Kenenisa Debela, and Reta Hundie. 2023. The effect of governance on entrepreneurship: from all income economies perspective. Journal of Innovation and Entrepreneurship 12 (01 2023). https://doi.org/10.1186/s13731-022-00264-xGoogle ScholarCross Ref
- [31]J. Neter, W. Wasserman, and M.H. Kutner. 1985. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. Number no. 469 in Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. R.D. Irwin. https://books.google.co.in/books?id=YxTvAAAAMAAJGoogle Scholar
- [32]Ryuta Tamura, Ken Kobayashi, Yuichi Takano, Ryuhei Miyashiro, Kazuhide Nakata, and Tomomi Matsui. 2019. Mixed integer quadratic optimization formu- lations for eliminating multicollinearity based on variance inflation factor. Journal of Global Optimization 73 (02 2019). https://doi.org/10.1007/s10898-018-0713-3Google ScholarDigital Library
- [33]Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. 2009. Structured Variable Selection with Sparsity-Inducing Norms. Journal of Machine Learning Research 12 (04 2009).Google Scholar
- [34]David Harrison and Daniel Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5 (03 1978), 81–102. https://doi.org/10.1016/0095-0696(78)90006-2Google ScholarCross Ref
- [35]Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47, 4 (2009), 547–553. https://doi.org/10.1016/j.dss.2009. 05.016 Smart Business Networks: Concepts and Empirical Evidence.Google ScholarDigital Library
- [36]Nikhil Kohli. 2020. US Stock Market Data Technical Indicators (Version 1). https://www.kaggle.com/datasets/nikhilkohli/us-stock-market-data-60- extracted-featuresGoogle Scholar
- [37]Julia M. Rohrer. 2018. Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data. Advances in Methods and Practices in Psychological Science 1, 1 (2018), 27–42. https://doi.org/10.1177/ 2515245917745629 arXiv:https://doi.org/10.1177/2515245917745629Google ScholarCross Ref
- [38]Sacha Epskamp and Eiko Fried. 2017. A Tutorial on Regularized Partial Correlation Networks. Psychological Methods 23 (01 2017). https://doi.org/10.1037/met0000167Google ScholarCross Ref
- [39]Albert Scrieciu, Alessandro Pagano, Virginia Rosa Coletta, Umberto Fratino, and Raffaele Giordano. 2021. Bayesian Belief Networks for Integrating Scientific and Stakeholders’ Knowledge to Support Nature-Based Solution Implementation. Frontiers in Earth Science 9 (2021). https://doi.org/10.3389/feart.2021.674618Google ScholarCross Ref
Index Terms
- CGFLasso: Combating Multicollinearity using Domain Knowledge
Recommendations
Regularizers for structured sparsity
We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. This problem is relevant in machine learning, statistics and signal processing. It is well known that a linear ...
On linearized ridge logistic estimator in the presence of multicollinearity
AbstractLogistic Regression is a very popular method to model the dichotomous data. The maximum likelihood estimator (MLE) of unknown regression parameters of the logistic regression is not too accurate when multicollinearity exists among the covariates. ...
Blind Deconvolution Using a Regularized Structured Total Least Norm Algorithm
Rosen, Park, and Glick proposed the structured total least norm (STLN) algorithm for solving problems in which both the matrix and the right-hand side contain errors. We extend this algorithm for ill-posed problems by adding regularization, and we use ...
Comments