short-paper

Open Access

CGFLasso: Combating Multicollinearity using Domain Knowledge

Authors:
Soumya Sarkar

Computer Science and Engineering, Indian Institute of Technology Ropar, India

Computer Science and Engineering, Indian Institute of Technology Ropar, India

0009-0001-8410-3124
View Profile

,
Nitin Singhal

Computer Science and Engineering, Indian Institute of Technology Ropar, India

Computer Science and Engineering, Indian Institute of Technology Ropar, India

0009-0000-4238-1292
View Profile

,
Shweta Jain

Computer Science and Engineering, Indian Institute of Technology Ropar, India

Computer Science and Engineering, Indian Institute of Technology Ropar, India

0000-0002-2666-9058
View Profile

,
Shashi Shekhar Jha

Computer Science and Engineering, Indian Institute of Technology Ropar, India

Computer Science and Engineering, Indian Institute of Technology Ropar, India

0000-0002-1375-2266
View Profile

CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)January 2024Pages 252–256https://doi.org/10.1145/3632410.3632468

Published:04 January 2024Publication History

CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)

Pages 252–256

ABSTRACT

Multicollinearity, or near-linear dependencies amongst features, is a statistical phenomenon wherein multiple predictor variables are severely correlated in a multiple regression model. This results in inaccurate parameter estimates, decreased power, low confidence on regressor coefficients and incorrect feature selection in domains such as biological sciences or finance. This paper aims to tackle this issue by constructing a contextual graph based on available domain knowledge. We propose a novel regularization approach called Contextual Graph-guided Fused Lasso (CGFlasso), based on GFLasso regularization to show how prior domain knowledge can improve the current state-of-the-art. We show that our method significantly reduces standard error across multiple datasets.

References

[1]Wonsuk Yoo, Robert Mayberry, Sejong Bae, Karan Singh, Qinghua Peter He, and James W Lillard. 2014. A Study of Effects of MultiCollinearity in the Multivariable Analysis. International journal of applied science and technology 4, 5 (October 2014), 9—19. https://europepmc.org/articles/PMC4318006Google Scholar
[2]Emine Ozgur Bayman and Franklin Dexter. 2021. Multicollinearity in logistic regression models. Anesth. Analg. 133, 2 (Aug. 2021), 362–365.Google ScholarCross Ref
[3]Jamal I Daoud. 2017. Multicollinearity and regression analysis. In Journal of Physics: Conference Series, Vol. 949. IOP Publishing, 012009.Google Scholar
[4]Jenni Niemelä-Nyrhinen and E. Leskinen. 2014. Multicollinearity in marketing models: Notes on the application of ridge trace estimation in structural equation modelling. Electronic Journal of Business Research Methods 12 (01 2014), 3–15.Google Scholar
[5]Jireh Chan, Steven Leow, Khean Bea, Wai Khuen Cheng, Seuk Wai Phoong, Zeng-Wei Hong, and Yen-Lin Chen. 2022. Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review. Mathematics 10 (04 2022), 1283. https://doi.org/10.3390/math10081283Google ScholarCross Ref
[6]Mickaël Dubocq, Nadia Haddy, Boris Schwartz, Carole Rubino, Florent Dayet, Florent de Vathaire, Ibrahima Diallo, and Rodrigue Allodji. 2018. Exploring the Performance of Methods to Deal Multicollinearity: Simulation and Real Data in Radiation Epidemiology Area. International Journal of Statistics in Medical Research 7 (05 2018), 33–44. https://doi.org/10.6000/1929-6029.2018.07.02.2Google ScholarCross Ref
[7]Kristina Vatcheva, MinJae Lee, Joseph McCormick, and Mohammad Rahbar. 2016. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiology open access 06 (03 2016), 227. https://doi.org/10.4172/2161- 1165.1000227Google ScholarCross Ref
[8]Susannah G. Ellsworth, Peter S.N. van Rossum, Radhe Mohan, Steven H. Lin, Clemens Grassberger, and Brian Hobbs. 2023. Declarations of independence: How embedded multicollinearity errors affect dosimetric and other complex analyses in radiation oncology. International Journal of Radiation Oncology*Biology*Physics (2023). https://doi.org/10.1016/j.ijrobp.2023.06.015Google ScholarCross Ref
[9]Edoardo Fiorillo, Edmondo Di Giuseppe, Giacomo Fontanelli, and Fabio Maselli. 2020. Lowland Rice Mapping in Sédhiou Region (Senegal) Using Sentinel 1 and Sentinel 2 Data and Random Forest. Remote Sensing 12 (10 2020), 3403. https://doi.org/10.3390/rs12203403Google ScholarCross Ref
[10]Ranjit Kumar Paul. 2008. MULTICOLLINEARITY : CAUSES, EFFECTS AND REMEDIES.Google Scholar
[11]Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. Journal of the American Statistical Association, 87(417), 178–183. https://doi.org/10.2307/2290467Google ScholarCross Ref
[12]Carsten F. Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, Bernd Gruber, Bruno Lafour- cade, Pedro J. Leitão, Tamara Münkemüller, Colin McClean, Patrick E. Os- borne, Björn Reineking, Boris Schröder, Andrew K. Skidmore, Damaris Zurell, and Sven Lautenbach. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecogra- phy 36, 1 (2013), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2012.07348.xGoogle ScholarCross Ref
[13]Anthony Ralston and Herbert S Wilf. 1967. Mathematical methods for digital computers-Volumes 1 and 2. New York: Wiley (1967).Google Scholar
[14]S.Q. Lafi and J.B. Kaneene. 1992. An explanation of the use of principal-components analysis to detect and correct for multicollinearity. Preventive Veterinary Medicine 13, 4 (1992), 261–275. https://doi.org/10.1016/0167-5877(92)90041- DGoogle ScholarCross Ref
[15]Herman Wold. 1982. Soft modelling: the basic design and some extensions. Systems under indirect observation, Part II (1982), 36–37.Google Scholar
[16]AE Horel. 1962. Application of ridge analysis to regression problems. Chemical Engineering Progress 58 (1962), 54–59.Google Scholar
[17]Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58, 1 (1996), 267– 288.Google ScholarCross Ref
[18]Hui Zou and Trevor Hastie. 2005. Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67, 2 (03 2005), 301–320.https://doi.org/10.1111/j.1467-9868.2005.00503.x arXiv:https://academic.oup.com/jrsssb/article-pdf/67/2/301/49795094/jrsssb_67_2_301.pdfGoogle ScholarCross Ref
[19]BM Kibria, Adewale F Lukman, et al. [n. d.]. A new ridge-type estimator for the linear regression model: Simulations and applications. Scientifica 2020 ([n. d.]).Google Scholar
[20]Adewale Lukman, Rasha Farghali, B M Golam Kibria, and Oluyemi Okunlola. 2023. Robust-stein estimator for overcoming outliers and multicollinearity. Scientific Reports 13 (06 2023), 1–19. https://doi.org/10.1038/s41598-023-36053-zGoogle ScholarCross Ref
[21]Jie Gui, Zhenan Sun, Shuiwang Ji, Dacheng Tao, and Tieniu Tan. 2016. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE Transactions on Neural Networks and Learning Systems 28 (04 2016), 1–18. https://doi.org/10.1109/TNNLS.2016.2551724Google ScholarCross Ref
[22]Akhil Garg and K. Tai. 2012. Comparison of regression analysis, Artificial Neural Network and genetic programming in Handling the multicollinearity problem. Proceedings of 2012 International Conference on Modelling, Identification and Control, ICMIC 2012, 353–358.Google Scholar
[23]Maaz Mahadi, Tarig Ballal, Muhammad Moinuddin, and Ubaid M Al-Saggaf. 2022. A recursive least-squares with a time-varying regularization parameter. Applied Sciences 12, 4 (2022), 2077.Google ScholarCross Ref
[24]Rick Stevens, Valerie Taylor, Jeff Nichols, Arthur Barney Maccabe, Katherine Yelick, and David Brown. 2020. AI for Science: Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science. (2 2020). https://doi.org/10.2172/1604756Google ScholarCross Ref
[25]Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. 2023. Informed Machine Learning – A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2023), 614–633. https://doi.org/10.1109/TKDE.2021.3079836Google ScholarCross Ref
[26]Kevin Liu, J.-Y Kuo, K. Yeh, C.-W Chen, H.-H Liang, and Y.-H Sun. 2013. Using fuzzy logic to generate conditional probabilities in Bayesian belief networks: a case study of ecological assessment. International Journal of Environmental Science and Technology 12 (03 2013). https://doi.org/10.1007/s13762-013-0459-xGoogle ScholarCross Ref
[27]Seyoung Kim and Eric Xing. 2009. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLoS genetics 5 (09 2009), e1000587. https://doi.org/10.1371/journal.pgen.1000587Google ScholarCross Ref
[28]Mary Ann Schroeder, Janice Lander, and Stacey Levine-Silverman. 1990. Diagnosing and Dealing with Multicollinearity. Western Journal of Nursing Research 12, 2 (1990), 175–187.Google ScholarCross Ref
[29]Noora Shrestha. 2020. Detecting Multicollinearity in Regression Analysis. American Journal of Applied Mathematics and Statistics 8, 2 (2020), 39–42. https://doi.org/10.12691/ajams-8-2-1Google ScholarCross Ref
[30]Mekonnen Abegaz, Kenenisa Debela, and Reta Hundie. 2023. The effect of governance on entrepreneurship: from all income economies perspective. Journal of Innovation and Entrepreneurship 12 (01 2023). https://doi.org/10.1186/s13731-022-00264-xGoogle ScholarCross Ref
[31]J. Neter, W. Wasserman, and M.H. Kutner. 1985. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. Number no. 469 in Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. R.D. Irwin. https://books.google.co.in/books?id=YxTvAAAAMAAJGoogle Scholar
[32]Ryuta Tamura, Ken Kobayashi, Yuichi Takano, Ryuhei Miyashiro, Kazuhide Nakata, and Tomomi Matsui. 2019. Mixed integer quadratic optimization formu- lations for eliminating multicollinearity based on variance inflation factor. Journal of Global Optimization 73 (02 2019). https://doi.org/10.1007/s10898-018-0713-3Google ScholarDigital Library
[33]Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. 2009. Structured Variable Selection with Sparsity-Inducing Norms. Journal of Machine Learning Research 12 (04 2009).Google Scholar
[34]David Harrison and Daniel Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5 (03 1978), 81–102. https://doi.org/10.1016/0095-0696(78)90006-2Google ScholarCross Ref
[35]Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47, 4 (2009), 547–553. https://doi.org/10.1016/j.dss.2009. 05.016 Smart Business Networks: Concepts and Empirical Evidence.Google ScholarDigital Library
[36]Nikhil Kohli. 2020. US Stock Market Data Technical Indicators (Version 1). https://www.kaggle.com/datasets/nikhilkohli/us-stock-market-data-60- extracted-featuresGoogle Scholar
[37]Julia M. Rohrer. 2018. Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data. Advances in Methods and Practices in Psychological Science 1, 1 (2018), 27–42. https://doi.org/10.1177/ 2515245917745629 arXiv:https://doi.org/10.1177/2515245917745629Google ScholarCross Ref
[38]Sacha Epskamp and Eiko Fried. 2017. A Tutorial on Regularized Partial Correlation Networks. Psychological Methods 23 (01 2017). https://doi.org/10.1037/met0000167Google ScholarCross Ref
[39]Albert Scrieciu, Alessandro Pagano, Virginia Rosa Coletta, Umberto Fratino, and Raffaele Giordano. 2021. Bayesian Belief Networks for Integrating Scientific and Stakeholders’ Knowledge to Support Nature-Based Solution Implementation. Frontiers in Earth Science 9 (2021). https://doi.org/10.3389/feart.2021.674618Google ScholarCross Ref

Index Terms

CGFLasso: Combating Multicollinearity using Domain Knowledge
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Regularization

Recommendations

Regularizers for structured sparsity

We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. This problem is relevant in machine learning, statistics and signal processing. It is well known that a linear ...
Read More
On linearized ridge logistic estimator in the presence of multicollinearity
Abstract
Logistic Regression is a very popular method to model the dichotomous data. The maximum likelihood estimator (MLE) of unknown regression parameters of the logistic regression is not too accurate when multicollinearity exists among the covariates. ...
Read More
Blind Deconvolution Using a Regularized Structured Total Least Norm Algorithm

Rosen, Park, and Glick proposed the structured total least norm (STLN) algorithm for solving problems in which both the matrix and the right-hand side contain errors. We extend this algorithm for ill-posed problems by adding regularization, and we use ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)
January 2024
627 pages
ISBN:9798400716348
DOI:10.1145/3632410
Editors:
Sriraam Natarajan,
Indrajit Bhattacharya,
Richa Singh,
Arun Kumar,
Sayan Ranu,
Kalika Bali,
Abinaya K
Copyright © 2024 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Linear Regression
Multicollinearity
Regularization
Qualifiers
- short-paper
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 26
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

CGFLasso: Combating Multicollinearity using Domain Knowledge

CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Regularizers for structured sparsity

On linearized ridge logistic estimator in the presence of multicollinearity

Blind Deconvolution Using a Regularized Structured Total Least Norm Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

CGFLasso: Combating Multicollinearity using Domain Knowledge

CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Regularizers for structured sparsity

On linearized ridge logistic estimator in the presence of multicollinearity

Blind Deconvolution Using a Regularized Structured Total Least Norm Algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media