ABSTRACT
Overparameterized models in regression analysis are often harder to interpret and can be harder to fit because of ill-conditioning. Genetic programming is prone to overparameterized models as it evolves the structure of the model without taking the location of parameters into account. One way to alleviate this is rewriting the expression and merging the redundant fitting parameters. In this paper we propose the use of equality saturation to alleviate overparameterization. We first notice that all the tested GP implementations suffer from overparameterization to different extents and then show that equality saturation together with a small set of rewriting rules is capable of reducing the number of fitting parameters to a minimum with a high probability. Compared to one of the few available alternatives, Sympy, it produces much better and consistent results. These results lead to different possible future investigations such as the simplification of expressions during the evolutionary process, and improvement of the interpretability of symbolic models.
- Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O'Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. 879--886.Google ScholarDigital Library
- G. F. Bomarito, P. E. Leser, N. C. M. Strauss, K. M. Garbrecht, and J. D. Hochhalter. 2022. Bayesian Model Selection for Reducing Bloat and Overfitting in Genetic Programming for Symbolic Regression. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (Boston, Massachusetts) (GECCO '22). Association for Computing Machinery, New York, NY, USA, 526--529. Google ScholarDigital Library
- Bogdan Burlacu, Gabriel Kronberger, and Michael Kommenda. 2020. Operon C++: An Efficient Genetic Programming Framework for Symbolic Regression. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (GECCO '20). Association for Computing Machinery, internet, 1562--1570. Google ScholarDigital Library
- J.S. Cohen. 2018. Computer Algebra and Symbolic Computation: Mathematical Methods. CRC Press. https://books.google.at/books?id=0WO2zQEACAAJGoogle ScholarDigital Library
- Miles Cranmer. 2020. PySR: Fast & Parallelized Symbolic Regression in Python/Julia. Google ScholarCross Ref
- Fabrício Olivetti de França and Guilherme Seidyo Imai Aldeia. 2021. Interaction-Transformation Evolutionary Algorithm for Symbolic Regression. Evolutionary computation 29, 3 (2021), 367--390.Google Scholar
- Roger Fletcher. 2013. Practical methods of optimization. John Wiley & Sons.Google ScholarCross Ref
- Andrew Gelman, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge University Press.Google Scholar
- Frank E Harrell. 2017. Regression modeling strategies. Bios 330, 2018 (2017), 14.Google Scholar
- Rajeev Joshi, Greg Nelson, and Keith Randall. 2002. Denali: A goal-directed superoptimizer. ACM SIGPLAN Notices 37, 5 (2002), 304--314.Google ScholarDigital Library
- Robert E Kass. 1990. Nonlinear regression analysis and its applications. J. Amer. Statist. Assoc. 85, 410 (1990), 594--596.Google ScholarCross Ref
- Michael Kommenda, Bogdan Burlacu, Gabriel Kronberger, and Michael Affenzeller. 2020. Parameter identification for symbolic regression using nonlinear least squares. Genet. Program. Evolvable Mach 21, 3 (2020), 471--501.Google ScholarDigital Library
- John R. Koza. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. http://mitpress.mit.edu/books/genetic-programmingGoogle ScholarDigital Library
- Gabriel Kronberger. 2022. Local Optimization Often is Ill-conditioned in Genetic Programming for Symbolic Regression. arXiv preprint arXiv:2209.00942 (2022).Google Scholar
- William La Cava and Jason H Moore. 2019. Semantic variation operators for multidimensional genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference. 1056--1064.Google ScholarDigital Library
- William La Cava, Patryk Orzechowski, Bogdan Burlacu, Fabricio Olivetti de Franca, Marco Virgolin, Ying Jin, Michael Kommenda, and Jason H. Moore. 2021. Contemporary Symbolic Regression Methods and their Relative Performance. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. https://openreview.net/pdf?id=xVQMrDLyGstGoogle Scholar
- William La Cava, Tilak Raj Singh, James Taggart, Srinivas Suri, and Jason H Moore. 2018. Learning concise representations for regression by evolving networks of trees. arXiv preprint arXiv:1807.00981 (2018).Google Scholar
- Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, et al. 2017. SymPy: symbolic computing in Python. PeerJ Computer Science 3 (2017), e103.Google ScholarCross Ref
- Pablo Moscato. 1999. Memetic Algorithms: A Short Introduction. In New Ideas in Optimization, David Corne, Marco Dorigo, and Fred Glover (Eds.). McGraw-Hill, London, 219--234.Google Scholar
- Chandrakana Nandi, Max Willsey, Adam Anderson, James R Wilcox, Eva Darulova, Dan Grossman, and Zachary Tatlock. 2020. Synthesizing structured CAD models with equality saturation and inverse transformations. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 31--44.Google ScholarDigital Library
- Michael O'Neill, Leonardo Vanneschi, Steven Gustafson, and Wolfgang Banzhaf. 2010. Open issues in genetic programming. Genetic Programming and Evolvable Machines 11, 3 (01 Sep 2010), 339--363. Google ScholarDigital Library
- Ludo Pagie and Paulien Hogeweg. 1997. Evolutionary Consequences of Coevolving Targets. Evolutionary Computation 5, 4 (Winter 1997), 401--418. Google ScholarDigital Library
- David L Randall, Tyler S Townsend, Jacob D Hochhalter, and Geoffrey F Bomarito. 2022. Bingo: a customizable framework for symbolic regression with genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2282--2288.Google ScholarDigital Library
- Guido Smits and Mark Kotanchek. 2004. Pareto-Front Exploitation in Symbolic Regression. In Genetic Programming Theory and Practice II, Una-May O'Reilly, Tina Yu, Rick L. Riolo, and Bill Worzel (Eds.). Springer, Ann Arbor, Chapter 17, 283--299. Google ScholarCross Ref
- Silviu-Marian Udrescu and Max Tegmark. 2020. AI Feynman: A physics-inspired method for symbolic regression. Science Advances 6, 16 (2020), eaay2631.Google Scholar
- Marco Virgolin, Tanja Alderliesten, Cees Witteveen, and Peter AN Bosman. 2017. Scalable genetic programming by gene-pool optimal mixing and input-space entropy-based building-block learning. In Proceedings of the Genetic and Evolutionary Computation Conference. 1041--1048.Google ScholarDigital Library
- Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. egg: Fast and extensible equality saturation. Proceedings of the ACM on Programming Languages 5, POPL (2021), 1--29.Google ScholarDigital Library
- Yihong Zhang, Yisu Remy Wang, Max Willsey, and Zachary Tatlock. 2022. Relational e-matching. Proceedings of the ACM on Programming Languages 6, POPL (2022), 1--22.Google ScholarDigital Library
Index Terms
- Reducing Overparameterization of Symbolic Regression Models with Equality Saturation
Recommendations
Relieving Genetic Programming from Coefficient Learning for Symbolic Regression via Correlation and Linear Scaling
GECCO '23: Proceedings of the Genetic and Evolutionary Computation ConferenceThe difficulty of learning optimal coefficients in regression models using only genetic operators has long been a challenge in genetic programming for symbolic regression. As a simple but effective remedy it has been proposed to perform linear scaling ...
Parameter identification for symbolic regression using nonlinear least squares
AbstractIn this paper we analyze the effects of using nonlinear least squares for parameter identification of symbolic regression models and integrate it as local search mechanism in tree-based genetic programming. We employ the Levenberg–Marquardt ...
Hybrid Single Node Genetic Programming for Symbolic Regression
Transactions on Computational Collective Intelligence XXIV - Volume 9770This paper presents a first step of our research on designing an effective and efficient GP-based method for symbolic regression. First, we propose three extensions of the standard Single Node GP, namely 1 a selection strategy for choosing nodes to be ...
Comments