ABSTRACT
When performing symbolic regression using genetic programming, overfitting and bloat can negatively impact generalizability and interpretability of the resulting equations as well as increase computation times. A Bayesian fitness metric is introduced and its impact on bloat and overfitting during population evolution is studied and compared to common alternatives in the literature. The proposed approach was found to be more robust to noise and data sparsity in numerical experiments, guiding evolution to a level of complexity appropriate to the dataset. Further evolution of the population resulted not in overfitting or bloat, but rather in slight simplifications in model form. The ability to identify an equation of complexity appropriate to the scale of noise in the training data was also demonstrated. In general, the Bayesian model selection algorithm was shown to be an effective means of regularization which resulted in less bloat and overfitting when any amount of noise was present in the training data.
The efficacy of a Genetic Programming (GP) [1] solution is often characterized by its (1) fitness, i.e. ability to perform a training task, (2) complexity, and (3) generalizability, i.e. ability to perform its task in an unseen scenario. Bloat is a common phenomenon for GP in which continued training results in significant increases in complexity with minimal improvements in fitness. There are several theories for the prevalence of bloat in GP which postulate possible evolutionary benefits of bloat [2]; however, for most practical purposes bloat is a hindrance rather than a benefit. For example, bloated solutions are less interpretable and more computationally expensive. Overfitting is another common phenomena in GP and the broader machine learning field. Overfitting occurs when continued training results in better fitness but reduced generalizability.
- John R Koza and John R Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press, 1992.Google ScholarDigital Library
- Vipul K Dabhi and Sanjay Chaudhary. A survey on techniques of improving generalization ability of genetic programming solutions. arXiv preprint arXiv:1211.1119, 2012.Google Scholar
- Tejashvi R Naik and Vipul K Dabhi. Improving generalization ability of genetic programming: comparative study. Journal of Bioinformatics and Intelligent Control, 2(4):243--252, 2013.Google ScholarCross Ref
- Jeannie Fitzgerald, R Muhammad Atif Azad, and Conor Ryan. Bootstrapping to reduce bloat and improve generalisation in genetic programming. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pages 141--142, 2013.Google ScholarDigital Library
- Ekaterina J Vladislavleva, Guido F Smits, and Dick Den Hertog. Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Transactions on Evolutionary Computation, 13 (2):333--349, 2008.Google ScholarDigital Library
- Sara Silva and Leonardo Vanneschi. Operator equalisation, bloat and overfitting: a study on human oral bioavailability prediction. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 1115--1122, 2009.Google ScholarDigital Library
- Iain Murray and Zoubin Ghahramani. A note on the evidence and bayesian occam's razor. 2005.Google Scholar
- Anthony O'Hagan. Fractional bayes factors for model comparison. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):99--118, 1995.Google ScholarCross Ref
- G.F. Bomarito. Bingo. https://github.com/nasa/bingo, 2022.Google Scholar
- Michael Schmidt and Hod Lipson. Comparison of tree and graph encodings as function of problem complexity. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 1674--1679, 2007.Google ScholarDigital Library
- Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, and Stefan Wagner. Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pages 1121--1128, 2013.Google ScholarDigital Library
- Z Emigdio, Leonardo Trujillo, Oliver Schütze, Pierrick Legrand, et al. Evaluating the effects of local search in genetic programming. In EVOLVE-A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation V, pages 213--228. Springer, 2014.Google Scholar
- Vinicius Veloso De Melo, Benjamin Fowler, and Wolfgang Banzhaf. Evaluating methods for constant optimization of symbolic regression benchmark problems. In 2015 Brazilian conference on intelligent systems (BRACIS), pages 25--30. IEEE, 2015.Google ScholarDigital Library
- William La Cava, Patryk Orzechowski, Bogdan Burlacu, Fabrício Olivetti de França, Marco Virgolin, Ying Jin, Michael Kommenda, and Jason H Moore. Contemporary symbolic regression methods and their relative performance. arXiv preprint arXiv:2107.14351, 2021.Google Scholar
- Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (3):411--436, 2006.Google Scholar
- P.E. Leser. Smcpy - sequential monte carlo with python. https://github.com/nasa/smcpy, 2022.Google Scholar
- Samir W Mahfoud. Niching methods for genetic algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 1995.Google ScholarDigital Library
- Ole J Mengshoel and David E Goldberg. Probabilistic crowding: Deterministic crowding with probabilistic replacement. 1999.Google Scholar
Index Terms
- Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression
Recommendations
Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression
Linear Genetic Programming (LGP) is an Evolutionary Computation algorithm, inspired in the Genetic Programming (GP) algorithm. Instead of using the standard tree representation of GP, LGP evolves a linear program, which causes a graph-based data flow ...
Semantics Based Substituting Technique for Reducing Code Bloat in Genetic Programming
SoICT '18: Proceedings of the 9th International Symposium on Information and Communication TechnologyGenetic Programming (GP) is a technique that allows computer programs encoded as a set of tree structures to be evolved using an evolutionary algorithm. In GP, code bloat is a common phenomenon characterized by the size of individuals gradually ...
Solving the symbolic regression problem with tree-adjunct grammar guided genetic programming: the comparative results
CEC '02: Proceedings of the Evolutionary Computation on 2002. CEC '02. Proceedings of the 2002 Congress - Volume 02In this paper, we show some experimental results of tree-adjunct grammar-guided genetic programming (TAG3P) on the symbolic regression problem, a benchmark problem in genetic programming. We compare the results with genetic programming (GP) and grammar-...
Comments