poster

Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression

Authors:
G. F. Bomarito

NASA Langley Research Center

NASA Langley Research Center
View Profile

,
P. E. Leser

NASA Langley Research Center

NASA Langley Research Center
View Profile

,
N. C. M. Strauss

University of Utah

University of Utah
View Profile

,
K. M. Garbrecht

University of Utah

University of Utah
View Profile

,
J. D. Hochhalter

University of Utah

University of Utah
View Profile

GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference CompanionJuly 2022Pages 526–529https://doi.org/10.1145/3520304.3528899

Published:19 July 2022Publication History

GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Pages 526–529

ABSTRACT

When performing symbolic regression using genetic programming, overfitting and bloat can negatively impact generalizability and interpretability of the resulting equations as well as increase computation times. A Bayesian fitness metric is introduced and its impact on bloat and overfitting during population evolution is studied and compared to common alternatives in the literature. The proposed approach was found to be more robust to noise and data sparsity in numerical experiments, guiding evolution to a level of complexity appropriate to the dataset. Further evolution of the population resulted not in overfitting or bloat, but rather in slight simplifications in model form. The ability to identify an equation of complexity appropriate to the scale of noise in the training data was also demonstrated. In general, the Bayesian model selection algorithm was shown to be an effective means of regularization which resulted in less bloat and overfitting when any amount of noise was present in the training data.

The efficacy of a Genetic Programming (GP) [1] solution is often characterized by its (1) fitness, i.e. ability to perform a training task, (2) complexity, and (3) generalizability, i.e. ability to perform its task in an unseen scenario. Bloat is a common phenomenon for GP in which continued training results in significant increases in complexity with minimal improvements in fitness. There are several theories for the prevalence of bloat in GP which postulate possible evolutionary benefits of bloat [2]; however, for most practical purposes bloat is a hindrance rather than a benefit. For example, bloated solutions are less interpretable and more computationally expensive. Overfitting is another common phenomena in GP and the broader machine learning field. Overfitting occurs when continued training results in better fitness but reduced generalizability.

References

John R Koza and John R Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press, 1992.Google ScholarDigital Library
Vipul K Dabhi and Sanjay Chaudhary. A survey on techniques of improving generalization ability of genetic programming solutions. arXiv preprint arXiv:1211.1119, 2012.Google Scholar
Tejashvi R Naik and Vipul K Dabhi. Improving generalization ability of genetic programming: comparative study. Journal of Bioinformatics and Intelligent Control, 2(4):243--252, 2013.Google ScholarCross Ref
Jeannie Fitzgerald, R Muhammad Atif Azad, and Conor Ryan. Bootstrapping to reduce bloat and improve generalisation in genetic programming. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pages 141--142, 2013.Google ScholarDigital Library
Ekaterina J Vladislavleva, Guido F Smits, and Dick Den Hertog. Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Transactions on Evolutionary Computation, 13 (2):333--349, 2008.Google ScholarDigital Library
Sara Silva and Leonardo Vanneschi. Operator equalisation, bloat and overfitting: a study on human oral bioavailability prediction. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 1115--1122, 2009.Google ScholarDigital Library
Iain Murray and Zoubin Ghahramani. A note on the evidence and bayesian occam's razor. 2005.Google Scholar
Anthony O'Hagan. Fractional bayes factors for model comparison. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):99--118, 1995.Google ScholarCross Ref
G.F. Bomarito. Bingo. https://github.com/nasa/bingo, 2022.Google Scholar
Michael Schmidt and Hod Lipson. Comparison of tree and graph encodings as function of problem complexity. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 1674--1679, 2007.Google ScholarDigital Library
Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, and Stefan Wagner. Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pages 1121--1128, 2013.Google ScholarDigital Library
Z Emigdio, Leonardo Trujillo, Oliver Schütze, Pierrick Legrand, et al. Evaluating the effects of local search in genetic programming. In EVOLVE-A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation V, pages 213--228. Springer, 2014.Google Scholar
Vinicius Veloso De Melo, Benjamin Fowler, and Wolfgang Banzhaf. Evaluating methods for constant optimization of symbolic regression benchmark problems. In 2015 Brazilian conference on intelligent systems (BRACIS), pages 25--30. IEEE, 2015.Google ScholarDigital Library
William La Cava, Patryk Orzechowski, Bogdan Burlacu, Fabrício Olivetti de França, Marco Virgolin, Ying Jin, Michael Kommenda, and Jason H Moore. Contemporary symbolic regression methods and their relative performance. arXiv preprint arXiv:2107.14351, 2021.Google Scholar
Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (3):411--436, 2006.Google Scholar
P.E. Leser. Smcpy - sequential monte carlo with python. https://github.com/nasa/smcpy, 2022.Google Scholar
Samir W Mahfoud. Niching methods for genetic algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 1995.Google ScholarDigital Library
Ole J Mengshoel and David E Goldberg. Probabilistic crowding: Deterministic crowding with probabilistic replacement. 1999.Google Scholar

Index Terms

Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
  2. Machine learning
    1. Machine learning approaches

Index terms have been assigned to the content through auto-classification.

Recommendations

Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression

Linear Genetic Programming (LGP) is an Evolutionary Computation algorithm, inspired in the Genetic Programming (GP) algorithm. Instead of using the standard tree representation of GP, LGP evolves a linear program, which causes a graph-based data flow ...
Read More
Semantics Based Substituting Technique for Reducing Code Bloat in Genetic Programming
SoICT '18: Proceedings of the 9th International Symposium on Information and Communication Technology

Genetic Programming (GP) is a technique that allows computer programs encoded as a set of tree structures to be evolved using an evolutionary algorithm. In GP, code bloat is a common phenomenon characterized by the size of individuals gradually ...
Read More
Solving the symbolic regression problem with tree-adjunct grammar guided genetic programming: the comparative results
CEC '02: Proceedings of the Evolutionary Computation on 2002. CEC '02. Proceedings of the 2002 Congress - Volume 02

In this paper, we show some experimental results of tree-adjunct grammar-guided genetic programming (TAG3P) on the symbolic regression problem, a benchmark problem in genetic programming. We compare the results with genetic programming (GP) and grammar-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference Companion
July 2022
2395 pages
ISBN:9781450392686
DOI:10.1145/3520304
Editor:
Jonathan E. Fieldsend
University of Exeter
,
General Chair:
Markus Wagner
The University of Adelaide
Copyright © 2022 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2022
Check for updates
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,669of4,410submissions,38%
Upcoming Conference
GECCO '24

Sponsor:

sigevo

Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 128
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression

GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference Companion

ABSTRACT

References

Cited By

Index Terms

Recommendations

Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression

Semantics Based Substituting Technique for Reducing Code Bloat in Genetic Programming

Solving the symbolic regression problem with tree-adjunct grammar guided genetic programming: the comparative results