skip to main content
10.1145/3520304.3528899acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
poster

Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression

Published:19 July 2022Publication History

ABSTRACT

When performing symbolic regression using genetic programming, overfitting and bloat can negatively impact generalizability and interpretability of the resulting equations as well as increase computation times. A Bayesian fitness metric is introduced and its impact on bloat and overfitting during population evolution is studied and compared to common alternatives in the literature. The proposed approach was found to be more robust to noise and data sparsity in numerical experiments, guiding evolution to a level of complexity appropriate to the dataset. Further evolution of the population resulted not in overfitting or bloat, but rather in slight simplifications in model form. The ability to identify an equation of complexity appropriate to the scale of noise in the training data was also demonstrated. In general, the Bayesian model selection algorithm was shown to be an effective means of regularization which resulted in less bloat and overfitting when any amount of noise was present in the training data.

The efficacy of a Genetic Programming (GP) [1] solution is often characterized by its (1) fitness, i.e. ability to perform a training task, (2) complexity, and (3) generalizability, i.e. ability to perform its task in an unseen scenario. Bloat is a common phenomenon for GP in which continued training results in significant increases in complexity with minimal improvements in fitness. There are several theories for the prevalence of bloat in GP which postulate possible evolutionary benefits of bloat [2]; however, for most practical purposes bloat is a hindrance rather than a benefit. For example, bloated solutions are less interpretable and more computationally expensive. Overfitting is another common phenomena in GP and the broader machine learning field. Overfitting occurs when continued training results in better fitness but reduced generalizability.

References

  1. John R Koza and John R Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Vipul K Dabhi and Sanjay Chaudhary. A survey on techniques of improving generalization ability of genetic programming solutions. arXiv preprint arXiv:1211.1119, 2012.Google ScholarGoogle Scholar
  3. Tejashvi R Naik and Vipul K Dabhi. Improving generalization ability of genetic programming: comparative study. Journal of Bioinformatics and Intelligent Control, 2(4):243--252, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jeannie Fitzgerald, R Muhammad Atif Azad, and Conor Ryan. Bootstrapping to reduce bloat and improve generalisation in genetic programming. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pages 141--142, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ekaterina J Vladislavleva, Guido F Smits, and Dick Den Hertog. Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Transactions on Evolutionary Computation, 13 (2):333--349, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sara Silva and Leonardo Vanneschi. Operator equalisation, bloat and overfitting: a study on human oral bioavailability prediction. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 1115--1122, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Iain Murray and Zoubin Ghahramani. A note on the evidence and bayesian occam's razor. 2005.Google ScholarGoogle Scholar
  8. Anthony O'Hagan. Fractional bayes factors for model comparison. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):99--118, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  9. G.F. Bomarito. Bingo. https://github.com/nasa/bingo, 2022.Google ScholarGoogle Scholar
  10. Michael Schmidt and Hod Lipson. Comparison of tree and graph encodings as function of problem complexity. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 1674--1679, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, and Stefan Wagner. Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In Proceedings of the 15th annual conference companion on Genetic and evolutionary computation, pages 1121--1128, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z Emigdio, Leonardo Trujillo, Oliver Schütze, Pierrick Legrand, et al. Evaluating the effects of local search in genetic programming. In EVOLVE-A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation V, pages 213--228. Springer, 2014.Google ScholarGoogle Scholar
  13. Vinicius Veloso De Melo, Benjamin Fowler, and Wolfgang Banzhaf. Evaluating methods for constant optimization of symbolic regression benchmark problems. In 2015 Brazilian conference on intelligent systems (BRACIS), pages 25--30. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. William La Cava, Patryk Orzechowski, Bogdan Burlacu, Fabrício Olivetti de França, Marco Virgolin, Ying Jin, Michael Kommenda, and Jason H Moore. Contemporary symbolic regression methods and their relative performance. arXiv preprint arXiv:2107.14351, 2021.Google ScholarGoogle Scholar
  15. Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (3):411--436, 2006.Google ScholarGoogle Scholar
  16. P.E. Leser. Smcpy - sequential monte carlo with python. https://github.com/nasa/smcpy, 2022.Google ScholarGoogle Scholar
  17. Samir W Mahfoud. Niching methods for genetic algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ole J Mengshoel and David E Goldberg. Probabilistic crowding: Deterministic crowding with probabilistic replacement. 1999.Google ScholarGoogle Scholar

Index Terms

  1. Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference Companion
        July 2022
        2395 pages
        ISBN:9781450392686
        DOI:10.1145/3520304

        Copyright © 2022 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2022

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate1,669of4,410submissions,38%

        Upcoming Conference

        GECCO '24
        Genetic and Evolutionary Computation Conference
        July 14 - 18, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader