Abstract
We examine the genetic evolution-based algorithm for symbolic regression from a probabilistic dynamical perspective. This approach permits us to follow the evolution of the search candidate functions from generation to generation as they improve their fitness and finally converge to the best function that matches a given data set. In particular, we use this statistical framework to explore the optimal external parameters that govern a special mutation operator, which can systematically improve the numerical value of constants contained in each candidate formula of the search space. We then apply symbolic regression to the chaotic logistic map and the Lorenz system.




Similar content being viewed by others
References
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.
Sarachik ES, Cane MA. The El-Nino southern oscillation phenomena. Cambridge, UK: Cambridge University Press; 2010.
Vladislavleva E, Friedrich T, Neumann F, Wagner M. Predicting the energy output of wind farms based on weather data: important variables and their correlation. Renew Energy. 2013;50:236.
Fitzsimmons J, Moscato P. Symbolic regression modelling of drug responses. In: First IEEE Conference on Artificial Intelligence for Industries; 2018.
Graham MJ, Djorgovski SG, Mahabal A, Donalek C, Drake A, Longo G. Data challenges of time domain astronomy. Distr Parallel Databases. 2012;30(5):371.
Schmidt M, Lipson H. Distilling free-form natural laws from experimental data. Science. 2009;324(5923):81.
Udrescu SM, Tegmark M. The Feynman database for symbolic regression. https://space.mit.edu/home/tegmark/aifeynman.html; 2020
Udrescu SM, Tegmark M. AI Feynman: a physics-inspired method for symbolic regression. Sci Adv. 2019;6(16):2631.
Durasevic M, Domagoj J, Scoczynski Ribeiro Martins M, Stjepan Picek P, Wagner M. Fitness landscape analysis of dimensionally-aware genetic programming featuring Feynman equations. arXiv:2004.12762v1 [cs.NE]; 2020.
Quade M, Abel M, Shafi K, Niven RK, Noack BR. Prediction of dynamical systems by symbolic regression. Phys Rev E. 2016;94:012214.
Gautier N, Aider JL, Duriez T, Noack B, Segond M, Abel M. Closed-loop separation control using machine learning. J Fluid Mech. 2015;770:442.
Qin H. Machine learning and serving of discrete field theories - when artificial intelligence meets the discrete universe. arXiv:1910.10147; 2019.
Gong C, Su Q, Grobe R. Machine learning techniques in the examination of the electron-positron pair creation process. J Opt Soc Am B. 2021;38:3582–91.
Zimmermann RS, Parlitz U. Observing spatio-temporal dynamics of excitable media using reservoir computing. Chaos. 2018;28:043118.
Tanaka G, Yamane T, HšŠroux JB, Nakane R, Kanazawa N, Takeda S, Numata H, Nakano D, Hirose A. Recent advances in physical reservoir computing: a review. Neural Netw. 2019;115:100.
Lu Z, Hunt BR, Ott E. Attractor reconstruction by machine learning. Chaos. 2018;28:061104.
Symbolic regression is a relatively young research field and there are no extensive reviews for direct applications in physics yet. Two interesting early articles are [17,18].
Vladislavleva K. Model-based problem solving through symbolic regression via Pareto genetic programming. PhD thesis, Tilburg University; 2008.
Minnebo W, Stijven S. Empowering knowledge computing with variable selection. M Sc thesis: University of Antwerp; 2011.
Bruneton JP, Cazenille L, Douin A, Reverdy V. Exploration and exploitation in symbolic regression using quality-diversity and evolutionary strategies algorithms. arXiv:1906.03959v1 [cs.NE]; 2019.
Koza JR. Genetic programming: on the programming of computers by means of natural selection. Cambridge: MIT Press; 1992.
Koza JR. Genetic programming. Cambridge: MIT Press; 1998.
Lambora A, Gupta K, Chopra K. Genetic algorithm—a literature review. In: International conference on machine learning, big data, cloud and parallel computing (COMITCon); 2019, p 380.
Miller B, Goldberg D. Genetic algorithms, tournament selection and the effects of noise. Complex Syst. 1995;9:193.
Blickle T, Thiele L. A comparison of selection schemes used in evolutionary algorithms. Evol Comput. 1996;4:361.
Goldberg D, Deb K. A comparative analysis of selection schemes used in genetic algorithms. Found Genet Algor. 1991;1:69.
Holland JH. Adaptation in natural and artificial systems. Cambridge: MIT Press; 1975.
Gavrilets S. Fitness landscapes and the origin of species. Princeton: Princeton University Press; 2004.
McCandlish DM. Visualizing fitness landscapes. Evolution. 2011;65:1544.
Wright S. The roles of mutation, inbreeding, crossbreeding, and selection in evolution. Proc Six Int Congr Genet. 1932;1:355.
Richter H, Engelbrecht A. Recent advances in the theory and application of fitness landscapes. Heidelberg: Springer; 2014.
May R. Simple mathematical models with very complicated dynamic. Nature. 1976;261:459.
Tan JPL. Simulated extrapolated dynamics with parametrization networks. arXiv:1902.03440v1 [nlin.CD]; 2019.
Lorenz EN. Deterministic nonperiodic flow. J Atmos Sci. 1963;20(2):130.
Acknowledgements
We would like to thank Profs. B.K. Clark, X. Fang, Z.L. Li, Y.J. Li and G.H. Rutherford, and G. Jacob, Z. Smozhanyk, and T. Walsh for many helpful discussions and suggestions. This work has been supported by the NSF. C.G. would like to thank ILP for the nice hospitality during his visit to Illinois State University and acknowledges the China Scholarship Council program for his PhD research. We also acknowledge access to the HPC cluster provided by Illinois State University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
(A) \(P_{m}(t)\) for a model system of M classes
To get some first insight into the time scales of the temporal evolution of the proportions \(P_{m}(t)\) of each class under the tournament selection, we present in this appendix an oversimplified system of M classes, where the corresponding initial fitness densities \(\rho _{m}(f,t=0)\) are so narrow that they do not overlap with each other. This permits us to derive a universal iteration scheme, where the proportions \(P_{m}(t)\) can be computed directly from the set of \(P_{m}(t=0)\) without specifying the shape of \(\rho _{m}(f,t=0)\).
In general, the new set of proportions \(P_{m}(t+1)\) after the application of all \(N_\mathrm{pop}\) tournaments can be obtained via the expression
The assumption of non-overlapping densities means that we can assign each class a unique mean fitness value, defined as \(\int _{0}^{\infty }\text {d}f'\ f'\rho _{m}(f',t)\equiv f_{m}\). This permits us to order the class labels such that their associated mean fitness increases with increasing label m, i.e., \(f_{m}<f_{m+1}\). If we further assume that each \(\rho _{m}(f,t)\) is basically non-zero only in the interval \([f_{m}-\varDelta f/2, f_{m}+\varDelta f/2]\), then the integration range of first integral \(\int _{0}^{\infty }\text {d}f\) of Eq. (14) can be approximated by \(\int _{f_{m}-\varDelta f/2}^{f_{m}+\varDelta f/2}\text {d}f\). This means that the largest upper integration value f of the second integral \(\int _{0}^{f}\text {d}f'\) is at most \(f=f_{m}+\varDelta f/2\). As a result, some of the integrals in \(S(f,t)=n_{T}[1-\sum _{m'=1}^{M}P_{m'}(t)\int _{0}^{f}\text {d}f'\rho _{m'} (f',t)]^{n_{T}-1}\) can be partially evaluated and therefore simplify significantly. The densities \(\rho _{m'}(f',t)\) with a fitness lower than \(f_{m}\) are integrated over their entire extent and we can use \(\int _{f_{m}-\varDelta f/2}^{f_{m}+\varDelta f/2}\text {d}f\rho _{m'}(f',t)=1\). As a result, we obtain \(P_{m'}(t)\int _{0}^{f}\text {d}f'\rho _{m'}(f',t) =\sum _{m'=1}^{m-1}P_{m'}(t)+\int _{0}^{f}\text {d}f'\rho _{m'}(f,t)\). This permits us to represent the entire integrand in S(f, t) as a total derivative and we obtain
If we expand this set of equations, we obtain the sequences of mutually coupled iterations
This means that we have obtained an iteration scheme to calculate the class populations \(P_{m}(t+1)\) of the next generation solely from the set of \(P_{m}(t)\), which have a lesser (or equal fitness). One can easily convince oneself that the norm is preserved by this set of maps. i.e., \(\sum _{m=1}^{M}P_{m}(t+1)=\sum _{m=1}^{M} P_{m}(t)=1\).
Evolution of the proportions \(P_{m}(t)\) for \({M} =10\) classes for the first five generations according to the model given by Eq. (16). The initial proportion were chosen \(P_{m}(t=0)=1/M\). We have only labeled the three proportions with the lowest three fitnesses
As an interesting side-note, we remark that despite the nonlinear feature of these iterative maps, for the class with the lowest fitness \({m}=1\), we have the simpler iterative scheme \(P_{1}(t+1)=1-[1-P_{1}(t)]^{n_{T}}\), which converges consistently to \(P_{1}(t\rightarrow \infty )\rightarrow 1\). If we introduce the complementary proportion \(Q_{1}(t+1)\equiv 1-P_{1}(t+1)\), we have \(1-P_{1}(t+1)=[1-P_{1}(t)]^{n_{T}}\) such that \(Q_{1}(t+1)=Q_{1}(t)^{n_{T}}\). This has the solution \(Q_{1}(t)=Q_{1}(0)^{tn_{T}}\) such that we have \(P_{1}(t)=1-[1-P_{1}(0)]^{tn_{T}}\), so \(P_{1}(t)\) grows monotonically on the time scale proportional to \(n_{T}^{-1}\) and independent of the proportions \(P_{m}\) of the other classes, as one might expect. While the decay is monotonic, its time scale depends not only on \(n_{T}\), but also very sensitively on its initial value \(P_{1}(0)\). If \(P_{1}(0)\ll 1\), then for short times \(P_{1}(t)\) grows linearly in time with a slope proportional to \(n_{T}P_{1}(0)\), i.e., \(P_{1}(t) = n_{T}P_{1}(0)t\).
On the opposite side, if m matches the total number of classes, i.e., m = M, then the iteration scheme for the class with the largest fitness \(f_{M}\) simplifies to
This permits us to find the complete time evolution for \(t=1,2,\ldots ,\) as \(P_{M}(t)=P_{M}(0)^{tn_{T}}\) following a universal monotonic exponential decay with decay time proportional to \(n_{T}^{-1}\).
The time evolution of all the other proportions \(P_{M}(t)\) for \(m\ne 1\) and \(m\ne M\) can be non-momotonic. As an example, in Fig. 5 we show the evolution of \(M=10\) classes and \(P_{m}(t=0)=1/M\) for the first five generations with a tournament size \(n_{T}=2\). We see that the low-fitness proportions, \(P_{m}(t)\) (for \({m}=1\), ..., 5) increase first and then decay, except \(P_{1}(t)\), which approaches monotonically 1.
Rights and permissions
About this article
Cite this article
Gong, C., Bryan, J., Furcoiu, A. et al. Evolutionary Symbolic Regression from a Probabilistic Perspective. SN COMPUT. SCI. 3, 209 (2022). https://doi.org/10.1007/s42979-022-01094-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01094-0