Abstract
Learning symbolic expressions directly from experiment data is a vital step in AI-driven scientific discovery. Nevertheless, state-of-the-art approaches are limited to learning simple expressions. Regressing expressions involving many independent variables still remain out of reach. Motivated by the control variable experiments widely utilized in science, we propose Control Variable Genetic Programming (CVGP) for symbolic regression over many independent variables. CVGP expedites symbolic expression discovery via customized experiment design, rather than learning from a fixed dataset collected a priori. CVGP starts by fitting simple expressions involving a small set of independent variables using genetic programming, under controlled experiments where other variables are held as constants. It then extends expressions learned in previous generations by adding new independent variables, using new control variable experiments in which these variables are allowed to vary. Theoretically, we show CVGP as an incremental building approach can yield an exponential reduction in the search space when learning a class of expressions. Experimentally, CVGP outperforms several baselines in learning symbolic expressions involving multiple independent variables.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The code is at: https://github.com/jiangnanhugo/cvgp/. Please refer to the extended version (https://arxiv.org/abs/2306.08057) for the Appendix.
References
Abolafia, D.A., Norouzi, M., Le, Q.V.: Neural program synthesis with priority queue training. CoRR abs/1801.03526 (2018)
Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. In: NIPS, pp. 5360–5370 (2017)
Balcan, M., Dick, T., Sandholm, T., Vitercik, E.: Learning to branch. In: ICML. Proceedings of Machine Learning Research, vol. 80, pp. 353–362. PMLR (2018)
Biggio, L., Bendinelli, T., Neitz, A., Lucchi, A., Parascandolo, G.: Neural symbolic regression that scales. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 936–945. PMLR (2021)
Booch, G., et al.: Thinking fast and slow in AI. In: AAAI, pp. 15042–15046. AAAI Press (2021)
Bradley, E., Easley, M., Stolle, R.: Reasoning about nonlinear system identification. Artif. Intell. 133(1), 139–188 (2001)
Bridewell, W., Langley, P., Todorovski, L., Džeroski, S.: Inductive process modeling. Mach. Learn. 71, 1–32 (2008)
Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113(15), 3932–3937 (2016)
Cerrato, M., Brugger, J., Schmitt, N., Kramer, S.: Reinforcement learning for automated scientific discovery. In: AAAI Spring Symposium on Computational Approaches to Scientific Discovery (2023)
Chen, C., Luo, C., Jiang, Z.: Elite bases regression: a real-time algorithm for symbolic regression. In: ICNC-FSKD, pp. 529–535. IEEE (2017)
Chen, D., Wang, Y., Gao, W.: Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Appl. Intell. 50(10), 3301–3317 (2020)
Chen, Q., Xue, B., Zhang, M.: Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern. 52(4), 2382–2395 (2022)
Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Cranmer, M.D., et al.: Discovering symbolic models from deep learning with inductive biases. In: NeurIPS (2020)
Dubcáková, R.: Eureqa: software review. Genet. Program Evolvable Mach. 12(2), 173–178 (2011)
Dzeroski, S., Todorovski, L.: Discovering dynamics: from inductive logic programming to machine discovery. J. Intell. Inf. Syst. 4(1), 89–108 (1995)
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
Glymour, C., Scheines, R., Spirtes, P.: Discovering Causal Structure: Artificial Intelligence, Philosophy of Science, and Statistical Modeling. Academic Press, London (2014)
Golovin, D., Krause, A., Ray, D.: Near-optimal Bayesian active learning with noisy observations. In: Advances in Neural Information Processing Systems, vol. 23 (2010)
Guimerà, R., et al.: A Bayesian machine scientist to aid in the solution of challenging scientific problems. Sci. Adv. 6(5), eaav6971 (2020)
Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)
He, B., Lu, Q., Yang, Q., Luo, J., Wang, Z.: Taylor genetic programming for symbolic regression. In: GECCO, pp. 946–954. ACM (2022)
Iten, R., Metger, T., Wilming, H., Del Rio, L., Renner, R.: Discovering physical concepts with neural networks. Phys. Rev. Lett. 124(1), 010508 (2020)
Jaber, A., Ribeiro, A., Zhang, J., Bareinboim, E.: Causal identification under Markov equivalence: calculus, algorithm, and completeness. Adv. Neural. Inf. Process. Syst. 35, 3679–3690 (2022)
Kahneman, D.: Thinking, Fast and Slow. Macmillan, New York (2011)
Kamienny, P., d’Ascoli, S., Lample, G., Charton, F.: End-to-end symbolic regression with transformers. In: NeurIPS (2022)
Kibler, D.F., Langley, P.: The experimental study of machine learning (1991)
King, R.D., et al.: The automation of science. Science 324(5923), 85–89 (2009)
King, R.D., et al.: Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427(6971), 247–252 (2004)
La Cava, W., et al.: Contemporary symbolic regression methods and their relative performance. arXiv preprint arXiv:2107.14351 (2021)
Langley, P.: BACON: a production system that discovers empirical laws. In: IJCAI, p. 344. William Kaufmann (1977)
Langley, P.: Rediscovering physics with BACON.3. In: IJCAI, pp. 505–507. William Kaufmann (1979)
Langley, P.: Data-driven discovery of physical laws. Cogn. Sci. 5(1), 31–54 (1981)
Langley, P.: Machine learning as an experimental science. Mach. Learn. 3, 5–8 (1988)
Langley, P.: Scientific discovery, causal explanation, and process model induction. Mind Soc. 18(1), 43–56 (2019)
Langley, P., Bradshaw, G.L., Simon, H.A.: BACON.5: the discovery of conservation laws. In: IJCAI, pp. 121–126. William Kaufmann (1981)
Langley, P.W., Simon, H.A., Bradshaw, G., Zytkow, J.M.: Scientific Discovery: Computational Explorations of the Creative Process. The MIT Press, Cambridge (1987)
Lehman, J.S., Santner, T.J., Notz, W.I.: Designing computer experiments to determine robust control variables. Statistica Sinica, 571–590 (2004)
Lenat, D.B.: The ubiquity of discovery. Artif. Intell. 9(3), 257–285 (1977)
Liu, Z., Tegmark, M.: Machine learning conservation laws from trajectories. Phys. Rev. Lett. 126, 180604 (2021)
Matsubara, Y., Chiba, N., Igarashi, R., Taniai, T., Ushiku, Y.: Rethinking symbolic regression datasets and benchmarks for scientific discovery. arXiv preprint arXiv:2206.10540 (2022)
McConaghy, T.: FFX: fast, scalable, deterministic symbolic regression technology. In: Riolo, R., Vladislavleva, E., Moore, J. (eds.) Genetic Programming Theory and Practice IX. Genetic and Evolutionary Computation, pp. 235–260. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-1770-5_13
Mundhenk, T.N., Landajuela, M., Glatt, R., Santiago, C.P., Faissol, D.M., Petersen, B.K.: Symbolic regression via deep reinforcement learning enhanced genetic programming seeding. In: NeurIPS, pp. 24912–24923 (2021)
Pearl, J.: Causality. Cambridge University Press, Cambridge (2009)
Petersen, B.K., Landajuela, M., Mundhenk, T.N., Santiago, C.P., Kim, S., Kim, J.T.: Deep symbolic regression: recovering mathematical expressions from data via risk-seeking policy gradients. In: ICLR. OpenReview.net (2021)
Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019)
Raissi, M., Yazdani, A., Karniadakis, G.E.: Hidden fluid mechanics: learning velocity and pressure fields from flow visualizations. Science 367(6481), 1026–1030 (2020)
Razavi, S., Gamazon, E.R.: Neural-network-directed genetic programmer for discovery of governing equations. CoRR abs/2203.08808 (2022)
Ryan, T.P., Morgan, J.P.: Modern experimental design. J. Stat. Theory Pract. 1(3–4), 501–506 (2007)
Santner, T.J., Williams, B.J., Notz, W.I.: The Design and Analysis of Computer Experiments. Springer Series in Statistics, Springer, New York (2003). https://doi.org/10.1007/978-1-4757-3799-8
Scavuzzo, L., et al.: Learning to branch with tree MDPs. In: NeurIPS (2022)
Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)
Simon, H.A.: Spurious correlation: a causal interpretation. J. Am. Stat. Assoc. 49(267), 467–479 (1954)
Udrescu, S.M., Tegmark, M.: AI Feynman: a physics-inspired method for symbolic regression. Sci. Adv. 6(16) (2020)
Uy, N.Q., Hoai, N.X., O’Neill, M., McKay, R.I., López, E.G.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program Evolvable Mach. 12(2), 91–119 (2011)
Valdés-Pérez, R.: Human/computer interactive elucidation of reaction mechanisms: application to catalyzed hydrogenolysis of ethane. Catal. Lett. 28, 79–87 (1994)
Virgolin, M., Alderliesten, T., Bosman, P.A.N.: Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In: GECCO, pp. 1084–1092. ACM (2019)
Virgolin, M., Pissis, S.P.: Symbolic regression is NP-hard. Trans. Mach. Learn. Res. (2022)
Wang, H., et al.: Enabling scientific discovery with artificial intelligence. Nature (2022)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
Wu, T., Tegmark, M.: Toward an artificial intelligence physicist for unsupervised learning. Phys. Rev. E 100, 033311 (Sep2019)
Xue, Y., Nasim, Md., Zhang, M., Fan, C., Zhang, X., El-Azab, A.: Physics knowledge discovery via neural differential equation embedding. In: Dong, Y., Kourtellis, N., Hammer, B., Lozano, J.A. (eds.) ECML PKDD 2021. LNCS (LNAI), vol. 12979, pp. 118–134. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86517-7_8
Zhang, S., Lin, G.: Robust data-driven discovery of governing physical laws with error bars. Proc. Roy. Soc. A Math. Phys. Eng. Sci. 474(2217), 20180305 (2018)
Acknowledgments
We thank all the reviewers for their constructive comments. This research was supported by NSF grant CCF-1918327.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, N., Xue, Y. (2023). Symbolic Regression via Control Variable Genetic Programming. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-43421-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)