Review of PySR: high-performance symbolic regression in Python and Julia

Tonda, Alberto

doi:10.1007/s10710-024-09503-4

Review of PySR: high-performance symbolic regression in Python and Julia

Book Review
Published: 23 December 2024

Volume 26, article number 7, (2025)
Cite this article

Download PDF

Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Review of PySR: high-performance symbolic regression in Python and Julia

Download PDF

Alberto Tonda¹

1570 Accesses
4 Altmetric
Explore all metrics

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

PySR 0.19.4 is an open-source, freely available Python module for Symbolic Regression (SR) written and maintained by Dr. Miles Cranmer of the University of Cambridge [1]. The module relies upon an internal engine written in Julia (SymbolicRegression.jl), optimized for computational efficiency, which can also be employed separately in Julia projects. PySR has excellent documentation and is under active development.

The main features that PySR offers to data scientists and practitioners are fully scikit-learn [2] compatible classes for a symbolic regression based regressor (PySRRegressor) and a classifier (PySRClassifier), with scikit-learn being the de facto standard for machine learning algorithms in Python. PySRRegressor and PySRClassifier can be thus seamlessly integrated into any Python machine learning pipeline testing different algorithms of the same kind. E.g. a PySRRegressor object could replace a scikit-learn RandomForestRegressor or RidgeRegressor, as they share the same API. Both the regressor and classifier classes offer similar options, allowing users to specify population size, number of generations, building blocks for the symbolic regression trees, and so on. PySR is also perfectly integrated with the most commonly used packages in data science, for example with the possibility of obtaining the final equations as a pandas^{Footnote 1} DataFrame, or sympy’s symbolic expressions,^{Footnote 2} or even as LaTeX code (Fig. 1).

To Genetic Programming people, PySR offers a computationally efficient symbolic regression implementation with recent and sophisticated symbolic regression features, and a code base that can be relatively easily modified to test new ideas.

PySR parallelizes function evaluations (using multiprocessing by default, but multithreading is also available), and can easily exploit more computing power by deploying on multi-node clusters. In fact, by default PySR uses a multi-population scheme with separate islands, each one associated with a different core. The number of populations is a parameter of each PySR class. In the documentation, Cranmer offers practical advice for tuning the algorithm and improving speed.^{Footnote 3}

At the end of each evolution, PySR does not return a single solution, but rather a set of candidate equations, each one a compromise between complexity and fitting. Complexity can be tuned by defining weights for each element appearing in a tree, via the complexity_of_operators argument. PySR also features a variety of useful options for practical applications, such as applying denoising to the input data, performing feature selection or Principal Component Analysis before starting the symbolic regression process, and inserting expert knowledge into the expressions by seeding or adding constraints, which may limit nesting of expressions (e.g. to avoid poorly interpretable sequences of sin(cos(sin(sin(…)…) or even overall frequency of appearance of specific building blocks. PySR also offers the possibility of adding personalized building blocks to symbolic regression or replacing the fitness function, defining personalized functions in Julia. Despite all possibilities of customization, the default settings for PySRRegressor deliver good results for most applications, making it usable even by non specialists.

While I consider PySR a remarkable resource for ease of use, customizability, and speed, it has its drawbacks. For example, its computational efficiency comes at the price of reproducibility, as fixing the random seed for the pseudo-random number generators has no effect when multi-processing is active. Keeping pseudo-random number generation coherent across parallel processes is a complex software engineering issue, and currently the only way of ensuring reproducibility is by forcing PySR to run on a single process, which of course impairs performance. Furthermore, while introducing new operators and fitness functions is not difficult, it still requires defining them using Julia syntax, which from the Python interface corresponds to writing Julia code inside Python strings; this can make debugging quite intricate. Finally, a minor issue for practitioners is that the heuristic used to automatically select a candidate equation on the final front of compromises can be flawed, and sometimes ignores better candidates; nevertheless, all equations on the front are saved and can be accessed individually.

Despite some small drawbacks, PySR remains one of the few examples of modern, usable, computationally-efficient symbolic regression I could find. The only comparable implementation of symbolic regression I am aware of is Operon [3], developed in C++, which also features scikit-learn compatible bindings for Python.^{Footnote 4} Being compiled in C++, Operon is much faster (about 0.04 s versus PySR’s 7 s to evaluate 100,000 candidates on my Linux laptop); the ease of use is similar, as both projects have a Python interface; however, modifying PySR for different purposes (e.g. testing research ideas) seems easier, as at worst PySR requires tweaking with Julia code, which is relatively high-level when compared to Operon’s C++. PySR documentation is also more extensive and informative, and the current version of Operon 0.4.0 still has a few issues. For example, when attempting to install Operon with Anaconda under Microsoft Windows, I faced several C++ compilation errors, while a Linux installation had no problems. It is worth mentioning that both Operon and PySR are part of the algorithms currently being tested in the SRBench benchmarking effort.^{Footnote 5}

PySR can be easily installed through pip (sudo pip install pysr), or cloned from its GitHub repository,^{Footnote 6} but I would recommend pip, as installing it from GitHub currently also requires manually installing Julia. For more information, consult the documentation.^{Footnote 7} On Microsoft Windows, PySR can be installed through Anaconda or Anaconda Cloud distributions.

Notes

pandas is arguably the most used Python/R library for manipulating tables and CSV files, https://pandas.pydata.org/docs/user_guide/index.html
sympy is a popular library for symbolic mathematics in Python, https://www.sympy.org/en/index.html
Advice for improving PySR speed in the documentation, https://astroautomata.com/PySR/tuning/
PyOperon GitHub repository, https://github.com/heal-research/pyoperon
SRBench GitHub repository, https://github.com/cavalab/srbench
PySR GitHub repository, https://github.com/MilesCranmer/PySR
PySR documentation, https://astroautomata.com/PySR/

References

M. Cranmer, Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl (2023). arXiv: https://arxiv.org/abs/2305.01582
F. Pedregosa et al., Scikit-learn: machine learning in python. JMLR 12, pp. 2825–2830 (2011). https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
B. Burlacu et al. Operon C++: an efficient genetic programming framework for symbolic regression. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pp. 1562–1570 (2020). https://doi.org/10.1145/3377929.3398099

Download references

Author information

Authors and Affiliations

Applied Mathematics and Computer Science, Paris Saclay Joint Research Unit (UMR 518 MIA-PS), INRAE, Université Paris-Saclay, Palaiseau, France
Alberto Tonda

Authors

Alberto Tonda
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Alberto Tonda.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tonda, A. Review of PySR: high-performance symbolic regression in Python and Julia. Genet Program Evolvable Mach 26, 7 (2025). https://doi.org/10.1007/s10710-024-09503-4

Download citation

Published: 23 December 2024
DOI: https://doi.org/10.1007/s10710-024-09503-4

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Review of PySR: high-performance symbolic regression in Python and Julia

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article