milescranmer / pysr Goto Github PK

High-Performance Symbolic Regression in Python and Julia

Home Page: https://astroautomata.com/PySR

License: Apache License 2.0

Python 96.50% Shell 0.68% Dockerfile 1.46% Jupyter Notebook 1.36%

symbolic-regression machine-learning python julia genetic-algorithm automl numpy interpretable-ml data-science explainable-ai

pysr's Introduction

PySR searches for symbolic expressions which optimize a particular objective.

pysr_animation.mp4

PySR: High-Performance Symbolic Regression in Python and Julia

Docs	Forums	Paper	colab demo

pip	conda	Stats
		pip: conda:

If you find PySR useful, please cite the paper arXiv:2305.01582. If you've finished a project with PySR, please submit a PR to showcase your work on the research showcase page!

Contents:

Why PySR?
Installation
Quickstart
→ Documentation
Contributors

Test status

Linux	Windows	macOS

Docker	Conda	Coverage

Why PySR?

PySR is an open-source tool for Symbolic Regression: a machine learning task where the goal is to find an interpretable symbolic expression that optimizes some objective.

Over a period of several years, PySR has been engineered from the ground up to be (1) as high-performance as possible, (2) as configurable as possible, and (3) easy to use. PySR is developed alongside the Julia library SymbolicRegression.jl, which forms the powerful search engine of PySR. The details of these algorithms are described in the PySR paper.

Symbolic regression works best on low-dimensional datasets, but one can also extend these approaches to higher-dimensional spaces by using "Symbolic Distillation" of Neural Networks, as explained in 2006.11287, where we apply it to N-body problems. Here, one essentially uses symbolic regression to convert a neural net to an analytic equation. Thus, these tools simultaneously present an explicit and powerful way to interpret deep neural networks.

Installation

Pip

You can install PySR with pip:

pip install pysr

Julia dependencies will be installed at first import.

Conda

Similarly, with conda:

conda install -c conda-forge pysr

Dockerfile

You can also use the Dockerfile to install PySR in a docker container

Clone this repo.
Within the repo's directory, build the docker container:

docker build -t pysr .

You can then start the container with an IPython execution with:

docker run -it --rm pysr ipython

For more details, see the docker section.

Troubleshooting

One issue you might run into can result in a hard crash at import with a message like "GLIBCXX_... not found". This is due to another one of the Python dependencies loading an incorrect libstdc++ library. To fix this, you should modify your LD_LIBRARY_PATH variable to reference the Julia libraries. For example, if the Julia version of libstdc++.so is located in $HOME/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/ (which likely differs on your system!), you could add:

export LD_LIBRARY_PATH=$HOME/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/:$LD_LIBRARY_PATH

to your .bashrc or .zshrc file.

Quickstart

You might wish to try the interactive tutorial here, which uses the notebook in examples/pysr_demo.ipynb.

In practice, I highly recommend using IPython rather than Jupyter, as the printing is much nicer. Below is a quick demo here which you can paste into a Python runtime. First, let's import numpy to generate some test data:

import numpy as np

X = 2 * np.random.randn(100, 5)
y = 2.5382 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 0.5

We have created a dataset with 100 datapoints, with 5 features each. The relation we wish to model is $2.5382 \cos(x_3) + x_0^2 - 0.5$.

Now, let's create a PySR model and train it. PySR's main interface is in the style of scikit-learn:

from pysr import PySRRegressor

model = PySRRegressor(
    niterations=40,  # < Increase me for better results
    binary_operators=["+", "*"],
    unary_operators=[
        "cos",
        "exp",
        "sin",
        "inv(x) = 1/x",
        # ^ Custom operator (julia syntax)
    ],
    extra_sympy_mappings={"inv": lambda x: 1 / x},
    # ^ Define operator for SymPy as well
    elementwise_loss="loss(prediction, target) = (prediction - target)^2",
    # ^ Custom loss function (julia syntax)
)

This will set up the model for 40 iterations of the search code, which contains hundreds of thousands of mutations and equation evaluations.

Let's train this model on our dataset:

model.fit(X, y)

Internally, this launches a Julia process which will do a multithreaded search for equations to fit the dataset.

Equations will be printed during training, and once you are satisfied, you may quit early by hitting 'q' and then <enter>.

After the model has been fit, you can run model.predict(X) to see the predictions on a given dataset using the automatically-selected expression, or, for example, model.predict(X, 3) to see the predictions of the 3rd equation.

You may run:

print(model)

to print the learned equations:

PySRRegressor.equations_ = [
	   pick     score                                           equation       loss  complexity
	0        0.000000                                          4.4324794  42.354317           1
	1        1.255691                                          (x0 * x0)   3.437307           3
	2        0.011629                          ((x0 * x0) + -0.28087974)   3.358285           5
	3        0.897855                              ((x0 * x0) + cos(x3))   1.368308           6
	4        0.857018                ((x0 * x0) + (cos(x3) * 2.4566472))   0.246483           8
	5  >>>>       inf  (((cos(x3) + -0.19699033) * 2.5382123) + (x0 *...   0.000000          10
]

This arrow in the pick column indicates which equation is currently selected by your model_selection strategy for prediction. (You may change model_selection after .fit(X, y) as well.)

model.equations_ is a pandas DataFrame containing all equations, including callable format (lambda_format), SymPy format (sympy_format - which you can also get with model.sympy()), and even JAX and PyTorch format (both of which are differentiable - which you can get with model.jax() and model.pytorch()).

Note that PySRRegressor stores the state of the last search, and will restart from where you left off the next time you call .fit(), assuming you have set warm_start=True. This will cause problems if significant changes are made to the search parameters (like changing the operators). You can run model.reset() to reset the state.

You will notice that PySR will save two files: hall_of_fame...csv and hall_of_fame...pkl. The csv file is a list of equations and their losses, and the pkl file is a saved state of the model. You may load the model from the pkl file with:

model = PySRRegressor.from_file("hall_of_fame.2022-08-10_100832.281.pkl")

There are several other useful features such as denoising (e.g., denoise=True), feature selection (e.g., select_k_features=3). For examples of these and other features, see the examples page. For a detailed look at more options, see the options page. You can also see the full API at this page. There are also tips for tuning PySR on this page.

Detailed Example

The following code makes use of as many PySR features as possible. Note that is just a demonstration of features and you should not use this example as-is. For details on what each parameter does, check out the API page.

model = PySRRegressor(
    procs=4,
    populations=8,
    # ^ 2 populations per core, so one is always running.
    population_size=50,
    # ^ Slightly larger populations, for greater diversity.
    ncycles_per_iteration=500,
    # ^ Generations between migrations.
    niterations=10000000,  # Run forever
    early_stop_condition=(
        "stop_if(loss, complexity) = loss < 1e-6 && complexity < 10"
        # Stop early if we find a good and simple equation
    ),
    timeout_in_seconds=60 * 60 * 24,
    # ^ Alternatively, stop after 24 hours have passed.
    maxsize=50,
    # ^ Allow greater complexity.
    maxdepth=10,
    # ^ But, avoid deep nesting.
    binary_operators=["*", "+", "-", "/"],
    unary_operators=["square", "cube", "exp", "cos2(x)=cos(x)^2"],
    constraints={
        "/": (-1, 9),
        "square": 9,
        "cube": 9,
        "exp": 9,
    },
    # ^ Limit the complexity within each argument.
    # "inv": (-1, 9) states that the numerator has no constraint,
    # but the denominator has a max complexity of 9.
    # "exp": 9 simply states that `exp` can only have
    # an expression of complexity 9 as input.
    nested_constraints={
        "square": {"square": 1, "cube": 1, "exp": 0},
        "cube": {"square": 1, "cube": 1, "exp": 0},
        "exp": {"square": 1, "cube": 1, "exp": 0},
    },
    # ^ Nesting constraints on operators. For example,
    # "square(exp(x))" is not allowed, since "square": {"exp": 0}.
    complexity_of_operators={"/": 2, "exp": 3},
    # ^ Custom complexity of particular operators.
    complexity_of_constants=2,
    # ^ Punish constants more than variables
    select_k_features=4,
    # ^ Train on only the 4 most important features
    progress=True,
    # ^ Can set to false if printing to a file.
    weight_randomize=0.1,
    # ^ Randomize the tree much more frequently
    cluster_manager=None,
    # ^ Can be set to, e.g., "slurm", to run a slurm
    # cluster. Just launch one script from the head node.
    precision=64,
    # ^ Higher precision calculations.
    warm_start=True,
    # ^ Start from where left off.
    bumper=True,
    # ^ Faster evaluation (experimental)
    extra_sympy_mappings={"cos2": lambda x: sympy.cos(x)**2},
    # extra_torch_mappings={sympy.cos: torch.cos},
    # ^ Not needed as cos already defined, but this
    # is how you define custom torch operators.
    # extra_jax_mappings={sympy.cos: "jnp.cos"},
    # ^ For JAX, one passes a string.
)

Docker

You can also test out PySR in Docker, without installing it locally, by running the following command in the root directory of this repo:

docker build -t pysr .

This builds an image called pysr for your system's architecture, which also contains IPython. You can select a specific version of Python and Julia with:

docker build -t pysr --build-arg JLVERSION=1.10.0 --build-arg PYVERSION=3.11.6 .

You can then run with this dockerfile using:

docker run -it --rm -v "$PWD:/data" pysr ipython

which will link the current directory to the container's /data directory and then launch ipython.

If you have issues building for your system's architecture, you can emulate another architecture by including --platform linux/amd64, before the build and run commands.

Contributors ✨

We are eager to welcome new contributors! Check out our contributors guide for tips 🚀. If you have an idea for a new feature, don't hesitate to share it on the issues or discussions page.

_{Mark Kittisopikul} 💻 💡 🚇 📦 📣 👀 🔧 ⚠️	_{T Coxon} 🐛 💻 🔌 💡 🚇 🚧 👀 🔧 ⚠️ 📓	_{Dhananjay Ashok} 💻 🌍 💡 🚧 ⚠️	_{Johan Blåbäck} 🐛 💻 💡 🚧 📣 👀 ⚠️ 📓	_{JuliusMartensen} 🐛 💻 📖 🔌 💡 🚇 🚧 📦 📣 👀 🔧 📓	_ngam 💻 🚇 📦 👀 🔧 ⚠️	_{Christopher Rowley} 💻 💡 🚇 📦 👀	_{Kaze Wong} 🐛 💻 💡 🚇 🚧 📣 👀 🔬 📓
_{Christopher Rackauckas} 🐛 💻 🔌 💡 🚇 📣 👀 🔬 🔧 ⚠️ 📓	_{Patrick Kidger} 🐛 💻 📖 🔌 💡 🚧 📣 👀 🔬 🔧 ⚠️ 📓	_{Okon Samuel} 🐛 💻 📖 🚧 💡 🚇 👀 ⚠️ 📓	_{William Booth-Clibborn} 💻 🌍 📖 📓 🚧 👀 🔧 ⚠️	_{Pablo Lemos} 🐛 💡 📣 👀 🔬 📓	_{Jerry Ling} 🐛 💻 📖 🌍 💡 📣 👀 📓	_{Charles Fox} 🐛 💻 💡 🚧 📣 👀 🔬 📓	_{Johann Brehmer} 💻 📖 💡 📣 👀 🔬 ⚠️ 📓
_{Marius Millea} 💻 💡 📣 👀 📓	_Coba 🐛 💻 💡 👀 📓	_foxtran 💻 💡 🚧 🔧 📓	_{Shah Mahdi Hasan} 🐛 💻 👀 📓	_{Pietro Monticone} 🐛 📖 💡	_{Mateusz Kubica} 📖 💡	_{Jay Wadekar} 🐛 💡 📣 🔬	_{Anthony Blaom, PhD} 🚇 💡 👀
_Jgmedina95 🐛 💡 👀	_{Michael Abbott} 💻 💡 👀 🔧	_{Oscar Smith} 💻 💡	_{Eric Hanson} 💡 📣 📓	_{Henrique Becker} 💻 💡 👀	_qwertyjl 🐛 📖 💡 📓	_{Rik Huijzer} 💡 🚇	_{Hongyu Wang} 💡 📣 🔬
_{Zehao Jin} 🔬 📣	_{Tanner Mengel} 🔬 📣	_{Arthur Grundner} 🔬 📣	_sjwetzel 🔬 📣 📓	_{Saurav Maheshkar} 🔧

pysr's People

Contributors

Stargazers

Watchers

Forkers

codeaudit niksterg volodymyrss xiaojieqiu sheevy pgg1610 sailfish009 carlosal1015 j29scott johannbrehmer stjordanis nikhil-garg luchkolab hoshinory ahahajade johndpope shankal17 knut0815 vivekdsree theo-brown shubham2941 stefanmathis fermiq wentaohub jyothi-jaci hlc86 rambam613 murilo thinkall kenny-jia ai-and-ml nevesense cheyennejarman nguyensu emanuel07mii mrmatthumbert h21k spencerraw alexlib brucelai rmallof laplacekorea ur-whitelab collector-m cmalherbe davidakz fdoperezi activesoftmatter xxchenxx wassimtenachi alexander0337 jgmedina95 leo731121 cuckoong xkey- ivukotic jimmmmmmmyzzy rivesunder utkarshgiri wzpy rata-ash logichen arechesk simrit1 ghostintheshellarise xgxg1314 iktomist domenicocaudo saidctb willco-1 leitemfa xiangtgao llapira gracospa mh-guo pablo-lemos sznajder conradcon mkitti manuguth jaywadekar zhouwenfengtyrantasteroid achenry liuznil trashman888 zhonglu guemesturb arklu jbax3 raolixiang-up awallace3 gg-big-org sakeeb91 valeman kostyayamshanov onionon1on janael-pinheiro mingrui1992 1900www kyrillosl

pysr's Issues

Recursive feature selection

I think something that may give a large algorithmic performance is recursive feature selection. Here is the idea:

Hold "next-mutation" feature importances for every equation separately.
Start these at uniform importances.
Randomly choose to update the feature importances for an equation each iteration, with low probability.
To calculate the next-mutation feature importances, get the residual between the current equation's prediction and the target dataset. Use XGBoost.jl to predict the residual using the features. Use the trained XGBoost model to determine feature importance.
Every time one performs a mutation that involves selecting a new feature, the random selection would be weighted by the equation's current recorded feature importances. I think this will help the model when there are a large number of features.

On a more general note, I think it could be interesting to look at the gradients of an equation with respect to each subtree. Perhaps that could provide information on which subtree to mutate next, and with what operation.

Mutable default arguments

pysr.pysr has some mutable default arguments, which is a bug.

Partial derivative loss

Hi,Miles
I learned that Regression losses work on the distance between targets and predictions in PySR. However, in the regression case for having a complex formula, the regression results maybe has high complex or not conform to its essential physical laws. In the Paper: distilling free-form natural laws from experimental data, the principle for the identification of nontriviality is interesting, proposed by Michael. I wonder if this error principle can be added to PySR. If possible, the performance of PySR will be more powerful, and better and more essential results can be obtained for some regression.
Thanks.

Custom loss function

Hi,

Is it possible to have the example.py show how to build a custom loss function? I'm still unable to figure out how to do it from the codes.

Thanks!

[Question] Pure Julia package

Hi, Is there a plan to have pure Julia API and expose it Julia package?

Add variable as operator

Is it possible to add a variable as an operator to be used in the symbolic regression? Let's say I'm fairly sure my function has a pi in it (or something else, like a physical constant).

Bug: "ERROR: LoadError: InexactError: trunc(Int32, 2.172351299e9)"

I saw this bug this morning and it is now solved in 0.3.18.

This bug is when PySR tries to calculate the age of an equation. It seems that the offset I am using for "birthdate" of an equation, equal to 1e3*(time() - 1.6e9), and put into an Integer variable (so there is millisecond precision between equations), has actually now maxed out 32-bit integers. I have changed the precision to 64-bit and this issue is fixed.

I'm sorry for this bug; please update your binaries when you use PySR next, with:

pip install --upgrade pysr

Potential speed regression

I think the speed might have taken a hit, for small populations, on my laptop ever since I switched from Threads to Distributed. Using Distributed is necessary to use this package across multiple nodes, but the fact that the speed seems noticeably different is a sign that I am doing something wrong.

Hall of fame files are cluttering up my directories

Can there be an option to not save the results to disk? Generally saving results to disk is a thing I prefer to be in charge of myself, e.g. using Sacred or similar.

API for adding equation constraints

Curious to hear if anybody has an idea for an API that implements more detailed equation constraints than just maxdepth and maxsize.

Today I added a constraint on equations with the argument limitPowComplexity, which says that the exponent in a power law should never have a complexity greater than one. This results in much more interpretable equations. E.g.,

x0^(5.0) is allowed
x0^(5.0 + x3^(x0+x2)) is not allowed, since the complexity of 5.0 + x3^(x0+x2) is greater than 1.

Now, I think I can set this up more generally but I am curious how I should set up the API. Please let me know below if you have any thoughts.

This was my initial idea:

pysr(...
     constraints={'pow': (-1, 1), 'mult': (5, 5)}
     )

This says that anytime a pow operator appears, its left argument can have arbitrary complexity (-1), but its right argument must have complexity equal to or less than 1. It also says that the mult operator should only multiply two factors together if each factor's complexity is less than or equal to 5.

The nice thing about setting a max complexity on a per-operator basis is that I can allow things like very large sums chains, but simple multiplications. I think this will set up ways of making equations easier to interpret for complex problems.

Allow `y` to be vector-valued?

It would be nice to allow y to be vector valued, essentially as a (parallelised?) shortcut for something like the following:

expressions = []
for index in range(y.shape[-1]):
    yi = y[:, index]
    expression_i = pysr.pysr(x, yi)
    expressions.append(expression_i)

This links to (and is actually motivated by) my suggestion in #32: as_pytorch could then return a single SymPyModule wrapping a list of all expressions, rather than returning several SymPyModules for each one, which would then have to be composed together in the above for loop.

How force algorithm make an equation with all inputs?

Hi,
I used the algorithm its unique and fine, but I have a problem,
I.e I have some data, including 1000 row, and 4 columns, it has 3 inputs (x1,x2,x3) and one output (y) when I run the algorithm it makes an equation like: y=10x1+2x3 , (it just includes x1 and x3)
The problem is that the result equation don't use x2 as input, including best equation and all other generated equations. or the equation just contains x1 and x2, but x3 not exist in equation or something similar...
Although I need to an equation than contains all inputs (x1,x2,x3)
How can I force the algorithm to to make an equation that uses all inputs including x1,x2,x3?

Thanks in advance

PySR compared to heuristic lab

hey,
I've been using your Library for one of my research purposes and it's been very insightful, I switched from gplearn and could reasonably tell the difference.
Recently I came across Heuristic lab(an open-source software) which is also based on symbolic regression.
If you know about the software could you please give your take on the comparison of PySR to it?
similar to the benchmarks comparison which I came across on the thread of gplearn and Pysr.
Appreciate your work been very helpful!

Multithreading

Hello,

Based on a short experiment and reading the documentation, it seems PySR uses multiple processes, and not multithreading. Is this correct?

If so, I believe it does not have much advantage over a numpy implementation (such as gplearn). I will explain why.

Creating SymReg led me to believe that it is important to have few individuals in a generation, and many generations, instead of many-individuals-few-generations. I suspect this is because more CPU time is used on selected individuals, instead of random ones.

Trying to parallelize using multiprocessing led to a large overhead when forking - or distributing individuals to processes. This is why I am looking for alternative solutions, like multithreading, and Julia is both easy to understand and allows multithreading - where memory is shared instead of copied, which is much faster.

I guess my issue is a bit vague, but I have the following questions:

What do you think of this?
Would you be open to a pull request using threads instead of processes (based on reproducible benchmarks, of course)?
Is there any PySR architecture choice making said PR difficult? I am willing to (slowly) learn Julia to create it. But if you think it's too difficult to implement multithreading, I'll go my own way in a new project (I also have a NSGA-II wish for PySR eventually; check out Pareto elitism and NSGA-II in SymReg).
Have you got any tips on where to start? I have not looked at PySR code yet.

In any case, thank you for creating this project, it addresses a need of mine as well. I would love to learn from it :)

Symbolic deep learning

Trying to recreate the examples from this paper
PySR is always predicting scalars as a low complexity solution, which doesn't make much sense, can you please elaborate on that?
And what is wrong why I'm unable to get the right expression?

Cycles per second: 3.050e+03
Progress: 19 / 20 total iterations (95.000%)
Hall of Fame:
-----------------------------------------
Complexity  Loss       Score     Equation
1           1.278e-01  -9.446e-02  -0.08741549
2           1.165e-01  9.256e-02  square(-0.18644808)
3           2.592e-02  1.503e+00  (x0 * -0.2923665)
5           1.682e-02  2.163e-01  ((-0.10430038 * x0) * x2)
8           1.576e-02  2.176e-02  (1.6735333 * sin((-0.067048885 * x0) * x2))

The code used to generate this is:

import numpy as np
from pysr import pysr, best

# Dataset
X = np.array(messages_over_time[-1][['dx', 'dy', 'r', 'm1', 'm2']]) # Taken from this notebook https://github.com/MilesCranmer/symbolic_deep_learning/blob/master/GN_Demo_Colab.ipynb
y = np.array(messages_over_time[-1]['e64'])

# Learn equations
equations = pysr(X, y, niterations=5,
    binary_operators=["plus", "mult" , 'sub', 'pow', 'div'],
    unary_operators=[
      "cos", "exp", "sin", 'neg', 'square', 'cube', 'exp', 
      "inv(x) = 1/x"], batching=True, batchSize=1000) 


print(best(equations))

Performance speed-up options?

Hello Miles! Thank you for open-sourcing this powerful tool! I am working on including PySR in my own research, and running into some performance bottlenecks.

I found regressing a simple equation (e.g. the quick-start example) takes roughly 2 minutes. Ideally, I am aiming to reduce that time to ~30 seconds. Would you give me some pointers on this? Meanwhile, I will try break down the challenge in several pieces:

Activating a new environment at each API call: I noticed that a new Julia (?) environment is created each time I call pysr() api (see terminal output below). Could we keep the environment up so we can skip this process for subsequent calls?

Running on julia -O3 /tmp/tmpe5qmgemh/runfile.jl
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
    Updating registry at `~/.julia/registries/General`
  No Changes to `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  No Changes to `~/anaconda3/envs/rw/lib/python3.7/site-packages/Manifest.toml`
Activating environment on workers.
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  Activating  Activating  environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
  Activating  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml` 
environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
Importing installed module on workers...Finished!
Started!

If the above wouldn't work, then allowing y to be vector-valued (as mentioned in #35) would be a second-best option! Even better, if we could create a "batched" version of pysr(X, y) api pysr_batched(X, y), such that X and y are python lists, and we return the results in a list as well, so that we only generate one Julia script, and call os.system() once to keep the Julia environment up.
Multi-threading: I noticed that increasing procs from 4 to 8 resulted in slightly longer running time. I am running on a 8-core 16-tread CPU. Did I do something dumb?
I went into pysr/sr.py and added runtests=false flag in line 438 and 440. That saved ~20 seconds.

Windows support

Hi Miles,

first of all, this is awesome. Thanks so much for making this.

A student I'm working with is trying to run PySR under Windows. Is that in principle supported?

PySR's dependencies don't seem to have any issues with Windows, but pysr.pysr throws a FileNotFoundError when accessing /tmp/.hyperparams_{rand_string}.hl'. Seems to be because of the different file system structure under Windows. If this is the only issue, how would you feel about using something like tempfile to generate temporary files in a more OS-independent way?

I am happy to try this and open a PR once it works.

Cheers,
Johann

[Windows] : Couldn't find equation file!

Hi Miles,

I've been installing PySR in parallel to Julia under win10. It runs... till the moment it crashes with the following message:

File "C:\Users\Matthieu\anaconda3\lib\site-packages\pysr\sr.py", line 774, in get_hof
raise RuntimeError("Couldn't find equation file! The equation search likely exited before a single iteration completed.")

RuntimeError: Couldn't find equation file! The equation search likely exited before a single iteration completed.

In the last case, I've been to 38% of progress.

I have to say that, sometime (not often), the process gets completed.

What is the reason for this?

Also... is there a forum or I posted at the right place?

I thank you for your help.

Regards

Magaud

Running in float32 or float64

Hi Miles,

Is there a way for this to run in float64 or float32 at run time, rather than modding the code? I am writing some code that uses PySR, and the changes we made to make it work previously will likely not apply cleanly if we pull the changes to PySR up, in it's current state. If not, do you have any advice on making this work?

[Feature] Units in equations

Each of my features has units [kg, m, N, ...], but the output equations don't take units into account. Most output equations fail an unit check even if setting the units of constants as required. This feature would allow defining units of each X and y feature, probably in SI, and only allow equations which pass an unit check. Any constants could still have arbitrary units.

Solver convergence

Hi Miles,

Thanks for sharing your code. I was trying the example case with some different solver configurations. The calculation is terminated after few 2 iterations. Is this because the calculation is converged? Or am I misunderstanding some of the solver's arguments?

equations = pysr(
    X,
    y,
    procs = 7,
    niterations=1000000,
    populations = 100,
    binary_operators=["+", "*", "/","^"],
    unary_operators=[
        "cos",
        "exp",
        "sin",  # Pre-defined library of operators (see docs)
        "inv(x) = 1/x",  # Define your own operator! (Julia syntax)
    ],
)

print(best(equations))

Running on julia -O3 /tmp/tmp1fl_c9ut/runfile.jl
Activating environment on workers.
Importing installed module on workers...Finished!
Copying definition of inv to workers...Finished!
Testing module on workers...Finished!
Testing entire pipeline on workers...Finished!
Started!

Cycles per second: 3.560e+02
Head worker occupation: 3.7%
Progress: 1 / 100000000 total iterations (0.000%)

Hall of Fame:

Complexity Loss Score Equation
1 2.476e+01 5.960e-08 1.5915705
3 5.484e+00 7.536e-01 pow(x0, 2.002539)
5 1.902e+00 5.296e-01 (-2.0208373 + pow(x0, 2.0489082))
11 1.791e+00 1.001e-02 ((1.2358671 + pow(x0, 2.002539)) + (-1.6340058 * pow(x3, 0.9758236)))
12 2.172e-02 4.412e+00 ((-1.8855495 + pow(x0, 2.002539)) + (-2.0073733 * sin(-1.6345752 + x3)))
19 8.701e-03 1.307e-01 (((pow(x0 / 1.0015011, 2.002539) + -0.91348684) + ((-0.53229153 + inv(-0.67792857)) * sin(-1.6345752 + x3))) + -1.0524741)

==============================

Cycles per second: 1.130e+03
Head worker occupation: 3.2%
Progress: 2 / 100000000 total iterations (0.000%)

Hall of Fame:

Complexity Loss Score Equation
1 2.476e+01 5.960e-08 1.5917339
2 2.399e+01 3.134e-02 cos(x3)
3 1.265e+01 6.405e-01 pow(x0, 1.2735958)
4 5.174e+00 8.937e-01 pow(inv(x0), -1.9202919)
5 1.902e+00 1.001e+00 (-2.021261 + pow(x0, 2.0489275))
11 1.878e+00 2.049e-03 (-2.2773557 + (pow(x0, 1.2649873) * (0.22078495 + pow(x0, 0.7385057))))
12 2.172e-02 4.460e+00 ((-1.8855495 + pow(x0, 2.002539)) + (-2.0073733 * sin(-1.6345752 + x3)))
19 8.701e-03 1.307e-01 (((pow(x0 / 1.0015011, 2.002539) + -0.91348684) + ((-0.53229153 + inv(-0.67792857)) * sin(-1.6345752 + x3))) + -1.0524741)

==============================
-2.0073733*sin(x3 - 1.6345752) + Abs(x0)**2.002539 - 1.8855495

Segmentation fault over larger arrays

Hi,

I've been having issues with running PySR over larger input arrays. For instance, by modifying the input size in the example:

from pysr import pysr, best, get_hof

# Dataset
X = 2*np.random.randn(300000, 8) #changed from (100,5) to (300000, 8)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2

# Learn equations
equations = pysr(X, y, niterations=5,
        binary_operators=["plus", "mult"],
        unary_operators=["cos", "exp", "sin"])

Results in the following trace output:

      From worker 3:	
      From worker 3:	signal (11): Segmentation fault
      From worker 3:	in expression starting at none:0
      From worker 3:	unknown function (ip: (nil))
      From worker 3:	Allocations: 14006734 (Pool: 14002705; Big: 4029); GC: 14
      From worker 5:	
      From worker 5:	signal (11): Segmentation fault
      From worker 5:	in expression starting at none:0
      From worker 5:	unknown function (ip: (nil))
      From worker 5:	Allocations: 14006758 (Pool: 14002728; Big: 4030); GC: 14
      From worker 4:	
      From worker 4:	signal (11): Segmentation fault
      From worker 4:	in expression starting at none:0
      From worker 4:	unknown function (ip: (nil))
      From worker 4:	Allocations: 14006745 (Pool: 14002716; Big: 4029); GC: 14
Couldn't find equation file!

I have replicated this behaviour across different machines, and for each I am fairly certain I have sufficient resources to hold the data in memory, so it's not clear to me what the issue is.

ERROR: The following package names could not be resolved/Julia servers out of date

tl;dr, delete ~/.julia/registries/General and then run the following commands in Julia:

ENV["JULIA_PKG_SERVER"] = ""
import Pkg
Pkg.update()

(original post)
FYI I pushed v0.4.2 of SymbolicRegression.jl to the registry 15 hours ago, which is required for the latest PySR. However, the registry server is still not updating - which seems like an issue that sometimes happens: JuliaRegistries/General#16777.

To get the Julia registry to stay up to date even if the registry server fails, you can use the git version instead. This can be done as follows:

Delete your registry folder ~/.julia/registries/General
Launch Julia.
Run the following commands:

ENV["JULIA_PKG_SERVER"] = ""
import Pkg
Pkg.update()

This will install the git version of the registry, which is always up-to-date.

the use of the package : PySR

Hi, I just begin to use the package. I use the python. But , I don't know whether it just need python3 .Do I need to install julia in order to use it?

[BUG] Worker not responding after 1 min

Dear PySR Experts,

I've been using PySR since few weeks now and I'm very happy of it's capabilities.

For a reason I ignore, it's crashing since 2 days.

It's seems to be link to none responding workers after waiting 1 min.

I tried to use the base example script from PySR but it's generating the same error.

Version (please complete the following information):

OS: Linux Mint 19.3
0.6.12.post1
Does the bug appear with the latest version of PySR? I don't know...

Configuration

What are your PySR settings? No specific settings
What dataset are you running on? base example script from PySR

Error message
attached

Additional context
possible conda-forge update... but reinstalled entirely anaconda, julia and pysr following that without any change... only once, the base example went through properly, started from standard terminal (out of anaconda + spyder)
Output_Error.txt

How to save PySR results?

Hi, thank you for providing us with a free symbolic regression library.

Since I am new to this and I haven't fully understood everything yet. I would like to save my all results (after 12 hours of computation) from the output of pysr(). I've tried using pickle, shelve, dill and scipy.io but I can't get them to work. I keep getting different types of errors including this one below:

AttributeError: module 'main' has no attribute 'inv'
AttributeError: Can't get attribute 'inv' on <module 'main'>
PicklingError: Can't pickle inv: attribute lookup inv on main failed

Would you please advise me on what I should do to save the results so that I can come back and explore them later?

Thank you.

For reference this is my script that was based on the examples:

# Learn equations
  equation = pysr(
      X,
      y[:,ii],
      binary_operators=["plus", "mult"],
      unary_operators=[
          "cos",
          "sin",  # Pre-defined library of operators (see https://pysr.readthedocs.io/en/latest/docs/operators/)
          "inv(x) = 1/x",
      ],
      loss="loss(x, y) = abs(x - y)",  # Custom loss function
      julia_project="../SymbolicRegression.jl",
  )  # Define your own operator! (Julia syntax)

[Errno 2] No such file or directory

I have installed pysr-0.6.12.post1 and I have been try to run the example.py but after solve some previous closed bug reports a FileNotFoundError occurs. I'm using Windows 10 and Python 3.7 the version of Julia is 1.6.2. The error message is the following.

FileNotFoundError: [Errno 2] No such file or directory: 'hall_of_fame_2021-08-04_230410.180.csv.bkup'

Sympy export

Dear @MilesCranmer,

I just tried out this package. It works! Fantastic.

I see that the result is a pandas DataFrame. The equation is stored as a string. It seems difficult to parse this back into something usable automatically.

I wonder if it would be easy to add an export to a sympy formula.

Sympy then supports exporting formulas into python functions and also efficient numpy functions. This would allow automatic use of PySR's results. Also, sympy can print formulas with pprint in a easily readable form.

Cheers,
Johannes

Planned breaking changes in v0.6.0

Will continue to update this issue until v0.6.0 released.

New defaults:
- annealing=False (no annealing works better with the new code. This is equivalent to alpha=infinity)
- useFrequency=True (deals with complexity in a smarter way)
- npopulations = 20 ~~procs*4~~
- progress=True (show a progress bar)
- optimizer_algorithm="BFGS"
- optimizer_iterations=10
- optimize_probability=1
- binary_operators default = ["+", "-", "/", "*"]
- unary_operators default = []
Warnings:
- Using maxsize > 40 will trigger a warning mentioning how it will be slow and use a lot of memory. Will mention to turn off useFrequency, and perhaps also use warmupMaxsizeBy.
Deprecated nrestarts -> optimizer_nrestarts
Exports to JAX, PyTorch, NumPy unified into same function call: pass X[nrows, features] as single argument, rather than the current numpy one which is different arguments for each feature. JAX export will optionally take parameters as arguments (for training). The PyTorch output is a trainable module (thanks @patrick-kidger!)
Custom constant optimizer: can choose between NelderMead and BFGS (requires differentiable operators).
Test progress bar in jupyter
Decide if should change format away from pandas dataframe to something like Patrick suggested, with a custom class for equations, containing .to_sympy, .to_numpy, etc. Downside is one can't do the .query(), .sort_values(), .plot(), etc., that come with Pandas.
- (Not doing for now)

RuntimeError: $PATH error, Julia not found.

Hi Miles!

I spent a few minutes trying to find out why the package kept raising RuntimeError: $PATH error, Julia not found even though Julia was callable from shell and was present in $PATH. Modifying all the calls to subprocess.Popen to include shell=True makes this issue go away (Ubuntu 18.04, called from a jupyter lab instance). For example

process = subprocess.Popen(["julia", "-v"], stdout=subprocess.PIPE, bufsize=-1)

process = subprocess.Popen(["julia", "-v"], shell=True, stdout=subprocess.PIPE, bufsize=-1).

Not sure if there's a good reason for keeping shell=False for the initial call -- perhaps adding a secondary check with shell enabled after the first one fails might help make the first experience with PySR smoother for new users.

EDIT: some additional documentation

PySR could not start julia. Make sure julia is installed and on your $PATH.

Hello Miles! Ⅰ am trying to apply your PySR to some biological dataset and wish to find some interesting results (something like compact ODE/SDE). But Ⅰ am kind of new to Julia, and when Ⅰ try to run the example, this bug jumps: "PySR could not start Julia. Make sure Julia is installed and on your $PATH". Ⅰ looked for some solution (add Julia path to the current workspace?) but Ⅰ still can't solve this problem, would you mind giving a solution? Thanks in advance.

**Ⅰ am trying to run on Mac, and the Julia version is 1.6 ('/Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia'). **

And also Ⅰ am curious that if PySR will be robust when data is noisy.

Thank you very much!

Not able to run bigger runs on 0.3.10

Hi,

After upgrading to 0.3.10 a simple example I run stopped working. Julia runs for a while, updates the results 1-3 times and then worker Julia processes die (the main one survives, but CPU usage go from 100% to couple of percent).
The code I use: https://pastebin.com/4DsQ3mN4
Could anyone check if they experience similar behaviour on their machine? I'm not sure if it something with my environment or the pysr itself.

Thanks

Benchmark / GPlearn

Can you do a basic comparison between this and gplearn with regards to speed and flexibility?

[Feature] Denoising pre-processing

PySR (and perhaps SR in general) is not that good at working with noise and seems to overfit. An easy solution to this is preprocessing where we denoise the data. The easiest way to do this is to fit a generic Gaussian process, and generate samples of the mean function over the same domain as the input data.

Single feature

Hi,

Seems like the package does not work when X has a single feature. E.g. the following

X = 2*np.random.randn(100, 1)
y = 2*np.cos(X[:, 0]) + X[:, 0]**2 - 2

# Learn equations
equations = pysr(X, y, niterations=5,
            binary_operators=["plus", "mult"],
            unary_operators=["cos", "exp", "sin"])

fails with

Running on julia -O3 --threads auto -e 'include("/tmp/.hyperparams_15285411899448926359.jl"); include("/tmp/.dataset_15285411899448926359.jl"); include("/home/siwy/.local/lib/python3.8/site-packages/julia/sr.jl"); fullRun(5, npop=1000, ncyclesperiteration=300, fractionReplaced=0.100000f0, verbosity=round(Int32, 1000000000.000000), topn=10)'
ERROR: LoadError: MethodError: no method matching Array{Float32,2}(::Array{Float64,1})
Closest candidates are:
  Array{Float32,2}(::AbstractArray{S,N}) where {T, N, S} at array.jl:562
  Array{Float32,2}(::UndefInitializer, ::Int64, ::Int64) where T at boot.jl:408
  Array{Float32,2}(::UndefInitializer, ::Int64...) where {T, N} at boot.jl:412
  ...
Stacktrace:
 [1] convert(::Type{Array{Float32,2}}, ::Array{Float64,1}) at ./array.jl:554
 [2] top-level scope at /tmp/.dataset_15285411899448926359.jl:1
 [3] include(::String) at ./client.jl:457
 [4] top-level scope at none:1
in expression starting at /tmp/.dataset_15285411899448926359.jl:1

Error when selecting some operators

Hi!

I've been trying to use a variety of operators and for some of them I get an error. For example if I run:

equations = pysr(X, y, niterations=100,
    binary_operators=["plus", "mult", "div", "sub", "pow"],
    unary_operators=[
      "cos", "sin", "asinh", "gamma"
      ])

I get

Running on julia -O3 /tmp/tmpsva4rkch/runfile.jl
Couldn't find equation file!

If I remove the "gamma" from the operator list then this works fine. I get the same error if I use the operator "acosh". I believe these functions. I'm not sure why this is happening since acosh is just a core Julia function whereas gamma is in SpecialFunctions.jl. I also confirmed that a colleague got the same error on a different system. Any thoughts?

[Windows] Always returning the same equation?

I don't know if this is a Windows issue or what (I work on a Linux partition, but I just wanted to play around with this - I haven't actually done serious work Windows for 7 years or so, so I'm at a loss), but after fitting one equation, it's always returning that equation. Even with different data, in a different notebook.

I've looked to see if I could find the julia file it creates - nope. And they're different files every time.

Any ideas?

loglog linear regression

Hi,

I am trying to use pysr on a small two-parameter data set (copied below):

d N y yerr
2 100 0.19249134909744509 0.04077703830057178
2 400 0.08870348931017269 0.01895750658969715
2 1000 0.05821684947429505 0.012249467489932532
2 2000 0.04032440345198765 0.008654963536388066
4 100 0.23360598541676597 0.04362116436976535
4 400 0.10665517789216333 0.019831288300079426
4 1000 0.07050647598930485 0.012405873462707294
4 2000 0.048513209353925904 0.0075021625933293
8 100 0.2816349422941111 0.04964869605998312
8 400 0.130359187678827 0.021744784316440224
8 1000 0.07822205145626468 0.01152860301070966
8 2000 0.0550522314381573 0.007979675849630921
16 100 0.39596679080632924 0.05729090287732478
16 400 0.1491729565911473 0.023822175408619892
16 1000 0.0888737514210574 0.014375093411499607
16 2000 0.06228894224329333 0.01037101974545278
32 100 0.681255101544107 0.07566014207326285
32 400 0.18976634920594035 0.027296960851153632
32 1000 0.10745312825997176 0.014599730019972569
32 2000 0.07072727755630769 0.009862489071337315
#64 40 17.283403958973132 0.5238893803671024
#64 100 16.87138083788046 0.1
64 400 0.25662700332371974 0.027683868929782498
64 1000 0.12962634673162354 0.016299677455956965

I later realised that when plotting the logs of all variables, it seems to be a linear regression:

Here is my code trying to discover a relation. I already knew there should be some logs of the parameters involved, so I added log and exp as unary operators:

data = np.loadtxt('/tmp/b.txt', skiprows=1)
X = data[:,:2]
y = data[:,2]
yerr = data[:,3]

# Learn equations
equations = pysr(
	X, y, #weights=1/yerr**2,
	niterations=5,
	binary_operators=["plus", "mult", "pow"],
	unary_operators=["log", "exp"],
)

print(equations)

Output:

Hall of Fame:
-----------------------------------------
Complexity  MSE        Score     Equation
0           3.616e+01  0.000e+00  0.003592
3           4.654e+00  6.835e-01  pow(x1, -0.36609524)
9           3.686e+00  3.884e-02  plus(pow(plus(mult(x1, 0.9665345), -1.3750538), -0.24685867), -0.10109954)
10          2.127e+00  5.500e-01  pow(plus(plus(x1, log(pow(0.048317507, x0))), -0.3589894), -0.38382408)
13          1.372e+00  1.461e-01  pow(plus(plus(plus(mult(x1, 0.29219225), -2.124579), mult(-0.75662184, x0)), -0.08665135), -0.4488404)
14          9.712e-01  3.454e-01  mult(pow(x1, -0.52433586), log(plus(plus(plus(log(exp(x0)), 0.32484704), 4.402428), x0)))
17          8.664e-01  3.808e-02  mult(pow(plus(x1, 2.4464433), -0.9689093), plus(mult(log(x1), log(x1)), mult(plus(x0, -1.6077209), 1.0596043)))

Am I doing something wrong, or is there a better way? Or do I just need to run it longer?

[BUG] "sub" operator causing BoundsError

Description:

When using the "sub" operator and the "plus" and "mult" operators together, an out of bounds error occurs. Not sure why or how these are related but the smallest error causing code I have found is below:

import numpy as np
from pysr import pysr, best, get_hof

# Dataset
X = 2*np.random.randn(100, 5)

y = X[:, 0] - X[:, 1]

# Learn equations
equations = pysr(X, y, niterations=5, 
        binary_operators=["sub", "plus", "mult"],
        unary_operators=[])

...# (you can use ctl-c to exit early)

print(best())

# Log of tests
# plus, sub, mult = BoundsError (consistent)
# sub, mult = No Error
# plus, sub = No Error
# plus, mult, mySub = No Error (mySub(x, y) = x - y)
# plus, mult, negative = No Error (negative(x) = 0 - x)
# plus, mult = No Error (and success from (x0 + (-1.0 * x1)))

I am running Windows 10 and using VS Code and the PowerShell terminal in the IDE.
Python Version 3.9.0
Julia Version 1.6.2

The Error:

Running on julia -O3 C:\Users\ipunc\AppData\Local\Temp\tmp0ffjfong\runfile.jl
  Activating environment at `C:\Python39\lib\site-packages\Project.toml`
    Updating registry at `C:\Users\ipunc\.julia\registries\General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
  No Changes to `C:\Python39\Lib\site-packages\Project.toml`
  No Changes to `C:\Python39\Lib\site-packages\Manifest.toml`
Activating environment on workers.
      From worker 3:      Activating environment at `C:\Python39\Lib\site-packages\Project.toml`
      From worker 4:      Activating environment at `C:\Python39\Lib\site-packages\Project.toml`
      From worker 5:      Activating environment at `C:\Python39\Lib\site-packages\Project.toml`  
      From worker 2:      Activating environment at `C:\Python39\Lib\site-packages\Project.toml`
Importing installed module on workers...Finished!
Testing module on workers...Finished!
Testing entire pipeline on workers...Finished!
Started!
1.0%┣▋                                                           ┫ 1/100 [00:02<Inf:Inf, 0.0 it/s]Head worker occupation: 4.8%
Hall of Fame:
-----------------------------------------
Complexity  Loss       Score     Equation
1           4.492e+00  7.282e-01  x0
3           1.263e-14  1.675e+01  (x0 - x1)

ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ .\task.jl:322 [inlined]
 [2] fetch
   @ .\task.jl:337 [inlined]
 [3] _EquationSearch(::SymbolicRegression...\ProgramConstants.jl.SRDistributed, datasets::Vector{SymbolicRegression...\Dataset.jl.Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(-), typeof(+), typeof(*)}, Tuple{}, L2DistLoss}, numprocs::Int64, procs::Nothing, runtests::Bool)
   @ SymbolicRegression C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\SymbolicRegression.jl:387
 [4] EquationSearch(datasets::Vector{SymbolicRegression...\Dataset.jl.Dataset{Float32}}; niterations::Int64, options::Options{Tuple{typeof(-), typeof(+), typeof(*)}, Tuple{}, L2DistLoss}, numprocs::Int64, procs::Nothing, multithreading::Bool, runtests::Bool)
   @ SymbolicRegression C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\SymbolicRegression.jl:181
 [5] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, 
varMap::Vector{String}, options::Options{Tuple{typeof(-), typeof(+), typeof(*)}, Tuple{}, L2DistLoss}, numprocs::Int64, procs::Nothing, multithreading::Bool, runtests::Bool)
   @ SymbolicRegression C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\SymbolicRegression.jl:145
 [6] #EquationSearch#24
   @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\SymbolicRegression.jl:157 [inlined]
 [7] top-level scope
   @ C:\Users\ipunc\AppData\Local\Temp\tmp0ffjfong\runfile.jl:7

    nested task error: On worker 2:
    BoundsError: attempt to access 5×100 Matrix{Float32} at index [-1, 1:100]
    Stacktrace:
      [1] throw_boundserror
        @ .\abstractarray.jl:651
      [2] checkbounds
        @ .\abstractarray.jl:616 [inlined]
      [3] _getindex
        @ .\multidimensional.jl:831 [inlined]
      [4] getindex
        @ .\abstractarray.jl:1170 [inlined]
      [5] deg0_eval
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\EvaluateEquation.jl:90      
      [6] evalTreeArray
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\EvaluateEquation.jl:22      
      [7] #EvalLoss#1
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\LossFunctions.jl:28
      [8] #scoreFunc#2
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\LossFunctions.jl:47
      [9] scoreFunc
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\LossFunctions.jl:47 [inlined]
     [10] #nextGeneration#1
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\Mutate.jl:139
     [11] regEvolCycle
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\RegularizedEvolution.jl:57  
     [12] #SRCycle#1
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\SingleIteration.jl:34       
     [13] macro expansion
        @ C:\Users\ipunc\.julia\packages\SymbolicRegression\1URtS\src\SymbolicRegression.jl:476 [inlined]
     [14] #55
        @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\macros.jl:87
     [15] #103
        @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:274
     [16] run_work_thunk
        @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:63
     [17] run_work_thunk
        @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\process_messages.jl:72
     [18] #96
        @ .\task.jl:411
    Stacktrace:
     [1] #remotecall_fetch#143
       @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:394 [inlined]
     [2] remotecall_fetch(f::Function, w::Distributed.Worker, args::Distributed.RRID)
       @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:386
     [3] #remotecall_fetch#146
       @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:421 [inlined]
     [4] remotecall_fetch
       @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:421 [inlined]
     [5] call_on_owner
       @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:494 [inlined]
     [6] fetch(r::Distributed.Future)
       @ Distributed C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Distributed\src\remotecall.jl:533
     [7] (::SymbolicRegression.var"#57#89"{Vector{Vector{Distributed.Future}}, Int64, Int64})()   
       @ SymbolicRegression .\task.jl:411
in expression starting at C:\Users\ipunc\AppData\Local\Temp\tmp0ffjfong\runfile.jl:7
┌ Warning: Forcibly interrupting busy workers

(Continues stopping other processes below)

As you can see from the code above, I have tested both adding my own subtraction operator and a negative operator and both options do work as expected. The order of operators does not matter and there is not any variation in the error (always worker 2). I have also tested with some other equations (where I originally found the error) and they also consistently run into the same issue. I'm really not sure what could be causing this and I think I can work around it but it should probably be looked into.

Thanks for your time :)

No such file or directory: 'julia': 'julia'

FileNotFoundError Traceback (most recent call last)
in
----> 1 equations = pysr(hash_characters, target.values, niterations=5)

/opt/conda/lib/python3.7/site-packages/pysr/sr.py in pysr(X, y, weights, binary_operators, unary_operators, procs, loss, populations, niterations, ncyclesperiteration, alpha, annealing, fractionReplaced, fractionReplacedHof, npop, parsimony, migration, hofMigration, shouldOptimizeConstants, topn, weightAddNode, weightInsertNode, weightDeleteNode, weightDoNothing, weightMutateConstant, weightMutateOperator, weightRandomize, weightSimplify, perturbationFactor, timeout, extra_sympy_mappings, equation_file, test, verbosity, progress, maxsize, fast_cycle, maxdepth, variable_names, batching, batchSize, select_k_features, warmupMaxsizeBy, constraints, useFrequency, tempdir, delete_tempfiles, julia_optimization, julia_project, user_input, update, temp_equation_file, output_jax_format, warmupMaxsize, nrestarts, optimizer_algorithm, optimizer_nrestarts, optimize_probability, optimizer_iterations)
344
345 _create_julia_files(**kwargs)
--> 346 _final_pysr_process(**kwargs)
347 _set_globals(**kwargs)
348

/opt/conda/lib/python3.7/site-packages/pysr/sr.py in _final_pysr_process(julia_optimization, runfile_filename, timeout, **kwargs)
374 if timeout is not None:
375 command = [f'timeout', f'{timeout}'] + command
--> 376 _cmd_runner(command, **kwargs)
377
378 def _cmd_runner(command, **kwargs):

/opt/conda/lib/python3.7/site-packages/pysr/sr.py in _cmd_runner(command, **kwargs)
379 if kwargs['verbosity'] > 0:
380 print("Running on", ' '.join(command))
--> 381 process = subprocess.Popen(command, stdout=subprocess.PIPE, bufsize=-1)
382 try:
383 while True:

/opt/conda/lib/python3.7/subprocess.py in init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
798 c2pread, c2pwrite,
799 errread, errwrite,
--> 800 restore_signals, start_new_session)
801 except:
802 # Cleanup if the child failed starting.

/opt/conda/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1549 if errno_num == errno.ENOENT:
1550 err_msg += ': ' + repr(err_filename)
-> 1551 raise child_exception_type(errno_num, err_msg, err_filename)
1552 raise child_exception_type(err_msg)
1553

FileNotFoundError: [Errno 2] No such file or directory: 'julia': 'julia'

Hall of fame output cannot be turned off?

Hi,

I haven't found any switch to reduce or turn off the printing of the current hall of fame.

Would it be possible to expose print_every_n_seconds, or give the user some other option to modify this?

Thanks,
Johann

Can we control the number of digits after the decimal point of a constant ?

Progress bar in Jupyter Notebooks

Not sure if this is possible in the current setup, but would be nice to have a progress bar display inside the jupyter notebooks.

To implement the multi-line progress bar, the SymbolicRegression.jl backend basically will print several ANSI escape sequences to erase lines. The multi-line escape sequences don't work in Jupyter.

I think to get this working, I think I need to

Use tqdm on the Python side (and manually loop through it), and
Change the mechanism for reading in the output of SymbolicRegression.jl to actually parse some structured log files (and perhaps parse the equations file into sympy and print that). So basically the loop printing in PySR will be pure-Python.

Bug: hall of fame not saving?

For some reason, at least on master, the hall of fame seems to refresh between iterations. Good equations found earlier during the search are removed from the hall of fame.

Here are some ideas to investigate:

Reproducible on earlier commits? (no)
Is the hall of fame not being copied during migration? I.e., during migration of the hall of fame, (not issue)
Is this issue dependent on OS? (no)
Is this issue dependent on procs used?
Is the hall of fame not being saved between cycles?
Are there potential race conditions anywhere?
Try removing all @inbounds? Perhaps its overwriting a different array?

Julia install error (new in 0.6.5)

I'm getting this error with pysr>=0.6.5, during the first run of pysr after pip install pysr.

Running on julia -O0 /tmp/tmpy5eg_cxg/runfile.jl
  Activating environment at `~/anaconda3/envs/rw/lib/python3.7/site-packages/Project.toml`
    Updating registry at `~/.julia/registries/General`
ERROR: LoadError: Unsatisfiable requirements detected for package SymbolicRegression [8254be44]:
 SymbolicRegression [8254be44] log:
 ├─possible versions are: 0.1.0-0.6.4 or uninstalled
 └─restricted to versions 0.6.5-0.6 by an explicit requirement — no versions left

Predefined function form

Hi Miles,

In the regression process, can we pre-define a function form first, and let the regression start from this function?
for example, If our objective function is x02 + 2.0*cos(x3) - 2.0 like example.py, that is simple, pysr can get the result quickly.
however, In some research processes, our objective function may be more complicated, such as x02+(x1*x0)*exp(sin(x2)x1), At this time, pysr may take longer to optimize and may fall into a local optimal solution.
so, So, I wonder if it is possible to predefine a function form in the pysr function, for example, x0exp(sin(x2)). Let the regression start from this function to speed up the efficiency of optimization. It is like applying a boundary condition to the solution equation.

thanks.

issue with calling best_callable when best returns a constant function.

When calling best_callable on a function pysr returns as a constant, I get the following error.

TypeError: _lambdifygenerated() missing 1 required positional argument: 'x1'

Could best_callable be adapted for constant functions as well?

Torch export key errors

Key errors:

Traceback (most recent call last):
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 124, in __init__
    arg_ = _memodict[arg]
KeyError: sqrt(Abs(x1) + 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 124, in __init__
    arg_ = _memodict[arg]
KeyError: 1/2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 151, in <module>
    batching=True,
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 453, in pysr
    equations = get_hof(**kwargs)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/sr.py", line 1002, in get_hof
    module = sympy2torch(eqn, sympy_symbols, selection=selection)
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 190, in sympy2torch
    expression, symbols_in, selection=selection, extra_funcs=extra_torch_mappings
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 161, in __init__
    expr=expression, _memodict=_memodict, _func_lookup=_func_lookup
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 130, in __init__
    **kwargs,
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 130, in __init__
    **kwargs,
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/site-packages/pysr/export_torch.py", line 120, in __init__
    self._torch_func = _func_lookup[expr.func]
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/collections/__init__.py", line 916, in __getitem__
    return self.__missing__(key)            # support subclasses that define __missing__
  File "/home/yxie20/anaconda3/envs/rw/lib/python3.7/collections/__init__.py", line 908, in __missing__
    raise KeyError(key)
KeyError: <class 'sympy.core.numbers.Half'>

MWE:

import numpy as np
from pysr import pysr, best
import time

from pysr.sr import best_callable
import torch

# Dataset (alternative)
X = 2*np.random.randn(1152, 32)
y = 2*np.cos(X[:, 3:11]) + X[:, 0:8]**2 - 2 + X[:,2:10]*X[:,1:9]

equations = pysr(X, y, 
    binary_operators=["plus", "sub", "mult", "div", "pow"],
    unary_operators=["exp", "log_abs", "log10_abs", "log2_abs", 
        "cos", "sin", "tan", "sinh", "cosh", "tanh", 
        "atan", "asinh", "acosh_abs", "atanh_clip"],
    # verbosity=0,
    # procs=6,
    temp_equation_file=True,
    progress=False,
    julia_optimization=0,       # Faster startup time. Turn off optimizing compiler for Julia code
    output_torch_format=True,
    #
    # niterations=2,              # Iterations per population of the entire algorithm. Best equations are printed and migrated between populations.  populations * niterations = progress bar. This doesnt really matter
    # maxsize=20,
    # populations=2,              # Number of populations running. (must > 1)
    # npop=2000,                  # Number of individuals per population. More population, slower, more chance of hitting the correct.
    # ncyclesperiteration=200,    # Number of total mutations per 10 samples of population each iteration. Also like npop.
    # Quick debug
    niterations=5,              # Iterations per population of the entire algorithm. Best equations are printed and migrated between populations.  populations * niterations = progress bar. This doesnt really matter
    maxsize=10,
    populations=5,              # Number of populations running. (must > 1)
    npop=200,                  # Number of individuals per population. More population, slower, more chance of hitting the correct.
    ncyclesperiteration=20,    # Number of total mutations per 10 samples of population each iteration. Also like npop.
    annealing=True,            # With False, simple equations take longer but more complex equations are achievable
    batching=True,
)

Additional export options

@patrick-kidger has written an amazing SymPy->PyTorch export library https://github.com/patrick-kidger/sympytorch. Will be really nice to use this in PySR, so that equations can be directly exported to PyTorch!

Also might be nice to have a JAX export too - e.g., could export a function that takes (x, parameters) as arguments (in traditional JAX functional style...).

[Question] Empty list for unary_operators

Hey,

Is there a technical reason why unary_operators argument cannot be an empty list? When I tried that it failed with:

assert len(unary_operators) > 0

milescranmer / pysr Goto Github PK

pysr's Introduction

PySR: High-Performance Symbolic Regression in Python and Julia

Test status

Why PySR?

Installation

Pip

Conda

Dockerfile

Troubleshooting

Quickstart

Detailed Example

Docker

Contributors ✨

pysr's People

Contributors

Stargazers

Watchers

Forkers

pysr's Issues

Cycles per second: 3.560e+02 Head worker occupation: 3.7% Progress: 1 / 100000000 total iterations (0.000%)

Hall of Fame:

Cycles per second: 1.130e+03 Head worker occupation: 3.2% Progress: 2 / 100000000 total iterations (0.000%)

Hall of Fame:

Recommend Projects

Recommend Topics

Recommend Org

Cycles per second: 3.560e+02
Head worker occupation: 3.7%
Progress: 1 / 100000000 total iterations (0.000%)

Cycles per second: 1.130e+03
Head worker occupation: 3.2%
Progress: 2 / 100000000 total iterations (0.000%)