erp12 / pyshgp Goto Github PK

Push Genetic Programming in Python.

Home Page: http://erp12.github.io/pyshgp

License: MIT License

Python 99.01% Makefile 0.20% Batchfile 0.25% Shell 0.54%

genetic-programming python software-synthesis artificial-intelligence machine-learning evolutionary-algorithms evolutionary-computation programming-by-example

pyshgp's Introduction

PyshGP

Push Genetic Programming in Python

WARNING: The public API of this package may see breaking changes until the 1.0 version.

Motivation

What is PushGP?

Push is programming language that plays nice with evolutionary computing / genetic programming. It is a stack-based language that features 1 stack per data type, including code. Programs are represented by lists of instructions, which modify the values on the stacks. Instructions are executed in order.

More information about PushGP can be found on the Push Redux the Push Homepage and the Push Language Discourse.

Why use PushGP?

PushGP is a leading software synthesis (sometimes called "programming by example") system. It utilized stochastic (typically evolutionary) search methods to produce programs that are capable of manipulating all the common data types, control structures, and data structures. It is easily extendable to specific use cases and has seen impressive human-competitive coding results. PushGP has discovered novel quantum computer programs previously unknown to human programers, and has achieved human competitive results in finding algebraic terms in the study of finite algebras.

In contrast to the majority of other ML/AI methods, PushGP does not require the transformation of data into numeric structures. PushGP does not optimize a set of numeric parameters using a gradient, but rather attempts to intelligently search the space of programs. The result is a system where the primary output is a program written in the Turing complete Push language.

PushGP has proven itself to be one of the most power "general program synthesis" frameworks. Like most evolutionary search frameworks, it usually requires an extremely high runtime, however it can solve problems that few other programming-by-example system can solve.

Additional references on the successes of PushGP:

Goals of PyshGP

Previous PushGP frameworks have focused on supporting genetic programming and software synthesis research. One of the leading PushGP projects is Clojush, which is written in Clojure and heavily focused on the experimentation needed to further the research field.

Pyshgp aims to bring PushGP to a wider range of users and use cases. Many popular ML/AI frameworks are written in Python, and with pyshgp it is much easier to compare PushGP with other methods or build ML pipelines that contain PushGP and other models together.

Although PushGP is constantly changing through research and publication, pyshgp is meant to be a slowly changing, more stable, PushGP framework. It is still possible to use pyshgp for research and development, however accepted contributions to the main repository will be extensively benchmarked, tested, and documented.

Installing pyshgp

pyshgp is compatible with python 3.7.x and 3.8.x.

Install from pip

pip install pyshgp

That's it! Read through the docs and examples to learn more.

Build From source

Clone the repo
cd into the pyshgp repo directory
run pip install . --upgrade
That's it! Read through the docs and examples to learn more.

Running Tests

Run the following command from project root directory. Make sure all the packages from requirements-with-dev.txt are installed in the instance of python you are using.

python -m pytest

Or run tests continuously (on save) during development using pytest-watch.

ptw

Documentation

Example usages of pyshgp can be found:

In the examples/ folder of the pyshgp Github repository.
In the minimal demo repository.

The full pyshgp API can be found on official website.

Pysh Roadmap / Contributing

PyshGP is nearly ready for its 1.0 release. The main outstanding items ares:

Extensive benchmarking to make sure pyshgp has the program-finding capabilities we expect from a contemporary PushGP system.
More feedback on the API must be gathered before we commit to not making any breaking changes.

For information about contributing, see the Contributing Guide.

pyshgp's People

Contributors

Stargazers

Watchers

Forkers

lacava bmetevier epicfaace thegreatb3 fagan2888 exp-optimization-tools nayabur saxena-ashish-g redbeansandrice world4jason vishalbelsare petrposik y1fanhe vector67 kephale ld-ing gaybro8777 theonlyemily nicmcphee

pyshgp's Issues

suggestion: Name predicate instructions to indicate boolean return value

I am noticing instructions like _exec_empty cropping up quite a lot. I realize that in Python variable names can only include underscores and Alnum characters, but I wonder if it might make Push code a bit more readable to name these in the Mathematica style: ending in Q (indicating "question", I guess?).

So for example _exec_empty_Q.

Just a minor suggestion. In Clojure implementations I use exec-empty?, and it helps readability quite a bit.

Re-run and document examples in examples/README.md

Pysh has changed a lot during development, not all runs documented in the examples/README.md are accurate anymore.

Re-write ReadTheDocs using Push-Redux as background knowledge.

Now that the Push-Redux has most of its content, we can simplify and the Pysh ReadTheDocs by referencing the redux.

End-to-end benchmarks to guide development.

It is hard to judge the impact of many of our changes during CI because full evolutionary runs can take weeks of CPU time. The best we can hope for is a benchmarking tool that can manually start a significant number of benchmarks and creates a report.

These benchmarks should be tracking runtime and pyshgp's ability to find solutions.

Things that need to be done to complete this work:

Implement more of the software synthesis benchmark problems.
Write a script to start n number of runs on x non-trivial problems, and track the runtime and solution of each. Ideally runs would happen in parallel. On fly? On digital ocean?
Determine some way of storing runtimes and solution rates long term.

Remove Twilio

It was a fun idea, but difficult to make usable for anyone other than me. 😢

Re-document Examples

Pysh has changed enough over the past few months that the documentations about the examples is getting fairly out of date.

This could wait until #44 is done.

Make interface with Sklearn much easier.

Create wrapper classes for Regression and Classification problems that implement the base class of scikit-learn.

Rewrite push spawner and translate functions to remove epigenetic markers.

The concepts of plush genes and epigenetic markers need to be merged. Silent and close markers should just be attributes of plush genes. There is no need to users to control the epigenetic markers.

This refactor will mainly impact Spawner and Translate code.

Error metrics on problems are re-written for each problem. Should create metrics module.

Rename modules so they don't start with "pysh_"

Possible overhaul of instructions.

Note: Maybe this whole issue can be ignored if we want to migrate the push interpreter to CPushPush.

Instructions

We should easily be able to rely more on inheritance and add more functionality to the constructor of Instruction in order to remove a lot of code duplication.

See the way instructions are made in Propel and CPushPush for examples of patterns that are much better than what is currently in pyshgp.

Instruction Set

Each python module which defines Push Instructions should define __all__ and then we can safely use import * in the instructions __init__.py. This should make the whole instructions sub-package much easier to understand.

Tests

This time we should put a lot more thought into removing the code duplication. Here is a sketch of what I would currently suggest, although more thought should maybe be put into this:

common_tests = [
    [
        {'_integer': [1, 2, 3]},
        {'_integer': [1, 2]},
        '_integer_pop'
    ],
    [
        {'_boolean': [True]},
        {'_boolean': []},
        '_boolean_pop'
    ],
    [
        {'_string': ['A']},
        {'_string': ['A', 'A']},
        '_boolean_pop'
    ],
    [
        {'_float': [1.5, 2.3]},
        {'_float': [2.3, 1.5]},
        '_float_swap'
    ],
    [
        {'_integer': [1, 2, 3]},
        {'_integer': [2, 3, 1]},
        '_integer_rot'
    ]
]

for test in common_tests:
    assert run_test(*test)

Instructions are run in a strange way.

_handle_?_instruction() implementation should be moved out of interpreter and class definition of corresponding instruction type.
PushInterpreter.execute_instruction() should be broken into PushInterpreter.eval_atom() and a execute() method found in each instruction class definition.
Documentation for all of this should be much better.

Need consistent usage of '_' in instruction names.

Currently I am leaning towards always using the _ when referencing instruction names because it is indicates that the string is probably and instruction name, and will make expressing programs as lists of strings as more reliable... although I am not sure it is good to encourage (or support) the latter.

Add generational pre-process function

This will significantly clean up the evolution() function in the gp/gp.py file.

string demo error

I tried tweaking the string demo where the target function takes a string (s) and returns s[:-2]+s[:-2]
by making it duplicate and then concatenate the reverse of the string by just changing the return to s[::-1]+s[::-1] and got the error

AttributeError Traceback (most recent call last)
in
40 )
41
---> 42 est.fit(X=X, y=y, verbose=True)
43 print(est._result.program)
44 print(est.predict(X))

~/anaconda3/lib/python3.7/site-packages/pyshgp/gp/estimators.py in fit(self, X, y, verbose)
211 else:
212 y_types = [type(y[0])]
--> 213 output_types = [push_type_for_type(t).name for t in y_types]
214
215 self.evaluator = DatasetEvaluator(X, y, interpreter=self.interpreter)

~/anaconda3/lib/python3.7/site-packages/pyshgp/gp/estimators.py in (.0)
211 else:
212 y_types = [type(y[0])]
--> 213 output_types = [push_type_for_type(t).name for t in y_types]
214
215 self.evaluator = DatasetEvaluator(X, y, interpreter=self.interpreter)

AttributeError: 'NoneType' object has no attribute 'name'

not sure why

Separate Push interpreter tests from instruction set tests.

Related to #35 but more general. This is currently pretty well covered by the current tests for the instruction set but those tests will ideally be removed at some point.

Command line args could be made much more robust with argparse

https://docs.python.org/3/library/argparse.html

Genetic Operator Pipelines

pyshgp's current genetic operators are large and relatively complex.

Replacing them with many smaller operations that each have their own probabilities could be beneficial. In additon, making pyshgp's current system to combine operations into something more robust and easy to use (which I am calling Operator Pipelines) might be good as well.

An Operator Pipeline which includes all of the above would be equivalent of Uniform Mutation.
An Operator Pipeline could also include recombination.

PushGPClassifier is missing

Blocked by #71

Remove `data` folder, use sklearn datasets.

Then we can avoid have csv, json, etc files in the repo.

Bug in gp.py

I've just cloned a fresh repo (and upgraded all my Python binaries), and when I run an example (any of the three) I get the following:

[all the standard setup works OK]

Creating Initial Population
Traceback (most recent call last):
  File "examples/integer_regression.py", line 60, in <module>
    gp.evolution(error_func, problem_params)
  File "/usr/local/lib/python2.7/site-packages/pyshgp/gp/gp.py", line 166, in evolution
    population = generate_random_population(evolutionary_params)
  File "/usr/local/lib/python2.7/site-packages/pyshgp/gp/gp.py", line 97, in generate_random_population
    rand_genome = r.random_plush_genome(evolutionary_params)
TypeError: random_plush_genome() takes exactly 2 arguments (1 given)

Push Instruction Set Tests

Current tests should be removed because they are too difficult to maintain and are hand written, so they probably don't cover enough cases. Tests should ideally be generated.

The difficulty of this is that the output of a program is the state of all the stacks. It is difficult (not possible?) to know what the expected output of a generated push program without running it, unless it isn't generated completely randomly. How can you generate a program in such a way that you know what its output should be?

Also, it is just as important to know what the output of a program should not be. In other words, if we are testing a random program we will have to determine what values should be expected on the stacks after execution. We will also have to check that no other values are on the stacks after execution. This is difficult to check for with programs that were generated with any degree of randomness in them.

Separate Push Interpreter from GP

It should be easier to use (and modify) the Push interpreter without having to worry about adversely changing evolution.

GP Tests

Okay, so I didn't exactly use Test Driven Development... Unit tests should still be added for GP related operations.

Exact implementation TBD because of the random nature of all the operators.

Custom special stack-element objects

Stack out of bounds (or should this be an exception)
Empty stack object

These will clean up code surrounding checking values on stack.

Vote instructions need improving.

Currently class vote instructions require a numeric argument, which adds burden to evolution. It would be beneficial to add vote instructions that increment and decrement vote levels for each class by a constant number baked into the instruction.

Add vote inc and vote dec instructions
Add vote inc and dec instructions that vote powers of 2.

Add generalization_function to SimplePushGPEvolver attributes

After automatic program simplification, run this function on the program to determine if it generalizes. generalization_function will look very similar to error function.

Also, consider refactoring examples to include 1 function which produces an error_function and a generalization_function.

How do I generate predictions with the best model from a run?

I'm starting with the iris example:

from sklearn import datasets, model_selection
import numpy as np

import pyshgp.gp.base as gp


iris = datasets.load_iris()
X_train, X_test, y_train, y_test = model_selection.train_test_split(iris.data,
                                                                    iris.target,
                                                                    test_size=0.5)

model = gp.PushGPClassifier(population_size=100, max_generations=50)
model.fit(X_train, y_train)

I tried model.score(X_test, y_test), but it complained that there is no predict function. Is there an easy way to create a predict function?

Properly implement program_growth_cap

Dumping and loading models

Is there a way to dump a model and load it later for prediction? Specifically, how do I use the final program printed out at the end of the GP run to perform a prediction task at later times?

Keyboard interrupt ^C does not halt processing

I was running one of the examples on my laptop, and got super boooooored, but ctrl-C just raises some kind of caught exception and leaves a pile of Python processes running.

Are you by any chance catching all exceptions, including KeyboardInterrupt? Because that's not really the way I would like ctrl-C to work, as it turns out. 😃

Use a dev branch

Great to see all this progress on the pyshgp package. I strongly recommend using a separate development branch for developing the module, and dedicate the master branch to the latest release on pip. I got thrown off for a bit because I was installing pyshgp via pip but referring to examples on the latest dev version on GitHub.

Update README.md

Due to lots of recent changes the readme need to be completely re-written.

Contributing Guide

Utilize numpy when possible

Break Up Uniform Mutation

Uniform mutation is a rather large variation operator. Now that pyshgp supports GeneticOperatorPipelines it would give the user more control to break UM into:

PerturbClosesMutuation
PerturbIntegerMutation
PerturbFloatMutation
TweakStringMutation
FlipBooleanMutation
RandomDeletionMutation
RandomAdditionMutation
RandomReplaceMutation
Genesis
Reproduction

Reformat Docs. Add autodoc API documentation.

Docs should be overhauled and largely gutted. Most of the PushGP descriptions should live with the nearly-complete Push-Redux which currently lives at https://erp12.github.io/push-redux/

I also just found out about the power of pairing ReadTheDocs and Autodoc. I am in the process of adding an auto-generated API pages to the ReadTheDocs documentation site. Unfortunately, I don't think there is a good way to use this to document the instruction set, so we will continue to rely on the hack-y comment scraper for now.

odd number tutorial error

I tried to run the odd number tutorial and got an error message reading[ init missing 1 required positional argument: "spawner"], not sure why this is happening

Issue with parallelization in regression.py

Hi, when I try running the regression.py with n_jobs = -1 I get the following error

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Add doc attr to Instruction. Make app to generate md based off of all_instructions.

What needs to be done to add a new example?

I've got a couple of example problems I use in my GP classes and workshops, and they seem to be working already in Pysh. Aside from in-file docs, what needs to be added, and where, before I submit a PR?

Pysh Exceptions Module

Probably bad form to use the generic Exception object for everything...

/pysh/gp/operators.py: raise Exception("Tried to perform unknown genetic operator " + str(op))
/pysh/gp/selection.py: raise Exception("Unknown selection method: " + str(evolutionary_params["selection_method"]))
/pysh/push/translation.py: raise Exception('Something bad found on paren_stack!')

Registered instructions should be a Set, not Dictionary.

Because

There shouldn't be duplicates
Every instruction obj has a name attribute that makes the dictionary key redundant
It is rare lookup 1 instruction, so a filter() is already used for often

You spelled 'evolutionary' wrong in the readme

Drop python 2 support. Add type hints.

Python 2 is more work than it is worth. Type hints are a nice.

This will involve

Removing import from future
Refactoring the is_?_type() functions in util.
Updating CI

Expand interpreter constructor to lessen code re-use in problem files.

Pass input stack values

Develop a very simple push storage system (aka tags)

To learn more about tags, see this.

In order to enable research on the best use of tag (or tag-like) systems, we need a basic framework for push program general information storage.

Some simple ideas include:

3 variables per datatype

Randomness instructions are missing

Without the ability to generate random number, strings, vectors, etc. it is impossible to evolve probabilistic programs.

Command line args don't work as described

I tried python examples/idea_of_numbers.py --population_size=200 and it errors out immediately with

Traceback (most recent call last):
  File "examples/idea_of_numbers.py", line 59, in <module>
    gp.evolution(error_func, problem_params)
  File "/usr/local/lib/python2.7/site-packages/pyshgp/gp/gp.py", line 148, in evolution
    params.grab_command_line_params(evolutionary_params)
  File "/usr/local/lib/python2.7/site-packages/pyshgp/gp/params.py", line 143, in grab_command_line_params
    while sys.argv[i+j].startswith('-'):
IndexError: list index out of range

(the idea_of_numbers.py file is my own)

Genome Simplification isn't great...

Often fails to make programs more reasonable as is.

At least should add replacement with no-ops instructions.