lspector / clojush Goto Github PK

View Code? Open in Web Editor NEW

330.0 20.0 92.0 21.17 MB

The Push programming language and the PushGP genetic programming system implemented in Clojure.

Home Page: http://hampshire.edu/lspector/push.html

License: Eclipse Public License 1.0

Python 0.19% Clojure 99.29% Shell 0.50% Dockerfile 0.01%

clojure genetic-programming pushgp programming-language stack-based interpreter

clojush's Introduction

Clojush

Lee Spector ([email protected]), started 20100227 See version history. Older version history is in old-version-history.txt.

This is the README file accompanying Clojush, an implementation of the Push programming language and the PushGP genetic programming system in the Clojure programming language. Among other features this implementation takes advantage of Clojure's facilities for multi-core concurrency.

Availability

https://github.com/lspector/Clojush/

Requirements

To use this code you must have a Clojure programming environment; see http://clojure.org/. The current version of Clojush requires Clojure 1.7.0.

Clojure is available for most OS platforms. A good starting point for obtaining and using Clojure.

Quickstart

Using Leiningen you can run an example from the OS command line (in the Clojush directory) with a call like:

lein run clojush.problems.demos.simple-regression

If you would like to change a parameter, you may do so at the command line. For example, to change the default population size from 1000 to 50, call:

lein run clojush.problems.demos.simple-regression :population-size 50

Additional parameters may also be specified. All valid parameters with their descriptions can be found in args.clj.

The above calls will load everything and run PushGP on a simple symbolic regression problem (symbolic regression of y=x^3-2x^2-x). Although the details will vary from run to run, and it's possible that it will fail, this usually succeeds in a few generations.

Another option is to evaluate in the leinigen REPL (Read Eval Print Loop):

sh> lein repl
...
clojush.core=> (use 'clojush.problems.demos.simple-regression)
...
clojush.core=> (pushgp argmap)

Arguments to pushgp are specified in the argmap variable in the problem's namespace.

To run the examples in an IDE (Integrated Development Environment) for Clojure such as Clooj or Eclipse/Counterclockwise, load one of the files in src/clojush/problems into the IDE's REPL, type (pushgp argmap) into the REPL's input area, and hit the enter key.

You can also use Docker to run examples, if you don't want to install Clojure on your machine directly.

# first build the image. This needs to be re-done if any of the code changes
docker build -t lspector/clojush .
# then run it on a specific problem
docker run --rm lspector/clojush lein run clojush.problems.demos.simple-regression

For large-scale runs you may want to provide additional arguments to Java in order to allow access to more memory and/or to take maximal advantage of Clojure's concurrency support in the context of Clojush's reliance on garbage collection. For example, you might want to provide arguments such as -Xmx2000m and -XX:+UseParallelGC. Details will depend on the method that you use to launch your code.

An additional tutorial is available in src/clojush/problems/demos/tutorial.clj.

Description

Clojush is a version of the Push programming language for evolutionary computation, and the PushGP genetic programming system, implemented in Clojure. More information about Push and PushGP can be found at http://hampshire.edu/lspector/push.html.

Clojush derives mainly from Push3 (for more information see http://hampshire.edu/lspector/push3-description.html, http://hampshire.edu/lspector/pubs/push3-gecco2005.pdf) but it is not intended to be fully compliant with the Push3 standard and there are a few intentional differences. It was derived most directly from the Scheme implementation of Push/PushGP (called Schush). There are several differences between Clojush and other versions of Push3 -- for example, almost all of the instruction names are different because the . character has special significance in Clojure -- and these are listed below.

If you want to understand the motivations for the development of Push, and the variety of things that it can be used for, you should read a selection of the documents listed at http://hampshire.edu/lspector/push.html, probably starting with the 2002 "Genetic Programming and Evolvable Machines" article that can be found at http://hampshire.edu/lspector/pubs/push-gpem-final.pdf. Bear in mind that Push has changed over the years, and that Clojush is closest to Push3 (references above).

Push can be used as the foundation of many evolutionary algorithms, not only PushGP (which is more or less a standard GP system except that it evolves Push programs rather than Lisp-style function trees -- which can make a big difference!). It was developed primarily for "meta-genetic-programming" or "autoconstructive evolution" experiments, in which programs and genetic operators co-evolve or in which programs produce their own offspring while also solving problems. But it turns out that Push has a variety of uniquely nice features even within a more traditional genetic programming context; for example it makes it unusually easy to evolve programs that use multiple data types, it provides novel and automatic forms of program modularization and control structure co-evolution, and it allows for a particularly simple form of automatic program simplification. Clojush can serve as the foundation for other evolutionary algorithms, but only the core Push interpreter and a version of PushGP are provided here.

Starting with version 2.0.0, the genomes of evolving individuals in Clojush are based on Plush (linear Push) genomes, which are translated into normal Push programs before execution. Plush genomes are composed of instruction maps, each of which contains an instruction and potentially other metadata describing whether that instruction should be silenced, whether closing parentheses should follow it, etc.

Usage

Example calls to PushGP are provided in other accompanying files.

Push programs are run calling run-push, which takes as arguments a Push program and a Push interpreter state that can be made with make-push-state. If you are planning to use PushGP then you will want to use this in the error function (a.k.a. fitness function) that you pass to the pushgp function. Here is a simple example of a call to run-push, adding 1 and 2 and returning the top of the integer stack in the resulting interpreter state:

(top-item :integer (run-push '(1 2 integer_add) (make-push-state)))

If you want to see every step of execution you can pass an optional third argument of true to run-push. This will cause a representation of the interpreter state to be printed at the start of execution and after each step. Here is the same example as above but with each step printed:

(top-item :integer (run-push '(1 2 integer_add) (make-push-state) true))

See the "parameters" section of the code for some parameters that will affect execution, e.g. whether code is pushed onto and/or popped off of the code stack prior to/after execution, along with the evaluation limits (which can be necessary for halting otherwise-infinite loops, etc.).

Run-push returns the Push state that results from the program execution; this is a Clojure map mapping type names to data stacks. In addition, the map returned from run-push will map :termination to :normal if termination was normal, or :abnormal otherwise (which generally means that execution was aborted because the evaluation limit was reached.

Random code can be generated with random-code, which takes a size limit and a list of "atom generators." Size is simply the length of the linear Plush genome. Each atom-generator should be a constant or the name of a Push instruction (in which case it will be used literally), or a Clojure function that will be called with no arguments to produce a constant or a Push instruction. This is how "ephemeral random constants" can be incorporated into evolutionary systems that use Clojush -- that is, it is how you can cause random constants to appear in randomly-generated programs without including all possible constants in the list of elements out of which programs can be constructed. Here is an example in which a random program is generated, printed, and run. It prints a message indicating whether or not the program terminated normally (which it may not, since it may be a large and/or looping program, and since the default evaluation limit is pretty low) and it returns the internal representation of the resulting interpreter state:

(let [s (make-push-state)
      c (random-push-code
          100                                  ;; size limit of 100 points
          (concat @registered-instructions     ;; all registered instrs
                  (list (fn [] (rand-int 100)) ;; random integers from 0-99
                        (fn [] (rand)))))]     ;; random floats from 0.0-1.0
  (printf "\n\nCode: %s\n\n" (apply list c))
  (run-push c s))

If you look at the resulting interpreter state you will see an "auxiliary" stack that is not mentioned in any of the Push publications. This exists to allow for auxiliary information to be passed to programs without using global variables; in particular, it is used for the "input instructions" in some PushGP examples. One often passes data to a Push program by pushing it onto the appropriate stacks before running the program, but in many cases it can also be helpful to have an instruction that re-pushes the input whenever it is needed. The auxiliary stack is just a convenient place to store the values so that they can be grabbed by input instructions and pushed onto the appropriate stacks when needed. Perhaps you will find other uses for it as well, but no instructions are provided for the auxiliary stack in Clojush (aside from the problem-specific input functions in the examples).

The pushgp function is used to run PushGP. It takes all of its parameters as keyword arguments, and provides default values for any parameters that are not provided. See the pushgp definition in pushgp/pushgp.clj for details. The single argument that must be provided is :error-function, which should be a function that takes a Push program and returns a list of errors. Note that this assumes that you will be doing single-objective evolution with the objective being thought of as an error to be minimized. This assumption not intrinsic to Push or PushGP; it's just the simplest and most standard thing to do, so it's what I've done here. One could easily hack around that. In the most generic applications you'll want to have your error function run through a list of inputs, set up the interpreter and call run-push for each, calculate an error for each (potentially with penalties for abnormal termination, etc.), and return a list of the errors.

Not all of the default arguments to pushgp will be reasonable for all problems. In particular, the default list of atom-generators -- which is ALL registered instructions, a random integer generator (in the range from 0-99) and a random float generator (in the range from 0.0 to 1.0) -- will be overkill for many problems and is so large that it may make otherwise simple problems quite difficult because the chances of getting the few needed instructions together into the same program will be quite low. But on the other hand one sometimes discovers that interesting solutions can be formed using unexpected instructions (see the Push publications for some examples of this). So the set of atom generators is something you'll probably want to play with. The registered-for-type function can make it simpler to include or exclude groups of instructions. This is demonstrated in some of the examples.

As of Clojush 2.0.0, genetic operator arguments are provided as a map to the :genetic-operator-probabilities argument. Here, each key may be a single operator or an "operator pipeline" vector, which allows the application of multiple operators sequentially, using one operators output as the input to the next operator. An example argument could be:

{:reproduction 0.1
 :alternation 0.2
 :uniform-mutation 0.2
 [:alternation :uniform-mutation] 0.2
 :uniform-close-mutation 0.1
 :uniform-silence-mutation 0.1
 [:make-next-operator-revertable :uniform-silence-mutation] 0.1}

Here, two different pipelines would be used. In the second pipeline, the meta-operator :make-next-operator-revertable makes the :uniform-silence-mutation operator revertable, which means that the child will be compared to the parent, and the parent kept if it is better than the child.

The use of simplification is also novel here. Push programs can be automatically simplified -- to some extent -- in a very straightforward way: because there are almost no syntax constraints you can remove anything (one or more atoms or sub-lists, or a pair of parentheses) and still have a valid program. So the automatic simplification procedure just iteratively removes something, checks to see what that does to the error, and keeps the simpler program if the error is the same (or lower!).

Automatic simplification is used in this implementation of PushGP in two places:

A specified number of simplification iterations is performed on the best program in each generation. This is produced only for the sake of the report, and the result is not added to the population. It is possible that the simplified program that is displayed will actually be better than the best program in the population. Note also that the other data in the report concerning the "best" program refers to the unsimplified program.
Simplification is also performed on solutions at the ends of runs.

Note that the automatic simplification procedure will not always find all possible simplifications even if you run it for a large number of iterations, but in practice it does often seem to eliminate a lot of useless code (and to make it easier to perform further simplification by hand).

If you've read this far then the best way to go further is probably to read and run the example problem files in src/clojush/problems/demos.

Implementation Notes

A Push interpreter state is represented here as a Clojure map that maps type names (keywords) to stacks (lists, with the top items listed first).

Push instructions are names of Clojure functions that take a Push interpreter state as an argument and return it modified appropriately. The define-registered macro is used to establish the definitions and also to record the instructions in the global list registered-instructions. Most instructions that work the same way for more than one type are implemented using a higher-order function that takes a type and returns a function that takes an interpreter state and modifies it appropriately. For example: there's a function called popper that takes a type and returns a function -- that function takes a state and pops the right stack in the state. This allows us to define integer_pop with a simple form:

(define-registered integer_pop (popper :integer))

In many versions of Push RUNPUSH takes initialization code or initial stack contents, along with a variety of other parameters. The implementation of run-push here takes only the code to be run and the state to modify. Other parameters are set globally in the parameters section below. At some point some of these may be turned into arguments to run-push so that they aren't global.

Miscellaneous differences between clojush and Push3 as described in the Push3 specification:

Clojush instruction names use _ instead of . since the latter has special meaning when used in Clojure symbols.
Equality instructions use eq rather than = since the latter in not explicitly allowed in Clojure symbols.
for similar reasons +, -, *, /, %, <, and > are replaced with add, sub, mult, div, mod, lt, and gt.
Boolean literals are true and false (instead of TRUE and FALSE in the Push3 spec). The original design decision was based on the fact that Common Lisp's native Boolean literals couldn't used without conflating false and the empty list (both NIL in Common Lisp).
Clojush adds exec_noop (same as code_noop).
Clojush includes an execution time limit (via the parameter evalpush-time-limit) that may save you from exponential code growth or other hazards. But don't forget to increase it if you expect legitimate programs to take a long time.

Push3 stuff not (yet) implemented:

NAME type/stack/instructions
Other missing instructions: *.DEFINE, CODE.DEFINITION, CODE.INSTRUCTIONS
The configuration code and configuration files described in the Push3 spec have not been implemented here. The approach here is quite different, so this may never be implemented

How to Contribute

To Do (sometime, maybe)

Implement remaining instructions in the Push3 specification.
Add support for seeding the random number generator.
Improve the automatic simplification algorithm and make it work on Plush genomes.
Possibly rename the auxiliary stack the "input" stack if no other uses are developed for it.
Write a sufficient-args fn/macro to clean up Push instruction definitions.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1017817. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.

clojush's People

Contributors

Stargazers

Watchers

clojush's Issues

Document Plush genomes and Push programs in the repo

I think you should have either (a) a Markdown document in the repo or (b) a repo wiki page that documents Plush genomes and Push programs. My inclination would be towards the former since then the docs are right in the code, but I'm open to other thoughts on the matter.

This might be something I'd be willing to pick up as a semi-outsider.

Can't use registered-for-stacks (and therefore autoconstruction) for problems that add instructions without type metadata

For example, mux-6.

clojush.util/top-item does not check for nil argument

(Noted while Tom pointed out several instructions do not return push-state objects.)

Failing Midje test:

;; sanity check

(fact "the top-item utility function should not return :no-stack-item when passed a nil argument"
  (top-item :integer nil) =not=> :no-stack-item
  (top-item :foo (make-push-state)) =not=> :no-stack-item
  )

Expected behaviors:

Raise an error when the indicated stack is not present in the keys of the map argument
Raise an error when the push state has no keys

Interpreter-crucial settings can't be set except through `pushgp`

I have a suspicion this is deeply related to the problem I was complaining about in #125:

I'm trying to do some exploratory work with some scripts I've come across (and also some testing), and I realize they're not running very long. As in the number of steps they need to "work" is more than the global setting of @global-evalpush-limit, which is 150 steps (?!).

Now as I understand this, it's a global variable set in pushgp/globals, yes? So to affect the number of steps a program running in an interpreter takes, I need to change the global value beforehand and (assuming I don't want it permanently changed for subsequent runs) set it back manually afterwards?

I'm looking at the list of globals in pushgp/globals.clj (right under the admonition "These definitions are used by Push instructions and therefore must be global") and have done some experiments:

global-atom-generators is used by code_rand only
global-max-points says it controls the size of things pushed to stacks, but it is only referenced by :code instructions that return newly constructed values (e.g., code_cons), so it doesn't affect the initial script at all (just checked by hand)
global-tag-limit this seems to be about the interpreter, not evolutionary search
global-top-level-push-code does actually affect the interpreter
global-top-level-pop-code ditto
global-evalpush-limit cannot be set or changed, except as a pushgp global
global-evalpush-time-limit ditto
global-pop-when-tagging ??? I think this is about the interpreter, not the search?
global-parent-selection actually does refer to genetic programming
global-print-behavioral-diversity ditto

The problem

As far as I can tell, the Push interpreter is only slightly intermingled with concerns about GP search, except for the global arguments above. This makes it more difficult to explore different kinds of search without dragging ontological junk over from earlier algorithms. It also makes testing much more difficult, since a lot of dependencies (pushgp as such) come along for the ride when one should only be concerned about instructions, stacks, input/output, tagging and so forth.

More importantly, the ability to parallelize the evaluation of Push programs (the slowest part of genetic programming, after all) is hindered by this leaky barrier between running and searching.

A refactoring path

I'd say this actually trumps the difficulties I noticed with reporting in pushup. I'd like to use the interpreter on its own for several research-related tasks, and I can't imagine teaching people about GP without better separating the concerns here.

I'd like to try making a push interpreter that is a more self-contained construct, with its own persistent attributes that can be written and read by a mindful creator. I'd want to refactor this apart, not break anything. So it would involve keeping the same arguments there are now, but inserting into the interpreter new functionality that lets one speak directly to it as needed.

In other words, pushgp should see the same interface when getting and setting variables the interpreter should use, but there should be (at least for the interim) a parallel way of talking to a "bare" interpreter directly, without the intermediation of the clojush/globals path.

Does that make sense?

JSON structure could be a lot more compact

Disregarding key strings, could we please restructure the JSON output so that it doesn't say the generation for every individual? Something like this comes to mind:

{
  {
    "generation" : 1,
    "individuals" :
      [
        {«individual 1 here»},
        {«individual 2 here»},
        ...
      ]
  },
  {
    "generation" : 2,
    ...
  },
  ...
}

Why does defining a "duplicate instruction" throw an error?

Specifically this line throws an exception when an instruction is registered with the same symbol as a previously registered one.

Unfortunately, this seems to be the breaking point whenever the entire Clojush system is subject to test: the instructions (identical, quite literally) a0 and a2 are defined in both clojush.problems.boolean.mux-11 and clojush.problems.boolean.mux-6, and others will also probably crop up if I get past this problem.

I would much rather be able to test the entire system all at once, rather than selectively turning on and off conflicting badly-separated modules and hoping they don't hide other interactions and conflicts. Should I

change the instruction names in one or both of these conflicting problems
change the "duplicate" detector so that it actually fires only when an instruction is redefined with different value, not (as in this case) when the same key is assigned the same exact function value
adopt the behavior of most other Clojure libraries I've seen, in which the last-defined map item just wins, without raising a fuss at all (in other words, disable this error checking)

What is not immediately clear is why this throws an error at all. Any memories bubbling up?

Auto Documentation

It would be nice to have documentation that is built from docstrings. It would both serve as a reference point and encourage keeping docstrings helpful in the codebase.

add codox as dep and document how to run it
generate docs in travis and push to gh-pages branch for hosting

Linting Code

Since this code has been touched by many different authors, it has a variety of styles in it. In order to make it easier to dive into, it might be useful to try to standardize coding styles and transition the codebase to using "best practices" in the Clojure community.

I have been looking into using Eastwood as a linter. It would be run on every push in Travis CI or CircleCI.

Add requirement for Eastwood and document it's usage in README
Add travis CI support and run Eastwood in it
Fix all current errors in linting
- Automatically upgrade ns :require forms
look into this other tool to see if it would be helpful: https://github.com/jonase/kibit/
also look at this tool: https://github.com/weavejester/cljfmt

New instructions?

Any advice for a file structure or convention for adding instructions to the "standard suite" vs problem-specific ones?

I have a mind to bring over some of the stack-focused instructions from PushSwift and Nudge, especially the [stack_name]_switch and [stack_name]_filter families of instructions, and I'm not sure how to structure the use appropriately.

Add value for `:is-random-replacement` for initial population

When we added :is-random-replacement to the logged output (a00bde2) I didn't notice that that field doesn't have a value for the initial population, which means we have an empty value ("...,,...") in the CSV outputs for the initial population. That's not fatal, but it does mean we'll need to "special case" that when processing those CSV files.

If we were to put a value there, though, It's not obvious to me what would be the "right" value. Should we true (it is randomly generated) or false (it's not a random replacement)? Anyone have any thoughts on this?

resolving incorrect keys/args to pushgp

If the keys/args are not recognized, print them to an error file. This should help us identify typos.

There is one Clojure struct defined

I was trying to build the convenience function I mentioned in #148 and have to ask if there's an particular reason that push-state is a struct instead of a record.

Undelete Number IO problem

Lee somehow deleted the Number IO problem in this commit b7144fa , when updating the version number.

I'll work on undeleting it.

I guess Lee needs that shell script after all :P

Files not actually moved out of clojush/pushgp

It appears that when Kyle moved some files out of clojush/pushgp/, he only copied the files out without actually deleting them from the repository. I assume this was a mistake - there's no reason to have these files in clojush/ and clojush/pushgp/, correct? If not, they should be deleted from clojush/pushgp/, which I can do.

argmap_regression : Clash of the Namespaces

pushgp-map already refers to: #'clojush.pushgp.pushgp/pushgp-map in namespace: clojush.experimental.pushgp-map

How do I record multiple "errors" for one I/O pair?

I'm building a few of my nasty test suite of problems, mainly to get a better hands-on feel for actually writing them from scratch and running them.

In at least one case, the desired result is an integer, and I want to minimize both the number of digits in the integer and the number of non-01 digits. How would I go about writing :error-function to do that?

Where is the convenience function for `make-pushstate-with-specified-stacks`?

I can't believe there isn't one tucked away somewhere. I don't want to create a redundant one.

Add Additional CSV Functionality

Add functionality to CSV printing that makes it possible to optionally print the following info for each individual:

:parent-indices :push-program :plush-genome :push-program-size :plush-genome-size :total-error :test-case-errors

Clarify shell and REPL commands' treatment of underscores vs hyphens

I notice (after getting enigmatic errors) that the name of a clojure file that has underscores is (at least sometimes) "supposed" to have hyphens instead when you invoke it from lein run or inside a REPL with (use 'thingie-with-hyphens).

That should be made explicit.

`clojush.problems.synthetic.order` (maybe others) shouldn't be simplified by default

To be frank, this is more a facet of "what are the individual problems for exactly?"

I'm just running all of them in turn by hand and watching, and I find several in synthetic that are interesting number-ordering tasks which get solved... and then the default "simplify" system comes along at the end and makes the "simpler" answers nonsensical. For instance, the order task is described as "put the positive numbers in front of their negative complements", and that works, and then automatic simplification comes along and deletes all the negative numbers. Not violating the terms of the task at all, of course.

In hindsight, I guess this is "they were written before simplification was a thing", but...?

Practically speaking maybe this one should have simplification turned off?

pushgp report breaks backward compatibility

Hi All,

The current version of clojush.pushgp.report/report breaks backwards compatibility. Although report has multiple definitions for different arguments the order of arguments have become inconsistent with old code. My local fix is just to revert to the pre-JSON/CSV code, but I suggest either reorganizing arguments so backwards compatibility is maintained or doing a major version bump (especially because they seem like useful features).

Cheers,
Kyle

Add Higher-Order Function Instructions

It would be cool to have higher-order function instructions like map, reduce, and filter in Clojush. Maybe they'd only be defined on the array-like types (string, vector_integer, vector_boolean, etc.), or maybe they'd somehow be defined on other stacks as well.

The details are vague, and there are some non-trivial things to figure out, but if anyone wants to take this on it would be cool!

JSON datafiles are malformed?

For some reason, the JSON logs I've been working with are comprised of multiple Array objects, which I'm pretty sure is a violation of the spec. I'm seeing:

[
  {individual},
  {individual},
  ...
  {individual}
],
[
  {individual},
  {individual},
  ...
  {individual}
],
... and so on

In other words, it has the structure [1,2,3],[4,5,6]..., which can't be parsed by any of the readers I've come across. These expect a single root object.

Can this please be wrapped in square brackets so it's a valid object? That said, see the subsequent issue, which is a request for a more rational structure.

What stacks need to be explicitly passed into `registered-for-stacks`?

Continuing a hands-on exploration, and confounded trying to understand (even from examples) which stacks can and should be passed into pushstate/registered-for-stacks in setting up a problem-specific file.

So I suspect that :code and :vector_boolean and :vector_integer types and instructions may be handy for this task, but I notice no examples where anybody has included those. And the docstrings don't seem clear whether the set of instructions generated by registered-for-stacks is the strict subset of those that use the specified types as inputs, or those which in any way include the specified types.

So boiling this down:

do I need to include :code?
what if anything do I need to do in order to get :vector_integer instructions?
do I need to specify boolean ERCs? I don't see that, either

`code_rand` fails with a warning during execution

Over at the ongoing experiment I've got some random code generation and the interpreter nominally working, still as a preliminary for anything interesting. In the course of making and running some random Push programs, I've noticed that the code_rand instruction barfs when executed.

It apparently wants @global-atom-generators not to be an empty list if it's going to work. I can see why, since it seems to want it to contain a collection of random-code generators instead of an empty (atom {}) definition.

That said... nobody in the codebase ever puts anything into @global-atom-generators.

Does having the default for revert-too-big-child being :parent create issues?

The current default behavior for revert-too-big-child is :parent, so if an offspring exceeds the size limit it is replaced by its (first) parent. This smells dangerous to me, but perhaps because I'm used to working with "standard" tree-based GP. In that kind of system if you have overly large offspring replaced by their parent, you create a bias that favors large fit-ish trees that are near the size limit. Their offspring are then likely to be too large, which means that a copy of that parent will be made with reasonable probability, depending on the fitness of the parent and the odds of the child being too big. Since the copy has the same fitness and size, it also has this property and will propagate, as will its copies, etc., etc.

This may be less of an issue with Push, since code size/growth seem at the very least different in Push, and I don't have great intuition about how those things play out in this space, so feel free to tell me I'm barking up a totally silly tree here. Thanks!

When writing reports, an example should explicitly say where they are being written

BTW, where are they being written? I'm currently running one that supposedly saves JSON and CSV, and have no idea what $PATH it's using. Root? Local directory? Some conventional directory I don't know about?

Optimize using type hints

Optimize using type hints and other approaches frequently discussed on [email protected].

This was in the development.md file, which I am about to delete, since the repo isn't the right place for this list.

:parent should be removed from individuals

I did some digging, and the :parent field was added in 2012 when Lee first introduced parent reversion in Clojush. Back then, the reversion didn't happen until much after breeding. Since then, parent reversion has been moved to the breeding phase, and does not rely on the :parent field. I did some digging and didn't find anything else that used it, besides some code to get rid of the parents after reversion since they took up unnecessary space.

Anyway, I'm going to go ahead and get rid of the parent field everywhere. As discussed in lab, the history and ancestry fields will stick around in case we need them in-run.

How do I invoke `lein run my.example` so it uses more of my copious CPU?

I just moved my little learning-by-doing example over to the Big Server machine in our house, and invoked it (as I had on my laptop) with lein run clojush.problems.tozier.winkler01. But it's only using a single core on the big multicore CPU. Is this a JVM issue, a lein issue (and if so, please point out how to fix), or is there an argument somewhere in the codebase I should be toggling?

Move `make-individual` from genetic operators to breed

I noticed that the end of every single genetic operator looks the same -- it calls make-individual, setting some of the fields. But, every one is the same! This seems like something that should be moved to breed, and have the genetic operators just return the Plush genome, not an individual. I will have to check and make sure that none of the operators are called from other places outside of breed, but otherwise this should be a clean refactoring that will simplify some things nicely.

Eliminate Side Effects When Loading Files

When we try to run lein check or lein doc it raises a bunch of exceptions, because the instruction files modify global state and the problem files actually try to run the problems, when they are loaded.

To fix this, it might make sense to eliminate global state and remove side effects from loading files.

on `atom-generators` and biases

One point of "manglish resistance" I've just identified in the "hard" problem I've been setting up is the unexpected way atom-generators is defined in many problems I've examined.

Intuitively—that is, from the stance of a new user examining the examples to emulate their "best practices"—setting up the atom-generators list should in some sense preserve the visible probabilities they imply. So for example in many cases I've looked at, the resulting probability that an input or ERC is created is no higher than the probability that any given instruction is used.

I understand that evolutionary search can "fix" any shortfall, and that one doesn't want to "bias" the search towards obvious needs, but I can't imagine it's a good thing for random mutation (for example) to tend towards the elimination of ERCs in all cases, which is the case in several problem setups I've looked at.

Here's an example from the digits problem

(def digits-atom-generators
  (concat (list
            \newline
            ;;; end constants
            (fn [] (- (lrand-int 21) 10))
            ;;; end ERCs
            (tag-instruction-erc [:integer :boolean :string :char :exec] 1000)
            (tagged-instruction-erc 1000)
            ;;; end tag ERCs
            'in1
            ;;; end input instructions
            )
          (registered-for-stacks [:integer :boolean :string :char :exec :print])))

In other words: if there's only one copy of each input or ERC function (as is true here), and hundreds of instructions (about 127 in this case), then all the actual ERCs and inputs will tend to be eliminated by mutation, and scant to begin with.

alternative:

I'd suggest an approach that takes a list of collections, and calls each one of the root items with equal probability, and then samples from that. So a list like the one above would produce 1/6 newlines, 1/6 inputs, 1/6 instructions, and so forth.

Should the JVM settings in project.clj be in a separate file?

I might be completely up a tree here, but I find the section in the default project.clj with all the commented out JVM options at best awkward. Having commented out options like that encourages people to include even more commented out lines, and makes managing things like commits difficult. If I think my project needs the 12gb option and I make that change, I presumably want to commit it to my fork. But then I probably don't want to try to push that change up to this "master" repo, which potentially requires some cherry picking if I make some change in that file that I do want/need to pass along.

Being fairly new to the codebase, I don't have a great sense of how often project.clj changes. I'm guessing parts, like the description and license, change quite rarely. Other parts, like the JVM options and dependencies, seem like things one might fiddle with quite a lot while getting a project up and running. If I'm right about that, maybe those bits that change a lot could/should be pulled out into separate files so they can be modified and managed separately. These could then be pulled in by project.clj in one place that then doesn't change.

Alternatively could/should those settings go in your project specific file, like simple_regression.clj?

Does that make any sense?

Catalog of instructions?

Documentation question: Is there a centralized catalog of instructions, and which libraries ("bundles"?) they're associated with?

max-points and max-points-in-initial-program

I just noticed one artifact from moving to Plush genomes that isn't really cleaned up properly: the push-gp args :max-points and :max-points-in-initial-program use the language of "points", which refers to sizes of Push programs, not Plush genomes. Yet they are used to determine the max sizes of Plush genomes during initialization and genetic operators.

We could simply make it more clear by calling them :max-genome-size and :max-genome-size-in-initial-program. But, that isn't a perfect solution, since :max-points is used in other points to limit sizes of Push code. In particular, it is used in code stack instructions to ensure that code there doesn't grow exponentially.

I guess we could split these into two separate arguments, but that seems inelegant. Any other ideas? Should we just leave it the way it is, or just rename it?

Dependencies in project.clj seem out of date?

I note this sets it up to use only clojure 1.5.1, and later on the midje dependency is out of date as well. I've been using 1.7 in all my recent work, and I'm concerned some functionality may be missing as a result. I really don't know which if any of the other packages might also be running behind.

Is there a "best practice" for keeping these up to date when managing a clojure project like this?

Inconsistent approach to empty vector_* types

I'm working through the tests for the enumerator type, and noticing that the recognizers for vector instances are a bit concerning. Mainly I am wondering about edge cases, for instance one in which vector_integer_rest is applied to an argument with exactly one element. The result is not pushed to :exec, but rather directly to :vector_integer, and I just checked this with this passing test:

(def vi-state-with-1-element (push-state-from-stacks :vector_integer '([77])))

(facts "vector_integer_rest can produce an empty vi result"
  (top-item :vector_integer (execute-instruction 'vector_integer_rest vi-state-with-1-element)) => []
  )

I assume there are some other places where a vector_* could produce an empty vector as well.

This is not in itself a problem, but now look again at the interpreter's recognizers. These do not push empty vectors to any stack.

I ask because I'm about to write code for moving the pointer through an Enumerator's stored seq, and I need to have a better sense of whether you want empty vectors to be present on the stacks before I write tests for Enumerators that behave right when they contain empty vectors.

I've already written the code which "unwraps" an Enumerator object and pushes the seq it contains onto :exec, but I did not think before now to check whether an empty seq would be recognized by the interpreter. I imagine it will raise an exception.

Broken examples

lein run clojush.problems.classification.intertwined-spiral fails with error stemming from obsoleted instr-paren-requirements.

grep revealed the same problem in dsoar, but trying that failed for another reason first:

lein run clojush.problems.control.dsoar fails with a name clash on the symbol recognize-literal

Add genome diversity to Clojush logs

Also, change Number of Unique Programs into a [0,1] syntactic measure of diversity.

Broken example: calc

lein run clojush.problems.software.calc produces

Number of tests: 10
Behaviors: [{:old-tests [[[:zero] 0.0 false] [[:one] 1.0 false] [[:two] 2.0 false] [[:three] 3.0 false] [[:four] 4.0 false] [[:five] 5.0 false] [[:six] 6.0 false] [[:seven] 7.0 false] [[:eight] 8.0 false] [[:nine] 9.0 false]], :new-tests [], :augment-fn #<calc$fn__3822 clojush.problems.software.calc$fn__3822@47ca32f7>} {:old-tests [], :new-tests [], :augment-fn #<calc$fn__3824 clojush.problems.software.calc$fn__3824@78d15e01>} {:old-tests [], :new-tests [], :augment-fn #<calc$fn__3831 clojush.problems.software.calc$fn__3831@7af32927>}]
Exception in thread "main" java.lang.AssertionError: Assert failed: Argument key :ultra-mutates-to-parentheses-frequently is not a recognized argument to pushgp.
(contains? (clojure.core/deref push-argmap) argkey)
    at clojush.pushgp.pushgp$load_push_argmap.invoke(pushgp.clj:134)
    at clojush.pushgp.pushgp$pushgp.invoke(pushgp.clj:257)
    at clojush.core$_main.doInvoke(core.clj:38)
    at clojure.lang.RestFn.invoke(RestFn.java:408)
    at clojure.lang.Var.invoke(Var.java:415)
    at user$eval5.invoke(form-init886434695797924965.clj:1)
    at clojure.lang.Compiler.eval(Compiler.java:6619)
    at clojure.lang.Compiler.eval(Compiler.java:6609)
    at clojure.lang.Compiler.load(Compiler.java:7064)
    at clojure.lang.Compiler.loadFile(Compiler.java:7020)
    at clojure.main$load_script.invoke(main.clj:294)
    at clojure.main$init_opt.invoke(main.clj:299)
    at clojure.main$initialize.invoke(main.clj:327)
    at clojure.main$null_opt.invoke(main.clj:362)
    at clojure.main$main.doInvoke(main.clj:440)
    at clojure.lang.RestFn.invoke(RestFn.java:421)
    at clojure.lang.Var.invoke(Var.java:419)
    at clojure.lang.AFn.applyToHelper(AFn.java:163)
    at clojure.lang.Var.applyTo(Var.java:532)
    at clojure.main.main(main.java:37)

Delete fast-lexicase-selection

I'll do this soon.

Remove limit on fields printed in CSV reports

In csv-print there is a filter that significantly limits the set of output fields that can be printed:

(let [columns (concat [:uuid]
                        (filter #(some #{%} csv-columns)
                                [:generation :location :parent-uuids :genetic-operators :push-program-size :plush-genome-size :push-program :plush-genome :total-error]))]

@thelmuth thinks this may have been done to ensure that the columns were printed in the same order. It's quite limiting, though, and it would be nice if we could add other fields on the command line without having to alter this code.

One option is to simply remove the filter, and always use whatever order is provided in the runtime configuration.

Another option would be to keep the filter to maintain the order of the "standard" fields, and then concat on the remaining fields, perhaps sorted by alphabetical order on the keyword.

The first would be simpler and less confusing to people coming to the code, but second wouldn't be hard if people would prefer that approach.

Associate doc strings with all Push instructions

At the moment almost all instructions are defined using a macro called define-registered which does a bunch of stuff to register the instructions in a global catalog and so forth.

Also at the moment, none of the actual instructions are documented in a useful way; rather, some of them have comments.

It makes a kind of sense to (a) change the define-registered macro so it takes a string argument and socks it away in the metadata, (b) go through the entire codebase and fix the strings.

This probably touches on things mentioned in #122 and #147.

Remove (make-individual) from Evaluate

@lspector Do you know why make-individual is called in evaluate-individual, instead of just assoc'ing the error values onto the individual passed to the evaluate-individual function? This is inconvenient in that it loses other data associated with that individual unless they are explicitly carried through.

If not, I'd suggestion changing it to assoc.

Track Individual's Locations

In order to efficiently gather ancestry info, I suggest adding location and parent-locations to individuals. The location will just be the individual's index in the population, where parent-locations will be a vector containing the individual's parents' indices in the population.

Please formally specify numeric type handling rules

So I'm trying to write tests for the Interpreter "routing" system, which is the cascade of recognizers that send items on the :exec stack to the appropriate other stacks.

At the moment this is the code that handles that, but the behavior of particular programs and the results of calculations are also filtered through this kind of ad hoc thingie, and as a result there are some non-obvious consequences for numeric values calculated in the course of running a Push program.

So as far as I can see, more or less every function that "returns an integer" explicitly calls keep-number-reasonable, so the de facto type we call a Push integer is a kind of truncated thing. Clojure will happily treat arithmetic (and other) numeric results as either boxed (fixed precision) or unboxed (arbitrary precision) values, and it looks also as if the basics of Clojush arithmetic uses arbitrary precision for results, but then inevitably applies keep-number-reasonable to that. Similarly, float values are protected by keep-number-reasonable from both the same overflow and also underflow—anything within some small $\epsilon$ of 0.0 is rounded down to 0.0.

But it is also true that if a float result is larger than max-number-magnitude, then keep-number-reasonable will convert it to an integer. You don't have to look at the tests I'm writing that surfaced this problem, but can see this in the code itself. Whenever the value of max-number-magnitude is the default value of 1000000000000, then the result of (float_mult 10000000.0 10000000.0) (or any larger float result) will be dropped down to the integer value 1000000000000.

And because the result of keep-number-reasonable is used inline within numeric instructions of several types, the integer will be present on the float stack as a result of float_mult. I suspect at least a few more of these are lurking in there for transcendental and exponential functions as well.

So what is keep-number-reasonable really supposed to do? Can we be a bit more rigorous about its goal? For instance, can we pick a maximum precision for any Push integer value? I expect the origin of the truncation kludge is something about the way Clojure handles numeric data type arithmetic, and that you maybe were getting exceptions when BigDecimal results had infinite representations, but pure Java float primitives were overflowing and raising exceptions too?

But what do you want it to do, really? Besides "not raise exceptions"? At the moment it's injecting subtle and confusing ad hoc changes to running code. And there are no such checks and truncations on the "routers", so a program that contains 2000000000000000000000 as a value would be perfectly runnable.

Report writing as a single concern

I see some activity on particular aspects of "reporting", for instance in #102. But inside the pushgp loop is a lot of intermingled and overlapping reportage.

I'd like to sketch a (big) refactoring, which aims to extract all of reporting from the search function. Completely.

As I watch the various reports pile up in a file or in STDOUT, most look like fossilized one-off experiments: stuff that was written for a kind of visible feedback while certain parts of the search functionality were being written and reconceived, but which have never been removed or shifted to a self-contained function.

My refactoring roadmap feels something like this, depending on your feedback about Clojure idioms and stuff:

review the current "reports" and classify them into "feedback", "annotations" and "data"
- feedback will be "reports" aimed at a watching user, including warnings, faults, running counts, and general progress indicators
- annotations are things like "last Github hash" and "problem name" (I'm hoping that's already in there!), which should technically be assumed to be metadata about a given run; these will probably appear repeatedly as headers or part of a template in most other reports
- data is the stuff you're used to puking to screen or saving to files, which might reasonably be expected to persist and explain: fitnesses, answer scripts, times, that sort of thing
create a push-report map, much like argmap or make-push-state (as I understand them): it will scour and merge the command line arguments, config file settings, and defaults to generate standardized suite of reports
[actual functionality change] tease apart the notion of a "report" to be a function which is applied to a dataset. Some of the current "reports" print data and summary information together, some use a lot of extra verbiage where a word would do, some try to include the kitchen sink. Instead of that, I would rather see push-report create both the content and the view for every report: that is, if data collection is the goal, that data will be saved in a machine-readable format that can be shared between reports, but also a "view document" will be created when the "report" is made which can be used to visualize or summarize the data in the desired way. The model in my mind is CouchDB at the moment, but that's not important: think of every invocation of push-report as going through this sort of cascade:

for each active report definition (based on config state, at this moment):
  write the view document if it doesn't exist
  create the datastore if it doesn't exist
  write the cached data

The point of separating the concerns of reporting from search should be obvious. The point of separating the concerns of data collection from visualization is to better foster flexibility and reuse, and to permit "live" viewing by an engaged researcher. A generic "view document" could be as simple as a web page that exactly duplicates the current STDOUT puke, to a dashboard that shows how many evaluations have happened and how well they're doing, to a detailed exploratory tool that does calculations on the current data store when opened to display desired visualizations.

And yes, you should think of a "view document" as a "web page" running a trivial d3 widget or something like that.

consequences

The only negative consequence I can see is that when producing a "report" of the sort currently extant, you'd have to actually use the "view document" framework instead of println. In return, you'd have the ability to define generic reports, re-use reports between projects, build new functionality (like new search operators or selection schemes) with a mind to actually seeing what is changed instead of just sortof letting the muse tell you, and so on.

Also: you'd be able to monitor a run remotely, in realtime, without a bunch of scrolly words: with decent, humane communicative charts.

Also also: this is the camel's nose under the tent for a web-based project setup.

Deploy tags to clojars

We want to help lee with doing releases. This is how releases work:

Generate a new version number
Update some files with new version
Add git commit for this version
Add git tag to this version
Push these to github
deploy the code to clojar

We could easily tackle 6 through travis.

The other steps are a bit more complicated. It is certainly possible to do all of that in Travis, it is just a matter of figuring it when/how we want to do it all.

For example, we could bump the version after every commit to master automatically. The problems I see with this are:

How does it know whether to do a minor or major bump? Maybe just have minor bump by default then some way to enable major bump
We have to be weary of race conditions here... For example, let's say travis pulls a commit and all tests pass, and then it goes to add a new version commit and push that. But what if someone else pushes a new commit halfway through that process. Then travis wont be able to cleanly push the bump version commit. Maybe that is fine, since it wont happen very often, but just something to think about.
I am not sure if you want to bump a new version after every single commit

We would store credentials securely in the travis CI file and pick them up like this: https://github.com/technomancy/leiningen/blob/master/doc/DEPLOY.md#credentials-in-the-environment.

Need help writing some midje tests for an error-function

I'd like to write the requisite tests for my error function in problems/tozier/winkler01.clj but I have only very sketchy ideas how to gather together the arguments and exercise functions elsewhere in the code.

So the basic approach for any given fact seems like:

load my file with the error-function defined in it
create a program with a plannable outcome; for example, one that produces a 0 error score, one that produces a positive error score, one that doesn't give an answer, one that's empty
apply the error-function to the program and compare the result to the expected result

Problem is, I can't suss out:

What do I do to create the argument for an error-function?
What is the structure returned by a typical error-function?
What do you "normally" do when a calculation might throw an exception? For example, in winkler01 I would expect the function proportion-not-01 to throw a Divide by 0 error if it were ever handed a non-numeric argument. So what's the idiom for handling that in the code, and what's your codebase's standard? (not necessarily the same thing)