hardmaru / estool Goto Github PK

View Code? Open in Web Editor NEW

921.0 921.0 162.0 7.22 MB

Evolution Strategies Tool

License: Other

Python 34.95% Jupyter Notebook 65.05%

estool's People

Contributors

Stargazers

Watchers

Forkers

jdc08161063 yanghaha11514 kastnerkyle soroushmehr geoffreyroeder ustcpcs 13331151 ssghost jooooh corl2017 ttungl andy-yangz nunofernandes-plight ernsttmp cthorey skorsun multipath mygmyg yumeng-jiang huokedu edersantana shubhampachori12110095 shareeff mehrdad-shokri ekinakyurek mahjiong akbek frankatmech maveriq hust100 kunlqt gcusso daniellsm cpehle wwxfromtju ll550 slremy trougnouf cappokan afcarl gabeochieng koldus wellbeing18 dostalj tony32769 lurium yimingpeng mmilk1231 noonkum hageshogun bhaney tedrepo mohammedgomaa jjmusa wadekarg hiwonjoon kumarjitpathakbangalore keithmgould martinoall erwincoumans lovelan521 hal2001 mederictrungvu lerrytang nanaomonika ganindu7 batermj uclyyu edmon1024 agaier danbri denis-xiao kaiya hassanzaaljdice xuannadi shaonannan geotyper qihongl stjordanis robot0102 skriegman marktension mattsherar arturomf94 mightyernie ssnl floopcz maheshjethalia mengxiangming robot-ai-machinelearning amitmate vishaal27 wook133 rh17983 serge42 shilx001 xrosliang ashrafbily awoziji tenminutesolder

estool's Issues

Hi Hardmaru,
we would be interested to run ES-Tool (Thanks for creating it!!) on a cluster of machines on AWS. Would you be interested in a pull request, if we implement this, and would it be OK for you if we use Star Cluster (http://star.mit.edu/cluster/index.html) as a cluster management tool?

Thanks & kind regards
Ernst

raise NotImplementedError

Hello!

I'm running python 3.6 over miniconda on Mac 10.14

versions:

gym: 0.10.4
cma: 2.6.0
pybullet: 2.2.2

I'm able to run the BipedalWalker-v2 env on its own with a mini test script:

import gym
env = gym.make('BipedalWalker-v2')
env.reset()
for t in range(1000):
    env.render()
    observation, reward, done, info = env.step(env.action_space.sample()) # take a random action
    if done:
      break

So given the above, I know gym and the env are working fine.

Also I'm able to train with the estool just fine using the bullet racecar env. I'm also able to run the trained model from bullet racecar.

What does not work is training with the BipedalWalker-v2 env.

Here is the command I run: python train.py biped -n 1 -t 2
(same results for when n=8 and t=4)

And here are the results:

(py36) Keiths-MacBook-Pro:estool keithgould$ p train.py biped -n 1 -t 2
pybullet build time: Sep 26 2018 10:51:37
current_dir=/Users/keithgould/miniconda3/envs/py36/lib/python3.6/site-packages/pybullet_envs/bullet
['mpirun', '-np', '2', '/Users/keithgould/miniconda3/envs/py36/bin/python3', 'train.py', 'biped', '-n', '1', '-t', '2']
pybullet build time: Sep 26 2018 10:51:37
pybullet build time: Sep 26 2018 10:51:37
current_dir=/Users/keithgould/miniconda3/envs/py36/lib/python3.6/site-packages/pybullet_envs/bullet
current_dir=/Users/keithgould/miniconda3/envs/py36/lib/python3.6/site-packages/pybullet_envs/bullet
assigning the rank and nworkers 2 0
assigning the rank and nworkers 2 1
making model from game: Game(env_name='BipedalWalker-v2', time_factor=0, input_size=24, output_size=4, layers=[40, 40], activation='tanh', noise_bias=0.0, output_noise=[False, False, False])
making model from game: Game(env_name='BipedalWalker-v2', time_factor=0, input_size=24, output_size=4, layers=[40, 40], activation='tanh', noise_bias=0.0, output_noise=[False, False, False])
size of model 2804
size of model 2804
(1,2mirr1)-aCMA-ES (mu_w=1.0,w_1=100%) in dimension 2804 (seed=364359, Tue Nov 27 12:07:55 2018)
('process', 0, 'out of total ', 2, 'started')
('training', 'biped')
('population', 2)
('num_worker', 1)
('num_worker_trial', 2)


(1,2mirr1)-aCMA-ES (mu_w=1.0,w_1=100%) in dimension 2804 (seed=385130, Tue Nov 27 12:07:55 2018)
('process', 1, 'out of total ', 2, 'started')

WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
sweet, loaded the env.
<BipedalWalker instance>
cool right??
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
sweet, loaded the env.
<BipedalWalker instance>
cool right??
Traceback (most recent call last):
  File "train.py", line 448, in <module>
    main(args)
  File "train.py", line 404, in main
    slave()
  File "train.py", line 229, in slave
    fitness, timesteps = worker(weights, seed, train_mode, max_len)
  File "train.py", line 205, in worker
    train_mode=train_mode, render_mode=False, num_episode=num_episode, seed=seed, max_len=max_len)
  File "/Users/keithgould/Robotics/estool/model.py", line 258, in simulate
    obs, reward, done, info = model.env.step(action)
  File "/Users/keithgould/miniconda3/envs/py36/lib/python3.6/site-packages/gym/core.py", line 63, in step
    raise NotImplementedError
NotImplementedError

There are some print statements in there where I'm trying to figure out why the step function is not defined (causing the NotImplementedError).

As far as I can tell the environment is loaded, and I know it has a step function...

Any thoughts/help appreciated.

Keith

PEPG questions

hey Hardmaru,
Thanks for creating your ES blogs, they've been really interesting.

I had a couple of quick and probably silly questions about your PEPG implementation.

I've read the original paper and attempted to implement the PEPG algorithm with symmetric sampling. I noticed a couple of differences to your implementation, and I was wondering if you could enlighten me.

In es.py where you're updating the mean, I notice that the calculated gradient does not get normalized by the batch size.

  rT = (reward[:self.batch_size] - reward[self.batch_size:])
  change_mu = np.dot(rT, epsilon)
  self.optimizer.stepsize = self.learning_rate
  update_ratio = self.optimizer.update(-change_mu) # adam, rmsprop, momentum, etc.

I guess that if you tune the learning rate according to batch size this is not an issue, but I was just wondering why you took this approach?

Also, where you're making the symmetric samples:

self.epsilon = np.random.randn(self.batch_size, self.num_params) * self.sigma.reshape(1, self.num_params)

You're sampling from a uniform distribution. Is there a reason that you take this approach rather than sampling from a normal distribution. Also, and I may be completely wrong here, as the uniform distribution is taken from [0,1), doesn't this mean that your parameters will always larger than the mean (for the + symmetric case) and always smaller than the mean (for the - symmetric case). You won't end up with the case where there's a mix of some parameters above the mean and some below.

I'm a completely self taught beginner to this stuff, so apologies for the naive questions.
cheers

Question for CMA-ES test part

Thanks for your contributions, it helps a lot. But I wonder why after I ran the 'cmaes' part of 'simple_es_example.py ' several times, I got different results?

How to train "KukaBulletEnv-v0" ?

I use the following order:
python train.py bullet_kuka_grasping -n 8 -t 4
I have trained kuka for a whole night, but the reward still has no change:"-1869.074175"
('improvement', 8150, -149.94043125000007, 'curr', -1869.074175, 'prev', -1719.13374375, 'best', -1719.13374375)

Then i use the following order to do the evaluation:
python model.py bullet_kuka_grasping log/bullet_kuka_grasping.cma.1.32.json
The kuka can still not grasp the object.

Could you please give some advices?

The pre-trained Kuka model did not work

I tried the pre-trained Kuka model in ./Zoo, but it can not grasp an object. Does anyone know the reason?

Bug in train.py

Hi,
I faced with the following error and similar to [https://github.com//issues/8]

File "/usr/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '2', '/home/baheri/.virtualenvs/worldmodels/bin/python', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.

The suggestion at that post did not help me. Any other suggestion to solve this issue?

No colour

Hi, thanks for this. I installed everything correctly and I can only get the opengl window to visualise the robot by pressing "w". There are no coloured simulations.

I'm on OS high sierra, pyenv with 3.5.2, gym 0.9.1, latest bullet3 and pybullet. Has anyone seen this before (could be something simple)

Can't run pre-trained biped models

Hi, first of all, thanks for making this great tool!

I've tried a couple of pre-trained models, e.g., bullet_ant, bullet_kuka_grasping, and they all work fine. But I couldn't run the zoo biped models (same for bipedhard, bipedhard_stoc) with this command:
python model.py biped zoo/biped.cma.1.96.best.json.
It failed with the NotImplementedError, full trace below:

Traceback (most recent call last): File "model.py", line 340, in <module> main() File "model.py", line 335, in main train_mode=False, render_mode=render_mode, num_episode=1) File "model.py", line 240, in simulate model.env.render("human") File "/Users/duongnt/tensorflow/lib/python3.6/site-packages/gym/core.py", line 110, in render raise NotImplementedError NotImplementedError

I already installed required dependencies gym, box2d, pybullet and mpi4py. Am I missing something?

Bad score when training "KukaBulletEnv-v0"

I have been training bullet_kuka_grasping for a long time.
Using 'python train.py bullet_kuka_grasping -e 16 -n 16 -t 16' ,nearly training a week.

but the best score was likely to converge to -1776.
How can I solve this problem? Thanks in advance!

negative function_values in CMAES?

Hi, I was confused by the this line in es.py
https://github.com/hardmaru/estool/blob/master/es.py#L115

reward_table = -np.array(reward_table_result)

The reward_table will be passed to tell() method as function_values. But why it assign a negative sign to raw rewards collected from rollouts?

Class OpenES, function ask. self.mu never gets updated.

First of all, Thanks for all your contribution! :)
I looked at the original algorithm from the paper "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" for OpenES implementation.
They update the policy parameters theta after every iteration or rollouts.
But the implementation in es.py file under OpenES class has this line commented in ask function.
#self.mu += self.learning_rate * change_mu .
https://github.com/hardmaru/estool/blob/master/es.py#L328C1-L328C47

Even the Adam optimizer which is initialized doesn't change the self.mu array.
Just wanted to know if this a mistake or am I missing something here.

matplotlib isn't in requirements.txt

I needed to pip install matplotlib for the simple_es_example.ipynb demo to work; should it be in requirements.txt?

The meaning of ”rms_stdev“ function?

Hi !

The function rms_stdev appears many times in the es.py.

I guess the meaning of this function is to calculate the root mean square (rms) of std (sigma).

Therefore, this function should first calculate the mean and then the square, i.e. np.sqrt(np.mean(sigma*sigma)).

But the implementation is just the opposite in es.py.

Natural gradients for deep layers

In NES algorithms do we use backpropogation from the last layers gradients(computed by the objectve function). I am curious as to how to optimize the hidden layers since they are not directly affecting the objective function

PEPG and NES

Looking through your implementation of PEPG, is it accurate to call it a NES?

bipedhard doesn't converge

I tried both bipedhard and bipedhard_stoc. They jut don't converge. After like 150 generations they are still not showing any progress.
I am training with the following command:
python train.py bipedhard -n 8 -t 4
The best individuals are around -100 cumulative reward. Something must be wrong

in PEPG module change_mu = np.dot(rT, epsilon) where left side have dimention pop_size and right have dimention batch_size(half of pop_size)

      rT = (reward[:self.batch_size] - reward[self.batch_size:])
      change_mu = np.dot(rT, epsilon)
      self.optimizer.stepsize = self.learning_rate
      update_ratio = self.optimizer.update(-change_mu) # adam, rmsprop, momentum, etc.
      #self.mu += (change_mu * self.learning_rate) # normal SGD method

so change_mu will be half shorter than need for pop_size

Do the parameters need to be in [a,b]

Hello, I don't see anything in the codebase the enforces such a restriction, but should there be bounds on the values parameters can have?

A related question is should each of the parameters have the same bounds (e.g. is normalization a requirement for strategies like OpenES to work properly?

What is the best way to step through the code ?

Hi, I want to to step through the code and look at it in detail.

However MPI will give me Segmentation fault error:

[gantosaxe:07615] *** Process received signal ***
[gantosaxe:07615] Signal: Segmentation fault (11)
[gantosaxe:07615] Signal code:  (128)
[gantosaxe:07615] Failing at address: (nil)

Is there a way to run the code without MPI or a reasonable way to debug it ?

Thanks !

class OpenES, function ask: np.array mu come from update and compute step Adam optimizer with shape(popsize, numparams) and it was error when try to reshape mu.reshape(1, self.num_params)

self.solutions = self.mu.reshape(1, self.num_params) + self.epsilon * self.sigma

 def _compute_step(self, globalg):
    a = self.stepsize * np.sqrt(1 - self.beta2 ** self.t) / (1 - self.beta1 ** self.t)
    self.m = self.beta1 * self.m + (1 - self.beta1) * globalg
    self.v = self.beta2 * self.v + (1 - self.beta2) * (globalg * globalg)
    step = -a * self.m / (np.sqrt(self.v) + self.epsilon)
    return step

Training time for BipedalWalkerHardcore-v2

First off, terrific work on repo and blog post, very detailed and clear.

I was able to solve the BipedalWalkerHardcore-v2, average 300+ for 100eps, with rl with an a3c implentation I made but it took quite a while train, unlike for BipedalWalker-v2 which took 5-10minutes, it took nearly two full days of training. I think mainly due to how fast it wanted to run to the finish but eventually learned some impressive moves. Did you experience a long training time using es as well?

Look forward to experimenting with your estool as final performance and robustness are most important in my use cases. Great work and thanks for creating!

CMA-ES with O(N)

Hello Hardmaru,
thank you very much for you great blog and for publishing the code!!! I love it!
Just a question - you mention that there is an CMA-ES scaling with O(N). We would love to use CMA-ES on larger problems. Do you happen to know a paper on that?

Thanks and kind regards,
Ernst

[Weight decay] Should weight decay of model parameters instead ?

It seems the current implementation does weight decay on the fitness value, I am wondering if it should be performed on model parameters ?
i.e.

if self.weight_decay > 0:
      l2_decay = compute_weight_decay(self.weight_decay, self.solutions)
      reward_table += l2_decay

the last line change to

self.solutions += l2_decay

And if it is the case that weight decay performing on model parameters, should it be after the results to be computed for current iteration ? (e.g. in tell() of OpenES, now is weight decay -> get current best param, should it be opposite order, get current best param -> weight decay )

License

Thanks for the code. I couldn't find a license file, am I free to adapt and reuse the code?

Solution range?

Is there currently a way to set the solution range that the population can search over? For example, for some problems I may wish to bound solutions for each parameter in the (0, 1] range, or (-1, 1).