rail-berkeley / rlkit Goto Github PK
View Code? Open in Web Editor NEWCollection of reinforcement learning algorithms
License: MIT License
Collection of reinforcement learning algorithms
License: MIT License
I would like to resume training from a given epoch. I guess I could add a line to:
def get_epoch_snapshot(self, epoch):
data_to_save = dict(
epoch=epoch,
exploration_policy=self.exploration_policy,
eval_policy=self.eval_policy,
algorithm=self #Here
)
if self.save_environment:
data_to_save['env'] = self.training_env
return data_to_save
and just pickle load the entire RLAlgorithm subclass instance to resume?
how to play trained weight, show render result;
This example dataset based trainer also does expert signal recollection, so that is why I didnt do a PR, will let it to you to decide which parts make sense for rlkit.
class OptimizedBatchRLAlgorithm(BaseRLAlgorithm, metaclass=abc.ABCMeta):
def __init__(
self,
trainer,
exploration_env,
evaluation_env,
exploration_data_collector: PathCollector,
evaluation_data_collector: PathCollector,
replay_buffer: ReplayBuffer,
batch_size,
max_path_length,
num_epochs,
num_eval_steps_per_epoch,
num_expl_steps_per_train_loop,
num_trains_per_train_loop,
num_train_loops_per_epoch=1,
min_num_steps_before_training=0,
max_num_steps_before_training=1e5,
expert_data_collector: PathCollector = None,
):
super().__init__(
trainer,
exploration_env,
evaluation_env,
exploration_data_collector,
evaluation_data_collector,
replay_buffer,
)
assert isinstance(replay_buffer, Dataset), "The replay buffers must be compatible with Pytorch Dataset to use this version."
self.batch_size = batch_size
self.max_path_length = max_path_length
self.num_epochs = num_epochs
self.num_eval_steps_per_epoch = num_eval_steps_per_epoch
self.num_trains_per_train_loop = num_trains_per_train_loop
self.num_train_loops_per_epoch = num_train_loops_per_epoch
self.num_expl_steps_per_train_loop = num_expl_steps_per_train_loop
self.min_num_steps_before_training = min_num_steps_before_training
self.max_num_steps_before_training = max_num_steps_before_training
self.expert_data_collector = expert_data_collector
def _train(self):
if self.min_num_steps_before_training > 0:
init_expl_paths = self.expl_data_collector.collect_new_paths(
self.max_path_length,
self.min_num_steps_before_training,
discard_incomplete_paths=False,
)
self.replay_buffer.add_paths(init_expl_paths)
self.expert_data_collector.end_epoch(-1)
self.expl_data_collector.end_epoch(-1)
if self.expert_data_collector is not None:
new_expl_paths = self.expert_data_collector.collect_new_paths(
self.max_path_length,
min(int(self.replay_buffer.max_buffer_size * 0.5), self.max_num_steps_before_training),
discard_incomplete_paths=False,
)
self.replay_buffer.add_paths(new_expl_paths)
dataset_loader = torch.utils.data.DataLoader(self.replay_buffer, pin_memory=True, batch_size=self.batch_size, num_workers=0)
for epoch in gt.timed_for(
range(self._start_epoch, self.num_epochs),
save_itrs=True,
):
printout('Evaluation sampling')
self.eval_data_collector.collect_new_paths(
self.max_path_length,
self.num_eval_steps_per_epoch,
discard_incomplete_paths=True,
)
gt.stamp('evaluation sampling')
for _ in range(self.num_train_loops_per_epoch):
printout('Exploration sampling')
new_expl_paths = self.expl_data_collector.collect_new_paths(
self.max_path_length,
self.num_expl_steps_per_train_loop,
discard_incomplete_paths=False,
)
gt.stamp('exploration sampling', unique=False)
self.replay_buffer.add_paths(new_expl_paths)
gt.stamp('data storing', unique=False)
self.training_mode(True)
i = 0
with tqdm(total=self.num_trains_per_train_loop) as pbar:
while True:
for _, data in enumerate(dataset_loader, 0):
if i > self.num_trains_per_train_loop:
break # We are done
observations = data[0].to(ptu.device)
actions = data[1].to(ptu.device)
rewards = data[2].to(ptu.device)
terminals = data[3].to(ptu.device).float()
next_observations = data[4].to(ptu.device)
env_infos = data[5]
train_data = dict(
observations=observations,
actions=actions,
rewards=rewards,
terminals=terminals,
next_observations=next_observations,
)
for key in env_infos.keys():
train_data[key] = env_infos[key]
self.trainer.train(train_data)
pbar.update(1)
i += 1
if i > self.num_trains_per_train_loop:
break
gt.stamp('training', unique=False)
self.training_mode(False)
if isinstance(self.expl_data_collector, AtariPathCollectorWithEmbedder):
eval_policy = self.eval_data_collector.get_snapshot()['policy']
self.expl_data_collector.evaluate(eval_policy)
self._end_epoch(epoch)
def to(self, device):
for net in self.trainer.networks:
net.to(device)
def training_mode(self, mode):
for net in self.trainer.networks:
net.train(mode)
do I need to write a render function ?
Is there any examples in these codes
Hi, I'm trying to evaluate TDM as an algorithm to add to our software and the first step was to replicate the results presented in "TEMPORAL DIFFERENCE MODELS: MODEL-FREE DEEP RL FOR MODEL-BASED CONTROL" (Very impressive paper, by the way). Of the examples in this repository, Half Cheetah looks like it trains the fastest in your paper, so I ran that first. When I run it I don't seem to get the same results you present, but I may just be interpreting things incorrectly.
I created a conda env as per the instructions in the readme, with the exception of gtimer where I removed the version requirement gtimer==1.0.1b5 because that version wasn't available.
When I run examples/tdm/cheetah.py, the initial returns are as follows:
Multitask L1 distance to goal Mean 2.2572
Multitask L1 distance to goal Std 1.43821
Multitask L1 distance to goal Max 5.63364
Multitask L1 distance to goal Min 0.0958731
Multitask Final L1 distance to goal Mean 1.34996
Multitask Final L1 distance to goal Std 1.69566
Multitask Final L1 distance to goal Max 5.46802
Multitask Final L1 distance to goal Min 0.315033
Multitask L2 distance to goal Mean 2.2572
Multitask L2 distance to goal Std 1.43821
Multitask L2 distance to goal Max 5.63364
Multitask L2 distance to goal Min 0.0958731
Multitask Final L2 distance to goal Mean 1.34996
Multitask Final L2 distance to goal Std 1.69566
Multitask Final L2 distance to goal Max 5.46802
Multitask Final L2 distance to goal Min 0.315033
Multitask Env Rewards Mean -2.24934
Multitask Env Rewards Std 1.44294
Multitask Env Rewards Max -0.0958731
Multitask Env Rewards Min -5.60402
Multitask L1 distance to goal Mean 2.2572
Multitask L1 distance to goal Std 1.43821
Multitask L1 distance to goal Max 5.63364
Multitask L1 distance to goal Min 0.0958731
Multitask Final L1 distance to goal Mean 1.34996
Multitask Final L1 distance to goal Std 1.69566
Multitask Final L1 distance to goal Max 5.46802
Multitask Final L1 distance to goal Min 0.315033
Multitask L2 distance to goal Mean 2.2572
Multitask L2 distance to goal Std 1.43821
Multitask L2 distance to goal Max 5.63364
Multitask L2 distance to goal Min 0.0958731
Multitask Final L2 distance to goal Mean 1.34996
Multitask Final L2 distance to goal Std 1.69566
Multitask Final L2 distance to goal Max 5.46802
Multitask Final L2 distance to goal Min 0.315033
Multitask Env Rewards Mean -2.24934
Multitask Env Rewards Std 1.44294
Multitask Env Rewards Max -0.0958731
Multitask Env Rewards Min -5.60402
xvels Mean 0.192786
xvels Std 0.55423
xvels Max 1.7305
xvels Min -2.56708
Final xvels Mean -0.873447
Final xvels Std 0.629683
Final xvels Max -0.00983972
Final xvels Min -1.92269
desired xvels Mean -0.242309
desired xvels Std 2.53803
desired xvels Max 5.45818
desired xvels Min -2.77633
Final desired xvels Mean -0.242309
Final desired xvels Std 2.53803
Final desired xvels Max 5.45818
Final desired xvels Min -2.77633
xvel errors Mean 2.24934
xvel errors Std 1.44294
xvel errors Max 5.60402
xvel errors Min 0.0958731
Final xvel errors Mean 1.34996
Final xvel errors Std 1.69566
Final xvel errors Max 5.46802
Final xvel errors Min 0.315033
QF Loss 1.94271
Policy Loss 0.141472
Raw Policy Loss 0.141472
Preactivation Policy Loss 0
Q Predictions Mean -0.141398
Q Predictions Std 0.188076
Q Predictions Max -0.000109524
Q Predictions Min -0.671575
Q Targets Mean -2.11527
Q Targets Std 8.76126
Q Targets Max -0
Q Targets Min -73.5125
Bellman Errors Mean 80.6921
Bellman Errors Std 504.045
Bellman Errors Max 5365.62
Bellman Errors Min 8.74315e-13
Policy Action Mean -0.00243206
Policy Action Std 0.00429308
Policy Action Max 0.00416228
Policy Action Min -0.011675
Test Rewards Mean -2.24934
Test Rewards Std 1.44294
Test Rewards Max -0.0958731
Test Rewards Min -5.60402
Test Returns Mean -222.684
Test Returns Std 133.611
Test Returns Max -52.0534
Test Returns Min -534.119
Test Actions Mean 0.181862
Test Actions Std 0.903677
Test Actions Max 1
Test Actions Min -1
Num Paths 10
Exploration Rewards Mean -327.502
Exploration Rewards Std 150.894
Exploration Rewards Max -0.0122693
Exploration Rewards Min -684.543
Exploration Returns Mean -32422.7
Exploration Returns Std 14259.2
Exploration Returns Max -9411.74
Exploration Returns Min -56577.3
Exploration Actions Mean 0.0733828
Exploration Actions Std 0.691484
Exploration Actions Max 1
Exploration Actions Min -1
AverageReturn -222.684
Number of train steps total 20075
Number of env steps total 1000
Number of rollouts total 10
Train Time (s) 185.599
(Previous) Eval Time (s) 0
Sample Time (s) 0.594464
Epoch Time (s) 186.193
Total Train Time (s) 187.161
Epoch 0
and the final returns after 27 000 samples were:
Multitask L1 distance to goal Mean 2.84896
Multitask L1 distance to goal Std 1.86101
Multitask L1 distance to goal Max 8.48643
Multitask L1 distance to goal Min 0.00163637
Multitask Final L1 distance to goal Mean 1.12561
Multitask Final L1 distance to goal Std 1.09752
Multitask Final L1 distance to goal Max 3.6141
Multitask Final L1 distance to goal Min 0.010751
Multitask L2 distance to goal Mean 2.84896
Multitask L2 distance to goal Std 1.86101
Multitask L2 distance to goal Max 8.48643
Multitask L2 distance to goal Min 0.00163637
Multitask Final L2 distance to goal Mean 1.12561
Multitask Final L2 distance to goal Std 1.09752
Multitask Final L2 distance to goal Max 3.6141
Multitask Final L2 distance to goal Min 0.010751
Multitask Env Rewards Mean -2.83008
Multitask Env Rewards Std 1.86414
Multitask Env Rewards Max -0.00163637
Multitask Env Rewards Min -8.48643
Multitask L1 distance to goal Mean 2.84896
Multitask L1 distance to goal Std 1.86101
Multitask L1 distance to goal Max 8.48643
Multitask L1 distance to goal Min 0.00163637
Multitask Final L1 distance to goal Mean 1.12561
Multitask Final L1 distance to goal Std 1.09752
Multitask Final L1 distance to goal Max 3.6141
Multitask Final L1 distance to goal Min 0.010751
Multitask L2 distance to goal Mean 2.84896
Multitask L2 distance to goal Std 1.86101
Multitask L2 distance to goal Max 8.48643
Multitask L2 distance to goal Min 0.00163637
Multitask Final L2 distance to goal Mean 1.12561
Multitask Final L2 distance to goal Std 1.09752
Multitask Final L2 distance to goal Max 3.6141
Multitask Final L2 distance to goal Min 0.010751
Multitask Env Rewards Mean -2.83008
Multitask Env Rewards Std 1.86414
Multitask Env Rewards Max -0.00163637
Multitask Env Rewards Min -8.48643
xvels Mean 0.0265717
xvels Std 1.19521
xvels Max 3.78229
xvels Min -5.17487
Final xvels Mean -0.953468
Final xvels Std 2.0584
Final xvels Max 3.35745
Final xvels Min -4.45172
desired xvels Mean -1.27358
desired xvels Std 3.19157
desired xvels Max 4.8695
desired xvels Min -5.75411
Final desired xvels Mean -1.27358
Final desired xvels Std 3.19157
Final desired xvels Max 4.8695
Final desired xvels Min -5.75411
xvel errors Mean 2.83008
xvel errors Std 1.86414
xvel errors Max 8.48643
xvel errors Min 0.00163637
Final xvel errors Mean 1.12561
Final xvel errors Std 1.09752
Final xvel errors Max 3.6141
Final xvel errors Min 0.010751
QF Loss 5.10427
Policy Loss 21.2602
Raw Policy Loss 21.2602
Preactivation Policy Loss 0
Q Predictions Mean -35.1532
Q Predictions Std 79.3634
Q Predictions Max -0.176604
Q Predictions Min -579.39
Q Targets Mean -34.9638
Q Targets Std 80.9855
Q Targets Max -0.0394544
Q Targets Min -585.102
Bellman Errors Mean 134.548
Bellman Errors Std 629.529
Bellman Errors Max 6193.98
Bellman Errors Min 1.83182e-05
Policy Action Mean 0.043988
Policy Action Std 0.897599
Policy Action Max 1
Policy Action Min -1
Test Rewards Mean -2.83008
Test Rewards Std 1.86414
Test Rewards Max -0.00163637
Test Rewards Min -8.48643
Test Returns Mean -280.178
Test Returns Std 150.009
Test Returns Max -63.9594
Test Returns Min -552.089
Test Actions Mean -0.262005
Test Actions Std 0.870809
Test Actions Max 1
Test Actions Min -1
Num Paths 10
Exploration Rewards Mean -189.165
Exploration Rewards Std 142.202
Exploration Rewards Max -0.217348
Exploration Rewards Min -668.823
Exploration Returns Mean -18727.3
Exploration Returns Std 11106.4
Exploration Returns Max -5627.08
Exploration Returns Min -44754
Exploration Actions Mean -0.0129454
Exploration Actions Std 0.841991
Exploration Actions Max 1
Exploration Actions Min -1
AverageReturn -280.178
Number of train steps total 670075
Number of env steps total 27000
Number of rollouts total 272
Train Time (s) 339.624
(Previous) Eval Time (s) 0.961533
Sample Time (s) 0.663905
Epoch Time (s) 341.249
Total Train Time (s) 8136.35
Epoch 26
Comparing the final xvel error mean (which I think is the metric you graph in the paper) it seems that after 27 000 env steps, rather than dropping from ~2.5 to < 1, the value has remained at or above 2.5. Is this just a matter of running it for longer, or does this suggest there's something fundamentally wrong with my environment or perhaps that the code has been changed since generating the graphs in the paper?
I've seen similar results in three runs, so it doesn't seem to be a bad initialization.
Thanks for your help.
Hi. I tried running SAC but it fails with following error. I am currently working on your repo to add environment farming(multiple instance of an environment feeding the reinforcement learning algorithms) feature. Therefore, problem may be related to my environment or changed parts but it seem like it is not since DDPG works. Do you have any idea why SAC fails? Thanks.
File "/root/code/rlkit/core/rl_algorithm.py", line 134, in train
self.train_online(start_epoch=start_epoch)
File "/root/code/rlkit/core/rl_algorithm.py", line 210, in train_online
observation = self.play_one_step(observation)
File "/root/code/rlkit/core/rl_algorithm.py", line 179, in play_one_step
self._try_to_train()
File "/root/code/rlkit/core/rl_algorithm.py", line 222, in _try_to_train
self._do_training()
File "/root/code/rlkit/torch/sac/sac.py", line 131, in _do_training
policy_loss.backward()
File "/opt/conda/envs/rlkit-env/lib/python3.5/site-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/opt/conda/envs/rlkit-env/lib/python3.5/site-packages/torch/autograd/init.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/opt/conda/envs/rlkit-env/lib/python3.5/site-packages/torch/autograd/stochastic_function.py", line 15, in _do_backward
raise RuntimeError("differentiating stochastic functions requires "
RuntimeError: differentiating stochastic functions requires providing a reward
I make a little change on oracle.py, likethis
and modify state_based_goal_experiments.py
env -> image_env
the I get
Traceback (most recent call last):
File "pusher_img.py", line 76, in <module>
use_gpu=True, # Turn on if you have a GPU
File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 594, in run_experiment
**run_experiment_kwargs
File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 175, in run_experiment_here
return experiment_function(variant)
File "/home/cww97/rlkit/rlkit/launchers/img_goal_experiments.py", line 120, in her_td3_experiment
algorithm.train()
File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 146, in train
self.train_online(start_epoch=start_epoch)
File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 178, in train_online
self._end_epoch(epoch)
File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 329, in _end_epoch
post_epoch_func(self, epoch)
File "/home/cww97/rlkit/rlkit/launchers/rig_experiments.py", line 525, in save_video
**dump_video_kwargs)
File "/home/cww97/rlkit/rlkit/util/video.py", line 87, in dump_video
skvideo.io.vwrite(filename, outputdata)
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/skvideo/io/io.py", line 64, in vwrite
writer.writeFrame(videodata[t])
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/skvideo/io/ffmpeg.py", line 448, in writeFrame
self._warmStart(M, N, C)
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/skvideo/io/ffmpeg.py", line 419, in _warmStart
stdout=self.DEVNULL, stderr=sp.STDOUT)
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/subprocess.py", line 1490, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
I search a little with google, I know this is a memory problem, then I tried to see if there is any difference.
I compared the oracle and img_pusher
the vidoe outputdata are all 115450416
is this because I miss other coding jobs?
I have attempted to run the TD3 example script from the rlkit-gpu Docker image with no success. I had to modify the TD3 example script slightly because I dont have a Mujoco license, so it instead runs MountainCarContinuous-v0. It runs just fine on my local machine from RLKit source, but when I try to run from within the docker container I get the following error:
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
Traceback (most recent call last):
File "examples/td3.py", line 111, in <module>
experiment(variant)
File "examples/td3.py", line 84, in experiment
algorithm.cuda()
File "/rlkit/rlkit/torch/torch_rl_algorithm.py", line 37, in cuda
net.cuda()
File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 216, in cuda
return self._apply(lambda t: t.cuda(device))
File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 146, in _apply
module._apply(fn)
File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 152, in _apply
param.data = fn(param.data)
File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 216, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/env/lib/python3.5/site-packages/torch/_utils.py", line 69, in _cuda
return new_type(self.size()).copy_(self, async)
File "/env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 358, in _lazy_new
_lazy_init()
File "/env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 121, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70
Can you confirm that the td3 example runs on the docker container without issue for you?
Hi,
I am trying to implement TDM. The link for multitask env is broken.
https://github.com/vitchyr/rlkit/blob/master/docs/envs/multitask_env.py
Crashes with:
Traceback (most recent call last):
File "/mounts/target/examples/tsac.py", line 75, in
experiment(variant)
File "/mounts/target/examples/tsac.py", line 52, in experiment
algorithm.to(ptu.device)
File "/mounts/rlkit-vitchyr/rlkit/torch/torch_rl_algorithm.py", line 30, in to
net.to(device)
File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 366, in getattr
type(self).name, name))
AttributeError: 'TanhGaussianPolicy' object has no attribute 'to'
thanks for your share !
i have a question:
when i run examples/tdm/ant_position.py, it returns:
2018-05-14 08:29:57.628702 CST | Variant: 2018-05-14 08:29:57.628908 CST | { "algorithm": "TDM", "tdm_kwargs": { "reward_scale": 10, "discount": 1, "max_path_length": 50, "batch_size": 128, "num_pretrain_paths": 0, "num_epochs": 500, "num_steps_per_epoch": 1000, "num_steps_per_eval": 1000, "tau": 0.001, "num_updates_per_env_step": 5, "max_tau": 49 } } WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype. WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype. Traceback (most recent call last): File "/home/jiamj/rlkit/examples/tdm/ant_position.py", line 87, in <module> experiment(variant) File "/home/jiamj/rlkit/examples/tdm/ant_position.py", line 15, in experiment env = NormalizedBoxEnv(GoalXYPosAnt(max_distance=6)) File "/home/jiamj/rlkit/rlkit/torch/tdm/envs/ant_env.py", line 21, in __init__ self.set_goal(np.array([self.max_distance, self.max_distance])) File "/home/jiamj/rlkit/rlkit/torch/tdm/envs/ant_env.py", line 41, in set_goal self.model.site_pos = site_pos AttributeError: attribute 'site_pos' of 'mujoco_py.cymj.PyMjModel' objects is not writable
how can i solve it?
thanks again!
I have been using these two routines to figure out the best learning rate to apply with awesome results on SAC. However, the changes in the temperature
alter those values along the way. Probably would be a good idea to extend it further to do some sort of 'automatic' discovery of LR after x
amount of epochs. This version will also mess up the gradients, so you cannot use the policy after you run this.
def find_policy_lr_step(self, loss):
self.find_lr_batch_num += 1
if self.find_lr_batch_num == 1:
self.find_lr_avg_loss = 0.0
self.find_lr_worst_loss = loss.item()
self.find_lr_best_loss = loss.item()
self.find_lr_best_lr = self.policy_optimizer.param_groups[0]['lr']
self.find_lr_worst_lr = self.policy_optimizer.param_groups[0]['lr']
self.find_lr_avg_loss = self.find_lr_beta * self.find_lr_avg_loss + (1-self.find_lr_beta) * loss.item()
smoothed_loss = self.find_lr_avg_loss / (1 - self.find_lr_beta ** self.find_lr_batch_num)
# Record the best and worst loss
if self.find_lr_batch_num > self.find_lr_batches // 10 and smoothed_loss < self.find_lr_best_loss:
self.find_lr_best_lr = self.find_lr_current_lr
self.find_lr_best_loss = smoothed_loss
# We only record at the start (we dont care about the divergent part)
if self.find_lr_batch_num < self.find_lr_batches // 5:
self.find_lr_worst_loss = max(smoothed_loss, self.find_lr_worst_loss)
# Stop if the loss is exploding
if self.find_lr_batch_num > self.find_lr_batches:
import matplotlib.pyplot as plt
plt.plot(self.find_lr_log_lrs,self.find_lr_losses)
plt.show()
# TODO: This is a simplistic heuristic until we do it properly doing gradient analysis.
printout(f'The best learning rate for network could be around: {self.find_lr_best_lr / 10}')
printout(f'Process will exit because finding the learning rate will make your gradients to degenerate')
exit(0)
# Store the values unless we are already diverging
if smoothed_loss <= self.find_lr_worst_loss:
self.find_lr_losses.append(smoothed_loss)
self.find_lr_log_lrs.append(math.log10(self.policy_optimizer.param_groups[0]['lr']))
# Update with the new learning rate.
self.find_lr_current_lr *= self.find_lr_multiplier
self.policy_optimizer.param_groups[0]['lr'] = self.find_lr_current_lr
def find_qfunc_lr_step(self, qf1_loss, qf2_loss):
self.find_lr_batch_num += 1
if self.find_lr_batch_num == 1:
self.find_lr_avg_loss = 0.0
self.find_lr_worst_loss = min( qf1_loss.item(), qf2_loss.item() )
self.find_lr_best_loss = min( qf1_loss.item(), qf2_loss.item() )
self.find_lr_best_lr = self.qf1_optimizer.param_groups[0]['lr']
self.find_lr_worst_lr = self.qf1_optimizer.param_groups[0]['lr']
self.find_lr_avg_loss = self.find_lr_beta * self.find_lr_avg_loss + (1-self.find_lr_beta) * min( qf1_loss.item(), qf2_loss.item() )
smoothed_loss = self.find_lr_avg_loss / (1 - self.find_lr_beta ** self.find_lr_batch_num)
# Record the best and worst loss
if self.find_lr_batch_num > self.find_lr_batches // 10 and smoothed_loss < self.find_lr_best_loss:
self.find_lr_best_lr = self.find_lr_current_lr
self.find_lr_best_loss = smoothed_loss
# We only record at the start (we dont care about the divergent part)
if self.find_lr_batch_num < self.find_lr_batches // 5:
self.find_lr_worst_loss = max(smoothed_loss, self.find_lr_worst_loss)
# Stop if the loss is exploding
if self.find_lr_batch_num > self.find_lr_batches:
import matplotlib.pyplot as plt
plt.plot(self.find_lr_log_lrs,self.find_lr_losses)
plt.show()
# TODO: This is a simplistic heuristic until we do it properly doing gradient analysis.
printout(f'The best learning rate for q function approximator could be around: {self.find_lr_best_lr / 10}')
printout(f'Process will exit because finding the learning rate will make your gradients to degenerate')
exit(0)
# Store the values unless we are already diverging
if smoothed_loss <= self.find_lr_worst_loss:
self.find_lr_losses.append(smoothed_loss)
self.find_lr_log_lrs.append(math.log10(self.qf1_optimizer.param_groups[0]['lr']))
# Update with the new learning rate.
self.find_lr_current_lr *= self.find_lr_multiplier
self.qf1_optimizer.param_groups[0]['lr'] = self.find_lr_current_lr
self.qf2_optimizer.param_groups[0]['lr'] = self.find_lr_current_lr
Hi @vitchyr
Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.
For installation, I actually didn't entirely follow the installation instructions, but here's what I did:
I took the master branch from 5565dd5 and then adjusted the examples/td3.py
and examples/ddpg.py
so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:
And TD3 uses this:
I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.
If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.
I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:
$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$
// other env results presented in a similar manner
For this I used the following plotting script where I just call it like python [script].py Ant-v2
and similarly for the other environments:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join
# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25
def smoothed(x, w):
"""Smooth x by averaging over sliding windows of w, assuming sufficient length.
"""
if len(x) <= w:
return x
smooth = []
for i in range(1, w):
smooth.append( np.mean(x[0:i]) )
for i in range(w, len(x)+1):
smooth.append( np.mean(x[i-w:i]) )
assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
return np.array(smooth)
def plot(args):
"""Load the progress csv file, and plot.
Plot:
'exploration/Returns Mean',
'exploration/num steps total',
'evaluation/Returns Mean',
'evaluation/num steps total',
"""
nrows, ncols = 1, 2
fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
figsize=(11*ncols,6*nrows))
algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
assert len(algorithms) == 2
colors = ['blue', 'red']
for idx,alg in enumerate(algorithms):
print('Currently on algorithm: ', alg)
alg_dir = join('data', alg)
progfiles = sorted([
join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
])
expl_returns = []
eval_returns = []
expl_steps = []
eval_steps = []
for prog in progfiles:
df = pd.read_csv(prog, delimiter = ',')
expl_ret = df['exploration/Returns Mean'].tolist()
expl_returns.append(expl_ret)
eval_ret = df['evaluation/Returns Mean'].tolist()
eval_returns.append(eval_ret)
expl_sp = df['exploration/num steps total'].tolist()
expl_steps.append(expl_sp)
eval_sp = df['evaluation/num steps total'].tolist()
eval_steps.append(eval_sp)
expl_returns = np.array(expl_returns)
eval_returns = np.array(eval_returns)
xs = expl_returns.shape[1]
expl_ret_mean = np.mean(expl_returns, axis=0)
eval_ret_mean = np.mean(eval_returns, axis=0)
expl_ret_std = np.mean(expl_returns, axis=0)
eval_ret_std = np.mean(eval_returns, axis=0)
w = 10
label0 = '{} (w={}), lastavg {:.1f}'.format(
(alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
label1 = '{} (w={}), lastavg {:.1f}'.format(
(alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
color=colors[idx], label=label0)
ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
color=colors[idx], label=label1)
# This can be noisy.
if False:
ax[0,0].fill_between(np.arange(xs),
expl_ret_mean-expl_ret_std,
expl_ret_mean+expl_ret_std,
alpha=0.3,
facecolor=colors[idx])
ax[0,1].fill_between(np.arange(xs),
eval_ret_mean-eval_ret_std,
eval_ret_mean+eval_ret_std,
alpha=0.3,
facecolor=colors[idx])
for i in range(2):
ax[0,i].tick_params(axis='x', labelsize=ticksize)
ax[0,i].tick_params(axis='y', labelsize=ticksize)
leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
for legobj in leg.legendHandles:
legobj.set_linewidth(5.0)
ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)
plt.tight_layout()
figname = 'fig-{}.png'.format(args.env)
plt.savefig(figname)
print("\nJust saved: {}".format(figname))
if __name__ == "__main__":
pp = argparse.ArgumentParser()
pp.add_argument('env', type=str)
args = pp.parse_args()
plot(args)
Here are the curves. Left is the exploration policy, and right is the evaluation policy.
The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.
I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!
Hello!
It seems that in the current version of the code future policy entropy is missing from target value.
https://github.com/vitchyr/rlkit/blob/76be8716881d9674082991bf33a65243003144d1/rlkit/torch/sac/sac.py#L105
Currently, the code doesn't work because it does not convert the action into a one-hot vector.
the her_sac_gym_fetch_reach.py
throws the following error when I run it.
Traceback (most recent call last):
File "examples/her/her_sac_gym_fetch_reach.py", line 130, in <module>
experiment(variant)
File "examples/her/her_sac_gym_fetch_reach.py", line 90, in experiment
algorithm.train()
File "/home/misha/downloads/rlkit/rlkit/core/rl_algorithm.py", line 46, in train
self._train()
File "/home/misha/downloads/rlkit/rlkit/core/batch_rl_algorithm.py", line 84, in _train
self._end_epoch(epoch)
File "/home/misha/downloads/rlkit/rlkit/core/rl_algorithm.py", line 58, in _end_epoch
self._log_stats(epoch)
File "/home/misha/downloads/rlkit/rlkit/core/rl_algorithm.py", line 110, in _log_stats
eval_util.get_generic_path_information(expl_paths),
File "/home/misha/downloads/rlkit/rlkit/core/eval_util.py", line 40, in get_generic_path_information
for p in paths
File "/home/misha/downloads/rlkit/rlkit/core/eval_util.py", line 40, in <listcomp>
for p in paths
File "/home/misha/downloads/rlkit/rlkit/pythonplusplus.py", line 167, in list_of_dicts__to__dict_of_lists
assert set(d.keys()) == set(keys)
AssertionError
This can be fixed by modifying the iterator in the list_of_dicts__to__dict_of_lists
function by:
if 'TimeLimit.truncated' in d:
del d['TimeLimit.truncated']
However, this is definitely a hack. Probably better to refactor that function in a more principled way.
Hi Vitchyr,
I'm following your rlkit-gpu install instructions (i.e., conda env create -f docker/rlkit_gpu/rlkit-env.yml), and I get the following error:
Could not find a version that satisfies the requirement gtimer==1.0.1b5 (from -r /data/repos/rlkit/docker/rlkit_gpu/condaenv.27ccrza6.requirements.txt (line 4)) (from versions: 1.0.0b0, 1.0.0b1, 1.0.0b2, 1.0.0b3, 1.0.0b4, 1.0.0b5)
No matching distribution found for gtimer==1.0.1b5 (from -r /data/repos/rlkit/docker/rlkit_gpu/condaenv.27ccrza6.requirements.txt (line 4))
Please help :-/
originally posted under another issue but re-posting for visibility
Sorry to open this up again, but I am unable to obtain comparable result to the Tensorflow implementation using the master branch. I post the training graph for the pytorch and tensorflow implementation below for comparison. Both results were averaged over 5 seeds.
The TF implementation's final performance is higher and also learns faster. The shape of the TF implementation also closely matches the shape of the graph in the paper, i.e. quickly increase and plateau at around 400 epochs.
Does the pytorch graph look similar to what you obtained too?
I just want to mention that your repo is awesome. Answering pestering question from me is not your responsibility : ) and I really appreciate any help here.
When I first run the script with:
python -m examples.her.her_td3_gym_fetch_reach
It raised an error:
Traceback (most recent call last):
File "/home/fcloud/anaconda3/envs/rlkit/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/home/fcloud/anaconda3/envs/rlkit/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/fcloud/workspace/rlkit/examples/her/her_td3_gym_fetch_reach.py", line 88, in <module>
experiment(variant)
File "/home/fcloud/workspace/rlkit/examples/her/her_td3_gym_fetch_reach.py", line 65, in experiment
**variant['algo_kwargs']
TypeError: __init__() missing 2 required keyword-only arguments: 'her_kwargs' and 'td3_kwargs'
So I modified the source code as follows:
her_kwargs = dict(observation_key='observation', desired_goal_key='desired_goal')
td3_kwargs = dict(env=env,
qf1=qf1,
qf2=qf2,
policy=policy,
exploration_policy=exploration_policy)
algorithm = HerTd3(
her_kwargs=her_kwargs,
td3_kwargs=td3_kwargs,
replay_buffer=replay_buffer,
**variant['algo_kwargs']
)
Then the experiment can be launched successfully, but the results seem not correct:
Replace the second term from https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/distributions.py#L43-L44 with the equation from
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/bijectors/tanh.py#L73 .
That is, replace torch.log(1 - value * value + self.epsilon
with 2. * (log(2.) - pre_tanh_value - softplus(-2. * pre_tanh_value))
.
Is there support for multiple workers (threads each feeding in independent samples)?
I'm trying to reproduce the pick and place here, and they mention needing 19 workers to do so: https://github.com/openai/baselines/tree/master/baselines/her
On the other hand, seems like since HER is off policy multiple workers aren't necessary..
When trying to plot the results with viskit, the follow error occurs.
Done! View http://localhost:5000 in your browser
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
Plot_keys: ['AverageReturn']
split_keys: []
group_keys: []
filters: {}
exclusions: []
[2018-12-29 15:56:30,229] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
response = self.full_dispatch_request()
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
raise value
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/home/vitchyr/git/viskit/viskit/frontend.py", line 762, in index
plot_div = get_plot_instruction(plot_keys=plot_keys)
File "/home/vitchyr/git/viskit/viskit/frontend.py", line 502, in get_plot_instruction
plot_width=plot_width, plot_height=plot_height
File "/home/vitchyr/git/viskit/viskit/frontend.py", line 79, in make_plot
title=title,
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/plotly/graph_objs/graph_objs.py", line 613, in update
self[key] = val
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/plotly/graph_objs/graph_objs.py", line 430, in __setitem__
value = self._value_to_graph_object(key, value, _raise=_raise)
File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/plotly/graph_objs/graph_objs.py", line 535, in _value_to_graph_object
raise exceptions.PlotlyDictValueError(self, path)
plotly.exceptions.PlotlyDictValueError: 'title' has invalid value inside 'layout'
Path To Error: ['layout']['title']
Current path: ['layout']
Current parent object_names: ['figure']
With the current parents, 'title' can be used as follows:
Under ('figure', 'layout'):
editType: layoutstyle
role: object
127.0.0.1 - - [29/Dec/2018 15:56:30] "GET / HTTP/1.1" 500 -
DDPG class imports the following. But this class is missing. Typos?
from rlkit.torch.torch_rl_algorithm import TorchRLAlgorithm. And, the initialization function of DDPG /DDPG doesn't look complete.
Thanks,
Narasimha
The readme says:
README last updated on: 02/19/2018
This date is incorrect in the literal sense of the word "update". It doesn't make sense to have this displayed in the readme. I recommend either removing it, or turning it into a hidden comment.
For example:
<!-- This is how to write a hidden comment. -->
I am using a custom environment, yet it seems that HER
depends on multiworld
anyway. The documentation says this shouldn't be necessary. Here is the stacktrace:
from rlkit.samplers.data_collector import GoalConditionedPathCollector
File "/root/src/sandbox/rlkit/rlkit/samplers/data_collector/__init__.py", line 6, in <module>
from rlkit.samplers.data_collector.path_collector import (
File "/root/src/sandbox/rlkit/rlkit/samplers/data_collector/path_collector.py", line 3, in <module>
from rlkit.envs.vae_wrapper import VAEWrappedEnv
File "/root/src/sandbox/rlkit/rlkit/envs/vae_wrapper.py", line 11, in <module>
from multiworld.core.multitask_env import MultitaskEnv
ModuleNotFoundError: No module named 'multiworld.core'
I left it running for a few epochs, several times to ensure that it was not a fluke.
And SAC is collapsing to always choose the same action.
replay_buffer/size 210000
trainer/QF1 Loss 1.35779e+19
trainer/QF2 Loss 1.34288e+19
trainer/Policy Loss -2.48799e+10
trainer/Q1 Predictions Mean 2.33888e+10
trainer/Q1 Predictions Std 3.70217e+09
trainer/Q1 Predictions Max 3.68046e+10
trainer/Q1 Predictions Min 1.31057e+10
trainer/Q2 Predictions Mean 2.34333e+10
trainer/Q2 Predictions Std 3.65296e+09
trainer/Q2 Predictions Max 3.66932e+10
trainer/Q2 Predictions Min 1.33272e+10
trainer/Q Targets Mean 2.36857e+10
trainer/Q Targets Std 4.52467e+09
trainer/Q Targets Max 3.54759e+10
trainer/Q Targets Min 0.224346
trainer/Log Pis Mean 0.987727
trainer/Log Pis Std 1.12239
trainer/Log Pis Max 2.15324
trainer/Log Pis Min -4.0056
trainer/Policy mu Mean 1.52476
trainer/Policy mu Std 0.0895151
trainer/Policy mu Max 1.62818
trainer/Policy mu Min 1.37598
trainer/Policy log std Mean -0.582497
trainer/Policy log std Std 0.0243203
trainer/Policy log std Max -0.492316
trainer/Policy log std Min -0.640244
trainer/Alpha 5.56742e+08
trainer/Alpha Loss 0.247146
exploration/num steps total 2.491e+06
exploration/num paths total 23586
exploration/path length Mean 131.579
exploration/path length Std 57.1612
exploration/path length Max 200
exploration/path length Min 8
exploration/Rewards Mean 0.264324
exploration/Rewards Std 0.149922
exploration/Rewards Max 0.590382
exploration/Rewards Min 0.0141083
exploration/Returns Mean 34.7795
exploration/Returns Std 23.3818
exploration/Returns Max 83.2558
exploration/Returns Min 2.15501
exploration/Actions Mean 0.4906
exploration/Actions Std 0.0686414
exploration/Actions Max 0.5
exploration/Actions Min -0.5
exploration/Num Paths 38
exploration/Average Returns 34.7795
exploration/env_infos/final/time Mean 0.342105
exploration/env_infos/final/time Std 0.285806
exploration/env_infos/final/time Max 0.96
exploration/env_infos/final/time Min 0
exploration/env_infos/initial/time Mean 0.995
exploration/env_infos/initial/time Std 3.33067e-16
exploration/env_infos/initial/time Max 0.995
exploration/env_infos/initial/time Min 0.995
exploration/env_infos/time Mean 0.606472
exploration/env_infos/time Std 0.263458
exploration/env_infos/time Max 0.995
exploration/env_infos/time Min 0
evaluation/num steps total 2.45463e+06
evaluation/num paths total 21675
evaluation/path length Mean 115.452
evaluation/path length Std 52.2554
evaluation/path length Max 200
evaluation/path length Min 9
evaluation/Rewards Mean 0.248655
evaluation/Rewards Std 0.0242211
evaluation/Rewards Max 0.294154
evaluation/Rewards Min 0.193703
evaluation/Returns Mean 28.7078
evaluation/Returns Std 12.9204
evaluation/Returns Max 52.5658
evaluation/Returns Min 2.53809
evaluation/Actions Mean 0.5
evaluation/Actions Std 0
evaluation/Actions Max 0.5
evaluation/Actions Min 0.5
evaluation/Num Paths 42
evaluation/Average Returns 28.7078
evaluation/env_infos/final/time Mean 0.422738
evaluation/env_infos/final/time Std 0.261277
evaluation/env_infos/final/time Max 0.955
evaluation/env_infos/final/time Min 0
evaluation/env_infos/initial/time Mean 0.995
evaluation/env_infos/initial/time Std 2.22045e-16
evaluation/env_infos/initial/time Max 0.995
evaluation/env_infos/initial/time Min 0.995
evaluation/env_infos/time Mean 0.64974
evaluation/env_infos/time Std 0.245087
evaluation/env_infos/time Max 0.995
evaluation/env_infos/time Min 0
time/data storing (s) 0.0476881
time/evaluation sampling (s) 13.4834
time/exploration sampling (s) 15.2477
time/logging (s) 0.0254512
time/saving (s) 0.0218989
time/training (s) 111.327
time/epoch (s) 140.153
time/total (s) 68869.7
Epoch 497
Running it from master. Could it be related to the 'action' state to be somewhat discrete? The environment will discretize the actions in 'x' states based on the input data.
I see
save_video=True,
save_video_period=100,
I can only see the png files in the folder data
, I wanna know how or where can I find the video files.
fix by openai/mujoco-py#188
use gym 0.7
other:
conda install pytorch=0.3.0 torchvision -c pytorch
pip install gym==0.7.3
541 export PYTHONPATH='.'
549 pip install mujoco-py==0.5.7
567 pip install gym==0.7.3
587 conda install pytorch=0.3.0 torchvision -c pytorch
Is there any easy way to get the S3 path? I'd like to have a script that downloads all the recent s3 results instead of copy and pasting each time, and that would require having the s3 path returned each time an experiment is uploaded
Hi,
I skimmed over the author's implementation and it seems that they don't use the value network. Instead they only use the Q-networks. Seems they removed it in this commit
Thanks,
Lukas
Failing to build mujoco-py. Initially failed due to missing 'swig' and 'patchelf', which was manually resolved. Now failing on 'glfw'. Calling 'import glfw' within (rlkit) conda environment works, though.
File "/tmp/pip-install-0n9doxzn/mujoco-py/setup.py", line 28, in run
import mujoco_py # noqa: force build
File "/tmp/pip-install-0n9doxzn/mujoco-py/mujoco_py/__init__.py", line 6, in <module>
from mujoco_py.mjviewer import MjViewer, MjViewerBasic
File "/tmp/pip-install-0n9doxzn/mujoco-py/mujoco_py/mjviewer.py", line 2, in <module>
import glfw
ImportError: No module named 'glfw'
We observed this issue on 2 machines. Thanks!
I had a question about the way evaluation statistics are computed for SAC - from taking a look at the code, it seems as though the statistics will only be computed over one particular training batch every epoch (https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/sac.py#L161), is this true? I'd imagine that this measurement would be pretty high variance, as opposed to averaging the statistics over all batches in the epoch. Could you clarify if this is the case and if so, why you've implemented logging in this way?
Hi,
Any plan on updating the classes to be compatible with pytorch 1.0?
For soft actor-critic, how do we specify the reward scale?
How do we specify it in examples/sac.py
?
There are many algorithms that import Mujoco environments because they are not separated. In my case I dont care about Mujoco, in fact I had to get a trial license just to avoid having to remove code from my fork.
Hi Vitchyr! Thanks for your inspiring code. Do you mind putting a license file in the repo so that we can use your code for other projects?
Hi,
I'm seeing this error appear consistently when I run the "DQN_and_double_DQN" example file out of the box somewhere around epoch 15 or so. From what I can tell, this is because when it gets a good enough policy that allows the episode to last until timeout (200 steps), the gym environment (i'm on 0.12.5) returns '{'TimeLimit.truncated': True}' in the env_info dictionary which is inconsistent with the prior observations. Sometime during logging this inconsistency causes an issue. Seems related to #12 on the surface, but not sure why this only seems to be a problem for me.
Please find the stacktrace below, happy to provide more details or my installed package version numbers if that would help.
File "rlkit/rlkit/pythonplusplus.py", line 165, in list_of_dicts__to__dict_of_lists
assert set(d.keys()) == set(keys)
File "rlkit/rlkit/core/eval_util.py", line 40, in <listcomp>
for p in paths
File "rlkit/rlkit/core/eval_util.py", line 40, in get_generic_path_information
for p in paths
File "rlkit/rlkit/core/rl_algorithm.py", line 127, in _log_stats
eval_util.get_generic_path_information(eval_paths),
File "rlkit/rlkit/core/rl_algorithm.py", line 58, in _end_epoch
self._log_stats(epoch)
File "rlkit/rlkit/core/batch_rl_algorithm.py", line 84, in _train
self._end_epoch(epoch)
File "rlkit/rlkit/core/rl_algorithm.py", line 46, in train
self._train()
File "rlkit/examples/dqn_and_double_dqn.py", line 71, in experiment
algorithm.train()
File "rlkit/examples/dqn_and_double_dqn.py", line 99, in <module>
experiment(variant)
I ran the experiment with RIG + pusher with the original settings.
Contrary to the paper, I cannot observe any improvement of the average return or success rate.
How can I reproduce the original paper results?
Output after 100 epochs (388004 iterations)
hand_distance Mean 0.0369142
hand_distance Std 0.0125764
hand_distance Max 0.153462
hand_distance Min 0.0280608
Final hand_distance Mean 0.0406867
Final hand_distance Std 0.00171668
Final hand_distance Max 0.0432108
Final hand_distance Min 0.0369941
puck_distance Mean 0.14895
puck_distance Std 0.0542304
puck_distance Max 0.234878
puck_distance Min 0.0381613
Final puck_distance Mean 0.150567
Final puck_distance Std 0.0544439
Final puck_distance Max 0.234878
Final puck_distance Min 0.0528324
touch_distance Mean 0.0677295
touch_distance Std 0.00900345
touch_distance Max 0.105978
touch_distance Min 0.0565952
Final touch_distance Mean 0.0690568
Final touch_distance Std 0.00635284
Final touch_distance Max 0.0836148
Final touch_distance Min 0.0630353
success Mean 0
success Std 0
success Max 0
success Min 0
Final success Mean 0
Final success Std 0
Final success Max 0
Final success Min 0
QF1 Loss 0.171023
QF2 Loss 0.18778
Policy Loss 104.349
Q1 Predictions Mean -104.665
Q1 Predictions Std 88.4167
Q1 Predictions Max -4.22522
Q1 Predictions Min -336.314
Q2 Predictions Mean -104.558
Q2 Predictions Std 88.3845
Q2 Predictions Max -4.09563
Q2 Predictions Min -335.829
Q Targets Mean -104.699
Q Targets Std 88.4681
Q Targets Max -4.3236
Q Targets Min -336.498
Bellman Errors 1 Mean 0.171023
Bellman Errors 1 Std 0.416491
Bellman Errors 1 Max 3.98678
Bellman Errors 1 Min 1.80304e-06
Bellman Errors 2 Mean 0.18778
Bellman Errors 2 Std 0.355425
Bellman Errors 2 Max 2.20607
Bellman Errors 2 Min 3.32488e-06
Policy Action Mean 0.194368
Policy Action Std 0.731224
Policy Action Max 1
Policy Action Min -1
Test Rewards Mean -0.360232
Test Rewards Std 0.594497
Test Rewards Max -0.00478052
Test Rewards Min -5.77281
Test Returns Mean -36.0232
Test Returns Std 22.2522
Test Returns Max -13.6433
Test Returns Min -92.884
Test Actions Mean 0.0276301
Test Actions Std 0.247718
Test Actions Max 1
Test Actions Min -0.838328
Num Paths 10
Exploration Rewards Mean -1.33633
Exploration Rewards Std 0.762848
Exploration Rewards Max -0.225774
Exploration Rewards Min -5.57678
Exploration Returns Mean -133.633
Exploration Returns Std 60.5544
Exploration Returns Max -42.4458
Exploration Returns Min -214.126
Exploration Actions Mean 0.262311
Exploration Actions Std 0.44695
Exploration Actions Max 1
Exploration Actions Min -1
image_dist Mean 10.9785
image_dist Std 1.05965
image_dist Max 14.3582
image_dist Min 9.29165
Final image_dist Mean 10.8799
Final image_dist Std 1.03786
Final image_dist Max 12.6932
Final image_dist Min 9.39808
image_success Mean -0.727273
image_success Std 0.445362
image_success Max 0
image_success Min -1
Final image_success Mean -0.727273
Final image_success Std 0.445362
Final image_success Max 0
Final image_success Min -1
vae_dist Mean 0.360232
vae_dist Std 0.594497
vae_dist Max 5.77281
vae_dist Min 0.00478052
Final vae_dist Mean 0.216473
Final vae_dist Std 0.096578
Final vae_dist Max 0.361867
Final vae_dist Min 0.0465295
AverageReturn -36.0232
Number of train steps total 388004
Number of env steps total 101000
Number of rollouts total 1010
Train Time (s) 323.224
(Previous) Eval Time (s) 123.294
Sample Time (s) 115.072
Epoch Time (s) 561.59
Total Train Time (s) 29454.1
Epoch 100
Evolution of the average return:
Docker image installs mujoco_py successfully. I can run a new Docker container, type "import mujoco_py" and it builds the cython code.
However. When I launch a container programmatically with doodad. For some reason, it keeps erroring and trying to rebuild the cython:
Running in docker
Import error. Trying to rebuild mujoco_py.
running build_ext
building 'mujoco_py.cymj' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py -I/root/.mujoco/mjpro150/include -I/env/lib/python3.6/site-packages/numpy/core/include -I/usr/include/python3.6m -I/env/include/python3.6m -c /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.c -o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.o -fopenmp -w
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py -I/root/.mujoco/mjpro150/include -I/env/lib/python3.6/site-packages/numpy/core/include -I/usr/include/python3.6m -I/env/include/python3.6m -c /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.c -o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.o -fopenmp -w
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.o -L/root/.mujoco/mjpro150/bin -Wl,--enable-new-dtags,-R/root/.mujoco/mjpro150/bin -lmujoco150 -lglewosmesa -lOSMesa -lGL -o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/lib.linux-x86_64-3.6/mujoco_py/cymj.cpython-36m-x86_64-linux-gnu.so -fopenmp
/usr/bin/ld: cannot find -lmujoco150
/usr/bin/ld: cannot find -lglewosmesa
I'm sure osmesa is installed, and mujoco is downloaded and installed. The LD path looks correct:
echo $LD_LIBRARY_PATH /usr/local/nvidia/lib64:/root/.mujoco/mjpro150/bin:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@e6131e04a34c:~/.mujoco# ls mjkey.txt mjpro150
root@e6131e04a34c:~/.mujoco# ls mjpro150/
bin doc include model sample
I cloned the repo, setup the environment and ran (made no changes)
python her_sac_gym_fetch_reach.py
The results don't seem to match with this. Did something break in the latest commit?
However, when I try the td3, it works fine
python her_td3_multiworld_sawyer_reach.py
Whenever I run the example scripts (and my own) they get fileNotFound errors because run_experiment expects "scripts" to be in the rlkit/rlkit directory not at the root level
Fatal Python error: Segmentation fault
Current thread 0x00007fc656ead700 (most recent call first):
File "/Users/richard/existing_codebases/rlkit/examples/her/her_td3_gym_fetch_reach.py", line 26 in experiment
File "/Users/richard/existing_codebases/rlkit/rlkit/launchers/launcher_util.py", line 172 in run_experiment_here
File "/mounts/target/scripts/run_experiment_from_doodad.py", line 46 in <module>
/bin/bash: line 1: 9 Segmentation fault DOODAD_ARGS_DATA=...= DOODAD_USE_CLOUDPICKLE=1 DOODAD_CLOUDPICKLE_VERSION=0.5.2 python /mounts/target/scripts/run_experiment_from_doodad.py
I am not issuing a PR for this one because this uses the modified Dataset based version of the trainer... But making it available through this issue in case you are interested in it. The idea behind is that if you have a big dataset, you want to avoid filling all your memory. Memory mapped files allow arbitrary size replay buffers:
class OptimizedDiskSequentialReplayBuffer(ReplayBuffer, Dataset):
def __init__(
self,
max_replay_buffer_size,
observation_shape,
observation_dtype,
action_dim,
env_info_sizes,
location='.',
name='replay_buffer'
):
self._observation_shape = observation_shape
self._observation_dtype = observation_dtype
self._action_dim = action_dim
self.max_buffer_size = max_replay_buffer_size
name = os.path.join(location, name)
self._observations = np.memmap(f'{name}.obs', dtype=observation_dtype, mode='w+', shape=(max_replay_buffer_size,) + observation_shape)
# It's a bit memory inefficient to save the observations twice, but it makes the code *much* easier since you no longer have to
# worry about termination conditions.
self._next_obs = np.memmap(f'{name}.nextobs', dtype=observation_dtype, mode='w+', shape=(max_replay_buffer_size,) + observation_shape)
self._actions = np.memmap(f'{name}.actions', dtype=np.float32, mode='w+', shape=(max_replay_buffer_size, action_dim))
# Make everything a 2D np array to make it easier for other code to reason about the shape of the data
self._rewards = np.memmap(f'{name}.rewards', dtype=np.float32, mode='w+', shape=(max_replay_buffer_size, 1))
# self._terminals[i] = a terminal was received at time i
self._terminals = np.memmap(f'{name}.terminals', dtype=np.uint8, mode='w+', shape=(max_replay_buffer_size, 1))
# Define self._env_infos[key][i] to be the return value of env_info[key]
# at time i
self._env_infos = {}
for key, size in env_info_sizes.items():
self._env_infos[key] = np.memmap(f'{name}.{key}', dtype=np.float32, mode='w+', shape=(max_replay_buffer_size, size))
self._env_info_keys = env_info_sizes.keys()
self._top = 0
self._size = 0
def __getitem__(self, index):
observations = torch.from_numpy(self._observations[index])
actions = torch.from_numpy(self._actions[index])
rewards = torch.from_numpy(self._rewards[index])
terminals = torch.from_numpy(self._terminals[index])
next_observations = torch.from_numpy(self._next_obs[index])
env_infos = dict()
for key in self._env_info_keys:
env_infos[key] = torch.from_numpy(self._env_infos[key])
return observations, actions, rewards, terminals, next_observations, env_infos
def __len__(self):
return self._size
def add_sample(self, observation, action, reward, next_observation,
terminal, env_info, **kwargs):
if self._size == 0:
location = 0
else:
location = np.random.randint(0, self._size)
self._observations[self._top] = self._observations[location]
self._observations[location] = observation
self._actions[self._top] = self._actions[location]
self._actions[location] = action
self._rewards[self._top] = self._rewards[location]
self._rewards[location] = reward
self._terminals[self._top] = self._terminals[location]
self._terminals[location] = terminal.astype(dtype=np.float)
self._next_obs[self._top] = self._next_obs[location]
self._next_obs[location] = next_observation
for key in self._env_info_keys:
array = self._env_infos[key]
array[self._top] = array[location]
array[location] = env_info[key]
self._advance()
def terminate_episode(self):
pass
def _advance(self):
self._top = (self._top + 1) % self.max_buffer_size
if self._size < self.max_buffer_size:
self._size += 1
def random_batch(self, batch_size):
raise NotImplementedError("You shouldnt use this one on this buffer")
def rebuild_env_info_dict(self, idx):
return {
key: self._env_infos[key][idx]
for key in self._env_info_keys
}
def batch_env_info_dict(self, indices):
return {
key: self._env_infos[key][indices]
for key in self._env_info_keys
}
def num_steps_can_sample(self):
return self._size
def get_diagnostics(self):
return OrderedDict([
('size', self._size)
])
I am using this in a sequential fashion, so I insert randomized instead.
This seems to happen whenever len(self._exploration_paths) == 1
because then create_stats_ordered_dict
doesn't return std
even though it might in another iteration.
What's the purpose of making local mode run through doodad?
It's making it very hard to debug because stuff like PyCharm doesn't work well trying to hook into subprocesses...
I can submit a PR to make local mode run within rlkit
Finally I conda install successully.. T_T
(rlkit) cww97@MAIL-ThinkPad:~/rlkit$ python examples/ddpg.py
Traceback (most recent call last):
File "examples/ddpg.py", line 6, in <module>
from rlkit.envs.wrappers import NormalizedBoxEnv
ImportError: No module named 'rlkit'
I know this is a question of python. But it seems that I havenot fix this for years.
how could I import correctly... T_T
Hi Vitchyr!
There appears to be an issue in the Exploration statistics. In online training, at each new epoch, observation = self._start_new_rollout()
is called. However, the PathBuilder is not reset at the end of an epoch, so the first trajectory of each new epoch is appended to the last trajectory of the previous epoch. This results in the wrong exploration returns statistics. I don't believe the rest of the training is affected.
One solution could be to call self._handle_rollout_ending
at the end of each epoch. This would however also yields returns of episodes that are cut short, but I'd say that's preferable. Shall I file a PR?
Nice work on this repo!
Best,
Pim
Traceback (most recent call last):
File "oracle.py", line 70, in <module>
use_gpu=True, # Turn on if you have a GPU
File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 594, in run_experiment
**run_experiment_kwargs
File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 175, in run_experiment_here
return experiment_function(variant)
File "/home/cww97/rlkit/rlkit/launchers/state_based_goal_experiments.py", line 108, in her_td3_experiment
algorithm.train()
File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 146, in train
self.train_online(start_epoch=start_epoch)
File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 167, in train_online
observation = self._take_step_in_env(observation)
File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 206, in _take_step_in_env
self.training_env.render()
File "/home/cww97/multiworld/multiworld/envs/mujoco/mujoco_env.py", line 124, in render
self._get_viewer().render()
File "/home/cww97/multiworld/multiworld/envs/mujoco/mujoco_env.py", line 133, in _get_viewer
self.viewer = mujoco_py.MjViewer(self.sim)
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/mujoco_py/mjviewer.py", line 133, in __init__
super().__init__(sim)
File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/mujoco_py/mjviewer.py", line 26, in __init__
super().__init__(sim)
File "mujoco_py/mjrendercontext.pyx", line 278, in mujoco_py.cymj.MjRenderContextWindow.__init__
File "mujoco_py/mjrendercontext.pyx", line 66, in mujoco_py.cymj.MjRenderContext.__init__
File "mujoco_py/mjrendercontext.pyx", line 87, in mujoco_py.cymj.MjRenderContext._set_mujoco_buffers
RuntimeError: Window rendering not supported
ummmm
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.