rail-berkeley / rlkit Goto Github PK

View Code? Open in Web Editor NEW

2.4K 62.0 544.0 1.03 MB

Collection of reinforcement learning algorithms

License: MIT License

Python 99.61% Dockerfile 0.39%

rlkit's Introduction

RLkit

Reinforcement learning framework and algorithms implemented in PyTorch.

Implemented algorithms:

Semi-supervised Meta Actor Critic
Skew-Fit
- example script
- paper
- Documentation
- Requires multiworld to be installed
Reinforcement Learning with Imagined Goals (RIG)
- See this version of this repository.
- paper
Temporal Difference Models (TDMs)
- Only implemented in v0.1.2 of RLkit. See Legacy Documentation section below.
- paper
- Documentation
Hindsight Experience Replay (HER)
(Double) Deep Q-Network (DQN)
Soft Actor Critic (SAC)
- example script
- original paper and updated version
- TensorFlow implementation from author
- Includes the "min of Q" method, the entropy-constrained implementation, reparameterization trick, and numerical tanh-Normal Jacbian calcuation.
Twin Delayed Deep Determinstic Policy Gradient (TD3)
- example script
- paper
Advantage Weighted Actor Critic (AWAC)
- example scripts
- paper
Implicit Q-Learning (IQL)
- example scripts
- paper

To get started, checkout the example scripts, linked above.

What's New

Version 0.2

04/25/2019

Use new multiworld code that requires explicit environment registration.
Make installation easier by adding setup.py and using default conf.py.

04/16/2019

Log how many train steps were called
Log env_info and agent_info.

04/05/2019-04/15/2019

Add rendering
Fix SAC bug to account for future entropy (#41, #43)
Add online algorithm mode (#42)

04/05/2019

The initial release for 0.2 has the following major changes:

Remove Serializable class and use default pickle scheme.
Remove PyTorchModule class and use native torch.nn.Module directly.
Switch to batch-style training rather than online training.
- Makes code more amenable to parallelization.
- Implementing the online-version is straightforward.
Refactor training code to be its own object, rather than being integrated inside of RLAlgorithm.
Refactor sampling code to be its own object, rather than being integrated inside of RLAlgorithm.
Implement Skew-Fit: State-Covering Self-Supervised Reinforcement Learning, a method for performing goal-directed exploration to maximize the entropy of visited states.
Update soft actor-critic to more closely match TensorFlow implementation:
- Rename TwinSAC to just SAC.
- Only have Q networks.
- Remove unnecessary policy regualization terms.
- Use numerically stable Jacobian computation.

Overall, the refactors are intended to make the code more modular and readable than the previous versions.

Version 0.1

12/04/2018

Add RIG implementation

12/03/2018

Add HER implementation
Add doodad support

10/16/2018

Upgraded to PyTorch v0.4
Added Twin Soft Actor Critic Implementation
Various small refactor (e.g. logger, evaluate code)

Installation

Install and use the included Ananconda environment

$ conda env create -f environment/[linux-cpu|linux-gpu|mac]-env.yml
$ source activate rlkit
(rlkit) $ python examples/ddpg.py

Choose the appropriate .yml file for your system. These Anaconda environments use MuJoCo 1.5 and gym 0.10.5. You'll need to get your own MuJoCo key if you want to use MuJoCo.

Add this repo directory to your PYTHONPATH environment variable or simply run:

pip install -e .

(Optional) Copy conf.py to conf_private.py and edit to override defaults:

cp rlkit/launchers/conf.py rlkit/launchers/conf_private.py

(Optional) If you plan on running the Skew-Fit experiments or the HER example with the Sawyer environment, then you need to install multiworld.

DISCLAIMER: the mac environment has only been tested without a GPU.

For an even more portable solution, try using the docker image provided in environment/docker. The Anaconda env should be enough, but this docker image addresses some of the rendering issues that may arise when using MuJoCo 1.5 and GPUs. The docker image supports GPU, but it should work without a GPU. To use a GPU with the image, you need to have nvidia-docker installed.

Using a GPU

You can use a GPU by calling

import rlkit.torch.pytorch_util as ptu
ptu.set_gpu_mode(True)

before launching the scripts.

If you are using doodad (see below), simply use the use_gpu flag:

run_experiment(..., use_gpu=True)

Visualizing a policy and seeing results

During training, the results will be saved to a file called under

LOCAL_LOG_DIR/<exp_prefix>/<foldername>

LOCAL_LOG_DIR is the directory set by rlkit.launchers.config.LOCAL_LOG_DIR. Default name is 'output'.
<exp_prefix> is given either to setup_logger.
<foldername> is auto-generated and based off of exp_prefix.
inside this folder, you should see a file called params.pkl. To visualize a policy, run

(rlkit) $ python scripts/run_policy.py LOCAL_LOG_DIR/<exp_prefix>/<foldername>/params.pkl

(rlkit) $ python scripts/run_goal_conditioned_policy.py LOCAL_LOG_DIR/<exp_prefix>/<foldername>/params.pkl

depending on whether or not the policy is goal-conditioned.

If you have rllab installed, you can also visualize the results using rllab's viskit, described at the bottom of this page

tl;dr run

python rllab/viskit/frontend.py LOCAL_LOG_DIR/<exp_prefix>/

to visualize all experiments with a prefix of exp_prefix. To only visualize a single run, you can do

python rllab/viskit/frontend.py LOCAL_LOG_DIR/<exp_prefix>/<folder name>

Alternatively, if you don't want to clone all of rllab, a repository containing only viskit can be found here. You can similarly visualize results with.

python viskit/viskit/frontend.py LOCAL_LOG_DIR/<exp_prefix>/

This viskit repo also has a few extra nice features, like plotting multiple Y-axis values at once, figure-splitting on multiple keys, and being able to filter hyperparametrs out.

Visualizing a goal-conditioned policy

To visualize a goal-conditioned policy, run

(rlkit) $ python scripts/run_goal_conditioned_policy.py
LOCAL_LOG_DIR/<exp_prefix>/<foldername>/params.pkl

Launching jobs with `doodad`

The run_experiment function makes it easy to run Python code on Amazon Web Services (AWS) or Google Cloud Platform (GCP) by using this fork of doodad.

It's as easy as:

from rlkit.launchers.launcher_util import run_experiment

def function_to_run(variant):
    learning_rate = variant['learning_rate']
    ...

run_experiment(
    function_to_run,
    exp_prefix="my-experiment-name",
    mode='ec2',  # or 'gcp'
    variant={'learning_rate': 1e-3},
)

You will need to set up parameters in config.py (see step one of Installation). This requires some knowledge of AWS and/or GCP, which is beyond the scope of this README. To learn more, more about doodad, go to the repository, which is based on this original repository.

Requests for pull-requests

Implement policy-gradient algorithms.
Implement model-based algorithms.

Legacy Code (v0.1.2)

For Temporal Difference Models (TDMs) and the original implementation of Reinforcement Learning with Imagined Goals (RIG), run git checkout tags/v0.1.2.

References

The algorithms are based on the following papers

Offline Meta-Reinforcement Learning with Online Self-Supervision Vitchyr H. Pong, Ashvin Nair, Laura Smith, Catherine Huang, Sergey Levine. arXiv preprint, 2021.

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. Vitchyr H. Pong*, Murtaza Dalal*, Steven Lin*, Ashvin Nair, Shikhar Bahl, Sergey Levine. ICML, 2020.

Visual Reinforcement Learning with Imagined Goals. Ashvin Nair*, Vitchyr Pong*, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine. NeurIPS 2018.

Temporal Difference Models: Model-Free Deep RL for Model-Based Control. Vitchyr Pong*, Shixiang Gu*, Murtaza Dalal, Sergey Levine. ICLR 2018.

Hindsight Experience Replay. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba. NeurIPS 2017.

Deep Reinforcement Learning with Double Q-learning. Hado van Hasselt, Arthur Guez, David Silver. AAAI 2016.

Human-level control through deep reinforcement learning. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis. Nature 2015.

Soft Actor-Critic Algorithms and Applications. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. arXiv preprint, 2018.

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. ICML, 2018.

Addressing Function Approximation Error in Actor-Critic Methods Scott Fujimoto, Herke van Hoof, David Meger. ICML, 2018.

Credits

This repository was initially developed primarily by Vitchyr Pong, until July 2021, at which point it was transferred to the RAIL Berkeley organization and is primarily maintained by Ashvin Nair. Other major collaborators and contributions:

A lot of the coding infrastructure is based on rllab. The serialization and logger code are basically a carbon copy of the rllab versions.

The Dockerfile is based on the OpenAI mujoco-py Dockerfile.

The SMAC code builds off of the PEARL code, which built off of an older RLKit version.

rlkit's People

Contributors

Stargazers

Watchers

Forkers

domingoesteban tm157 daghty shubhampachori12110095 wwxfromtju g-wang codeaudit little1tow amoliu nlpng jjdblast yanghaha11514 williamd4112 sndler rpinsler allensmile stefaj gkyustc ilibx alpslee collector-m shlpu simitii wuter b-kartal rohansaphal97 nke001 samangel93 kelvict grayking emigmo haochihlin mehrdad-shokri parsonszeng johtobi kamyargh daominglyu cdevin amandlek onisimchukv eridgd syangdung bitchef katerakelly lovelan521 batterysnoopy kylehkhsu chomolungma haichao-zhang zhangbinxy phuicy sungjinlees stjordanis sinhasam devloper13 zhangmeishan habibzadeh sohojoe jcoreyes hartikainen prafulla77 pvr1 vasu-kukkapalli lepy doublestar90 saj1919 mluogh devilofshine world2005 hunglethanh9 adolfoeliazat oskop nsokhand lgrando1 factoreally dailyactie deltaresprojects oderdene lx10077 hsouporto seann999 seungjaeryanlee kashif saminyeasar ashdtu multitude0099 dantodor mannykayy ayoub-berdeddouch nitishvikasdeshpande waqasm86 balamir53 nifsupreme richardrl dingyiming0427 jmichaux boscotsang sanjitjain2 charliezon alexzhou1995

rlkit's Issues

HER/RIG example scripts expect "scripts" folder to be in "rlkit/scripts"

Whenever I run the example scripts (and my own) they get fileNotFound errors because run_experiment expects "scripts" to be in the rlkit/rlkit directory not at the root level

ImportError: No module named 'rlkit'

Finally I conda install successully.. T_T

(rlkit) cww97@MAIL-ThinkPad:~/rlkit$ python examples/ddpg.py 
Traceback (most recent call last):
  File "examples/ddpg.py", line 6, in <module>
    from rlkit.envs.wrappers import NormalizedBoxEnv
ImportError: No module named 'rlkit'

I know this is a question of python. But it seems that I havenot fix this for years.

how could I import correctly... T_T

Future entropy missed in SAC.

Hello!
It seems that in the current version of the code future policy entropy is missing from target value.
https://github.com/vitchyr/rlkit/blob/76be8716881d9674082991bf33a65243003144d1/rlkit/torch/sac/sac.py#L105

HER example scripts segfault when running "local_docker" or "ec2" mode

Fatal Python error: Segmentation fault

Current thread 0x00007fc656ead700 (most recent call first):
  File "/Users/richard/existing_codebases/rlkit/examples/her/her_td3_gym_fetch_reach.py", line 26 in experiment
  File "/Users/richard/existing_codebases/rlkit/rlkit/launchers/launcher_util.py", line 172 in run_experiment_here
  File "/mounts/target/scripts/run_experiment_from_doodad.py", line 46 in <module>
/bin/bash: line 1:     9 Segmentation fault      DOODAD_ARGS_DATA=...= DOODAD_USE_CLOUDPICKLE=1 DOODAD_CLOUDPICKLE_VERSION=0.5.2 python /mounts/target/scripts/run_experiment_from_doodad.py

FetchReach her example fails - easy to fix though

the her_sac_gym_fetch_reach.py throws the following error when I run it.

Traceback (most recent call last):
  File "examples/her/her_sac_gym_fetch_reach.py", line 130, in <module>
    experiment(variant)
  File "examples/her/her_sac_gym_fetch_reach.py", line 90, in experiment
    algorithm.train()
  File "/home/misha/downloads/rlkit/rlkit/core/rl_algorithm.py", line 46, in train
    self._train()
  File "/home/misha/downloads/rlkit/rlkit/core/batch_rl_algorithm.py", line 84, in _train
    self._end_epoch(epoch)
  File "/home/misha/downloads/rlkit/rlkit/core/rl_algorithm.py", line 58, in _end_epoch
    self._log_stats(epoch)
  File "/home/misha/downloads/rlkit/rlkit/core/rl_algorithm.py", line 110, in _log_stats
    eval_util.get_generic_path_information(expl_paths),
  File "/home/misha/downloads/rlkit/rlkit/core/eval_util.py", line 40, in get_generic_path_information
    for p in paths
  File "/home/misha/downloads/rlkit/rlkit/core/eval_util.py", line 40, in <listcomp>
    for p in paths
  File "/home/misha/downloads/rlkit/rlkit/pythonplusplus.py", line 167, in list_of_dicts__to__dict_of_lists
    assert set(d.keys()) == set(keys)
AssertionError

This can be fixed by modifying the iterator in the list_of_dicts__to__dict_of_lists function by:

if 'TimeLimit.truncated' in d:
            del d['TimeLimit.truncated']

However, this is definitely a hack. Probably better to refactor that function in a more principled way.

Dataset based Trainer

This example dataset based trainer also does expert signal recollection, so that is why I didnt do a PR, will let it to you to decide which parts make sense for rlkit.

class OptimizedBatchRLAlgorithm(BaseRLAlgorithm, metaclass=abc.ABCMeta):
    def __init__(
            self,
            trainer,
            exploration_env,
            evaluation_env,
            exploration_data_collector: PathCollector,
            evaluation_data_collector: PathCollector,
            replay_buffer: ReplayBuffer,
            batch_size,
            max_path_length,
            num_epochs,
            num_eval_steps_per_epoch,
            num_expl_steps_per_train_loop,
            num_trains_per_train_loop,
            num_train_loops_per_epoch=1,
            min_num_steps_before_training=0,
            max_num_steps_before_training=1e5,
            expert_data_collector: PathCollector = None,
    ):
        super().__init__(
            trainer,
            exploration_env,
            evaluation_env,
            exploration_data_collector,
            evaluation_data_collector,
            replay_buffer,
        )

        assert isinstance(replay_buffer, Dataset), "The replay buffers must be compatible with Pytorch Dataset to use this version."

        self.batch_size = batch_size
        self.max_path_length = max_path_length
        self.num_epochs = num_epochs
        self.num_eval_steps_per_epoch = num_eval_steps_per_epoch
        self.num_trains_per_train_loop = num_trains_per_train_loop
        self.num_train_loops_per_epoch = num_train_loops_per_epoch
        self.num_expl_steps_per_train_loop = num_expl_steps_per_train_loop
        self.min_num_steps_before_training = min_num_steps_before_training
        self.max_num_steps_before_training = max_num_steps_before_training
        self.expert_data_collector = expert_data_collector

    def _train(self):
        if self.min_num_steps_before_training > 0:
            init_expl_paths = self.expl_data_collector.collect_new_paths(
                self.max_path_length,
                self.min_num_steps_before_training,
                discard_incomplete_paths=False,
            )

            self.replay_buffer.add_paths(init_expl_paths)

            self.expert_data_collector.end_epoch(-1)
            self.expl_data_collector.end_epoch(-1)

        if self.expert_data_collector is not None:
            new_expl_paths = self.expert_data_collector.collect_new_paths(
                self.max_path_length,
                min(int(self.replay_buffer.max_buffer_size * 0.5), self.max_num_steps_before_training),
                discard_incomplete_paths=False,
            )
            self.replay_buffer.add_paths(new_expl_paths)

        dataset_loader = torch.utils.data.DataLoader(self.replay_buffer, pin_memory=True, batch_size=self.batch_size, num_workers=0)

        for epoch in gt.timed_for(
                range(self._start_epoch, self.num_epochs),
                save_itrs=True,
        ):
            printout('Evaluation sampling')
            self.eval_data_collector.collect_new_paths(
                self.max_path_length,
                self.num_eval_steps_per_epoch,
                discard_incomplete_paths=True,
            )
            gt.stamp('evaluation sampling')

            for _ in range(self.num_train_loops_per_epoch):

                printout('Exploration sampling')
                new_expl_paths = self.expl_data_collector.collect_new_paths(
                    self.max_path_length,
                    self.num_expl_steps_per_train_loop,
                    discard_incomplete_paths=False,
                )
                gt.stamp('exploration sampling', unique=False)

                self.replay_buffer.add_paths(new_expl_paths)
                gt.stamp('data storing', unique=False)

                self.training_mode(True)

                i = 0
                with tqdm(total=self.num_trains_per_train_loop) as pbar:
                    while True:

                        for _, data in enumerate(dataset_loader, 0):
                            if i > self.num_trains_per_train_loop:
                                break  # We are done

                            observations = data[0].to(ptu.device)
                            actions = data[1].to(ptu.device)
                            rewards = data[2].to(ptu.device)
                            terminals = data[3].to(ptu.device).float()
                            next_observations = data[4].to(ptu.device)
                            env_infos = data[5]

                            train_data = dict(
                                observations=observations,
                                actions=actions,
                                rewards=rewards,
                                terminals=terminals,
                                next_observations=next_observations,
                            )

                            for key in env_infos.keys():
                                train_data[key] = env_infos[key]

                            self.trainer.train(train_data)
                            pbar.update(1)
                            i += 1

                        if i > self.num_trains_per_train_loop:
                            break

                gt.stamp('training', unique=False)
                self.training_mode(False)

                if isinstance(self.expl_data_collector, AtariPathCollectorWithEmbedder):
                    eval_policy = self.eval_data_collector.get_snapshot()['policy']
                    self.expl_data_collector.evaluate(eval_policy)

            self._end_epoch(epoch)

    def to(self, device):
        for net in self.trainer.networks:
            net.to(device)

    def training_mode(self, mode):
        for net in self.trainer.networks:
            net.train(mode)

while pass ‘render=True’ get 'Window rendering not supported'

Traceback (most recent call last):
  File "oracle.py", line 70, in <module>
    use_gpu=True,  # Turn on if you have a GPU
  File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 594, in run_experiment
    **run_experiment_kwargs
  File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 175, in run_experiment_here
    return experiment_function(variant)
  File "/home/cww97/rlkit/rlkit/launchers/state_based_goal_experiments.py", line 108, in her_td3_experiment
    algorithm.train()
  File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 146, in train
    self.train_online(start_epoch=start_epoch)
  File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 167, in train_online
    observation = self._take_step_in_env(observation)
  File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 206, in _take_step_in_env
    self.training_env.render()
  File "/home/cww97/multiworld/multiworld/envs/mujoco/mujoco_env.py", line 124, in render
    self._get_viewer().render()
  File "/home/cww97/multiworld/multiworld/envs/mujoco/mujoco_env.py", line 133, in _get_viewer
    self.viewer = mujoco_py.MjViewer(self.sim)
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/mujoco_py/mjviewer.py", line 133, in __init__
    super().__init__(sim)
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/mujoco_py/mjviewer.py", line 26, in __init__
    super().__init__(sim)
  File "mujoco_py/mjrendercontext.pyx", line 278, in mujoco_py.cymj.MjRenderContextWindow.__init__
  File "mujoco_py/mjrendercontext.pyx", line 66, in mujoco_py.cymj.MjRenderContext.__init__
  File "mujoco_py/mjrendercontext.pyx", line 87, in mujoco_py.cymj.MjRenderContext._set_mujoco_buffers
RuntimeError: Window rendering not supported

ummmm

[Question] Any idea why SAC loss would diverge?

I left it running for a few epochs, several times to ensure that it was not a fluke.

And SAC is collapsing to always choose the same action.

replay_buffer/size                       210000
trainer/QF1 Loss                              1.35779e+19
trainer/QF2 Loss                              1.34288e+19
trainer/Policy Loss                          -2.48799e+10
trainer/Q1 Predictions Mean                   2.33888e+10
trainer/Q1 Predictions Std                    3.70217e+09
trainer/Q1 Predictions Max                    3.68046e+10
trainer/Q1 Predictions Min                    1.31057e+10
trainer/Q2 Predictions Mean                   2.34333e+10
trainer/Q2 Predictions Std                    3.65296e+09
trainer/Q2 Predictions Max                    3.66932e+10
trainer/Q2 Predictions Min                    1.33272e+10
trainer/Q Targets Mean                        2.36857e+10
trainer/Q Targets Std                         4.52467e+09
trainer/Q Targets Max                         3.54759e+10
trainer/Q Targets Min                         0.224346
trainer/Log Pis Mean                          0.987727
trainer/Log Pis Std                           1.12239
trainer/Log Pis Max                           2.15324
trainer/Log Pis Min                          -4.0056
trainer/Policy mu Mean                        1.52476
trainer/Policy mu Std                         0.0895151
trainer/Policy mu Max                         1.62818
trainer/Policy mu Min                         1.37598
trainer/Policy log std Mean                  -0.582497
trainer/Policy log std Std                    0.0243203
trainer/Policy log std Max                   -0.492316
trainer/Policy log std Min                   -0.640244
trainer/Alpha                                 5.56742e+08
trainer/Alpha Loss                            0.247146
exploration/num steps total                   2.491e+06
exploration/num paths total               23586
exploration/path length Mean                131.579
exploration/path length Std                  57.1612
exploration/path length Max                 200
exploration/path length Min                   8
exploration/Rewards Mean                      0.264324
exploration/Rewards Std                       0.149922
exploration/Rewards Max                       0.590382
exploration/Rewards Min                       0.0141083
exploration/Returns Mean                     34.7795
exploration/Returns Std                      23.3818
exploration/Returns Max                      83.2558
exploration/Returns Min                       2.15501
exploration/Actions Mean                      0.4906
exploration/Actions Std                       0.0686414
exploration/Actions Max                       0.5
exploration/Actions Min                      -0.5
exploration/Num Paths                        38
exploration/Average Returns                  34.7795
exploration/env_infos/final/time Mean         0.342105
exploration/env_infos/final/time Std          0.285806
exploration/env_infos/final/time Max          0.96
exploration/env_infos/final/time Min          0
exploration/env_infos/initial/time Mean       0.995
exploration/env_infos/initial/time Std        3.33067e-16
exploration/env_infos/initial/time Max        0.995
exploration/env_infos/initial/time Min        0.995
exploration/env_infos/time Mean               0.606472
exploration/env_infos/time Std                0.263458
exploration/env_infos/time Max                0.995
exploration/env_infos/time Min                0
evaluation/num steps total                    2.45463e+06
evaluation/num paths total                21675
evaluation/path length Mean                 115.452
evaluation/path length Std                   52.2554
evaluation/path length Max                  200
evaluation/path length Min                    9
evaluation/Rewards Mean                       0.248655
evaluation/Rewards Std                        0.0242211
evaluation/Rewards Max                        0.294154
evaluation/Rewards Min                        0.193703
evaluation/Returns Mean                      28.7078
evaluation/Returns Std                       12.9204
evaluation/Returns Max                       52.5658
evaluation/Returns Min                        2.53809
evaluation/Actions Mean                       0.5
evaluation/Actions Std                        0
evaluation/Actions Max                        0.5
evaluation/Actions Min                        0.5
evaluation/Num Paths                         42
evaluation/Average Returns                   28.7078
evaluation/env_infos/final/time Mean          0.422738
evaluation/env_infos/final/time Std           0.261277
evaluation/env_infos/final/time Max           0.955
evaluation/env_infos/final/time Min           0
evaluation/env_infos/initial/time Mean        0.995
evaluation/env_infos/initial/time Std         2.22045e-16
evaluation/env_infos/initial/time Max         0.995
evaluation/env_infos/initial/time Min         0.995
evaluation/env_infos/time Mean                0.64974
evaluation/env_infos/time Std                 0.245087
evaluation/env_infos/time Max                 0.995
evaluation/env_infos/time Min                 0
time/data storing (s)                         0.0476881
time/evaluation sampling (s)                 13.4834
time/exploration sampling (s)                15.2477
time/logging (s)                              0.0254512
time/saving (s)                               0.0218989
time/training (s)                           111.327
time/epoch (s)                              140.153
time/total (s)                            68869.7
Epoch                                       497

Running it from master. Could it be related to the 'action' state to be somewhat discrete? The environment will discretize the actions in 'x' states based on the input data.

TDM how to load trained weight for continue train

how to play trained weight, show render result;

Cannot learn (pusher experiment)

I ran the experiment with RIG + pusher with the original settings.
Contrary to the paper, I cannot observe any improvement of the average return or success rate.
How can I reproduce the original paper results?

Output after 100 epochs (388004 iterations)

hand_distance Mean 0.0369142
hand_distance Std 0.0125764
hand_distance Max 0.153462
hand_distance Min 0.0280608
Final hand_distance Mean 0.0406867
Final hand_distance Std 0.00171668
Final hand_distance Max 0.0432108
Final hand_distance Min 0.0369941
puck_distance Mean 0.14895
puck_distance Std 0.0542304
puck_distance Max 0.234878
puck_distance Min 0.0381613
Final puck_distance Mean 0.150567
Final puck_distance Std 0.0544439
Final puck_distance Max 0.234878
Final puck_distance Min 0.0528324
touch_distance Mean 0.0677295
touch_distance Std 0.00900345
touch_distance Max 0.105978
touch_distance Min 0.0565952
Final touch_distance Mean 0.0690568
Final touch_distance Std 0.00635284
Final touch_distance Max 0.0836148
Final touch_distance Min 0.0630353
success Mean 0
success Std 0
success Max 0
success Min 0
Final success Mean 0
Final success Std 0
Final success Max 0
Final success Min 0
QF1 Loss 0.171023
QF2 Loss 0.18778
Policy Loss 104.349
Q1 Predictions Mean -104.665
Q1 Predictions Std 88.4167
Q1 Predictions Max -4.22522
Q1 Predictions Min -336.314
Q2 Predictions Mean -104.558
Q2 Predictions Std 88.3845
Q2 Predictions Max -4.09563
Q2 Predictions Min -335.829
Q Targets Mean -104.699
Q Targets Std 88.4681
Q Targets Max -4.3236
Q Targets Min -336.498
Bellman Errors 1 Mean 0.171023
Bellman Errors 1 Std 0.416491
Bellman Errors 1 Max 3.98678
Bellman Errors 1 Min 1.80304e-06
Bellman Errors 2 Mean 0.18778
Bellman Errors 2 Std 0.355425
Bellman Errors 2 Max 2.20607
Bellman Errors 2 Min 3.32488e-06
Policy Action Mean 0.194368
Policy Action Std 0.731224
Policy Action Max 1
Policy Action Min -1
Test Rewards Mean -0.360232
Test Rewards Std 0.594497
Test Rewards Max -0.00478052
Test Rewards Min -5.77281
Test Returns Mean -36.0232
Test Returns Std 22.2522
Test Returns Max -13.6433
Test Returns Min -92.884
Test Actions Mean 0.0276301
Test Actions Std 0.247718
Test Actions Max 1
Test Actions Min -0.838328
Num Paths 10
Exploration Rewards Mean -1.33633
Exploration Rewards Std 0.762848
Exploration Rewards Max -0.225774
Exploration Rewards Min -5.57678
Exploration Returns Mean -133.633
Exploration Returns Std 60.5544
Exploration Returns Max -42.4458
Exploration Returns Min -214.126
Exploration Actions Mean 0.262311
Exploration Actions Std 0.44695
Exploration Actions Max 1
Exploration Actions Min -1
image_dist Mean 10.9785
image_dist Std 1.05965
image_dist Max 14.3582
image_dist Min 9.29165
Final image_dist Mean 10.8799
Final image_dist Std 1.03786
Final image_dist Max 12.6932
Final image_dist Min 9.39808
image_success Mean -0.727273
image_success Std 0.445362
image_success Max 0
image_success Min -1
Final image_success Mean -0.727273
Final image_success Std 0.445362
Final image_success Max 0
Final image_success Min -1
vae_dist Mean 0.360232
vae_dist Std 0.594497
vae_dist Max 5.77281
vae_dist Min 0.00478052
Final vae_dist Mean 0.216473
Final vae_dist Std 0.096578
Final vae_dist Max 0.361867
Final vae_dist Min 0.0465295
AverageReturn -36.0232
Number of train steps total 388004
Number of env steps total 101000
Number of rollouts total 1010
Train Time (s) 323.224
(Previous) Eval Time (s) 123.294
Sample Time (s) 115.072
Epoch Time (s) 561.59
Total Train Time (s) 29454.1
Epoch 100

Evolution of the average return:

Compatibility with pytorch 1.0

Hi,

Any plan on updating the classes to be compatible with pytorch 1.0?

Multiple worker support

Is there support for multiple workers (threads each feeding in independent samples)?

I'm trying to reproduce the pick and place here, and they mention needing 19 workers to do so: https://github.com/openai/baselines/tree/master/baselines/her

On the other hand, seems like since HER is off policy multiple workers aren't necessary..

DDPG example - Missing TorchRLAlgorithm, how to use BatchRLDatasetAlgorithm

DDPG class imports the following. But this class is missing. Typos?
from rlkit.torch.torch_rl_algorithm import TorchRLAlgorithm. And, the initialization function of DDPG /DDPG doesn't look complete.

Thanks,
Narasimha

Making local mode Not run through doodad

What's the purpose of making local mode run through doodad?

It's making it very hard to debug because stuff like PyCharm doesn't work well trying to hook into subprocesses...

I can submit a PR to make local mode run within rlkit

SAC HER example results not matching

I cloned the repo, setup the environment and ran (made no changes)

python her_sac_gym_fetch_reach.py

The results don't seem to match with this. Did something break in the latest commit?

However, when I try the td3, it works fine

python her_td3_multiworld_sawyer_reach.py

Box2D-kengz installation crashing

Seems to happen when building from Anaconda.

Support discrete action in HER replay buffer

Currently, the code doesn't work because it does not convert the action into a one-hot vector.

Remove incorrect update date from readme

The readme says:

README last updated on: 02/19/2018

This date is incorrect in the literal sense of the word "update". It doesn't make sense to have this displayed in the readme. I recommend either removing it, or turning it into a hidden comment.

For example:


Some benchmarks on six MuJoCo-v2 environments for DDPG and TD3

Hi @vitchyr

Thanks for the great code base. I was recently benchmarking some results here in search for some DDPG/TD3 implementations after my failure to get baselines working. I thought I'd share some results in case it would be useful to you or others.

For installation, I actually didn't entirely follow the installation instructions, but here's what I did:

I used a Python 3.6.7 pip virtualenv, and just manually installed the packages I saw in your installation yml file. I used torch 0.4.1 as recommended.
I actually used MuJoCo 2.0, so I was using the -v2 instances of the environments.
I used gym 0.12.5 and mujoco-py 2.0.2.2

I took the master branch from 5565dd5 and then adjusted the examples/td3.py and examples/ddpg.py so that they also imported other MuJoCo environments. In addition, for TD3 only, I adjusted the hyperparameters in the "algorithm_kwargs" so that they matched DDPG in the main method. To be clear, DDPG uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/ddpg.py#L71-L79

And TD3 uses this:

https://github.com/vitchyr/rlkit/blob/5565dd589c54f3ee5add28183dd28f0e9663130f/examples/td3.py#L104-L111

I simply modified the td3.py script so that all hyperparameters above match DDPG, so in particular I changed: number of epochs to 1000, eval steps to 1000, min steps before training to 10k, and batch size to 128.

If I am not mistaken, this should mean that both the exploration and evaluation policies will experience 1 million total steps over the course of training. Though, because evaluation by default will discard incomplete trajectories, sometimes the actual number of steps reported by the debugger will be less than 1 million.

I ran DDPG and TD3 on six MuJoCo-v2 environments, for four random seeds each. I adjusted the code so my directory structure looks like this:

$ ls -lh data/
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Walker2d-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Ant-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-HalfCheetah-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Hopper-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-InvertedPendulum-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Reacher-v2
drwxrwxr-x 6 daniel daniel 4.0K Jun 21 19:59 rlkit-td3-Walker2d-v2
$ ls -lh data/rlkit-ddpg-Ant-v2/
drwxrwxr-x 2 daniel daniel 4.0K Jun 20 20:49 rlkit-ddpg-Ant-v2_2019_06_20_20_49_44_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_20_53_49_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_44_22_0000--s-0
drwxrwxr-x 2 daniel daniel 4.0K Jun 21 19:59 rlkit-ddpg-Ant-v2_2019_06_20_21_49_37_0000--s-0
$ 

// other env results presented in a similar manner

For this I used the following plotting script where I just call it like python [script].py Ant-v2 and similarly for the other environments:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
import argparse
import csv
import pandas as pd
import os
import numpy as np
from os.path import join

# matplotlib
titlesize = 33
xsize = 30
ysize = 30
ticksize = 25
legendsize = 25
error_region_alpha = 0.25


def smoothed(x, w):
    """Smooth x by averaging over sliding windows of w, assuming sufficient length.
    """
    if len(x) <= w:
        return x
    smooth = []
    for i in range(1, w):
        smooth.append( np.mean(x[0:i]) )
    for i in range(w, len(x)+1):
        smooth.append( np.mean(x[i-w:i]) )
    assert len(x) == len(smooth), "lengths: {}, {}".format(len(x), len(smooth))
    return np.array(smooth)


def plot(args):
    """Load the progress csv file, and plot.

    Plot:
      'exploration/Returns Mean',
      'exploration/num steps total',
      'evaluation/Returns Mean',
      'evaluation/num steps total',
    """
    nrows, ncols = 1, 2
    fig, ax = plt.subplots(nrows, ncols, squeeze=False, sharey='row',
                           figsize=(11*ncols,6*nrows))

    algorithms = sorted([x for x in os.listdir('data/') if args.env in x])
    assert len(algorithms) == 2
    colors = ['blue', 'red']

    for idx,alg in enumerate(algorithms):
        print('Currently on algorithm: ', alg)
        alg_dir = join('data', alg)
        progfiles = sorted([
                join(alg_dir, x, 'progress.csv') for x in os.listdir(alg_dir)
        ])
        expl_returns = []
        eval_returns = []
        expl_steps = []
        eval_steps = []

        for prog in progfiles:
            df = pd.read_csv(prog, delimiter = ',')

            expl_ret = df['exploration/Returns Mean'].tolist()
            expl_returns.append(expl_ret)
            eval_ret = df['evaluation/Returns Mean'].tolist()
            eval_returns.append(eval_ret)

            expl_sp = df['exploration/num steps total'].tolist()
            expl_steps.append(expl_sp)
            eval_sp = df['evaluation/num steps total'].tolist()
            eval_steps.append(eval_sp)

        expl_returns = np.array(expl_returns)
        eval_returns = np.array(eval_returns)
        xs = expl_returns.shape[1]
        expl_ret_mean = np.mean(expl_returns, axis=0)
        eval_ret_mean = np.mean(eval_returns, axis=0)
        expl_ret_std = np.mean(expl_returns, axis=0)
        eval_ret_std = np.mean(eval_returns, axis=0)

        w = 10
        label0 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(expl_ret_mean[-w:]))
        label1 = '{} (w={}), lastavg {:.1f}'.format(
                    (alg).replace('rlkit-',''), w, np.mean(eval_ret_mean[-w:]))
        ax[0,0].plot(np.arange(xs), smoothed(expl_ret_mean, w=w),
                     color=colors[idx], label=label0)
        ax[0,1].plot(np.arange(xs), smoothed(eval_ret_mean, w=w),
                     color=colors[idx], label=label1)

        # This can be noisy.
        if False:
            ax[0,0].fill_between(np.arange(xs),
                                 expl_ret_mean-expl_ret_std,
                                 expl_ret_mean+expl_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])
            ax[0,1].fill_between(np.arange(xs),
                                 eval_ret_mean-eval_ret_std,
                                 eval_ret_mean+eval_ret_std,
                                 alpha=0.3,
                                 facecolor=colors[idx])

    for i in range(2):
        ax[0,i].tick_params(axis='x', labelsize=ticksize)
        ax[0,i].tick_params(axis='y', labelsize=ticksize)
        leg = ax[0,i].legend(loc="best", ncol=1, prop={'size':legendsize})
        for legobj in leg.legendHandles:
            legobj.set_linewidth(5.0)
    ax[0,0].set_title('{} (Exloration)'.format(args.env), fontsize=ysize)
    ax[0,1].set_title('{} (Evaluation)'.format(args.env), fontsize=ysize)

    plt.tight_layout()
    figname = 'fig-{}.png'.format(args.env)
    plt.savefig(figname)
    print("\nJust saved: {}".format(figname))


if __name__ == "__main__":
    pp = argparse.ArgumentParser()
    pp.add_argument('env', type=str)
    args = pp.parse_args()
    plot(args)

Here are the curves. Left is the exploration policy, and right is the evaluation policy.

The TL;DR is that TD3 wins on four of the environments, and DDPG wins on the other two. One of the ones TD3 doesn't win is InvertedPendulum but that should be easy to get to 1000 if the hyperparameters are tuned. Also to reiterate the code comments, I do not have standard deviation reported since that would make the plots quite hard to read.

I thought this might be useful, if you want to point people towards some baselines. (I didn't see any upon a quick glance, but maybe you have them somewhere else?) Anyway, I hope this is useful or at least remotely interesting!

AssertionError: Table keys cannot change from iteration to iteration

This seems to happen whenever len(self._exploration_paths) == 1 because then create_stats_ordered_dict doesn't return std even though it might in another iteration.

ImportError: No module named 'glfw'

Failing to build mujoco-py. Initially failed due to missing 'swig' and 'patchelf', which was manually resolved. Now failing on 'glfw'. Calling 'import glfw' within (rlkit) conda environment works, though.

File "/tmp/pip-install-0n9doxzn/mujoco-py/setup.py", line 28, in run
import mujoco_py # noqa: force build
File "/tmp/pip-install-0n9doxzn/mujoco-py/mujoco_py/__init__.py", line 6, in <module>
from mujoco_py.mjviewer import MjViewer, MjViewerBasic
File "/tmp/pip-install-0n9doxzn/mujoco-py/mujoco_py/mjviewer.py", line 2, in <module>
import glfw
ImportError: No module named 'glfw'

We observed this issue on 2 machines. Thanks!

best way to resume training from PKL

I would like to resume training from a given epoch. I guess I could add a line to:

def get_epoch_snapshot(self, epoch):
    data_to_save = dict(
        epoch=epoch,
        exploration_policy=self.exploration_policy,
        eval_policy=self.eval_policy,
       algorithm=self #Here 
    )
    if self.save_environment:
        data_to_save['env'] = self.training_env
    return data_to_save

and just pickle load the entire RLAlgorithm subclass instance to resume?

Getting a gtimer error while installing requirements

Hi Vitchyr,
I'm following your rlkit-gpu install instructions (i.e., conda env create -f docker/rlkit_gpu/rlkit-env.yml), and I get the following error:

Could not find a version that satisfies the requirement gtimer==1.0.1b5 (from -r /data/repos/rlkit/docker/rlkit_gpu/condaenv.27ccrza6.requirements.txt (line 4)) (from versions: 1.0.0b0, 1.0.0b1, 1.0.0b2, 1.0.0b3, 1.0.0b4, 1.0.0b5)
No matching distribution found for gtimer==1.0.1b5 (from -r /data/repos/rlkit/docker/rlkit_gpu/condaenv.27ccrza6.requirements.txt (line 4))

Please help :-/

Use stable implementation of tanh's inverse log determinant Jacobian

Replace the second term from https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/distributions.py#L43-L44 with the equation from
https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/bijectors/tanh.py#L73 .

That is, replace torch.log(1 - value * value + self.epsilon with 2. * (log(2.) - pre_tanh_value - softplus(-2. * pre_tanh_value)).

AttributeError: attribute 'site_pos' of 'mujoco_py.cymj.PyMjModel' objects is not writable

thanks for your share !
i have a question:
when i run examples/tdm/ant_position.py, it returns:
2018-05-14 08:29:57.628702 CST | Variant: 2018-05-14 08:29:57.628908 CST | { "algorithm": "TDM", "tdm_kwargs": { "reward_scale": 10, "discount": 1, "max_path_length": 50, "batch_size": 128, "num_pretrain_paths": 0, "num_epochs": 500, "num_steps_per_epoch": 1000, "num_steps_per_eval": 1000, "tau": 0.001, "num_updates_per_env_step": 5, "max_tau": 49 } } WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype. WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype. Traceback (most recent call last): File "/home/jiamj/rlkit/examples/tdm/ant_position.py", line 87, in <module> experiment(variant) File "/home/jiamj/rlkit/examples/tdm/ant_position.py", line 15, in experiment env = NormalizedBoxEnv(GoalXYPosAnt(max_distance=6)) File "/home/jiamj/rlkit/rlkit/torch/tdm/envs/ant_env.py", line 21, in __init__ self.set_goal(np.array([self.max_distance, self.max_distance])) File "/home/jiamj/rlkit/rlkit/torch/tdm/envs/ant_env.py", line 41, in set_goal self.model.site_pos = site_pos AttributeError: attribute 'site_pos' of 'mujoco_py.cymj.PyMjModel' objects is not writable
how can i solve it?
thanks again!

Docker Image Does Not Work

I have attempted to run the TD3 example script from the rlkit-gpu Docker image with no success. I had to modify the TD3 example script slightly because I dont have a Mujoco license, so it instead runs MountainCarContinuous-v0. It runs just fine on my local machine from RLKit source, but when I try to run from within the docker container I get the following error:

THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
Traceback (most recent call last):
  File "examples/td3.py", line 111, in <module>
    experiment(variant)
  File "examples/td3.py", line 84, in experiment
    algorithm.cuda()
  File "/rlkit/rlkit/torch/torch_rl_algorithm.py", line 37, in cuda
    net.cuda()
  File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 152, in _apply
    param.data = fn(param.data)
  File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 216, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/env/lib/python3.5/site-packages/torch/_utils.py", line 69, in _cuda
    return new_type(self.size()).copy_(self, async)
  File "/env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 358, in _lazy_new
    _lazy_init()
  File "/env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 121, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/lib/THC/THCGeneral.c:70

Can you confirm that the td3 example runs on the docker container without issue for you?

Value network in TwinSAC

Hi,

I skimmed over the author's implementation and it seems that they don't use the value network. Instead they only use the Q-networks. Seems they removed it in this commit

Thanks,

Lukas

Training env is reset at start of epoch, but PathBuilder not reset. Resulting in erroneous Exploration statistics.

Hi Vitchyr!

There appears to be an issue in the Exploration statistics. In online training, at each new epoch, observation = self._start_new_rollout() is called. However, the PathBuilder is not reset at the end of an epoch, so the first trajectory of each new epoch is appended to the last trajectory of the previous epoch. This results in the wrong exploration returns statistics. I don't believe the rest of the training is affected.

One solution could be to call self._handle_rollout_ending at the end of each epoch. This would however also yields returns of episodes that are cut short, but I'd say that's preferable. Shall I file a PR?

Nice work on this repo!

Best,
Pim

Unable to reproduce the results of HER-TD3

When I first run the script with:

python -m examples.her.her_td3_gym_fetch_reach

It raised an error:

Traceback (most recent call last):
  File "/home/fcloud/anaconda3/envs/rlkit/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/fcloud/anaconda3/envs/rlkit/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/fcloud/workspace/rlkit/examples/her/her_td3_gym_fetch_reach.py", line 88, in <module>
    experiment(variant)
  File "/home/fcloud/workspace/rlkit/examples/her/her_td3_gym_fetch_reach.py", line 65, in experiment
    **variant['algo_kwargs']
TypeError: __init__() missing 2 required keyword-only arguments: 'her_kwargs' and 'td3_kwargs'

So I modified the source code as follows:

    her_kwargs = dict(observation_key='observation', desired_goal_key='desired_goal')
    td3_kwargs = dict(env=env,
                      qf1=qf1,
                      qf2=qf2,
                      policy=policy,
                      exploration_policy=exploration_policy)
    algorithm = HerTd3(
        her_kwargs=her_kwargs,
        td3_kwargs=td3_kwargs,
        replay_buffer=replay_buffer,
        **variant['algo_kwargs']
    )

Then the experiment can be launched successfully, but the results seem not correct:

how can I render while training

do I need to write a render function ?

Is there any examples in these codes

Retrieve s3

Is there any easy way to get the S3 path? I'd like to have a script that downloads all the recent s3 results instead of copy and pasting each time, and that would require having the s3 path returned each time an experiment is uploaded

How could I see the video in rig...

I see

            save_video=True,
            save_video_period=100,

I can only see the png files in the folder data, I wanna know how or where can I find the video files.

How do we set reward scale for SAC

For soft actor-critic, how do we specify the reward scale?
How do we specify it in examples/sac.py?

Unable to Recreate TDM Performance for Half Cheetah

Hi, I'm trying to evaluate TDM as an algorithm to add to our software and the first step was to replicate the results presented in "TEMPORAL DIFFERENCE MODELS: MODEL-FREE DEEP RL FOR MODEL-BASED CONTROL" (Very impressive paper, by the way). Of the examples in this repository, Half Cheetah looks like it trains the fastest in your paper, so I ran that first. When I run it I don't seem to get the same results you present, but I may just be interpreting things incorrectly.

I created a conda env as per the instructions in the readme, with the exception of gtimer where I removed the version requirement gtimer==1.0.1b5 because that version wasn't available.

When I run examples/tdm/cheetah.py, the initial returns are as follows:

Expand Initial Return Data

Multitask L1 distance to goal Mean 2.2572
Multitask L1 distance to goal Std 1.43821
Multitask L1 distance to goal Max 5.63364
Multitask L1 distance to goal Min 0.0958731
Multitask Final L1 distance to goal Mean 1.34996
Multitask Final L1 distance to goal Std 1.69566
Multitask Final L1 distance to goal Max 5.46802
Multitask Final L1 distance to goal Min 0.315033
Multitask L2 distance to goal Mean 2.2572
Multitask L2 distance to goal Std 1.43821
Multitask L2 distance to goal Max 5.63364
Multitask L2 distance to goal Min 0.0958731
Multitask Final L2 distance to goal Mean 1.34996
Multitask Final L2 distance to goal Std 1.69566
Multitask Final L2 distance to goal Max 5.46802
Multitask Final L2 distance to goal Min 0.315033
Multitask Env Rewards Mean -2.24934
Multitask Env Rewards Std 1.44294
Multitask Env Rewards Max -0.0958731
Multitask Env Rewards Min -5.60402
Multitask L1 distance to goal Mean 2.2572
Multitask L1 distance to goal Std 1.43821
Multitask L1 distance to goal Max 5.63364
Multitask L1 distance to goal Min 0.0958731
Multitask Final L1 distance to goal Mean 1.34996
Multitask Final L1 distance to goal Std 1.69566
Multitask Final L1 distance to goal Max 5.46802
Multitask Final L1 distance to goal Min 0.315033
Multitask L2 distance to goal Mean 2.2572
Multitask L2 distance to goal Std 1.43821
Multitask L2 distance to goal Max 5.63364
Multitask L2 distance to goal Min 0.0958731
Multitask Final L2 distance to goal Mean 1.34996
Multitask Final L2 distance to goal Std 1.69566
Multitask Final L2 distance to goal Max 5.46802
Multitask Final L2 distance to goal Min 0.315033
Multitask Env Rewards Mean -2.24934
Multitask Env Rewards Std 1.44294
Multitask Env Rewards Max -0.0958731
Multitask Env Rewards Min -5.60402
xvels Mean 0.192786
xvels Std 0.55423
xvels Max 1.7305
xvels Min -2.56708
Final xvels Mean -0.873447
Final xvels Std 0.629683
Final xvels Max -0.00983972
Final xvels Min -1.92269
desired xvels Mean -0.242309
desired xvels Std 2.53803
desired xvels Max 5.45818
desired xvels Min -2.77633
Final desired xvels Mean -0.242309
Final desired xvels Std 2.53803
Final desired xvels Max 5.45818
Final desired xvels Min -2.77633
xvel errors Mean 2.24934
xvel errors Std 1.44294
xvel errors Max 5.60402
xvel errors Min 0.0958731
Final xvel errors Mean 1.34996
Final xvel errors Std 1.69566
Final xvel errors Max 5.46802
Final xvel errors Min 0.315033
QF Loss 1.94271
Policy Loss 0.141472
Raw Policy Loss 0.141472
Preactivation Policy Loss 0
Q Predictions Mean -0.141398
Q Predictions Std 0.188076
Q Predictions Max -0.000109524
Q Predictions Min -0.671575
Q Targets Mean -2.11527
Q Targets Std 8.76126
Q Targets Max -0
Q Targets Min -73.5125
Bellman Errors Mean 80.6921
Bellman Errors Std 504.045
Bellman Errors Max 5365.62
Bellman Errors Min 8.74315e-13
Policy Action Mean -0.00243206
Policy Action Std 0.00429308
Policy Action Max 0.00416228
Policy Action Min -0.011675
Test Rewards Mean -2.24934
Test Rewards Std 1.44294
Test Rewards Max -0.0958731
Test Rewards Min -5.60402
Test Returns Mean -222.684
Test Returns Std 133.611
Test Returns Max -52.0534
Test Returns Min -534.119
Test Actions Mean 0.181862
Test Actions Std 0.903677
Test Actions Max 1
Test Actions Min -1
Num Paths 10
Exploration Rewards Mean -327.502
Exploration Rewards Std 150.894
Exploration Rewards Max -0.0122693
Exploration Rewards Min -684.543
Exploration Returns Mean -32422.7
Exploration Returns Std 14259.2
Exploration Returns Max -9411.74
Exploration Returns Min -56577.3
Exploration Actions Mean 0.0733828
Exploration Actions Std 0.691484
Exploration Actions Max 1
Exploration Actions Min -1
AverageReturn -222.684
Number of train steps total 20075
Number of env steps total 1000
Number of rollouts total 10
Train Time (s) 185.599
(Previous) Eval Time (s) 0
Sample Time (s) 0.594464
Epoch Time (s) 186.193
Total Train Time (s) 187.161
Epoch 0

and the final returns after 27 000 samples were:

Expand Final Return Data

Multitask L1 distance to goal Mean 2.84896
Multitask L1 distance to goal Std 1.86101
Multitask L1 distance to goal Max 8.48643
Multitask L1 distance to goal Min 0.00163637
Multitask Final L1 distance to goal Mean 1.12561
Multitask Final L1 distance to goal Std 1.09752
Multitask Final L1 distance to goal Max 3.6141
Multitask Final L1 distance to goal Min 0.010751
Multitask L2 distance to goal Mean 2.84896
Multitask L2 distance to goal Std 1.86101
Multitask L2 distance to goal Max 8.48643
Multitask L2 distance to goal Min 0.00163637
Multitask Final L2 distance to goal Mean 1.12561
Multitask Final L2 distance to goal Std 1.09752
Multitask Final L2 distance to goal Max 3.6141
Multitask Final L2 distance to goal Min 0.010751
Multitask Env Rewards Mean -2.83008
Multitask Env Rewards Std 1.86414
Multitask Env Rewards Max -0.00163637
Multitask Env Rewards Min -8.48643
Multitask L1 distance to goal Mean 2.84896
Multitask L1 distance to goal Std 1.86101
Multitask L1 distance to goal Max 8.48643
Multitask L1 distance to goal Min 0.00163637
Multitask Final L1 distance to goal Mean 1.12561
Multitask Final L1 distance to goal Std 1.09752
Multitask Final L1 distance to goal Max 3.6141
Multitask Final L1 distance to goal Min 0.010751
Multitask L2 distance to goal Mean 2.84896
Multitask L2 distance to goal Std 1.86101
Multitask L2 distance to goal Max 8.48643
Multitask L2 distance to goal Min 0.00163637
Multitask Final L2 distance to goal Mean 1.12561
Multitask Final L2 distance to goal Std 1.09752
Multitask Final L2 distance to goal Max 3.6141
Multitask Final L2 distance to goal Min 0.010751
Multitask Env Rewards Mean -2.83008
Multitask Env Rewards Std 1.86414
Multitask Env Rewards Max -0.00163637
Multitask Env Rewards Min -8.48643
xvels Mean 0.0265717
xvels Std 1.19521
xvels Max 3.78229
xvels Min -5.17487
Final xvels Mean -0.953468
Final xvels Std 2.0584
Final xvels Max 3.35745
Final xvels Min -4.45172
desired xvels Mean -1.27358
desired xvels Std 3.19157
desired xvels Max 4.8695
desired xvels Min -5.75411
Final desired xvels Mean -1.27358
Final desired xvels Std 3.19157
Final desired xvels Max 4.8695
Final desired xvels Min -5.75411
xvel errors Mean 2.83008
xvel errors Std 1.86414
xvel errors Max 8.48643
xvel errors Min 0.00163637
Final xvel errors Mean 1.12561
Final xvel errors Std 1.09752
Final xvel errors Max 3.6141
Final xvel errors Min 0.010751
QF Loss 5.10427
Policy Loss 21.2602
Raw Policy Loss 21.2602
Preactivation Policy Loss 0
Q Predictions Mean -35.1532
Q Predictions Std 79.3634
Q Predictions Max -0.176604
Q Predictions Min -579.39
Q Targets Mean -34.9638
Q Targets Std 80.9855
Q Targets Max -0.0394544
Q Targets Min -585.102
Bellman Errors Mean 134.548
Bellman Errors Std 629.529
Bellman Errors Max 6193.98
Bellman Errors Min 1.83182e-05
Policy Action Mean 0.043988
Policy Action Std 0.897599
Policy Action Max 1
Policy Action Min -1
Test Rewards Mean -2.83008
Test Rewards Std 1.86414
Test Rewards Max -0.00163637
Test Rewards Min -8.48643
Test Returns Mean -280.178
Test Returns Std 150.009
Test Returns Max -63.9594
Test Returns Min -552.089
Test Actions Mean -0.262005
Test Actions Std 0.870809
Test Actions Max 1
Test Actions Min -1
Num Paths 10
Exploration Rewards Mean -189.165
Exploration Rewards Std 142.202
Exploration Rewards Max -0.217348
Exploration Rewards Min -668.823
Exploration Returns Mean -18727.3
Exploration Returns Std 11106.4
Exploration Returns Max -5627.08
Exploration Returns Min -44754
Exploration Actions Mean -0.0129454
Exploration Actions Std 0.841991
Exploration Actions Max 1
Exploration Actions Min -1
AverageReturn -280.178
Number of train steps total 670075
Number of env steps total 27000
Number of rollouts total 272
Train Time (s) 339.624
(Previous) Eval Time (s) 0.961533
Sample Time (s) 0.663905
Epoch Time (s) 341.249
Total Train Time (s) 8136.35
Epoch 26

Comparing the final xvel error mean (which I think is the metric you graph in the paper) it seems that after 27 000 env steps, rather than dropping from ~2.5 to < 1, the value has remained at or above 2.5. Is this just a matter of running it for longer, or does this suggest there's something fundamentally wrong with my environment or perhaps that the code has been changed since generating the graphs in the paper?

I've seen similar results in three runs, so it doesn't seem to be a bad initialization.

Thanks for your help.

Investigate super-convergence on RL algorithms

I have been using these two routines to figure out the best learning rate to apply with awesome results on SAC. However, the changes in the temperature alter those values along the way. Probably would be a good idea to extend it further to do some sort of 'automatic' discovery of LR after x amount of epochs. This version will also mess up the gradients, so you cannot use the policy after you run this.

    def find_policy_lr_step(self, loss):

        self.find_lr_batch_num += 1

        if self.find_lr_batch_num == 1:
            self.find_lr_avg_loss = 0.0
            self.find_lr_worst_loss = loss.item()
            self.find_lr_best_loss = loss.item()
            self.find_lr_best_lr = self.policy_optimizer.param_groups[0]['lr']
            self.find_lr_worst_lr = self.policy_optimizer.param_groups[0]['lr']

        self.find_lr_avg_loss = self.find_lr_beta * self.find_lr_avg_loss + (1-self.find_lr_beta) * loss.item()
        smoothed_loss = self.find_lr_avg_loss / (1 - self.find_lr_beta ** self.find_lr_batch_num)

        # Record the best and worst loss
        if self.find_lr_batch_num > self.find_lr_batches // 10 and smoothed_loss < self.find_lr_best_loss:
            self.find_lr_best_lr = self.find_lr_current_lr
            self.find_lr_best_loss = smoothed_loss

        # We only record at the start (we dont care about the divergent part)
        if self.find_lr_batch_num < self.find_lr_batches // 5:
            self.find_lr_worst_loss = max(smoothed_loss, self.find_lr_worst_loss)

        # Stop if the loss is exploding
        if self.find_lr_batch_num > self.find_lr_batches:

            import matplotlib.pyplot as plt
            plt.plot(self.find_lr_log_lrs,self.find_lr_losses)
            plt.show()

            # TODO: This is a simplistic heuristic until we do it properly doing gradient analysis.
            printout(f'The best learning rate for network could be around: {self.find_lr_best_lr / 10}')
            printout(f'Process will exit because finding the learning rate will make your gradients to degenerate')
            exit(0)

        # Store the values unless we are already diverging
        if smoothed_loss <= self.find_lr_worst_loss:
            self.find_lr_losses.append(smoothed_loss)
            self.find_lr_log_lrs.append(math.log10(self.policy_optimizer.param_groups[0]['lr']))

        # Update with the new learning rate.
        self.find_lr_current_lr *= self.find_lr_multiplier
        self.policy_optimizer.param_groups[0]['lr'] = self.find_lr_current_lr

    def find_qfunc_lr_step(self, qf1_loss, qf2_loss):
        self.find_lr_batch_num += 1

        if self.find_lr_batch_num == 1:
            self.find_lr_avg_loss = 0.0
            self.find_lr_worst_loss = min( qf1_loss.item(), qf2_loss.item() )
            self.find_lr_best_loss = min( qf1_loss.item(), qf2_loss.item() )
            self.find_lr_best_lr = self.qf1_optimizer.param_groups[0]['lr']
            self.find_lr_worst_lr = self.qf1_optimizer.param_groups[0]['lr']

        self.find_lr_avg_loss = self.find_lr_beta * self.find_lr_avg_loss + (1-self.find_lr_beta) * min( qf1_loss.item(), qf2_loss.item() )
        smoothed_loss = self.find_lr_avg_loss / (1 - self.find_lr_beta ** self.find_lr_batch_num)

        # Record the best and worst loss
        if self.find_lr_batch_num > self.find_lr_batches // 10 and smoothed_loss < self.find_lr_best_loss:
            self.find_lr_best_lr = self.find_lr_current_lr
            self.find_lr_best_loss = smoothed_loss

        # We only record at the start (we dont care about the divergent part)
        if self.find_lr_batch_num < self.find_lr_batches // 5:
            self.find_lr_worst_loss = max(smoothed_loss, self.find_lr_worst_loss)

        # Stop if the loss is exploding
        if self.find_lr_batch_num > self.find_lr_batches:

            import matplotlib.pyplot as plt
            plt.plot(self.find_lr_log_lrs,self.find_lr_losses)
            plt.show()

            # TODO: This is a simplistic heuristic until we do it properly doing gradient analysis.
            printout(f'The best learning rate for q function approximator could be around: {self.find_lr_best_lr / 10}')
            printout(f'Process will exit because finding the learning rate will make your gradients to degenerate')
            exit(0)

        # Store the values unless we are already diverging
        if smoothed_loss <= self.find_lr_worst_loss:
            self.find_lr_losses.append(smoothed_loss)
            self.find_lr_log_lrs.append(math.log10(self.qf1_optimizer.param_groups[0]['lr']))

        # Update with the new learning rate.
        self.find_lr_current_lr *= self.find_lr_multiplier
        self.qf1_optimizer.param_groups[0]['lr'] = self.find_lr_current_lr
        self.qf2_optimizer.param_groups[0]['lr'] = self.find_lr_current_lr

dependency on multiworld even when not using it

I am using a custom environment, yet it seems that HER depends on multiworld anyway. The documentation says this shouldn't be necessary. Here is the stacktrace:

    from rlkit.samplers.data_collector import GoalConditionedPathCollector
  File "/root/src/sandbox/rlkit/rlkit/samplers/data_collector/__init__.py", line 6, in <module>
    from rlkit.samplers.data_collector.path_collector import (
  File "/root/src/sandbox/rlkit/rlkit/samplers/data_collector/path_collector.py", line 3, in <module>
    from rlkit.envs.vae_wrapper import VAEWrappedEnv
  File "/root/src/sandbox/rlkit/rlkit/envs/vae_wrapper.py", line 11, in <module>
    from multiworld.core.multitask_env import MultitaskEnv
ModuleNotFoundError: No module named 'multiworld.core'

Multitask environment file does not exist

Hi,
I am trying to implement TDM. The link for multitask env is broken.
https://github.com/vitchyr/rlkit/blob/master/docs/envs/multitask_env.py

The policy loss in SAC?

thanks for your code,it helps a lot. I wonder why you use a different policy loss in SAC.

In the origin paper, the policy loss is simple. What does the mean_reg_loss and other loss mean?Hoping for your reply.

viskit incompatability: plotly.exceptions.PlotlyDictValueError: 'title' has invalid value inside 'layout'

When trying to plot the results with viskit, the follow error occurs.

Done! View http://localhost:5000 in your browser
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
Plot_keys: ['AverageReturn']
split_keys: []
group_keys: []
filters: {}
exclusions: []
[2018-12-29 15:56:30,229] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/vitchyr/git/viskit/viskit/frontend.py", line 762, in index
    plot_div = get_plot_instruction(plot_keys=plot_keys)
  File "/home/vitchyr/git/viskit/viskit/frontend.py", line 502, in get_plot_instruction
    plot_width=plot_width, plot_height=plot_height
  File "/home/vitchyr/git/viskit/viskit/frontend.py", line 79, in make_plot
    title=title,
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/plotly/graph_objs/graph_objs.py", line 613, in update
    self[key] = val
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/plotly/graph_objs/graph_objs.py", line 430, in __setitem__
    value = self._value_to_graph_object(key, value, _raise=_raise)
  File "/home/vitchyr/anaconda2/envs/railrl-env-4/lib/python3.5/site-packages/plotly/graph_objs/graph_objs.py", line 535, in _value_to_graph_object
    raise exceptions.PlotlyDictValueError(self, path)
plotly.exceptions.PlotlyDictValueError: 'title' has invalid value inside 'layout'

Path To Error: ['layout']['title']

Current path: ['layout']
Current parent object_names: ['figure']

With the current parents, 'title' can be used as follows:

Under ('figure', 'layout'):

    editType: layoutstyle
    role: object



127.0.0.1 - - [29/Dec/2018 15:56:30] "GET / HTTP/1.1" 500 -

Mujoco_py doesn't build

Docker image installs mujoco_py successfully. I can run a new Docker container, type "import mujoco_py" and it builds the cython code.

However. When I launch a container programmatically with doodad. For some reason, it keeps erroring and trying to rebuild the cython:

Running in docker

Import error. Trying to rebuild mujoco_py.
running build_ext
building 'mujoco_py.cymj' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py -I/root/.mujoco/mjpro150/include -I/env/lib/python3.6/site-packages/numpy/core/include -I/usr/include/python3.6m -I/env/include/python3.6m -c /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.c -o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.o -fopenmp -w
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py -I/root/.mujoco/mjpro150/include -I/env/lib/python3.6/site-packages/numpy/core/include -I/usr/include/python3.6m -I/env/include/python3.6m -c /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.c -o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.o -fopenmp -w
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/cymj.o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/gl/osmesashim.o -L/root/.mujoco/mjpro150/bin -Wl,--enable-new-dtags,-R/root/.mujoco/mjpro150/bin -lmujoco150 -lglewosmesa -lOSMesa -lGL -o /env/lib/python3.6/site-packages/mujoco_py-1.50.1.68-py3.6.egg/mujoco_py/generated/_pyxbld_1.50.1.68_36_linuxcpuextensionbuilder/lib.linux-x86_64-3.6/mujoco_py/cymj.cpython-36m-x86_64-linux-gnu.so -fopenmp
/usr/bin/ld: cannot find -lmujoco150
/usr/bin/ld: cannot find -lglewosmesa

I'm sure osmesa is installed, and mujoco is downloaded and installed. The LD path looks correct:
echo $LD_LIBRARY_PATH /usr/local/nvidia/lib64:/root/.mujoco/mjpro150/bin:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

root@e6131e04a34c:~/.mujoco# ls mjkey.txt mjpro150 
root@e6131e04a34c:~/.mujoco# ls mjpro150/ 
bin doc include model sample

examples/tsac.py does not run at master

Crashes with:

Traceback (most recent call last):
File "/mounts/target/examples/tsac.py", line 75, in
experiment(variant)
File "/mounts/target/examples/tsac.py", line 52, in experiment
algorithm.to(ptu.device)
File "/mounts/rlkit-vitchyr/rlkit/torch/torch_rl_algorithm.py", line 30, in to
net.to(device)
File "/env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 366, in getattr
type(self).name, name))
AttributeError: 'TanhGaussianPolicy' object has no attribute 'to'

Performance on Hopper-v2

originally posted under another issue but re-posting for visibility

Sorry to open this up again, but I am unable to obtain comparable result to the Tensorflow implementation using the master branch. I post the training graph for the pytorch and tensorflow implementation below for comparison. Both results were averaged over 5 seeds.

The TF implementation's final performance is higher and also learns faster. The shape of the TF implementation also closely matches the shape of the graph in the paper, i.e. quickly increase and plateau at around 400 epochs.

Does the pytorch graph look similar to what you obtained too?

I just want to mention that your repo is awesome. Answering pestering question from me is not your responsibility : ) and I really appreciate any help here.

error : mujoco_py' has no attribute 'load_model_from_path mujoco_py version 0.5.7

fix by openai/mujoco-py#188
use gym 0.7

other:
conda install pytorch=0.3.0 torchvision -c pytorch
pip install gym==0.7.3

541 export PYTHONPATH='.'
549 pip install mujoco-py==0.5.7
567 pip install gym==0.7.3
587 conda install pytorch=0.3.0 torchvision -c pytorch

SAC code fails?

Hi. I tried running SAC but it fails with following error. I am currently working on your repo to add environment farming(multiple instance of an environment feeding the reinforcement learning algorithms) feature. Therefore, problem may be related to my environment or changed parts but it seem like it is not since DDPG works. Do you have any idea why SAC fails? Thanks.

File "/root/code/rlkit/core/rl_algorithm.py", line 134, in train
self.train_online(start_epoch=start_epoch)
File "/root/code/rlkit/core/rl_algorithm.py", line 210, in train_online
observation = self.play_one_step(observation)
File "/root/code/rlkit/core/rl_algorithm.py", line 179, in play_one_step
self._try_to_train()
File "/root/code/rlkit/core/rl_algorithm.py", line 222, in _try_to_train
self._do_training()
File "/root/code/rlkit/torch/sac/sac.py", line 131, in _do_training
policy_loss.backward()
File "/opt/conda/envs/rlkit-env/lib/python3.5/site-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/opt/conda/envs/rlkit-env/lib/python3.5/site-packages/torch/autograd/init.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/opt/conda/envs/rlkit-env/lib/python3.5/site-packages/torch/autograd/stochastic_function.py", line 15, in _do_backward
raise RuntimeError("differentiating stochastic functions requires "
RuntimeError: differentiating stochastic functions requires providing a reward

OSError: [Errno 12] Cannot allocate memory

I make a little change on oracle.py, likethis

and modify state_based_goal_experiments.py

env -> image_env

the I get

Traceback (most recent call last):
  File "pusher_img.py", line 76, in <module>
    use_gpu=True,  # Turn on if you have a GPU
  File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 594, in run_experiment
    **run_experiment_kwargs
  File "/home/cww97/rlkit/rlkit/launchers/launcher_util.py", line 175, in run_experiment_here
    return experiment_function(variant)
  File "/home/cww97/rlkit/rlkit/launchers/img_goal_experiments.py", line 120, in her_td3_experiment
    algorithm.train()
  File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 146, in train
    self.train_online(start_epoch=start_epoch)
  File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 178, in train_online
    self._end_epoch(epoch)
  File "/home/cww97/rlkit/rlkit/core/rl_algorithm.py", line 329, in _end_epoch
    post_epoch_func(self, epoch)
  File "/home/cww97/rlkit/rlkit/launchers/rig_experiments.py", line 525, in save_video
    **dump_video_kwargs)
  File "/home/cww97/rlkit/rlkit/util/video.py", line 87, in dump_video
    skvideo.io.vwrite(filename, outputdata)
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/skvideo/io/io.py", line 64, in vwrite
    writer.writeFrame(videodata[t])
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/skvideo/io/ffmpeg.py", line 448, in writeFrame
    self._warmStart(M, N, C)
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/site-packages/skvideo/io/ffmpeg.py", line 419, in _warmStart
    stdout=self.DEVNULL, stderr=sp.STDOUT)
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "/home/cww97/.conda/envs/rlkit/lib/python3.5/subprocess.py", line 1490, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I search a little with google, I know this is a memory problem, then I tried to see if there is any difference.

I compared the oracle and img_pusher

the vidoe outputdata are all 115450416

is this because I miss other coding jobs?

Make sure you dont need a Mujoco license to use any of the algorithms

There are many algorithms that import Mujoco environments because they are not separated. In my case I dont care about Mujoco, in fact I had to get a trial license just to avoid having to remove code from my fork.

Exception has occurred: AssertionError with Gym v0.12

Hi,
I'm seeing this error appear consistently when I run the "DQN_and_double_DQN" example file out of the box somewhere around epoch 15 or so. From what I can tell, this is because when it gets a good enough policy that allows the episode to last until timeout (200 steps), the gym environment (i'm on 0.12.5) returns '{'TimeLimit.truncated': True}' in the env_info dictionary which is inconsistent with the prior observations. Sometime during logging this inconsistency causes an issue. Seems related to #12 on the surface, but not sure why this only seems to be a problem for me.

Please find the stacktrace below, happy to provide more details or my installed package version numbers if that would help.

File "rlkit/rlkit/pythonplusplus.py", line 165, in list_of_dicts__to__dict_of_lists
    assert set(d.keys()) == set(keys)
  File "rlkit/rlkit/core/eval_util.py", line 40, in <listcomp>
    for p in paths
  File "rlkit/rlkit/core/eval_util.py", line 40, in get_generic_path_information
    for p in paths
  File "rlkit/rlkit/core/rl_algorithm.py", line 127, in _log_stats
    eval_util.get_generic_path_information(eval_paths),
  File "rlkit/rlkit/core/rl_algorithm.py", line 58, in _end_epoch
    self._log_stats(epoch)
  File "rlkit/rlkit/core/batch_rl_algorithm.py", line 84, in _train
    self._end_epoch(epoch)
  File "rlkit/rlkit/core/rl_algorithm.py", line 46, in train
    self._train()
  File "rlkit/examples/dqn_and_double_dqn.py", line 71, in experiment
    algorithm.train()
  File "rlkit/examples/dqn_and_double_dqn.py", line 99, in <module>
    experiment(variant)

License

Hi Vitchyr! Thanks for your inspiring code. Do you mind putting a license file in the repo so that we can use your code for other projects?

Eval statistics for SAC

I had a question about the way evaluation statistics are computed for SAC - from taking a look at the code, it seems as though the statistics will only be computed over one particular training batch every epoch (https://github.com/vitchyr/rlkit/blob/master/rlkit/torch/sac/sac.py#L161), is this true? I'd imagine that this measurement would be pretty high variance, as opposed to averaging the statistics over all batches in the epoch. Could you clarify if this is the case and if so, why you've implemented logging in this way?

Memory Mapped based replay buffer

I am not issuing a PR for this one because this uses the modified Dataset based version of the trainer... But making it available through this issue in case you are interested in it. The idea behind is that if you have a big dataset, you want to avoid filling all your memory. Memory mapped files allow arbitrary size replay buffers:

class OptimizedDiskSequentialReplayBuffer(ReplayBuffer, Dataset):

    def __init__(
            self,
            max_replay_buffer_size,
            observation_shape,
            observation_dtype,
            action_dim,
            env_info_sizes,
            location='.',
            name='replay_buffer'
    ):
        self._observation_shape = observation_shape
        self._observation_dtype = observation_dtype
        self._action_dim = action_dim
        self.max_buffer_size = max_replay_buffer_size

        name = os.path.join(location, name)

        self._observations = np.memmap(f'{name}.obs', dtype=observation_dtype, mode='w+', shape=(max_replay_buffer_size,) + observation_shape)
        # It's a bit memory inefficient to save the observations twice, but it makes the code *much* easier since you no longer have to
        # worry about termination conditions.
        self._next_obs = np.memmap(f'{name}.nextobs', dtype=observation_dtype, mode='w+', shape=(max_replay_buffer_size,) + observation_shape)
        self._actions = np.memmap(f'{name}.actions', dtype=np.float32, mode='w+', shape=(max_replay_buffer_size, action_dim))
        # Make everything a 2D np array to make it easier for other code to reason about the shape of the data
        self._rewards = np.memmap(f'{name}.rewards', dtype=np.float32, mode='w+', shape=(max_replay_buffer_size, 1))
        # self._terminals[i] = a terminal was received at time i
        self._terminals = np.memmap(f'{name}.terminals', dtype=np.uint8, mode='w+', shape=(max_replay_buffer_size, 1))

        # Define self._env_infos[key][i] to be the return value of env_info[key]
        # at time i
        self._env_infos = {}
        for key, size in env_info_sizes.items():
            self._env_infos[key] = np.memmap(f'{name}.{key}', dtype=np.float32, mode='w+', shape=(max_replay_buffer_size, size))
        self._env_info_keys = env_info_sizes.keys()

        self._top = 0
        self._size = 0

    def __getitem__(self, index):

        observations = torch.from_numpy(self._observations[index])
        actions = torch.from_numpy(self._actions[index])
        rewards = torch.from_numpy(self._rewards[index])
        terminals = torch.from_numpy(self._terminals[index])
        next_observations = torch.from_numpy(self._next_obs[index])

        env_infos = dict()
        for key in self._env_info_keys:
            env_infos[key] = torch.from_numpy(self._env_infos[key])

        return observations, actions, rewards, terminals, next_observations, env_infos

    def __len__(self):
        return self._size

    def add_sample(self, observation, action, reward, next_observation,
                   terminal, env_info, **kwargs):

        if self._size == 0:
            location = 0
        else:
            location = np.random.randint(0, self._size)

        self._observations[self._top] = self._observations[location]
        self._observations[location] = observation

        self._actions[self._top] = self._actions[location]
        self._actions[location] = action

        self._rewards[self._top] = self._rewards[location]
        self._rewards[location] = reward

        self._terminals[self._top] = self._terminals[location]
        self._terminals[location] = terminal.astype(dtype=np.float)

        self._next_obs[self._top] = self._next_obs[location]
        self._next_obs[location] = next_observation

        for key in self._env_info_keys:
            array = self._env_infos[key]

            array[self._top] = array[location]
            array[location] =  env_info[key]

        self._advance()

    def terminate_episode(self):
        pass

    def _advance(self):
        self._top = (self._top + 1) % self.max_buffer_size
        if self._size < self.max_buffer_size:
            self._size += 1

    def random_batch(self, batch_size):
        raise NotImplementedError("You shouldnt use this one on this buffer")

    def rebuild_env_info_dict(self, idx):
        return {
            key: self._env_infos[key][idx]
            for key in self._env_info_keys
        }

    def batch_env_info_dict(self, indices):
        return {
            key: self._env_infos[key][indices]
            for key in self._env_info_keys
        }

    def num_steps_can_sample(self):
        return self._size

    def get_diagnostics(self):
        return OrderedDict([
            ('size', self._size)
        ])

I am using this in a sequential fashion, so I insert randomized instead.