Giter Club home page Giter Club logo

stable-baselines3's Introduction

CI Documentation Status coverage report codestyle

Stable Baselines3

Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines.

You can read a detailed presentation of Stable Baselines3 in the v1.0 blog post or our JMLR paper.

These algorithms will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

Note: Despite its simplicity of use, Stable Baselines3 (SB3) assumes you have some knowledge about Reinforcement Learning (RL). You should not utilize this library without some practice. To that extent, we provide good resources in the documentation to get started with RL.

Main Features

The performance of each algorithm was tested (see Results section in their respective page), you can take a look at the issues #48 and #49 for more details.

Features Stable-Baselines3
State of the art RL methods ✔️
Documentation ✔️
Custom environments ✔️
Custom policies ✔️
Common interface ✔️
Dict observation space support ✔️
Ipython / Notebook friendly ✔️
Tensorboard support ✔️
PEP8 code style ✔️
Custom callback ✔️
High code coverage ✔️
Type hints ✔️

Planned features

Please take a look at the Roadmap and Milestones.

Migration guide: from Stable-Baselines (SB2) to Stable-Baselines3 (SB3)

A migration guide from SB2 to SB3 can be found in the documentation.

Documentation

Documentation is available online: https://stable-baselines3.readthedocs.io/

Integrations

Stable-Baselines3 has some integration with other libraries/services like Weights & Biases for experiment tracking or Hugging Face for storing/sharing trained models. You can find out more in the dedicated section of the documentation.

RL Baselines3 Zoo: A Training Framework for Stable Baselines3 Reinforcement Learning Agents

RL Baselines3 Zoo is a training framework for Reinforcement Learning (RL).

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

In addition, it includes a collection of tuned hyperparameters for common environments and RL algorithms, and agents trained with those settings.

Goals of this repository:

  1. Provide a simple interface to train and enjoy RL agents
  2. Benchmark the different Reinforcement Learning algorithms
  3. Provide tuned hyperparameters for each environment and RL algorithm
  4. Have fun with the trained agents!

Github repo: https://github.com/DLR-RM/rl-baselines3-zoo

Documentation: https://rl-baselines3-zoo.readthedocs.io/en/master/

SB3-Contrib: Experimental RL Features

We implement experimental features in a separate contrib repository: SB3-Contrib

This allows SB3 to maintain a stable and compact core, while still providing the latest features, like Recurrent PPO (PPO LSTM), Truncated Quantile Critics (TQC), Quantile Regression DQN (QR-DQN) or PPO with invalid action masking (Maskable PPO).

Documentation is available online: https://sb3-contrib.readthedocs.io/

Stable-Baselines Jax (SBX)

Stable Baselines Jax (SBX) is a proof of concept version of Stable-Baselines3 in Jax, with recent algorithms like DroQ or CrossQ.

It provides a minimal number of features compared to SB3 but can be much faster (up to 20x times!): https://twitter.com/araffin2/status/1590714558628253698

Installation

Note: Stable-Baselines3 supports PyTorch >= 1.13

Prerequisites

Stable Baselines3 requires Python 3.8+.

Windows 10

To install stable-baselines on Windows, please look at the documentation.

Install using pip

Install the Stable Baselines3 package:

pip install stable-baselines3[extra]

Note: Some shells such as Zsh require quotation marks around brackets, i.e. pip install 'stable-baselines3[extra]' (More Info).

This includes an optional dependencies like Tensorboard, OpenCV or ale-py to train on atari games. If you do not need those, you can use:

pip install stable-baselines3

Please read the documentation for more details and alternatives (from source, using docker).

Example

Most of the code in the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.

Here is a quick example of how to train and run PPO on a cartpole environment:

import gymnasium as gym

from stable_baselines3 import PPO

env = gym.make("CartPole-v1", render_mode="human")

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10_000)

vec_env = model.get_env()
obs = vec_env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = vec_env.step(action)
    vec_env.render()
    # VecEnv resets automatically
    # if done:
    #   obs = env.reset()

env.close()

Or just train a model with a one liner if the environment is registered in Gymnasium and if the policy is registered:

from stable_baselines3 import PPO

model = PPO("MlpPolicy", "CartPole-v1").learn(10_000)

Please read the documentation for more examples.

Try it online with Colab Notebooks !

All the following examples can be executed online using Google Colab notebooks:

Implemented Algorithms

Name Recurrent Box Discrete MultiDiscrete MultiBinary Multi Processing
ARS1 ✔️ ✔️ ✔️
A2C ✔️ ✔️ ✔️ ✔️ ✔️
DDPG ✔️ ✔️
DQN ✔️ ✔️
HER ✔️ ✔️ ✔️
PPO ✔️ ✔️ ✔️ ✔️ ✔️
QR-DQN1 ✔️ ✔️
RecurrentPPO1 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
SAC ✔️ ✔️
TD3 ✔️ ✔️
TQC1 ✔️ ✔️
TRPO1 ✔️ ✔️ ✔️ ✔️ ✔️
Maskable PPO1 ✔️ ✔️ ✔️ ✔️

1: Implemented in SB3 Contrib GitHub repository.

Actions gym.spaces:

  • Box: A N-dimensional box that containes every point in the action space.
  • Discrete: A list of possible actions, where each timestep only one of the actions can be used.
  • MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be used.
  • MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combination.

Testing the installation

Install dependencies

pip install -e .[docs,tests,extra]

Run tests

All unit tests in stable baselines3 can be run using pytest runner:

make pytest

To run a single test file:

python3 -m pytest -v tests/test_env_checker.py

To run a single test:

python3 -m pytest -v -k 'test_check_env_dict_action'

You can also do a static type check using pytype and mypy:

pip install pytype mypy
make type

Codestyle check with ruff:

pip install ruff
make lint

Projects Using Stable-Baselines3

We try to maintain a list of projects using stable-baselines3 in the documentation, please tell us if you want your project to appear on this page ;)

Citing the Project

To cite this repository in publications:

@article{stable-baselines3,
  author  = {Antonin Raffin and Ashley Hill and Adam Gleave and Anssi Kanervisto and Maximilian Ernestus and Noah Dormann},
  title   = {Stable-Baselines3: Reliable Reinforcement Learning Implementations},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {268},
  pages   = {1-8},
  url     = {http://jmlr.org/papers/v22/20-1364.html}
}

Maintainers

Stable-Baselines3 is currently maintained by Ashley Hill (aka @hill-a), Antonin Raffin (aka @araffin), Maximilian Ernestus (aka @ernestum), Adam Gleave (@AdamGleave), Anssi Kanervisto (@Miffyli) and Quentin Gallouédec (@qgallouedec).

Important Note: We do not provide technical support, or consulting and do not answer personal questions via email. Please post your question on the RL Discord, Reddit, or Stack Overflow in that case.

How To Contribute

To any interested in making the baselines better, there is still some documentation that needs to be done. If you want to contribute, please read CONTRIBUTING.md guide first.

Acknowledgments

The initial work to develop Stable Baselines3 was partially funded by the project Reduced Complexity Models from the Helmholtz-Gemeinschaft Deutscher Forschungszentren, and by the EU's Horizon 2020 Research and Innovation Programme under grant number 951992 (VeriDream).

The original version, Stable Baselines, was created in the robotics lab U2IS (INRIA Flowers team) at ENSTA ParisTech.

Logo credits: L.M. Tenkes

stable-baselines3's People

Contributors

09tangriro avatar adamgleave avatar alexpasqua avatar araffin avatar corentinlger avatar dominicgkerr avatar ernestum avatar gregwar avatar kachayev avatar m-rph avatar manifoldfr avatar megan-klaiber avatar melanol avatar miffyli avatar ndormann avatar patrickhelm avatar qgallouedec avatar qxcv avatar rocamonde avatar rolandgvc avatar scheiklp avatar shwang avatar sidney-tio avatar simoninithomas avatar swamydev avatar thedebugger811 avatar tobirohrer avatar tom-doerr avatar vwxyzjn avatar younik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stable-baselines3's Issues

[feature-request] Use torch's jit to reduce walltime

Motivation

RL, unlike DL, is reliant on both CPU and GPU performance, simply because we interact with the environment. DL is not reliant on the CPU because the processing is done in the GPU thus, python is a suitable choice. In contrast, python is horrible for RL simply because function calls, property accesses, element accesses, partial application, currying and so on, take a long time.

ps. related to #56.

Suggestion

See whether frequently used functions can be traced/jitted to reduce time in python's interpreter. This feature request is for >= 1.2.

Alternatives

Numba

See what can be compiled with numba or equivalent and is worth the effort. Issue: adds additional dependencies...

Cython

See what can be compiled with cython and decide whether the effort is worth it. Depends on Cython's compat with torch though and adds additional complexity.

[SAC] Potential bug in temperature learning

ent_coef_loss = -(self.log_ent_coef * (log_prob + self.target_entropy).detach()).mean()

In the paper they compute the loss as -alpha * (logp + entropy_target) but here you implemented -log(alpha) * (logp + entorpy_target) if I am not mistaken.

I guess it should look like:

            ent_coef_loss = None
            if self.ent_coef_optimizer is not None:
                # Important: detach the variable from the graph
                # so we don't change it with other losses
                # see https://github.com/rail-berkeley/softlearning/issues/60
                ent_coef = th.exp(self.log_ent_coef) # No need to detach here anyways
                ent_coef_loss = -(ent_coef * (log_prob.detach() + self.target_entropy)).mean()
                ent_coef_losses.append(ent_coef_loss.item())
            else:
                ent_coef = self.ent_coef_tensor

Of course you could detach ent_coef afterwards in both cases, to maybe save time when computing actor_loss and doing the respective .backward().

Sphinx doc tests support

As we don't mock any module anymore, we may enable doc tests with sphinx.

However, we should disable it for read the doc to avoid overloading their server.

[question]how to run model on multi gpu

Problem Description
if i want train model in more than one gpu ,should i just state like this:

import gym
from stable_baselines3.a2c import MlpPolicy
from stable_baselines3 import PPO
import os

if __name__ == '__main__':
    os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
    env = gym.make('CartPole-v1')
    model = PPO(MlpPolicy, env, verbose=1,
                policy_kwargs={'net_arch': [dict(pi=[4096, 4096, 4096, 4096, 4096], vf=[4096, 4096, 4096, 4096, 4096])]},
n_steps=8192*8, batch_size=8192*4, n_epochs=2)
    model.learn(total_timesteps=1e6)

    obs = env.reset()
    for i in range(1000):
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        env.render()
        if done:
            obs = env.reset()

to check the gpu utility, i used a very big mlp network and batch size, but it seems not work.
the nvidia-smi show as below:

Every 1.0s: nvidia-smi                                                                                                                Sat Jun 27 21:05:03 2020

Sat Jun 27 21:05:03 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 29%   61C    P2   148W / 250W |   6211MiB /  7952MiB |     72%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:03:00.0 Off |                  N/A |
| 20%   32C    P8     5W / 250W |     10MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:83:00.0 Off |                  N/A |
| 20%   29C    P8    16W / 250W |     10MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:84:00.0 Off |                  N/A |
| 20%   29C    P8     1W / 250W |     10MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2880      C   python                                      6201MiB |
+-----------------------------------------------------------------------------+

what should i do if i want use more than one gpu?

Comparison with tensorflow1/2 implementation

I wanted to ask if you expect better performance using pytorch in stable-baselines3 than if it was going to be implemented with tensorflow2 ( or compared to the currently implemented tensorflow1). If so, from what aspects do you expect the advantages be in the end?

Memory allocation for buffers

With the current implementation of buffers.py one can request a buffersize which doesn't fit in the memory provided but because of numpys implementation of np.zeros() the memory is not allocated before it is actually used. But because the buffer is meant to be filled completely (otherwise one could just use a smaller buffer) the computer will finally run out of memory and start to swap heavily. Because there are only smaller parts of the buffer that are accessed at once (minibatches) the system will just swap the necessary pages in and out of memory. At that moment the progress of the run is most likely lost and one has to start a new run with a smaller buffer.

I would recommend using np.ones instead, as it will allocate the buffer at the beginning and fail if there is not enough memory provided by the system. The only issue is that there is no clear error description in the case where the system memory is exceeded but python gets simply killed by the OS with a SIGKILL. Maybe one could catch that command?

[question] Should collect_rollouts have env.reset() at the end of each episode?

I was looking at the collect_rollouts function in base_class.py, and it doesn't seem to call env.reset() to reset the environment between each episode.

if done:
total_episodes += 1
self._episode_num += 1
episode_rewards.append(episode_reward)
total_timesteps.append(episode_timesteps)
if action_noise is not None:
action_noise.reset()

Comparing that to sb, which does seem to call env.reset()
https://github.com/hill-a/stable-baselines/blob/c20df905363cb6a3649ea275a0be0bf82d6a575f/stable_baselines/sac/sac.py#L474-L478

Is there some place that env.reset() is being called that I am not seeing, or could it be that I am not quite undersating how collect rollouts should work?

Implement Vanilla DQN

Almost ready, performance tests are on-going....

  • doc example need to be updated afterward

@Artemis-Skade is working on it

Pytest fails randomly on VecNormalize tests

See https://github.com/DLR-RM/stable-baselines3/pull/53/checks?check_run_id=758580860
and https://gitlab.com/araffin/stable-baselines3/-/jobs/589877156

Since #50 was merged.

Traceback:

 2020-06-10T16:29:13.7480849Z =================================== FAILURES ===================================
2020-06-10T16:29:13.7482359Z ___________________________ test_sync_vec_normalize ____________________________
2020-06-10T16:29:13.7482761Z 
2020-06-10T16:29:13.7484480Z     def test_sync_vec_normalize():
2020-06-10T16:29:13.7484937Z         env = DummyVecEnv([make_env])
2020-06-10T16:29:13.7485318Z     
2020-06-10T16:29:13.7485746Z         assert unwrap_vec_normalize(env) is None
2020-06-10T16:29:13.7486142Z     
2020-06-10T16:29:13.7486588Z         env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10., clip_reward=10.)
2020-06-10T16:29:13.7487019Z     
2020-06-10T16:29:13.7487477Z         assert isinstance(unwrap_vec_normalize(env), VecNormalize)
2020-06-10T16:29:13.7488198Z     
2020-06-10T16:29:13.7488624Z         env = VecFrameStack(env, 1)
2020-06-10T16:29:13.7489719Z     
2020-06-10T16:29:13.7490133Z         assert isinstance(unwrap_vec_normalize(env), VecNormalize)
2020-06-10T16:29:13.7490566Z     
2020-06-10T16:29:13.7490939Z         eval_env = DummyVecEnv([make_env])
2020-06-10T16:29:13.7491374Z         eval_env = VecNormalize(eval_env, training=False, norm_obs=True, norm_reward=True, clip_obs=10., clip_reward=10.)
2020-06-10T16:29:13.7491818Z         eval_env = VecFrameStack(eval_env, 1)
2020-06-10T16:29:13.7492237Z     
2020-06-10T16:29:13.7494143Z         env.reset()
2020-06-10T16:29:13.7494637Z         # Initialize running mean
2020-06-10T16:29:13.7495040Z         latest_reward = None
2020-06-10T16:29:13.7495440Z         for _ in range(100):
2020-06-10T16:29:13.7495931Z             _, latest_reward, _, _ = env.step([env.action_space.sample()])
2020-06-10T16:29:13.7496345Z     
2020-06-10T16:29:13.7496769Z         # Check that unnormalized reward is same as original reward
2020-06-10T16:29:13.7497213Z         original_latest_reward = env.get_original_reward()
2020-06-10T16:29:13.7497681Z         assert np.allclose(original_latest_reward, env.unnormalize_reward(latest_reward))
2020-06-10T16:29:13.7498090Z     
2020-06-10T16:29:13.7498462Z         obs = env.reset()
2020-06-10T16:29:13.7498870Z         dummy_rewards = np.random.rand(10)
2020-06-10T16:29:13.7499270Z         original_obs = env.get_original_obs()
2020-06-10T16:29:13.7499712Z         # Check that unnormalization works
2020-06-10T16:29:13.7500113Z >       assert np.allclose(original_obs, env.unnormalize_obs(obs))
2020-06-10T16:29:13.7500514Z E       assert False
2020-06-10T16:29:13.7501878Z E        +  where False = <function allclose at 0x7fc839db05f0>(array([[ 0.24142723,  0.97041893, -0.96644473]], dtype=float32), array([[-0.48370542,  0.9704189 , -0.96644476]]))
2020-06-10T16:29:13.7502405Z E        +    where <function allclose at 0x7fc839db05f0> = np.allclose
2020-06-10T16:29:13.7503978Z E        +    and   array([[-0.48370542,  0.9704189 , -0.96644476]]) = <bound method VecNormalize.unnormalize_obs of <stable_baselines3.common.vec_env.vec_normalize.VecNormalize object at 0x7fc814804d50>>(array([[10.        ,  3.0962322 , -0.74936515]], dtype=float32))
2020-06-10T16:29:13.7505725Z E        +      where <bound method VecNormalize.unnormalize_obs of <stable_baselines3.common.vec_env.vec_normalize.VecNormalize object at 0x7fc814804d50>> = <stable_baselines3.common.vec_env.vec_frame_stack.VecFrameStack object at 0x7fc814804450>.unnormalize_obs
2020-06-10T16:29:13.7506161Z 
2020-06-10T16:29:13.7506578Z tests/test_vec_normalize.py:167: AssertionError

Implement HER

EDIT: we might want to support Dict obs for VecNormalize (maybe another issue)

[Feature-Request] make_vec_env for bsuite environments

Motivation

Bsuite environments are short/simple to run and are often good validation environments. The idea is to implement the equivalent of make_atari/make_vec_env but for the bsuite environments.

Details

More or less the same implementation as make_vec_env, except the bsuite environments will also be wrapped in an adapter to make them compatible with the gym interface. This will also be optional, and available iff bsuite is installed.

[Bug] Wrong target q in SAC

In SAC, when updating the Q-values:

target_q = th.min(target_q1, target_q2)
target_q = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q
# td error + entropy term
q_backup = target_q - ent_coef * next_log_prob.reshape(-1, 1)

should be (entropy term also conditioned by termination):

target_q = th.min(target_q1, target_q2) - ent_coef * next_log_prob.reshape(-1, 1)
q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q

See original implementation: https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L26

The bug is not present in SB2 (need to double) as it is using the SAC variant with a value function whereas we are using 2 Q-values.

I will push a fix soon.

Roadmap to Stable-Baselines3 V1.0

This issue is meant to be updated as the list of changes is not exhaustive

Dear all,

Stable-Baselines3 beta is now out 🎉 ! This issue is meant to reference what is implemented and what is missing before a first major version.

As mentioned in the README, before v1.0, breaking changes may occur. I would like to encourage contributors (especially the maintainers) to make comments on how to improve the library before v1.0 (and maybe make some internal changes).

I will try to review the features mentioned in hill-a/stable-baselines#576 (and hill-a/stable-baselines#733)
and I will create issues soon to reference what is missing.

What is implemented?

  • basic features (training/saving/loading/predict)
  • basic set of algorithms (A2C/PPO/SAC/TD3)
  • basic pre-processing (Box and Discrete observation/action spaces are handled)
  • callback support
  • complete benchmark for the continuous action case
  • basic rl zoo for training/evaluating plotting (https://github.com/DLR-RM/rl-baselines3-zoo)
  • consistent api
  • basic tests and most type hints
  • continuous integration (I'm in discussion with the organization admins for that)
  • handle more observation/action spaces #4 and #5 (thanks @rolandgvc)
  • tensorboard integration #9 (thanks @rolandgvc)
  • basic documentation and notebooks
  • automatic build of the documentation
  • Vanilla DQN #6 (thanks @Artemis-Skade)
  • Refactor off-policy critics to reduce code duplication #3 (see #78 )
  • DDPG #3
  • do a complete benchmark for the discrete case #49 (thanks @Miffyli !)
  • performance check for continuous actions #48 (even better than gSDE paper)
  • get/set parameters for the base class (#138 )
  • clean up type-hints in docs #10 (cumbersome to read)
  • documenting the migration between SB and SB3 #11
  • finish typing some methods #175
  • HER #8 (thanks @megan-klaiber)
  • finishing to update and clean the doc #166 (help is wanted)
  • finishing to update the notebooks and the tutorial #7 (I will do that, only HER notebook missing)

What are the new features?

  • much cleaner base code (and no more warnings =D )
  • independent saving/loading/predict for policies
  • State-Dependent Exploration (SDE) for using RL directly on real robots (this is a unique feature, it was the starting point of SB3, I published a paper on that: https://arxiv.org/abs/2005.05719)
  • proper evaluation (using separate env) is included in the base class (using EvalCallback)
  • all environments are VecEnv
  • better saving/loading (now can include the replay buffer and the optimizers)
  • any number of critics are allowed for SAC/TD3
  • custom actor/critic net arch for off-policy algos (#113 )
  • QR-DQN in SB3-Contrib
  • Truncated Quantile Critics (TQC) (see #83 ) in SB3-Contrib
  • @Miffyli suggested a "contrib" repo for experimental features (it is here)

What is missing?

  • syncing some files with Stable-Baselines to remain consistent (we may be good now, but need to be checked)
  • finish code-review of exisiting code #17

Checklist for v1.0 release

  • Update Readme
  • Prepare blog post
  • Update doc: add links to the stable-baselines3 contrib
  • Update docker image to use newer Ubuntu version
  • Populate RL zoo

What is next? (for V1.1+)

side note: should we change the default start_method to fork? (now that we don't have tf anymore)

transferrable models

Hi, is there a way to save a model only with the internal layers of the NN? Or is it possible to modify a trained model (the zip file), to let it retrain with different action dimension?

I am thinking the following example, say we have env_1 with action dimension 10, and env_2, which is more complicated version of env_1, with action dimension 20. One model is trained with env_1. Can we modify the model (zip file) and use it as initial value to train on env_2 afterwards?

Thank you very much!

[Feature-Request] Automatically create path to save location

Motivation

It's very unfortunate to lose a few hours of work simply because the destination folder doesn't exist ( or the path was mistyped ^^')

Suggestion(s)

Either of:

  • Automatically create the path to the destination folder and raise a warning if the path didn't exist.
  • Save to /tmp/ if the path doesn't exist and raise a warning.

[feature-request] Vectorized/Stacked Action Noise

I am opening an issue to discuss the implementation and leave the PR for the code review.

The motivation behind this addition is that for OU or other noise processes we need the state to exist until the end of the trajectory. We can't use a single process for multiple parallel episodes as they have different lengths. Thus, stacked/vectorized processes allow us to keep the processes for as long as the particular episode goes.

My internal code (from the old issue in SB2).

class VecNoise(ActionNoise):

    def __init__(self, base_class: Type[ActionNoise], num_noise: int, **noise_params):
        super().__init__()
        constr = partial(base_class, **noise_params)
        self._noise: List[ActionNoise] = [constr() for _ in range(num_noise)]

    def reset(self) -> None:
        """
        Call reset in all noise objects
        """
        for noise in self._noise:
            noise.reset()

    def reset_by_index(self, indices: Iterable[int]) -> None:
        """
        Args:
            indices: the indices to reset
        """
        for index in indices:
            self._noise[index].reset()

    def __repr__(self) -> str:
        return "VecNoise({})".format(repr(self._noise[0]))

    def __call__(self) -> np.ndarray:
        return np.stack([noise() for noise in self._noise])

An alternative is to provide the noise directly and copy/deepcopy it

class VecNoise(ActionNoise):

    def __init__(self, base_noise:ActionNoise, num_noise: int):
        super().__init__()
        self._noise: List[ActionNoise] = []
        for _ in range(num_noise):
            noise = copy.deepcopy(base_noise)
            noise.reset() 
            self._noise.append(noise)
   ...

Or

class VecNoise(ActionNoise):

    def __init__(self, base_noise:ActionNoise, num_noise: int):
        super().__init__()
        self._noise: List[ActionNoise] = tuple(map(copy.deepcopy, [base_noise]*num_noise))
        tuple(map(ActionNoise.reset, self._noise))
   ...

Bug in PPO - Performance do not match gSDE paper

The issue may come from this commit (distribution refactoring): fdecd51

The difference between working and not working code: ba18258...fdecd51

Currently inspecting the commit but help is welcomed ;)

Related to #49 #48

v0.3.0 is working, v0.4.0 has the bug.

Perf PPO on HalfCheetah using the rl zoo:
python train.py --algo ppo --env HalfCheetahBulletEnv-v0 --eval-freq 50000 --seed 2682960776

Pybullet: 2.6.5 (should work with 2.7.1 too)
Gym: 0.17.1
PyTorch: 1.5.0

Seed: 2682960776 - cpu

SB3 24 february (working version) - gSDE Paper version

SB3: f1a4fa2
RL zoo: abf8fcd

Eval num_timesteps=49985, episode_reward=-1162.74 +/- 39.43
Eval num_timesteps=99985, episode_reward=-1206.28 +/- 48.66
Eval num_timesteps=149985, episode_reward=-1167.46 +/- 29.00
Eval num_timesteps=199985, episode_reward=871.04 +/- 8.67
Eval num_timesteps=249985, episode_reward=552.29 +/- 918.83
Eval num_timesteps=299985, episode_reward=1302.70 +/- 29.44
Eval num_timesteps=349985, episode_reward=1459.13 +/- 96.28
Eval num_timesteps=399985, episode_reward=1225.08 +/- 593.00
Eval num_timesteps=449985, episode_reward=1966.47 +/- 55.37
| time_elapsed | 745 |

v0.3.0 (working)

Eval num_timesteps=49985, episode_reward=-1251.36 +/- 73.55
Eval num_timesteps=99985, episode_reward=-1335.98 +/- 8.61
Eval num_timesteps=149985, episode_reward=722.82 +/- 32.92
Eval num_timesteps=199985, episode_reward=789.23 +/- 41.66
Eval num_timesteps=249985, episode_reward=884.63 +/- 12.12
Eval num_timesteps=299985, episode_reward=1128.64 +/- 27.48
Eval num_timesteps=349985, episode_reward=1326.70 +/- 80.14
Eval num_timesteps=399985, episode_reward=1528.11 +/- 52.68
| time_elapsed | 662 |
Eval num_timesteps=449985, episode_reward=1626.98 +/- 75.79

23 March - Remove CEMRL (working)

SB3: 4b2092f
RL zoo: 4c685669169b212a
Eval num_timesteps=49985, episode_reward=-1114.12 +/- 278.50
Eval num_timesteps=99985, episode_reward=-1159.72 +/- 28.42
Eval num_timesteps=149985, episode_reward=-1076.83 +/- 194.04
Eval num_timesteps=199985, episode_reward=-395.34 +/- 711.05
Eval num_timesteps=249985, episode_reward=-46.66 +/- 384.98
Eval num_timesteps=299985, episode_reward=993.80 +/- 403.25
Eval num_timesteps=349985, episode_reward=1534.85 +/- 18.85

### 23 March - Change pre-processing (working)
SB3: ba18258
RL zoo: 8b71eddc7561b26

Eval num_timesteps=49985, episode_reward=-1294.17 +/- 71.03
Eval num_timesteps=99985, episode_reward=-1047.79 +/- 107.26
Eval num_timesteps=149985, episode_reward=-509.42 +/- 736.42
Eval num_timesteps=199985, episode_reward=491.34 +/- 37.18
Eval num_timesteps=249985, episode_reward=929.41 +/- 55.70
Eval num_timesteps=299985, episode_reward=922.89 +/- 52.07
Eval num_timesteps=349985, episode_reward=1161.23 +/- 71.37

31 March - Refactor Action Distribution v0.4.0a3 (not working)

SB3: fdecd51
RL zoo: 8b71eddc7561b26

Eval num_timesteps=49985, episode_reward=-1254.91 +/- 96.88
Eval num_timesteps=99985, episode_reward=-1139.13 +/- 175.29
Eval num_timesteps=149985, episode_reward=-608.69 +/- 658.00
Eval num_timesteps=199985, episode_reward=334.35 +/- 363.71
Eval num_timesteps=249985, episode_reward=-283.72 +/- 485.11
Eval num_timesteps=299985, episode_reward=-44.18 +/- 84.45
Eval num_timesteps=349985, episode_reward=192.63 +/- 19.71
Eval num_timesteps=399985, episode_reward=292.71 +/- 177.96
| time_elapsed | 683 |

v0.4.0 (not working)

Eval num_timesteps=49985, episode_reward=-1335.59 +/- 38.26
Eval num_timesteps=99985, episode_reward=-717.95 +/- 415.52
Eval num_timesteps=149985, episode_reward=-555.61 +/- 99.95
Eval num_timesteps=199985, episode_reward=-1091.25 +/- 37.42
Eval num_timesteps=249985, episode_reward=-741.49 +/- 92.98
Eval num_timesteps=299985, episode_reward=-139.83 +/- 60.00
Eval num_timesteps=349985, episode_reward=10.24 +/- 306.05
Eval num_timesteps=399985, episode_reward=554.69 +/- 13.50
Eval num_timesteps=449985, episode_reward=634.15 +/- 12.41
Eval num_timesteps=499985, episode_reward=721.88 +/- 13.24
| time_elapsed | 696 |

v0.5.0 (not working)

Eval num_timesteps=49985, episode_reward=-1234.57 +/- 76.73
Eval num_timesteps=99985, episode_reward=-1102.34 +/- 103.49
Eval num_timesteps=149985, episode_reward=-948.12 +/- 138.89
Eval num_timesteps=199985, episode_reward=483.17 +/- 90.65
Eval num_timesteps=249985, episode_reward=609.21 +/- 14.94
Eval num_timesteps=299985, episode_reward=651.08 +/- 24.13
Eval num_timesteps=349985, episode_reward=497.35 +/- 335.62
| time_elapsed | 591 |
Eval num_timesteps=399985, episode_reward=524.58 +/- 313.33
| time_elapsed | 677 |

Hyperparameters:

HalfCheetahBulletEnv-v0:
  env_wrapper: utils.wrappers.TimeFeatureWrapper
  normalize: true
  n_envs: 16
  n_timesteps: !!float 2e6
  policy: 'MlpPolicy'
  batch_size: 128
  n_steps: 512
  gamma: 0.99
  gae_lambda: 0.9
  n_epochs: 20
  ent_coef: 0.0
  sde_sample_freq: 4
  max_grad_norm: 0.5
  vf_coef: 0.5
  learning_rate: !!float 3e-5
  use_sde: True
  clip_range: 0.4
  policy_kwargs: "dict(log_std_init=-2,
                       ortho_init=False,
                       activation_fn=nn.ReLU,
                       net_arch=[dict(pi=[256, 256], vf=[256, 256])]
                       )"

Custom parser for type hints

Type hints are nice to check the code but it makes the documentation hard to read.
We should write a custom parser to display the hints in a cleaner way.

Related: https://michaelgoerz.net/notes/extending-sphinx-napoleon-docstring-sections.html

and https://github.com/agronholm/sphinx-autodoc-typehints

Update after #167: Above added. Remaining TODOs:

  • Clean up parentheses in the type hinting (they are formatted as standard text, not in code-style).

Regex used:

(:((param)|(return))\s*[a-zA_z_0-9]*:) (\([a-zA-Z_\[\]\s,\.0-9]+\))

and replace with $1.

[Discussion] More issue templates

It would be really useful to have more issue template that will provide better guidelines on how to ask/discuss different things and differentiate between questions, bugs, feature requests, or other forms of discussion.

Migration guide

Document:

  • breaking changes (dropped features and renaming)
  • new features
  • how to migrate code
  • action proba removed for now

[Feature-request] Allow auxiliary tasks in policies through a unified interface

Motivation

There is a large corpus of work that shows that auxiliary tasks (i.e. anything besides learning a policy and a value function) can help agents get better state representation, e.g. AutoEncoders, forward predictive models, curl etc.

Proposal

The current implementation of algorithms enables an almost trivial way of incorporating additional tasks through slight modifications of the source code. The proposal is to make it so that changes need to happen in a small segment of the code instead of every function that calls the policy.

This request is for > 1.2.

Specifics

Consider for example the ActorCriticPolicy class
The suggestion is to change:

def _get_latent(self, obs: th.Tensor) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:

to

def _get_latent(self, obs: th.Tensor) -> Tuple[th.Tensor, th.Tensor, th.Tensor, Optional[Tuple[th.Tensor,...]]]:

and propagate the results where necessary, for example:

    def forward(self, obs: th.Tensor,
                deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        
        latent_pi, latent_vf, latent_sde = self._get_latent(obs)

        values = self.value_net(latent_vf)
        distribution = self._get_action_dist_from_latent(latent_pi, latent_sde=latent_sde)
        actions = distribution.get_actions(deterministic=deterministic)
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob

becomes

    def forward(self, obs: th.Tensor,
                deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor, Optional[Tuple[th.Tensor,...]]]:
        
        latent_pi, latent_vf, latent_sde, other_latent = self._get_latent(obs)

        values = self.value_net(latent_vf)
        distribution = self._get_action_dist_from_latent(latent_pi, latent_sde=latent_sde)
        actions = distribution.get_actions(deterministic=deterministic)
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob, other_latent

and account for this when a rollout is collected.
The consequence here is that when deriving the algorithm, eg. to add an AE, we will only need to derive the _get_latent and iirc train functions, the former to return the latent of the encoder, the latter to compute the loss of decoder.

Environment is reset twice per episode when evaluating policy on DummyVecEnv

The evaluate_policy helper function reset the environment at the start of each episode:

for _ in range(n_eval_episodes):
obs = env.reset()

But DummyVecEnv automatically resets the environment when step returns done = true:

if self.buf_dones[env_idx]:
# save final observation where user can get it, then reset
self.buf_infos[env_idx]['terminal_observation'] = obs
obs = self.envs[env_idx].reset()

This causes the environment to reset twice per episode when evaluating the policy.

observations Buffer float32 initialization

I have taken a look at SB3 and it looks good so far.

But one thing I don’t like is that the Replay/Rollout observations Buffer is always initialized with float32. It uses a lot of memory when using a image observations space, what is not necessary because the image observations a in uint8. My proposal is to initialize the observations buffer with what ever dtype the observations have. This is quite easy to do because spaces.Space, already has the dtype information stored in it.

NOTE:
This issue is a bit related to #37

PPO + gSDE trained on GPU thow an error at test time with `deterministic=True`

Describe the bug
A clear and concise description of what the bug is.

See issue DLR-RM/rl-baselines3-zoo#18

Code example
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

Please use the markdown code blocks
for both code and stack traces.

python train.py --algo ppo --env AntBulletEnv-v0 -n 1000 -params use_sde:True
python enjoy.py --algo ppo --env AntBulletEnv-v0 -f logs --no-render
Traceback (most recent call last):
  File "enjoy.py", line 204, in <module>
    main()
  File "enjoy.py", line 131, in main
    action, state = model.predict(obs, state=state, deterministic=deterministic)
  File "/home/dell/stable-baselines3/stable_baselines3/common/base_class.py", line 321, in predict
    return self.policy.predict(observation, state, mask, deterministic)
  File "/home/dell/stable-baselines3/stable_baselines3/common/policies.py", line 225, in predict
    actions = self._predict(observation, deterministic=deterministic)
  File "/home/dell/stable-baselines3/stable_baselines3/ppo/policies.py", line 278, in _predict
    return distribution.get_actions(deterministic=deterministic)
  File "/home/dell/stable-baselines3/stable_baselines3/common/distributions.py", line 59, in get_actions
    return self.sample()
  File "/home/dell/stable-baselines3/stable_baselines3/common/distributions.py", line 551, in sample
    noise = self.get_noise(self._latent_sde)
  File "/home/dell/stable-baselines3/stable_baselines3/common/distributions.py", line 542, in get_noise
    return th.mm(latent_sde, self.exploration_mat)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #2 'mat2' in call to _th_mm

System Info
Describe the characteristic of your environment:

  • latest version (with fix for VecEnv)
  • CUDA enabled

[question] retrain model after loading checkpoint

Problem describe
For some reasons, our model is trained in steps. After each step checkpoint, we decide whether change some setting or not, and then load the model for further training. Sometimes, even we did not change anything in the model, the training performance (not the predict performance) after loaded had a gap with the previous performance in training. The awkward thing is that the gap seems random, we are not sure where the problem comes from and it is hard to show the gap in model. We have checked the params in the policy, and sure that the params is correctly loaded every time, the learning rate/clip_range etc is a default constant value, so we guess that the problem might comes from the Optimizer because the Optimize is not saved in the save method.
So if you have any option about the retrain procedure, please give us some guidance, and should we have to save the Optimizer in the checkpoint.
Best wishes.

Performance check (Continuous Actions)

Check that the algorithms reach expected performance.
This was already done prior to v0.5 for the gSDE paper but as we made big changes, it is good to check that again.

SB2 vs SB3 (Tensorflow Stable-Baselines vs Pytorch Stable-Baselines3)

  • A2C (6 seeds)

a2c.pdf
a2c_ant.pdf
a2c_half.pdf
a2c_hopper.pdf
a2c_walker.pdf

  • PPO (6 seeds)

ppo.pdf
ant_ppo.pdf
half_ppo.pdf
hopper_ppo.pdf
ppo_walker.pdf

  • SAC (3 seeds)

sac.pdf
sac_ant.pdf
sac_half.pdf
sac_hopper.pdf
sac_walker.pdf

  • TD3 (3 seeds)

td3.pdf
td3_ant.pdf
td3_half.pdf
td3_hopper.pdf
td3_walker.pdf

See https://paperswithcode.com/paper/generalized-state-dependent-exploration-for for the score that should be reached in 1M (off-policy) or 2M steps (on-policy).

Test envs; PyBullet Envs

Tested with version 0.8.0 (feat/perf-check branch in the two zoos)

SB3 commit hash: cceffd5
rl-zoo commit hash: 99f7dd0321c5beea1a0d775ad6bc043d41f3e2db

Environments A2C A2C PPO PPO SAC SAC TD3 TD3
SB2 SB3 SB2 SB3 SB2 SB3 SB2 SB3
HalfCheetah 1859 +/- 161 2003 +/- 54 2186 +/- 260 1976 +/- 479 2833 +/- 21 2757 +/- 53 2530 +/- 141 2774 +/- 35
Ant 2155 +/- 237 2286 +/- 72 2383 +/- 284 2364 +/- 120 3349 +/- 60 3146 +/- 35 3368 +/- 125 3305 +/- 43
Hopper 1457 +/- 75 1627 +/- 158 1166 +/- 287 1567 +/- 339 2391 +/- 238 2422 +/- 168 2542 +/- 79 2429 +/- 126
Walker2D 689 +/- 59 577 +/- 65 1117 +/- 121 1230 +/- 147 2202 +/- 45 2184 +/- 54 1686 +/- 584 2063 +/- 185

Generalized State-Dependent Exploration (gSDE)

See https://paperswithcode.com/paper/generalized-state-dependent-exploration-for for the score that should be reached in 1M (off-policy) or 2M steps (on-policy).

  • on policy (2M steps, 6 seeds):

gsde_onpolicy.pdf
gsde_onpolicy_ant.pdf
gsde_onpolicy_half.pdf
gsde_onpolicy_hopper.pdf
gsde_onpolicy_walker.pdf

  • off-policy (1M steps, 3 seeds):

gsde_off_policy.pdf
gsde_offpolicy_ant.pdf
gsde_offpolicy_half.pdf
gsde_offpolicy_hopper.pdf
gsde_offpolicy_walker.pdf

SB3 commit hash: b948b7f
rl zoo commit hash: b56c1470c9a958c196f60e814de893050e2469f0

Environments A2C A2C PPO PPO SAC SAC TD3 TD3
Gaussian gSDE Gaussian gSDE Gaussian gSDE Gaussian gSDE
HalfCheetah 2003 +/- 54 2032 +/- 122 1976 +/- 479 2826 +/- 45 2757 +/- 53 2984 +/- 202 2774 +/- 35 2592 +/- 84
Ant 2286 +/- 72 2443 +/- 89 2364 +/- 120 2782 +/- 76 3146 +/- 35 3102 +/- 37 3305 +/- 43 3345 +/- 39
Hopper 1627 +/- 158 1561 +/- 220 1567 +/- 339 2512 +/- 21 2422 +/- 168 2262 +/- 1 2429 +/- 126 2515 +/- 67
Walker2D 577 +/- 65 839 +/- 56 1230 +/- 147 2019 +/- 64 2184 +/- 54 2136 +/- 67 2063 +/- 185 1814 +/- 395

DDPG

Using TD3 hyperparameters as base with some minor adjustements (lr, batch_size) for stability.

6 seeds, 1M steps.

Environments DDPG
Gaussian
HalfCheetah 2272 +/- 69
Ant 1651 +/- 407
Hopper 1201 +/- 211
Walker2D 882 +/- 186

[Feature Request] Multi-Agent (MA) Support / Distributed algorithms (IMPALA/APEX)

Here is an issue to discuss about multi-agent and distributed agent support.

My personal view on that is this should be done outside SB3 (even though it could use SB3 as a base) and anyway not before 1.2+.

Related issues:

Related projects: "Slime Volley Ball" (self-play) and "Adversarial Policies" in https://stable-baselines.readthedocs.io/en/master/misc/projects.html

This may interest: @eugenevinitsky @justinkterry @Ujwal2910 @AlessandroZavoli
anyone else?

Episode mean reward is not properly logged on tensorboard when using SAC

I'm training an agent on a custom environment using SAC. The environment is wrapped in a Monitor, which is wrapped in a DummyVecEnv, which is wrapped in a VecNormalize, with norm_reward = True.

This is the tensorboard graph for the episode mean reward:

No smoothing 0.9 Smoothing

As you can see, the graph has some weird loops. For example, at around 170k steps or 450k steps.


Edit: Training is conducted in epochs of 50k steps.

Program starts by calling ./start.sh.

start.sh

#!/bin/bash

while [ "$?" -eq 0 ]; do
	python3 main.py
done

main.py

import os.path

from my_custom_env import MyCustomEnv

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

class SaveCheckpoint(BaseCallback):
	def __init__(self, save_freq, verbose = 0):
		super(SaveCheckpoint, self).__init__(verbose)
		self.save_freq = save_freq

	def _on_step(self):
		if self.num_timesteps % self.save_freq == 0:
			self.model.save("model.zip")
			self.training_env.save("stats.pkl")

		return True


if __name__ == '__main__':

	# inits
	env = DummyVecEnv([lambda: Monitor(MyCustomEnv())])
	model = None

	# load recent checkpoint
	if os.path.isfile("model.zip") and os.path.isfile("stats.pkl"):
		env = VecNormalize.load("stats.pkl", env)
		env.reset()
		model = SAC.load("model.zip", env)
	else:
		env = VecNormalize(env)
		model = SAC('MlpPolicy', env, verbose = 1, tensorboard_log = ".")

	# replay buffer
	if os.path.isfile("replay_buffer.pkl"):
		model.load_replay_buffer("replay_buffer.pkl")

	# train
	model.learn(50000,
		callback = SaveCheckpoint(10000),
		log_interval = 1,
		reset_num_timesteps = False
	)

	# save replay buffer
	model.save_replay_buffer(".")

	env.close()
> pip3 freeze | grep 'stable-baselines3'
stable-baselines3==0.7.0a1

[feature request] Low-level API

Having a lower level API can be helpful for some users to have more control over the training loop. The low-level API is particularly useful in Hierarichal RL or Multi-Agent RL problems. One idea is to provide an API like this repo from @eleurent:

for episode in episodes:
    while not done:
        action = agent.act(state)
        next_state, reward, done, info = env.step(action)
        agent.record(state, action, next_state, reward, done, info)

Hyper-param optimization

Do you still use Optuna for hyper parameter optimization like Stabel-baselines1? and do you perhaps know if there might be good alternatives to it out there?

PPO rollouts not terminating with `done == True`

I am using a custom environment, and I've already checked the following:

from stable_baselines3.common.env_checker import check_env

env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)

But the PPO algo is continuing to step(action) when done == True (when the state is no longer in the bounds).

This is how I am interfacing with the algorithm:

class Agent:

    def __init__(self, environment, name, net_arch=[100, 100], n_env=1, n_steps=10000):

        # vectorise environment
        self.environment = environment
        check_env(self.environment)
        venv = DummyVecEnv([lambda: environment]*n_env)

        # load or create model
        assert(isinstance(name, str))
        self.name = name
        try:
            self.model = PPO.load(self.name, venv)
        except:
            self.model = PPO(
                'MlpPolicy', 
                venv,
                use_sde=True,
                sde_sample_freq=5,
                gae_lambda=0.9,
                learning_rate=1e-2,
                verbose=1,
                policy_kwargs=dict(net_arch=net_arch),
                n_steps=n_steps
            )

    def train(self, time_steps):

        # learn and save
        self.model.learn(total_timesteps=time_steps)
        self.model.save(self.name)

    def evaluate(self):

        # simulate
        obs = self.environment.reset()
        while True:
            action, _ = self.model.predict(obs, deterministic=True)
            obs, rew, done, _ = self.environment.step(action)
            if done: break

        # plot
        self.environment.system.plot(fname='{}.pdf'.format(self.name))

Highlights over existing PyTorch RL repos

Greetings! I'm a PyTorch RL fan but previously used baselines and stable baselines for research. I notice stable-baselines3 through the origin stable-baselines issue.
Recently there are many PyTorch RL platforms that emerged, including rlpyt, tianshou, etc. I went through their code and compared with stable-baselines3.

Features Stable-Baselines3 rlpyt tianshou
State of the art RL methods ✔️ ✔️ ✔️
Documentation ✔️ ✔️ ✔️
Custom environments ✔️ Just so-so ✔️
Custom policies ✔️ ✔️ ✔️
Common interface ✔️ ✔️ ✔️
Ipython / Notebook friendly ✔️ ✔️ ✔️
PEP8 code style ✔️ ✔️ ✔️
Custom callback ✔️
High code coverage ✔️ ✔️
Type hints ✔️ ✔️

And for the planned features of stable-baselines3:

Features Stable-Baselines3 rlpyt tianshou
Tensorboard support ✔️ ✔️ ✔️
DQN extensions ➖ QR-DQN in SB3 contrib ✔️ ✔️
Support for Dict observation spaces ✔️ ✔️ ✔️
Recurrent Policies ✔️ in contrib ✔️ ✔️
TRPO ✔️ in contrib ✔️

Also, the most important feature "modularization", from my perspective, tianshou is the best of all, rlpyt is the second. I hate OpenAI Baselines at this point, but stable-baselines is much better than openai.

Just some of my concerns.

Implement DDPG

need to refactor off-policy critic first to avoid code duplication

I will do it.

[feature-request] N-step returns for TD methods

Originally posted by @partiallytyped in hill-a/stable-baselines#821
"
N-step returns allow for much better stability, and improve performance when training DQN, DDPG etc, so it will be quite useful to have this feature.

A simple implementation of this would be as a wrapper around ReplayBuffer so it would work with both Prioritized and Uniform sampling. The wrapper keeps a queue of observed experiences compute the returns and add the experience to the buffer.
"

Roadmap: v1.1+ (see #1 )

Inconsistency with the API on saving stuff

BaseAlgorithm.save takes as an argument the path to the output file. Same thing with VecNormalize.save.

:param path: path to the file where the rl agent should be saved

:param save_path: (str) The path to save to

However, OffPolicyAlgorithm.save_replay_buffer takes as an argument the path to the output directory.

:param path: (str) Path to a log folder

In order to keep things consistent, I believe save_replay_buffer should be changed to match the API of the other two methods. It would also allow for more control over the resulting file name.

Review of Existing Code

@araffin has done a great job creating this version of the library but has done it mostly solo, so much of the code has never been reviewed.

I suggest the other maintainers each take responsibility for reviewing a portion of the code. Rather than doing a traditional code review, since it's already committed I suggest we just make a PR with any changes we think should take place, or raise an issue for non-trivial proposals.

I think the priority is code that is used in multiple algorithms and/or defines the public API and which is new. This includes:

  • common/base_class.py
  • common/distributions.py
  • common/policies.py
  • common/type_aliases.py

Next would be the individual algorithms:

  • A2C
  • PPO
  • SAC
  • TD3

Also new parts of the documentation could use re-reading/confirming:

  • Documentation

If this sounds like a good idea to others, then perhaps we can each claim a few entries on the above list and then start a PR for it? Also feel free to edit the post to break it up differently or to add files I've missed.

Running PPO with default HP fails on Pendulum [question]

Hello,

I am running a PPO agent on usual Gym environments (Pendulum and BipedalWalker-v3) and I am surprised to observe that the rewards doesn't exceed -1000 even after 5e5 timesteps. I am running the agent with default hyperparameters, having just copied-pasted from the readthedocs examples.
Here's said code:


from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.cmd_util import make_vec_env


if __name__ == '__main__': 
    # Parallel environments
    env = make_vec_env('Pendulum-v0', n_envs=6)

    model = PPO(MlpPolicy, env, verbose=1)
    model.learn(total_timesteps=500000)
    model.save("ppo_cartpole")

    del model # remove to demonstrate saving and loading

    model = PPO.load("ppo_cartpole")

    obs = env.reset()
    while True:
        action, _states = model.predict(obs)
        obs, rewards, dones, info = env.step(action)
        env.render()
    env.close()

Final results:

---------------------------------------
| approx_kl            | 0.0011709424 |
| clip_fraction        | 0            |
| clip_range           | 0.2          |
| entropy_loss         | -1.07        |
| ep_len_mean          | 200          |
| ep_rew_mean          | -1.06e+03    |
| explained_variance   | -1.57        |
| fps                  | 1270         |
| iterations           | 41           |
| learning_rate        | 0.0003       |
| n_updates            | 400          |
| policy_gradient_loss | 0.000337     |
| std                  | 0.709        |
| time_elapsed         | 396          |
| total timesteps      | 503808       |
| value_loss           | 4.19e+03     |
---------------------------------------

Am I doing something wrong ?
Thanks ! ((:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.