vwxyzjn / ppo-implementation-details Goto Github PK

View Code? Open in Web Editor NEW

575.0 3.0 87.0 177 KB

The source code for the blog post The 37 Implementation Details of Proximal Policy Optimization

Home Page: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

License: Other

Python 81.26% Shell 18.74%

ppo-implementation-details's Introduction

The 37 Implementation Details of Proximal Policy Optimization

This repo contains the source code for the blog post The 37 Implementation Details of Proximal Policy Optimization

Blog post url: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Tracked Weights and Biases experiments: https://wandb.ai/vwxyzjn/ppo-details

If you like this repo, consider checking out CleanRL (https://github.com/vwxyzjn/cleanrl), the RL library that we used to build this repo.

Get started

Prerequisites:

Python 3.8+
Poetry

Install dependencies:

poetry install

Train agents:

poetry run python ppo.py

Train agents with experiment tracking:

poetry run python ppo.py --track --capture-video

Atari

Install dependencies:

poetry install -E atari

Train agents:

poetry run python ppo_atari.py

Train agents with experiment tracking:

poetry run python ppo_atari.py --track --capture-video

Pybullet

Install dependencies:

poetry install -E pybullet

Train agents:

poetry run python ppo_continuous_action.py

Train agents with experiment tracking:

poetry run python ppo_continuous_action.py --track --capture-video

Gym-microrts (MultiDiscrete)

Install dependencies:

poetry install -E gym-microrts

Train agents:

poetry run python ppo_multidiscrete.py

Train agents with experiment tracking:

poetry run python ppo_multidiscrete.py --track --capture-video

Train agents with invalid action masking:

poetry run python ppo_multidiscrete_mask.py

Train agents with invalid action masking and experiment tracking:

poetry run python ppo_multidiscrete_mask.py --track --capture-video

Atari with Envpool

Install dependencies:

poetry install -E envpool

Train agents:

poetry run python ppo_atari_envpool.py

Train agents with experiment tracking:

poetry run python ppo_atari_envpool.py --track

Solve Pong-v5 in 5 mins:

poetry run python ppo_atari_envpool.py --clip-coef=0.2 --num-envs=16 --num-minibatches=8 --num-steps=128 --update-epochs=3

400 game scores in Breakout-v5 with PPO in ~1 hour (side-effects-free 3-4x speed up compared to ppo_atari.py with SyncVectorEnv):

poetry run python ppo_atari_envpool.py --gym-id Breakout-v5

Procgen

Install dependencies:

poetry install -E procgen

Train agents:

poetry run python ppo_procgen.py

Train agents with experiment tracking:

poetry run python ppo_procgen.py --track

Reproduction of all of our results

To reproduce the results run with openai/baselines, install our fork at hhttps://github.com/vwxyzjn/baselines. Then follow the scripts in scripts/baselines. To reproduce our results, follow the scripts in scripts/ours.

Citation

@inproceedings{shengyi2022the37implementation,
  author = {Huang, Shengyi and Dossa, Rousslan Fernand Julien and Raffin, Antonin and Kanervisto, Anssi and Wang, Weixun},
  title = {The 37 Implementation Details of Proximal Policy Optimization},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/},
  url  = {https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/}
}

ppo-implementation-details's People

Contributors

Stargazers

Watchers

Forkers

zhao-liyang0113 ttuananh112 zhehui-huang liuqi8827 weishan-lin techthiyanes wujiadw404 aradhyamathur skandermoalla gaozihong samwincott janssen-yang nikunj-gupta niceboy120 whoismanoj nithiroj vbkbmqj crlqinliang jhjp shigeppp otakbeku aravind-11 bowenff poet-libai jiong952 szucsaaron2 yslatfrance pamekitti ryukijano kevin866 mengqiaolan happy-dragon timwee vs1720 freekang jackmurphy97 renormalizedkat zxt243416724 adhiiisetiawan snowdenwintermute mcpfirefly kaycee074 marvin-wen stephlee12 nickhclos funckytown1 iamfaith nicholaschenai stanai7060 cshang2017 ppama awesomejf zsxxxx98 nvl2val drawson5570 george-ogden leanhchien-1903 jimzxn huy-ha punitiisc acezsq rl-dlmu sarmadzandi wmatsushita nangeblog arthurwdk meilianlee diiiegoo xiaotailong hbcbh1999 davidscg sunwuzhou03 benmez1n yuanmeng1120 blankning nalgae73 dtbinh huangjunli864 tomertommy hyungdal zzmjohn mugi0gh bbrosisb algorithmvoyager xuzhiheng henryslzhao jimmmy0

ppo-implementation-details's Issues

Poetry env seems not working well with gym(0.21.0).

The problem occured when trying 'poetry install'.

` - Installing gym (0.21.0): Failed

ChefBuildError

Backend subprocess exited when trying to invoke get_requires_for_build_wheel

error in gym setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.

at ~/.local/pipx/venvs/poetry/lib/python3.9/site-packages/poetry/installation/chef.py:164 in _prepare
160│
161│ error = ChefBuildError("\n\n".join(message_parts))
162│
163│ if error is not None:
→ 164│ raise error from None
165│
166│ return path
167│
168│ def _prepare_sdist(self, archive: Path, destination: Path | None = None) -> Path:

Note: This error originates from the build backend, and is likely not a problem with poetry but with gym (0.21.0) not supporting PEP 517 builds. You can verify this by running 'pip wheel --no-cache-dir --use-pep517 "gym (==0.21.0)"'.`

I searched about it and it seems to be a problem with gym(0.21.0). I have no idea how to tackle with this problem using given poetry.lock.

Is it another way to avoid this problem? Also, would it be convenient if we could upgrade to gym(>=0.22)?

Many Thanks.

using PPO implementation in custom environement

Hi, thank you for writing this code I found it extremely helpful as a beginner.
I have been using this implementation in a custom environment and I had a general question.

One of the hyperparameters is n_steps, number of steps to run for each environment per update. I was wondering if there is an inherent issue if my custom environment has maximum 250 steps and loses reward for the time that passes.

Can this create a conflict and will it not learn as well? I hope my question makes sense. Please do let me know.

Advantages should be computed every ppo epoch?

Thanks for the great implementation.
I found (in ppo_continuous) that the advantage is computed only once right after rollout, shouldn't it be inside the ppo epoch?

Issue when using the ppo with Cartpole

Hi sorry to bother you,

I was trying to run the the code from your wideo with Weights & Biases but I have errors when I want to run it. First of all there isn't a seed attribute for the env. To replace env.seed(seed) I have used :

env.reset(seed=seed)

Now I have an issue with:

next_obs = torch.Tensor(envs.reset()).to(device)
# Returns the error:
ValueError: expected sequence of length 4 at dim 1 (got 0)

To try and resolve it I print the tupple and converted it into an array before converting it into a Tensor :

env_reset_ar = np.asarray(envs.reset(), dtype=tuple)
next_obs = torch.Tensor(env_reset_ar).to(device) # store the initial obs 
#With env.reset() : 

#[[-0.01702683  0.02884287 -0.01968052 -0.00465021]
# [-0.00673692  0.01692973 -0.00772153  0.01331844]
# [-0.0069372   0.00867986  0.02378378  0.04562673]
# [-0.00695037  0.02889467  0.0484153  -0.01302742]]
#{} 

# But Now I have the  Error :
next_obs = torch.Tensor(env_reset_ar).to(device) # store the initial obs 
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Do you know what do I need to change to make it work ? I am still a beginner at RL so I don't know what have I done wrong.
Thank you in advance.

Kind regards,
Eliott.

Are these arguments correct? Or is it very hard to train MountainCar-v0 by PPO?

I applied the code to train MountainCar-v0 but failed after 10 million timesteps. The command is as follows.

!python ppo.py --gym-id MountainCar-v0 --total-timesteps 10000000

Are these arguments correct? Or is it very hard to train MountainCar-v0 by PPO?

Using PPO with RNN

I want to use RNN for RL on a custom environment. My custom environment has a variable size for the observations. For example, consider the following environment:

class ModuloComputationEnv(gym.Env):
    """Environment in which an agent must learn to output mod 2,3,4 of the sum of
       seen observations.

    Observations are squences of integer numbers ,
    e.g. (1,3,4,5)

    The action space is just 3 values first for the sum of inputs till now %2, second %3 
    and third %4.

    Rewards are r=-abs(self.ac1-action[0]) - abs(self.ac2-action[1]) - abs(self.ac3-action[2]), 
    for all steps.
    """

    def __init__(self, config):
        
        #the input sequence can have any number from 0,99
        self.observation_space = Sequence(Discrete(100), seed=2)

        #the action is a vector of 3, [%2, %3, %4], of the sum of the input sequence
        self.action_space = MultiDiscrete([2,3,4])

        self.cur_obs = None
        
        #this variable maintains the episode_length
        self.episode_len = 0

        #this variable maintains %2
        self.ac1 = 0
        
        #this variable maintains %3
        self.ac2 = 0

        #this variable maintains %4
        self.ac3 = 0

    def reset(self, *, seed=None, options=None):
        """Resets the episode and returns the initial observation of the new one.
        """

        # Reset the episode len.
        self.episode_len = 0
        
        # Sample a random sequence from our observation space.
        self.cur_obs = self.observation_space.sample()

        #take the sum of the initial observation
        sum_obs = sum(self.cur_obs)

        #consider the %2, %3, and %4 of the initial observation
        self.ac1 = sum_obs%2
        self.ac2 = sum_obs%3
        self.ac3 = sum_obs%4

        # Return initial observation.
        return self.cur_obs, {}

    def step(self, action):
        """Takes a single step in the episode given `action`

        Returns:
            New observation, reward, done-flag, info-dict (empty).
        """
        # Set `truncated` flag after 10 steps.
        self.episode_len += 1
        truncated = False
        terminated = self.episode_len >= 10

        #the reward is the negative of further away from computing the individual values
        reward = abs(self.ac1-action[0]) + abs(self.ac2-action[1]) + abs(self.ac3-action[2])
        reward = -reward


        # Set a new observation (random sample).
        self.cur_obs = self.observation_space.sample()

        #recompute the %2, %3 and %4 values
        self.ac1 = (self.cur_obs+self.ac1)%2
        self.ac2 = (self.cur_obs+self.ac2)%3
        self.ac3 = (self.cur_obs+self.ac3)%4
        
        return self.cur_obs, reward, terminated, truncated, {}

vwxyzjn / ppo-implementation-details Goto Github PK

ppo-implementation-details's Introduction

The 37 Implementation Details of Proximal Policy Optimization

Get started

Atari

Pybullet

Gym-microrts (MultiDiscrete)

Atari with Envpool

Procgen

Reproduction of all of our results

Citation

ppo-implementation-details's People

Contributors

Stargazers

Watchers

Forkers

ppo-implementation-details's Issues

Recommend Projects

Recommend Topics

Recommend Org