alexis-jacq / pytorch-dppo Goto Github PK

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286

License: MIT License

Python 100.00%

pytorch-dppo's Introduction

Pytorch-DPPO

Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https://arxiv.org/pdf/1707.06347.pdf).

I finally fixed what was wrong with the gradient descent step, using previous log-prob from rollout batches. At least ppo.py is fixed, the rest is going to be corrected as well very soon.

In the following example I was not patient enough to wait for million iterations, I just wanted to check if the model is properly learning:

Progress of single PPO:

InvertedPendulum

InvertedDoublePendulum

HalfCheetah

hopper (PyBullet)

halfcheetah (PyBullet)

Progress of DPPO (4 agents) [TODO]

Acknowledgments

The structure of this code is based on https://github.com/ikostrikov/pytorch-a3c.

Hyperparameters and loss computation has been taken from https://github.com/openai/baselines

pytorch-dppo's People

Contributors

Stargazers

Watchers

pytorch-dppo's Issues

Loss questions

I just went throught your code and the PPO paper and have a few questions, perhaps if you have time you could comment.

First off, nice work. The code is easy to read with lots of comments and full variable names, it made it easy for me to read through and (partially) understand.
Should you take away lossvalue or add it? We want each individual loss to make the overall loss larger as they get larger. I can see you copied baselines exactly but maybe they have it wrong, In the PPO paper they have minus on one term (eq9). If you stop the training and inspect loss_clip and loss_value the first is negative, the second is positive. So it seems like we need to have loss=loss_value-loss_clip. Thoughts?
what's log_std, is that an exploration parameter set by the model?
Do we need loss_value? In the PPO paper they say that if we don't have shared parameter between the policy and value function then it's not needed (first paragraph of section 5). And your example model doesn't share parameters. An example of one that does is in baselines and it could halve your model parameters e.g.:

class Model(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(Model, self).__init__()
        h_size_1 = 100
        h_size_2 = 100
        self.fc1 = nn.Linear(num_inputs, h_size_1)
        self.fc2 = nn.Linear(h_size_1, h_size_2)
        self.mu = nn.Linear(h_size_2, num_outputs)
        self.log_std = nn.Parameter(torch.zeros(num_outputs))
        self.v = nn.Linear(h_size_2,1)
        for name, p in self.named_parameters():
            # init parameters
            if 'bias' in name:
                p.data.fill_(0)
            '''
            if 'mu.weight' in name:
                p.data.normal_()
                p.data /= torch.sum(p.data**2,0).expand_as(p.data)'''
        # mode
        self.train()

    def forward(self, inputs):
        # actor
        x = F.tanh(self.fc1(inputs))
        h = F.tanh(self.fc2(x))
        mu = self.mu(h)
        log_std = torch.exp(self.log_std).unsqueeze(0).expand_as(mu)
        # critic
        v = self.v(h)
        return mu, log_std, v

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead.

env:
torch 1.8.1+cu111

Error:
UserWarning: Error detected in AddmmBackward. Traceback of forward call that caused the error:
File "", line 1, in
File "E:\A\envs\gym\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "E:\A\envs\gym\lib\multiprocessing\spawn.py", line 118, in _main
return self._bootstrap()
File "E:\A\envs\gym\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "E:\A\envs\gym\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "Pytorch-RL\Pytorch-DPPO-master\train.py", line 155, in train
mu_old, sigma_sq_old, v_pred_old = model_old(batch_states)
File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "Pytorch-DPPO-master\model.py", line 53, in forward
v1 = self.v(x3)
File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\A\envs\gym\lib\site-packages\torch\nn\modules\linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "E:\A\envs\gym\lib\site-packages\torch\nn\functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
(Triggered internally at ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
Process Process-4:
Traceback (most recent call last):
File "E:\A\envs\gym\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "E:\A\envs\gym\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self.kwargs)
File "Pytorch-DPPO-master\train.py", line 197, in train
total_loss.backward(retain_graph=True)
File "E:\A\envs\gym\lib\site-packages\torch\tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "E:\A\envs\gym\lib\site-packages\torch\autograd_init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 1]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

i googled and some says its caused by inplcae op ,but i cant seems to find any,i havent try to downgrade torch version,but is there any solutions that i dont need to downgrade ?

Failed in more complex environment

Thank for the sharing the code.
I tested that the code is working in the invertedPendulum-v1.

But when I changed the environment to Ant-v1 without changing any other parameters,
it seems the agent failed to learn as below. Do I need to change some parameters?

Time 00h 01m 01s, episode reward -3032.25671304008, episode length 1000
Time 00h 02m 01s, episode reward -99.15254692012928, episode length 25
Time 00h 03m 01s, episode reward -41.27665909454931, episode length 14
Time 00h 04m 01s, episode reward -39.077425184658665, episode length 17
Time 00h 05m 02s, episode reward -136.60746428384076, episode length 45
Time 00h 06m 02s, episode reward -111.40062667574634, episode length 40
Time 00h 07m 02s, episode reward -516.1070385678166, episode length 169
Time 00h 08m 02s, episode reward -129.64627338344073, episode length 42
Time 00h 09m 02s, episode reward -146.55425861577797, episode length 45
Time 00h 10m 03s, episode reward -253.41361049200614, episode length 86
Time 00h 11m 03s, episode reward -108.6953450777496, episode length 38
Time 00h 12m 03s, episode reward -64.66194807902957, episode length 16
Time 00h 13m 03s, episode reward -33.51695185844647, episode length 11
Time 00h 14m 03s, episode reward -86.88904449639067, episode length 35
Time 00h 15m 03s, episode reward -78.48049851223362, episode length 23
Time 00h 16m 03s, episode reward -165.73681903021165, episode length 61
Time 00h 17m 04s, episode reward -155.3555664457943, episode length 60
Time 00h 18m 04s, episode reward -57.65249942070945, episode length 20
Time 00h 19m 04s, episode reward -392.10161323743887, episode length 109
Time 00h 20m 04s, episode reward -55.63287075930159, episode length 12
Time 00h 21m 04s, episode reward -81.0448173961397, episode length 29
Time 00h 22m 04s, episode reward -149.84827826419726, episode length 52
Time 00h 23m 04s, episode reward -398.0365800924663, episode length 22
Time 00h 24m 05s, episode reward -1948.6136580594682, episode length 17
Time 00h 25m 05s, episode reward -18719.08471382285, episode length 51
Time 00h 26m 06s, episode reward -805145.8854457787, episode length 1000
Time 00h 27m 06s, episode reward -17008.04843510176, episode length 17
Time 00h 28m 07s, episode reward -168769.34038655, episode length 129
Time 00h 29m 07s, episode reward -104933.08883886453, episode length 79
Time 00h 30m 07s, episode reward -22809.687035617088, episode length 17
Time 00h 31m 07s, episode reward -46398.71530676861, episode length 37
Time 00h 32m 07s, episode reward -18513.064083079746, episode length 15
Time 00h 33m 07s, episode reward -21329.411481710402, episode length 15
Time 00h 34m 09s, episode reward -1393903.341478124, episode length 1000
Time 00h 35m 10s, episode reward -1374988.6133415946, episode length 1000
Time 00h 36m 10s, episode reward -33792.40522011441, episode length 28
Time 00h 37m 10s, episode reward -20629.94697013807, episode length 16
Time 00h 38m 10s, episode reward -39780.93399623488, episode length 29
Time 00h 39m 10s, episode reward -61722.81635309537, episode length 47
Time 00h 40m 10s, episode reward -46780.12455378964, episode length 36
Time 00h 41m 10s, episode reward -91640.36757206521, episode length 73
Time 00h 42m 11s, episode reward -77137.71004513587, episode length 63
Time 00h 43m 11s, episode reward -15184.611248485926, episode length 10
Time 00h 44m 11s, episode reward -26995.023495691694, episode length 20
Time 00h 45m 11s, episode reward -110371.66228435331, episode length 81
Time 00h 46m 11s, episode reward -55639.738879114084, episode length 41
Time 00h 47m 11s, episode reward -53735.2616539847, episode length 39
Time 00h 48m 11s, episode reward -60755.49631228513, episode length 43
Time 00h 49m 11s, episode reward -29466.664499076247, episode length 23
Time 00h 50m 12s, episode reward -48580.31395829051, episode length 37
Time 00h 51m 12s, episode reward -128957.8903571858, episode length 99
Time 00h 52m 12s, episode reward -70144.76359014906, episode length 51
Time 00h 53m 12s, episode reward -29271.097255889938, episode length 21
Time 00h 54m 12s, episode reward -21737.6644599086, episode length 17
Time 00h 55m 12s, episode reward -27549.40889570978, episode length 20
Time 00h 56m 12s, episode reward -97097.66966694668, episode length 77
Time 00h 57m 13s, episode reward -18384.51761876518, episode length 14
Time 00h 58m 13s, episode reward -28424.585660954337, episode length 22
Time 00h 59m 13s, episode reward -96267.24448946006, episode length 72
Time 01h 00m 13s, episode reward -79794.54738721657, episode length 60
Time 01h 01m 13s, episode reward -88486.88046448736, episode length 64
Time 01h 02m 13s, episode reward -31071.50782185118, episode length 24
Time 01h 03m 13s, episode reward -53608.97197643964, episode length 38
Time 01h 04m 14s, episode reward -38451.031800392186, episode length 27
Time 01h 05m 14s, episode reward -27645.787896926682, episode length 20

Question on algorithm itself

Usually PPO is for continous action, but for OpenAI FIVE, shouldn't the action be discrete? What's the technique to make PPO applicable to Dota2 actions?

Old policy?

Great work! I'm also working on a PPO implementation, but I completely miss where π and π_old come from. Here you store the policy output when actually acting on the environment - if you stored this and retrieved it from the memory, wouldn't it be the same as calculating them again in a batch like you do here?

You then construct a new policy, and calculate the new policy output here. I see that it is different because you load weights that have been updated by other processes, but in a synchronous setting, the weights wouldn't have been updated, and hence the policy outputs wouldn't be any different?

clamp ratio

It seems that you should clamp ratio, not surr1.

https://github.com/alexis-jacq/Pytorch-DPPO/blob/master/ppo.py#L145

average gradients to update global theta?

Thanks for the nice implementation in pytorch, which made easier for me to learn.

Regarding chief.py implementation, I got a question about updates to global weights. From Algorithm Pseudocode in the paper, it seems to use averaged gradients from workers to update the global weights, but chief.py looks using sum of worker's gradients? Thanks.

Cheng

on advantages

after test your PPO, and compare with another , i think your advantages need to been :
(advantages - advantages.mean()) / advantages.std()
for you reference