ikostrikov / pytorch-a3c Goto Github PK

PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".

License: MIT License

Python 100.00%

python reinforcement-learning pytorch deep-learning actor-critic a3c pytorch-a3c asynchronous-methods deep-reinforcement-learning asynch asynchronous-advantage-actor-critic

pytorch-a3c's Introduction

pytorch-a3c

This is a PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".

This implementation is inspired by Universe Starter Agent. In contrast to the starter agent, it uses an optimizer with shared statistics as in the original paper.

Please use this bibtex if you want to cite this repository in your publications:

@misc{pytorchaaac,
  author = {Kostrikov, Ilya},
  title = {PyTorch Implementations of Asynchronous Advantage Actor Critic},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ikostrikov/pytorch-a3c}},
}

A2C

I highly recommend to check a sychronous version and other algorithms: pytorch-a2c-ppo-acktr.

In my experience, A2C works better than A3C and ACKTR is better than both of them. Moreover, PPO is a great algorithm for continuous control. Thus, I recommend to try A2C/PPO/ACKTR first and use A3C only if you need it specifically for some reasons.

Also read OpenAI blog for more information.

Contributions

Contributions are very welcome. If you know how to make this code better, don't hesitate to send a pull request.

Usage

# Works only wih Python 3.
python3 main.py --env-name "PongDeterministic-v4" --num-processes 16

This code runs evaluation in a separate thread in addition to 16 processes.

Results

With 16 processes it converges for PongDeterministic-v4 in 15 minutes.

For BreakoutDeterministic-v4 it takes more than several hours.

pytorch-a3c's People

Contributors

Stargazers

Watchers

Forkers

lucasb-eyer peratham g-wang dylanthomas codeaudit jacksparal ethancaballero apaszke devendrachaplot happywu scientist1642 lkhcnn shawground kastnerkyle andrewliao11 sherjilozair rbunn80110 ajaytalati collector-m tonyf hgajiayou alexandrugavril rex8312 zbgzbg2007 alexnowakvila tensorpro gwding vikingmew ryannnxu slowbull jackharmer srikanth-sfu geek7colson mhauskn vicdu leechikara ctallec feryal schroederdewitt hongzimao davidtranno1 pedronahum raileanu tord-zhang mabirck kismuz yuhangsong turinglife ilovescienceandpython b-rich ddrise nadavbh12 ssarcandy picopoco jackhaha363 binderwang snowwalkerj meelement ktfeigelis yuechuanli jotase amoliu keerthanpg bemoregt wangye777 shubhampachori12110095 runngezhang japan4415 thelaughingboy pfulop xueyaohuang joydosun atgambardella edbeeching nke001 bu-cs542-2018 gaoyz0625 convexsetgithub ellerylin agiantwhale crnsmile bhargav104 l3robot afcarl tommygoahead lovemercy acechuse sungjinlees mohamad-hasan-sohan-ajini jsuit 591x351-13n parthjaggi jhandei zhefan lisaanne zacheberhart jfsantos wh-forker nukinuki555 iloveopenworld

pytorch-a3c's Issues

Mixture of model prediction and update

Hello Ilya,

first, thanks for you amazing work. I would have one question about a way how you have designed an A3C training.

What you basically do is that you play N steps and you store all the values, log_probs and entropies during this phase and you store all these variables in lists. After that, you perform an update, where you use these variables.

Another approach would be to split this into two methods:

predict(state), which would return action values
update(states, actions, rewards), which would update the model

In this setup, you can use volatile=True in predict, but on the other hand, in update method you have to compute all these values, log_probs and entropies by forward pass on all the states.

My question is, did you consider this second approach? Why did you chose the first one? Is it more efficient or its just simpler to implement? I am trying to make my own A3C implementation and definitely I will try out myself which is more efficient, but I am curious about your opinion.

Thanks

Works better with 80x80 images

After I ran some experiments I noticed that I got better results when I cancelled the 42x42 resizing.
Hope it helps.

I cannot train with your recent pytorch-a3c

I am very interested in you pytorch-a3c because of its compactness and very simple structure.
I tried to follow your excellent work, but I cannot run successfully after struggling more than a month.
I will be very happy if you could give me any helpful information.

Following your readme, I installed most recent PyTorch, and clone your pytorch-a3c. (3 times cloned)

(1) "PongDeterministic-v4" was refused with next message.
(anaconda-2.4.0) user-no-MacBook-Pro:iko3 user$ OMP_NUM_THREADS=1 python main.py --env-name "PongDeterministic-v4" --num-processes 16
[2017-10-07 19:57:18,923] Making new env: PongDeterministic-v4
Traceback (most recent call last):
File "main.py", line 53, in
env = create_atari_env(args.env_name)
File "/Users/user/iko3/envs.py", line 9, in create_atari_env
env = gym.make(env_id)
File "/Users/user/gym/gym/envs/registration.py", line 126, in make
return registry.make(id)
File "/Users/user/gym/gym/envs/registration.py", line 90, in make
spec = self.spec(id)
File "/Users/user/gym/gym/envs/registration.py", line 110, in spec
raise error.DeprecatedEnv('Env {} not found (valid versions include {})'.format(id, matching_envs))
gym.error.DeprecatedEnv: Env PongDeterministic-v4 not found (valid versions include ['PongDeterministic-v3', 'PongDeterministic-v0'])

(2)so, I replaced "PongDeterministic-v4" to "PongDeterministic-v3"
Program started but score keep -21 more than 1day.

(3)I printed out prob and action in test.py.
prob is changing only last digit, and action changed 0 to 5.
By the way, I cannot understand the length of prob. I think pong has 3 actions(left,stay,right).
Outputs are as follows. 　
Last line repeated more than one day with same score -21
My machine environment is Mac 10.12.6(Sierra) core i7

0.1666 0.1667 0.1666 0.1666 0.1667 0.1667
[torch.FloatTensor of size 1x6]
, array([[4]]))
(Variable containing:
0.1666 0.1667 0.1666 0.1666 0.1667 0.1667
[torch.FloatTensor of size 1x6]
, array([[5]]))
(Variable containing:
0.1666 0.1667 0.1666 0.1666 0.1667 0.1667
[torch.FloatTensor of size 1x6]
, array([[5]]))
(Variable containing:
0.1667 0.1667 0.1667 0.1666 0.1667 0.1667
[torch.FloatTensor of size 1x6]
, array([[1]]))
Time 00h 01m 08s, episode reward -21.0, episode length 764

Question about the policy loss calculation?

On line 103 of train.py you calculate the policy loss as

policy_loss = policy_loss - log_probs[i] * Variable(gae) - 0.01 * entropies[i]

~~My question is where does this -0.01 * entropies[i] term come from and what does it do?~~

EDIT: I went through the paper again and now see this was a form of regularization they used in the experimental setup section that wasn't in the pseudocode. Thanks for the code!

Error when rendering

Adding env.render() causes the error:
Traceback (most recent call last):
...
File "/home/shani/anaconda2/lib/python2.7/site-packages/pyglet/gl/glx_info.py", line 89, in have_version client = [int(i) for i in client_version.split('.')]
ValueError: invalid literal for int() with base 10: 'None'

Tried to run it in a different python file and it worked. Is there supposed to be a problem rendering while using pytorch?

License of this repository?

Thank you for this great repository.
I am implementing
Curiosity-driven Exploration by Self-supervised Prediction
with pytorch, and this needs A3C.
I wish I could start it with your code, but I couldn't find any license of this repository. May I use codes in this repo for my implementation?

is ensure_shared_grads still required?

It seems the PyTorch Mnist hogwild example has been updated now, as gradients are now allocated lazily.
I think this means that this part of you code is no longer required?

Atari Environment Decision Choice

I'm curious for training reasons, why choose

[atariEnv]Deterministic-v4

versus

[atariEnv]NoFrameskip-v0

versus

[atariEnv]-v0

Why is the convergence on Pong so fast?

It takes about 15 minutes to converge on Pong with this repo. The A3C Paper managed to get convergence with 16 workers in about 4 hours. That's a speedup of 16x!

I'm aware that there are a few improvements such as GAE and normalising the observations from Gym, but those shouldn't account for a 16x speedup.

I've been trying to replicate those results in my own project, but can't seem to get better than 3-4 hours for convergence. Is there some kind of trick I'm missing?

AttributeError and CPU usage

Dear

After I installed everything

torch
gym
universe
cv2
....

and then:
python main.py --env-name "PongDeterministic-v3" --num-processes 8

I got this message:

File "/home/ubuntu/Notebooks/pytorch-a3c/train.py", line 24, in train
shared_param.grad.data = param.grad.data
AttributeError: 'NoneType' object has no attribute 'data'

However the code seems to run:

Time 00h 00m 03s, episode reward -21.0, episode length 764
Time 00h 01m 07s, episode reward -21.0, episode length 764
Time 00h 02m 11s, episode reward -21.0, episode length 764
Time 00h 03m 14s, episode reward -21.0, episode length 764
Time 00h 04m 18s, episode reward -21.0, episode length 764
Time 00h 05m 22s, episode reward -21.0, episode length 764
Time 00h 06m 25s, episode reward -21.0, episode length 764
Time 00h 07m 29s, episode reward -21.0, episode length 764
......

But the CPU is always 0% and some times peaks at 400~500%

Is this normal? (I think only the evaluation thread is running!)

Thank you so much for your efforts

action_space.n and actions sampling

Hey,

I have a question. Maybe I am in wrong and sorry to disturb you.
I understood all code but there is one thing that is not clear for me.
The A3C model get out Value and probability, that are 6 for pong right?

But when you compute:

action = prob.multinomial(num_samples=1).data

the number of sample are just 1 and clearly you have one action. But however the number of sample for pong are 6.

env.step(action.numpy())

is correct because receive only one action.

That is repeated within log_prob that you save in the list (you save 6 prob or only one?), with pong you have 6 not 1 and that could be created problem in the policy calculation.

Sorry I am bit confuse becouse

num_outputs = action_space.n

in the model.py are 6 not 1.

memory overflow

I tried to run your code on my ubuntu/Titan-X/Intel i7, but I kept getting memory problems, after about one day. Can you guess why ?
I tried both
OMP_NUM_THREADS=1 python main.py --env-name "Breakout-v0" --num-processes 8
and
OMP_NUM_THREADS=1 python main.py --env-name "Breakout-v0" --num-processes 6
, to see if bringing down num-processes might help. But that did not work out either.

ps.
(from free -m >> mem_usage.txt)
total used free shared buff/cache available
Mem: 64388 63937 325 14 125 271
Swap: 65451 65451 0
Fri Mar 31 17:55:18 KST 2017

Cant' work on pytorch 0.4.0

macOS 10.13.4
Python 3.6.4
pytorch 0.4.0

I encountered an error

~/py-garage/pytorch-a3c(master*) » python3 main.py --env-name "PongDeterministic-v4" --num-processes 1                                                        anya@turing-machine
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
/Users/anya/py-garage/pytorch-a3c/test.py:37: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:38: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:44: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  state.unsqueeze(0), volatile=True), (hx, cx)))
/Users/anya/py-garage/pytorch-a3c/train.py:55: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/test.py:45: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/train.py:56: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_prob = F.log_softmax(logit)
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/anya/py-garage/pytorch-a3c/train.py", line 60, in train
    action = prob.multinomial().data
TypeError: multinomial() missing 1 required positional arguments: "num_samples"
/Users/anya/py-garage/pytorch-a3c/test.py:40: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(cx.data, volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:41: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(hx.data, volatile=True)
Time 00h 00m 01s, num steps 0, FPS 0, episode reward -21.0, episode length 812

I try to fix it
TypeError: multinomial() missing 1 required positional arguments: "num_samples"

diff --git a/train.py b/train.py
index 1b9c139..e3f0143 100644
--- a/train.py
+++ b/train.py
@@ -57,7 +57,7 @@ def train(rank, args, shared_model, counter, lock, optimizer=None):
             entropy = -(log_prob * prob).sum(1, keepdim=True)
             entropies.append(entropy)

-            action = prob.multinomial().data
+            action = prob.multinomial(num_samples=1).data
             log_prob = log_prob.gather(1, Variable(action))

             state, reward, done, _ = env.step(action.numpy())

I encountered a new error

~/py-garage/pytorch-a3c(master*) » python3 main.py --env-name "PongDeterministic-v4" --num-processes 1                                                        anya@turing-machine
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
/Users/anya/py-garage/pytorch-a3c/test.py:37: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:38: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:44: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  state.unsqueeze(0), volatile=True), (hx, cx)))
/Users/anya/py-garage/pytorch-a3c/train.py:55: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/train.py:56: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_prob = F.log_softmax(logit)
/Users/anya/py-garage/pytorch-a3c/test.py:45: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/test.py:40: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(cx.data, volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:41: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(hx.data, volatile=True)
/Users/anya/py-garage/pytorch-a3c/train.py:108: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
  torch.nn.utils.clip_grad_norm(model.parameters(), args.max_grad_norm)
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/anya/py-garage/pytorch-a3c/train.py", line 111, in train
    optimizer.step()
  File "/Users/anya/py-garage/pytorch-a3c/my_optim.py", line 70, in step
    p.data.addcdiv_(-step_size, exp_avg, denom)
TypeError: addcdiv_() takes 2 positional arguments but 3 were given
Time 00h 00m 01s, num steps 20, FPS 13, episode reward -21.0, episode length 812

I am very confused and cannot fix it.
TypeError: addcdiv_() takes 2 positional arguments but 3 were given

GPU version of a3c algorithm?

Hi,

Thanks for the code. I was wondering if there is a GPU version of the A3C algorithm in PyTorch. I notice that the hogwild! example in PyTorch is the CPU version.

For your reference, I've found a GPU version of A3C in tensorflow: https://github.com/caomw/icra2017-visual-navigation

About ensure_shared_grads

It is really a great work! I have some questions about copying local gradients. In train.py, what is the purpose of adding condition:

if shared_param.grad is not None

If I understand correctly, local gradients should always be copied to the global network. Is that right? Also, I'm confused about ._grad and .grad in train.py:18. Are there any differences?

No framestack?

Hello, is it correct that the environment does not stack frames i.e. returns 4x42x42 but only 1x42x42? I've heard that stacking frames gives the model details about velocity/direction etc. The model converges fine on most occasions regardless, but i was just wondering if you had any thoughts about this.

Image preprocessing

I tried to understand the preprocessing in universe-starter-agent and couldn't understand if the image was converted to grayscale like they did in dqn. Have you noticed that? Shouldn't it make the process faster?

question about the hyper-parameters

In train.py
Line 95 and Line 107 both have a 0.5 decay parameter for value loss.
That's why? The original paper does not describe that.

Removing Universe dependency

Universe has quite many dependencies including Go and here basically only gym is used.
Vectorized api is useful in some cases but wrappers here just seem to replace and normalize observations. Or is it useful for some reason?
I replaced it with gym wrappers and everything looks fine.
diff

What's the difference between environment 'Pong-v4' and 'PongDeterministic-v4'

I'm sorry to ask a simple question.
I don't know the difference between the 'Pong-v4' and 'PongDeterministic-v4'. And why you use the latter environment to test your algorithm instead of 'Pong-v4'.

question about using GAE

I found that in original GAE paper
eq.16
A_{t}^{GAE} = \sum_{l=0 }^{\infty} (\gamma \tau )^l \delta_{t+l}^{V}

However, in the code the advantage is look like
https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L97

gae = gae * args.gamma * args.tau + delta_t

Shouldn't it modified into:

gae += args.gamma * args.tau * delta_t

I haven't implemented code with GAE before, so I'm just curious about this

ensure_shared_grads only works once?

In ensure_shared_grads, wouldn't shared_param.grad NOT be None after the first time the parameters are updated? In my testing it seems that the function returns without syncing the gradient in every step after the first. Related discussion here.

def ensure_shared_grads(model, shared_model):
    for param, shared_param in zip(model.parameters(), shared_model.parameters()):
        if shared_param.grad is not None:
            return
        shared_param._grad = param.grad

What is the purpose of `os.environ['OMP_NUM_THREADS'] = '1'`?

I wonder why os.environ['OMP_NUM_THREADS'] = '1' is used in the main method: https://github.com/ikostrikov/pytorch-a3c/blob/master/main.py#L43.

I ran a demo about CartPole-v0 using openai gym with LSTM. I found that if I removed it, then the agent couldn't learn a good policy. So why?

This is the logging info without os.environ['OMP_NUM_THREADS'] = '1'.

Time 00h 00m 00s, episode reward 10.0, episode length 10
Time 00h 01m 00s, episode reward 10.0, episode length 10
Time 00h 02m 00s, episode reward 12.0, episode length 12
Time 00h 03m 01s, episode reward 12.0, episode length 12
Time 00h 04m 01s, episode reward 14.0, episode length 14
Time 00h 05m 01s, episode reward 14.0, episode length 14
Time 00h 06m 01s, episode reward 12.0, episode length 12
Time 00h 07m 03s, episode reward 13.0, episode length 13
Time 00h 08m 03s, episode reward 10.0, episode length 10

And this is the info after adding it:

Time 00h 00m 00s, episode reward 10.0, episode length 10
Time 00h 01m 00s, episode reward 18.0, episode length 18
Time 00h 02m 00s, episode reward 16.0, episode length 16
Time 00h 03m 00s, episode reward 29.0, episode length 29
Time 00h 04m 00s, episode reward 48.0, episode length 48
Time 00h 05m 00s, episode reward 12.0, episode length 12
Time 00h 06m 00s, episode reward 107.0, episode length 107
Time 00h 07m 01s, episode reward 79.0, episode length 79
Time 00h 08m 01s, episode reward 153.0, episode length 153
Time 00h 09m 01s, episode reward 77.0, episode length 77
Time 00h 10m 01s, episode reward 91.0, episode length 91
Time 00h 11m 01s, episode reward 129.0, episode length 129
Time 00h 12m 02s, episode reward 137.0, episode length 137
Time 00h 13m 02s, episode reward 117.0, episode length 117
Time 00h 14m 03s, episode reward 155.0, episode length 155
Time 00h 15m 03s, episode reward 200.0, episode length 200
Time 00h 16m 03s, episode reward 200.0, episode length 200

loss backward

(policy_loss + 0.5 * value_loss).backward()
I am not familiar with pytorch. I am very confused about how the loss backward in two sub-net:actor and critic.because apparently that there are two output in the network(action and value),but(policy_loss + 0.5 * value_loss) seems to be loss for one sub-net.
Hope someone can help me.
Thank you so much.

AttributeError: 'NoneType' object has no attribute 'data'

when I ran your code, it spitted out the following:

..
[2017-03-17 20:15:21,067] Making new env: PongDeterministic-v3
[2017-03-17 20:15:21,075] Making new env: PongDeterministic-v3
Process Process-3:
Traceback (most recent call last):
File "/home/john/anaconda3/envs/th/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/john/anaconda3/envs/th/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/john/dev/pytorch-a3c/train.py", line 24, in train
shared_param.grad.data = param.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
Process Process-8:
...
...
File "/home/john/dev/pytorch-a3c/train.py", line 24, in train
shared_param.grad.data = param.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
Time 00h 00m 01s, episode reward -21.0, episode length 764
...

In the end it looks like it's running. Am I right?

Then, what is the cause the error messages ?

Many thanks !

(I am running this in conda env with python 3.5)

Why policy loss is negative?

In train.py, the code for updating the policy loss is

 policy_loss = policy_loss - \
                log_probs[i] * Variable(gae) - 0.01 * entropies[i]

According to the original paper, I think it should be

 policy_loss = policy_loss + \
                log_probs[i] * Variable(gae) + 0.01 * entropies[i]

Performance with Breakout

Have you trained Breakout with your a3c by any chance? I wonder that kind of scores you have gotten.

John

Did lstm cell really make more sense in A3C?

Very sorry to ask you a simple question, thanks a lot.

Problem with multiprocessing?

Hi, thanks a lot for releasing this project.

I wonder if you could give me some advice? I'm trying to adapt your code to a different environment. It's working great for a single thread, when I simple execute train.py,, without using torch multiprocessing. When I try to use it with multiprocessing, it still runs, but I get a screen full of error messages, like these,

*** Error in `*** Error in `python': double free or corruption (out): 0x00007f6598000c80 *** *** Error in `*** Error in `python': double free or corruption (out): 0x00007f6598000c80 *** *** Error in `*** Error in `python': double free or corruption (out): 0x00007f6598000c80 ***

I'm not sure whether it's learning either? I have a TensorFlow implementationn, so I know both my environment and A3C algorithm work.

I just wondered if you've seen this before? It might be something to do with the the action variable, because when I set it to a constant, the error goes away, but of course it doesn't learn then?

Thanks a lot for your help :)

Possible memory leak?

Training Breakout goes ok but, memory usage exceeds 25gb after 4 hours of training on 16 cpu cores.
I wonder if it's related to sharing memory between processes.

I run Python 3.5 on scientific linux.

big bug

I encountered an error
[INFO ] [Logger ] Record log in C:\Users\ArunD.kivy\logs\kivy_18-09-24_3.txt
[INFO ] [Kivy ] v1.10.1.dev0, git-484b2f7, 20170513
[INFO ] [Python ] v3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
[INFO ] [Factory ] 194 symbols loaded
[INFO ] [Image ] Providers: img_tex, img_dds, img_sdl2, img_pil, img_gif (img_ffpyplayer ignored)
[INFO ] [Text ] Provider: sdl2
[INFO ] [OSC ] using for socket
[INFO ] [Window ] Provider: sdl2
[INFO ] [GL ] Using the "OpenGL" graphics system
[INFO ] [GL ] GLEW initialization succeeded
[INFO ] [GL ] Backend used
[INFO ] [GL ] OpenGL version <b'4.4.0 - Build 20.19.15.4835'>
[INFO ] [GL ] OpenGL vendor <b'Intel'>
[INFO ] [GL ] OpenGL renderer <b'Intel(R) HD Graphics'>
[INFO ] [GL ] OpenGL parsed version: 4, 4
[INFO ] [GL ] Shading version <b'4.40 - Build 20.19.15.4835'>
[INFO ] [GL ] Texture max size <16384>
[INFO ] [GL ] Texture max units <32>
[INFO ] [Shader ] fragment shader: <b"WARNING: 0:7: '' : #version directive missing">
[INFO ] [Shader ] vertex shader: <b"WARNING: 0:7: '' : #version directive missing">
[INFO ] [Window ] auto add sdl2 input provider
[INFO ] [Window ] virtual keyboard not allowed, single mode, not docked
[INFO ] [Base ] Start application main loop
C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py:63: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
probs = F.softmax(self.model(Variable(state, volatile = True))*100) # T=100
C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py:63: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
probs = F.softmax(self.model(Variable(state, volatile = True))*100) # T=100
[INFO ] [Base ] Leaving application in progress...
Traceback (most recent call last):
File "map.py", line 235, in
CarApp().run()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\app.py", line 828, in run
runTouchApp()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\base.py", line 504, in runTouchApp
EventLoop.window.mainloop()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\core\window\window_sdl2.py", line 663, in mainloop
self._mainloop()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\core\window\window_sdl2.py", line 405, in _mainloop
EventLoop.idle()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\base.py", line 339, in idle
Clock.tick()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\clock.py", line 581, in tick
self._process_events()
File "kivy_clock.pyx", line 367, in kivy._clock.CyClockBase._process_events (kivy_clock.c:7700)
File "kivy_clock.pyx", line 397, in kivy._clock.CyClockBase._process_events (kivy_clock.c:7577)
File "kivy_clock.pyx", line 395, in kivy._clock.CyClockBase._process_events (kivy_clock.c:7498)
File "kivy_clock.pyx", line 167, in kivy._clock.ClockEvent.tick (kivy_clock.c:3490)
File "map.py", line 131, in update
action = brain.update(last_reward, last_signal)
File "C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py", line 79, in update
action = self.select_action(new_state)
File "C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py", line 64, in select_action
action = probs.multinomial()
TypeError: multinomial() missing 1 required positional arguments: "num_samples"

LSTM vs FF

Great job!
What I don't understand is what the LSTM used for here. What is the difference between using LSTM and using FF? Does it replace the batch of 4 observations?

Does this implementation solve the inconsistent parameter issue from the original TF universe-starter-agent implementation?

I'm referencing this issue:
tensorflow/tensorflow#6360 (comment)

hidden states for backwards

In the training part,
The output value and loss are saved at first.
At the end of the forward stage, they are calculated.
The problem is, at that time. the hidden states of the model are all from the last input state.
In the backward procedure, the gradients of weights are also depending on the hidden states.
But the outputs of former saved states are obviously based their own hidden states but not the hidden states right now.
So can the variable.backward() function wok well for this situation?

when run this code in macbook pro, python exit unnormally

How to modify code for continuous actions?

Hello :)

I was wondering how to modify the code for continuous actions? So for example it could be compared with your naf implementation on openAI gym pendulum,

env = gym.envs.make("Pendulum-v0")

Here's how far I got,

lstm_in = 3
lstm_out = 256

class ActorCritic(nn.Module):

    def __init__(self , lstm_in ):
        super(ActorCritic, self).__init__( )
               
        self.lstm = nn.LSTMCell(lstm_in, lstm_out)
        
        self.actor_mu = nn.Linear(lstm_out, 1)
        self.actor_sigma = nn.Linear(lstm_out, 1)

        self.critic_linear = nn.Linear(lstm_out, 1)
        
        self.train()

    def forward(self, inputs):
        
        x, (hx, cx) = inputs
        
        #might need some RELU's here ??
        
        hx, cx = self.lstm(x, (hx, cx))
        x = hx

        return self.critic_linear(x), self.actor_mu(x), self.actor_sigma(x), (hx, cx)

and the code in main now looks like,

env = gym.envs.make("Pendulum-v0")
lstm_in = 3    
global_model = ActorCritic( lstm_in )
global_model.share_memory()
local_model = ActorCritic( lstm_in )

It breaks with the following changes in train.py though,

env = gym.envs.make("Pendulum-v0")
s0 = env.reset()
done = True
state = torch.from_numpy(s0).float().unsqueeze(0)

value, mu, sigma, (hx, cx) = local_model((Variable(state), (hx, cx)))

#mu = mu.clamp(-1, 1) # constain to sensible values 
Softplus=nn.Softplus()     
sigma = Softplus(sigma) #+ 1e-5 # constrain to sensible values
normal_dist = torch.normal(mu, sigma) 

prob = normal_dist
# TODO - what goes here?
#nnlog = nn.Log() 
#log_prob = nnlog(prob)

#log_prob = F.log_softmax(prob)
#prob = F.softmax(logit)
#log_prob = F.log_softmax(logit)
entropy = -(log_prob * prob).sum(1)

action = prob.data
action = Variable( action )
log_prob = log_prob.gather(1, action)

#action=[0,]
state, reward, done, _ = env.step(action.data)

Any idea how to get it working?

Thanks a lot for your help,

Best regards,

Ajay

Reference - Deepmind A3C's paper, https://arxiv.org/pdf/1602.01783.pdf
Section 9 - Continuous Action Control Using the MuJoCo Physics Simulator

Picture from https://github.com/deeplearninc/relaax#distributed-a3c-architecture-with-continuous-actions

Question about normalized_columns_initializer(weights, std=1.0) method.

Hi, I'm new to reinforcement learning. While I'm learning the A3C code, I find one thing confused.
In model.py, the normalized_columns_initializer:
def normalized_columns_initializer(weights, std=1.0):
out = torch.randn(weights.size())
out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True))
return out
I tried std=1.0, out = torch.Tensor(np.random.randn(3,4)), then out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True)), out.std(dim=1) the standard deviation of out is not 1.0.
Am I understand the function of this method right?
If we want to set the std of out, why not use out *= std / out.std(dim=1, keepdim=True)?

SharedAdam bias correction wrong

The timestep parameters of the SharedAdam optimizer are not shared. This should lead to bias overcorrection, leading to incorrect unbiased estimates. Does the current implementation work?

File "main.py", line 55,TypeError: sum received an invalid combination of arguments

File "main.py", line 55, in
env.observation_space.shape[0], env.action_space)
File "/home/bruce/Downloads/pytorch-a3c-master/pytorch-a3c-master/model.py", line 47, in init
self.actor_linear.weight.data, 0.01)
File "/home/bruce/Downloads/pytorch-a3c-master/pytorch-a3c-master/model.py", line 9, in normalized_columns_initializer
out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True))
TypeError: sum received an invalid combination of arguments - got (int, keepdim=bool), but expected one of:

no arguments
(int dim)

my environment is Ubuntu 14.04
I just run your code without change and the error comes out, please help,thanks

Have you tried out Shared YellowFin?

Did it have significant effect compared to Shared Adam/RMSprop?

Running with pytorch 0.2.0

Have you tried running it with pytorch 0.2.0? seem like the multiprocessing doesn't work.

environment observation normalization

hello,

I am very puzzled why we can normalize observation like this. Could you explain to me, or give me a tutorial? Can not understand why alpha is 0.9999, why we can calculate std like this, although it truly improve the performance.

Thanks!

Where does the initializer come from?

Hi,

I found some initializers in model.py. The values are not very clearly to me. Could you please explain where does they come from and is that suitable for other atari games? Or do I need to change it for other games? Also the original references are very welcome, thank you very much.

Thanks.

GAE parameter name should be lambda not tau. And why is default 1.0?

From the end of section 3 in the GAE paper: High-Dimensional Continuous Control Using Generalized Advantage Estimation
https://arxiv.org/pdf/1506.02438.pdf

Taking γ < 1 introduces bias into the policy gradient estimate, regardless of the value function’s accuracy. 
On the other hand, λ < 1 introduces bias only when the value function is inaccurate.
Empirically, we find that the best value of λ is much lower than the best value of γ, 
likely because λ introduces far less bias than γ for a reasonably accurate value function

So my two questions are:

Why is lambda called "tau" in this repo?
If they empirically find that lambda is better if it is "much lower" than gamma discount, why is the default for lambda (tau) and gamma set to 1.0 and 0.99 respectively?

how to under ensure ensure_shared_grads?

I am kind of confused of the ensure_shared_grads here https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L13. Here, the grad is synced only when it is None. I think we need to set shared_param._grad = param.grad all the time because I don't see we sync the grad anywhere except here. Would anyone give me some hints about it?

gradient share problem

in pytorch 1.0 the following line:

https://github.com/ikostrikov/pytorch-a3c/blob/master/train.py#L14

returns the gradient assignment error:
RuntimeError: assigned grad has data of a different type.

any fixes or suggestions?

Does this implementation work with real-time environments that lag such as VNC environments from universe?

Have you tested this implementation on any Universe environments that lag due to the VNC?
The original one had a separate thread used for collecting experience that runs in parallel to optimization (I guess to make up for the lag?):
openai/universe-starter-agent#36 (comment)

Does this one have a separate thread used for collecting experience that runs in parallel to optimization?