Giter Club home page Giter Club logo

pytorch-a3c's Introduction

pytorch-a3c

This is a PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".

This implementation is inspired by Universe Starter Agent. In contrast to the starter agent, it uses an optimizer with shared statistics as in the original paper.

Please use this bibtex if you want to cite this repository in your publications:

@misc{pytorchaaac,
  author = {Kostrikov, Ilya},
  title = {PyTorch Implementations of Asynchronous Advantage Actor Critic},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ikostrikov/pytorch-a3c}},
}

A2C

I highly recommend to check a sychronous version and other algorithms: pytorch-a2c-ppo-acktr.

In my experience, A2C works better than A3C and ACKTR is better than both of them. Moreover, PPO is a great algorithm for continuous control. Thus, I recommend to try A2C/PPO/ACKTR first and use A3C only if you need it specifically for some reasons.

Also read OpenAI blog for more information.

Contributions

Contributions are very welcome. If you know how to make this code better, don't hesitate to send a pull request.

Usage

# Works only wih Python 3.
python3 main.py --env-name "PongDeterministic-v4" --num-processes 16

This code runs evaluation in a separate thread in addition to 16 processes.

Results

With 16 processes it converges for PongDeterministic-v4 in 15 minutes. PongDeterministic-v4

For BreakoutDeterministic-v4 it takes more than several hours.

pytorch-a3c's People

Contributors

apaszke avatar atgambardella avatar beduffy avatar ethancaballero avatar ikostrikov2 avatar lucasb-eyer avatar nadavbh12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-a3c's Issues

Mixture of model prediction and update

Hello Ilya,

first, thanks for you amazing work. I would have one question about a way how you have designed an A3C training.

What you basically do is that you play N steps and you store all the values, log_probs and entropies during this phase and you store all these variables in lists. After that, you perform an update, where you use these variables.

Another approach would be to split this into two methods:

  • predict(state), which would return action values
  • update(states, actions, rewards), which would update the model

In this setup, you can use volatile=True in predict, but on the other hand, in update method you have to compute all these values, log_probs and entropies by forward pass on all the states.

My question is, did you consider this second approach? Why did you chose the first one? Is it more efficient or its just simpler to implement? I am trying to make my own A3C implementation and definitely I will try out myself which is more efficient, but I am curious about your opinion.

Thanks

I cannot train with your recent pytorch-a3c

I am very interested in you pytorch-a3c because of its compactness and very simple structure.
I tried to follow your excellent work, but I cannot run successfully after struggling more than a month.
I will be very happy if you could give me any helpful information.

Following your readme, I installed most recent PyTorch, and clone your pytorch-a3c. (3 times cloned)

(1) "PongDeterministic-v4" was refused with next message.
(anaconda-2.4.0) user-no-MacBook-Pro:iko3 user$ OMP_NUM_THREADS=1 python main.py --env-name "PongDeterministic-v4" --num-processes 16
[2017-10-07 19:57:18,923] Making new env: PongDeterministic-v4
Traceback (most recent call last):
  File "main.py", line 53, in
    env = create_atari_env(args.env_name)
  File "/Users/user/iko3/envs.py", line 9, in create_atari_env
    env = gym.make(env_id)
  File "/Users/user/gym/gym/envs/registration.py", line 126, in make
    return registry.make(id)
  File "/Users/user/gym/gym/envs/registration.py", line 90, in make
    spec = self.spec(id)
  File "/Users/user/gym/gym/envs/registration.py", line 110, in spec
    raise error.DeprecatedEnv('Env {} not found (valid versions include {})'.format(id, matching_envs))
gym.error.DeprecatedEnv: Env PongDeterministic-v4 not found (valid versions include ['PongDeterministic-v3', 'PongDeterministic-v0'])

(2)so, I replaced "PongDeterministic-v4" to "PongDeterministic-v3"
Program started but score keep -21 more than 1day.

(3)I printed out prob and action in test.py.
prob is changing only last digit, and action changed 0 to 5.
By the way, I cannot understand the length of prob. I think pong has 3 actions(left,stay,right).
Outputs are as follows.  
Last line repeated more than one day with same score -21
My machine environment is Mac 10.12.6(Sierra) core i7

 0.1666  0.1667  0.1666  0.1666  0.1667  0.1667
[torch.FloatTensor of size 1x6]
, array([[4]]))
(Variable containing:
 0.1666  0.1667  0.1666  0.1666  0.1667  0.1667
[torch.FloatTensor of size 1x6]
, array([[5]]))
(Variable containing:
 0.1666  0.1667  0.1666  0.1666  0.1667  0.1667
[torch.FloatTensor of size 1x6]
, array([[5]]))
(Variable containing:
 0.1667  0.1667  0.1667  0.1666  0.1667  0.1667
[torch.FloatTensor of size 1x6]
, array([[1]]))
Time 00h 01m 08s, episode reward -21.0, episode length 764

Question about the policy loss calculation?

On line 103 of train.py you calculate the policy loss as

policy_loss = policy_loss - log_probs[i] * Variable(gae) - 0.01 * entropies[i]

My question is where does this -0.01 * entropies[i] term come from and what does it do?

EDIT: I went through the paper again and now see this was a form of regularization they used in the experimental setup section that wasn't in the pseudocode. Thanks for the code!

Error when rendering

Adding env.render() causes the error:
Traceback (most recent call last):
...
File "/home/shani/anaconda2/lib/python2.7/site-packages/pyglet/gl/glx_info.py", line 89, in have_version client = [int(i) for i in client_version.split('.')]
ValueError: invalid literal for int() with base 10: 'None'

Tried to run it in a different python file and it worked. Is there supposed to be a problem rendering while using pytorch?

is ensure_shared_grads still required?

It seems the PyTorch Mnist hogwild example has been updated now, as gradients are now allocated lazily.
I think this means that this part of you code is no longer required?

Why is the convergence on Pong so fast?

It takes about 15 minutes to converge on Pong with this repo. The A3C Paper managed to get convergence with 16 workers in about 4 hours. That's a speedup of 16x!

I'm aware that there are a few improvements such as GAE and normalising the observations from Gym, but those shouldn't account for a 16x speedup.

I've been trying to replicate those results in my own project, but can't seem to get better than 3-4 hours for convergence. Is there some kind of trick I'm missing?

AttributeError and CPU usage

Dear

After I installed everything

  • torch
  • gym
  • universe
  • cv2
    ....

and then:
python main.py --env-name "PongDeterministic-v3" --num-processes 8

I got this message:

File "/home/ubuntu/Notebooks/pytorch-a3c/train.py", line 24, in train
shared_param.grad.data = param.grad.data
AttributeError: 'NoneType' object has no attribute 'data'

However the code seems to run:

Time 00h 00m 03s, episode reward -21.0, episode length 764
Time 00h 01m 07s, episode reward -21.0, episode length 764
Time 00h 02m 11s, episode reward -21.0, episode length 764
Time 00h 03m 14s, episode reward -21.0, episode length 764
Time 00h 04m 18s, episode reward -21.0, episode length 764
Time 00h 05m 22s, episode reward -21.0, episode length 764
Time 00h 06m 25s, episode reward -21.0, episode length 764
Time 00h 07m 29s, episode reward -21.0, episode length 764
......

But the CPU is always 0% and some times peaks at 400~500%

Is this normal? (I think only the evaluation thread is running!)

Thank you so much for your efforts

action_space.n and actions sampling

Hey,

I have a question. Maybe I am in wrong and sorry to disturb you.
I understood all code but there is one thing that is not clear for me.
The A3C model get out Value and probability, that are 6 for pong right?

But when you compute:

action = prob.multinomial(num_samples=1).data

the number of sample are just 1 and clearly you have one action. But however the number of sample for pong are 6.

env.step(action.numpy())

is correct because receive only one action.

That is repeated within log_prob that you save in the list (you save 6 prob or only one?), with pong you have 6 not 1 and that could be created problem in the policy calculation.

Sorry I am bit confuse becouse

num_outputs = action_space.n

in the model.py are 6 not 1.

memory overflow

I tried to run your code on my ubuntu/Titan-X/Intel i7, but I kept getting memory problems, after about one day. Can you guess why ?
I tried both
OMP_NUM_THREADS=1 python main.py --env-name "Breakout-v0" --num-processes 8
and
OMP_NUM_THREADS=1 python main.py --env-name "Breakout-v0" --num-processes 6
, to see if bringing down num-processes might help. But that did not work out either.

ps.
(from free -m >> mem_usage.txt)
total used free shared buff/cache available
Mem: 64388 63937 325 14 125 271
Swap: 65451 65451 0
Fri Mar 31 17:55:18 KST 2017

Cant' work on pytorch 0.4.0

macOS 10.13.4
Python 3.6.4
pytorch 0.4.0

I encountered an error

~/py-garage/pytorch-a3c(master*) » python3 main.py --env-name "PongDeterministic-v4" --num-processes 1                                                        anya@turing-machine
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
/Users/anya/py-garage/pytorch-a3c/test.py:37: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:38: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:44: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  state.unsqueeze(0), volatile=True), (hx, cx)))
/Users/anya/py-garage/pytorch-a3c/train.py:55: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/test.py:45: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/train.py:56: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_prob = F.log_softmax(logit)
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/anya/py-garage/pytorch-a3c/train.py", line 60, in train
    action = prob.multinomial().data
TypeError: multinomial() missing 1 required positional arguments: "num_samples"
/Users/anya/py-garage/pytorch-a3c/test.py:40: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(cx.data, volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:41: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(hx.data, volatile=True)
Time 00h 00m 01s, num steps 0, FPS 0, episode reward -21.0, episode length 812

I try to fix it
TypeError: multinomial() missing 1 required positional arguments: "num_samples"

diff --git a/train.py b/train.py
index 1b9c139..e3f0143 100644
--- a/train.py
+++ b/train.py
@@ -57,7 +57,7 @@ def train(rank, args, shared_model, counter, lock, optimizer=None):
             entropy = -(log_prob * prob).sum(1, keepdim=True)
             entropies.append(entropy)

-            action = prob.multinomial().data
+            action = prob.multinomial(num_samples=1).data
             log_prob = log_prob.gather(1, Variable(action))

             state, reward, done, _ = env.step(action.numpy())

I encountered a new error

~/py-garage/pytorch-a3c(master*) » python3 main.py --env-name "PongDeterministic-v4" --num-processes 1                                                        anya@turing-machine
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.AtariRescale42x42'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
/Users/anya/py-garage/pytorch-a3c/test.py:37: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:38: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(torch.zeros(1, 256), volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:44: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  state.unsqueeze(0), volatile=True), (hx, cx)))
/Users/anya/py-garage/pytorch-a3c/train.py:55: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/train.py:56: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_prob = F.log_softmax(logit)
/Users/anya/py-garage/pytorch-a3c/test.py:45: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  prob = F.softmax(logit)
/Users/anya/py-garage/pytorch-a3c/test.py:40: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  cx = Variable(cx.data, volatile=True)
/Users/anya/py-garage/pytorch-a3c/test.py:41: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  hx = Variable(hx.data, volatile=True)
/Users/anya/py-garage/pytorch-a3c/train.py:108: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
  torch.nn.utils.clip_grad_norm(model.parameters(), args.max_grad_norm)
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/anya/py-garage/pytorch-a3c/train.py", line 111, in train
    optimizer.step()
  File "/Users/anya/py-garage/pytorch-a3c/my_optim.py", line 70, in step
    p.data.addcdiv_(-step_size, exp_avg, denom)
TypeError: addcdiv_() takes 2 positional arguments but 3 were given
Time 00h 00m 01s, num steps 20, FPS 13, episode reward -21.0, episode length 812

I am very confused and cannot fix it.
TypeError: addcdiv_() takes 2 positional arguments but 3 were given

About ensure_shared_grads

It is really a great work! I have some questions about copying local gradients. In train.py, what is the purpose of adding condition:

if shared_param.grad is not None

If I understand correctly, local gradients should always be copied to the global network. Is that right? Also, I'm confused about ._grad and .grad in train.py:18. Are there any differences?

No framestack?

Hello, is it correct that the environment does not stack frames i.e. returns 4x42x42 but only 1x42x42? I've heard that stacking frames gives the model details about velocity/direction etc. The model converges fine on most occasions regardless, but i was just wondering if you had any thoughts about this.

Image preprocessing

I tried to understand the preprocessing in universe-starter-agent and couldn't understand if the image was converted to grayscale like they did in dqn. Have you noticed that? Shouldn't it make the process faster?

Removing Universe dependency

Universe has quite many dependencies including Go and here basically only gym is used.
Vectorized api is useful in some cases but wrappers here just seem to replace and normalize observations. Or is it useful for some reason?
I replaced it with gym wrappers and everything looks fine.
diff

ensure_shared_grads only works once?

In ensure_shared_grads, wouldn't shared_param.grad NOT be None after the first time the parameters are updated? In my testing it seems that the function returns without syncing the gradient in every step after the first. Related discussion here.

def ensure_shared_grads(model, shared_model):
    for param, shared_param in zip(model.parameters(), shared_model.parameters()):
        if shared_param.grad is not None:
            return
        shared_param._grad = param.grad

What is the purpose of `os.environ['OMP_NUM_THREADS'] = '1'`?

I wonder why os.environ['OMP_NUM_THREADS'] = '1' is used in the main method: https://github.com/ikostrikov/pytorch-a3c/blob/master/main.py#L43.

I ran a demo about CartPole-v0 using openai gym with LSTM. I found that if I removed it, then the agent couldn't learn a good policy. So why?

This is the logging info without os.environ['OMP_NUM_THREADS'] = '1'.

Time 00h 00m 00s, episode reward 10.0, episode length 10
Time 00h 01m 00s, episode reward 10.0, episode length 10
Time 00h 02m 00s, episode reward 12.0, episode length 12
Time 00h 03m 01s, episode reward 12.0, episode length 12
Time 00h 04m 01s, episode reward 14.0, episode length 14
Time 00h 05m 01s, episode reward 14.0, episode length 14
Time 00h 06m 01s, episode reward 12.0, episode length 12
Time 00h 07m 03s, episode reward 13.0, episode length 13
Time 00h 08m 03s, episode reward 10.0, episode length 10

And this is the info after adding it:

Time 00h 00m 00s, episode reward 10.0, episode length 10
Time 00h 01m 00s, episode reward 18.0, episode length 18
Time 00h 02m 00s, episode reward 16.0, episode length 16
Time 00h 03m 00s, episode reward 29.0, episode length 29
Time 00h 04m 00s, episode reward 48.0, episode length 48
Time 00h 05m 00s, episode reward 12.0, episode length 12
Time 00h 06m 00s, episode reward 107.0, episode length 107
Time 00h 07m 01s, episode reward 79.0, episode length 79
Time 00h 08m 01s, episode reward 153.0, episode length 153
Time 00h 09m 01s, episode reward 77.0, episode length 77
Time 00h 10m 01s, episode reward 91.0, episode length 91
Time 00h 11m 01s, episode reward 129.0, episode length 129
Time 00h 12m 02s, episode reward 137.0, episode length 137
Time 00h 13m 02s, episode reward 117.0, episode length 117
Time 00h 14m 03s, episode reward 155.0, episode length 155
Time 00h 15m 03s, episode reward 200.0, episode length 200
Time 00h 16m 03s, episode reward 200.0, episode length 200

loss backward

(policy_loss + 0.5 * value_loss).backward()
I am not familiar with pytorch. I am very confused about how the loss backward in two sub-net:actor and critic.because apparently that there are two output in the network(action and value),but(policy_loss + 0.5 * value_loss) seems to be loss for one sub-net.
Hope someone can help me.
Thank you so much.

AttributeError: 'NoneType' object has no attribute 'data'

when I ran your code, it spitted out the following:

..
[2017-03-17 20:15:21,067] Making new env: PongDeterministic-v3
[2017-03-17 20:15:21,075] Making new env: PongDeterministic-v3
Process Process-3:
Traceback (most recent call last):
File "/home/john/anaconda3/envs/th/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/john/anaconda3/envs/th/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/john/dev/pytorch-a3c/train.py", line 24, in train
shared_param.grad.data = param.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
Process Process-8:
...
...
File "/home/john/dev/pytorch-a3c/train.py", line 24, in train
shared_param.grad.data = param.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
Time 00h 00m 01s, episode reward -21.0, episode length 764
...

In the end it looks like it's running. Am I right?

Then, what is the cause the error messages ?

Many thanks !

(I am running this in conda env with python 3.5)

Why policy loss is negative?

In train.py, the code for updating the policy loss is

 policy_loss = policy_loss - \
                log_probs[i] * Variable(gae) - 0.01 * entropies[i]

According to the original paper, I think it should be

 policy_loss = policy_loss + \
                log_probs[i] * Variable(gae) + 0.01 * entropies[i]

Performance with Breakout

Have you trained Breakout with your a3c by any chance? I wonder that kind of scores you have gotten.

John

Problem with multiprocessing?

Hi, thanks a lot for releasing this project.

I wonder if you could give me some advice? I'm trying to adapt your code to a different environment. It's working great for a single thread, when I simple execute train.py,, without using torch multiprocessing. When I try to use it with multiprocessing, it still runs, but I get a screen full of error messages, like these,

*** Error in `*** Error in `python': double free or corruption (out): 0x00007f6598000c80 *** *** Error in `*** Error in `python': double free or corruption (out): 0x00007f6598000c80 *** *** Error in `*** Error in `python': double free or corruption (out): 0x00007f6598000c80 ***

I'm not sure whether it's learning either? I have a TensorFlow implementationn, so I know both my environment and A3C algorithm work.

I just wondered if you've seen this before? It might be something to do with the the action variable, because when I set it to a constant, the error goes away, but of course it doesn't learn then?

Thanks a lot for your help :)

Possible memory leak?

Training Breakout goes ok but, memory usage exceeds 25gb after 4 hours of training on 16 cpu cores.
I wonder if it's related to sharing memory between processes.

I run Python 3.5 on scientific linux.

big bug

I encountered an error
[INFO ] [Logger ] Record log in C:\Users\ArunD.kivy\logs\kivy_18-09-24_3.txt
[INFO ] [Kivy ] v1.10.1.dev0, git-484b2f7, 20170513
[INFO ] [Python ] v3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
[INFO ] [Factory ] 194 symbols loaded
[INFO ] [Image ] Providers: img_tex, img_dds, img_sdl2, img_pil, img_gif (img_ffpyplayer ignored)
[INFO ] [Text ] Provider: sdl2
[INFO ] [OSC ] using for socket
[INFO ] [Window ] Provider: sdl2
[INFO ] [GL ] Using the "OpenGL" graphics system
[INFO ] [GL ] GLEW initialization succeeded
[INFO ] [GL ] Backend used
[INFO ] [GL ] OpenGL version <b'4.4.0 - Build 20.19.15.4835'>
[INFO ] [GL ] OpenGL vendor <b'Intel'>
[INFO ] [GL ] OpenGL renderer <b'Intel(R) HD Graphics'>
[INFO ] [GL ] OpenGL parsed version: 4, 4
[INFO ] [GL ] Shading version <b'4.40 - Build 20.19.15.4835'>
[INFO ] [GL ] Texture max size <16384>
[INFO ] [GL ] Texture max units <32>
[INFO ] [Shader ] fragment shader: <b"WARNING: 0:7: '' : #version directive missing">
[INFO ] [Shader ] vertex shader: <b"WARNING: 0:7: '' : #version directive missing">
[INFO ] [Window ] auto add sdl2 input provider
[INFO ] [Window ] virtual keyboard not allowed, single mode, not docked
[INFO ] [Base ] Start application main loop
C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py:63: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
probs = F.softmax(self.model(Variable(state, volatile = True))*100) # T=100
C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py:63: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
probs = F.softmax(self.model(Variable(state, volatile = True))*100) # T=100
[INFO ] [Base ] Leaving application in progress...
Traceback (most recent call last):
File "map.py", line 235, in
CarApp().run()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\app.py", line 828, in run
runTouchApp()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\base.py", line 504, in runTouchApp
EventLoop.window.mainloop()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\core\window\window_sdl2.py", line 663, in mainloop
self._mainloop()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\core\window\window_sdl2.py", line 405, in _mainloop
EventLoop.idle()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\base.py", line 339, in idle
Clock.tick()
File "C:\Users\ArunD\Anaconda3\lib\site-packages\kivy\clock.py", line 581, in tick
self._process_events()
File "kivy_clock.pyx", line 367, in kivy._clock.CyClockBase._process_events (kivy_clock.c:7700)
File "kivy_clock.pyx", line 397, in kivy._clock.CyClockBase._process_events (kivy_clock.c:7577)
File "kivy_clock.pyx", line 395, in kivy._clock.CyClockBase._process_events (kivy_clock.c:7498)
File "kivy_clock.pyx", line 167, in kivy._clock.ClockEvent.tick (kivy_clock.c:3490)
File "map.py", line 131, in update
action = brain.update(last_reward, last_signal)
File "C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py", line 79, in update
action = self.select_action(new_state)
File "C:\Users\ArunD\Documents\self-driving car\Self_Driving_Car\ai.py", line 64, in select_action
action = probs.multinomial()
TypeError: multinomial() missing 1 required positional arguments: "num_samples"

LSTM vs FF

Great job!
What I don't understand is what the LSTM used for here. What is the difference between using LSTM and using FF? Does it replace the batch of 4 observations?

hidden states for backwards

In the training part,
The output value and loss are saved at first.
At the end of the forward stage, they are calculated.
The problem is, at that time. the hidden states of the model are all from the last input state.
In the backward procedure, the gradients of weights are also depending on the hidden states.
But the outputs of former saved states are obviously based their own hidden states but not the hidden states right now.
So can the variable.backward() function wok well for this situation?

How to modify code for continuous actions?

Hello :)

I was wondering how to modify the code for continuous actions? So for example it could be compared with your naf implementation on openAI gym pendulum,

env = gym.envs.make("Pendulum-v0")

Here's how far I got,

lstm_in = 3
lstm_out = 256

class ActorCritic(nn.Module):

    def __init__(self , lstm_in ):
        super(ActorCritic, self).__init__( )
               
        self.lstm = nn.LSTMCell(lstm_in, lstm_out)
        
        self.actor_mu = nn.Linear(lstm_out, 1)
        self.actor_sigma = nn.Linear(lstm_out, 1)

        self.critic_linear = nn.Linear(lstm_out, 1)
        
        self.train()

    def forward(self, inputs):
        
        x, (hx, cx) = inputs
        
        #might need some RELU's here ??
        
        hx, cx = self.lstm(x, (hx, cx))
        x = hx

        return self.critic_linear(x), self.actor_mu(x), self.actor_sigma(x), (hx, cx)

and the code in main now looks like,

env = gym.envs.make("Pendulum-v0")
lstm_in = 3    
global_model = ActorCritic( lstm_in )
global_model.share_memory()
local_model = ActorCritic( lstm_in )

It breaks with the following changes in train.py though,

env = gym.envs.make("Pendulum-v0")
s0 = env.reset()
done = True
state = torch.from_numpy(s0).float().unsqueeze(0) 
value, mu, sigma, (hx, cx) = local_model((Variable(state), (hx, cx)))

#mu = mu.clamp(-1, 1) # constain to sensible values 
Softplus=nn.Softplus()     
sigma = Softplus(sigma) #+ 1e-5 # constrain to sensible values
normal_dist = torch.normal(mu, sigma) 

prob = normal_dist
# TODO - what goes here?
#nnlog = nn.Log() 
#log_prob = nnlog(prob)

#log_prob = F.log_softmax(prob)
#prob = F.softmax(logit)
#log_prob = F.log_softmax(logit)
entropy = -(log_prob * prob).sum(1)

action = prob.data
action = Variable( action )
log_prob = log_prob.gather(1, action)

#action=[0,]
state, reward, done, _ = env.step(action.data)

Any idea how to get it working?

Thanks a lot for your help,

Best regards,

Ajay

Reference - Deepmind A3C's paper, https://arxiv.org/pdf/1602.01783.pdf
Section 9 - Continuous Action Control Using the MuJoCo Physics Simulator

image

Picture from https://github.com/deeplearninc/relaax#distributed-a3c-architecture-with-continuous-actions

Question about normalized_columns_initializer(weights, std=1.0) method.

Hi, I'm new to reinforcement learning. While I'm learning the A3C code, I find one thing confused.
In model.py, the normalized_columns_initializer:
def normalized_columns_initializer(weights, std=1.0):
out = torch.randn(weights.size())
out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True))
return out
I tried std=1.0, out = torch.Tensor(np.random.randn(3,4)), then out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True)), out.std(dim=1) the standard deviation of out is not 1.0.
Am I understand the function of this method right?
If we want to set the std of out, why not use out *= std / out.std(dim=1, keepdim=True)?

SharedAdam bias correction wrong

The timestep parameters of the SharedAdam optimizer are not shared. This should lead to bias overcorrection, leading to incorrect unbiased estimates. Does the current implementation work?

File "main.py", line 55,TypeError: sum received an invalid combination of arguments

File "main.py", line 55, in
env.observation_space.shape[0], env.action_space)
File "/home/bruce/Downloads/pytorch-a3c-master/pytorch-a3c-master/model.py", line 47, in init
self.actor_linear.weight.data, 0.01)
File "/home/bruce/Downloads/pytorch-a3c-master/pytorch-a3c-master/model.py", line 9, in normalized_columns_initializer
out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True))
TypeError: sum received an invalid combination of arguments - got (int, keepdim=bool), but expected one of:

  • no arguments
  • (int dim)

my environment is Ubuntu 14.04
I just run your code without change and the error comes out, please help,thanks

environment observation normalization

hello,

I am very puzzled why we can normalize observation like this. Could you explain to me, or give me a tutorial? Can not understand why alpha is 0.9999, why we can calculate std like this, although it truly improve the performance.

Thanks!

Where does the initializer come from?

Hi,

I found some initializers in model.py. The values are not very clearly to me. Could you please explain where does they come from and is that suitable for other atari games? Or do I need to change it for other games? Also the original references are very welcome, thank you very much.

Thanks.

GAE parameter name should be lambda not tau. And why is default 1.0?

From the end of section 3 in the GAE paper: High-Dimensional Continuous Control Using Generalized Advantage Estimation
https://arxiv.org/pdf/1506.02438.pdf

Taking γ < 1 introduces bias into the policy gradient estimate, regardless of the value function’s accuracy. 
On the other hand, λ < 1 introduces bias only when the value function is inaccurate.
Empirically, we find that the best value of λ is much lower than the best value of γ, 
likely because λ introduces far less bias than γ for a reasonably accurate value function

So my two questions are:

  • Why is lambda called "tau" in this repo?
  • If they empirically find that lambda is better if it is "much lower" than gamma discount, why is the default for lambda (tau) and gamma set to 1.0 and 0.99 respectively?

Can't work on Ubuntu 16.04

After value, logit, (hx, cx) = model((Variable(state.unsqueeze(0)),(hx, cx))) in train.py, the program doesn't go on. Do you have any idea?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.