Giter Club home page Giter Club logo

rl_a3c_pytorch's Introduction

*Update: Minor updates to code. Added distributed step size training functionality. Added integration to tensorboard so you can can log and create graphs of training, see graph of model, and visualize your weights and biases distributions as they update during training.

A3G A GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING

RL A3C Pytorch

A3C LSTM playing Breakout-v0 A3C LSTM playing SpaceInvadersDeterministic-v3 A3C LSTM playing MsPacman-v0 A3C LSTM playing BeamRider-v0 A3C LSTM playing Seaquest-v0

A3G

Implementation of A3C that utilizes GPU for speed increase in training. Which we can call A3G. A3G as opposed to other versions that try to utilize GPU with A3C algorithm, with A3G each agent has its own network maintained on GPU but shared model is on CPU and agent models are quickly converted to CPU to update shared model which allows updates to be frequent and fast by utilizing Hogwild Training and make updates to shared model asynchronously and without locks. This new method greatly increase training speed and models that use to take days to train can be trained in as fast as 10minutes for some Atari games! 10-15minutes for Breakout to start to score over 400! And 10mins to solve Pong!

This repository includes my implementation with reinforcement learning using Asynchronous Advantage Actor-Critic (A3C) in Pytorch an algorithm from Google Deep Mind's paper "Asynchronous Methods for Deep Reinforcement Learning."

See [a3c_continuous][23] a newly added repo of my A3C LSTM implementation for continuous action spaces which was able to solve BipedWalkerHardcore-v3 environment (average 300+ for 100 consecutive episodes)

A3C LSTM

I implemented an A3C LSTM model and trained it in the atari 2600 environments provided in the Openai Gym. So far model currently has shown the best prerfomance I have seen for atari game environments. Included in repo are trained models for SpaceInvaders-v0, MsPacman-v0, Breakout-v0, BeamRider-v0, Pong-v0, Seaquest-v0 and Asteroids-v0 which have had very good performance and currently hold the best scores on openai gym leaderboard for each of those games(No plans on training model for any more atari games right now...). Saved models in trained_models folder. *Removed trained models to reduce the size of repo

Have optimizers using shared statistics for RMSProp and Adam available for use in training as well option to use non shared optimizer.

Gym atari settings are more difficult to train than traditional ALE atari settings as Gym uses stochastic frame skipping and has higher number of discrete actions. Such as Breakout-v0 has 6 discrete actions in Gym but ALE is set to only 4 discrete actions. Also in GYM atari they randomly repeat the previous action with probability 0.25 and there is time/step limit that limits performance.

link to the Gym environment evaluations below

Tables Best 100 episode Avg Best Score
SpaceInvaders-v0 5808.45 ± 337.28 13380.0
SpaceInvaders-v3 6944.85 ± 409.60 20440.0
SpaceInvadersDeterministic-v3 79060.10 ± 5826.59 167330.0
Breakout-v0 739.30 ± 18.43 864.0
Breakout-v3 859.57 ± 1.97 864.0
Pong-v0 20.96 ± 0.02 21.0
PongDeterministic-v3 21.00 ± 0.00 21.0
BeamRider-v0 8441.22 ± 221.24 13130.0
MsPacman-v0 6323.01 ± 116.91 10181.0
Seaquest-v0 54203.50 ± 1509.85 88840.0

The 167,330 Space Invaders score is World Record Space Invaders score and game ended only due to GYM timestep limit and not from loss of life. When I increased the GYM timestep limit to a million its reached a score on Space Invaders of approximately 2,300,000 and still ended due to timestep limit. Most likely due to game getting fairly redundent after a while

Due to gym version Seaquest-v0 timestep limit agent scores lower but on Seaquest-v4 with higher timestep limit agent beats game (see gif above) with max possible score 999,999!!

Requirements

  • Python 2.7+
  • Openai Gym and Universe
  • Pytorch (Pytorch 2.0 has a bug where it incorrectly occupies GPU memory on all GPUs being used when backward() is called on training processes. This does not slow down training but it does unnecesarily take up a lot of gpu memory. If this is problem for you and running out of gpu memory downgrade pytorch)

Training

When training model it is important to limit number of worker processes to number of cpu cores available as too many processes (e.g. more than one process per cpu core available) will actually be detrimental in training speed and effectiveness

To train agent in PongNoFrameskip-v4 environment with 32 different worker processes:

python main.py --env PongNoFrameskip-v4 --workers 32

#A3G-Training training using machine with 4 V100 GPUs and 20core CPU for PongNoFrameskip-v4 took 10 minutes to converge

To train agent in PongNoFrameskip-v4 environment with 32 different worker processes on 4 GPUs with new A3G:

python main.py --env PongNoFrameskip-v4 --workers 32 --gpu-ids 0 1 2 3 --amsgrad

Hit Ctrl C to end training session properly

A3C LSTM playing Pong-v0

Evaluation

To run a 100 episode gym evaluation with trained model

python gym_eval.py --env PongNoFrameskip-v4 --num-episodes 100 --new-gym-eval

Distributed Step Size training

Example of use to train an agent using different step sizes across training processes from provided list of step sizes

python main.py --env PongNoFrameskip-v4 --workers 18 --gpu-ids 0 1 2 --amsgrad --distributed-step-size 16 32 64 --tau 0.92 --tensorboard-logger

Below a graph showing of running the distributed step size training command above PongNoFrameskip DSS Training

Notice BeamRiderNoFrameskip-v4 reaches scores over 50,000 in less than 2hrs of training compared to the gym v0 version this shows the difficulty of those versions but also the timelimit being a major factor in score level

These training charts were done on a DGX Station using 4GPUs and 20core Cpu. I used 36 worker agents and a tau of 0.92 which is the lambda in Generalized Advantage Estimation equation to introduce more variance due to the more deterministic nature of using just a 4 frame skip environment and a 0-30 NoOp start BeamRider Training Boxing training Pong Training SpaceInvaders Training Qbert training

Project Reference

rl_a3c_pytorch's People

Contributors

dgriff777 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rl_a3c_pytorch's Issues

Clarification needed regarding num_workers

@dgriff777 thanks again for providing this amazing repo, was wondering if num_workers should be equal to the number of threads instead of the number of cores as suggested in README.md. I'm new to A3C so please bear with me if this is a naive question :P

Performance of Breakout

Could I ask how long it takes to train Breakout from scratch to get the desire score (859.57 for Breakout-v3)?

Have You tried BreakoutNoFrameskip? This is a version without repetition and randomness.

Thanks!

Question about Test function

Hi, I'd to have a question about the following block

if player.done and not player.info:

    if player.done and not player.info:
        state = player.env.reset()
        player.eps_len += 2
        player.state = torch.from_numpy(state).float()
        if gpu_id >= 0:
            with torch.cuda.device(gpu_id):
                player.state = player.state.cuda()
    elif player.info:

I don't quite understand when the info equals True or False, what is the meaning off having info=True and info=False ?

I can't seem to find a documentation about this info flag on Gym website :(

Thanks

Stuck when training in MsPacman-v0

Hi @dgriff777 . Thank you for your repo. It's great that it can achieve such a high score. But I met a problem when I try to apply it to MsPacman-v0.

I simply used this command python main.py --env MsPacman-v0 --workers 7
Then, I get the test score like this:

2018-10-27 15:59:44,767 : lr: 0.0001
2018-10-27 15:59:44,767 : gamma: 0.99
2018-10-27 15:59:44,767 : tau: 1.0
2018-10-27 15:59:44,767 : seed: 1
2018-10-27 15:59:44,767 : workers: 7
2018-10-27 15:59:44,767 : num_steps: 20
2018-10-27 15:59:44,767 : max_episode_length: 10000
2018-10-27 15:59:44,767 : env: MsPacman-v0
2018-10-27 15:59:44,767 : env_config: config.json
2018-10-27 15:59:44,767 : shared_optimizer: True
2018-10-27 15:59:44,767 : load: False
2018-10-27 15:59:44,767 : save_max: True
2018-10-27 15:59:44,767 : optimizer: Adam
2018-10-27 15:59:44,767 : load_model_dir: trained_models/
2018-10-27 15:59:44,767 : save_model_dir: trained_models/
2018-10-27 15:59:44,767 : log_dir: logs/
2018-10-27 15:59:44,767 : gpu_ids: [-1]
2018-10-27 15:59:44,767 : amsgrad: True
2018-10-27 15:59:44,767 : skip_rate: 4
2018-10-27 15:59:52,746 : Time 00h 00m 07s, episode reward 60.0, episode length 429, reward mean 60.0000
2018-10-27 16:00:17,886 : Time 00h 00m 32s, episode reward 70.0, episode length 619, reward mean 65.0000
2018-10-27 16:00:43,513 : Time 00h 00m 58s, episode reward 70.0, episode length 628, reward mean 66.6667
2018-10-27 16:01:09,034 : Time 00h 01m 24s, episode reward 70.0, episode length 633, reward mean 67.5000
2018-10-27 16:01:34,687 : Time 00h 01m 49s, episode reward 70.0, episode length 615, reward mean 68.0000
2018-10-27 16:02:00,366 : Time 00h 02m 15s, episode reward 70.0, episode length 641, reward mean 68.3333
2018-10-27 16:02:25,238 : Time 00h 02m 40s, episode reward 70.0, episode length 624, reward mean 68.5714
2018-10-27 16:02:50,496 : Time 00h 03m 05s, episode reward 70.0, episode length 622, reward mean 68.7500
2018-10-27 16:03:15,714 : Time 00h 03m 30s, episode reward 70.0, episode length 631, reward mean 68.8889
2018-10-27 16:03:40,280 : Time 00h 03m 55s, episode reward 70.0, episode length 626, reward mean 69.0000
2018-10-27 16:04:05,072 : Time 00h 04m 20s, episode reward 70.0, episode length 628, reward mean 69.0909

The test score is always 70 and It seems that the agent will choose the same way every time and stop at a corner.

Could you tell me how did you train the model to get 6323.01 ± 116.91 scores in MsPacman-v0? Is there any other parameters that I should set?

Quick question on batch processing

Thanks for the implementation! Great codes.

I read/ran your codes and realized it is processing training examples with just a batch_size=1 (instead of large batch size, am I correct on this?). I am just wondering if this is designed on purpose due to your G-A3C. With larger batch size things are running faster with GPU, so why batch_size=1?

Is there anything we can do to run it on large batches?

Thank you.

UserWarning: This overload of add_ is deprecated

Is it normal to get this trace back in the console? It spams for a few dozen times and then stops abruptly. Then, it starts logging the training session as intended. Sorry if this question is incredibly ignorant lol. I'm new to python and the world of ai. Figured I'd post my question here before searching Google. Thanks in advance.
C:\Users\joshu\Documents\0AIFolder\00A3CA3Gatari\rl_a3c_pytorch-master\shared_optim.py:167: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ..\torch\csrc\utils\python_arg_parser.cpp:882.) exp_avg.mul_(beta1).add_(1 - beta1, grad)

Need for trained models

Hi,

Due to lack of resources, i cant train the models myself. Therefore, I need the pre trained models of the various games. Is it possible for you to share the models with me? It would be very helpful.

Thanks

watching game visualization during training

I looked through all the arguments and I don't see one that allows you to watch the game being played while training. I guess I must be missing something really obvious. Great work by the way!

How long have you trained the model

cool project! But how long have you trained for each Atari game. I have trained the SpaceInvader-v0 for 13
hours, with 16 cpus but the reward is still in 790. However, according to the original paper, when the hour is 14, the performance has reached 1400. How can I train faster.
Another problem is that the network architecture is different from the original paper proposed. Does that affect the performance ? Thank you!

Reward Smoothing

Hi, How do you think about reward smoothing.
The collected rewards have high variance. In order to show the tendency of reward curve, should we do some reward smoothing operation as same as tensorboard smoothing?
If so, which smoothing method should I choose, exponential smoothing or average smoothing?

Hyperparameters for training

Hi,
Thank for your work. I am wondering how did you train the seaquest to achieve such a high performance. I always stuck at ~4000 scores. Could you please share the hyperparamerters?

Move trained_models into release

Model files are rather large (understandably), so it's quite a pain to clone the repo for local use.
One reasonable way around this is to archive them and make it available as a release file instead.

question about trained models

Just want to clarify that there is only one saved model per environment and it will be overwritten each training epoch, right? For example, MsPacman will only have one saved model trained_models/MsPacman-v0.dat

cnn layers

in your model you have 4 cnn layers and max pooling.

  1. dqn 2015 used only 3 cnn layers without pooling
  2. a3c 2016 used only 2 cnn layers without pooling

questions:

  1. don't you think pooling actually lose spatial information of RL scene, which imo is important, why you decided using pooling instead of increasing stride to 2?
  2. why you decided using 4 cnn layers, possibly gym v0 environments are harder?
  3. any particular reasons for such specific weight init (final actor/critic linear weights)?

thanks.

Why one process run on 2 gpus?

First, thank you for your great work of a3c implementation.

I run the code with python main.py --workers 1 --gpu-ids 5 and find out that one process runs on 2 gpus. Similar things happened when I run with --workers 50. All the processes should run on gpu 5. However, I find that all of these processes (same PID) run on gpu 0 with Type C and smaller GPU Memory Usage compared with those run on gpu 5. How can I assign all the processes on gpu 5? Thank you very much!

1
2

Reward is always 0 when training Breakout-v0

I have trained the model a night on Breakout-v0, however the reward is always 0. What reasons may cause this situation? Or could you tell me what the parameters you are using when training to play Breakout-v0? Thank you. Here is the log file.
log.txt

I just want to say your trained model has no effect

I try to eval your trained model, however the result has no effect:

2017-08-01 21:08:13,757 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,757] reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:13,787] Starting new video recorder writing to /Volumes/xs/CodeSpace/AISpace/rl_space/rl_a3c_pytorch/Pong-v0_monitor/openaigym.video.0.33472.video000001.mp4
2017-08-01 21:08:24,947 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:24,947] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:35,054 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:35,054] reward sum: -21.0, reward mean: -21.0000
2017-08-01 21:08:44,732 : reward sum: -21.0, reward mean: -21.0000
[2017-08-01 21:08:44,732] reward sum: -21.0, reward mean: -21.0000

And the record is white-and-black videos, can not just show on screen.

Seaquest-v0 not training as well as announced

I have tried twice to train the agent on Seaquest-v0 with 32 workers on a server, but after 13 hours of training, the score seems to be stuck at 2700/2800 maximum.

Here's the log file :
log.txt

I'am using gym 0.8.1 and Atari-py 0.0.21 and let all the hyperparameters to their default value.
Any idea why the score obtained is much lower than the one you obtained ? (>50000)
Would you have the trained model for Seaquest-v0 ?
Thanks !

NotImplementedError

Slightly baffled by this:

Traceback (most recent call last):
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/bob/PycharmProjects/rl_a3c_pytorch-master/train.py", line 31, in train
player.state = player.env.reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 123, in reset
observation = self._reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 375, in _reset
observation = self.env.reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 123, in reset
observation = self._reset()
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 376, in _reset
return self._observation(observation)
File "/home/bob/anaconda3/envs/rl_env/lib/python3.5/site-packages/gym/core.py", line 386, in _observation
raise NotImplementedError
NotImplementedError

Train a new game

If I want to train on a new game, how should I choose the initial parameters to start tuning? Any suggestions?

Normalize

In the NormalizedEnv, I am puzzled why you choose alpha equal to 0.9999, if I want unbiased_mean, should the alpha be (num_step - 1)/num_step?

I am also puzzled why normalize environment observation has a such huge influence on the performance? can you explain to me? Thanks!

run a3c on 8 cpus, it still slow.

When I runing a3c on 8 cpus, it still slow.
My cpu is Xeon(R) Platinum 8255C. Is this the reason for my poor cpu performance or torch multiprocessing 's problem?

Question on training function

I noticed that in your player_util.py action_train function:

if self.done:
    if self.gpu_id >= 0:
        with torch.cuda.device(self.gpu_id):
            self.cx = Variable(torch.zeros(1, 512).cuda())
            self.hx = Variable(torch.zeros(1, 512).cuda())
    else:
        self.cx = Variable(torch.zeros(1, 512))
        self.hx = Variable(torch.zeros(1, 512))
else:
    self.cx = Variable(self.cx.data)
    self.hx = Variable(self.hx.data)

But how can you backpropagate gradients through time, to the past 20 steps, if you set:

self.cx = Variable(self.cx.data)
self.hx = Variable(self.hx.data)

Solving time

Thank you for the nice implementation. I'm curious about the running time on your machine. In https://github.com/ikostrikov/pytorch-a3c, it is reported that PongDeterministic-v3 is solved around 15min, did you reproduce similar results in any version of Pong?

Thank you

Visualization not appearing

I ran python main.py --env Pong-v0 --workers 32 and
python gym_eval.py --env Pong-v0 --num-episodes 100, but I dont see any visualization of the game. Can i turn it on somehow?

multi gpu support

When I run the program on multi gpu, that is, I set gpu_id to be [0, 1, 2, 3], it report error "Some of weight/gradient/input teSLTM nsors are located on different GPUs. Please move them to a single one"

Pretrained models

Hello, is it possible to get access to some of the pre-trained models? (specifically looking for sea quest, pong and space invaders but any or all would be brilliant)

Why ensure_shared_grads

in you ensure_shared_grads function:

 if shared_param.grad is not None and not gpu:
            return

Can you explain what this means?

plot rewards as a function of number of timesteps

Hi, thanks so much for the excellent codebase. Just wondering, is there any way to plot the training curve as a function of timesteps (as opposed to plotting the training curve as a function of time passed)?

Thanks!

Question on max-length

Hi,

I think there is one problem on the max_length. Max_length is 20000 in your default setting. But the gym's internal max episode length is 10000.

In testing, when the number of steps == 10000, Player.done = True and not player.max_length. It's possible that player.info['ale.lives'] > 0 at this time. Now the condition
if player.done and player.info['ale.lives'] > 0 and not player.max_length:
satisfies

In such condition, you reset the environment. Now in EpisodicLifeEnv, self.was_real_done in environment.py is True and you actually reset the gym environment (which is correct). But your code doesn't treat this 10000 steps episode as a terminated episode, instead, your code assumes the episode doesn't terminate because player.info['ale.lives'] > 0.

synchronization among workers

Browsing the code, I can't help but noticing there are no synchronization among workers, i.e., using Lock mechanism to coordinate the updating of shared_model by different workers. Is this how the "hogwild" algorithm works? I have browsed several Pytorch implementation of A3C. All seem to share the same model updating mechanism. Here are my specific questions I wonder if you can enlighten me or confirm my understanding:

  1. In setting the grad for shared_model (see function blow), why do you check "if shared_param.grad is not None"? Is this the way to prevent one worker oversteping another worker? After first setting of shared_param._grad=param.grad, shared_param.grad will never be "None" again because neither optimizer.zero_grad() or optimizer.step() will reset shared_param.grad to None. Wouldn't this prevent shared_param._grad from being set to param.grad ever again within the same worker?

def ensure_shared_grads(model, shared_model):
for param, shared_param in zip(model.parameters(),
shared_model.parameters()):
if shared_param.grad is not None:
return
shared_param._grad = param.grad

  1. Assuming my understanding of above code is wrong, shared_model keeps getting its new grad from worker A. However it is possible that before optim.step() is being executed in worker A, worker B will have updated grad for shared_model too (partially or completely). So by the time optim.step() is finished, the grad used to update param could be either coming from worker A, or from worker B, or a mix of both. Is it true?

  2. If my above statement is true, then this way of updating model param seems to be very inefficient. Original A3C paper seems to mention periodical synchronization not sync at all times. It may help increasing stability and converging speed. Just a thought.

Thank you.

eps for Adam

Is there a reason why the default for eps in the adam optimizer is so high? Currently, it is 1e-3 [line 104 in shared_optim.py]. Usually, it's around 1e-08. Just wanted to see if this was done intentionally (e.g., it works better than when it is lower) or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.