Distributed Distributional Deep Deterministic Policy Gradients (D4PG)

A Tensorflow implementation of a Distributed Distributional Deep Deterministic Policy Gradients (D4PG) network, for continuous control.

D4PG builds on the Deep Deterministic Policy Gradients (DDPG) approach (paper, code), making several improvements including the introduction of a distributional critic, using distributed agents running on multiple threads to collect experiences, prioritised experience replay (PER) and N-step returns.

Trained on OpenAI Gym environments.

This implementation has been successfully trained and tested on the Pendulum-v0, BipedalWalker-v2 and LunarLanderContinuous-v2 environments. This code can however be run on any environment with a low-dimensional (non-image) state space and continuous action space.

This currently holds the high score for the Pendulum-v0 environment on the OpenAI leaderboard

Requirements

Note: Versions stated are the versions I used, however this will still likely work with other versions.

Ubuntu 16.04 (Most (non-Atari) envs will also work on Windows)
python 3.5
OpenAI Gym 0.10.8 (See link for installation instructions + dependencies)
tensorflow-gpu 1.5.0
numpy 1.15.2
scipy 1.1.0
opencv-python 3.4.0
imageio 2.4.1 (requires pillow)
inotify-tools 3.14

Usage

The default environment is 'Pendulum-v0'. To use a different environment simply change the ENV parameter in params.py before running the following files.

To train the D4PG network, run

  $ python train.py

This will train the network on the specified environment and periodically save checkpoints to the /ckpts folder.

To test the saved checkpoints during training, run

  $ python test_every_new_ckpt.py

This should be run alongside the training script, allowing to periodically test the latest checkpoints as the network trains. This script will invoke the run_every_new_ckpt.sh shell script which monitors the given checkpoint directory and runs the test.py script on the latest checkpoint every time a new checkpoint is saved. Test results are saved to a text file in the /test_results folder (optional).

Once we have a trained network, we can visualise its performance in the environment by running

  $ python play.py

This will play the environment on screen using the trained network and save a GIF (optional).

Note: To reproduce the best 100-episode performance of -123.11 +/- 6.86 that achieved the top score on the 'Pendulum-v0' OpenAI leaderboard, run

  $ python test.py

specifying the train_params.ENV and test_params.CKPT_FILE parameters in params.py as Pendulum-v0 and Pendulum-v0.ckpt-660000 respectively.

Results

Result of training the D4PG on the 'Pendulum-v0' environment:

Result of training the D4PG on the 'LunarLanderContinuous-v2' environment:

Result of training the D4PG on the 'BipedalWalker-v2' environment:

Result of training the D4PG on the 'BipedalWalkerHardcore-v2' environment:

Environment	Best 100-episode performance	Ckpt file
Pendulum-v0	-123.11 +/- 6.86	ckpt-660000
LunarLanderContinuous-v2	290.87 +/- 2.00	ckpt-320000
BipedalWalker-v2	304.62 +/- 0.13	ckpt-940000
BipedalWalkerHardcore-v2	256.29 +/- 7.08	ckpt-8130000

All checkpoints for the above results are saved in the ckpts folder and the results can be reproduced by running python test.py and specifying the train_params.ENV and test_params.CKPT_FILE parameters in params.py for the desired environment and checkpoint file.

superjeary / d4pg Goto Github PK

d4pg's Introduction

Distributed Distributional Deep Deterministic Policy Gradients (D4PG)

Requirements

Usage

Results

To-do

References

License

d4pg's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent