rmst / ddpg Goto Github PK

TensorFlow implementation of the DDPG algorithm from the paper Continuous Control with Deep Reinforcement Learning (ICLR 2016)

License: MIT License

Python 44.82% Shell 0.07% Jupyter Notebook 55.12%

deep-learning reinforcement-learning tensorflow

ddpg's People

Contributors

Stargazers

Watchers

ddpg's Issues

Concurrent Read Write on tf Variable?

I've seen several possible concurrent reads and writes in your code.

train_p and train_q has no control dependency on each other, train_q updates q_theta which is also used in train_p. It is possible they may conflict with each other.
train_p updates p_theta_target, which is used in train_q, and the order is also undefined.

Need your help to understand a step

Could you pinpoint the code where actor's parameters (weights) are being updated ?

I am particularly looking for the step where gradient of critic is calculated wrt to action variables and actor wrt to theta. Sum of the multiplication of these gradients is used to update the actor (like given in the algorithm of the paper).

DDPG Actor output saturate

Hello~ I have some question about DDPG
Ｗhen my action dimension = 1, the result is good, but when my action dimension = 2 (the activation function is tanh and sigmoid), the output of actor will saturate.
Here is the result what I said: https://github.com/m5823779/DDPG
By the way, I use batch normalization only in my actor network.
Do you know where is the problem?

bias annealing weight updates

I could be wrong but it does not seem that you are annealing the bias with important sampling as suggested in the paper(3.4).

w_i = (1/N * 1/P(i))^beta

I think you would have to multiply this w_i term with your gradients

Why should terminated states removed from minibatch?

In replay_memory.py

    indices = np.zeros(size,dtype=np.int)
    for k in range(size):
      # find random index 
      invalid = True
      while invalid:
        # sample index ignore wrapping over buffer
        i = random.randint(0, self.n-2)
        # if i-th sample is current one or is terminal: get new index
        if i != self.i and not self.terminals[i]:
          invalid = False

      indices[k] = i

This part removes some candidates.

Is indices=np.random.randint(0, self.n, size) more appropriate?

Unrealistic rewards for InvertedPendulum

Hi,

Im running the code as-is for the InvertedPendulum-v1 environment. The output log looks like:

[2016-09-29 02:55:12,968] Making new env: InvertedDoublePendulum-v1
[2016-09-29 02:55:13,029] OpenGL_accelerate module loaded
[2016-09-29 02:55:13,076] Using accelerated ArrayDatatype
outdir: ddpg-results/IP/
True action space: [-1.], [ 1.] 
True state space: [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [ inf  inf  inf  inf  inf  inf  inf  inf  inf  inf  inf] 
Filtered action space: [-1.], [ 1.]
Filtered state space: [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [ inf  inf  inf  inf  inf  inf  inf  inf  inf   inf  inf]
((11,), (1,))
{'_entry_point': 'gym.envs.mujoco:InvertedDoublePendulumEnv',
 '_env_name': 'InvertedDoublePendulum',
 '_kwargs': {},
 '_local_only': False,
 'id': 'InvertedDoublePendulum-v1',
 'nondeterministic': False,
 'reward_threshold': 9100.0,
 'tags': [], 
 'timestep_limit': 1000,
 'trials': 100}
Average test return 94.6561916032 after 0 timesteps of training
Average training return 64.8650261368 after 10004 timesteps of training
Average test return 94.4441357631 after 10004 timesteps of training
Average training return 62.8453825653 after 20006 timesteps of training
Average test return 94.4936849296 after 20006 timesteps of training
Average training return 63.6538282778 after 30008 timesteps of training
Average test return 94.9548271625 after 30008 timesteps of training
Average training return 63.9039428219 after 40011 timesteps of training
Average test return 94.2871854837 after 40011 timesteps of training
Average training return 63.2686654373 after 50014 timesteps of training
Average test return 98.8836603337 after 50014 timesteps of training
Average training return 145.89652752 after 60042 timesteps of training
Average test return 295.657725759 after 60042 timesteps of training
Average training return 192.307169483 after 70066 timesteps of training
Average test return 257.732447567 after 70066 timesteps of training
Average training return 226.691339415 after 80067 timesteps of training
Average test return 473.731095604 after 80067 timesteps of training
Average training return 255.541847852 after 90069 timesteps of training
Average test return 435.084465257 after 90069 timesteps of training
Average training return 254.536465181 after 100089 timesteps of training
Average test return 630.270166648 after 100089 timesteps of training
Average training return 250.049665622 after 110105 timesteps of training
Average test return 2436.58758156 after 110105 timesteps of training
Average training return 244.717938695 after 120121 timesteps of training
Average test return 93368.0844892 after 120121 timesteps of training

And then the code kind of just exits (although I'd asked it to train for 1 million train steps). Do you experience this sort of behavior as well? I'm guessing there is a subtle bug in the code which allows it to score such high episodic returns as 94k.

Well done! But is it working?

Hi,

I was looking for such a repo to understand how to implement ddpg. Thanks for sharing.

I tried the Reacher-v1. However it does not seem to converge. So it this repo currently working or is it still under construction?

Also, have you considered using Keras to make things cleaner?

Cheers!

error for No such file or directory

Hi,

Im running the code as-is for the InvertedPendulum-v1 environment. The output log looks like:

I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
File "run.py", line 120, in
experiment.run(main)
File "/home/cuiyi/ddpg/experiment.py", line 41, in run
create(scriptdir, f, n)
File "/home/cuiyi/ddpg/experiment.py", line 77, in create
mkdir(path)
File "/home/cuiyi/ddpg/experiment.py", line 255, in mkdir
os.mkdir(path)
OSError: [Errno 2] No such file or directory: '../ddpg-results/experiment1'

what should I do to deal with the problem?

thanks.

Reacher-v1 not training

Hi, I have just tried running Reacher-v1 for 1000000 timesteps with default settings and it didn't learn anything (it just get stuck at -12 test reward), but it looks like you made it running with some settings, what were these settings ?

rmst / ddpg Goto Github PK

ddpg's People

Contributors

Stargazers

Watchers

Forkers

ddpg's Issues

Concurrent Read Write on tf Variable?

Need your help to understand a step

DDPG Actor output saturate

bias annealing weight updates

Why should terminated states removed from minibatch?

Unrealistic rewards for InvertedPendulum

Well done! But is it working?

error for No such file or directory

Reacher-v1 not training

Output of actor will saturate

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent