sfujim / td3 Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 434.0 207 KB

Author's PyTorch implementation of TD3 for OpenAI gym tasks

License: MIT License

Python 96.24% Shell 3.76%

td3's People

Contributors

Stargazers

Watchers

Forkers

riashat shubhampachori12110095 ml-lab shalikhub peterhj tandychao junchenjin grayking fuxianh juancamilog collector-m hyzcn aosmundson nithin127 thomascent tomstewart89 colllin dkkim93 seblopezcot speedcell4 tclikang davidkochanov bluecontra pierthodo yasasa stjordanis denisyarats guillemdb samirw manfreddiaz zhangbinxy xzw0005 willwhitney whikwon jmichaux sweetice quantumiracle ignaciocarlucho quanvuong layne-wang neu-shuai tttor redheadm sharadmv hzm2016 georgesung nicolinho dogatekin zhan0903 ranamihir isacarnekvist enicky fiberleif psyche-mia emmazfj ricklentz tju-marl yimingpeng gaimjkp sunshineclt liwenzhang0201 chenjunjie1994 laurathepluralized fedorajzf xiongduan ssnl tansenl joshromoff jacobtyo surya-77 seungjaeryanlee utkarshjp7 h-aboutalebi llfl shangtongzhang gamehoo txing-casia aaronzweig fgolemo hareshkarnan saminyeasar creatorcen lebrice afcarl mpgussert ronamit durgaprasd francescozanini tju-lzg feynman0825 daominglyu mardashka21 jdlowman2 bmpcc6k tesslerc selimac xiaotailong dr-tuski antoinetheb trevorablett

td3's Issues

question about random seeds

hi, I tried several experiments and found that even after setting the same random seed, the results are different each time. This confuses me. After debugging, I found that in the main.py

action = env.action_space.sample()

This line will sample different actions every time, so I want to ask whether gym.action_space should also set the same random seed?

Minimum and Array-like Actions

The current algorithm assumes env.action_space.low = -env.action_space.high and env.action_space.high[0] = env.action_space.high[i] for all i, which is not the case for all environments.

what should be target_policy_noise target_noise_clip?

hello and thank you for sharing such a great project
I'm trying to solve Humanoid-V3 and I am having a problem to set target_policy_noise and target_noise_clip
the default paper for 0.2 and -0.5,0.5 doesn't work for gym Humanoid-V3 because the action space is -.04 to 0.4
best regards

does overestimate occurs in advantage actor-critic？

请求许可使用该代码用于个人缩写的书中

您好，本人正在撰写一本大模型相关的书籍，其中涉及到一个代码解释器的模块，想要使用TD3代码作为示例使用，请问是否可以呢？

Is your algorithm suitable for discrete action spaces in AC setting？

Models not being saved?

This parameter isn't doing anything at the moment, right?

TD3/main.py

Line 43 in 1bccd36

 parser.add_argument("--save_models", action="store_true") # Whether or not models are saved 

Unable to generate results

/data_2t/zhangruijiang/python/envs/td3/lib/python3.7/site-packages/gym/envs/registration.py:14: PkgResourcesDeprecationWarning: Parameters to load are deprecated. Call .resolve and .require separately.
result = entry_point.load(False)

Worry about the convergence proof

Thanks for your impressive work very much. Unfortunately, I have some worries in my current work. I beg you to give me some guidance.

In my work, I try to improve your clipped double q-learning and want to get its convergence property.

In our algorithm description, for tabular version (discrete action), we randomly update one q-function. And for continuous action version, I follow your updating design.

However, I just prove that if our improved clipped double q-learning uses the same update mechanism like what you do in the paper, it can converge to the optimal action value. For the case that randomly updating one q-function, I haven't make a proof till now.

Whether I can say that we make the convergence proof about our improved clipped double q-learning in the paper? In the proof part (in Appendix), we will describe the difference between both updating methods.

Or, can you guide me what kind of the description is better?

Thank you again!

training a loaded model seems to reset model parameters

When loading a pretrained model for further training, the performance during training seems to have resetted with respect to the pretrained model. Bypassing the train function

policy.train(replay_buffer, args.batch_size,mean_action,scale)

does result in the performance corresponding to the trained model.
policy.train seems to reset parameters somehow, but i can't seem to find how or where.

Huge overestimation of the Q values and lack of convergence

So I'm working with this environment

https://gist.github.com/iandanforth/e3ffb67cf3623153e968f2afdfb01dc8

which is basically the mujoco cart pole.

What I notice is that the value function estimates provided by the target critics steadily increase as training goes on, going way beyond the maximum cumulative reward (even without accounting for the discount).
The environment is solved after quite few iterations when the estimates are MUCH lower than the real Q values, but continuing the iterations makes the critic diverge several orders of magnitude.

I tried reducing tau and increasing policy_freq, but this unstable behaviour still emerges.
Did you observe something similar with the mujoco?

How to test model after training

Once the model is saved, I want to test it. Can you please upload the script for testing?

Plotting

What code did you use to get the plots for the paper?

When I plot the numbers from npy files I'm getting slightly less smooth curves.

state should be next state in algorithm description in the paper

If I'm reading the paper and the code right, I think the line that sets ã in the algorithm description in the paper (v2) should be ã = pi sub phi' (s') + epsilon instead of pi sub phi' (s) + epsilon. Right?

(Apologies if this isn't the best way to report bugs in the paper. Feel free to delete. Also, many many thanks for publishing the code)

Again about observation normalization.

In the Supplementary Material, you compared hyperparameters with DDPG from baselines.

In this part, DDPG uses normalized observations. So my question is:

In this paper, which version of codes does the performance of DDPG (not OurDDPG) come from? OpenAI baselines or your implementation? I have some confusion about that because when I run OpenAI's baselines with observation normalization, DDPG does not perform well while your codes perform well.

I guess that the reason is that the huge replay buffer of DDPG will store all training data which are generated from different normalization parameters.

Having trouble running DDPG implementation

I'm using Python 3.4 and PyTorch 0.4 on Mac OS X; running

python main.py --env HalfCheetah-v1 --policy_name DDPG

results in:

(my_root) sbhupatiraju@instance-11:~/ddpg/TD3$ python main.py --env HalfCheetah-v1 --policy_name DDPG
---------------------------------------
Settings: DDPG_HalfCheetah-v1_0
---------------------------------------
[2018-05-01 17:37:35,924] Making new env: HalfCheetah-v1
/home/sbhupatiraju/ddpg/TD3/DDPG.py:19: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  return Variable(tensor, volatile=volatile)
---------------------------------------
Evaluation over 10 episodes: -1.481050
---------------------------------------
Total T: 1000 Episode Num: 1 Episode T: 1000 Reward: -409.138651
/home/sbhupatiraju/ddpg/TD3/DDPG.py:97: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  target_Q.volatile = False
Traceback (most recent call last):
  File "main.py", line 94, in <module>
    policy.train(replay_buffer, episode_timesteps, args.batch_size, args.discount, args.tau)
  File "/home/sbhupatiraju/ddpg/TD3/DDPG.py", line 104, in train
    critic_loss = self.criterion(current_Q, target_Q)
  File "/home/sbhupatiraju/.conda/envs/my_root/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sbhupatiraju/.conda/envs/my_root/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 371, in forward
    _assert_no_grad(target)
  File "/home/sbhupatiraju/.conda/envs/my_root/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 12, in _assert_no_grad
    "nn criterions don't compute the gradient w.r.t. targets - please " \
AssertionError: nn criterions don't compute the gradient w.r.t. targets - please mark these tensors as not requiring gradients

Critic 1 loss very different than Critic 2 loss

Thank you for the repository. I just implemented TD3 and observed that the loss of Critic 1 is consistently much higher than the loss of Critic 2. Both networks have the same architecture and learning rates in my code, why may this be? Shouldn't they have similar losses and updates?

What is the reason behind to modify the DDPG implementation?

I notice that in your paper C. DDPG Network and Hyper-parameter Comparison mention that the DDPG architecture is different from OpenAI Baseline. Is there any specific reason or intuition behind this? Thank you.

Fujimoto, S., Van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. 35th International Conference on Machine Learning, ICML 2018, 4, 2587–2601.

Hi, i made re-implementation using yours but it doesn't work.

It is only 165 lines with Gym Environment "Pendulum-v0".

I know it is sorry to ask you this. but it is so hard to figure out what went wrong.

If you have enough time, plz check my repo.

Regards.

Performance on Humanoid-v2?

Hi,

Thanks for your elegant codes. Here I'm asking to confirm how TD3 performs in mujoco Humanoid-v2.

In SAC paper and also other papers, they reported TD3 almost cannot make any progress on Humanoid-v2 (the learning curve is like y=0). However, we tried to reproduce this phenomenon using your code and found that TD3 could reach 6000+ scores in testing returns in 10M steps (just one seed though).

Since you didn't report humanoid results in your paper, and you might change hyperparameters to improve the results, I'm wondering which result might be true? We use the latest hyperparams and set start_timesteps=10000 in reference to https://github.com/sfujim/TD3/blob/master/run_experiments.sh#L11

Thanks!

Great work. Just a quick question.

Great work! Thanks for the codes.

Just quick questions on your three great contributions (DP, TPN, CDQ), have you ever tried the three on discrete-action-space games (just typical ones like SpaceInvader or KungFu or Seaquest etc.)? And are the three tricks providing improvement as well ? Since in the last paragraph of your paper it is mentioned that the three contributions can also be utilized for discrete-action games.

Thanks again.

TD3 plotting details

Hi,
I was wondering if you could share the script you used to plot the data in the td3 paper? If the x-axis were timesteps, what was on the y-axis?

Calculation method of value estimate

Thank you for your outstanding work. In your paper, the estimated value and the real value are mentioned. I would like to ask about the specific calculation method. Thank you。

Where did -self.critic.Q1 come from?

Hi,
Line 140 in TD3.py
actor_loss = -self.critic.Q1(state, self.actor(state)).mean()

It seems that class Critic does not have attribute Q1?

About observation normalization

Hello.
I have one question.
You claimed that DDPG should use normalized observations as the input. However, in your codes, there is no normalization. Any problems with that?

Figure explanation

Could you please explain in detail how you get Max Average Return in Fig.5 & Table 1. in your paper.
From what I understand, if I want you evaluate one algorithm(like TD3) in one game and get statistics:

test_reward[10][1M/5k = 200]
for seed in range(10):
    set the seed
    for every 5k steps collected:
        test the policy for 10 times, record average game return in test_reward[seed][epoch]

now I have 10*200 = 2000 rewards in total, each reward represents average return over 10 test trails.
How exactly do you calculate Max Average Return and standard deviation.
For example, is it
max average return = (max(mean(test_reward, axis = 0)))
or
max average return = (mean(max(test_reward, axis = 0)))

same for deviation, thanks a lot

Plots or graphs

Hello dear,
I hope message finds you well. My name is Syed Hasnat, and I am currently working on a project that involves implementing the TD3 algorithm. I have found your research paper on TD3 to be extremely insightful, and I am particularly interested in reproducing the graphs mentioned in the paper for my work.

I have been exploring the open-source code of TD3, but I'm facing some challenges in extracting the specific parameters used to generate the graphs mentioned in your paper. I would greatly appreciate it if you could share the code snippets or guide me on which parameters from the open-source code were used to create those graphs.

Further more in the paper it's mentioned that

" In
Figure 1, we graph the average value estimate over 10000
states and compare it to an estimate of the true value. The true value is estimated using the average discounted return over 1000 episodes following the current policy, starting from states sampled from the replay buffer. "

So, how you are relating 10k states and 1k episodes?
After that you have told that

"A very clear overestimation bias occurs from the learning procedure, which contrasts with the novel method that we describe in the following section, Clipped Double Q-learning, which greatly reduces overestimation by the critic."

So, where is the novel method in the graph?

Sorry to say, that overall I am not getting the idea how you have plotted these graphs and which data should be utalised in a proper manner to plot these graphs.

Your assistance will be invaluable for my project, and I believe it will enhance the overall quality of my work. I understand your time is valuable, and I appreciate any support you can provide.

Thank you for your consideration. Looking forward to your guidance.

Best regards,

Syed Hasnat
USPCAS-E (US Pakistan Center For Advanced Studies in Energy)
[email protected]

SyntaxError

When I tried running:
python3 main.py --env HalfCheetah-v2
such error happened:

root@ubuntu:/test/TD3-master# python3 main.py --env HalfCheetah-v2
File "main.py", line 30
print(f"Evaluation over {eval_episodes} episodes: {avg_reward:.3f}")
^
SyntaxError: invalid syntax

so I guess there are problems with my environment,could you tell me the specific environment?

a little suggestion

I think in TD3.py the second line of the following code may be redundant, because it uses clamp twice

next_action = (self.actor_target(next_state) + noise).clamp(-self.max_action, self.max_action)
next_action = next_action.clamp(-self.max_action, self.max_action)

And thanks for your codes a lot !

some issues about gaussian noise in td3.py

I've noticed that in Td3.py, line 112:

noise = ( torch.randn_like(action) * self.policy_noise )

When i see the paper, i found that the gaussian noise before adding to target action should follow N(0, 0.2). Here the 0.2 (self.policy_noise) is used in stddev.

In your code it's used as multiplication outside the distribution, which will turn the distribution to N(0,0.04). Maybe it's just your implementation? Does this distribution affect the result a lot?

ourddpg performance not good

I run the source code of TD3: ourddpg algorithm and find performance in most games(apart from HalfCheetah-v3_0) is inconsistent with the original paper? Could you illustrate what might be the problem, please?

policy update freq in DDPG & OurDDPG

Hi @sfujim ,

Thanks a lot for your great work! They are really solving some practical problem.

One thing I notice in the main.py function is the policy update frequency for DDPG and OurDDPG. Do they both update at each time step after collect enough data in buffer? Would it be too frequent? I also check the implementation in Openai Spinup, they update every a few steps.
So I wonder if there is any special reason behind it, about update both actor and critic in each step?

Thanks a lot!

Best,
Xubo

Replay Buffer Sampling

In the utils.py file, the sample method in the ReplayBuffer class generates a list of random indices to pull out a batch of transitions from the current replay buffer with the following line of code:

ind = np.random.randint(0, self.size, size=batch_size)

This line of code samples a list of random indices with replacement. Would this cause any issues in training if the batch occasionally has multiple transitions that in reality came from one experience step? In a recent DQN implementation I wrote, I used the following line to generate a random list of indices without replacement:

rand_idxs = torch.from_numpy(np.random.choice(np.arange(self.size), size=batch_size, replace=False))

argparse default type

python main.py --expl_noise=0.1 would result in

Traceback (most recent call last):
  File "main.py", line 122, in <module>
    + np.random.normal(0, max_action * args.expl_noise, size=action_dim)
TypeError: can't multiply sequence by non-int of type 'float'

argparse takes in 0.1 as str not float, so type needs to be specified for float arguments in main.py.

Why is there no noise added in DDPG?

Thank you very much for your code, which is very clean and effective. But I have a question: DDPG mentioned the use of noise, why didn't you add noise to DDPG?

Done state is set to 0 if episode timesteps are greater than or equal to max episode timesteps

Hello,

First of all, thanks for publishing the code.

I have a question regarding your implementation:
In the main loop

TD3/main.py

Line 127 in 385b33a

done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0

you set the done flag to false if the episode timesteps are greater than or equal to the max episode timesteps.
From my understanding, this results in the done flag always being False for environments without terminal states.
Is this behavior intended? Shouldn't the done flag be True if the max time steps are exceeded?
When examining the performance of ourDDPG with setting the done state to 1 I achieve a much lower reward compared to the reward obtained when setting the state to 0 in the Half-Cheetah Env.

New TD3 hyperparameters really improve the performance?

Could you confirm that the new hyperparameters for TD3 (i.e. network size from [400, 300] to [256, 256], batch size from 100 to 256, learning rate from 1e-3 to 3e-4) really improve the performance?

In my experiment, it does not demonstrate a consistent improvement.

Curve smoothing！

Hi, thank you for sharing a clean code!
I want to know how you use the filter to smooth the curve, can you share the relevant code? thank you very much. If it's not convenient for you, it's fine.

dockefile

Thank you for sharing a clean code!
Is there a dockerfile that sets up the dependencies?

Would it be possible to use a single network?

Hello, thank you for this great work. I have few questions as I am a newbie in Reinforcement Learning.

Would it be possible to use a single network with multi-heads rather than two? I am actually training such network successfully. I am just not giving the reward of the actor based on the critic yet.

What is the advantage of using the critic reward rather than the real reward for the actor? When I do the multi-tasking by using the critic as a regularization loss, it already improves the actor.

However, I understand that it is advisable to make one policy update after two Q-value updates via delays. What if I do backward only with the relevant loss at each epoch according to this schedule?

Note: While writing this, I have actually just started using the minimum of the real reward and the reward estimated by the critic within actor loss and will see if that will improve the results in some way.

Although I am concurrently optimizing the critic loss (MSE), I am afraid that it might be over-estimating to have the real reward at the worst-case. Is that why having two separate networks are necessary?

The reason it would be desirable making this with a single network is resources such as memory and more importantly learning two tasks together might be helping the convergence of both tasks (as it is in my case). So even there are two separate networks, maybe one can have shared layers between two?

Are you interested in image input with convolutional networks architecture?

Thanks for your awsome code!

I have tried DDPG in gym environments and got good results with fully connected netwroks

but when I use the RGB array (several frames stacked in grayscale) offered by gym.render() to train a convolutional actor/critic networks, it seems to be failed even if much longger training than fully connected networks...

I wonder if it is impossible for DDPG to train convolutional networks? Or are you intersted in implementing a variant with convolutional network?

Thanks again !