Hello, Thanks for the resource. It would be nice to implement <a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[REQUEST] Implement CRR/AWAC about d3rlpy HOT 9 CLOSED

takuseno commented on May 18, 2024 1

[REQUEST] Implement CRR/AWAC

from d3rlpy.

Comments (9)

takuseno commented on May 18, 2024 2

@araffin Thank you for requesting an issue! I'm considering more algorithms based on Advantage-Weighted Critic which is already implemented in d3rlpy (not quite tested yet). So I'll update this issue when they are ready!

from d3rlpy.

takuseno commented on May 18, 2024 2

@araffin I'm implementing AWAC algorithm on awac branch.
https://github.com/takuseno/d3rlpy/blob/awac/d3rlpy/algos/awac.py

When performance reports are available, I'll share them here.

from d3rlpy.

takuseno commented on May 18, 2024 2

I've merged awac branch to master now. I've confirmed it somehow works, however, I did not thorough evaluation because it's a bit busy now.

from d3rlpy.

takuseno commented on May 18, 2024

I was implementing AWAC based on the paper that says they made AWAC based on TD3. However, I realized that they did it based on SAC instead. For now, TD3-based AWAC performs good enough at offline training, but not good enough at finetuning. I'm suspecting that the reason is the exploration issue.

from d3rlpy.

araffin commented on May 18, 2024

Thanks for the update =)

However, I realized that they did it based on SAC instead. For now, TD3-based AWAC performs good enough at offline training, but not good enough at finetuning. I'm suspecting that the reason is the exploration issue.

You mean that what is written in the paper is different from what is implemented?

I also noticed some differences in your implementation. You use random actions to estimate the Value, where as they use the deterministic output of the policy for it.

EDIT: it seems that you updated that part but you are still sampling instead of using the mean/deterministic output

I also contacted the author recently and he told me that "the only notable thing is that we do not output the standard deviation (actually the log variance) of the policy with neural network layers, they are just learned parameters themselves (1 per action dimension). This prevents the variances from overfitting." (so like for PPO vs SAC)

from d3rlpy.

takuseno commented on May 18, 2024

@araffin Thank you for checking the update! I noticed the logstd parameter days ago. For the latest version, the performance is much better. I'll merge this in this week.

from d3rlpy.

araffin commented on May 18, 2024

@araffin Thank you for checking the update! I noticed the logstd parameter days ago. For the latest version, the performance is much better. I'll merge this in this week.

Good to hear =) (I'm been trying it too and it looks good)

minor remark, you may consider python raw strings for the docstrings:

r""":math:`\alpha/(\sqrt{v} + \epsilon)`"""

this avoids the use of too many backslashes ;)

from d3rlpy.

takuseno commented on May 18, 2024

Oh, I did not know that! Thank you! I'll use this to remove doubled slashes.

from d3rlpy.

takuseno commented on May 18, 2024

@araffin Thanks for your advice!

from d3rlpy.

[REQUEST] Implement CRR/AWAC about d3rlpy HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent