I left it running for a few epochs, several times to ensure that it was not a fluke.</

[Question] Any idea why SAC loss would diverge? about rlkit HOT 8 CLOSED

rail-berkeley commented on June 20, 2024

[Question] Any idea why SAC loss would diverge?

from rlkit.

Comments (8)

vitchyr commented on June 20, 2024

A bit hard to figure out based on just this. What did you set as the target entropy? An alpha of 5.56742e+08 seems rather large.

from rlkit.

redknightlois commented on June 20, 2024

You are right, that is not an underflow... that is divergence (its a +)...
Didn't set it up explicitly so that would be: self.target_entropy = -np.prod((1,)).item()

from rlkit.

vitchyr commented on June 20, 2024

Are you using discrete actions? That heuristic wouldn't work in that case

from rlkit.

redknightlois commented on June 20, 2024

It behaves like discrete yes. I didn't change the policy to account for optimizing for Softmax because I was not able to figure out how to derive the temperature based sampling/exploration from the equations. Everybody sais it is easy, but no one shows how to do so :D (ex. openai/spinningup#22 )

So I hacked it instead, making the environment to understand the continuous actions as discrete signals. It was bound to have some 'side-effect'. And now that you noticed it is actually diverging it makes sense. For a typical 3 states (softmax(3)), what target entropy would you suggest to try?

from rlkit.

vitchyr commented on June 20, 2024

For discrete actions, you should choose a positive number that's less than log(# of actions).

To compute the entropy, you should look up the definition of entropy. For discrete actions it's sum of p log(p).

from rlkit.

redknightlois commented on June 20, 2024

First time with an entropy-based algorithm, so clueless on that... Any accessible writeup that you know would be great to read.

Would you go with close to log(3) (~0.477) or closer to zero instead?

from rlkit.

redknightlois commented on June 20, 2024

OK, now that I changed the entropy to 0.35 I dont see the divergent behavior, but what I do see is a collapse on the deterministic policy results. The strange fact is that if I restart the process loading the last policy I get behaviors similar to those found in the exploration phase of the epoch. Sounds like a bug in the evaluation part, could that be?

from rlkit.

vitchyr commented on June 20, 2024

The evaluation code and exploration code are the same. They just use the DataCollector. Note that if you're loading up the policy, you might be loading the evaluation policy, which is deterministic. It sounds like the SAC loss issue has been resolved, so I'm closing this issue.

from rlkit.

Recommend Projects

[Question] Any idea why SAC loss would diverge? about rlkit HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent