Giter Club home page Giter Club logo

Comments (8)

vitchyr avatar vitchyr commented on June 20, 2024

A bit hard to figure out based on just this. What did you set as the target entropy? An alpha of 5.56742e+08 seems rather large.

from rlkit.

redknightlois avatar redknightlois commented on June 20, 2024

You are right, that is not an underflow... that is divergence (its a +)...
Didn't set it up explicitly so that would be: self.target_entropy = -np.prod((1,)).item()

from rlkit.

vitchyr avatar vitchyr commented on June 20, 2024

Are you using discrete actions? That heuristic wouldn't work in that case

from rlkit.

redknightlois avatar redknightlois commented on June 20, 2024

It behaves like discrete yes. I didn't change the policy to account for optimizing for Softmax because I was not able to figure out how to derive the temperature based sampling/exploration from the equations. Everybody sais it is easy, but no one shows how to do so :D (ex. openai/spinningup#22 )

So I hacked it instead, making the environment to understand the continuous actions as discrete signals. It was bound to have some 'side-effect'. And now that you noticed it is actually diverging it makes sense. For a typical 3 states (softmax(3)), what target entropy would you suggest to try?

from rlkit.

vitchyr avatar vitchyr commented on June 20, 2024

For discrete actions, you should choose a positive number that's less than log(# of actions).

To compute the entropy, you should look up the definition of entropy. For discrete actions it's sum of p log(p).

from rlkit.

redknightlois avatar redknightlois commented on June 20, 2024

First time with an entropy-based algorithm, so clueless on that... Any accessible writeup that you know would be great to read.

Would you go with close to log(3) (~0.477) or closer to zero instead?

from rlkit.

redknightlois avatar redknightlois commented on June 20, 2024

OK, now that I changed the entropy to 0.35 I dont see the divergent behavior, but what I do see is a collapse on the deterministic policy results. The strange fact is that if I restart the process loading the last policy I get behaviors similar to those found in the exploration phase of the epoch. Sounds like a bug in the evaluation part, could that be?

from rlkit.

vitchyr avatar vitchyr commented on June 20, 2024

The evaluation code and exploration code are the same. They just use the DataCollector. Note that if you're loading up the policy, you might be loading the evaluation policy, which is deterministic. It sounds like the SAC loss issue has been resolved, so I'm closing this issue.

from rlkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.