Giter Club home page Giter Club logo

Comments (3)

lowrollr avatar lowrollr commented on June 19, 2024

I was able to reproduce, then correct the behavior you are describing.

First, in 16d2f3f, you move the legal actions assignment to before env.step in mcts.py, meaning that legal actions are calculated using the previous state of the environment rather than the current state, which results in invalid actions being taken. You'd need to revert this change.

This is not the only source of invalid actions, however. I also realized that in my previously implemented environments, I make the assumption that all rewards/evaluations are positive. Your custom environment can assign negative rewards, which breaks some of the action choice logic that assumes rewards/evaluations are always positive. It should be straightforward to allow for negative rewards/evaluations, I'll link a commit to this issue that will resolve this problem.

Finally, I believe the inclusion of label smoothing in cross entropy loss causes policy loss values to be very very large numbers, as the policy logits corresponding to invalid actions are assigned to very large negative numbers rather than zero, leading to large loss accumulated for all of these logits. I will explore ways to allow for label smoothing as well, but the current implementation does not allow for this.

When I removed label smoothing and addressed the other two issues above (and also lowered the learning rate in your provided config file), I saw reasonable loss values and did not detect any invalid actions.

On the topic of debugging, I use the VSCode debugger within a Jupyter notebook and set breakpoints, for this issue I set up some breakpoints to detect when an illegal action was chosen.

I think ideally there should be some built-in assertions to detect this exact situation, as this is how any issue usually manifests itself. Will look into that more as well.

Thank you for your patience and for pointing out this issue! Will have these problems resolved within the next day or two.

from turbozero.

lowrollr avatar lowrollr commented on June 19, 2024

87fd4d8 allows for negative rewards/evaluations.

I'll keep this open until I address label smoothing as well, and perhaps debug asserts for detecting invalid actions in MCTS. Let me know if you run into any other issues in the meantime!

from turbozero.

bubble-07 avatar bubble-07 commented on June 19, 2024

Thanks for the attention to this issue - cherry-picking my environment on top of the most recent change-sets completely resolves the issue with negative reward values resulting in invalid actions! I'm also seeing the stabilization in training dynamics with a lower learning rate, and so guess I'm off to the races.

My apologies about the bit where I swapped around logic with the legal actions assignment - I only made that change out of a "throw-spaghetti-and-see-if-it-sticks" approach to debugging as a last resort, and I'm sorry if it complicated the investigation at all.

Adding built-in assertions to check the integrity of invariants about MCTS could be useful - maybe having them on only for debug=true configs to ensure that perf doesn't take a hit?

I'm content with the resolution here, but I won't close the issue now, given that you have some other things that you want to tackle before declaring this one closed.

Thanks again!

from turbozero.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.