Hey there - I've drafted an implementation of a custom environment for the (approximat

<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero about turbozero HOT 3 OPEN

lowrollr commented on July 2, 2024

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero

from turbozero.

Comments (3)

lowrollr commented on July 2, 2024

I was able to reproduce, then correct the behavior you are describing.

First, in 16d2f3f, you move the legal actions assignment to before env.step in mcts.py, meaning that legal actions are calculated using the previous state of the environment rather than the current state, which results in invalid actions being taken. You'd need to revert this change.

This is not the only source of invalid actions, however. I also realized that in my previously implemented environments, I make the assumption that all rewards/evaluations are positive. Your custom environment can assign negative rewards, which breaks some of the action choice logic that assumes rewards/evaluations are always positive. It should be straightforward to allow for negative rewards/evaluations, I'll link a commit to this issue that will resolve this problem.

Finally, I believe the inclusion of label smoothing in cross entropy loss causes policy loss values to be very very large numbers, as the policy logits corresponding to invalid actions are assigned to very large negative numbers rather than zero, leading to large loss accumulated for all of these logits. I will explore ways to allow for label smoothing as well, but the current implementation does not allow for this.

When I removed label smoothing and addressed the other two issues above (and also lowered the learning rate in your provided config file), I saw reasonable loss values and did not detect any invalid actions.

On the topic of debugging, I use the VSCode debugger within a Jupyter notebook and set breakpoints, for this issue I set up some breakpoints to detect when an illegal action was chosen.

I think ideally there should be some built-in assertions to detect this exact situation, as this is how any issue usually manifests itself. Will look into that more as well.

Thank you for your patience and for pointing out this issue! Will have these problems resolved within the next day or two.

from turbozero.

lowrollr commented on July 2, 2024

87fd4d8 allows for negative rewards/evaluations.

I'll keep this open until I address label smoothing as well, and perhaps debug asserts for detecting invalid actions in MCTS. Let me know if you run into any other issues in the meantime!

from turbozero.

bubble-07 commented on July 2, 2024

Thanks for the attention to this issue - cherry-picking my environment on top of the most recent change-sets completely resolves the issue with negative reward values resulting in invalid actions! I'm also seeing the stabilization in training dynamics with a lower learning rate, and so guess I'm off to the races.

My apologies about the bit where I swapped around logic with the legal actions assignment - I only made that change out of a "throw-spaghetti-and-see-if-it-sticks" approach to debugging as a last resort, and I'm sorry if it complicated the investigation at all.

Adding built-in assertions to check the integrity of invariants about MCTS could be useful - maybe having them on only for debug=true configs to ensure that perf doesn't take a hit?

I'm content with the resolution here, but I won't close the issue now, given that you have some other things that you want to tackle before declaring this one closed.

Thanks again!

from turbozero.

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero about turbozero HOT 3 OPEN

Comments (3)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent