Comments (3)
I was able to reproduce, then correct the behavior you are describing.
First, in 16d2f3f, you move the legal actions assignment to before env.step
in mcts.py
, meaning that legal actions are calculated using the previous state of the environment rather than the current state, which results in invalid actions being taken. You'd need to revert this change.
This is not the only source of invalid actions, however. I also realized that in my previously implemented environments, I make the assumption that all rewards/evaluations are positive. Your custom environment can assign negative rewards, which breaks some of the action choice logic that assumes rewards/evaluations are always positive. It should be straightforward to allow for negative rewards/evaluations, I'll link a commit to this issue that will resolve this problem.
Finally, I believe the inclusion of label smoothing in cross entropy loss causes policy loss values to be very very large numbers, as the policy logits corresponding to invalid actions are assigned to very large negative numbers rather than zero, leading to large loss accumulated for all of these logits. I will explore ways to allow for label smoothing as well, but the current implementation does not allow for this.
When I removed label smoothing and addressed the other two issues above (and also lowered the learning rate in your provided config file), I saw reasonable loss values and did not detect any invalid actions.
On the topic of debugging, I use the VSCode debugger within a Jupyter notebook and set breakpoints, for this issue I set up some breakpoints to detect when an illegal action was chosen.
I think ideally there should be some built-in assertions to detect this exact situation, as this is how any issue usually manifests itself. Will look into that more as well.
Thank you for your patience and for pointing out this issue! Will have these problems resolved within the next day or two.
from turbozero.
87fd4d8 allows for negative rewards/evaluations.
I'll keep this open until I address label smoothing as well, and perhaps debug asserts for detecting invalid actions in MCTS. Let me know if you run into any other issues in the meantime!
from turbozero.
Thanks for the attention to this issue - cherry-picking my environment on top of the most recent change-sets completely resolves the issue with negative reward values resulting in invalid actions! I'm also seeing the stabilization in training dynamics with a lower learning rate, and so guess I'm off to the races.
My apologies about the bit where I swapped around logic with the legal actions assignment - I only made that change out of a "throw-spaghetti-and-see-if-it-sticks" approach to debugging as a last resort, and I'm sorry if it complicated the investigation at all.
Adding built-in assertions to check the integrity of invariants about MCTS could be useful - maybe having them on only for debug=true
configs to ensure that perf doesn't take a hit?
I'm content with the resolution here, but I won't close the issue now, given that you have some other things that you want to tackle before declaring this one closed.
Thanks again!
from turbozero.
Related Issues (8)
- LazyZero-based training sample commands fail with "invalid multinomial distribution" HOT 7
- Dirilecht instead of dirichlet in mcts HOT 1
- speed issue HOT 1
- bug HOT 34
- allow for running w/ multiple gpus and provide an example HOT 1
- allow for user-specified data augmentation in Trainer HOT 1
- Batch MCTS is needed !!! HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from turbozero.