LazyZero-based training sample commands fail with "invalid multinomial distribution" about turbozero HOT 7 CLOSED

bubble-07 commented on July 19, 2024

LazyZero-based training sample commands fail with "invalid multinomial distribution"

from turbozero.

Comments (7)

lowrollr commented on July 19, 2024

This issue should be resolved now, apologies for the inconvenience. 7ff797d.

If you could describe your use-case I'd be happy to provide additional assistance! 😀

from turbozero.

lowrollr commented on July 19, 2024

Following up here, I've been able to reproduce the issue again even after 7ff797d, I will investigate this more.

from turbozero.

bubble-07 commented on July 19, 2024

Thanks for the prompt response - I'm glad that you were able to reproduce the issue.

I don't have a niche use-case for the LazyZero algorithm implemented here, TBH (I was just trying to run it as sample code to get a better intuition for how the training dynamics work on my machine), but more generally, I'm going to try to apply this code to a custom environment involving approximating solutions to the matrix semigroup reachability problem. Previously, I had thought that I could take a short-cut to generating training data for that problem (essentially by constructing a bunch of toy instances with exact solutions), but the performance using the synthetic training data was abysmal. (Probably due to extreme discontinuity in the loss surface). I spent a lot of time trying to make that approach work without really looking at the larger picture, which brings me to today, since my original intent was to use AlphaZero to fine-tune the model after that initial "toy data" training step. I want to see whether, given enough time, any model will be able to deliver a good approximation to the matrix semigroup reachability problem, which'd let me decide whether/not to pull the plug on this particular avenue of research or to press on.

from turbozero.

lowrollr commented on July 19, 2024

Hey there - I'm very grateful for this project, since it should speed up experimentation on one of my side-projects!

I'm trying to run the LazyZero training examples, but this seems to fail due to the probability distribution over actions [at some level of some game tree] being degenerate upon initialization.

Command (and output):

(turbozero-py3.9) bubble-07@LambdaY-Desktop:~/Programmingstuff/turbozero$ python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:59: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask0 = torch.tensor([[[[-1e5, 1]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:60: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask1 = torch.tensor([[[[1], [-1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:61: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask2 = torch.tensor([[[[1, -1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:62: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask3 = torch.tensor([[[[-1e5], [1]]]], dtype=dtype, device=device, requires_grad=False)
Populating Replay Memory...:   0%|                              | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/bubble-07/Programmingstuff/turbozero/turbozero.py", line 226, in <module>
    trainer.training_loop()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 191, in training_loop
    self.fill_replay_memory()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 173, in fill_replay_memory
    finished_episodes, _ = self.collector.collect()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 24, in collect
    _, terminated = self.collect_step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 40, in collect_step
    initial_states, probs, _, actions, terminated = self.evaluator.step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/evaluator.py", line 43, in step
    probs, values = self.evaluate(*args)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazyzero.py", line 33, in evaluate
    return super().evaluate(evaluation_fn)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 57, in evaluate
    return self.explore_for_iters(evaluation_fn, self.policy_rollouts, self.rollout_depth)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 100, in explore_for_iters
    actions = self.choose_action_with_puct(
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 71, in choose_action_with_puct
    return rand_argmax_2d(legal_puct_scores).flatten()
  File "/home/bubble-07/Programmingstuff/turbozero/core/utils/utils.py", line 6, in rand_argmax_2d
    return torch.multinomial(inds.float(), 1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Populating Replay Memory...:   0%|

For reference, it does look like I am able to train using AlphaZero on Othello using the provided configs.

This issue has been fixed with 3cd371a.

from turbozero.

lowrollr commented on July 19, 2024

Thanks for the prompt response - I'm glad that you were able to reproduce the issue.

I don't have a niche use-case for the LazyZero algorithm implemented here, TBH (I was just trying to run it as sample code to get a better intuition for how the training dynamics work on my machine), but more generally, I'm going to try to apply this code to a custom environment involving approximating solutions to the matrix semigroup reachability problem. Previously, I had thought that I could take a short-cut to generating training data for that problem (essentially by constructing a bunch of toy instances with exact solutions), but the performance using the synthetic training data was abysmal. (Probably due to extreme discontinuity in the loss surface). I spent a lot of time trying to make that approach work without really looking at the larger picture, which brings me to today, since my original intent was to use AlphaZero to fine-tune the model after that initial "toy data" training step. I want to see whether, given enough time, any model will be able to deliver a good approximation to the matrix semigroup reachability problem, which'd let me decide whether/not to pull the plug on this particular avenue of research or to press on.

This is an interesting use-case! I don't have any domain knowledge in that space, but from some brief research it does seem like you could pose that problem as a Markov Decision Process, so AlphaZero could definitely be something you try.

As I understand things, in this problem we are choosing matrix operations to apply to an initial matrix in order to try produce a target matrix. (is this the right idea?)
Does this set of matrix operations stay constant throughout the problem? Can you pose the problem in such a way that the quantity of matrix operations applicable in a given step has a fixed upper bound?

I can provide assistance as best I can if you'd like to try implementing an environment for this problem. AlphaZero has only been rigorously tested for a multi-player environment (Othello) in this project but I believe all the pieces should be there to support a 'single-player' problem such as this -- I'm happy to fill in any gaps that might exist though.

from turbozero.

bubble-07 commented on July 19, 2024

Yep, posing it as a 1-player game is the rough approach - from your repository, I'm mostly cribbing from the 2048 example to get started.

You have roughly the right idea of the problem - the set of matrix operations is always "multiply two matrices that are in my current set of matrices", and the score is a kind of modified distance between a given target matrix and the closest matrix in the active set, discounted by the number of matrix multiplications that were taken. A move consists of either multiplying two matrices in the current set of matrices, or to declare that the solution is "finished", and claim a reward discounted by the number of steps taken.

Since I'm thinking about training only on matrices of a fixed (square) shape and a fixed range of starting-set sizes, and with a small upper bound on the number of moves, it should be somewhat straightforward to fit it into the environment definitions of this project.

from turbozero.

lowrollr commented on July 19, 2024

This sounds straightforward indeed! Are you typically trying to solve for 2-dimensional matrices, or does the problem extend to any number of dimensions?

I'm planning on writing a doc page on the environment spec at some point soon, since it's a little different from the typical gymnasium spec. That should be helpful to you.

from turbozero.

LazyZero-based training sample commands fail with "invalid multinomial distribution" about turbozero HOT 7 CLOSED

Comments (7)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent