Giter Club home page Giter Club logo

turbozero's People

Contributors

lowrollr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

turbozero's Issues

LazyZero-based training sample commands fail with "invalid multinomial distribution"

Hey there - I'm very grateful for this project, since it should speed up experimentation on one of my side-projects!

I'm trying to run the LazyZero training examples, but this seems to fail due to the probability distribution over actions [at some level of some game tree] being degenerate upon initialization.

Command (and output):

(turbozero-py3.9) bubble-07@LambdaY-Desktop:~/Programmingstuff/turbozero$ python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:59: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask0 = torch.tensor([[[[-1e5, 1]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:60: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask1 = torch.tensor([[[[1], [-1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:61: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask2 = torch.tensor([[[[1, -1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:62: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask3 = torch.tensor([[[[-1e5], [1]]]], dtype=dtype, device=device, requires_grad=False)
Populating Replay Memory...:   0%|                              | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/bubble-07/Programmingstuff/turbozero/turbozero.py", line 226, in <module>
    trainer.training_loop()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 191, in training_loop
    self.fill_replay_memory()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 173, in fill_replay_memory
    finished_episodes, _ = self.collector.collect()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 24, in collect
    _, terminated = self.collect_step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 40, in collect_step
    initial_states, probs, _, actions, terminated = self.evaluator.step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/evaluator.py", line 43, in step
    probs, values = self.evaluate(*args)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazyzero.py", line 33, in evaluate
    return super().evaluate(evaluation_fn)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 57, in evaluate
    return self.explore_for_iters(evaluation_fn, self.policy_rollouts, self.rollout_depth)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 100, in explore_for_iters
    actions = self.choose_action_with_puct(
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 71, in choose_action_with_puct
    return rand_argmax_2d(legal_puct_scores).flatten()
  File "/home/bubble-07/Programmingstuff/turbozero/core/utils/utils.py", line 6, in rand_argmax_2d
    return torch.multinomial(inds.float(), 1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Populating Replay Memory...:   0%|

For reference, it does look like I am able to train using AlphaZero on Othello using the provided configs.

speed issue

hello,mctx-az don't have issue button, so i ask here.
The speed of alphazero_policy is much slower than gumbel_muzero_policy in my test, do you know why? And muzero_policy also much slower than gumbel_muzero_policy in my test. what may cause this?

bug

It seems root_metadata not used in update_root, parameter are mismatch in mcts.py.

line 81
eval_state = self.update_root(eval_state, env_state, root_metadata, params)

line 97

def update_root(self, tree: MCTSTree, root_embedding: chex.ArrayTree, 
                 params: chex.ArrayTree, **kwargs) -> MCTSTree:
   
     key, tree = get_rng(tree)
     root_policy_logits, root_value = self.eval_fn(root_embedding, params, key)
     root_policy = jax.nn.softmax(root_policy_logits)
     root_node = tree.at(tree.ROOT_INDEX)
     root_node = self.update_root_node(root_node, root_policy, root_value, root_embedding)
     return set_root(tree, root_node)

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero

Hey there - I've drafted an implementation of a custom environment for the (approximate) matrix semigroup reachability problem, which I've added to my fork here: https://github.com/bubble-07/turbozero. One thing that's currently puzzling me is that it appears that the environment consistently results in the assignment of non-zero visit probabilities to actions which are declared to be invalid. I've added a few debug prints and sys.exit() to core/train/trainer.py to track this down in my fork, and I'm able to reproduce the issue with:

python3 turbozero.py --verbose --gpu --mode=train --config=./example_configs/asmr_mini.yaml --logfile=./asmr_mini.log --debug

Output

torch.Size([6])
Populating Replay Memory...: 100%|██████████████████████████████| 4/4 [00:00<00:00,  8.11it/s]
Collecting self-play episodes...:  25%|██████▎                  | 1/4 [00:00<00:01,  2.03it/s]tensor([[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-1.0662e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -1.0292e+00],
        [-7.0420e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -2.6667e-01]],
       device='cuda:0', grad_fn=<AddBackward0>)
tensor([[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.0100, 0.9800, 0.0000, 0.0000, 0.0100],
        [0.1700, 0.0000, 0.0000, 0.0000, 0.8300]], device='cuda:0')
tensor(inf, device='cuda:0', grad_fn=<MulBackward0>)

The first tensor printed is what the neural net yields for the policy logits (after masking out invalid entries), the second tensor is the empirical visit probabilities, and the third one is what the value for the policy cross-entropy loss is for the iteration. While the neural net's policy is correctly masked, it looks like the empirical visit probabilities aren't, judging from the second column. I've tried to pick apart where all of this was going wrong in the MCTS routine, but I unfortunately wasn't able to get far - have any handy ways to debug this?

Dirilecht instead of dirichlet in mcts

There is a typo in lines 95, 157 and 162 of self.dirichlet spelled self.dirilecht.
Fortunately this typo is repeated and does not affect the functioning, but I suggest modifying it to avoid possible confusion.

Batch MCTS is needed !!!

The current batch only among multiple games, not one search batched. for example , if one search use 400 simulations, thoese 400 simulations will run one by one, not bacthed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.