lowrollr / turbozero Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 4.0 29.52 MB

fast + parallel AlphaZero in JAX

License: Apache License 2.0

Jupyter Notebook 21.17% Python 78.83%

alphazero gpu-acceleration jax mcts monte-carlo-tree-search reinforcement-learning vectorization

turbozero's People

Contributors

Stargazers

Watchers

Forkers

schultzjack davidetr8 stjordanis bubble-07

turbozero's Issues

LazyZero-based training sample commands fail with "invalid multinomial distribution"

Hey there - I'm very grateful for this project, since it should speed up experimentation on one of my side-projects!

I'm trying to run the LazyZero training examples, but this seems to fail due to the probability distribution over actions [at some level of some game tree] being degenerate upon initialization.

Command (and output):

(turbozero-py3.9) bubble-07@LambdaY-Desktop:~/Programmingstuff/turbozero$ python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:59: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask0 = torch.tensor([[[[-1e5, 1]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:60: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask1 = torch.tensor([[[[1], [-1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:61: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask2 = torch.tensor([[[[1, -1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:62: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask3 = torch.tensor([[[[-1e5], [1]]]], dtype=dtype, device=device, requires_grad=False)
Populating Replay Memory...:   0%|                              | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/bubble-07/Programmingstuff/turbozero/turbozero.py", line 226, in <module>
    trainer.training_loop()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 191, in training_loop
    self.fill_replay_memory()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 173, in fill_replay_memory
    finished_episodes, _ = self.collector.collect()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 24, in collect
    _, terminated = self.collect_step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 40, in collect_step
    initial_states, probs, _, actions, terminated = self.evaluator.step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/evaluator.py", line 43, in step
    probs, values = self.evaluate(*args)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazyzero.py", line 33, in evaluate
    return super().evaluate(evaluation_fn)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 57, in evaluate
    return self.explore_for_iters(evaluation_fn, self.policy_rollouts, self.rollout_depth)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 100, in explore_for_iters
    actions = self.choose_action_with_puct(
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 71, in choose_action_with_puct
    return rand_argmax_2d(legal_puct_scores).flatten()
  File "/home/bubble-07/Programmingstuff/turbozero/core/utils/utils.py", line 6, in rand_argmax_2d
    return torch.multinomial(inds.float(), 1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Populating Replay Memory...:   0%|

For reference, it does look like I am able to train using AlphaZero on Othello using the provided configs.

speed issue

hello，mctx-az don't have issue button， so i ask here.
The speed of alphazero_policy is much slower than gumbel_muzero_policy in my test, do you know why? And muzero_policy also much slower than gumbel_muzero_policy in my test. what may cause this?

allow for user-specified data augmentation in Trainer

It might be a good idea for me to allow for users to specify any number of transforms to apply to augment experiences prior to storing in replay memory -- this is a fairly common use-case so it ideally should not require a custom class.

Originally posted by @lowrollr in #8 (comment)

bug

It seems root_metadata not used in update_root, parameter are mismatch in mcts.py.

line 81
eval_state = self.update_root(eval_state, env_state, root_metadata, params)

line 97

def update_root(self, tree: MCTSTree, root_embedding: chex.ArrayTree, 
                 params: chex.ArrayTree, **kwargs) -> MCTSTree:
   
     key, tree = get_rng(tree)
     root_policy_logits, root_value = self.eval_fn(root_embedding, params, key)
     root_policy = jax.nn.softmax(root_policy_logits)
     root_node = tree.at(tree.ROOT_INDEX)
     root_node = self.update_root_node(root_node, root_policy, root_value, root_embedding)
     return set_root(tree, root_node)

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero

Hey there - I've drafted an implementation of a custom environment for the (approximate) matrix semigroup reachability problem, which I've added to my fork here: https://github.com/bubble-07/turbozero. One thing that's currently puzzling me is that it appears that the environment consistently results in the assignment of non-zero visit probabilities to actions which are declared to be invalid. I've added a few debug prints and sys.exit() to core/train/trainer.py to track this down in my fork, and I'm able to reproduce the issue with:

python3 turbozero.py --verbose --gpu --mode=train --config=./example_configs/asmr_mini.yaml --logfile=./asmr_mini.log --debug

Output

torch.Size([6])
Populating Replay Memory...: 100%|██████████████████████████████| 4/4 [00:00<00:00,  8.11it/s]
Collecting self-play episodes...:  25%|██████▎                  | 1/4 [00:00<00:01,  2.03it/s]tensor([[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-1.0662e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -1.0292e+00],
        [-7.0420e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -2.6667e-01]],
       device='cuda:0', grad_fn=<AddBackward0>)
tensor([[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.0100, 0.9800, 0.0000, 0.0000, 0.0100],
        [0.1700, 0.0000, 0.0000, 0.0000, 0.8300]], device='cuda:0')
tensor(inf, device='cuda:0', grad_fn=<MulBackward0>)

The first tensor printed is what the neural net yields for the policy logits (after masking out invalid entries), the second tensor is the empirical visit probabilities, and the third one is what the value for the policy cross-entropy loss is for the iteration. While the neural net's policy is correctly masked, it looks like the empirical visit probabilities aren't, judging from the second column. I've tried to pick apart where all of this was going wrong in the MCTS routine, but I unfortunately wasn't able to get far - have any handy ways to debug this?

allow for running w/ multiple gpus and provide an example

          And i need use mult gpus, I don't see something like pmap in code. Need a example that how to train use mult gpus.

Originally posted by @Nightbringers in #8 (comment)

Dirilecht instead of dirichlet in mcts

There is a typo in lines 95, 157 and 162 of self.dirichlet spelled self.dirilecht.
Fortunately this typo is repeated and does not affect the functioning, but I suggest modifying it to avoid possible confusion.

Batch MCTS is needed !!!

The current batch only among multiple games, not one search batched. for example , if one search use 400 simulations, thoese 400 simulations will run one by one, not bacthed.

lowrollr / turbozero Goto Github PK

turbozero's People

Contributors

Stargazers

Watchers

Forkers

turbozero's Issues

LazyZero-based training sample commands fail with "invalid multinomial distribution"

speed issue

allow for user-specified data augmentation in Trainer

bug

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero

allow for running w/ multiple gpus and provide an example

Dirilecht instead of dirichlet in mcts

Batch MCTS is needed !!!

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent