lowrollr / turbozero Goto Github PK

fast + parallel AlphaZero in JAX

License: Apache License 2.0

Jupyter Notebook 21.17% Python 78.83%

alphazero gpu-acceleration monte-carlo-tree-search reinforcement-learning vectorization mcts jax

turbozero's Issues

Dirilecht instead of dirichlet in mcts

There is a typo in lines 95, 157 and 162 of self.dirichlet spelled self.dirilecht.
Fortunately this typo is repeated and does not affect the functioning, but I suggest modifying it to avoid possible confusion.

LazyZero-based training sample commands fail with "invalid multinomial distribution"

Hey there - I'm very grateful for this project, since it should speed up experimentation on one of my side-projects!

I'm trying to run the LazyZero training examples, but this seems to fail due to the probability distribution over actions [at some level of some game tree] being degenerate upon initialization.

Command (and output):

(turbozero-py3.9) bubble-07@LambdaY-Desktop:~/Programmingstuff/turbozero$ python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:59: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask0 = torch.tensor([[[[-1e5, 1]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:60: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask1 = torch.tensor([[[[1], [-1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:61: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask2 = torch.tensor([[[[1, -1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:62: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask3 = torch.tensor([[[[-1e5], [1]]]], dtype=dtype, device=device, requires_grad=False)
Populating Replay Memory...:   0%|                              | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/bubble-07/Programmingstuff/turbozero/turbozero.py", line 226, in <module>
    trainer.training_loop()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 191, in training_loop
    self.fill_replay_memory()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 173, in fill_replay_memory
    finished_episodes, _ = self.collector.collect()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 24, in collect
    _, terminated = self.collect_step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 40, in collect_step
    initial_states, probs, _, actions, terminated = self.evaluator.step()
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/evaluator.py", line 43, in step
    probs, values = self.evaluate(*args)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazyzero.py", line 33, in evaluate
    return super().evaluate(evaluation_fn)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 57, in evaluate
    return self.explore_for_iters(evaluation_fn, self.policy_rollouts, self.rollout_depth)
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 100, in explore_for_iters
    actions = self.choose_action_with_puct(
  File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 71, in choose_action_with_puct
    return rand_argmax_2d(legal_puct_scores).flatten()
  File "/home/bubble-07/Programmingstuff/turbozero/core/utils/utils.py", line 6, in rand_argmax_2d
    return torch.multinomial(inds.float(), 1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Populating Replay Memory...:   0%|

For reference, it does look like I am able to train using AlphaZero on Othello using the provided configs.

speed issue

hello，mctx-az don't have issue button， so i ask here.
The speed of alphazero_policy is much slower than gumbel_muzero_policy in my test, do you know why? And muzero_policy also much slower than gumbel_muzero_policy in my test. what may cause this?

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero

Hey there - I've drafted an implementation of a custom environment for the (approximate) matrix semigroup reachability problem, which I've added to my fork here: https://github.com/bubble-07/turbozero. One thing that's currently puzzling me is that it appears that the environment consistently results in the assignment of non-zero visit probabilities to actions which are declared to be invalid. I've added a few debug prints and sys.exit() to core/train/trainer.py to track this down in my fork, and I'm able to reproduce the issue with:

python3 turbozero.py --verbose --gpu --mode=train --config=./example_configs/asmr_mini.yaml --logfile=./asmr_mini.log --debug

Output

torch.Size([6])
Populating Replay Memory...: 100%|██████████████████████████████| 4/4 [00:00<00:00,  8.11it/s]
Collecting self-play episodes...:  25%|██████▎                  | 1/4 [00:00<00:01,  2.03it/s]tensor([[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-1.0662e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -1.0292e+00],
        [-7.0420e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -2.6667e-01]],
       device='cuda:0', grad_fn=<AddBackward0>)
tensor([[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.0100, 0.9800, 0.0000, 0.0000, 0.0100],
        [0.1700, 0.0000, 0.0000, 0.0000, 0.8300]], device='cuda:0')
tensor(inf, device='cuda:0', grad_fn=<MulBackward0>)

The first tensor printed is what the neural net yields for the policy logits (after masking out invalid entries), the second tensor is the empirical visit probabilities, and the third one is what the value for the policy cross-entropy loss is for the iteration. While the neural net's policy is correctly masked, it looks like the empirical visit probabilities aren't, judging from the second column. I've tried to pick apart where all of this was going wrong in the MCTS routine, but I unfortunately wasn't able to get far - have any handy ways to debug this?

The key differences between this work and the implementation of alphazero in PGX

A great work! Can you tell me the key differences between this work and the implementation of alphazero in PGX: https://github.com/sotetsuk/pgx/tree/main/examples/alphazero, and what are the specific advantages and disadvantages? I hope to choose a more efficient implementation as the code foundation, thank you!

Any chance you could share some good wandb runs you have on the pgx games?

I'd really like to see the hyperparams and loss curves. It would really help to get things dialed in.

Or happy to help out by running some sweeps if we not there yet.

Duane

bug

It seems root_metadata not used in update_root, parameter are mismatch in mcts.py.

line 81
eval_state = self.update_root(eval_state, env_state, root_metadata, params)

line 97

def update_root(self, tree: MCTSTree, root_embedding: chex.ArrayTree, 
                 params: chex.ArrayTree, **kwargs) -> MCTSTree:
   
     key, tree = get_rng(tree)
     root_policy_logits, root_value = self.eval_fn(root_embedding, params, key)
     root_policy = jax.nn.softmax(root_policy_logits)
     root_node = tree.at(tree.ROOT_INDEX)
     root_node = self.update_root_node(root_node, root_policy, root_value, root_embedding)
     return set_root(tree, root_node)

allow for running w/ multiple gpus and provide an example

          And i need use mult gpus, I don't see something like pmap in code. Need a example that how to train use mult gpus.

Originally posted by @Nightbringers in #8 (comment)

lowrollr / turbozero Goto Github PK

turbozero's Issues

Dirilecht instead of dirichlet in mcts

LazyZero-based training sample commands fail with "invalid multinomial distribution"

speed issue

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero

The key differences between this work and the implementation of alphazero in PGX

Batch MCTS is needed !!!

allow for user-specified data augmentation in Trainer

Hey, Could you please share your baseline runs?

bug

allow for running w/ multiple gpus and provide an example

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent