lowrollr / turbozero Goto Github PK
View Code? Open in Web Editor NEWfast + parallel AlphaZero in JAX
License: Apache License 2.0
fast + parallel AlphaZero in JAX
License: Apache License 2.0
There is a typo in lines 95, 157 and 162 of self.dirichlet
spelled self.dirilecht
.
Fortunately this typo is repeated and does not affect the functioning, but I suggest modifying it to avoid possible confusion.
Hey there - I'm very grateful for this project, since it should speed up experimentation on one of my side-projects!
I'm trying to run the LazyZero training examples, but this seems to fail due to the probability distribution over actions [at some level of some game tree] being degenerate upon initialization.
Command (and output):
(turbozero-py3.9) bubble-07@LambdaY-Desktop:~/Programmingstuff/turbozero$ python turbozero.py --verbose --mode=train --config=./example_configs/2048_tiny.yaml --logfile=./2048_tiny.log
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:59: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask0 = torch.tensor([[[[-1e5, 1]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:60: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask1 = torch.tensor([[[[1], [-1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:61: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask2 = torch.tensor([[[[1, -1e5]]]], dtype=dtype, device=device, requires_grad=False)
/home/bubble-07/Programmingstuff/turbozero/envs/_2048/torchscripts.py:62: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
mask3 = torch.tensor([[[[-1e5], [1]]]], dtype=dtype, device=device, requires_grad=False)
Populating Replay Memory...: 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/bubble-07/Programmingstuff/turbozero/turbozero.py", line 226, in <module>
trainer.training_loop()
File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 191, in training_loop
self.fill_replay_memory()
File "/home/bubble-07/Programmingstuff/turbozero/core/train/trainer.py", line 173, in fill_replay_memory
finished_episodes, _ = self.collector.collect()
File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 24, in collect
_, terminated = self.collect_step()
File "/home/bubble-07/Programmingstuff/turbozero/core/train/collector.py", line 40, in collect_step
initial_states, probs, _, actions, terminated = self.evaluator.step()
File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/evaluator.py", line 43, in step
probs, values = self.evaluate(*args)
File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazyzero.py", line 33, in evaluate
return super().evaluate(evaluation_fn)
File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 57, in evaluate
return self.explore_for_iters(evaluation_fn, self.policy_rollouts, self.rollout_depth)
File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 100, in explore_for_iters
actions = self.choose_action_with_puct(
File "/home/bubble-07/Programmingstuff/turbozero/core/algorithms/lazy_mcts.py", line 71, in choose_action_with_puct
return rand_argmax_2d(legal_puct_scores).flatten()
File "/home/bubble-07/Programmingstuff/turbozero/core/utils/utils.py", line 6, in rand_argmax_2d
return torch.multinomial(inds.float(), 1)
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
Populating Replay Memory...: 0%|
For reference, it does look like I am able to train using AlphaZero on Othello using the provided configs.
hello,mctx-az don't have issue button, so i ask here.
The speed of alphazero_policy is much slower than gumbel_muzero_policy in my test, do you know why? And muzero_policy also much slower than gumbel_muzero_policy in my test. what may cause this?
Hey there - I've drafted an implementation of a custom environment for the (approximate) matrix semigroup reachability problem, which I've added to my fork here: https://github.com/bubble-07/turbozero. One thing that's currently puzzling me is that it appears that the environment consistently results in the assignment of non-zero visit probabilities to actions which are declared to be invalid. I've added a few debug print
s and sys.exit()
to core/train/trainer.py
to track this down in my fork, and I'm able to reproduce the issue with:
python3 turbozero.py --verbose --gpu --mode=train --config=./example_configs/asmr_mini.yaml --logfile=./asmr_mini.log --debug
Output
torch.Size([6])
Populating Replay Memory...: 100%|██████████████████████████████| 4/4 [00:00<00:00, 8.11it/s]
Collecting self-play episodes...: 25%|██████▎ | 1/4 [00:00<00:01, 2.03it/s]tensor([[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
[-1.0662e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -1.0292e+00],
[-7.0420e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -2.6667e-01]],
device='cuda:0', grad_fn=<AddBackward0>)
tensor([[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
[0.0100, 0.9800, 0.0000, 0.0000, 0.0100],
[0.1700, 0.0000, 0.0000, 0.0000, 0.8300]], device='cuda:0')
tensor(inf, device='cuda:0', grad_fn=<MulBackward0>)
The first tensor
printed is what the neural net yields for the policy logits (after masking out invalid entries), the second tensor is the empirical visit probabilities, and the third one is what the value for the policy cross-entropy loss is for the iteration. While the neural net's policy is correctly masked, it looks like the empirical visit probabilities aren't, judging from the second column. I've tried to pick apart where all of this was going wrong in the MCTS routine, but I unfortunately wasn't able to get far - have any handy ways to debug this?
A great work! Can you tell me the key differences between this work and the implementation of alphazero in PGX: https://github.com/sotetsuk/pgx/tree/main/examples/alphazero, and what are the specific advantages and disadvantages? I hope to choose a more efficient implementation as the code foundation, thank you!
The current batch only among multiple games, not one search batched. for example , if one search use 400 simulations, thoese 400 simulations will run one by one, not bacthed.
It might be a good idea for me to allow for users to specify any number of transforms to apply to augment experiences prior to storing in replay memory -- this is a fairly common use-case so it ideally should not require a custom class.
Originally posted by @lowrollr in #8 (comment)
Hey Jacob,
Thanks for sharing this project, great job!
Any chance you could share some good wandb runs you have on the pgx games?
I'd really like to see the hyperparams and loss curves. It would really help to get things dialed in.
Or happy to help out by running some sweeps if we not there yet.
Duane
It seems root_metadata not used in update_root, parameter are mismatch in mcts.py.
line 81
eval_state = self.update_root(eval_state, env_state, root_metadata, params)
line 97
def update_root(self, tree: MCTSTree, root_embedding: chex.ArrayTree,
params: chex.ArrayTree, **kwargs) -> MCTSTree:
key, tree = get_rng(tree)
root_policy_logits, root_value = self.eval_fn(root_embedding, params, key)
root_policy = jax.nn.softmax(root_policy_logits)
root_node = tree.at(tree.ROOT_INDEX)
root_node = self.update_root_node(root_node, root_policy, root_value, root_embedding)
return set_root(tree, root_node)
And i need use mult gpus, I don't see something like pmap in code. Need a example that how to train use mult gpus.
Originally posted by @Nightbringers in #8 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.