FYI: https://arxiv.org/abs/1

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

In reversi, it is better that α is 0.3 ~ 0.5? <p dir="a

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm about reversi-alpha-zero HOT 7 OPEN

mokemokechicken commented on July 23, 2024 2

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

from reversi-alpha-zero.

Comments (7)

Zeta36 commented on July 23, 2024 3

In chess, AlphaZero outperformed Stockfish after just 4 hours (300k steps)

Wow!!

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024 3

Hi @apollo-time

I think the main differences are as follows.

P3~4

AlphaZero:

AlphaZero does not augment the training data and does not transform the board position during MCTS. (for generality)
evaluation step is omitted. self-play is performed by the newest model parameters. (!)
didn't tune hyper-parameter by Bayesian optimization. (reuse past parameters except policy noise)

So, MCTS is also used without transforming the board position.

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024 2

The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo
and AlphaGo Zero in two ways. First, training data was augmented by generating 8 symmetries
for each position. Second, during MCTS, board positions were transformed using a randomly
selected rotation or reflection before being evaluated by the neural network, so that the MonteCarlo
evaluation is averaged over different biases

Oh..., I did't generate 8 symmetries for each position...

from reversi-alpha-zero.

gooooloo commented on July 23, 2024 2

In reversi, it is better that α is 0.3 ~ 0.5?

Agreed. Let's say 180 legal actions in average in Go19x19, and in Reversi it may be around 10? So as to the new paper, 10 times 0.03 seems more reasonable.

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024

Dirichlet noise Dir(α) was added to the prior probabilities in the
root node; this was scaled in inverse proportion to the approximate number of legal moves in a
typical position, to a value of α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively.

In reversi, it is better that α is 0.3 ~ 0.5?

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024