I see my model don't be improved anymore. Moreover I found "It may forget pertinen

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It may forget pertinent information about positions that it no longer visits. about reversi-alpha-zero HOT 21 OPEN

mokemokechicken commented on July 23, 2024

It may forget pertinent information about positions that it no longer visits.

from reversi-alpha-zero.

Comments (21)

AranKomat commented on July 23, 2024 2

@gooooloo Well, that makes sense. But when I said 150 stones on average Go game, I didn't take into account the symmetries, so for fair comparison I didn't consider symmetries of reversi, which has the same set of symmetries as Go. Sorry for not being explicit. Since what we're concerned with is the ratio between our training/self-play ratio (5.3 after symmetries) vs. AZ's training/self-play ratio (about 0.44, but it's 0.44/8=0.055 after symmetries), there's still 100 times of difference, which is reasonable given the number of GPUs we're using.

from reversi-alpha-zero.

AranKomat commented on July 23, 2024 1

Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo .

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024 1

I am testing on feature/multiprocess_selfplay,

when 16 parallel in self-play,

580 seconds per 1 self-play game (16 parallel) -> 36 seconds per self-play game
400 positions per 1 self-play game
225 seconds per 200 steps(bs=256) -> 225 seconds per 200*256 positions

Training: 228 positions / seconds (=200*256/225)
SelfPlay: 11 positions / seconds (=400 / 36)
Training/SelfPlay Ratio: 21 (=228/11)

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024 1

I also added wait to optimizer to change the ratio.

Now,

164 self-play game per 1 hour -> 22(=3600/164) seconds per self-play game
400 positions per 1 self-play game
225 * 2 seconds per 200 steps(bs=256) -> 450 seconds per 200 * 256 positions

Training: 113 positions / seconds (=200*256/450)
SelfPlay: 18 positions / seconds (= 400 / 22)
Training/SelfPlay Ratio: 6.2 (=113/18)

from reversi-alpha-zero.

gooooloo commented on July 23, 2024 1

@AranKomat

Mine is:

30 processes for self-play, about 150 seconds per game per process, gives 5 seconds per game in average.
about 12 minutes per 100 steps training, batch size = 3072, gives 426 positions per second (=3072*100/12/60)

I actually don't understand below number @mokemokechicken mentioned:

400 positions per 1 self-play game

But if I just use this number, then I have self-play speed: 80 positions per second (=400/5).
Then Training/SelfPlay Ration: 5.3 (=426/80)

from reversi-alpha-zero.

gooooloo commented on July 23, 2024 1

I used (nb_game_in_file, max_file_num)=(5, 300), so the number of total games in training data was 1500 (games).
My training dataset size was about 600k (positions).
So, 600k / 1500 = 400 (position/game).

But a reversi game has up to 60 position to move, isn't it? Event with up to 5 "PASS" move, it is 65. Then even with game state flip and rotation, it is at most 260.

UPDATE:
Oh my fault, "flip and rotation" gives a x8 multiplication, not x4. Then it makes sense. 400/8=50, you are playing 50 moves per game, giving you have a resignation mechanism.

from reversi-alpha-zero.

AranKomat commented on July 23, 2024 1

The ratio of 0.44 was obtained from AlphaZero, where symmetry wasn't exploited. Also, Shogi and Chess cannot exploit symmetries, so they set the self-play vs training ratio of AlphaZero based on the assumption that self-play data isn't necessarily as plentiful as in symmetric games. Without symmetry, the ratio is 0.44, which is closer to 1. The ratio for Shogi and Chess may be even closer to 1. Also, in symmetric games without symmetric data augmentation, the NN quickly learns symmetry, which was demonstrated by AZ being superior to AGZ in Go. Considering the eventual meaninglessness of symmetric data augmentation, the net ratio of @gooooloo becomes 5.3*8=42.4. So, he needs at least 42 times more GPUs for self-play to get to 1.

from reversi-alpha-zero.

gooooloo commented on July 23, 2024 1

@AranKomat @mokemokechicken I double checked my pipeline's performance, should be 25 processes + 180 second per game per process, which gives 7 seconds per game in average. Then My ratio should be about 7.*(=426/(400/7)), not 5.3.

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024

@apollo-time

I think that there is that's possibility,
and if we want to improve the model more and more, we need larger sim_per_move and self-play dataset.

I have a simple hypothesis that

upper performace(=strength) of model is decided by sim_per_move.
speed of changing(≒improvement) is decided by speed of generating self-play data and size of self-play dataset(small size is faster).
generalization performace of model is decided by size of self-play dataset(large size is more general).

so, I feel that increasing sim_per_move and dataset size gradually is effective.
(I think that Human also do that to become professional.)

from reversi-alpha-zero.

apollo-time commented on July 23, 2024

I think larger slim_per_move and self-play dataset can't resolve no longer visits problem, because the unusually positions can't be selected by self-play MCTS.
So I try select fully random action sometimes in self-play, and ignore previous history of the random action.

from reversi-alpha-zero.

AranKomat commented on July 23, 2024

@mokemokechicken I asked @gooooloo a similar question in other thread, but what is the default ratio of the number of games per gradient update ratio of your algorithm? I guess the ratio is important for the performance, since it behaves like sims/move, which is undoubtedly important.

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024

@AranKomat

what is the default ratio of the number of games per gradient update ratio of your algorithm?

I do not know which number to answer concretely, but the resulting speed is as follows.

setting

batch size: 256
sim per move: 400
(nb_game_in_file, max_file_num): (5, 300)

speed

80 seconds per 1 self-play game
400 positions per 1 self-play game
150 seconds per 200 steps(bs=256) -> 150 seconds per 200*256 positions

Training: 341 positions / seconds (=200*256/160)
SelfPlay: 5 positions / seconds (=400 / 80)
Training/SelfPlay Ratio: 68 (=341/5)

Maybe, it means that 1 position is learned 68 times regardless (nb_game_in_file, max_file_num).

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024

@AranKomat

I guess you can improve your performance with more self-plays per update.

I think so too.
In my environment, although GPU usage is already 100%(by self-play and training),
implementing multiprocess self-play will increase self-play games per training.

So I am planing to implement multiprocess self-play,
However, it is under consideration whether or not it really works with the present method.

from reversi-alpha-zero.

AranKomat commented on July 23, 2024

Cool. So, multi-processing successfully decreased the ratio and achieved 36s per game under 400 sims/move. Now, it suffices to elucidate the trade-off between training/selfplay ratio and sims/move. I'm excited for your subsequent announcements!

from reversi-alpha-zero.

gooooloo commented on July 23, 2024

Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo .

Thanks @AranKomat . I didn't see this post until just now...

I guess you can improve your performance with more self-plays per update

Yes, I also think so. Deepmind uses 2000+ or 4000+ TPU for selfplay (as Aja Huang says in a post, I just can't remember the link). We can see the self play performance is important.

Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration.

Actually I was getting an smaller selfplay/training ratio when increasing sims/move from 100 to 800. Although I also introduced multi process implementation at that time, the overall self play game speed is a little bit slower than before. Yet I observe the AI strength improvement.

from reversi-alpha-zero.

AranKomat commented on July 23, 2024

@gooooloo In AlphaZero, staggering 5000 TPUs were used, so I totally agree. It's weird but nice that increased sims/move resulted in a smaller ratio. Hopefully, @mokemokechicken and others will observe a similar phenomena.

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024

400 positions per 1 self-play game

Note:
I used (nb_game_in_file, max_file_num)=(5, 300), so the number of total games in training data was 1500 (games).
My training dataset size was about 600k (positions).
So, 600k / 1500 = 400 (position/game).

from reversi-alpha-zero.

gooooloo commented on July 23, 2024

... had a small self-play/training ratio

It's weird ... that increased sims/move resulted in a smaller ratio

The ratio is # of selp play moves / # of trained moves. I increased # sims per move, then self play got slower, then # of self play moves smaller. But training module not changed. So the total ratio got smaller. Isn't it?

from reversi-alpha-zero.

AranKomat commented on July 23, 2024

@gooooloo Sorry, I thought you were talking about training/self-play ratio, but it was opposite. My mistake. I also agree with you about the number of positions per game.

from reversi-alpha-zero.

gooooloo commented on July 23, 2024

@AranKomat I made a mistake calculating. Please see that post again, I modified it.

from reversi-alpha-zero.

mokemokechicken commented on July 23, 2024

It is strange that training/self-play ratio becomes under 1. It means that there are positions not used in training.
So, I think the ratio was almost 1.

from reversi-alpha-zero.

It may forget pertinent information about positions that it no longer visits. about reversi-alpha-zero HOT 21 OPEN

Comments (21)

setting

speed

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent