mokemokechicken / reversi-alpha-zero Goto Github PK
View Code? Open in Web Editor NEWReversi reinforcement learning by AlphaGo Zero methods.
License: MIT License
Reversi reinforcement learning by AlphaGo Zero methods.
License: MIT License
Hello, @mokemokechicken and @Yuhang.
As I promised I've done (just in one day, I had no more time) an adaptation of the reversi-zero project @mokemokechicken did into a chess version: https://github.com/Zeta36/chess-alpha-zero
The project is already functional (in the sense that it doesn't fail and the three workers do their job), but unfortunately I have no GPU (just an Intel i5 CPU) nor money to spend in a AWS server or similar.
So I just could check the self-play with the toy config "--type mini". Moreover, I had to descend self.simulation_num_per_move = 2 and self.parallel_search_num = 2.
In this way I was able to generate the 100 games needed for the optimitazion worker to start. The optimization process seemed to work perfeclty, and the model was able to reach a loss of ~0.6 after 1000 steps. I guess so that the model was able to overfit the 100 games of the former self-play.
Then I execute the evaluation process and it worked fine. The overfitted model was able to defeat the random original model of the beggining by 100% (causality??).
Finally I check the ASCII way to play against the best model. It worked as expected. To indicate our moves we have to use UCI notation: a1a2, b3b8, etc. More info here: https://chessprogramming.wikispaces.com/Algebraic+Chess+Notation
By the way, the model output is now of size 8128 (instead of the 64 of reversi and the 362 of Go), and it correspond to all possible legal UCI moves in a chess game. I generate these new labels in the config.py file.
I have to note you also that the board state (and the player turn) is traced by the fen chess notation: https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation
Here is for example the FEN for the starting position: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 (w is for white to move).
I changed also a little bit the resign function. Chess is not like Go or Reversi where you always finish the game more or less in the same number of movements. In Chess the game can end in a lot of ways (checkmate, stalemate, etc.) and the self-play coul be more than 200 movements before reaching an ending position (normally in a draw). So I decided to cut-off the play after some player has more than 13 points of advantage (this score is computed as usual taking into account the value of the pieces: he queen is worth 10, roots 5.5, etc).
As you can imagine with my poor machine I could not fully test the project beyond these tiny tests of functionality. So I'd really appreciate if you could please take some free time of your GPU's for testing this implementation in a more serious way. Both you can of course be colaborators of the project if you wish.
Also I don't know if I commited some theoretical bugs after this adaptation to chess and I'd apretiate too any comments by your side in this sense.
Best regards!!
I know the Deepmind paper says a replace_rate of 0.55. But considering in that Go game under that rule, there is no "draw" result, so 0.55 is reasonable. However, in reversi there is "draw", so is it too high for the replace rate still being 0.55?
By 0.55, that saying, the next generation has to beat best model in most games, even draw is not allowed. That seems difficult. And the best model is thus less evolved, which makes the selfplay policy less improved neither. Then the training data less improved.
Or another question can be: in your practice when evaluating, how often does "draw" ending happen? In my local running, it happens in about rate of 1/8 when evaluating. I am still in early stage of training, and I rewrite the selfplay part also, so I don't know whether this 1/8 rate is reasonable or not. Just curious what rate of drawing you got.
Thanks.
In challenge 2, the AlphaZero method, self-play always uses the newest next_generation model. When running both self
and opt
workers, the self
worker will always load the newest next_generation model saved by the opt
worker when starting a new game.
Over a long period of time (say, 1-2 days), the self
worker will load a new model so many times that it will cause ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,256,256]
.
I found that this is because Keras doesn't free the GPU memory occupied by the old model weights when loading new model weights. The following simple modification of src/reversi-zero/worker/self-play.py
will quickly reproduce this error by running python src/reversi_zero/run.py self
:
...
import keras.backend as K
...
def start(config: Config):
tf_util.set_session_config(per_process_gpu_memory_fraction=0.05) # make the error occur faster
return SelfPlayWorker(config, env=ReversiEnv()).start()
...
class SelfPlayWorker:
...
def start(self):
if self.model is None:
self.model = self.load_model()
self.buffer = []
idx = 1
while True:
start_time = time()
# env = self.start_game(idx)
end_time = time()
logger.debug(f"play game {idx} time={end_time - start_time} sec, ")
# f"turn={env.turn}:{env.board.number_of_black_and_white}")
if True or (idx % self.config.play_data.nb_game_in_file) == 0:
# K.clear_session()
load_best_model_weights(self.model) # repeatedly loading even the same weight will produce this error
idx += 1
continue
I've run with different per_process_gpu_memory_fraction
values and found that the error occurs after exactly the corresponding times of model loading. For example, per_process_gpu_memory_fraction=0.05
on my GTX 960 with 4037MB GPU memory will crash after exactly floor(4037 x 0.05 / 46) = 4
times of model loading (since the model weight h5 file is 46MB).
There's a simple way to fix this in Keras. Just run keras.backend.clear_session()
before loading new weights. In the above example, uncommenting K.clear_session()
solves this error.
I've opened a pull request for this by fixing this in lib/model_helper.py
.
Wonderful job, friend.
Can you please tell us what's the performance you got with this approach? Do you have some statistics or something?
Regards!
I see ReversiPlayer.action function try select action_by_value when turn > change_tau_turn.
action = int(np.random.choice(range(64), p=policy))
action_by_value = int(np.argmax(self.var_q[key] + (self.var_n[key] > 0)*100))
if action == action_by_value or env.turn < self.play_config.change_tau_turn or env.turn <= 1:
break
I think action_by_value is the first index with none zero n.
Why did you select action like this?
How can I create .env file?
I see calculate policy softmax on the all moves contains illegal.
How can calculate softmax on the only legal moves, if set placeholder for legal moves?
Big Problem...
virtual_loss_for_w
must be decided before env.step(action)
.
reversi-alpha-zero/src/reversi_zero/agent/player.py
Lines 180 to 184 in a09d90b
This is a critical bug which makes MCTS search more same moves.
In "challenge_history.md", the year should be 2018 in Challenge 3/4, not 2017.
Installing wxPython is a nightmare on many platforms. Usually users cannot have an out-of-box installation without quite much searching. Web based UI can be a friendly replacement.
Is there a baseline for comparing the learned model e.g. a benchmark software to evaluate against? It would be useful for us to know how effective the learning algorithm actually is.
For example, what do you mean by "Won the App LV x?" Does it mean that if the model beat the app even once, it counts as a win even if it loses the other times?
I downloaded your "best model" and "newest model", and played both networks against grhino AI (level 2). Sadly, both networks got destroyed by grhino on multiple tries. If you have a benchmark of levels to beat before grhino, that would be really helpful
I see my model don't be improved anymore.
Moreover I found "It may forget pertinent information about positions that it no longer visits" as ThomasWAnthony's when opinion select action unusually.
@mokemokechicken, @gooooloo How about it?
Hi,
Great work with your repository, impressive stuff, Just interested to see when you are running the software, with self-play and optimise at the same time, how many self-play games do you aim to complete between the optimiser releasing a new model? I wonder as I would have thought if not enough games are completed the model would over-fit?
Thanks jack
In the paper, there is a L2 weight regularisation in loss equation. Seems your implementation doesn't have it? ( please correct me if that regularisation is implied somewhere else. )
reversi-alpha-zero/src/reversi_zero/agent/player.py
Lines 165 to 168 in f1cfa6c
I see update N and W with virtual loss when select the node in order to discourages other threads from simultaneously exploring the identical variation (in paper).
@mokemokechicken It seems that I can only finish one game with about 108s using the default hyper-parameters. My CPU is 8 core i7-7700K CPU @ 4.20GHz
as well and my GPU is GeForce GTX 1080 Ti. Is there any hyper-parameters changed without any declaration in README? Thanks!
It is better to implement resign when self-play.
Because it is not important to learn moves in one-side game.
It may be a waste of capacity.
I tried to run the new download_newest_model_as_best_model.sh script, but both of the URLs it tries to fetch from return 404: Not Found.
As I mentioned before, I'm working on applying AlphaZero to text generation using decoder-only Transformer instead of CNN. My implementation is nearly finished, but I haven't tested to see its performance on text generation. Besides, Transformer can be used for board games like reversi, since you can represent each move as a symbol (for example, you can represent any move of reversi with a number from 0 to 63). Obviously, this doesn't contain any geometric information, but it's interesting to see whether this info is really that important or not compared with the speed advantage, which is because layer-wise per-move FLOPS is now roughly bs x 4 x hidden_dim^2 instead of bs x 8^2 x hidden_dim^2 x 3^2, which is 144x faster. Any question? If you're interested, I'll notify as soon as my implementation works, so that you and I can extract necessary components to apply to your reversi stuff.
@mokemokechicken @gooooloo I added one more GPU. However, the added GPU has 0% usage rate, and doubling prediction_queue_size, parallel_search_num and multi_process_num doesn't make a difference. Also, has asynchronous training, or one GPU doing grad updates while other GPUs doing self-play and weights are synced after each grad update, been implemented in @gooooloo 's algorithm?
https://github.com/tensorflow/minigo
Maybe help here.
I make a AlphaZero Implementation of Gobang based on TensorFlow.
reversi-alpha-zero/src/reversi_zero/agent/player.py
Lines 79 to 80 in e79d92f
Do you think it is reasonable to resign if min(varq, (varn==0)*10)> 0.8 (or 0.9)?
As you said the <-0.8 resignation helps a lot due to its ignorance on one-side game state. I also verified the effect on my local run. Then >0.8 is also a one-side game. And I guess MCTS is strong enough to guide ai to win on this situation? Then the NN will be more accurate due to less sample space. I know it is not mentioned on Deepmind Paper. Maybe it is because of large state space of Go game. But maybe it's worth trying on Reversi?What do you think?
Installed everything.
Is the failure happening after the TF warnings?
If I want to get rid of the TF warnings (they are just be warnings, right?), what should I do?
Please help,
thanks.
> python src/reversi_zero/run.py play_gui
2017-12-10 16:02:48,794@reversi_zero.manager INFO # config type: normal
Using TensorFlow backend.
2017-12-10 16:03:02,034@reversi_zero.agent.model DEBUG # loading model from /Users/john/dev/igo/reversi0/reversi-az/data/model/model_best_config.json
2017-12-10 16:03:04.151972: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-10 16:03:04.152015: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-10 16:03:04,714@reversi_zero.agent.model DEBUG # loaded model digest = ae1dd819bdaf71fcc6e95e8b64bb53db6ca5fa63398fdff2582ab14ed9c87109
This program needs access to the screen. Please run with a
Framework build of python, and only when you are logged in
on the main display of your Mac.
I see you did random flip and rotation in the Player's expand_and_evaluate function.
I think this is need when add data for training, but don't necessary when evaluate for selecting action.
How about it?
Please share your reversi model achievement!
Models that are not reversi-alpha-zero are also welcome.
Battle record, configuration, repository url, comments and so on.
So I have a question which is related to another similar project for go https://github.com/gcp/leela-zero
In this project, self play game are generated from the same player playing against itself. So black and white have the same random seed, and have a shared search tree through tree reuse.
If I'm reading the code right, in reversi-alpha-zero, 2 independent players are used to generate self-play games, with their own separate search tree and different random seed.
I am very curious about the effects of the 2 different ways of doing this. What have been your results ?
This is my finding from a toy version of my customization of Akababa's implementation, but I'm certain this is relevant to your implementation and at least worth asking here, as your seeding part isn't essentially different from his. I've noticed that, since each process/thread shares the same seed, each process/thread generates the same result (e.g. state transition during simulation) regardless of the underlying probability distribution. Some processes/threads are faster than others, so eventually you may not find this deterministic behavior, as some processes/threads yield different outputs at the same instant despite the identical pseudo-random sequence generation. Did you find this problematic? I don't think this has been mentioned yet.
Relevant sources:
best-seed-for-parallel-process
Random seed is replication across child processes #9650
seeding-random-number-generators-in-parallel-programs
reversi-alpha-zero/src/reversi_zero/agent/player.py
Lines 406 to 414 in 4ac6e06
Maybe a bug here: p_
here is NOT a probability distribution over legal moves, until you do normalization in codes after. But in the dirichlet_noise_only_for_legal_moves == True
case, dirichlet noise is already a probability distribution over legal moves. Saying, you are adding dirichlet noise on a non-probability-distribution, which I believe not consistent with AlphaGoZero paper.
I happen to find my implementation had this bug too, and after I fixed this bug, my AI's strength improves significantly.
@mokemokechicken now you plays draw with NTest:13. Good job!
I notice the design of share_mtcs_info_in_self_play. It share mcts info among different games of same model. This is different with AlphaGoZero/AlphaZero paper, but I imagine it would improve selfplay quality a lot. How is it in real practice?
And how many memory usage does it bring?
@mokemokechicken in ReadMe you said:
NBoard cannot play with two different engines (maybe).
I feel it can in another way. From this source code of NBoard, seems like it just run a ntest
executable, then communicate to it via NBoard protocol. Since you have understand that protocol well, you can also launch ntest
and communicate with it in your code. Then you can play your model with ntest without manual intervention. If you want to check game detail, just save the movement and replay it.
@mokemokechicken I see you started Challenge 3 (AlphaZero Method) in README, what is the difference between this one and Challenge2?
reversi-alpha-zero/src/reversi_zero/agent/player.py
Lines 115 to 121 in 5ee2f33
I first think this code is searching in the simulation_num_per_move threads at the same time.
But I see async function is not called in the multi thread.
How about it, and how can I search three in multiple threads?
When reloading, some self play game may be still in progress. Will this reloading be OK? Some games will be using old model in first half of game, and new model in second half of game. Some more discussion can be seen here. In another fork there is a fix of this, but it will cause some idle time of cpu.
Or, you are aware of this issue, and think it is ok? As we can see, you model are making good progress as well...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.