mokemokechicken / reversi-alpha-zero Goto Github PK

View Code? Open in Web Editor NEW

677.0 51.0 169.0 1.25 MB

Reversi reinforcement learning by AlphaGo Zero methods.

License: MIT License

Jupyter Notebook 7.15% Python 92.12% Shell 0.72%

reinforcement-learning deeplearning alphago-zero keras tensorflow machine-learning reversi

reversi-alpha-zero's Issues

ChessAlpha Zero development

Hello, @mokemokechicken and @Yuhang.

As I promised I've done (just in one day, I had no more time) an adaptation of the reversi-zero project @mokemokechicken did into a chess version: https://github.com/Zeta36/chess-alpha-zero

The project is already functional (in the sense that it doesn't fail and the three workers do their job), but unfortunately I have no GPU (just an Intel i5 CPU) nor money to spend in a AWS server or similar.
So I just could check the self-play with the toy config "--type mini". Moreover, I had to descend self.simulation_num_per_move = 2 and self.parallel_search_num = 2.

In this way I was able to generate the 100 games needed for the optimitazion worker to start. The optimization process seemed to work perfeclty, and the model was able to reach a loss of ~0.6 after 1000 steps. I guess so that the model was able to overfit the 100 games of the former self-play.

Then I execute the evaluation process and it worked fine. The overfitted model was able to defeat the random original model of the beggining by 100% (causality??).
Finally I check the ASCII way to play against the best model. It worked as expected. To indicate our moves we have to use UCI notation: a1a2, b3b8, etc. More info here: https://chessprogramming.wikispaces.com/Algebraic+Chess+Notation

By the way, the model output is now of size 8128 (instead of the 64 of reversi and the 362 of Go), and it correspond to all possible legal UCI moves in a chess game. I generate these new labels in the config.py file.
I have to note you also that the board state (and the player turn) is traced by the fen chess notation: https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation

Here is for example the FEN for the starting position: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 (w is for white to move).

I changed also a little bit the resign function. Chess is not like Go or Reversi where you always finish the game more or less in the same number of movements. In Chess the game can end in a lot of ways (checkmate, stalemate, etc.) and the self-play coul be more than 200 movements before reaching an ending position (normally in a draw). So I decided to cut-off the play after some player has more than 13 points of advantage (this score is computed as usual taking into account the value of the pieces: he queen is worth 10, roots 5.5, etc).

As you can imagine with my poor machine I could not fully test the project beyond these tiny tests of functionality. So I'd really appreciate if you could please take some free time of your GPU's for testing this implementation in a more serious way. Both you can of course be colaborators of the project if you wish.

Also I don't know if I commited some theoretical bugs after this adaptation to chess and I'd apretiate too any comments by your side in this sense.

Best regards!!

winner type is 'Winner' not 'Player'

reversi-alpha-zero/src/reversi_zero/env/reversi_env.py

Line 88 in 6468f5e

self.winner = another_player(self.next_player)

About the optimizer?

I found that the optimizer only load data at the beginning, will it reload new play data in the training progress?
2.Hope more log can be available, such as loss with step

Is 0.55 too high for replace_rate given Reversi can have draw result?

reversi-alpha-zero/src/reversi_zero/configs/normal.py

Line 4 in 61922cc

self.replace_rate = 0.55

I know the Deepmind paper says a replace_rate of 0.55. But considering in that Go game under that rule, there is no "draw" result, so 0.55 is reasonable. However, in reversi there is "draw", so is it too high for the replace rate still being 0.55?

By 0.55, that saying, the next generation has to beat best model in most games, even draw is not allowed. That seems difficult. And the best model is thus less evolved, which makes the selfplay policy less improved neither. Then the training data less improved.

Or another question can be: in your practice when evaluating, how often does "draw" ending happen? In my local running, it happens in about rate of 1/8 when evaluating. I am still in early stage of training, and I rewrite the selfplay part also, so I don't know whether this 1/8 rate is reasonable or not. Just curious what rate of drawing you got.

Thanks.

GPU ResourceExhaustedError after many times of Keras model.load() during self-play

In challenge 2, the AlphaZero method, self-play always uses the newest next_generation model. When running both self and opt workers, the self worker will always load the newest next_generation model saved by the opt worker when starting a new game.

Over a long period of time (say, 1-2 days), the self worker will load a new model so many times that it will cause ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,256,256].

I found that this is because Keras doesn't free the GPU memory occupied by the old model weights when loading new model weights. The following simple modification of src/reversi-zero/worker/self-play.py will quickly reproduce this error by running python src/reversi_zero/run.py self:

...
import keras.backend as K
...
def start(config: Config):
    tf_util.set_session_config(per_process_gpu_memory_fraction=0.05)  # make the error occur faster
    return SelfPlayWorker(config, env=ReversiEnv()).start()
...
class SelfPlayWorker:
    ...
    def start(self):
        if self.model is None:
            self.model = self.load_model()

        self.buffer = []
        idx = 1

        while True:
            start_time = time()
            # env = self.start_game(idx)
            end_time = time()
            logger.debug(f"play game {idx} time={end_time - start_time} sec, ")
                         # f"turn={env.turn}:{env.board.number_of_black_and_white}")
            if True or (idx % self.config.play_data.nb_game_in_file) == 0:
                # K.clear_session()
                load_best_model_weights(self.model)  # repeatedly loading even the same weight will produce this error
                idx += 1
                continue

I've run with different per_process_gpu_memory_fraction values and found that the error occurs after exactly the corresponding times of model loading. For example, per_process_gpu_memory_fraction=0.05 on my GTX 960 with 4037MB GPU memory will crash after exactly floor(4037 x 0.05 / 46) = 4 times of model loading (since the model weight h5 file is 46MB).

There's a simple way to fix this in Keras. Just run keras.backend.clear_session() before loading new weights. In the above example, uncommenting K.clear_session() solves this error.
I've opened a pull request for this by fixing this in lib/model_helper.py.

Great job!!

Wonderful job, friend.

Can you please tell us what's the performance you got with this approach? Do you have some statistics or something?

Regards!

What is action_by_value?

I see ReversiPlayer.action function try select action_by_value when turn > change_tau_turn.

action = int(np.random.choice(range(64), p=policy))
action_by_value = int(np.argmax(self.var_q[key] + (self.var_n[key] > 0)*100))
if action == action_by_value or env.turn < self.play_config.change_tau_turn or env.turn <= 1:
    break

I think action_by_value is the first index with none zero n.
Why did you select action like this?

Create .env file

How can I create .env file?

invalid correct moves

reversi-alpha-zero/src/reversi_zero/lib/bitboard.py

Line 53 in 4cca25a

def find_correct_moves(own, enemy):

I found 11 valid moves using find_correct_moves on the Board(black=147618596785664,white=4812688517050430, current_play=black).
But there are 10 correct moves actually.
Please check find_correct_moves function.

Policy out softmax with illegal moves

reversi-alpha-zero/src/reversi_zero/agent/model.py

Line 48 in 5ee2f33

 policy_out = Dense(8*8, kernel_regularizer=l2(mc.l2_reg), activation="softmax", name="policy_out")(x) 

I see calculate policy softmax on the all moves contains illegal.
How can calculate softmax on the only legal moves, if set placeholder for legal moves?

Player#moves must not include moves in thinking!!

reversi-alpha-zero/src/reversi_zero/agent/player.py

Line 67 in 6468f5e

self.moves.append([(own, enemy), list(policy)])

Big Problem...

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

FYI: https://arxiv.org/abs/1712.01815

The sign of virtual loss is reversed

virtual_loss_for_w must be decided before env.step(action).

reversi-alpha-zero/src/reversi_zero/agent/player.py

Lines 180 to 184 in a09d90b

 action_t = self.select_action_q_and_u(env, is_root_node) 

 _, _ = env.step(action_t) 

 virtual_loss = self.config.play.virtual_loss 

 virtual_loss_for_w = virtual_loss if env.next_player == Player.black else -virtual_loss

This is a critical bug which makes MCTS search more same moves.

The history dates of Challenge 3/4 are wrong.

In "challenge_history.md", the year should be 2018 in Challenge 3/4, not 2017.

Drop wxPython?

Installing wxPython is a nightmare on many platforms. Usually users cannot have an out-of-box installation without quite much searching. Web based UI can be a friendly replacement.

Baseline Comparison?

Is there a baseline for comparing the learned model e.g. a benchmark software to evaluate against? It would be useful for us to know how effective the learning algorithm actually is.

For example, what do you mean by "Won the App LV x?" Does it mean that if the model beat the app even once, it counts as a win even if it loses the other times?

I downloaded your "best model" and "newest model", and played both networks against grhino AI (level 2). Sadly, both networks got destroyed by grhino on multiple tries. If you have a benchmark of levels to beat before grhino, that would be really helpful

It may forget pertinent information about positions that it no longer visits.

I see my model don't be improved anymore.
Moreover I found "It may forget pertinent information about positions that it no longer visits" as ThomasWAnthony's when opinion select action unusually.
@mokemokechicken, @gooooloo How about it?

AlphaZero Approach

Hi,

Great work with your repository, impressive stuff, Just interested to see when you are running the software, with self-play and optimise at the same time, how many self-play games do you aim to complete between the optimiser releasing a new model? I wonder as I would have thought if not enough games are completed the model would over-fit?

Thanks jack

L2 weight regularisation in loss?

reversi-alpha-zero/src/reversi_zero/worker/optimize.py

Line 70 in 6468f5e

losses = [objective_function_for_policy, objective_function_for_value]

In the paper, there is a L2 weight regularisation in loss equation. Seems your implementation doesn't have it? ( please correct me if that regularisation is implied somewhere else. )

About MCTS

reversi-alpha-zero/src/reversi_zero/agent/player.py

Lines 165 to 168 in f1cfa6c

 virtual_loss = self.config.play.virtual_loss 

 self.var_n[key][action_t] += virtual_loss 

 self.var_w[key][action_t] -= virtual_loss 

 leaf_v = await self.search_my_move(env) # next move

I see update N and W with virtual loss when select the node in order to discourages other threads from simultaneously exploring the identical variation (in paper).

Why don't update Q with N/W at this time?
Isn't it W=W+virtual loss when player is white?
Why didn't share tree between two players?

About the time of self-play

@mokemokechicken It seems that I can only finish one game with about 108s using the default hyper-parameters. My CPU is 8 core i7-7700K CPU @ 4.20GHz as well and my GPU is GeForce GTX 1080 Ti. Is there any hyper-parameters changed without any declaration in README? Thanks!

Implement resign

It is better to implement resign when self-play.
Because it is not important to learn moves in one-side game.
It may be a waste of capacity.

404 in the new download script

I tried to run the new download_newest_model_as_best_model.sh script, but both of the URLs it tries to fetch from return 404: Not Found.

Replacing CNN with decoder-only Transformer for possible acceleration?

As I mentioned before, I'm working on applying AlphaZero to text generation using decoder-only Transformer instead of CNN. My implementation is nearly finished, but I haven't tested to see its performance on text generation. Besides, Transformer can be used for board games like reversi, since you can represent each move as a symbol (for example, you can represent any move of reversi with a number from 0 to 63). Obviously, this doesn't contain any geometric information, but it's interesting to see whether this info is really that important or not compared with the speed advantage, which is because layer-wise per-move FLOPS is now roughly bs x 4 x hidden_dim^2 instead of bs x 8^2 x hidden_dim^2 x 3^2, which is 144x faster. Any question? If you're interested, I'll notify as soon as my implementation works, so that you and I can extract necessary components to apply to your reversi stuff.

Cannot use multiple GPUs in self-play

@mokemokechicken @gooooloo I added one more GPU. However, the added GPU has 0% usage rate, and doubling prediction_queue_size, parallel_search_num and multi_process_num doesn't make a difference. Also, has asynchronous training, or one GPU doing grad updates while other GPUs doing self-play and weights are synced after each grad update, been implemented in @gooooloo 's algorithm?

Unofficial AlphaGoZero implementation from Googlers

https://github.com/tensorflow/minigo

Maybe help here.

Gobang version

I make a AlphaZero Implementation of Gobang based on TensorFlow.

Another resign condition?

@mokemokechicken

reversi-alpha-zero/src/reversi_zero/agent/player.py

Lines 79 to 80 in e79d92f

 if self.play_config.resign_threshold is not None and \ 

 np.max(self.var_q[key] - (self.var_n[key] == 0)*10) <= self.play_config.resign_threshold:

Do you think it is reasonable to resign if min(varq, (varn==0)*10)> 0.8 (or 0.9)?

As you said the <-0.8 resignation helps a lot due to its ignorance on one-side game state. I also verified the effect on my local run. Then >0.8 is also a one-side game. And I guess MCTS is strong enough to guide ai to win on this situation? Then the NN will be more accurate due to less sample space. I know it is not mentioned on Deepmind Paper. Maybe it is because of large state space of Go game. But maybe it's worth trying on Reversi?What do you think?

Failed running GUI

Installed everything.

Is the failure happening after the TF warnings?

If I want to get rid of the TF warnings (they are just be warnings, right?), what should I do?

Please help,
thanks.

> python src/reversi_zero/run.py play_gui
2017-12-10 16:02:48,794@reversi_zero.manager INFO # config type: normal
Using TensorFlow backend.
2017-12-10 16:03:02,034@reversi_zero.agent.model DEBUG # loading model from /Users/john/dev/igo/reversi0/reversi-az/data/model/model_best_config.json
2017-12-10 16:03:04.151972: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-10 16:03:04.152015: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-10 16:03:04,714@reversi_zero.agent.model DEBUG # loaded model digest = ae1dd819bdaf71fcc6e95e8b64bb53db6ca5fa63398fdff2582ab14ed9c87109
This program needs access to the screen. Please run with a
Framework build of python, and only when you are logged in
on the main display of your Mac.

Random flip and rotation when evaluate

I see you did random flip and rotation in the Player's expand_and_evaluate function.
I think this is need when add data for training, but don't necessary when evaluate for selecting action.
How about it?

Performance Reports

Please share your reversi model achievement!
Models that are not reversi-alpha-zero are also welcome.
Battle record, configuration, repository url, comments and so on.

About using different players for training game generation

So I have a question which is related to another similar project for go https://github.com/gcp/leela-zero
In this project, self play game are generated from the same player playing against itself. So black and white have the same random seed, and have a shared search tree through tree reuse.
If I'm reading the code right, in reversi-alpha-zero, 2 independent players are used to generate self-play games, with their own separate search tree and different random seed.
I am very curious about the effects of the 2 different ways of doing this. What have been your results ?

Child seeds being identical to the parent seed may nullify the effect of multi-processing/threading

This is my finding from a toy version of my customization of Akababa's implementation, but I'm certain this is relevant to your implementation and at least worth asking here, as your seeding part isn't essentially different from his. I've noticed that, since each process/thread shares the same seed, each process/thread generates the same result (e.g. state transition during simulation) regardless of the underlying probability distribution. Some processes/threads are faster than others, so eventually you may not find this deterministic behavior, as some processes/threads yield different outputs at the same instant despite the identical pseudo-random sequence generation. Did you find this problematic? I don't think this has been mentioned yet.

Relevant sources:
best-seed-for-parallel-process
Random seed is replication across child processes #9650
seeding-random-number-generators-in-parallel-programs

maybe a bug here

reversi-alpha-zero/src/reversi_zero/agent/player.py

Lines 406 to 414 in 4ac6e06

 if is_root_node and self.play_config.noise_eps > 0: # Is it correct?? -> (1-e)p + e*Dir(alpha) 

 if self.play_config.dirichlet_noise_only_for_legal_moves: 

 noise = dirichlet_noise_of_mask(legal_moves, self.play_config.dirichlet_alpha) 

 else: 

 noise = np.random.dirichlet([self.play_config.dirichlet_alpha] * 64) 

 p_ = (1 - self.play_config.noise_eps) * p_ + self.play_config.noise_eps * noise 

 # re-normalize in legal moves 

 p_ = p_ * bit_to_array(legal_moves, 64)

reversi-alpha-zero/src/reversi_zero/agent/player.py

Line 420 in 4ac6e06

p_ = self.normalize(p_, temperature)

Maybe a bug here: p_ here is NOT a probability distribution over legal moves, until you do normalization in codes after. But in the dirichlet_noise_only_for_legal_moves == True case, dirichlet noise is already a probability distribution over legal moves. Saying, you are adding dirichlet noise on a non-probability-distribution, which I believe not consistent with AlphaGoZero paper.

I happen to find my implementation had this bug too, and after I fixed this bug, my AI's strength improves significantly.

how much does share_mtcs_info_in_self_play contribute in strength?

@mokemokechicken now you plays draw with NTest:13. Good job!

I notice the design of share_mtcs_info_in_self_play. It share mcts info among different games of same model. This is different with AlphaGoZero/AlphaZero paper, but I imagine it would improve selfplay quality a lot. How is it in real practice?

And how many memory usage does it bring?

automatically ntest

@mokemokechicken in ReadMe you said:

NBoard cannot play with two different engines (maybe).

I feel it can in another way. From this source code of NBoard, seems like it just run a ntest executable, then communicate to it via NBoard protocol. Since you have understand that protocol well, you can also launch ntest and communicate with it in your code. Then you can play your model with ntest without manual intervention. If you want to check game detail, just save the movement and replay it.

What's different between Challenge 2 & 3?

@mokemokechicken I see you started Challenge 3 (AlphaZero Method) in README, what is the difference between this one and Challenge2?

Is it multiple searching at the same time?

reversi-alpha-zero/src/reversi_zero/agent/player.py

Lines 115 to 121 in 5ee2f33

 coroutine_list = [] 

 for it in range(self.play_config.simulation_num_per_move): 

 cor = self.start_search_my_move(own, enemy) 

 coroutine_list.append(cor) 

 coroutine_list.append(self.prediction_worker()) 

 loop.run_until_complete(asyncio.gather(*coroutine_list))

I first think this code is searching in the simulation_num_per_move threads at the same time.
But I see async function is not called in the multi thread.
How about it, and how can I search three in multiple threads?

a question about reloading model

reversi-alpha-zero/src/reversi_zero/agent/api.py

Line 81 in db3c039

self.try_reload_model()

When reloading, some self play game may be still in progress. Will this reloading be OK? Some games will be using old model in first half of game, and new model in second half of game. Some more discussion can be seen here. In another fork there is a fix of this, but it will cause some idle time of cpu.

Or, you are aware of this issue, and think it is ok? As we can see, you model are making good progress as well...

tensorflow.python.framework.errors_impl.InvalidArgumentError: Tensor input_1:0, specified in either feed_devices or fetch_devices was not found in the Graph

	action_t = self.select_action_q_and_u(env, is_root_node)
	_, _ = env.step(action_t)

	virtual_loss = self.config.play.virtual_loss
	virtual_loss_for_w = virtual_loss if env.next_player == Player.black else -virtual_loss

	virtual_loss = self.config.play.virtual_loss
	self.var_n[key][action_t] += virtual_loss
	self.var_w[key][action_t] -= virtual_loss
	leaf_v = await self.search_my_move(env) # next move

	if self.play_config.resign_threshold is not None and \
	np.max(self.var_q[key] - (self.var_n[key] == 0)*10) <= self.play_config.resign_threshold:

	if is_root_node and self.play_config.noise_eps > 0: # Is it correct?? -> (1-e)p + e*Dir(alpha)
	if self.play_config.dirichlet_noise_only_for_legal_moves:
	noise = dirichlet_noise_of_mask(legal_moves, self.play_config.dirichlet_alpha)
	else:
	noise = np.random.dirichlet([self.play_config.dirichlet_alpha] * 64)
	p_ = (1 - self.play_config.noise_eps) * p_ + self.play_config.noise_eps * noise

	# re-normalize in legal moves
	p_ = p_ * bit_to_array(legal_moves, 64)

	coroutine_list = []
	for it in range(self.play_config.simulation_num_per_move):
	cor = self.start_search_my_move(own, enemy)
	coroutine_list.append(cor)

	coroutine_list.append(self.prediction_worker())
	loop.run_until_complete(asyncio.gather(*coroutine_list))

mokemokechicken / reversi-alpha-zero Goto Github PK

reversi-alpha-zero's Issues

Recommend Projects

Recommend Topics

Recommend Org