Giter Club home page Giter Club logo

reversi-alpha-zero's Introduction

About

Reversi reinforcement learning by AlphaGo Zero methods.

@mokemokechicken's training hisotry is Challenge History.

If you can share your achievements, I would be grateful if you post them to Performance Reports.

Environment

  • Python 3.6.3
  • tensorflow-gpu: 1.3.0 (+)
    • tensorflow==1.3.0 is also ok, but very slow. When play_gui, tensorflow(cpu) is enough speed.
  • Keras: 2.0.8 (+)

Modules

Reinforcement Learning

This AlphaGo Zero implementation consists of three worker self, opt and eval.

  • self is Self-Play to generate training data by self-play using BestModel.
  • opt is Trainer to train model, and generate next-generation models.
  • eval is Evaluator to evaluate whether the next-generation model is better than BestModel. If better, replace BestModel.
    • If config.play.use_newest_next_generation_model = True, this worker is useless. (It is AlphaZero method)

Evaluation

For evaluation, you can play reversi with the BestModel.

  • play_gui is Play Game vs BestModel using wxPython.

Data

  • data/model/model_best_*: BestModel.
  • data/model/next_generation/*: next-generation models.
  • data/play_data/play_*.json: generated training data.
  • logs/main.log: log file.

If you want to train the model from the beginning, delete the above directories.

How to use

Setup

install libraries

pip install -r requirements.txt

install libraries with Anaconda

cp requirements.txt conda-requirements.txt
  • Comment out lines for jedi, Keras, parso, python-dotenv, tensorflow-tensorboard, wxPython libraries
  • Replace '-' with '_' for ipython-genutils, jupyter-*, prompt-toolkit libraries
conda env create -f environment.yml
source activate reversi-a0
conda install --yes --file conda-requirements.txt

If you want use GPU,

pip install tensorflow-gpu

set environment variables

Create .env file and write this.

KERAS_BACKEND=tensorflow

Windows Setup

This instruction is written by @GCRhoads, Thanks!

Required: 64-bit windows

Procedure verified for Windows 8.1. Not yet tested for other versions.

Note: Windows uses backslashes not forward slashes in path names.

  1. Change the first line (if necessary) of "src\reversi_zero\agent\player.py" to from asyncio.futures import Future

  2. Install the 64-bit version of Python 3.5 (the 32-bit version is not sufficient). You have two options

Note: For some strange reason, both Python 3.5 and Anaconda get installed in a hidden folder. To access them, you first have to go to the Control Panel, select Folder Options, and on the View tab, click on the circle next to "Show hidden files, folders, or drives" in the Advanced settings section. Anaconda gets installed in C:\ProgramData\Anaconda3. The direct download option installs Python in (I believe) C:\Users\\AppData\Local\Program\Python.

  1. Install Visual C++ 2015 build tools. You could install the entire 2015 version (not the 2017 version that Microsoft tries to force on you) of Visual Studio but this is a large download and install, most of which you don't need. Download visual C++ build tools. Double-click on the downloaded file to run the installer.

  2. Rewrite all uses of an f-strings. The python source code for this project uses numerous f-strings, a feature new to Python 3.6. Since we need Python 3.5 (required by the windows version of tensorflow), use your editor's search feature to find every occurrence of an f-string and rewrite it using string.format().

  3. Install the libraries From either the Anaconda prompt or from a command window in the top level folder where you put this distribution, enter the following.

pip install -r requirements.txt
  1. Install tensor-flow

If you have a gpu compatible with tensor-flow (see the list on the tensor-flow web site), then your code will execute much faster if you install the gpu version. To install the gpu-version enter the following in either the Anaconda prompt or the command window.

pip3 install -- upgrade tensorflow-gpu

If you do not have a compatible gpu, then you will have to settle for the slow cpu-only version. To install this, enter the following in either the Anaconda prompt or the command window.

pip3 install -- upgrade tensorflow
  1. set environment variables Create a .env file and write the following line in this file.
KERAS_BACKEND=tensorflow

Now you should be good to go.

Strongest Model

Now, "challenge 5 model" and "ch5 config" are strongest in my models. If you want to play with it,

rm -rf data/model/next_generation/
sh ./download_model.sh 5
# run as wxPython GUI
python src/reversi_zero/run.py play_gui -c config/ch5.yml

If you want to use as a NBoard engine(see below "Run as NBoard2.0 Engine"), please use nboard_engine -c config/ch5.yml for the Command.

Past Models

Please remove( or rename) data/model/next_generation/ directory if you want to use "BestModel" at data/model/model_best_*.

Download Trained BestModel

Download trained BestModel(trained by bellow Challenge 1) for example.

sh ./download_best_model.sh

Download Trained the Newest Model

Download trained the newest model(trained by Challenge 2, 3, 4, 5) as BestModel.

sh ./download_model.sh <version>

ex)

sh ./download_model.sh 5

Configuration

'AlphaGo Zero' method and 'AlphaZero' method

I think the main difference between 'AlphaGo Zero' and 'AlphaZero' is whether using eval or not. It is able to change these methods by configuration.

AlphaGo Zero method

  • PlayConfig#use_newest_next_generation_model = False
  • PlayWithHumanConfig#use_newest_next_generation_model = False
  • Execute Evaluator to select the best model.

AlphaZero method

  • PlayConfig#use_newest_next_generation_model = True
  • PlayWithHumanConfig#use_newest_next_generation_model = True
  • Not use Evaluator (the newest model is selected as self-play's model)

policy distribution of self-play

In DeepMind's paper, it seems that policy(ฯ€) data saved by self-play are distribution in proportion to pow(N, 1/tau). After the middle of the game, the tau becomes 0, so the distribution is one-hot.

PlayDataConfig#save_policy_of_tau_1 = True means that the saved policy's tau is always 1.

other important hyper-parameters (I think)

If you find a good parameter set, please share in the github issues!

PlayDataConfig

  • nb_game_in_file,max_file_num: The max game number of training data is nb_game_in_file * max_file_num.
  • multi_process_num: Number of process to generate self-play data.

PlayConfig, PlayWithHumanConfig

  • simulation_num_per_move : MCTS number per move.
  • c_puct: balance parameter of value network and policy network in MCTS.
  • resign_threshold: resign threshold
  • parallel_search_num: balance parameter(?) of speed and accuracy in MCTS.
    • prediction_queue_size should be same or greater than parallel_search_num.
  • dirichlet_alpha: random parameter in self-play.
  • share_mtcs_info_in_self_play: extra option. if true, share MCTS tree node information among games in self-play.
    • reset_mtcs_info_per_game: reset timing of shared MCTS information.
  • use_solver_turn, use_solver_turn_in_simulation: use solver from this turn. not use it if None.

TrainerConfig

  • wait_after_save_model_ratio: if greater than 0, optimizer will wait the ratio time to time span of saving model every after saving model. It might be useful if you run self-play and optimize in one GPU.

Basic Usages

For training model, execute Self-Play, Trainer and Evaluator.

Self-Play

python src/reversi_zero/run.py self

When executed, Self-Play will start using BestModel. If the BestModel does not exist, new random model will be created and become BestModel.

options

  • --new: create new BestModel
  • -c config_yaml: specify config yaml path override default settings of config.py

Trainer

python src/reversi_zero/run.py opt

When executed, Training will start. A base model will be loaded from latest saved next-generation model. If not existed, BestModel is used. Trained model will be saved every 2000 steps(mini-batch) after epoch.

options

  • -c config_yaml: specify config yaml path override default settings of config.py
  • --total-step: specify total step(mini-batch) numbers. The total step affects learning rate of training.

Evaluator

python src/reversi_zero/run.py eval

When executed, Evaluation will start. It evaluates BestModel and the latest next-generation model by playing about 200 games. If next-generation model wins, it becomes BestModel.

options

  • -c config_yaml: specify config yaml path override default settings of config.py

Play Game

python src/reversi_zero/run.py play_gui

Note: Mac pyenv environment

play_gui uses wxPython. It can not execute if your python environment is built without Framework. Try following pyenv install option.

env PYTHON_CONFIGURE_OPTS="--enable-framework" pyenv install 3.6.3

For Anaconda users:

conda install python.app
pythonw src/reversi_zero/run.py play_gui

When executed, ordinary reversi board will be displayed and you can play against BestModel. After BestModel moves, numbers are displayed on the board.

  • Top left numbers(1) mean 'Visit Count (=N(s,a))' of the last search.
  • Bottom left numbers(2) mean 'Q Value (=Q(s,a)) on AI side' of the last state and move. The Q values are multiplied by 100.

Run as NBoard2.0 Engine

NBoard is a very good reversi GUI and has strong reversi engines, which runs on Windows, Mac, and Linux (JRE required).

It can add external engines that implement NBoard Protocol.

How to add this model as an external engine to NBoard

  • (0) launch NBoard from command line(need environment variables like PATH)

    • ex) java -jar /Applications/NBoard/nboard-2.0.jar
  • (1) select menu Engine -> Select Opponent...

  • (2) clike button Add Engine

  • (3) set parameter:

    • Name = RAZ (for example)
    • Working Directory = PATH TO THIS PROJECT
    • Command = nboard_engine or bash nboard_engine. If you want to specify config type, nboard_engine -c config/ch5.yml.
  • (4) Engine Level N is set as simulation_num_per_move=N*20

convenient way to evaluate your model

NBoard cannot play with two different engines (maybe). However, it can select different engines of play-engine and analysis-engine.

So, convenient way to evaluate your model is for example,

  • select this engine as play-engine (or analysis-engine), another engine as analysis-engine (or play-engine).
  • check menu View -> Highlight Best Move
  • start User plays Black(or White)
  • You simply choose the best move of analysis-engine.

I have little confidence about hint protocol as analysis-engine (there is odd behavior), but work in my environment.

Auto Evaluation with other reversi AIs

reversi-arena is a system for evaluating reversi AIs which implement NBoard Protocol. It is useful when playing many games with strong AI like NTest.

View Training Log in TensorBoard

1. install tensorboard

pip install tensorboard

2. launch tensorboard and access by web browser

tensorboard --logdir logs/tensorboard/

And access http://<The Machine IP>:6006/.

Trouble Shooting

If you can not launch tensorboard by error, try to create another new plain project which includes only tensorflow and tensorboard.

And

tensorboard --logdir <PATH TO REVERSI DIR>/logs/tensorboard/

Tips and Memo

GPU Memory

In my environment of GeForce GTX 1080, memory is about 8GB, so sometimes lack of memory happen. Usually the lack of memory cause warnings, not error. If error happens, try to change per_process_gpu_memory_fraction in src/worker/{evaluate.py,optimize.py,self_play.py},

tf_util.set_session_config(per_process_gpu_memory_fraction=0.2)

Less batch_size will reduce memory usage of opt. Try to change TrainerConfig#batch_size in NormalConfig.

Training Speed

  • CPU: 8 core i7-7700K CPU @ 4.20GHz
  • GPU: GeForce GTX 1080
  • 1 game in Self-Play: about 10~20 sec (simulation_num_per_move = 100, thinking_loop = 1).
  • 1 step(mini-batch, batch size=512) in Training: about 1.8 sec.

reversi-alpha-zero's People

Contributors

gooooloo avatar haoshengzou avatar mationai avatar mokemokechicken avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reversi-alpha-zero's Issues

Another resign condition?

@mokemokechicken

if self.play_config.resign_threshold is not None and \
np.max(self.var_q[key] - (self.var_n[key] == 0)*10) <= self.play_config.resign_threshold:

Do you think it is reasonable to resign if min(varq, (varn==0)*10)> 0.8 (or 0.9)?

As you said the <-0.8 resignation helps a lot due to its ignorance on one-side game state. I also verified the effect on my local run. Then >0.8 is also a one-side game. And I guess MCTS is strong enough to guide ai to win on this situation? Then the NN will be more accurate due to less sample space. I know it is not mentioned on Deepmind Paper. Maybe it is because of large state space of Go game. But maybe it's worth trying on Reversi?What do you think?

Child seeds being identical to the parent seed may nullify the effect of multi-processing/threading

This is my finding from a toy version of my customization of Akababa's implementation, but I'm certain this is relevant to your implementation and at least worth asking here, as your seeding part isn't essentially different from his. I've noticed that, since each process/thread shares the same seed, each process/thread generates the same result (e.g. state transition during simulation) regardless of the underlying probability distribution. Some processes/threads are faster than others, so eventually you may not find this deterministic behavior, as some processes/threads yield different outputs at the same instant despite the identical pseudo-random sequence generation. Did you find this problematic? I don't think this has been mentioned yet.

Relevant sources:
best-seed-for-parallel-process
Random seed is replication across child processes #9650
seeding-random-number-generators-in-parallel-programs

Gobang version

I make a AlphaZero Implementation of Gobang based on TensorFlow.

Random flip and rotation when evaluate

I see you did random flip and rotation in the Player's expand_and_evaluate function.
I think this is need when add data for training, but don't necessary when evaluate for selecting action.
How about it?

ChessAlpha Zero development

Hello, @mokemokechicken and @Yuhang.

As I promised I've done (just in one day, I had no more time) an adaptation of the reversi-zero project @mokemokechicken did into a chess version: https://github.com/Zeta36/chess-alpha-zero

The project is already functional (in the sense that it doesn't fail and the three workers do their job), but unfortunately I have no GPU (just an Intel i5 CPU) nor money to spend in a AWS server or similar.
So I just could check the self-play with the toy config "--type mini". Moreover, I had to descend self.simulation_num_per_move = 2 and self.parallel_search_num = 2.

In this way I was able to generate the 100 games needed for the optimitazion worker to start. The optimization process seemed to work perfeclty, and the model was able to reach a loss of ~0.6 after 1000 steps. I guess so that the model was able to overfit the 100 games of the former self-play.

Then I execute the evaluation process and it worked fine. The overfitted model was able to defeat the random original model of the beggining by 100% (causality??).
Finally I check the ASCII way to play against the best model. It worked as expected. To indicate our moves we have to use UCI notation: a1a2, b3b8, etc. More info here: https://chessprogramming.wikispaces.com/Algebraic+Chess+Notation

By the way, the model output is now of size 8128 (instead of the 64 of reversi and the 362 of Go), and it correspond to all possible legal UCI moves in a chess game. I generate these new labels in the config.py file.
I have to note you also that the board state (and the player turn) is traced by the fen chess notation: https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation

Here is for example the FEN for the starting position: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 (w is for white to move).

I changed also a little bit the resign function. Chess is not like Go or Reversi where you always finish the game more or less in the same number of movements. In Chess the game can end in a lot of ways (checkmate, stalemate, etc.) and the self-play coul be more than 200 movements before reaching an ending position (normally in a draw). So I decided to cut-off the play after some player has more than 13 points of advantage (this score is computed as usual taking into account the value of the pieces: he queen is worth 10, roots 5.5, etc).

As you can imagine with my poor machine I could not fully test the project beyond these tiny tests of functionality. So I'd really appreciate if you could please take some free time of your GPU's for testing this implementation in a more serious way. Both you can of course be colaborators of the project if you wish.

Also I don't know if I commited some theoretical bugs after this adaptation to chess and I'd apretiate too any comments by your side in this sense.

Best regards!!

Implement resign

It is better to implement resign when self-play.
Because it is not important to learn moves in one-side game.
It may be a waste of capacity.

Is 0.55 too high for replace_rate given Reversi can have draw result?

I know the Deepmind paper says a replace_rate of 0.55. But considering in that Go game under that rule, there is no "draw" result, so 0.55 is reasonable. However, in reversi there is "draw", so is it too high for the replace rate still being 0.55?

By 0.55, that saying, the next generation has to beat best model in most games, even draw is not allowed. That seems difficult. And the best model is thus less evolved, which makes the selfplay policy less improved neither. Then the training data less improved.

Or another question can be: in your practice when evaluating, how often does "draw" ending happen? In my local running, it happens in about rate of 1/8 when evaluating. I am still in early stage of training, and I rewrite the selfplay part also, so I don't know whether this 1/8 rate is reasonable or not. Just curious what rate of drawing you got.

Thanks.

Performance Reports

Please share your reversi model achievement!
Models that are not reversi-alpha-zero are also welcome.
Battle record, configuration, repository url, comments and so on.

Is it multiple searching at the same time?

coroutine_list = []
for it in range(self.play_config.simulation_num_per_move):
cor = self.start_search_my_move(own, enemy)
coroutine_list.append(cor)
coroutine_list.append(self.prediction_worker())
loop.run_until_complete(asyncio.gather(*coroutine_list))

I first think this code is searching in the simulation_num_per_move threads at the same time.
But I see async function is not called in the multi thread.
How about it, and how can I search three in multiple threads?

maybe a bug here

if is_root_node and self.play_config.noise_eps > 0: # Is it correct?? -> (1-e)p + e*Dir(alpha)
if self.play_config.dirichlet_noise_only_for_legal_moves:
noise = dirichlet_noise_of_mask(legal_moves, self.play_config.dirichlet_alpha)
else:
noise = np.random.dirichlet([self.play_config.dirichlet_alpha] * 64)
p_ = (1 - self.play_config.noise_eps) * p_ + self.play_config.noise_eps * noise
# re-normalize in legal moves
p_ = p_ * bit_to_array(legal_moves, 64)

p_ = self.normalize(p_, temperature)

Maybe a bug here: p_ here is NOT a probability distribution over legal moves, until you do normalization in codes after. But in the dirichlet_noise_only_for_legal_moves == True case, dirichlet noise is already a probability distribution over legal moves. Saying, you are adding dirichlet noise on a non-probability-distribution, which I believe not consistent with AlphaGoZero paper.

I happen to find my implementation had this bug too, and after I fixed this bug, my AI's strength improves significantly.

a question about reloading model

self.try_reload_model()

When reloading, some self play game may be still in progress. Will this reloading be OK? Some games will be using old model in first half of game, and new model in second half of game. Some more discussion can be seen here. In another fork there is a fix of this, but it will cause some idle time of cpu.

Or, you are aware of this issue, and think it is ok? As we can see, you model are making good progress as well...

About the time of self-play

@mokemokechicken It seems that I can only finish one game with about 108s using the default hyper-parameters. My CPU is 8 core i7-7700K CPU @ 4.20GHz as well and my GPU is GeForce GTX 1080 Ti. Is there any hyper-parameters changed without any declaration in README? Thanks!

Baseline Comparison?

Is there a baseline for comparing the learned model e.g. a benchmark software to evaluate against? It would be useful for us to know how effective the learning algorithm actually is.

For example, what do you mean by "Won the App LV x?" Does it mean that if the model beat the app even once, it counts as a win even if it loses the other times?

I downloaded your "best model" and "newest model", and played both networks against grhino AI (level 2). Sadly, both networks got destroyed by grhino on multiple tries. If you have a benchmark of levels to beat before grhino, that would be really helpful

404 in the new download script

I tried to run the new download_newest_model_as_best_model.sh script, but both of the URLs it tries to fetch from return 404: Not Found.

About using different players for training game generation

So I have a question which is related to another similar project for go https://github.com/gcp/leela-zero
In this project, self play game are generated from the same player playing against itself. So black and white have the same random seed, and have a shared search tree through tree reuse.
If I'm reading the code right, in reversi-alpha-zero, 2 independent players are used to generate self-play games, with their own separate search tree and different random seed.
I am very curious about the effects of the 2 different ways of doing this. What have been your results ?

Drop wxPython?

Installing wxPython is a nightmare on many platforms. Usually users cannot have an out-of-box installation without quite much searching. Web based UI can be a friendly replacement.

Failed running GUI

Installed everything.

Is the failure happening after the TF warnings?

If I want to get rid of the TF warnings (they are just be warnings, right?), what should I do?

Please help,
thanks.

> python src/reversi_zero/run.py play_gui
2017-12-10 16:02:48,794@reversi_zero.manager INFO # config type: normal
Using TensorFlow backend.
2017-12-10 16:03:02,034@reversi_zero.agent.model DEBUG # loading model from /Users/john/dev/igo/reversi0/reversi-az/data/model/model_best_config.json
2017-12-10 16:03:04.151972: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-10 16:03:04.152015: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-12-10 16:03:04,714@reversi_zero.agent.model DEBUG # loaded model digest = ae1dd819bdaf71fcc6e95e8b64bb53db6ca5fa63398fdff2582ab14ed9c87109
This program needs access to the screen. Please run with a
Framework build of python, and only when you are logged in
on the main display of your Mac.

About MCTS

virtual_loss = self.config.play.virtual_loss
self.var_n[key][action_t] += virtual_loss
self.var_w[key][action_t] -= virtual_loss
leaf_v = await self.search_my_move(env) # next move

I see update N and W with virtual loss when select the node in order to discourages other threads from simultaneously exploring the identical variation (in paper).

  1. Why don't update Q with N/W at this time?
  2. Isn't it W=W+virtual loss when player is white?
  3. Why didn't share tree between two players?

What is action_by_value?

I see ReversiPlayer.action function try select action_by_value when turn > change_tau_turn.

action = int(np.random.choice(range(64), p=policy))
action_by_value = int(np.argmax(self.var_q[key] + (self.var_n[key] > 0)*100))
if action == action_by_value or env.turn < self.play_config.change_tau_turn or env.turn <= 1:
    break

I think action_by_value is the first index with none zero n.
Why did you select action like this?

how much does share_mtcs_info_in_self_play contribute in strength?

@mokemokechicken now you plays draw with NTest:13. Good job!

I notice the design of share_mtcs_info_in_self_play. It share mcts info among different games of same model. This is different with AlphaGoZero/AlphaZero paper, but I imagine it would improve selfplay quality a lot. How is it in real practice?

And how many memory usage does it bring?

Cannot use multiple GPUs in self-play

@mokemokechicken @gooooloo I added one more GPU. However, the added GPU has 0% usage rate, and doubling prediction_queue_size, parallel_search_num and multi_process_num doesn't make a difference. Also, has asynchronous training, or one GPU doing grad updates while other GPUs doing self-play and weights are synced after each grad update, been implemented in @gooooloo 's algorithm?

Great job!!

Wonderful job, friend.

Can you please tell us what's the performance you got with this approach? Do you have some statistics or something?

Regards!

GPU ResourceExhaustedError after many times of Keras model.load() during self-play

In challenge 2, the AlphaZero method, self-play always uses the newest next_generation model. When running both self and opt workers, the self worker will always load the newest next_generation model saved by the opt worker when starting a new game.

Over a long period of time (say, 1-2 days), the self worker will load a new model so many times that it will cause ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,256,256].

I found that this is because Keras doesn't free the GPU memory occupied by the old model weights when loading new model weights. The following simple modification of src/reversi-zero/worker/self-play.py will quickly reproduce this error by running python src/reversi_zero/run.py self:

...
import keras.backend as K
...
def start(config: Config):
    tf_util.set_session_config(per_process_gpu_memory_fraction=0.05)  # make the error occur faster
    return SelfPlayWorker(config, env=ReversiEnv()).start()
...
class SelfPlayWorker:
    ...
    def start(self):
        if self.model is None:
            self.model = self.load_model()

        self.buffer = []
        idx = 1

        while True:
            start_time = time()
            # env = self.start_game(idx)
            end_time = time()
            logger.debug(f"play game {idx} time={end_time - start_time} sec, ")
                         # f"turn={env.turn}:{env.board.number_of_black_and_white}")
            if True or (idx % self.config.play_data.nb_game_in_file) == 0:
                # K.clear_session()
                load_best_model_weights(self.model)  # repeatedly loading even the same weight will produce this error
                idx += 1
                continue

I've run with different per_process_gpu_memory_fraction values and found that the error occurs after exactly the corresponding times of model loading. For example, per_process_gpu_memory_fraction=0.05 on my GTX 960 with 4037MB GPU memory will crash after exactly floor(4037 x 0.05 / 46) = 4 times of model loading (since the model weight h5 file is 46MB).

There's a simple way to fix this in Keras. Just run keras.backend.clear_session() before loading new weights. In the above example, uncommenting K.clear_session() solves this error.
I've opened a pull request for this by fixing this in lib/model_helper.py.

automatically ntest

@mokemokechicken in ReadMe you said:

NBoard cannot play with two different engines (maybe).

I feel it can in another way. From this source code of NBoard, seems like it just run a ntest executable, then communicate to it via NBoard protocol. Since you have understand that protocol well, you can also launch ntest and communicate with it in your code. Then you can play your model with ntest without manual intervention. If you want to check game detail, just save the movement and replay it.

Replacing CNN with decoder-only Transformer for possible acceleration?

As I mentioned before, I'm working on applying AlphaZero to text generation using decoder-only Transformer instead of CNN. My implementation is nearly finished, but I haven't tested to see its performance on text generation. Besides, Transformer can be used for board games like reversi, since you can represent each move as a symbol (for example, you can represent any move of reversi with a number from 0 to 63). Obviously, this doesn't contain any geometric information, but it's interesting to see whether this info is really that important or not compared with the speed advantage, which is because layer-wise per-move FLOPS is now roughly bs x 4 x hidden_dim^2 instead of bs x 8^2 x hidden_dim^2 x 3^2, which is 144x faster. Any question? If you're interested, I'll notify as soon as my implementation works, so that you and I can extract necessary components to apply to your reversi stuff.

About the optimizer?

  1. I found that the optimizer only load data at the beginning, will it reload new play data in the training progress?
    2.Hope more log can be available, such as loss with step

AlphaZero Approach

Hi,

Great work with your repository, impressive stuff, Just interested to see when you are running the software, with self-play and optimise at the same time, how many self-play games do you aim to complete between the optimiser releasing a new model? I wonder as I would have thought if not enough games are completed the model would over-fit?

Thanks jack

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.