kblomdahl / dream-go Goto Github PK

Artificial go player based on reinforcement and supervised learning

License: Apache License 2.0

Rust 74.54% Python 22.06% Makefile 0.30% Shell 0.16% Dockerfile 0.11% C++ 2.05% Hack 0.78%

go ai reinforcement-learning supervised-learning cuda cudnn rust python tensorflow weiqi

dream-go's Introduction

Dream Go - All day, every day

Dream Go is an independent implementation of the algorithms and concepts presented by DeepMind in their Master the Game of Go without Human Knowledge paper with a few modifications to (maybe) make it feasible to develop a strong player without access to a supercomputer on the scale of Sunway TaihuLight.

Human games are used to bootstrap the network weights.
Additional (synthetic) features inspired by AlphaGo and DeepForest are used during training and inference.
A self learning procedure inspired by Thinking Fast and Slow with Deep Learning and Tree Search is used.

Dependencies

CUDAv11 and cuDNNv8 (or higher)
NVIDIA GPU (Compute Capability 6.1 or higher)

Dev Dependencies

If you want to run the supervised or reinforcement learning programs to improve the quality of the weights or help development of the agent then you will need the following:

Python 3.6 with Tensorflow
Rust (nightly)

Training

To bootstrap the network from pre-generated data you will need an SGF file where each line contains a full game-tree, henceforth called big SGF files. If you do not have access to such a file you can use the tools/sgf2big.py tool to merge all SGF files contained within a directory to a single big SGF file. You may also want to do some data cleaning and balancing (to avoid bias in the value network) by removing duplicate games and ensuring we have the same amount of wins for both black and white.

./tools/sgf2big.py data/kgs/ > kgs_big.sgf

cat kgs_big.sgf | sort | uniq | shuf | ./tools/sgf2balance.py > kgs_bal.sgf

This binary file can then be feed into the bootstrap script which will tune the network weights to more accurately predict the moves played in the original SGF files. This script will automatically terminate on convergence. You can monitor the accuracy (and a bunch of other stuff) using Tensorboard, whose logs are stored in the logs/ directory. The final output will also be stored in the models/ directory.

cd contrib/trainer
python -m dream_tf --start kgs_big.sgf

tensorboard --logdir models/

When you are done training your network you need to transcode the weights from Tensorflow protobufs into a format that can be read by Dream Go, this can be accomplished using the --dump command of the bootstrap script:

python -m dream_tf --dump > dream-go.json

Reinforcement Learning

Two reinforcement learning algorithms are supported by Dream Go. They differ only marginally in implementation but have vastly different hardware requirements. Which of the two algorithms is the best is currently unknown, but I would recommend Expect Iteration because you most likely do not have the hardware requirements to run the AlphaZero algorithm:

AlphaZero

If you want to use the AlphaZero algorithm then you need to start by generating self-play games. The self-play games generated by Dream Go are different from normal games played using the GTP interface in several ways, most notably they are more random (to encourage exploration, and avoid duplicate games), and a summary of the monte-carlo search tree is stored for each position. This monte-carlo summary is then used during training to expose a richer structure to the neural network.

This can be accomplished using the --self-play command-line option. I also recommend that you increase the --num-threads and --batch-size arguments for this since the defaults are tuned for the GTP interface which has different (real time) requirements. This program will generate 25,000 games (should take around 14 days on modern hardware):

./dream_go --num-threads 32 --batch-size 32 --self-play 25000 > self_play.sgf

The network should now be re-trained using this self-play, this is done in the same way as during the supervised training by first performing some basic data cleaning to avoid bias, converting the games to a binary representation and then training the network using TensorFlow. You should have at least 150,000 games in total to acquire a good result:

sort < self_play.sgf | uniq | shuf | ./tools/sgf2balance.py > self_play_bal.sgf

cd contrib/trainer/ && python3 -m dream_tf --start self_play_bal.sgf

Expert Iteration

The training procedure for Expert Iteration is almost the same as for AlphaZero with two exceptions:

We generate games with --num-rollout 1 and --ex-it. These are self-play games without any search, so they are about 800 to 1,600 times faster to generate, but of lower quality.
We generate the monte-carlo search tree during data extraction using the --ex-it switch only for examples that actually end-up as examples for the neural network.

./dream_go --num-games 32 --num-threads 32 --batch-size 32 --num-rollout 1 --ex-it --self-play 200000 > policy_play.sgf

sort < policy_play.sgf | uniq | shuf | ./tools/sgf2balance.py > policy_play_bal.sgf

cd contrib/trainer/ && python3 -m dream_tf --start policy_play_bal.sgf

For the values provided in this example, which generate 200,000 examples for the neural network it should take about 1 days to generate the required data (from 200,000 distinct games).

Roadmap

1.0.0 - Public Release
0.7.0 - Acceptance
- First version with a network trained from self-play games
0.6.3 - Unravel
- The engines plays more enjoyable with kgs-genmove_cleanup
- Bug fixes
0.6.2 - Unfolded
- Improved training procedure.
- Change the input features to include more liberties.
- Decrease memory use by 80%, and runtime performance by 25%.
- Improved performance with Tensor Cores.
0.6.1 - Emerged
- Improved neural network architecture
- Improved reinforcement training environment
0.6.0 - Emergent
- Time and tournament commands for the GTP interface
- Improved neural network training
- Improved performance with DP4A
- Multi GPU support
0.5.0 - Assessment
- Optimize the monte carlo tree search parameters against other engines
- Optimize neural network size for best performance vs speed ratio
0.4.0 - Awakening
- GTP interface
0.3.0 - Slow-wave sleep
- Monte carlo tree search for self-play
0.2.0 - Light Sleep
- Self-play agent without monte carlo tree search
- Reinforcement learning using self-play games
0.1.0 - Napping
- Supervised learning using a pre-existing dataset

License

Apache License 2.0

dream-go's People

Contributors

Stargazers

Watchers

Forkers

amitdo icrown alreadydone boomerchi gaybro8777 wqzsscc cloudcell iq-scm

dream-go's Issues

Use cuTLASS instead of cuDNN for convolutions

Replace all of our usages of cuDNN with cuTLASS. This would have several advantages:

cuTLASS has a BSD-3 license, this would allow us to distribute statically linked binaries without legal trouble.
cuTLASS exposes a lower-level interface, this would allow us to customise the code more, releasing many of the modelling restrictions we have inherited from cuDNN.
cuTLASS is a CUDA library, this would allow us to combine many of our kernels into larger blocks, reducing the overhead associated with launching a new kernel.

Drawbacks

cuTLASS is a C++ template library, so we need to write the CUDA kernels in C++, and then link them to the Rust code base. Introducing one more language to the code base.
cuTLASS is only cited as having ~95% of the performance of cuBLAS for integer kernels.
We need to implement the convolution kernels surrounding the GEMM ourselves.

Usages of cuDNN

Beyond a lot of boilerplate code necessary to initialise tensor descriptors, etc, we only use the following functions from cuDNN or cuBLAS:

cudnnConvolutionBiasActivationForward (for 1x1 and 3x3 kernels)
cudnnActivationForward
cudnnAddTensor
cudnnScaleTensor
cudnnSoftmaxForward
cublasGemmEx

About MCTSnet

Hello, have you successfully replaced MCTS with MCTSnet?I have applied MCTSnet in the Sokoban, which is a GYM based environment, but the effect is not very good. The accuracy rate of each step is only 70%, and the victory rate is only 3%. Could you tell me how your effect is，thx.

Scoring and `kgs-genmove_cleanup` improvements

I suggest we make two changes to the greedy scoring, and the kgs-genmove_cleanup code, such that:

If there is a search tree available, use that instead of playing according to the policy when scoring.
When generating moves for scoring or clean-up do not play inside of eyes for pass alive groups [1] [2] [3].

The first should improve the accuracy of the scoring when there are dead groups involved, since the AI usually reads that a group is dead and then resigns. But since the scoring does not do any search, just greedy policy play it judge that the group is alive, scoring the game incorrectly.

The second point contains mostly quality of life improvements, as it avoid annoying "end game" moves where the engine would be busy filling eye space. But could also influence scoring, since it prevents the engine from filling its own eyes.

[1] "Life in the Game of Go", http://webdocs.cs.ualberta.ca/~games/go/seminar/2002/020717/benson.pdf
[2] "When one eye is sufficient: a static classification", https://link.springer.com/content/pdf/10.1007/978-0-387-35706-5_8.pdf
[3] "Judgment of Static Life and Death in Computer Go Using String Graph", https://link.springer.com/chapter/10.1007/11539117_78

Add some integration tests to detect bad network behaviour

Add an integration test that checks what policy and value is predicted for some board positions with the following features:

Big dead dragon (with one eye), check that it scores the game correctly.
Broken ladders, check that the policy does not suggest playing it out.

Some suggested SGF files containing either broken ladders (both from leela-zero, but we should check them anyway) or big dragons that are dead.

big_dead_dragon.sgf.gz
ladder_1.sgf.gz
ladder_2.sgf.gz

Compilation error "no method names chars" E0599

Trying to compile the source code from the zip of release 0.5.0, on Ubuntu 18. Full log of errors copied below. Any suggestions? I'm completely new to the rust language: figuring out the difference between the "stable" and "nightly" compilers was already a big effort for me...

DreamGo> tar zxf v0.5.0.tar.gz
DreamGo> cd dream-go-0.5.0/
dream-go-0.5.0> rustup override add nightly
info: using existing install for 'nightly-x86_64-unknown-linux-gnu'
info: override toolchain for '/opt/DreamGo/dream-go-0.5.0' set to 'nightly-x86_64-unknown-linux-gnu'

nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.30.0-nightly (2ab3eba30 2018-09-14)

dream-go-0.5.0> cargo build
Compiling void v1.0.2
Compiling libc v0.2.34
Compiling bitflags v0.4.0
Compiling lazy_static v1.0.0
Compiling encode_unicode v0.1.3
Compiling unicode-width v0.1.4
Compiling utf8-ranges v1.0.0
Compiling regex-syntax v0.4.1
Compiling num-traits v0.1.41
Compiling unreachable v1.0.0
Compiling unreachable v0.1.1
Compiling thread_local v0.3.5
Compiling memchr v2.0.1
Compiling num_cpus v1.7.0
Compiling nix v0.5.1
Compiling time v0.1.38
Compiling rand v0.4.1
Compiling ordered-float v0.5.0
Compiling aho-corasick v0.6.4
Compiling threadpool v1.7.1
Compiling rustyline v1.0.0
Compiling regex v0.2.3
Compiling dream_go v0.5.0 (/opt/DreamGo/dream-go-0.5.0)
warning: unused import: Read
--> src/nn/loader.rs:18:26
|
18 | use std::io::{BufReader, Read};
| ^^^^
|
= note: #[warn(unused_imports)] on by default

error[E0599]: no method named chars found for type std::io::BufReader<std::fs::File> in the current scope
--> src/nn/loader.rs:76:45
|
76 | let mut iter = BufReader::new(file).chars().map(|ch| ch.unwrap());
| ^^^^^

error: aborting due to previous error

For more information about this error, try rustc --explain E0599.
error: Could not compile dream_go.

To learn more, run the command again with --verbose.

Adjust neural network hyper-parameters

The regularization coefficient 1e-4 that DeepMind recommended was for their 40 layers architecture. We use the 20 layers architecture and therefore should have a larger regularization coefficient.

While we are at it, change the learning rate schedule and optimizer to Adam so that we can reach a local minimum faster as it takes ages on my non-supercomputer to train today.

Add support for training on multiple GPU's.
Change the learning rate schedule to something more reasonable.
Change the optimizer to Adam.
Adjust regularization constant.

./dream_go --extract kgs_bal.sgf > kgs_big.bin?

how to make this command?
there is no dream-go exe file..
./dream_go --extract kgs_bal.sgf > kgs_big.bin

Move the `Param` object out of the `mcts` package and make it read from the command line

Currently all input to dream_go is controlled from environment variables, this is not terrible but to behave more as people would expect we should read configuration from the command line instead. To accomplish this I suggest we move the Param structure that is currently in the mcts package to util and then generalise some of the parameters to read from the command line instead of the environment variables.

It is also unclear whether we need to keep the Param object as a trait since if we are reading from the command-line then we can branch directly on whether we are in tournament or self-play mode (which currently decides on how much Dirichlet noise we as well as some other factors).

MCTSnet instead of monte carlo tree search

I suggest that we implement MCTSnet instead of traditional tree search. This should provide an improvement because (i) it allows for richer statistics to be extracted from each terminal game¹, and (ii) it allows us to compute better policy values, by keeping two distinct readout and policy networks².

¹ Traditional tree search only provide a binary win / lose statistic.
² This is a bit complicated to explain, but in traditional tree search use estimated win rate as an indication of which candidate moves should be explored further, but this is not necessarily a good indication of what moves should be explored. See, for example, first play urgency [1] for further discussions on why this is a good thing.

MCTSnet architecture

Tensorflow training script

To implement MCTSnet in tensorflow there are two complications:

We need to update the board state during the tree search.
This is a problem because we cannot calculate a gradient through the update step. This can be solved by adding a likelihood REINFORCE gradient to the gradient of the loss function:
```
[ Σ ∂ / (∂ θ) log π(h) ] · loss(x)
```
We need to implement a tree memory in tensorflow.
This can probably be done using sparse lookup's, and while loops. But it might be much easier to use eager execution to support the dynamic control flow. Using eager execution would provide lower performance however as it would force us to use a batch size of one.

This provides the following mock-up code for the training script, where step 1 and 2 can be batched by using some background threads:

Create a dataset that yields the triplet (state, logits, winner).
Map each entry in the dataset to 25 entries, one for each iteration of the MCTSnet:
- Probe into the search tree, keeping track of each ∂ / (∂ θ) log π(h) encountered.
- Calculate the loss of the readout network, and its gradient.
- Calculate the REINFORCE gradient, where R is the difference between the final loss and the loss of the current step:
  [ Σ ∂ / (∂ θ) ρ(...) ] + [ Σ ∂ / (∂ θ) log π(h) ] · R
Batch, and apply the average of the gradients to the parameters.

Network architectures

Backup (β)

The backup network is responsible for combining statistics from a node in the search tree to its parent. The architecture is a gated recurrent unit (GRU):

Learned simulation policy (π)

The readout network is responsible for determining what moves to actually play given some statistics. The architecture is the same as the policy head in the AlphaZero article.

Embedding (ε)

This network is responsible for summarizing a board state into a set of statistics with shape [2, 361]. The architecture is the same as the residual tower in the AlphaZero article.

Readout (ρ)

The readout network is responsible for determining what moves to actually play, and determining the winrate, given some statistics. The architecture is equivalent to the policy head, and the value head in the AlphaZero article.

TODO

Re-write the training script to implement MCTSnet.
Restructure the nn module to support multiple different graphs in a flexible manner.
Add the additional networks to the nn module.
Re-write the mcts module to implement MCTSnet.

Citations

[1] "Modifications of UCT and sequence-like simulations for Monte-Carlo Go", Yizao Wang and Sylvain Gelly, http://dept.stat.lsa.umich.edu/~yizwang/publications/wang07modifications.pdf
[2] "Gradient Estimation Using Stochastic Computation Graphs", John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel, https://arxiv.org/pdf/1506.05254.pdf

Add Sabaki Integration

Add commands that integrate with the JSON Sabaki interface. This is desirable to make it easier to debug, and find, issues:

heatmap - output the neural network activation for each vertex on the board.
sabaki-genmovelog - output all considered variations during the MCTS

Reference implementation of the protocol (for leelaz) can be found here.

Investigate SWISH as activation function in cuDNN

In cuDNN 8.2 the swish activation function was introduced, this is an activation function that has been very successfully applied in networks such as MobileNetV3 and EfficientNet. It is worth investigating locally as well to see if it yields better performance than rectified linear units.

Performance

When plugging the swish activation function into the cudnn_types.rs benchmark (replacing relu), we some unexpected results that probably indicates that something is broken :)

Further investigations using nvprof indicates that no kernels were launched in the swish examples, so it is not quite a viable candidate yet unless we want to use the backend API.

Relu

test f16_nchw_compute_type_f16 ... bench:      81,782 ns/iter (+/- 17,856)
test f16_nchw_compute_type_f32 ... bench:     153,531 ns/iter (+/- 47,599)
test f16_nhwc_compute_type_f16 ... bench:      97,342 ns/iter (+/- 1,885)
test f16_nhwc_compute_type_f32 ... bench:     120,626 ns/iter (+/- 665)
test i8x32_nhwcvectc           ... bench:      84,497 ns/iter (+/- 1,287)
test i8x32_nhwcvectc_noreorder ... bench:      81,282 ns/iter (+/- 502)

Swish

running 2 tests
test f16_nchw_compute_type_f16 ... bench:      20,640 ns/iter (+/- 711)
test f16_nchw_compute_type_f32 ... bench:      20,627 ns/iter (+/- 181)
test f16_nhwc_compute_type_f16 ... bench:      20,473 ns/iter (+/- 319)
test f16_nhwc_compute_type_f32 ... bench:      20,703 ns/iter (+/- 1,207)
test i8x32_nhwcvectc           ... bench:      20,280 ns/iter (+/- 77)
test i8x32_nhwcvectc_noreorder ... bench:      20,274 ns/iter (+/- 124)

Monte-Carlo tree search as regularized policy optimization

The combination of Monte-Carlo tree search
(MCTS) with deep reinforcement learning has
led to significant advances in artificial intelligence. However, AlphaZero, the current stateof-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero’s
search heuristics, along with other common ones
such as UCT, are an approximation to the solution of a specific regularized policy optimization
problem. With this insight, we propose a variant
of AlphaZero which uses the exact solution to
this policy optimization problem, and show experimentally that it reliably outperforms the original
algorithm in multiple domains.

http://proceedings.mlr.press/v119/grill20a/grill20a.pdf
http://proceedings.mlr.press/v119/grill20a/grill20a-supp.pdf

Prune nodes from the search tree that are obviously bad

A lot of game engines in the 12th Computer Go UEC Cup [1] used a heuristic to prune "obviously bad" nodes from the search tree. This idea might be worth pursuing in Dream Go too.

[1] http://entcog.c.ooco.jp/entcog/new_uec/en/

Reduce the number of OS threads used during search

There are roughly three places where threads are spawned in Dream Go:

We extract features from multiple games in parallel in DataSet or self_play. Each game can spawn multiple search threads.
We evaluate several possible moves from the search tree in parallel.
We evaluate several batches of possible moves in parallel on the GPU.

There are several issues with this approach, the first is that we usually want to play a huge number of games at the same time to ensure good GPU saturation. This means running at least 512 search threads in total, which needs to be saturated by playing multiple games since we can only reliably saturate 4 search threads per game (due to bottlenecks during the opening and end-game). So if we want to run 512 search threads we need to run 640 threads in total (not counting the batch threads, which is fairly small anyway).

This is a fairly large number of context switches per second, and can cause a fairly large amount of synchronisation overhead.

The second issue is that there is currently only one knob to tune all of this, the NUM_THREADS environment variables, which currently gets scaled by different factors in different parts of the program to determine the number of threads to actually use.

To fix this, while keeping the code somewhat clean, we would need to change the architecture to rely on one of the following two popular approaches:

Future based event loop
Cooperative Threading using fibers / coroutines

In either case we would spawn NUM_THREADS number of actual working threads that schedule the event loop of fibers using a workstealing algorithm. This would allow us to spawn infinity number of threads in each of the mentioned components without having to worry about the overhead (and locks can be replaced with spinlocks since we do not need to worry about a badly timed context switch).

An alternative would be to just introduce additional environment variable that we read the number of threads to use in each component from. This would allow us to control everything perfectly without the complexity of the previous approach, but we would still suffer from the overhead associated with OS threads.

These variables could of course be introduced in the previous approach to to reap the benefit of both.

Extraction of internal features from the tower representation

We should investigate the internal features of the tower representation. Doing this should provide several benefits:

It should help us identify good features that we can compute directly, and feed into the neural network.
It should help us improve the neural network architecture.

Store examples in a more compact format

The features today are stored in a fairly verbose format, which takes up a lot of disk space for any non-trivial dataset. Today we use the following format, which uses 2 bytes per floating point number:

struct Example {
    features: [f16; 12996],
    value: f16,
    policy: [f16; 362]
}

The total size of such an example is 26718 bytes. A few observations can be made about this representation:

features contains only 0 or 1 so it could be compacted to a bitset.
value is always -1 or +1 so it could be compacted to a single bit.
policy contains true floating point number between 0 and 1 but could be quantized to an u8.

Such a structure would have a total size of 1987 bytes, a reduction of 13.45x. This is pretty non-trivial, but would come at the cost of a higher runtime cost to decode such an example in tensorflow, however since tensorflow is GPU bound at the moment this should not be a problem.

Any features we have already computed can be automatically converted to the new format with almost no loss of information. The only information lost is in the policy due to the quantization, but that should be minimal.

Compress the MCTS tree

Today a single instance of a Node is 10,688 bytes. For an average search tree, that contains 1,600 nodes, that is about 16 Mb. This is a bit excessive, and also causes some performance concerns during deallocation as:

When freeing a node we iterate over each of its child, of which each node has 362. A lot of these checks are unnecessary.
free() is pretty expensive, and we bulk free entire sub-trees at a time.

To avoid all of these issues I suggest that we change Node to a sparse representation, such that:

We store a lot of the f32 as f16.
We store some of the i32 as i16.
We store a smaller sparse version of many of the Node fields on the stack, or a full version on the heap.

trait NodeInner { ... }

struct SmallNodeImpl {
    count: [i32; 8],
    vcount: [i16; 8],
    value: [f16; 8],
    expanding: [bool; 8],
    children: [*mut Node; 8],
    indices: [i32; 8]
}

impl NodeInner for SmallNodeImpl { ... }

struct BigNodeImpl {
    count: [i32; 368],
    vcount: [i16; 368],
    value: [f16; 368],
    expanding: [bool; 362],
    children: [*mut Node; 362]
}

impl NodeInner for BigNodeImpl { ... }

enum NodeImpl {
    Small(SmallNodeImpl),
    Big(Box<BigNodeImpl>)
}

pub struct Node {
    lock: Mutex,
    pub color: Color,
    pub pass_count: i32,
    pub total_count: i32,
    pub vtotal_count: i32,
    pub prior: [f16; 368],

    pub inner: NodeImpl
}

This should reduce the average size to 928 bytes (a 10x reduction) and the run time of freeing a single Node by 46x. We will still need to bulk free a lot of these nodes at the same time, to solve this problem I suggest we keep a global object pool.

Reinforcement Learning - Round 2

See this wiki page for the updated ELO ratings and status of the training procedure.

Changes from the last reinforcement learning procedure:

All games will be correctly scored from the beginning
In the previous reinforcement learning iteration the games were inconsistently scored, the human games we bootstrapped from were played using Chinese rules, but we were playing with TT-rules, we later switch to Chinese rules but due to implementation issues the scoring was inconsistent for a while.
Increased temperature during policy play
We now use a higher initial temperature during policy play that will diminish over the course of the game. This should mean that we get more diversity between the game while still avoiding it tenuking during important fights during the middle game.
First Play Urgency
We are using a different FPU implementation this time, which means we should get much better exploration, however to ensure we still reading deeply we had to increase the number of playouts from 1,600 to 3,200.
Warm-start of each neural network optimization
In the previous reinforcement learning iteration each neural network was trained from scratch on the 100,000 most recent games. Now we instead perform a warm-start using the previous networks weights and only fine tune the next network using the most recent games. This will increase the training performance as we do not have to put effort into learning the same basic concepts each time, and should also help the network avoid forgetting important concepts.

We hope that these combined will ensure we have both high quality and diverse learning data, which should allow us to go further than before. The neural network architecture is still the same as before (9x128), with only minimal differences such as the fact that all rectified linear units are now clipped at three (to help with quantization).

Play multiple self-play games at the same time

In addition to performing parallel probes into each search tree, also run several parallel self-play games.

Doing this would have the advantage of a better saturation of the neural network, since the network throttle during start-up and end-game because there are not enough valid moves to consider in parallel. Playing multiple games at the same time would help with this situation.

I suggest the following given a batch_size and thread_count by the user:

Play up to batch_size number of games in parallel.
Run up to thread_count / batch_size parallel probes into each search tree.
Run up to thread_count / batch_size neural network workers to compute multiple batches in parallel.

This would also work with the GTP interface since part two and three will essentially allow for larger batch sizes on-demand while still being more responsive that a fixed batch size. Another goal of this issue would be to expand the parallel framework to make ponder (thinking during the opponents turn) possible without too much trouble.

Poor GPU utilization observed during play

With modern drivers we are observing very poor GPU utilization during self-play and normal play. nvidia-smi shows up about 40% utilization when running on dual-GPU, on a single GPU it gives about 60% utilization (since it doesn't have to wait for the slower of the two):

Thu Apr 30 19:20:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:65:00.0  On |                  N/A |
|  0%   34C    P2    63W / 185W |   1484MiB /  7979MiB |     39%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:B3:00.0 Off |                  N/A |
|  0%   34C    P2    97W / 260W |    640MiB / 11019MiB |     37%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1253      G   /usr/lib/xorg/Xorg                           416MiB |
|    0      2058      G   /usr/bin/kwin_x11                            112MiB |
|    0      2062      G   /usr/bin/krunner                              20MiB |
|    0      2064      G   /usr/bin/plasmashell                          99MiB |
|    0      2892      G   ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files   114MiB |
|    0      3478      G   ...quest-channel-token=3743066830833839882    40MiB |
|    0      4017      C   ./target/release/dream_go                    556MiB |
|    0     18141      G   ...quest-channel-token=6788427025990345763   112MiB |
|    1      4017      C   ./target/release/dream_go                    628MiB |
+-----------------------------------------------------------------------------+

Generate self-play games

We currently have around 5,000 self-play games with full policies generated. But I would prefer to have around 20,000 before trying to train the network again as previous attempts has resulting in overfit weights.

The latest benchmarks suggests that it takes about 104.1 seconds to generate a single game of self-play. This could be reduced slightly by running with a larger batch size but beyond 64 it only scales linearly. Assuming running on 2x GPU's each game effectively takes 52.05 seconds. This means we need 52.05 * 15000 / 3600 = 216.875 hours (aka 9 days) to create these games.

If we are to use rented hardware to decrease this time then with one p3.2xlarge instance we could cut the time per game to about 23.34 seconds resulting in a generation time of about 4.5 days (assuming we are also running it on 2x GPU's at home). This would cost about $330.48, which approaches the point where we should just buy a V100 if we are planning on repeating this.

Adversarial approach to data-set balancing and training

Today we clean the following attributes prior to feeding the data to the neural network. However pre-cleaning the data is problematic during self play since periodically the neural network should favour black or white depending on the current tactics and strategies:

Throw away games with the wrong komi (dream go use 7.5).
Throw away handicap games.
Throw away excessive black or white wins (i.e. make sure the ratio between the number of examples games where black and white wins is 1:1).

An alternative approach to the data cleaning would be to employ an adversarial network as suggested by Gilles Louppe, Michael Kagan, and Kyle Cranmer [1] [2]. Their approach is to include additional adversarial networks that we train (in a separate tower) to predict these attributes based on the policy and value output, and then in the main tower minimise their accuracy.

This approach would allow us to maintain more of the training data-set.

[1] Gilles Louppe, Michael Kagan, Kyle Cranmer, "Learning to Pivot with Adversarial Networks", https://papers.nips.cc/paper/6699-learning-to-pivot-with-adversarial-networks
[2] https://blog.godatadriven.com/fairness-in-ml

Komi

TL;DR I don't think we can solve the komi problem with this approach.

Komi has a direct influence on the score, so we cannot tell the network to not make the value a function of the komi. We can however make the policy not play differently depending on the komi, so given an adversarial network K(p) with one output class for every komi that occurs in the training set. We can then train K(p) to predict the komi based on the policy and include the negative cross entropy of K(p) in the main tower loss.

This approach may or may not work, since it is unclear if the policy contains enough information to determine the komi, but if we include the board state in the input of K(p) then the adversary may just ignore the policy and learn to predict the komi based on the board state.

Handicap

We do not include the handicap in the examples at the moment, so we would need to complement the input data. If the input data is complemented then we can follow a similar approach as for balancing black and white wins.

Excessive Black or White wins

This is pretty straight forward, we can create an adversarial network that is a function of the value head that tries to predict whether the winner is black or white.

GPU vs CPU matrix multiplication

At the end of the neural network we use for prediction we perform a number of small matrix multiplication to calculate the final policy and value values. These might be better to move to the CPU over the GPU does to kernel launch overhead:

GPU

test gpu_08_2888x362 ... bench:      99,911 ns/iter (+/- 7,922)
test gpu_08_722x1    ... bench:      78,342 ns/iter (+/- 4,629)
test gpu_16_2888x362 ... bench:      96,692 ns/iter (+/- 1,565)
test gpu_16_722x1    ... bench:      77,873 ns/iter (+/- 1,108)

Seems like the kernel launch overhead is dominating the runtime.

CPU

>>> x = np.zeros([16, 2888], 'f4')
>>> W = np.zeros([2888, 362], 'f4')
>>> y = np.zeros([16, 362], 'f4')
>>> timeit.timeit(lambda: np.matmul(x, W, out=y), number=10000) / 10000
0.0001820326805114746
= 181974 ns

>>> x = np.zeros([8, 2888], 'f4')
>>> W = np.zeros([2888, 362], 'f4')
>>> y = np.zeros([8, 362], 'f4')
>>> timeit.timeit(lambda: np.matmul(x, W, out=y), number=10000) / 10000
0.00014174079895019532
= 141741 ns

>>> y = np.zeros([16, 1], 'f4')
>>> x = np.zeros([16, 722], 'f4')
>>> W = np.zeros([722, 1], 'f4')
>>> timeit.timeit(lambda: np.matmul(x, W, out=y), number=10000) / 10000
5.264592170715332e-06
= 5264 ns

>>> y = np.zeros([8, 1], 'f4')
>>> x = np.zeros([8, 722], 'f4')
>>> W = np.zeros([8, 1], 'f4')
>>> timeit.timeit(lambda: np.matmul(x, W, out=y), number=10000) / 10000
3.273797035217285e-06
= 3274 ns

These are mock-ups from python using numpy, so not the perfect example. I might do a custom rust implementation of these "later" for a better comparison.

Conclusion

It might be worth looking into calculating the value head on the CPU at least, it might not be worth doing it for the policy head unless we can shrink it slightly beforehand. Doing the heads on the CPU would also allow us to do before custom behaviours in an easier manner.

Implement leela-zero `analyze` commands

With the deprecation of the Sabaki specific analyze commands (which were very useful for debugging MCTS), we should implement the leela-zero analyze commands instead as they are widely supported by different Go GUI's. There is no official specification for these commands, but based on the leela-zero code:

lz-genmove_analyze [color] [interval]
lz-analyze [color?] [interval]

Code for analyze output can be found in UCTSearch.cpp. It should every interval centisecond output the following to standard output for every move being considered in the root node:

info move [vertex] visits [count] winrate [10000 * value] prior [10000 * prior] order [order] pv [pv]

vertex is the GTP vertex being considered, for example d4.
count is the number of probes into this vertex, for example 800.
value is the winrate for the vertex between 0 and 1.
prior is the prior value for the vertex between 0 and 1.
order is the current rank of the vertex, for example 1.
pv is the expected path through the search tree after this vertex, for example d4 q16 ...

Every line after the first should have a one space prefix.

Limit the tree size instead of the number of additional rollouts

We might want to consider limiting the tree size to NUM_ROLLOUT instead of limiting the number of additional rollouts to perform at each step. This is better because:

It avoids the issue Leela Zero observed where the Dirichlet noise is completely overwhelmed during forced sequences (such as ladders). This kills any chance of exploration during self-play.
It would provide a more smooth strength curve over the game, avoiding blunders when an unexpected move is encountered because the search tree is a lot smaller.
During self play we can increase the number of rollouts as most of the time it plays moves the opponent expects meaning we effectively need to do less rollouts.

Overall I do not see any significant drawback of doing this.

Remove all locks from the monte carlo tree search

Today the monte carlo tree search (MCTS) spend at least 30% of its time in lock contention, mainly due to the root node, and time control. There are some prior art on lockless algorithms for MCTS [1] [2], but none of the can be directly applied to our implementation due to how we compress the tree #36.

A few thoughts on how we can accomplish different parts of this:

RCU would allow us to get ride of the locks necessary to block on upgrade from small to big node structure, which should be a very rare event.
- If we do not want to implement an RCU when a read-write lock would suffice, but is less desirable.
Changing the node structure from an object of arrays to an array of objects would allow us to update statistics using 128-bit atomic ops. If we are willing to accept a non-deterministic search then we do not need to do this, and just normal atomic ops are sufficient when calculating the UCT.

[1] S. Ali Mirsoleimani, Jaap van den Herik, Aske Plaat, and Jos Vermaseren, A Lock-free Algorithm for Parallel MCTS, http://liacs.leidenuniv.nl/~plaata1/papers/paper_ICAART18.pdf
[2] Markus Enzenberger, and Martin Muller, A Lock-free Multithreaded Monte-Carlo Tree Search
Algorithm, https://webdocs.cs.ualberta.ca/~mmueller/ps/enzenberger-mueller-acg12.pdf

Fix deadlock (?) during GTP move generation

Fix freeze during GTP move generation that seems to occur randomly. It is fairly easy to re-produce:

Setup Dream Go to play against itself. Wait until it happens...
CPU and GPU utilization drops to 0%

This does not seem to happen during self or policy play which limits the scope somewhat.

Normalize the value function Q(s, a) from the neural network to [0,1)

The value function Q(s, a) in PUCT or RAVE formula is the probability of winning for a given state and action.

In the current implementation we use the output from the neural network (which is between -1 and +1) as the value function, but this might give the wrong result. We should try normalizing the result of the value head v to see if following the formula gives better results.

We might need to re-tune the hyper-parameters of the monte-carlo search after this change.

Compiling

Hi, this looks like an interesting project. I am new in Rust and I tried to compile it. After installing the Rust stable release channel I got this error message:

  Installing dream_go v0.5.0 (file:///home/alex/Downloads/dream-go-0.5.0)
   Compiling dream_go v0.5.0 (file:///home/alex/Downloads/dream-go-0.5.0)
error[E0554]: #![feature] may not be used on the stable release channel
  --> src/lib.rs:14:1
   |
14 | #![feature(asm)]
   | ^^^^^^^^^^^^^^^^

error[E0554]: #![feature] may not be used on the stable release channel
  --> src/lib.rs:15:1
   |
15 | #![feature(io)]
   | ^^^^^^^^^^^^^^^

error[E0554]: #![feature] may not be used on the stable release channel
  --> src/lib.rs:16:1
   |
16 | #![feature(link_llvm_intrinsics)]
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0554]: #![feature] may not be used on the stable release channel
  --> src/lib.rs:17:1
   |

After updating to the nightly release channel and even after completely removing Rust and installing the nightly release channel I get still the same error. Any ideas?

Improve scoring algorithm

Our scoring algorithm is pretty bad at detecting large dead dragons, seki, and life and death in general. I suggest we solve this as it affects the value head greatly (since we train it to recognize the wrong result)

Based on 1,000 policy play games the current score is wrong 3% of the time. This could be improved, I suggest we aim for <1%, any lower number is likely unrealistic as it is going to involve very unfinished situations:

kalle@coffee:[dream-go-copy]$ cargo run --release -- --policy-play 1000 > policy_play.sgf
kalle@coffee:[dream-go-copy]$ python3 ../dream-go/tools/sgf2score.py < policy_play.sgf > policy_play_wrong.sgf
Arbitrated 0 games without a winner
Arbitrated 30 games with the wrong winner

The sgf2score.py script uses GNU Go with Chinese scoring to determine the winner, and compare it against what is recorded in the SGF file.

Add support for `q8` during neural network inference

Based on some recent benchmarks (on GTX 1080 Ti), using 8-bit quantized integers during inference could give us significant performance improvement. No experiments has been performed yet, but other sources suggest that the drop in accuracy from using q8 instead of f16 (or f32) is insignificant.

This is inline with what has been cited by Google and NVIDIA, who are pushing q8 during inference in their TPU's and TensorRT library.

Updated with new benchmarks using a fused conv-add-act kernel

`BATCH_SIZE=1`

test f32_128_winograd             ... bench:      23,768 ns/iter (+/- 26)
test f32_256_winograd             ... bench:      75,374 ns/iter (+/- 64)
test i8_128_implicitprecompgemm   ... bench:      32,795 ns/iter (+/- 60)
test i8_256_implicitprecompgemm   ... bench:      54,868 ns/iter (+/- 57)
test i8x4_128_implicitprecompgemm ... bench:      34,073 ns/iter (+/- 58)
test i8x4_256_implicitprecompgemm ... bench:      56,966 ns/iter (+/- 67)

Speed-up is 0.70x

`BATCH_SIZE=16`

test f32_128_winograd             ... bench:     163,904 ns/iter (+/- 411)
test f32_256_winograd             ... bench:     557,587 ns/iter (+/- 3,755)
test i8_128_implicitprecompgemm   ... bench:      53,362 ns/iter (+/- 155)
test i8_256_implicitprecompgemm   ... bench:     187,492 ns/iter (+/- 179)
test i8x4_128_implicitprecompgemm ... bench:      54,799 ns/iter (+/- 97)
test i8x4_256_implicitprecompgemm ... bench:     190,671 ns/iter (+/- 137)

Speed-up is 3.07x

`BATCH_SIZE=64`

test f32_128_winograd             ... bench:     600,183 ns/iter (+/- 686)
test f32_256_winograd             ... bench:   2,153,691 ns/iter (+/- 2,246)
test i8_128_implicitprecompgemm   ... bench:     175,793 ns/iter (+/- 94)
test i8_256_implicitprecompgemm   ... bench:     609,676 ns/iter (+/- 6,896)
test i8x4_128_implicitprecompgemm ... bench:     178,419 ns/iter (+/- 117)
test i8x4_256_implicitprecompgemm ... bench:     616,550 ns/iter (+/- 12,348)

Speed-up is 3.41x

`BATCH_SIZE=128`

test f32_128_winograd             ... bench:   1,180,420 ns/iter (+/- 2,734)
test f32_256_winograd             ... bench:   4,310,177 ns/iter (+/- 4,477)
test i8_128_implicitprecompgemm   ... bench:     326,474 ns/iter (+/- 5,199)
test i8_256_implicitprecompgemm   ... bench:   1,164,039 ns/iter (+/- 3,448)
test i8x4_128_implicitprecompgemm ... bench:     342,554 ns/iter (+/- 1,579)
test i8x4_256_implicitprecompgemm ... bench:   1,184,267 ns/iter (+/- 360)

Speed-up is 3.62x

Re-factor MCTS code to use asynchronous framework

In preparation for implementing the mu-zero framework re-factor the monte-carlo tree search to use the rust asynchronous framework.

Convert predict_worker to async to avoid the use of channels.
Convert PredictService to async to avoid the use of channels.
Break the board out to an async trait.

Change optimizer from AdamOptimizer to MomentumOptimizer

As a follow-up of #5, based on some limited testing the reason we did not reach the desired accuracy with the higher regularization coefficient does not seem to be directly related to the regularization but due to the change in optimizer to Adam which was never properly investigated.

Sparse Quantized Model

Some impressive results that might be useful when transitioning to quantized models again.

https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-deepsparse/

Create distributable packages for debian / redhat / windows

Create downloadable packages in the following format to make installation easier:

tar.gz - binary for x86-64
deb - for most modern linux distributions
rpm - can be generated from the deb file
zip - for windows

For the most part this is fairly straight forward, we just need to pre-build a binary and package it with the weights. The main issue is in the licensing of CUDA and cuDNN.

Re-implement `INT8x32_CONFIG` support during inference

We previously implemented quantized convolution in #8, but due being temporarily unsupported by tensor cores we switched to TRUE_HALF_CONFIG instead. Based on benchmarks it seems like we could acquire an 1.58x speed-up re-implementing the quantized convolution:

test f16_nchw_compute_type_f16 ... bench:      78,210 ns/iter (+/- 831)
test f16_nchw_compute_type_f32 ... bench:     155,737 ns/iter (+/- 370)
test f16_nhwc_compute_type_f16 ... bench:      71,667 ns/iter (+/- 73)
test f16_nhwc_compute_type_f32 ... bench:     122,514 ns/iter (+/- 29)
test f32_compute_type_f32      ... bench:     186,643 ns/iter (+/- 168,786)
test i8_nhwc                   ... bench:     123,767 ns/iter (+/- 24)
test i8x32_nhwcvectc           ... bench:      45,120 ns/iter (+/- 39)
test i8x4_nhwcvectc            ... bench:     108,830 ns/iter (+/- 1,072)

Simulated strength improvements based on being able to perform 1.58x more playouts using the exact same network (trained using semi-supervised learning from a LZ network on real games from foxwq):

9x128-cd3c90c-lz-3200 v 9x128-cd3c90c-lz-5056 (27/200 games)
board size: 19   komi: 7.5
                        wins              black         white       avg cpu
9x128-cd3c90c-lz-3200      4 14.81%       3  21.43%     1   7.69%    756.97
9x128-cd3c90c-lz-5056     23 85.19%       12 92.31%     11 78.57%    923.67
                                          15 55.56%     12 44.44%

I could have kept going, but I think you see the point from this simulation. This seems worth pursuing.

Design approach

Scaling of weights in tensorflow

We need to change our activation functions during training from tf.nn.relu to relu3 in order to better simulate quantized inference. Previous attempts to further simulate quantization during training using the fake quantization methods in tensorflow has not proven to be successful so we will skip this step.

We also need to dump the weights in a quantized format. This can be achieved using the tf.quantization.quantize method, with the SCALED mode and HALF_AWAY_FROM_ZERO rounding mode. The scaling factor to use for each weights will for simplify be a symmetric scaling factor (with 0 represented as quantized 0) as max(abs(W)) for some weights W.

It will probably be a good idea to leave the policy and value head as floating point, because they have historically been unstable when quantized.

Action items

Change to tf.nn.relu3
Quantize weights in the tower using tf.quantize.
Add a type parameters to the output weights.

Quantized inference using cuDNN

We are going to use the universal scale 3 to store the intermediate results from each convolution layers. In reality due to quantization effects we end up with an actual scaling factor of 3.09023 (see #34). Using an universal scale s allows us to derive pretty straight forward values for the α and β parameters passed to cudnnConvolutionForward and the scaling factors for the weights sW and bias sb.

α = sW / 127
β = 1
sb = s

Action items

Implement the q8 data type.
Change the loader to be able to load weights in different formats (instead of all f16).
Add support for the NHCW_VECT_C data format.
Re-order weights from NCHW to NHWC_VECT_C.
Scale offsets to 127.0 / THREE.
Implement quantized convolutions during inference.

Re-balance search tree size vs neural network size

When playing on Go servers such as CGOS (with a time limit of 15 minutes per side) we typically run up about 150,000 rollouts per move. This is not bad per se, but the benefits from searching tends to encounter diminishing returns past 32,000 nodes†. So we are wasting a lot of FLOPS on rollouts that has little benefit to the final result.

To fix this we can increase the neural network size, which will reduce the search tree size in favour of a more accurate search. But a larger neural network will require more data to train.

† Based on how many rollouts are necessary to visit all root candidates on an empty board at least once.

Available sizes

Today we use a 9 deep, and 128 wide neural network. We can alter the depth and width almost arbitrary, but there are some constraints :

The number of channels should always be a multiple of 32 since that is the usual GPU warp sizes.
The number of channels should be at least 9 in order to provide for a full board evaluation.

The number of parameters of a model can be calculated from:

depth · (18 · width² + 2 · width + 1) + 38 · width + 354,655

And the FLOPS necessary for inference can be calculated from:

361 · depth · (18 · width² + 2 · width + 1) + 361 · 38 · width + 354,655

There are two parameters that influence these values width and depth, which of the two we want to increase is problem specific and so far there is no sound theoretical background we can use to guide us. Further evaluating each point requires huge amounts of time, so I am going to cherry pick some candidates based on personal intuition, and ignore the rest of the candidates:

Depth	Width	# Parameters	FLOP
9	128	3,016,040	961,114,640
9	256	10,985,832	3,838,209,552
16	192	10,984,943	3,837,888,623
23	160	10,966,518	3,831,237,198

Add ladder features to the neural network

Add two additional features to the neural network:

Whether a move at this point is a successful ladder capture.
Whether a move at this point is a successful ladder escape.

The motivation for adding these two features is that the neural network sometimes fails to read out ladders successfully and reading out a ladder can require the network to look a huge distance (which is not only hard, but would also take up a large amount of features in the neural network).

To help with these problems we can just tell the network directly whether a ladder would work, this should help the network tell whether playing out a ladder is a good idea. There is still an issue with the training data being heavily biased toward avoiding broken ladders (since no human would play them out), which may cause these features to have limited effect. Not sure how to fix that issue except just playing more games.

Get ride of batch normalization during inference

During inference the batch normalization is a fixed linear transformation and can therefore be absorbed into the surrounding convolution operators.

The definition of batch normalization is, where x is the input:

scale * (x - mean) / sqrt(variance + eps) + offset

We force scale = 1 and offset = 0 during training so this equation simplifies to:

(x - mean) / sqrt(variance)
= x / sqrt(variance + eps) + mean / -sqrt(variance + eps)

We can rename the term mean / -sqrt(variance) to bias and if we substitute the input tensor for its convolution operator C(W, x) and W is in the format [out, in, h, w] then this equation can be re-written as:

C(W, x) / sqrt(variance + eps) + bias
= C(W / sqrt(reshape(variance + eps, [out, 1, 1, 1])), x) + bias

Doing this would have three advantages:

We could replace our blocks of calls to cudnnConvolutionForward, cudnnBatchNormalizationForwardInference, and cudnnActivationForward with one call to cudnnConvolutionBiasActivationForward. This should mean less API overhead and probably some performance improvements.
The network will then consist of only convolutional, add tensor, activations, and matrix multiplications. Such a network should be easier to quantize.
We can get ride of the weird special case where the mean and variance are always stored as f32 because of oddities with the batch normalization API.

Disable TT-scoring of terminal game states during tournament play

We currently score terminal game states using the Tromp-Taylor Rules to ensure the policy play doesn't keep playing forever, but this has adverse effects during end-game since it doesn't see a large dragon is dead unless it is fully captured (which may lead it down the wrong path).

In addition to these flaws there should also be no need for the terminal game state evaluation when we are searching since scoring the game is the neural networks job.

An alternative would be to implement Chinese rules, but that is a lot harder since we then need to determine the life-and-death of groups.

Add distributed environment for training

Add script and tools that enable running the full training procedure in a distributed environment. We do not have a distributed environment, but are facing organizational issues with the self-play games and this should be with organizing them:

upload2rest.py <server> - script that reads from standard input and upload to the server in the given argument.

The server itself can be a glorified key-store database with some limited time series functionality. I suggest the following schema in the single database table using a REST-ful API:

key VARCHAR(16)
timestamp BIGINT
data BLOB

This will allow us to store different types of keys in the database, such as the self-play games themselves as well as the binary features. Depending on how fancy we are feeling we can use either something like Google Cloud Storage or a local database like SQLite. There should also be little trouble migrating from one database to another so long as we have them in something.

The server itself needs to support two operations:

POST - for adding a record to a collection with the a key that is equivalent to the root directory. Timestamp can either be read from the X-Date HTTP Header or set to the curren time.
GET - for retrieving records from the collection with the key of the root directory in the request. Also allows for adding additional filters to the request, such as recent/n where n is the number of requests to retrieve.

The server itself does not require any knowledge of what it is storing and this interface should be enough for my current needs. Future extensions might be to add some interface that allows one to browse the played games.

Docker image for running the self-play agents

Using expert iteration would only require some minor adjustments:

while true; do
    curl upload.dg.io/weights/recent/1 > dream_go.json
    ./dream_go --self-play 1000 | tee self_play.sgf | python upload2rest.py --sgf upload.dg.io/self_play
    ./dream_go --extract self_play.sgf | python upload2rest.py --bytes 23830 upload.dg.io/features
done

Docker image for running the training server

Since we do not have a dedicated host for running the training server, we may want to remove the infinite loop from this container:

while true; do
    curl upload.dg.io/features/recent/500000 > features.bin
    python tools/bootstrap.py features.bin
    python tools/bootstrap.py --dump | python upload2rest.py --bytes -1 upload.dg.io/weights
done

Docker image for running the server

The server will always listen to port 80, you can change it using the docker bridged networking interface:

python server.py /mnt/dream_go/db.sqlite3

Introduce a new self-play mode

Introduce a new self-play mode that acts as a hybrid between Expert Iteration and Self Play, using Tromp-Taylor rules at their core in order to generate higher quality games, at a faster speed.

During play, maintain a moving average of the win rate for each player. This should be based on the average win rate of played moves in the search tree.
During search, probe for n iterations where n = f(moving average win rate).
Play until both players wants to pass, and the game is Tromp-Taylor scoreable.

Maintain the moving average win rate

Each player starts with a 50% win rate, after each move we have a new win rate rᵢ ∈ [0, 1].

wᵢ₊₁ = wᵢ - α (wᵢ - rᵢ)

Where α is the coefficient between zero and one, we want it to move fast, but not too fast in order to avoid either player giving up due to a blunder. Say we want it to have decayed 90% after ten moves, which means α = 1.0 - 0.9**(1 / 10) ≈ 0.2.

Determine `f(moving average win rate)`

There is not much point putting a lot of effort into an already lost game, but we need to continue playing in order to be able to accurately score the game. We also want to avoid flipping the game conclusion due to mistakes, since that will pollute the value head.

Do full playouts when the win rate is close to 50%.
Decrease the number of playouts as the win rate goes approach 100% or 0%.
Never go below 100 playouts, since that may cause catastrophic mistakes.

This would suggest a formula such as:

f(x) = max(100, -4.0 · x · (x - 1) · num_rollouts)

Recorded policies

If we only perform full playouts, and record their policy, when close to 50% win rate then we would only train on examples close to the beginning of the game. But we may not want to train on examples with a low number of playouts. As a compromise we should emit policies for all search trees with more than 0.5 · num_rollouts, and randomly perform full playouts with a probability of 2%.

Any remaining imbalance can be fixed using undersampling and oversampling, based on the move number, in the input pipeline in the training script.

Tromp-Taylor scoreable

A game is Tromp-Taylor scoreable when all groups are unconditionally alive [1], and there are no vertices that does not belong to either player.

[1] https://senseis.xmp.net/?BensonsAlgorithm

Re-analyze

A faster way to produce high quality data is to take existing data, and augment it with new vectors. We can do this in two ways:

Policy We can re-analyze a single state, which will provide us a new policy. Each state can be analyzed independently since the stored policy does not have to correspond to the played move.
Value Given a single state, we can re-play from that given board state for a new value. Since we need to re-play the entire game from the chosen state this is quite expensive.

We should produce both policy and value metrics to avoid one of the two losses from stalling, hence we need to do partial re-plays of games. We want to avoid picking an already decided game to re-play from, so I suggest that when choosing what move to begin replaying from we pick the middle of value integral.

Create a _zero_ version of dream-go

Add a branch with only the following input features:

A constant plane filled with ones if we are black
A constant plane filled with ones if we are white
Our vertices
Opponent vertices

The purpose of this branch is for analysing what the heck the neural network actually learns and all of the additional features in the main branch makes that hard to do.

Add Time Control

Implement the following GTP commands to allow for better time control and tournament play:

Decrease neural network size to 7 intermediate layers

To increase inference and training performance decrease the number of intermediate layers to 7 instead of the current 19. This should give about a 2.71x speed-up, but such a small network might not be able to learn anything useful.

Similaries with Leela Zero

Just observed some similaries with Leela Zero. Is Dream Go some kind of rewritten Leela? What is taken from Leela and where are the changes?

Investigate @lightvector neural network enhancements

Check out some of ideas mentioned here to enhance the neural network:

@lightvector https://github.com/lightvector/GoNN

Initial thoughts on the concepts without any further research to back it up:

Global Pooled Properties This should be easy to implement and makes a lot of sense. Unclear if we want to add them early, before each residual block, or only in the policy and value head. Need to experiment a bit with this to tell.
Parametric ReLUs Unfortunately there is no support for PReLU in cuDNN so we cannot do this. Otherwise I've had similar experiences a long time ago when I was playing around with neural network architectures for Go where PReLU is just better. I used to experiment with other activation functions, like selu, but they were not as good.
Chain Pooling This probably gives worse results since the pooling destroy the local shape information. Might be interesting to do a dense block approach where each residual blocks becomes a DenseNet [1]. This way each residual block would benefit from both the pooling and the local shape:
1. Given input x compute the the chain pooling of each channel and store the result in yₛ.
2. Concatenate x and yₛ channel-wise into yₓₛ (so yₓₛ has shape [256, 19, 19]).
3. y₁ <- relu(C(W₁, yₓₛ) + b₁) (y₁ has shape [128, 19, 19])
4. Concatenate y₁ and yₛ channel-wise into y₁ₛ (so y₁ₛ has shape [256, 19, 19]).
5. y₂ <- relu(C(W₂, y₁) + x + b₂)
6. return y₂
I suspect this is too expensive to do in practice (no good support for it in cuDNN), but it is a very interesting idea.

Investigate MCTS parallelism degradation

We're experiencing poor GPU utilization in regard to MCTS, because of the degree of parallelism that we are experiencing while searching. The performance can be observed in the below test runs where forward represents an upper roof on the expected performance from the MCTS:

$ VLOSS_CNT=2 ./target/release/dream_go --bench data/foxwq_tiny.sgf
data/foxwq_tiny.sgf:
  sgf:               3128444.5732 per second
  feature:           12077.6764 per second
  batch_size 16
    forward:         14261.2164 per second
    predict_service: 7284.5108 per second
  mcts:              4371.8057 per second

$ VLOSS_CNT=64 ./target/release/dream_go --bench data/foxwq_tiny.sgf 
data/foxwq_tiny.sgf:
  sgf:               3312395.9695 per second
  feature:           13048.4149 per second
  batch_size 16
    forward:         14265.5720 per second
    predict_service: 7149.6469 per second
  mcts:              9651.8708 per second

$ VLOSS_CNT=256 ./target/release/dream_go --bench data/foxwq_tiny.sgf
data/foxwq_tiny.sgf:
  sgf:               3390898.1354 per second
  feature:           13219.3126 per second
  batch_size 16
    forward:         14373.3085 per second
    predict_service: 7202.9912 per second
  mcts:              10494.0725 per second

Additional performance benchmarks with an absolute time limit might be interesting in order to determine if there are significant strength degradations in not following the UCT closely.

Fix synchronization issue

Fix the following crash experienced twice when extracting features from 250,000 games (so pretty rare):

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', libcore/option.rs:335:21
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:68
             at libstd/sys_common/backtrace.rs:57
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:396
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at libstd/panicking.rs:537
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:521
   7: rust_begin_unwind
             at libstd/panicking.rs:497
   8: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   9: core::panicking::panic
             at libcore/panicking.rs:51
  10: dream_go::mcts::predict::PredictState::predict
  11: <dream_go::mcts::predict::PredictState as dream_go::parallel::service::ServiceImpl>::process
  12: dream_go::parallel::service::worker_thread
  13: std::panicking::try::do_call
  14: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  15: <F as alloc::boxed::FnBox<A>>::call_box
  16: std::sys_common::thread::start_thread
             at /checkout/src/liballoc/boxed.rs:827
             at libstd/sys_common/thread.rs:24
  17: std::sys::unix::thread::Thread::new::thread_start
             at libstd/sys/unix/thread.rs:90
  18: start_thread
  19: clone

Shape of the convolution in the policy head

To determine the optimal width and depth of the downsampling convolution in the policy head we decided to train a single network with eight different policy head on the same labels. Note that these results cannot be directly compared to other values posted elsewhere because having eight policy heads in the same network will bias the internal representation towards one that flavor policy over value results.

Width	Depth	Top 1	Top 3	Top 5	Performance (ns)
1x1	1				16,372,159 +/- 1,319,321
1x1	8				15,145,902 +/- 316,185
3x3	1				16,863,288 +/- 1,055,362
3x3	2	53.89%	77.81%	85.70%	17,227,438 +/- 373,451
3x3	4	54.02%	78.20%	86.27%	20,285,890 +/- 445,266
3x3	6	54.10%	78.25%	86.42%	23,422,236 +/- 646,105
3x3	8	54.05%	78.27%	86.47%	15,106,154 +/- 777,990
5x5	1				19,281,082 +/- 132,104
5x5	2	53.93%	77.90%	85.76%	21,374,388 +/- 327,139
5x5	4	54.07%	78.22%	86.24%	29,913,518 +/- 932,358
5x5	6	54.12%	78.23%	86.39%	38,218,057 +/- 335,314
5x5	8	54.11%	78.26%	86.45%	15,088,077 +/- 691,612

kblomdahl / dream-go Goto Github PK

dream-go's Introduction

Dream Go - All day, every day

Dependencies

Dev Dependencies

Training

Reinforcement Learning

AlphaZero

Expert Iteration

Roadmap

License

dream-go's People

Contributors

Stargazers

Watchers

Forkers

dream-go's Issues

Drawbacks

Usages of cuDNN

MCTSnet architecture

Tensorflow training script

Network architectures

Backup (β)

Learned simulation policy (π)

Embedding (ε)

Readout (ρ)

TODO

Citations

Performance

Relu

Swish

Komi

Handicap

Excessive Black or White wins

GPU

CPU

Conclusion

BATCH_SIZE=1

BATCH_SIZE=16

BATCH_SIZE=64

BATCH_SIZE=128

Design approach

Scaling of weights in tensorflow

Action items

Quantized inference using cuDNN

Action items

Available sizes

Docker image for running the self-play agents

Docker image for running the training server

Docker image for running the server

Maintain the moving average win rate

Determine f(moving average win rate)

Recorded policies

Tromp-Taylor scoreable

Re-analyze

Recommend Projects

Recommend Topics

Recommend Org

`BATCH_SIZE=1`

`BATCH_SIZE=16`

`BATCH_SIZE=64`

`BATCH_SIZE=128`

Determine `f(moving average win rate)`