Welcome to my website to play the game. Give it a try!
https://crema.evalieben.cn/game/
The game is currently only available in Chinese.
The model structure references DeepMind's work on AlphaGo Zero:
Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270
The best model, resnet3-64, is comprised of:
- 1 input convolutional layer
- 3 residual blocks (2 convolutional layers + 1 skip connection for each block)
- 1 policy head (1 convolutional layer + 1 fully connected layers)
- 1 value head (1 convolutional layer + 2 fully connected layers)
Each of the convolutional layer has 64 features (the input layer has 2 features). The network structure resembles AlphaGo Zero's but has much less features and residual blocks (My game is too simple after all).
The neural network is trained by Proximal Policy Optimization, PPO. I also tweaked the original PPO implementation according to this webpage:
The 37 Implementation Details of Proximal Policy Optimization https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Some of the implementations work very well for my model: entropy maximization, normalization of advantages, global gradient clipping, etc. Each training cycle begins with 128 vectorized environments sampling game states, and then do backpropagation with a minibatch size of 2048. I simply set the reward of each action to 1 if agent wins in one game, elsewise the reward for all actions will be -1.
The horizontal axis shows the number of epochs and the vertial axis shows the win rate (%). Each curve represents an opponent using the old model. Once the win rate reaches 90%, drop the old model and then save the latest for opponent to use. Training is much tougher after 1000 epochs, and the model finally converges in 15000 epochs.