rl-popit's Introduction

rl-popit

Welcome to my website to play the game. Give it a try!

https://crema.evalieben.cn/game/

The game is currently only available in Chinese.

Neural Network Architecture

The model structure references DeepMind's work on AlphaGo Zero:

Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270

The best model, resnet3-64, is comprised of:

1 input convolutional layer
3 residual blocks (2 convolutional layers + 1 skip connection for each block)
1 policy head (1 convolutional layer + 1 fully connected layers)
1 value head (1 convolutional layer + 2 fully connected layers)

Each of the convolutional layer has 64 features (the input layer has 2 features). The network structure resembles AlphaGo Zero's but has much less features and residual blocks (My game is too simple after all).

Method

The neural network is trained by Proximal Policy Optimization, PPO. I also tweaked the original PPO implementation according to this webpage:

The 37 Implementation Details of Proximal Policy Optimization https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Some of the implementations work very well for my model: entropy maximization, normalization of advantages, global gradient clipping, etc. Each training cycle begins with 128 vectorized environments sampling game states, and then do backpropagation with a minibatch size of 2048. I simply set the reward of each action to 1 if agent wins in one game, elsewise the reward for all actions will be -1.

Training Curve

The horizontal axis shows the number of epochs and the vertial axis shows the win rate (%). Each curve represents an opponent using the old model. Once the win rate reaches 90%, drop the old model and then save the latest for opponent to use. Training is much tougher after 1000 epochs, and the model finally converges in 15000 epochs.

Win Rate Change during Epoch 0 - 1.4k

Win Rate Change during Epoch 1k - 15k

Recommend Projects

crema-lida / rl-popit Goto Github PK

rl-popit's Introduction

rl-popit

Neural Network Architecture

Method

Training Curve

Win Rate Change during Epoch 0 - 1.4k

Win Rate Change during Epoch 1k - 15k

rl-popit's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent