The regularization coefficient 1e-4 that DeepMind recommended was for their 4

Adjust neural network hyper-parameters about dream-go HOT 4 CLOSED

kblomdahl commented on June 14, 2024

Adjust neural network hyper-parameters

from dream-go.

Comments (4)

kblomdahl commented on June 14, 2024

The 40 layer tower that DeepMind used had 11,895,775 weights and a regularization coefficient of 1e-4, while our 20 layer tower has 5,997,535 weights so our regularization coefficient should be approximately 2e-4.

0.0001 * (11,895,775 / 5,997,535) = 0.000198344

40 layers weights

3 * 3 * 34 * 128
+ 39 * (2 * 3 * 3 * 128 * 128)
+ 128 * 1 + 361 * 256 + 256 + 256 + 1
+ 128 * 2 + 722 * 362 + 362
= 11895775

20 layers weights

3 * 3 * 34 * 128
+ 19 * (2 * 3 * 3 * 128 * 128)
+ 128 * 1 + 361 * 256 + 256 + 256 + 1
+ 128 * 2 + 722 * 362 + 362
= 5997535

from dream-go.

kblomdahl commented on June 14, 2024

Instead of treating DeepMind as an oracle, we might also wish to play with the regularization coefficient a bit ourselves since we can set it higher to avoid overfitting. This is especially useful because we are severely lacking in data compared to other similar projects.

I suggest training a 20 layer tower with the following coefficients on human games and observing their tournament and testing performance.

1e-4 This is what the current network is trained with.
2e-4 As suggested by the previous post.
1e-3 To represent an extreme coefficient.

The result of the training procedure for these network can be found in Table 1, and they seem to suggest that 2e-4 is the "sweet spot", since the change in the coefficient when compared to 1e-4 is much larger than the change in the accuracy. 1e-3 does not seem to be a valid option, since the weights are too constrained to properly exploit the structure.

Table 1: Accuracy of each model on a separate set of professional games.

Coefficient	Steps	Value	Policy
Coefficient	Steps	Value	Top 1	Top 3	Top 5
1e-4	355,717	73.1%	49.2%	73.8%	83.4%
2e-4	95,295	68.4%	48.2%	73.6%	81.8%
1e-3	24,257	58.1%	38.4%	61.8%	71.3%

Tournament

Each cell indicate ROW vs COL.

-	1e-4	2e-4	1e-3
1e-4	-	8 - 2	9 - 1
2e-4	-	-	6 - 3
1e-3	-	-	-

from dream-go.

kblomdahl commented on June 14, 2024

I suggest the learning schedule in Table 1 based on empirical data gathered from training the 8 layer tower and previous 20 layer towers. This schedule is a lot steeper than the one suggested by DeepMind, the reason for this is that we are now using the Adam optimizer which typically converge a lot faster than a pure momentum optimizer and we therefore need less steps.

This schedule takes about 22 hours to fully train on my computer with two GTX 1080 Ti's.

Table 1: Learning rate

Step	Learning Rate
0	3e-3
12,000	1e-3
27,000	3e-4
45,000	1e-4
66,000	3e-5
90,000	1e-5

from dream-go.

kblomdahl commented on June 14, 2024

The tournament and testing accuracy suggest seems to correlate to an hierarchy where 1e-4 > 2e-4 > 1e-3. It is unclear if, given an equal number of training steps the same result would occur.

In the meantime I suggest we keep the currently trained network as the state of the art but keep the regularization coefficient for any future networks trained on self-play. The reasoning for this is that we will have issues with weights being overfit and the loss in playing strength from this larger regularization coefficient is not too large (even if it is non-trivial).

Would however recommend that when training on human data, we decrease the regularization coefficient to 1e-4.

from dream-go.

Adjust neural network hyper-parameters about dream-go HOT 4 CLOSED

Comments (4)

40 layers weights

20 layers weights

Tournament

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent