Giter Club home page Giter Club logo

lczero-colab-files's Introduction

lczero-colab-files's People

Contributors

debneil avatar

Watchers

 avatar  avatar  avatar

lczero-colab-files's Issues

Some observations

This will be most helpful to people looking to train nets.
Thank you for taking the time to write it up.

Some feedback:

The CCRL Standard Dataset (CSD) network is 10b and in your yaml 20b.
The 20b will take longer to train and to run the match and is not a fair comparison.

I suggest using a yaml with the learning rate and number of steps, batch size, etc as close as possible to the CSD one. In your yaml, the total steps of 100K means the last LR starting at 130K is not run?

Overall suggestion is to try to replicate the CSD results first to make sure your own methodology is correct. The match results should be very close to 50/50. Then, try training another net with different input.

Very important to point out changing the value_loss_weight to 1.0 as is the case in your yaml. I think the old CSD yaml has not been updated. People should not use the old yaml unchanged.

Also, the CSD training data has both PGN and chunk files so no need to preprocess. And chunk files from the training tool with regular PGN input will only have policy info for the one move actually made. It turns out that this reduces the net strength by about 150 Elo. For working with the CSD as a baseline this is fine, but something to keep in mind. Dkappe has shared some input files created from PGNs that were augmented with quick Stockfish searches to add additional move policy values (BadGyal data).

The CSD ran for 200K steps I think. This can take quite a long time, as you point out.

For test matches, again it is helpful to verify the methodology first. From time to time I do a "sanity check" by running two identical copies of the same engine/net against itself to make sure the results are almost exactly 50/50. I have found that restart=on seems to be very important, but have to looked at it in a while.

If the nets are the same size (10b v 10b), then a fixed nodes per move test (which your example uses) is fine and very fast (can use concurrency of 2 or 3 depending on h/w). For nets with different sizes, time per move matches are more appropriate, although the number of moves can also be done in the ratio of speed difference, but I prefer actual times per move.

Informal tests I ran put the CSD net at about 2,900 Elo. I was testing v Crafty (see CCRL ratings for Crafty on 1-4 CPUs). Performance will vary of course depending on GPU.

Again, this is a long overdue guide and you made a great start.
I hope the guide can be improved even more and eventually include training with a local GPU.

Finally, probably should mention that this is all Supervised Learning (SL) from regular PGN games that have already been played. The main Leela chess project uses Reinforcement Learning (RL) where the input games are from Leela nets playing each other (which incidentally includes all of the move policy info). Many Leela nets are trained with a smaller number of input games (tens of thousands instead of millions) and the nets gradually improve from random play. Hundreds of incrementally better nets are created with the same hyper-parameters and this is called a Run. Currently there are three separate Runs being done, and there are many older Runs.

Generating the self-play games takes a lot of GPU power, so RL is not practical for most individual efforts. Of course, the already created Leela games can be used with SL to train new nets. Some even start with SL and then "finish off" with some RL. This might be getting too far off-topic, but some context would be helpful, I think.

init.sh

I'm trying to follow your Colab instructions, but it seems that you need to run init.sh to compile protobuf, before you can run training. Did you do this? I'm having trouble getting the paths right for it to run.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.