When I run parse.py, it always says found 0 chunks. Can you fix this?

Supervised learning with TensorFlow about leela-zero HOT 25 OPEN

leela-zero commented on September 23, 2024

Supervised learning with TensorFlow

from leela-zero.

Comments (25)

gcp commented on September 23, 2024

Did you generate training data first, and are you pointing it to that data?

from leela-zero.

godmoves commented on September 23, 2024

How to generate training data based on several sgfs rather than a single sgf file?

from leela-zero.

wonderkid27 commented on September 23, 2024

Yes, I generate training data, point it to that data and it says found 0 chunks

from leela-zero.

gcp commented on September 23, 2024

How to generate training data based on several sgfs rather than a singnle sgf file?

Just concatenate them together.

from leela-zero.

gcp commented on September 23, 2024

Yes, I generate training data, point it to that data and it says found 0 chunks

What exactly are you doing?

The code looks at the argument you gave, and tries to find any file that matches argument*.gz. If you get "0 chunks found", you are not pointing it to your data. Check that you really have the training files in the path or prefix you're giving it. i.e. you can do "ls argument*.gz" and you should see the training data.

from leela-zero.

godmoves commented on September 23, 2024

When training with tensorflow, the gpu usage is quite low. Is this reasonable?

from leela-zero.

gcp commented on September 23, 2024

When training with tensorflow, the gpu usage is quite low. Is this reasonable?

I think it's partly because Python is rather slow at preparing the input. I'm about to push an improved implementation to the next branch that uses TensorFlow 1.4 Dataset and multiprocessing to speed this up greatly.

Even so I don't see TensorFlow being able to max out the GPU, something that NVIDIA-Caffe was easily able to do.

from leela-zero.

gcp commented on September 23, 2024

I merged the Dataset based branch. It seems to be about 3-4 times faster on my hardware and is close to maxing out my GPU. For smaller networks it will still bottleneck on CPU. Note the default network size in the TensorFlow code is already much smaller than the real AlphaGo Zero network (6x128 instead of 40x256), so it's not really going to be an issue when we scale up.

from leela-zero.

wonderkid27 commented on September 23, 2024

Can I train with CPU only?

from leela-zero.

gcp commented on September 23, 2024

Yes, the TensorFlow code is generic and shouldn't care. It will be much slower, though, probably 10x or more.

from leela-zero.

godmoves commented on September 23, 2024

The supervised learning runs about 600 pos/s on a gtx 1080 ti. But the network seems not to make any progress (step 75000, training accuracy=49.2188%, mse=0.901087).
The usage of GPU is still not stable, but better than before.

from leela-zero.

gcp commented on September 23, 2024

Accuracy of 49% is good. The one I put up ran for a week or so and was at 52%. It's normal for the accuracy to rise very quickly in the beginning and then start to flatten off. Also it's calculated over a single batch only so the number will fluctuate a lot.

I am not sure if the MSE calculation is comparable to the Google paper, it is possible the number must be divided by 2 or by 4 to be compared. (Edit: just checked and the paper says "scaled by a factor 1/4" so yes it must be divided by 4!)

I'm getting about 650 pos/s on a GTX 1060 and 900 pos/s on a GTX 1070.

from leela-zero.

gcp commented on September 23, 2024

I pushed some more updates, including an important fix in symmetry generation.

from leela-zero.

godmoves commented on September 23, 2024

So it is much slower on a 1080 ti?
By the way, tf.summary.merge() and tf.summary.FileWriter() may be added to save the summary, need I open a pull request?

from leela-zero.

gcp commented on September 23, 2024

So it is much slower on a 1080 ti?

There is probably a non-GPU reason here. Different TensorFlow compile? latest cuDNN version? Slow CPUs? Can be a few factors influencing performance.

By the way, tf.summary.merge() and tf.summary.FileWriter() may be added to save the summary, need I open a pull request?

Sure, I added a few tf.summary commands but didn't finish support for it, as I wanted to get back to finishing the client/server setup.

from leela-zero.

roy7 commented on September 23, 2024

I don't know if @wonderkid27 got it working, but this is what I tried:

../../src/leelaz -w ../../src/d645af975ed5b9d08530d092d484482d9aee014f9498c7afcde8570743f85751
dump_supervised e1c5be12e1d64907a7e7e52c1fb5b3ee.sgf train.out
Total games in file: 1
Shuffling...done.
Game 0, 0 positions
Game 0, 35 positions
Game 0, 63 positions
Game 0, 96 positions
Game 0, 128 positions
Game 0, 163 positions
Game 0, 206 positions
Game 0, 238 positions
Game 0, 264 positions
Game 0, 293 positions
Game 0, 343 positions
Game 0, 377 positions
Game 0, 401 positions
Game 0, 427 positions
Game 0, 456 positions
Game 0, 485 positions
Dumped 518 training positions.
Writing chunk 0

gunzip train.out.0.gz
./parse.py train.out.0
/usr/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
Found 0 chunks

I'm not sure if the RuntimeWarning is something I can safely ignore or if it's actually preventing it from working correctly.

from leela-zero.

roy7 commented on September 23, 2024

Ok I figured out my problem. I was thinking the parameter to parse.py was a filename, but it's a prefix. So if I have train.out.0.gz in the directory, then

./parse.py train.out

finds my test chunk. This does lead to

2017-11-12 11:36:21.810404: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-12 11:36:24.281390: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC.

though. I'll see what I can google.

Edit: I'll trying compiling tensorflow myself, to be sure I use all available CPU instructions plus GPU support.

from leela-zero.

roy7 commented on September 23, 2024

Ok think I'm golden? It's actually doing... something. ;) Built tensorflow from source.

[roy@ryzen-one tf]$ ./parse.py train.out
Found 1 chunks
Using 15 worker processes.
2017-11-12 12:32:09.195530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-12 12:32:09.196073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 10.91GiB freeMemory: 10.17GiB
2017-11-12 12:32:09.196094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1151] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
WARNING:tensorflow:From /home/roy/code/leela-zero/training/tf/tfprocess.py:65: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

2017-11-12 12:32:23.554214: I tensorflow/core/kernels/shuffle_dataset_op.cc:112] Filling up shuffle buffer (this may take a while): 4816 of 65536
2017-11-12 12:32:33.550107: I tensorflow/core/kernels/shuffle_dataset_op.cc:112] Filling up shuffle buffer (this may take a while): 11680 of 65536

etc.

from leela-zero.

zdevwu commented on September 23, 2024

@roy7 I'm also getting

Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC

How did you get rid of it? I don't suppose to change any code, right?

from leela-zero.

roy7 commented on September 23, 2024

I believe it's an incompatibility between the Tensorflow binaries and the version of python since newer TF binaries aren't available yet. I'm not certain. But the problem went away when I uninstalled the TF package and complied it myself.

from leela-zero.

zdevwu commented on September 23, 2024

@roy7 i did, i compiled the latest 1.4.1, what version are you using?

from leela-zero.

roy7 commented on September 23, 2024

1.4.0. It seemed to the latest on the github when I cloned the repo to compile it.

from leela-zero.

zdevwu commented on September 23, 2024

I switched to 1.4.0 but ended up with the same error complaining: Conv2DCustomBackpropFilterOp only supports NHWC

I notice there are todo list items might be related to my issue:

CPU support for Xeon Phi and for people without a GPU
AMD specific version using MIOpen

@gcp, do i need a NVIDIA gpu to run the raining? if the training only supports gpu, do you have any thoughts about how to make amd cards work. I have an amd rx 480, I can try to make it work if you point me the direction.

from leela-zero.

nukee86 commented on September 23, 2024

I struggle all day trying to do supervised learning. I downloaded a couple of sgf's and did:

src/leelaz -w weights.txt
dump_supervised bigsgf.sgf train.out
exit

The output has no win %(as I can see raw training data from http://leela.online-go.com/training/ have win % inside), which I think is ok, right?
Now, after I run:

training/tf/parse.py train.out

the script got stucked all the time at:

python ..\parse.py train.out
Training with 64 chunks, validating on 7 chunks
Using 4 worker processes.
Using 4 worker processes.
2018-07-12 20:25:51.489286: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports inst
ructions that this TensorFlow binary was not compiled to use: AVX
2018-07-12 20:25:51.756302: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 wit
h properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.253
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.80GiB
2018-07-12 20:25:51.774303: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlo
w device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:
01:00.0, compute capability: 5.2)

GPU ram is used at 1.8GB so it seems it did something, but I got nothing happen even if I left it for an hour. Am I doing something wrong? I reduced BATCH_SIZE to 256 cause I had MemoryError earlier, and now there's no error, but nothing happen. Same thing if I use raw training data.

EDIT: I set up Ubuntu on my PC and it works now, Well, kinda. It is like above for 15-20 min, then start to actually do something. What might cause freeze? Or maybe it's normal?
My hardware set up is: AMD FX-6300, 8GB Ram, Nvidia GTX 960 2GB Ram

from leela-zero.

chunjiongzhang commented on September 23, 2024

      Ok I figured out my problem. I was thinking the parameter to parse.py was a filename, but it's a prefix. So if I have train.out.0.gz in the directory, then
./parse.py train.out
finds my test chunk. This does lead to
2017-11-12 11:36:21.810404: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-12 11:36:24.281390: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC.
though. I'll see what I can google.
Edit: I'll trying compiling tensorflow myself, to be sure I use all available CPU instructions plus GPU support.

me too are u solution?can u teach me how to solve it

from leela-zero.

Supervised learning with TensorFlow about leela-zero HOT 25 OPEN

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent