Comments (25)
Did you generate training data first, and are you pointing it to that data?
from leela-zero.
How to generate training data based on several sgfs rather than a single sgf file?
from leela-zero.
Yes, I generate training data, point it to that data and it says found 0 chunks
from leela-zero.
How to generate training data based on several sgfs rather than a singnle sgf file?
Just concatenate them together.
from leela-zero.
Yes, I generate training data, point it to that data and it says found 0 chunks
What exactly are you doing?
The code looks at the argument you gave, and tries to find any file that matches argument*.gz. If you get "0 chunks found", you are not pointing it to your data. Check that you really have the training files in the path or prefix you're giving it. i.e. you can do "ls argument*.gz" and you should see the training data.
from leela-zero.
When training with tensorflow, the gpu usage is quite low. Is this reasonable?
from leela-zero.
When training with tensorflow, the gpu usage is quite low. Is this reasonable?
I think it's partly because Python is rather slow at preparing the input. I'm about to push an improved implementation to the next branch that uses TensorFlow 1.4 Dataset and multiprocessing to speed this up greatly.
Even so I don't see TensorFlow being able to max out the GPU, something that NVIDIA-Caffe was easily able to do.
from leela-zero.
I merged the Dataset based branch. It seems to be about 3-4 times faster on my hardware and is close to maxing out my GPU. For smaller networks it will still bottleneck on CPU. Note the default network size in the TensorFlow code is already much smaller than the real AlphaGo Zero network (6x128 instead of 40x256), so it's not really going to be an issue when we scale up.
from leela-zero.
Can I train with CPU only?
from leela-zero.
Yes, the TensorFlow code is generic and shouldn't care. It will be much slower, though, probably 10x or more.
from leela-zero.
The supervised learning runs about 600 pos/s on a gtx 1080 ti. But the network seems not to make any progress (step 75000, training accuracy=49.2188%, mse=0.901087).
The usage of GPU is still not stable, but better than before.
from leela-zero.
Accuracy of 49% is good. The one I put up ran for a week or so and was at 52%. It's normal for the accuracy to rise very quickly in the beginning and then start to flatten off. Also it's calculated over a single batch only so the number will fluctuate a lot.
I am not sure if the MSE calculation is comparable to the Google paper, it is possible the number must be divided by 2 or by 4 to be compared. (Edit: just checked and the paper says "scaled by a factor 1/4" so yes it must be divided by 4!)
I'm getting about 650 pos/s on a GTX 1060 and 900 pos/s on a GTX 1070.
from leela-zero.
I pushed some more updates, including an important fix in symmetry generation.
from leela-zero.
So it is much slower on a 1080 ti?
By the way, tf.summary.merge()
and tf.summary.FileWriter()
may be added to save the summary, need I open a pull request?
from leela-zero.
So it is much slower on a 1080 ti?
There is probably a non-GPU reason here. Different TensorFlow compile? latest cuDNN version? Slow CPUs? Can be a few factors influencing performance.
By the way, tf.summary.merge() and tf.summary.FileWriter() may be added to save the summary, need I open a pull request?
Sure, I added a few tf.summary commands but didn't finish support for it, as I wanted to get back to finishing the client/server setup.
from leela-zero.
I don't know if @wonderkid27 got it working, but this is what I tried:
../../src/leelaz -w ../../src/d645af975ed5b9d08530d092d484482d9aee014f9498c7afcde8570743f85751
dump_supervised e1c5be12e1d64907a7e7e52c1fb5b3ee.sgf train.out
Total games in file: 1
Shuffling...done.
Game 0, 0 positions
Game 0, 35 positions
Game 0, 63 positions
Game 0, 96 positions
Game 0, 128 positions
Game 0, 163 positions
Game 0, 206 positions
Game 0, 238 positions
Game 0, 264 positions
Game 0, 293 positions
Game 0, 343 positions
Game 0, 377 positions
Game 0, 401 positions
Game 0, 427 positions
Game 0, 456 positions
Game 0, 485 positions
Dumped 518 training positions.
Writing chunk 0
gunzip train.out.0.gz
./parse.py train.out.0
/usr/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
Found 0 chunks
I'm not sure if the RuntimeWarning is something I can safely ignore or if it's actually preventing it from working correctly.
from leela-zero.
Ok I figured out my problem. I was thinking the parameter to parse.py was a filename, but it's a prefix. So if I have train.out.0.gz in the directory, then
./parse.py train.out
finds my test chunk. This does lead to
2017-11-12 11:36:21.810404: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-12 11:36:24.281390: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC.
though. I'll see what I can google.
Edit: I'll trying compiling tensorflow myself, to be sure I use all available CPU instructions plus GPU support.
from leela-zero.
Ok think I'm golden? It's actually doing... something. ;) Built tensorflow from source.
[roy@ryzen-one tf]$ ./parse.py train.out
Found 1 chunks
Using 15 worker processes.
2017-11-12 12:32:09.195530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-12 12:32:09.196073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 10.91GiB freeMemory: 10.17GiB
2017-11-12 12:32:09.196094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1151] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
WARNING:tensorflow:From /home/roy/code/leela-zero/training/tf/tfprocess.py:65: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See tf.nn.softmax_cross_entropy_with_logits_v2.
2017-11-12 12:32:23.554214: I tensorflow/core/kernels/shuffle_dataset_op.cc:112] Filling up shuffle buffer (this may take a while): 4816 of 65536
2017-11-12 12:32:33.550107: I tensorflow/core/kernels/shuffle_dataset_op.cc:112] Filling up shuffle buffer (this may take a while): 11680 of 65536
etc.
from leela-zero.
@roy7 I'm also getting
Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC
How did you get rid of it? I don't suppose to change any code, right?
from leela-zero.
I believe it's an incompatibility between the Tensorflow binaries and the version of python since newer TF binaries aren't available yet. I'm not certain. But the problem went away when I uninstalled the TF package and complied it myself.
from leela-zero.
@roy7 i did, i compiled the latest 1.4.1, what version are you using?
from leela-zero.
1.4.0. It seemed to the latest on the github when I cloned the repo to compile it.
from leela-zero.
I switched to 1.4.0 but ended up with the same error complaining: Conv2DCustomBackpropFilterOp only supports NHWC
I notice there are todo list items might be related to my issue:
CPU support for Xeon Phi and for people without a GPU
AMD specific version using MIOpen
@gcp, do i need a NVIDIA gpu to run the raining? if the training only supports gpu, do you have any thoughts about how to make amd cards work. I have an amd rx 480, I can try to make it work if you point me the direction.
from leela-zero.
I struggle all day trying to do supervised learning. I downloaded a couple of sgf's and did:
src/leelaz -w weights.txt
dump_supervised bigsgf.sgf train.out
exit
The output has no win %(as I can see raw training data from http://leela.online-go.com/training/ have win % inside), which I think is ok, right?
Now, after I run:
training/tf/parse.py train.out
the script got stucked all the time at:
python ..\parse.py train.out
Training with 64 chunks, validating on 7 chunks
Using 4 worker processes.
Using 4 worker processes.
2018-07-12 20:25:51.489286: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports inst
ructions that this TensorFlow binary was not compiled to use: AVX
2018-07-12 20:25:51.756302: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 wit
h properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.253
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.80GiB
2018-07-12 20:25:51.774303: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu
\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlo
w device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:
01:00.0, compute capability: 5.2)
GPU ram is used at 1.8GB so it seems it did something, but I got nothing happen even if I left it for an hour. Am I doing something wrong? I reduced BATCH_SIZE to 256 cause I had MemoryError earlier, and now there's no error, but nothing happen. Same thing if I use raw training data.
EDIT: I set up Ubuntu on my PC and it works now, Well, kinda. It is like above for 15-20 min, then start to actually do something. What might cause freeze? Or maybe it's normal?
My hardware set up is: AMD FX-6300, 8GB Ram, Nvidia GTX 960 2GB Ram
from leela-zero.
Ok I figured out my problem. I was thinking the parameter to parse.py was a filename, but it's a prefix. So if I have train.out.0.gz in the directory, then
./parse.py train.out
finds my test chunk. This does lead to
2017-11-12 11:36:21.810404: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-12 11:36:24.281390: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC.
though. I'll see what I can google.
Edit: I'll trying compiling tensorflow myself, to be sure I use all available CPU instructions plus GPU support.
me too are u solution?can u teach me how to solve it
from leela-zero.
Related Issues (20)
- http://zero.sjeng.org/best-network down HOT 6
- NetworkException: JSON parse error HOT 2
- The distributed training of Leela Zero has finished and the server is closed. Please consider donating your GPU power to one of these projects
- Change location of Eigen submodule
- Does evaluation depend on who puts last stone?
- Leela-zero-next doesn't compile on old CPU (without OpenCL). What about to make binary releases? HOT 9
- leela zero passing win :-)? means? HOT 5
- Deep Mind paper 404's now HOT 2
- Gobang version
- Help wanted! android HOT 1
- leela_zero failed to run test on Windows with MSVC
- Can not compile Leelaz on MacOS 11.6.6 HOT 3
- Fails to compile on arm64: error: undefined symbol: __powitf2 HOT 2
- README doesn't explain how to get weights HOT 1
- Error in OpenCL calculation: Update your device's OpenCL drivers or reduce the amount of games played simultaneously.
- leela-zero failed to RunTests with could not open weights file: ../src/tests/0k.txt on MSVC
- Failed to build Linux mint " error: ‘dummy’ may be used uninitialized" HOT 3
- How does Leela handle invalid input? HOT 1
- Problem compiling AutoGPT
- Practicly same playouts in eigen vs opencl HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from leela-zero.