Giter Club home page Giter Club logo

Comments (11)

mrastegari avatar mrastegari commented on September 28, 2024 1

Ok try to fix the precision by adding
gradParameters:mul(1e+5)
after line 184 in train.lua

from xnor-net.

mrastegari avatar mrastegari commented on September 28, 2024

Lets double check few things first:
1- Could you get the same accuracy with the pretrained models?
2- Could you train the BWN?
3- I have noticed in some versions of cudnn the precision of division makes issues in convergence. If you are using adam you can multiply all the gradients by a large number to prevent the precision error which leads to NaN.

from xnor-net.

zhaoweicai avatar zhaoweicai commented on September 28, 2024

hi @mrastegari
The accuracy I get for two pretrained models are 56.67 and 42.37 respectively. I can train BWN, but I stopped at epoch #30, top-1 accuracy is 25.57. But the training was several weeks ago before you fixed some bugs. But for XNOR-net, I am not able to make training converge all the time. I don't know if others encounter the same issue.

from xnor-net.

zhaoweicai avatar zhaoweicai commented on September 28, 2024

Just to make sure, add gradParameters:mul(1e+5) after updateBinaryGradWeight(convNodes), right? It still doesn't work for me. Has anyone experienced the same issue?

from xnor-net.

mrastegari avatar mrastegari commented on September 28, 2024

After how many iteration you see the divergence? Also try to follow the paper by replacing the updateBinaryGradWeight function by:

function updateBinaryGradWeight(convNodes)
   for i =2, #convNodes-1 do
    local n = convNodes[i].weight[1]:nElement()
    local s = convNodes[i].weight:size()
    convNodes[i].gradWeight[convNodes[i].weight:le(-1)]=0;
    convNodes[i].gradWeight[convNodes[i].weight:ge(1)]=0;
    convNodes[i].gradWeight:add(1/(n)):mul(1-1/s[2]);
   end
   if opt.nGPU >1 then
    model:syncParameters()
   end
end

from xnor-net.

zhaoweicai avatar zhaoweicai commented on September 28, 2024

Hi @mrastegari
Thanks for your help. But it still doesn't work for me. The training starts to diverge at the very beginning with err=nan. I start to retrain BinaryNet now. BinaryNet seems to work very well for now. XnorNet never works for me.

from xnor-net.

mrastegari avatar mrastegari commented on September 28, 2024

I just pushed a modification can you check that?

from xnor-net.

zhaoweicai avatar zhaoweicai commented on September 28, 2024

Thanks for your help. At first, I change '-cache' to './cache/'. It still doesn't work. Error becomes 'nan' at the beginning all the time even I run the experiments dozens of times and with different random seeds. Has anyone successfully reproduce the XNOR experiments yet? I am confused. BTW, I re-run the Binary-Net experiment, I can get the accuracy of 51.65% in the end. Does the xnor code work very well for you? What problem do you think it is?

from xnor-net.

mrastegari avatar mrastegari commented on September 28, 2024

There is definitely something wrong with your setup. I asked a friend to try on his machine and he could reproduce the same result ~43%. Which version of Binary-Net are you using? 51.65% top-1 is too good for binary-input-and-binary-weight. Do you have a code for that?

from xnor-net.

zhaoweicai avatar zhaoweicai commented on September 28, 2024

hi @mrastegari
I found the problem, which is running multiple gpus. When I switched to 1 gpu, the model started to converge. For multiple-gpu version, maybe I used different CUDA and cuDNN versions. Could you share which version do you use? Thanks!

from xnor-net.

mrastegari avatar mrastegari commented on September 28, 2024

I use cuda 7.5 and cudnn 5. I also had this problem with GPUs on some of the machines that had mainboard incompatibility with GPUs

from xnor-net.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.