soumith / cudnn.torch Goto Github PK
View Code? Open in Web Editor NEWTorch-7 FFI bindings for NVIDIA CuDNN
License: BSD 2-Clause "Simplified" License
Torch-7 FFI bindings for NVIDIA CuDNN
License: BSD 2-Clause "Simplified" License
Let's track them here:
nn.SpatialLogSoftMax
, should be addressed by torch/nn#560nn.SpatialCrossEntropyCriterion
nn.TemporalConvolution
does not have padH
support and the current implementation of cudnn.TemporalConvolution
needs modifications to support cudnn.convert
in R4nn.SpatialBatchNormalization
does not support 5D inputs in R4nn.SpatialConvolution
and cudnn.SpatialConvolution
in R3 does not support noBias()
(will cause error on conversion)nn.SpatialConvolution
does not support groups (will cause error on cudnn.convert
cudnn -> nn)it does not work and there is no test. I have been thinking to fix it while it was R3, now it's master and has to be fixed.
Running on Yosemite. All NVIDIA components current. Recent update of torch installation. Upon load of cudnn I get:
/Users/seth/Dev/torch/install/bin/luajit: /Users/seth/Dev/torch/install/share/lua/5.1/trepl/init.lua:363: /Users/seth/Dev/torch/install/share/lua/5.1/trepl/init.lua:363: /Users/seth/Dev/torch/install/share/lua/5.1/cudnn/init.lua:26: Error in CuDNN: CUDNN_STATUS_INTERNAL_ERROR
This is not stopping me from anything - but it is new, thought you'd like to know.
th> require 'nn'; require 'cudnn';
[0.8705s]
th> net = nn.Sequential():add(nn.Linear(10, 5)):add(cudnn.ReLU(true)):add(nn.LogSoftMax())
[0.0001s]
th> net:cuda()
nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Linear(10 -> 5)
(2): cudnn.ReLU
(3): nn.LogSoftMax
}
[0.0039s]
th> net:forward(torch.randn(10):cuda())
/home/atcold/torch/install/share/lua/5.1/nn/LogSoftMax.lua:4: attempt to index field 'THNN' (a nil value)
stack traceback:
/home/atcold/torch/install/share/lua/5.1/nn/LogSoftMax.lua:4: in function 'updateOutput'
/home/atcold/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
[string "_RESULT={net:forward(torch.randn(10):cuda())}"]:1: in main chunk
[C]: in function 'xpcall'
/home/atcold/torch/install/share/lua/5.1/trepl/init.lua:651: in function 'repl'
...cold/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670
[0.2120s]
Oh, I guess the classifier has to be cunn
rather then cudnn
... I think I got confused.
What is the current status on the half precision stuff? With the Tegra X1 out, this is becoming more interesting. I have seen that @soumith has committed some first stuff with 27e969b. Is anyone working on this? I would generally be interested on contributing to this. Are there some thoughts around on how this should best be done with CudaTensor currently supporting single-precision exclusively?
I am using torch.manualSeed() and cutorch.manualSeed() in my code. However, I get different result every time when using cudnn module. nn module works normally.
Reproducing:
require 'cutorch'
require 'cunn'
require 'cudnn'
require 'optim'
local SpatialConvolution = cudnn.SpatialConvolution
local SpatialMaxPooling = cudnn.SpatialMaxPooling
--local SpatialConvolution = nn.SpatialConvolution
--local SpatialMaxPooling = nn.SpatialMaxPooling
cudnn.benchmark = false
torch.setdefaulttensortype("torch.FloatTensor")
torch.manualSeed(71)
cutorch.manualSeed(71)
local model = nn.Sequential()
model:add(nn.View(30, 48, 48))
model:add(SpatialConvolution(30, 64, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialConvolution(64, 64, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialMaxPooling(2, 2, 2, 2))
model:add(SpatialConvolution(64, 128, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialConvolution(128, 128, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialMaxPooling(2, 2, 2, 2))
model:add(nn.View(128 * 12 * 12))
model:add(nn.Linear(128 * 12 * 12, 512))
model:add(nn.Dropout(0.5))
model:add(nn.Linear(512, 1))
local inputs = torch.Tensor(32, 1, 30, 48, 48):uniform():cuda()
local targets = torch.Tensor(32, 1):uniform():cuda()
local criterion = nn.MSECriterion():cuda()
local config = { learningRate = 0.00001 }
model = model:cuda()
model:training()
local parameters, gradParameters = model:getParameters()
for i = 1, 100 do
local feval = function(x)
if x ~= parameters then
parameters:copy(x)
end
gradParameters:zero()
inputs = inputs:clone()
local output = model:forward(inputs)
local f = criterion:forward(output, targets)
model:backward(inputs, criterion:backward(output, targets))
return f, gradParameters
end
optim.adam(feval, parameters, config)
if i % 10 == 0 then
collectgarbage()
end
end
model:evaluate()
local y = model:forward(inputs)
print(y:sum())
--[[
# on GTX760, CUDA 7.5, cuDNN v3
# cunn - same result
% th seed.lua
18.265419006348
% th seed.lua
18.265419006348
% th seed.lua
18.265419006348
% th seed.lua
18.265419006348
# cudnn - different result
% th seed.lua
18.265224456787
% th seed.lua
18.26549911499
% th seed.lua
18.265830993652
% th seed.lua
18.266021728516
--]]
First of all, thanks for the very fast porting to R3. I noticed that fastest()
, setMode()
or resetMode()
are ineffective once a forward pass has been done because createIODescriptors()
is not fully executed. Suggested fix: add self.iSize:fill(0)
to each function.
Running test on device: 1
Running 33 tests
________________________*________ ==> Done Completed 253 asserts in 33 tests with 1 errors
--------------------------------------------------------------------------------
TemporalConvolution_padding_batch
Function call failed
test.lua:285: bad argument #4 to 'narrow' (out of range at /home/sgross/local/cutorch/lib/THC/generic/THCTensor.c:367)
stack traceback:
[C]: in function 'narrow'
test.lua:285: in function <test.lua:268>
[C]: in function 'xpcall'
...ocal/torch-luajit/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
...ocal/torch-luajit/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
...ocal/torch-luajit/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
test.lua:1324: in main chunk
[C]: in function 'dofile'
...rch-luajit/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406444
I have a GPU model gpu_net
that I'd like to convert back to be a CPU model. I'm doing this with two calls cpu_net = gpu_net:clone():float()
followed by cudnn.convert(cpu_net, nn)
.
Unfortunately, this doesn't seem to be working. If I look at the output of cpu_net:listModules()
, I still see instances of ReLU modules having self.mode = CUDNN_ACTIVATION_RELU
.
If I save my CPU model to an ascii file and search through it for instances of 'cudnn', I find thousands.
I tried manually setting the self.mode
instances to nil, but it didn't work.
For what it's worth, I'm using Element Research's rnn
package extensively. Perhaps cudnn.convert()
isn't cooperating with that?
Thanks.
nn.SpatialConvolution works fine in this case, cudnn v2rc3 breaks with C++ exception, maybe we should add an assert or search why does it break?
When kH ~= kW, the weight & gradWeight has the size nOutputPlane x nInputPlane x kW x kH
.
However, the cudnn kernels interpret the memory as having the size nOutputPlane x nInputPlane x kH x kW
, which is consistent with nn.SpatialConvolution and nn.SpatialConvolutionMM. See example below which uses a convolution to mask an input tensor.
I could just submit a PR with a correct __init() function, or should we care about the fact that loaded old cudnn models will still have the wrong tensor view? I suspect we shouldn't worry, since even loading a weight tensor with the wrong view will still be interpreted correctly.
require 'cudnn'
kH, kW = 3,2
inp = torch.range(1, 6):view(1, 1, kH, kW)
net = cudnn.SpatialConvolution(1,1,kW,kH):cuda()
net.bias:zero()
net.weight:zero()
net.weight[{1,1,2,1}]=1
print("input")
print(inp)
print("cudnn: should be 1x1x3x2 instead of 1x1x2x3")
print(net.weight)
print("as demonstrated by the corresponding position in the input tensor")
print(net:forward(inp:cuda()):squeeze())
net2 = nn.SpatialConvolution(1,1,kW,kH)
net2.weight:copy(net.weight) -- copy despite different sizes
net2.bias:zero()
print("nn version has the right layout":)
print(net2.weight)
print(net2:forward(inp):squeeze())
Hi,
When I use cudnn.SpatialConvolution with kernels of size 1x1, it works correctly for the forward pass but I don't get the expected result for the backprop : only the first plane is correct, the other ones are all zeros. Here is an piece of code failing :
require 'cudnn'
n = cudnn.SpatialConvolution(2, 2, 1, 1):cuda()
a = torch.randn(1, 2, 3, 3):cuda()
y = n:forward(a)
x = n:backward(a, y)
print(x)
If it is a bug of cudnn, it would be nice to at least have an assert to check for the kernel size.
Thanks!
Michael
There seems to be a problem with groups in SpatialConvolution and cudnn.benchmark or cudnn.fastest options. They seem to produce different results w.r.t. the standard mode. Let's have the following code
require 'torch'
require 'cudnn'
local g = 1 --2
local M = cudnn.SpatialConvolution(384,384, 3,3, 1,1, 1,1, g):cuda()
local M2 = M:clone()
for i=1,10 do
local p = torch.CudaTensor(1,384,16,16):normal(0,1)
cudnn.benchmark = false
M:forward(p)
cudnn.benchmark = true
M2:forward(p)
local a = M.output
local b = M2.output
print(i,(a-b):norm(),(a-b):abs():max())
end
On my Titan Black and cudnn v3, g=1 outputs zeros but g=2 outputs nonzeros. On Sergey's Titan Black and v4, g=2 outputs zeros sometimes and on Titan X it "hangs". Any explanations?
Doesn't convert nn.BatchNormalization to cudnn.BatchNormalization. The same for Spatial.
cudnn.Tanh is actually a lot slower than nn.Tanh
What's the point in it then?
torch.setdefaulttensortype('torch.CudaTensor')
testCase = torch.rand(10*1000)
wikinet=nn.Sequential()
collectgarbage()
wikinet:add( nn.Linear( 10*1000, 10*1000 ) )
wikinet:add( cudnn.Tanh() )
local timer = torch.Timer() -- the Timer starts to count now
print( #wikinet:forward( testCase ) )
print(timer:time().real)
local timer = torch.Timer() -- the Timer starts to count now
print( #wikinet:forward( testCase ) )
print( #wikinet:forward( testCase ) )
print(timer:time().real)
wikinet=nn.Sequential()
collectgarbage()
wikinet:add( nn.Linear( 10*1000, 10*1000 ) )
wikinet:add( nn.Tanh() )
local timer = torch.Timer() -- the Timer starts to count now
print( #wikinet:forward( testCase ) )
print(timer:time().real)
local timer = torch.Timer() -- the Timer starts to count now
print( #wikinet:forward( testCase ) )
print( #wikinet:forward( testCase ) )
print(timer:time().real)
torch.setdefaulttensortype('torch.FloatTensor')
collectgarbage()
as @eladhoffer pointed 1x1 convolutions do not give proper backward results. Setting
local ki, kj = 1,1
local si, sj = 1,1
breaks the test.
This is only in Linux, OS X is fine.
cudnn did their final release of R3.
Now cudnn.torch's master branch points to the R3 bindings. If you want R2 bindings specifically, checkout the R2 branch.
I noticed that cudnn.h
includes this rather cryptic comment for the cudnnBatchNormalizationForwardTraining()
function:
/* MUST use factor=1 in the very first call of a complete training cycle.
Use a factor=1/(1+n) at N-th call to the function to get
Cumulative Moving Average (CMA) behavior
CMA[n] = (x[1]+...+x[n])/n
Since CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1) =
((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1) =
CMA[n]*(1-1/(n+1)) + x[n+1]*1/(n+1) */
double exponentialAverageFactor,
In the current version of the code, this is just the momentum
parameter which defaults to 0.1. Is it worth implementing the proper averaging behavior? The advantage is that this would allow us to capture the mean and variance across the entire training set.
"I have to time to support these, so please dont expect a quick response to filed github issues."
Do you mean that you DON'T have time to support these?
The install instructions for CUDNN have you add its directory to DYLD_LIBRARY_PATH. Unfortunately, El Capitan changed how that environment variable works: certain executables (maybe everything in /bin?) don't inherit DYLD_* anymore. It has to do with System Integrity Protection, apparently. Unfortunately, since the th
command is a #!/bin/sh
script, th
doesn't inherit the library path and you get
libcudnn (R4) not found in library path.
Please install CuDNN from https://developer.nvidia.com/cuDNN
Then make sure files named as libcudnn.so.4 or libcudnn.4.dylib are placed in your library load path (for example /usr/local/lib , or manually add a path to LD_LIBRARY_PATH)
I hacked around this temporarily by setting the DYLD_LIBRARY_PATH inside ~/torch/install/bin/th
, but of course that's fragile. It would be great if somehow cudnn.torch could link directly against a CUDNN library so we wouldn't have to patch the linker to find it.
A workaround I haven't tried might be to put cudnn in ~/lib or /usr/local/lib, since those are on the library path by default.
Edit: some additional info from other projects: oracle/node-oracledb#231
I recently updated (Feb. 17) to the latest versions, i.e. cuDNN v4 (Feb. 10), nn, cutorch, cunn, and cudnn.torch. The problem I am seeing is that my networks are no longer able to learn... I rechecked my code, installation, and everything seems okay. I systematically started swapping out cudnn backend with nn equivalent and it seems that SpatialBatchNormalization is not working as expected. On CIFAR-10, I can get 93% when I use everything with cudnn backend but with nn.SpatialBatchNormalization. When I use cudnn.SpatialBatchNormalization I can't get past 10% (always guessing same label).
Any thoughts? I can investigate more but looking at the commit logs I see a number of substantial changes to SpatialBatchNormalization.
Thanks.
Running test on device: 1
Running 27 tests
______________*____________ ==> Done Completed 85 asserts in 27 tests with 1 errors
--------------------------------------------------------------------------------
SpatialCrossEntropyCriterion
error in difference between central difference and :backward
LT(<) violation val=0.16240306198597, condition=0.01
/opt/rocks/distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
test/test.lua:795: in function <test/test.lua:767>
--------------------------------------------------------------------------------
Dear all,
I tried running a rather simple ConvNet using cuDNN (the supervised demo [1] where I replaced the convolutions, relu, and pooling by the cuDNN equivalents) and it sometimes trains fine, but sometimes just seems do be doing nothing at all (train and test error remain at the initial 19% global acc. forever). There are no errors reported. Using the cunn modules instead of the cuDNN ones also always works fine, so this seems to be a problem specific to cuDNN.
Another interesting thing: when I print the biases for some layers (same for the weights) using e.g.
print(model:get(1).bias:float())
they are all reported as nan's (in a run where cuDNN "hangs". Otherwise they are small numbers). Doing the same with the cunn-convolutions always prints nice small numbers.
Anyone else seen this problem before? Is there a way to at least detect when cuDNN "hangs" (besides monitoring the biases for nan's)? It's kind of annoying when your net is not learning at all and you do not know whether this is due to bad hyperparameters, or just because cuDNN hangs.
I'm on OSX 10.8 with Cuda 6.5 and cuDNN R2. I also use Caffe and Mocha with cuDNN backends and have never seen this problem occurring with these libs before.
Best
Michael
[1] https://github.com/torch/tutorials/blob/master/2_supervised/2_model.lua
Currently we can't use SpatialFullConvolution, because init.lua doesn't initialize it.
If affine
is set to false, the weight
and bias
are nil
, leading to an error in createIODescriptors. Maybe it would be better to assert in the constructor itself that affine==false
is not permitted at the moment ?
In order to use the nn.SpatialBatchNormaliztion during test I need to simulate a batch (4D) from a single sample image (3D), i.e make a 1xNxM --> 1x1xNxM tensor. I use nn.Replicate(1) for this but the cudnn.SpatialConvolution complains about its input. Here is the code to reproduce the error:
m = nn.Sequential()
m:add(nn.Replicate(1))
m:add(cudnn.SpatialConvolution(1,3,5,5))
m:cuda()
m:forward(torch.rand(1,7,7):cuda())
I got this:
/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_BAD_PARAM
stack traceback:
[C]: in function 'error'
/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
/torch/install/share/lua/5.1/cudnn/init.lua:76: in function 'toDescriptor'
/torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:108: in function 'createIODescriptors'
/torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:339: in function 'updateOutput'
/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
[string "_RESULT={m:forward(torch.rand(1,7,7):cuda())}"]:1: in main chunk
[C]: in function 'xpcall'
/torch/install/share/lua/5.1/trepl/init.lua:650: in function 'repl'
/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670
But this works perfect
m.modules[2]:forward(torch.rand(1,1,7,7):cuda()):size()
By the way the nn version also works
m.modules[2] = nn.SpatialConvolution(1,3,5,5)
m:forward(torch.rand(1,7,7):cuda()):size()
Any workaround for the batch simulation is also welcome.
Hi,
I am having following error while using cudnn v4 &nividia 7.5 on K40 GPU.
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5709/cutorch/lib/THC/THCGeneral.c line=591 error=8 : invalid device function
/home/sk1846/torch/install/bin/luajit: /home/sk1846/torch/install/share/lua/5.1/cudnn/Pooling.lua:57: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-5709/cutorch/lib/THC/THCGeneral.c:591
stack traceback:
[C]: in function 'resizeAs'
/home/sk1846/torch/install/share/lua/5.1/cudnn/Pooling.lua:57: in function 'createIODescriptors'
/home/sk1846/torch/install/share/lua/5.1/cudnn/Pooling.lua:88: in function 'updateOutput'
/home/sk1846/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
/home/sk1846/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
I've got tons of annoying warnings when saving cudnn networks:
$ Warning: cannot write object field <weightDesc>
$ Warning: cannot write object field <biasDesc>
$ Warning: cannot write object field <iDesc>
$ Warning: cannot write object field <convDesc>
$ Warning: cannot write object field <oDesc>
$ Warning: cannot write object field <iDesc>
$ Warning: cannot write object field <oDesc>
$ Warning: cannot write object field <weightDesc>
$ Warning: cannot write object field <biasDesc>
$ Warning: cannot write object field <iDesc>
$ Warning: cannot write object field <convDesc>
$ Warning: cannot write object field <oDesc>
Is it possible to suppress them?
Hi,
When I use cudnn.SoftMax, I get an assertion failed on line 21 of cudnn.SpatialSoftMax. My input is batchSize x nClasses, so it doesn't have 4 dimensions, which makes the assertion fail.
Shouldn't SoftMax handle input of dimension 2 ? Otherwise, I don't see the difference between SoftMax and SpatialSoftMax (or there is something I missed).
Michael
Looks like cudnn R2 is a release candidate for a good reason.
For convolution kernels, there are lots of illegal memory accesses. I encountered this while debugging:
torch/cutorch#87 reported by both @russelfei and @szagoruyko
running a script with cuda-memcheck returns tons of illegal memory access issues, which in turn are failing the later operations in the pipeline, which is the error that shows up in cutorch.
Do you guys want me to revert this repo to R1 and put R2-rc1 in a separate branch, and only make the repo R2-ready when R2 final candidate is released?
inn | cudnn R3 |
---|---|
inn.SpatialCrossResponseNormalization | cudnn.SpatialCrossLRN |
inn.SpatialSameResponseNormalization | cudnn.SpatialDivisiveNormalization |
also cudnn.SpatialCrossLRN has different default parameters. Should we maybe use inn naming and defaults?
cc @fmassa
Have made CuDNN R2-rc2 the default bindings in master now.
If you need R1 bindings, checkout the branch R1 instead of master.
Very nice speedups with R2-rc2 and all bugs seem to be gone.
When I try to save and load cudnn.SpatialConvolution module fields iDesc and convDesc are nils.
code to reproduce:
require 'cudnn'
input = torch.CudaTensor(32,3,27,27)
--[[
net = cudnn.SpatialConvolution(3,96,5,5)
net = net:cuda()
o = net:forward(input)
torch.save('model.net', net)
--]]
net = torch.load('model.net')
o = net:forward(input)
As pointed by @colesbury in torch/cunn#185
Benchmarks:
GeForce GTX TITAN X
nn
nn 64 x 16 x 112 x 112 4.82 ms [ 4.82 ms]
nn 64 x 64 x 56 x 56 3.69 ms [ 22.15 ms]
nn 64 x 128 x 28 x 28 1.23 ms [ 9.81 ms]
nn 64 x 256 x 14 x 14 0.64 ms [ 7.63 ms]
nn 64 x 512 x 7 x 7 0.33 ms [ 1.98 ms]
Weighted Total: 46.39 ms
GeForce GTX TITAN X
cudnn
cudnn 64 x 16 x 112 x 112 34.48 ms [ 34.48 ms]
cudnn 64 x 64 x 56 x 56 8.94 ms [ 53.62 ms]
cudnn 64 x 128 x 28 x 28 2.41 ms [ 19.31 ms]
cudnn 64 x 256 x 14 x 14 0.81 ms [ 9.76 ms]
cudnn 64 x 512 x 7 x 7 0.47 ms [ 2.84 ms]
Weighted Total: 120.00 ms
GeForce GTX TITAN Black
nn
nn 64 x 16 x 112 x 112 6.76 ms [ 6.76 ms]
nn 64 x 64 x 56 x 56 4.96 ms [ 29.75 ms]
nn 64 x 128 x 28 x 28 2.01 ms [ 16.07 ms]
nn 64 x 256 x 14 x 14 0.95 ms [ 11.44 ms]
nn 64 x 512 x 7 x 7 0.71 ms [ 4.29 ms]
Weighted Total: 68.30 ms
GeForce GTX TITAN Black
cudnn
cudnn 64 x 16 x 112 x 112 31.96 ms [ 31.96 ms]
cudnn 64 x 64 x 56 x 56 8.26 ms [ 49.56 ms]
cudnn 64 x 128 x 28 x 28 3.51 ms [ 28.05 ms]
cudnn 64 x 256 x 14 x 14 1.70 ms [ 20.35 ms]
cudnn 64 x 512 x 7 x 7 0.97 ms [ 5.85 ms]
Weighted Total: 135.77 ms
Will close when NVIDIA fixes it.
The requirements for contiguous tensors seem to be too strict with R3. Actually, cudnn supports operations on non-contiguous data well; there are just some constraints, typically that input and gradInput and output and gradOutput needs to have the same strides. Thus, I was able to extend Pointwise (ReLU) to work with non-contiguous tensors. Basically one just needs to create personalized descriptors for every tensor and not share them.
I'm using DepthConcat over convolutions and it's crashing for me with cudnn, but not with cunn, with "gradOutput has to be contiguous" on the backward pass.
Example: https://gist.github.com/etrulls/1444933b86eeda31afd6
Is there any particular reason for not having nn to cudnn conversion function in R4?
Someone from nvidia pointed out that the buffersize returned by cudnnGetConvolutionForwardWorkspaceSize is the number of bytes.
So I am allocating excessive memory here because I thought the size was the number of floats needed. fix it.
there's an issue when changing input sizes, some ffi allocation isnt getting freed and the stack gets corrupted.
@clementfarabet if you have the chance, have a glance at SpatialConvolution.lua :createIODescriptors and see if there's anything obvious that stands out.
https://github.com/soumith/cudnn.torch/blob/R3/SpatialAveragePooling.lua#L6 mode is different when loaded from older versions
Have just tested in Ubuntu, all tests pass. But in OS X no:
____*__*______ ==> Done Completed 50 asserts in 14 tests with 3 errors
--------------------------------------------------------------------------------
Tanh_single
error on state (forward)
LT(<) violation val=nan, condition=0.0001
/usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
test/test.lua:329: in function <test/test.lua:303>
--------------------------------------------------------------------------------
Tanh_single
error on state (backward)
LT(<) violation val=nan, condition=0.01
/usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
test/test.lua:332: in function <test/test.lua:303>
--------------------------------------------------------------------------------
SoftMax_single
error on state (backward)
LT(<) violation val=nan, condition=0.01
/usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
test/test.lua:467: in function <test/test.lua:437>
--------------------------------------------------------------------------------
weird
I've got this problem uncovered when moved to local install. I cannot load models with cudnn saved before December. The issue is with with Sequential models with cudnn inside, pretty generic.
So as I've got some global outdated installs that can load these models I tried to debug and search the module that causes troubles. So let's say this is my model:
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
(1): nn.Reshape
(2): cudnn.SpatialConvolution
(3): cudnn.ReLU
(4): cudnn.SpatialMaxPooling
(5): cudnn.SpatialConvolution
}
I save it with old install and open with new install and got errors like this:
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:254: unknown object
stack traceback:
[C]: in function 'error'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:254: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:211: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:248: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:234: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:248: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:248: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:234: in function 'readObject'
/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:271: in function 'load'
When I debug it recursively goes with readObject and finds the last SpatialConvolution with typeidx=1332609024 and doesn't know what to do with it.
If I split Sequential and save it module by module - it loads.
cudnnR4 doesn't choose the optimal algorithm in the fully-connected mode, even with cudnn.benchmark = true, which results in ~20x slower backward pass compared to MatConvNet.
Torch:
--TORCH OUTPUT (in seconds)
--forward 0.24779605865479
--backward 4.5414280891418
--forward 0.051395893096924
--backward 4.5211651325226
--forward 0.054457902908325
--backward 4.5210771560669
-- with cudnn.benchmark = true
--forward 14.457499027252
--backward 0.98335909843445
--forward 0.045572996139526
--backward 0.98773503303528
--forward 0.045454025268555
--backward 0.98268413543701
require 'cudnn'
require 'hdf5'
function gpuTicToc(f)
cutorch.synchronize()
local tic = torch.tic()
f()
cutorch.synchronize()
return torch.toc(tic)
end
model = cudnn.SpatialConvolution(256, 4096, 6, 6, 1, 1):cuda(); model.weight:fill(1); model.bias:fill(1)
input = torch.CudaTensor(1600, 256, 6, 6):fill(1):cuda()
for i = 1, 3 do
model:zeroGradParameters()
print('forward', gpuTicToc(function()
model:forward(input)
end))
one = torch.CudaTensor():resize(model.output:size()):fill(1)
print('backward', gpuTicToc(function()
model:backward(input, one)
end))
end
model:float()
h = hdf5.open('test.h5', 'w')
h:write('/output', model.output)
h:write('/gradInput', model.gradInput)
h:write('/gradWeight', model.gradWeight)
h:write('/gradBias', model.gradBias)
h:close()
MatConvNet:
%MATLAB OUTPUT (in seconds)
%
%forward 0.224209
%backward 0.046167
%forward 0.045812
%backward 0.044633
%forward 0.043401
%backward 0.044506
%
%output diff: 0.000000
%gradInput diff: 0.000000
%gradWeight diff: 0.000000
%gradBias diff: 0.000000
%addpath('matconvnet-1.0-beta18/matlab'); vl_compilenn('EnableGpu', true, 'EnableCudnn', true, 'CudnnRoot', '/home/kantorov/cudnnR4');
run('matconvnet-1.0-beta18/matlab/vl_setupnn.m');
weight = gpuArray(ones(6, 6, 256, 4096, 'single'));
bias = gpuArray(ones(1, 4096, 'single'));
input = gpuArray(ones(6, 6, 256, 1600, 'single'));
one = gpuArray(single(ones(1, 1, 4096, 1600)));
for i = 1:3
wait(gpuDevice); tic;
output = vl_nnconv(input, weight, bias);
wait(gpuDevice); fprintf('forward %f\n', toc);
wait(gpuDevice); tic;
[dzdx, dzdf dzdb] = vl_nnconv(input, weight, bias, one);
wait(gpuDevice); fprintf('backward %f\n', toc);
end
torch_output = h5read('test.h5', '/output');
torch_gradInput = h5read('test.h5', '/gradInput');
torch_gradWeight = h5read('test.h5', '/gradWeight');
torch_gradBias = h5read('test.h5', '/gradBias');
fprintf('output diff: %f\n', sum(abs(reshape(torch_output, [], 1) - reshape(output, [], 1))));
fprintf('gradInput diff: %f\n', sum(abs(reshape(torch_gradInput, [], 1) - reshape(dzdx, [], 1))));
fprintf('gradWeight diff: %f\n', sum(abs(reshape(torch_gradWeight, [], 1) - reshape(dzdf, [], 1))));
fprintf('gradBias diff: %f\n', sum(abs(reshape(torch_gradBias, [], 1) - reshape(dzdb, [], 1))));
Replacing cudnn.SpatialConvolution with nn.Linear makes Torch and MatConvNet even:
--TORCH OUTPUT (in seconds) with nn.Linear
--forward 0.046329975128174
--backward 0.048556089401245
--forward 0.045660018920898
--backward 0.046145915985107
--forward 0.045567989349365
--backward 0.043753862380981
For the same input {0.5, 0.5}
cudnn.SoftMax
0.5000
0.5000
[torch.CudaTensor of size 2]
nn.SoftMax
0.5000
0.5000
[torch.CudaTensor of size 2]
cudnn.LogSoftMax
0.5000
0.5000
[torch.CudaTensor of size 2]
nn.LogSoftMax
-0.6931
-0.6931
[torch.CudaTensor of size 2]
Here is my code
require 'torch'
require 'nn'
require 'cutorch'
require 'cunn'
require 'cudnn'
input = torch.Tensor{0.5,0.5}
input = input:cuda()
model = cudnn.SoftMax()
model:cuda()
output_cudnnSM = model:forward(input)
print('cudnn.SoftMax')
print(output_cudnnSM)
model = nn.SoftMax()
model:cuda()
output_nnSM = model:forward(input)
print('nn.SoftMax')
print(output_nnSM)
model = cudnn.LogSoftMax()
model:cuda()
output_cudnnLSM = model:forward(input)
print('cudnn.LogSoftMax')
print(output_cudnnLSM)
model = nn.LogSoftMax()
model:cuda()
output_nnLSM = model:forward(input)
print('nn.LogSoftMax')
print(output_nnLSM)
I've been looking around to see if this is intended or I'm just doing something stupid, but it's really hard to find official docs on what cudnn supports and what it doesn't support. but, the following code:
require 'cudnn'
x = cudnn.SpatialConvolution(3,16,3,3,2,2,1,1)
x:cuda()
x:forward(torch.rand(3,32,32):cuda())
results in "Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED"
So i assume it has something to do with the stride not matching the kernel size. I haven't gone through and seen what fails and what doesn't, if you know where to find a list of supported convolution arguments from cudnn, would appreciate being pointed in the right direction. Also, i'd be interested to see that list for cudnn v4 as well.
Anyways, if it's just cudnn's fault, that's fine, but I think it might make sense to list this on the readme, since nn's SpatialConvolution does support these convolution arguments, so cudnn is not really fully compatible with nn.
After the update, only one instance of cudnn can be run on K80. When I try to launch another one, it gave the following error.
/usr/local/torch/install/share/lua/5.1/cudnn/init.lua:26: Error in CuDNN: CUDNN_STATUS_INTERNAL_ERROR
I reverted back to the old version and it works again.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.