soumith / cudnn.torch Goto Github PK

View Code? Open in Web Editor NEW

389.0 389.0 159.0 783 KB

Torch-7 FFI bindings for NVIDIA CuDNN

License: BSD 2-Clause "Simplified" License

CMake 2.15% Lua 97.85%

cudnn.torch's People

Contributors

Stargazers

Watchers

Forkers

ajtulloch jakezhaojb vseledkin liyong3forever pyadolla datarev chagge colesbury moodstocks samehkhamis mys007 v-r-x hmandal aaronzue praveersingh borisfom tomsercu milestonesvn dilipkk jonathantompson phecy mkorpusik donis- yonatanglassner lim0606 misko ivpopov abyssxsy ivankreso fbesse cdtwigg constantineg1 nagadomi lospooky imclab gheinrich wucpmark jonathanasdf hughperkins jinghsu tangxing hfxunlp apaszke kingofoz bhargavaramm deep-vision mlennox stryxzilla dopetard saiganeshb lucienfostier jpuigcerver ll36771 terrencew zhaomingming2016 nagyistge deeplearningsprint amit2014 shuzi iamalbert qureai bartvm nvidia alband arasharchor guozanhua218 vanpersie32 maximus009 thinhqt sarah20187 btnc paintmagazine fyu gchanan ndronen joeyhng hgajiayou kiranvaidhya skinscanner taxicab phunghx suryabhupa elikosan themosst mbcel rracinskij mranzinger csarofeen dengcy028 joostvdoorn jsenellart twitter-forks edgarriba jameslinus thilinicooray brollb shivak richardassar sidharthms killeent

cudnn.torch's Issues

Inconsistencies with nn

Let's track them here:

There is no nn.SpatialLogSoftMax, should be addressed by torch/nn#560
There is no nn.SpatialCrossEntropyCriterion ~~(and cudnn test is broken)~~
nn.TemporalConvolution does not have padH support and the current implementation of cudnn.TemporalConvolution needs modifications to support cudnn.convert in R4
~~nn.SpatialBatchNormalization does not support 5D inputs in R4~~
~~nn.SpatialConvolution and cudnn.SpatialConvolution in R3 does not support noBias() (will cause error on conversion)~~
nn.SpatialConvolution does not support groups (will cause error on cudnn.convert cudnn -> nn)

DivisiveNormalization LRN doesn't work

it does not work and there is no test. I have been thinking to fix it while it was R3, now it's master and has to be fixed.

OSX cudnn wont load after update to latest torch

Running on Yosemite. All NVIDIA components current. Recent update of torch installation. Upon load of cudnn I get:
/Users/seth/Dev/torch/install/bin/luajit: /Users/seth/Dev/torch/install/share/lua/5.1/trepl/init.lua:363: /Users/seth/Dev/torch/install/share/lua/5.1/trepl/init.lua:363: /Users/seth/Dev/torch/install/share/lua/5.1/cudnn/init.lua:26: Error in CuDNN: CUDNN_STATUS_INTERNAL_ERROR

This is not stopping me from anything - but it is new, thought you'd like to know.

THNN a nil value

th> require 'nn'; require 'cudnn';                                                                    
                                                                      [0.8705s]
th> net = nn.Sequential():add(nn.Linear(10, 5)):add(cudnn.ReLU(true)):add(nn.LogSoftMax())
                                                                      [0.0001s]
th> net:cuda()
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> output]
  (1): nn.Linear(10 -> 5)
  (2): cudnn.ReLU
  (3): nn.LogSoftMax
}
                                                                      [0.0039s]
th> net:forward(torch.randn(10):cuda())                                                               
/home/atcold/torch/install/share/lua/5.1/nn/LogSoftMax.lua:4: attempt to index field 'THNN' (a nil value)
stack traceback:
        /home/atcold/torch/install/share/lua/5.1/nn/LogSoftMax.lua:4: in function 'updateOutput'
        /home/atcold/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
        [string "_RESULT={net:forward(torch.randn(10):cuda())}"]:1: in main chunk
        [C]: in function 'xpcall'
        /home/atcold/torch/install/share/lua/5.1/trepl/init.lua:651: in function 'repl'
        ...cold/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
        [C]: at 0x00406670
                                                                      [0.2120s]

Oh, I guess the classifier has to be cunn rather then cudnn... I think I got confused.

fp16/half precision support

What is the current status on the half precision stuff? With the Tegra X1 out, this is becoming more interesting. I have seen that @soumith has committed some first stuff with 27e969b. Is anyone working on this? I would generally be interested on contributing to this. Are there some thoughts around on how this should best be done with CudaTensor currently supporting single-precision exclusively?

The same seed gives different results

I am using torch.manualSeed() and cutorch.manualSeed() in my code. However, I get different result every time when using cudnn module. nn module works normally.

Reproducing:

require 'cutorch'
require 'cunn'
require 'cudnn'
require 'optim'

local SpatialConvolution = cudnn.SpatialConvolution
local SpatialMaxPooling = cudnn.SpatialMaxPooling
--local SpatialConvolution = nn.SpatialConvolution
--local SpatialMaxPooling = nn.SpatialMaxPooling

cudnn.benchmark = false
torch.setdefaulttensortype("torch.FloatTensor")
torch.manualSeed(71)
cutorch.manualSeed(71)

local model = nn.Sequential()
model:add(nn.View(30, 48, 48))
model:add(SpatialConvolution(30, 64, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialConvolution(64, 64, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialMaxPooling(2, 2, 2, 2))
model:add(SpatialConvolution(64, 128, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialConvolution(128, 128, 3, 3, 1, 1, 1, 1))
model:add(nn.ReLU())
model:add(SpatialMaxPooling(2, 2, 2, 2))
model:add(nn.View(128 * 12 * 12))
model:add(nn.Linear(128 * 12 * 12, 512))
model:add(nn.Dropout(0.5))
model:add(nn.Linear(512, 1))

local inputs = torch.Tensor(32, 1, 30, 48, 48):uniform():cuda()
local targets = torch.Tensor(32, 1):uniform():cuda()
local criterion = nn.MSECriterion():cuda()
local config = { learningRate = 0.00001 }

model = model:cuda()
model:training()
local parameters, gradParameters = model:getParameters()
for i = 1, 100 do
   local feval = function(x)
      if x ~= parameters then
     parameters:copy(x)
      end
      gradParameters:zero()
      inputs = inputs:clone()
      local output = model:forward(inputs)
      local f = criterion:forward(output, targets)
      model:backward(inputs, criterion:backward(output, targets))
      return f, gradParameters
   end
   optim.adam(feval, parameters, config)
   if i % 10 == 0 then
      collectgarbage()
   end
end
model:evaluate()
local y = model:forward(inputs)
print(y:sum())

--[[
# on GTX760, CUDA 7.5, cuDNN v3

# cunn - same result

% th seed.lua
   18.265419006348
% th seed.lua
   18.265419006348
% th seed.lua
   18.265419006348
% th seed.lua
   18.265419006348

# cudnn - different result

% th seed.lua
   18.265224456787
% th seed.lua
   18.26549911499
% th seed.lua
   18.265830993652
% th seed.lua
   18.266021728516

--]]

R3: Modes may not be applied

First of all, thanks for the very fast porting to R3. I noticed that fastest(), setMode() or resetMode() are ineffective once a forward pass has been done because createIODescriptors() is not fully executed. Suggested fix: add self.iSize:fill(0) to each function.

TemporalConvolution_padding_batch test is flaky

Running test on device: 1
Running 33 tests
________________________*________  ==> Done Completed 253 asserts in 33 tests with 1 errors
--------------------------------------------------------------------------------
TemporalConvolution_padding_batch
 Function call failed
test.lua:285: bad argument #4 to 'narrow' (out of range at /home/sgross/local/cutorch/lib/THC/generic/THCTensor.c:367)
stack traceback:
        [C]: in function 'narrow'
        test.lua:285: in function <test.lua:268>
        [C]: in function 'xpcall'
        ...ocal/torch-luajit/install/share/lua/5.1/torch/Tester.lua:115: in function 'pcall'
        ...ocal/torch-luajit/install/share/lua/5.1/torch/Tester.lua:186: in function '_run'
        ...ocal/torch-luajit/install/share/lua/5.1/torch/Tester.lua:161: in function 'run'
        test.lua:1324: in main chunk
        [C]: in function 'dofile'
        ...rch-luajit/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406444

cudnn.convert() isn't converting ReLU() modules back to nn

I have a GPU model gpu_net that I'd like to convert back to be a CPU model. I'm doing this with two calls cpu_net = gpu_net:clone():float() followed by cudnn.convert(cpu_net, nn).

Unfortunately, this doesn't seem to be working. If I look at the output of cpu_net:listModules(), I still see instances of ReLU modules having self.mode = CUDNN_ACTIVATION_RELU.

If I save my CPU model to an ascii file and search through it for instances of 'cudnn', I find thousands.

I tried manually setting the self.mode instances to nil, but it didn't work.

For what it's worth, I'm using Element Research's rnn package extensively. Perhaps cudnn.convert() isn't cooperating with that?

Thanks.

cudnn.SpatialConvolution breaks on 2x2 input with 3x3 kernel and pad=1

nn.SpatialConvolution works fine in this case, cudnn v2rc3 breaks with C++ exception, maybe we should add an assert or search why does it break?

New FFT Convolutions are a bit hard on multi-GPU

because we launch our kernels depth-first rather than breadth-first, we see huge latencies for Googlenet with kernel-launch overhead:

SpatialConvolution weights are kW x kH, should be kH x kW

When kH ~= kW, the weight & gradWeight has the size nOutputPlane x nInputPlane x kW x kH.
However, the cudnn kernels interpret the memory as having the size nOutputPlane x nInputPlane x kH x kW, which is consistent with nn.SpatialConvolution and nn.SpatialConvolutionMM. See example below which uses a convolution to mask an input tensor.
I could just submit a PR with a correct __init() function, or should we care about the fact that loaded old cudnn models will still have the wrong tensor view? I suspect we shouldn't worry, since even loading a weight tensor with the wrong view will still be interpreted correctly.

require 'cudnn'
kH, kW = 3,2
inp = torch.range(1, 6):view(1, 1, kH, kW)
net = cudnn.SpatialConvolution(1,1,kW,kH):cuda()
net.bias:zero()
net.weight:zero()
net.weight[{1,1,2,1}]=1
print("input")
print(inp)
print("cudnn: should be 1x1x3x2 instead of 1x1x2x3")
print(net.weight)
print("as demonstrated by the corresponding position in the input tensor")
print(net:forward(inp:cuda()):squeeze())
net2 = nn.SpatialConvolution(1,1,kW,kH)
net2.weight:copy(net.weight) -- copy despite different sizes
net2.bias:zero()
print("nn version has the right layout":)
print(net2.weight)
print(net2:forward(inp):squeeze())

cudnn.SpatialConvolution with kernel of size 1

Hi,

When I use cudnn.SpatialConvolution with kernels of size 1x1, it works correctly for the forward pass but I don't get the expected result for the backprop : only the first plane is correct, the other ones are all zeros. Here is an piece of code failing :

require 'cudnn'
n = cudnn.SpatialConvolution(2, 2, 1, 1):cuda()
a = torch.randn(1, 2, 3, 3):cuda()
y = n:forward(a)
x = n:backward(a, y)
print(x)

If it is a bug of cudnn, it would be nice to at least have an assert to check for the kernel size.

Thanks!
Michael

Groups problem with benchmark/fastest

There seems to be a problem with groups in SpatialConvolution and cudnn.benchmark or cudnn.fastest options. They seem to produce different results w.r.t. the standard mode. Let's have the following code

require 'torch'
require 'cudnn'

local g = 1 --2
local M = cudnn.SpatialConvolution(384,384, 3,3, 1,1, 1,1, g):cuda()
local M2 = M:clone()

for i=1,10 do
    local p = torch.CudaTensor(1,384,16,16):normal(0,1)
    cudnn.benchmark = false
    M:forward(p)
    cudnn.benchmark = true
    M2:forward(p)

    local a = M.output
    local b = M2.output
    print(i,(a-b):norm(),(a-b):abs():max())
end

On my Titan Black and cudnn v3, g=1 outputs zeros but g=2 outputs nonzeros. On Sergey's Titan Black and v4, g=2 outputs zeros sometimes and on Titan X it "hangs". Any explanations?

BatchNormalization and SpatialBatchNormalization conversion missing

Doesn't convert nn.BatchNormalization to cudnn.BatchNormalization. The same for Spatial.

cudnn.Tanh

cudnn.Tanh is actually a lot slower than nn.Tanh
What's the point in it then?

torch.setdefaulttensortype('torch.CudaTensor')


testCase = torch.rand(10*1000)


wikinet=nn.Sequential()
collectgarbage()

wikinet:add( nn.Linear(  10*1000, 10*1000 ) )
wikinet:add( cudnn.Tanh() )

    local timer = torch.Timer() -- the Timer starts to count now
print(  #wikinet:forward( testCase )  )
    print(timer:time().real)


    local timer = torch.Timer() -- the Timer starts to count now
print(  #wikinet:forward( testCase )  )
print(  #wikinet:forward( testCase )  )
    print(timer:time().real)



wikinet=nn.Sequential()
collectgarbage()

wikinet:add( nn.Linear(  10*1000, 10*1000 ) )
wikinet:add( nn.Tanh() )

    local timer = torch.Timer() -- the Timer starts to count now
print(  #wikinet:forward( testCase )  )
    print(timer:time().real)


    local timer = torch.Timer() -- the Timer starts to count now
print(  #wikinet:forward( testCase )  )
print(  #wikinet:forward( testCase  )  )
    print(timer:time().real)


torch.setdefaulttensortype('torch.FloatTensor')
collectgarbage()

R2RC2 doesn't pass tests with 1x1 convolution

as @eladhoffer pointed 1x1 convolutions do not give proper backward results. Setting

local ki, kj = 1,1
local si, sj = 1,1

breaks the test.
This is only in Linux, OS X is fine.

cudnn final R3 release, master is R3

cudnn did their final release of R3.

Now cudnn.torch's master branch points to the R3 bindings. If you want R2 bindings specifically, checkout the R2 branch.

Question: CUDNNv4 SpatialBatchNorm momentum

I noticed that cudnn.h includes this rather cryptic comment for the cudnnBatchNormalizationForwardTraining() function:

/* MUST use factor=1 in the very first call of a complete training cycle.
    Use a factor=1/(1+n) at N-th call to the function to get
    Cumulative Moving Average (CMA) behavior
    CMA[n] = (x[1]+...+x[n])/n
    Since CMA[n+1] = (n*CMA[n]+x[n+1])/(n+1) =
    ((n+1)*CMA[n]-CMA[n])/(n+1) + x[n+1]/(n+1) =
    CMA[n]*(1-1/(n+1)) + x[n+1]*1/(n+1) */
double                              exponentialAverageFactor,

In the current version of the code, this is just the momentum parameter which defaults to 0.1. Is it worth implementing the proper averaging behavior? The advantage is that this would allow us to capture the mean and variance across the entire training set.

accurately finding correct .so version when multiple versions are installed

cloning a model breaks things as weightDesc cant be copied over

is this what you meant :-)

"I have to time to support these, so please dont expect a quick response to filed github issues."

Do you mean that you DON'T have time to support these?

DYLD_LIBRARY_PATH no longer easily inherited on OS X El Capitan

The install instructions for CUDNN have you add its directory to DYLD_LIBRARY_PATH. Unfortunately, El Capitan changed how that environment variable works: certain executables (maybe everything in /bin?) don't inherit DYLD_* anymore. It has to do with System Integrity Protection, apparently. Unfortunately, since the th command is a #!/bin/sh script, th doesn't inherit the library path and you get

libcudnn (R4) not found in library path.
Please install CuDNN from https://developer.nvidia.com/cuDNN
Then make sure files named as libcudnn.so.4 or libcudnn.4.dylib are placed in your library load path (for example /usr/local/lib , or manually add a path to LD_LIBRARY_PATH)

I hacked around this temporarily by setting the DYLD_LIBRARY_PATH inside ~/torch/install/bin/th, but of course that's fragile. It would be great if somehow cudnn.torch could link directly against a CUDNN library so we wouldn't have to patch the linker to find it.

A workaround I haven't tried might be to put cudnn in ~/lib or /usr/local/lib, since those are on the library path by default.

Edit: some additional info from other projects: oracle/node-oracledb#231

SpatialBatchNormalization Issue

I recently updated (Feb. 17) to the latest versions, i.e. cuDNN v4 (Feb. 10), nn, cutorch, cunn, and cudnn.torch. The problem I am seeing is that my networks are no longer able to learn... I rechecked my code, installation, and everything seems okay. I systematically started swapping out cudnn backend with nn equivalent and it seems that SpatialBatchNormalization is not working as expected. On CIFAR-10, I can get 93% when I use everything with cudnn backend but with nn.SpatialBatchNormalization. When I use cudnn.SpatialBatchNormalization I can't get past 10% (always guessing same label).

Any thoughts? I can investigate more but looking at the commit logs I see a number of substantial changes to SpatialBatchNormalization.

Thanks.

SpatialCrossEntropyCriterion test broken

Running test on device: 1
Running 27 tests
______________*____________  ==> Done Completed 85 asserts in 27 tests with 1 errors
--------------------------------------------------------------------------------
SpatialCrossEntropyCriterion
error in difference between central difference and :backward
 LT(<) violation   val=0.16240306198597, condition=0.01
    /opt/rocks/distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:795: in function <test/test.lua:767>

--------------------------------------------------------------------------------

cuDNN sometimes does not work at all (but sometimes does)...

Dear all,

I tried running a rather simple ConvNet using cuDNN (the supervised demo [1] where I replaced the convolutions, relu, and pooling by the cuDNN equivalents) and it sometimes trains fine, but sometimes just seems do be doing nothing at all (train and test error remain at the initial 19% global acc. forever). There are no errors reported. Using the cunn modules instead of the cuDNN ones also always works fine, so this seems to be a problem specific to cuDNN.

Another interesting thing: when I print the biases for some layers (same for the weights) using e.g.

print(model:get(1).bias:float())

they are all reported as nan's (in a run where cuDNN "hangs". Otherwise they are small numbers). Doing the same with the cunn-convolutions always prints nice small numbers.

Anyone else seen this problem before? Is there a way to at least detect when cuDNN "hangs" (besides monitoring the biases for nan's)? It's kind of annoying when your net is not learning at all and you do not know whether this is due to bad hyperparameters, or just because cuDNN hangs.

I'm on OSX 10.8 with Cuda 6.5 and cuDNN R2. I also use Caffe and Mocha with cuDNN backends and have never seen this problem occurring with these libs before.

Best
Michael

[1] https://github.com/torch/tutorials/blob/master/2_supervised/2_model.lua

Missing " require('cudnn.SpatialFullConvolution') " in init.lua

Currently we can't use SpatialFullConvolution, because init.lua doesn't initialize it.

SpatialBatchNormalization doesn't support affine=false

If affine is set to false, the weight and bias are nil, leading to an error in createIODescriptors. Maybe it would be better to assert in the constructor itself that affine==false is not permitted at the moment ?

Problem with nn.Replicate(1)

In order to use the nn.SpatialBatchNormaliztion during test I need to simulate a batch (4D) from a single sample image (3D), i.e make a 1xNxM --> 1x1xNxM tensor. I use nn.Replicate(1) for this but the cudnn.SpatialConvolution complains about its input. Here is the code to reproduce the error:

m = nn.Sequential()
m:add(nn.Replicate(1))
m:add(cudnn.SpatialConvolution(1,3,5,5))
m:cuda()
m:forward(torch.rand(1,7,7):cuda())

I got this:
/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_BAD_PARAM
stack traceback:
[C]: in function 'error'
/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck'
/torch/install/share/lua/5.1/cudnn/init.lua:76: in function 'toDescriptor'
/torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:108: in function 'createIODescriptors'
/torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:339: in function 'updateOutput'
/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
[string "_RESULT={m:forward(torch.rand(1,7,7):cuda())}"]:1: in main chunk
[C]: in function 'xpcall'
/torch/install/share/lua/5.1/trepl/init.lua:650: in function 'repl'
/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
[C]: at 0x00406670

But this works perfect
m.modules[2]:forward(torch.rand(1,1,7,7):cuda()):size()

By the way the nn version also works
m.modules[2] = nn.SpatialConvolution(1,3,5,5)
m:forward(torch.rand(1,7,7):cuda()):size()

Any workaround for the batch simulation is also welcome.

Cudnn/Pooling.lua, gradinput=nill

Hi,
I am having following error while using cudnn v4 &nividia 7.5 on K40 GPU.

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5709/cutorch/lib/THC/THCGeneral.c line=591 error=8 : invalid device function
/home/sk1846/torch/install/bin/luajit: /home/sk1846/torch/install/share/lua/5.1/cudnn/Pooling.lua:57: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-5709/cutorch/lib/THC/THCGeneral.c:591
stack traceback:
[C]: in function 'resizeAs'
/home/sk1846/torch/install/share/lua/5.1/cudnn/Pooling.lua:57: in function 'createIODescriptors'
/home/sk1846/torch/install/share/lua/5.1/cudnn/Pooling.lua:88: in function 'updateOutput'
/home/sk1846/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
/home/sk1846/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'

"Warning: cannot write object field <weightDesc>"

I've got tons of annoying warnings when saving cudnn networks:

$ Warning: cannot write object field <weightDesc>
$ Warning: cannot write object field <biasDesc>
$ Warning: cannot write object field <iDesc>
$ Warning: cannot write object field <convDesc>
$ Warning: cannot write object field <oDesc>
$ Warning: cannot write object field <iDesc>
$ Warning: cannot write object field <oDesc>
$ Warning: cannot write object field <weightDesc>
$ Warning: cannot write object field <biasDesc>
$ Warning: cannot write object field <iDesc>
$ Warning: cannot write object field <convDesc>
$ Warning: cannot write object field <oDesc>

Is it possible to suppress them?

cudnn.SoftMax fails?

Hi,

When I use cudnn.SoftMax, I get an assertion failed on line 21 of cudnn.SpatialSoftMax. My input is batchSize x nClasses, so it doesn't have 4 dimensions, which makes the assertion fail.

Shouldn't SoftMax handle input of dimension 2 ? Otherwise, I don't see the difference between SoftMax and SpatialSoftMax (or there is something I missed).

Michael

cudnn R2 memory errors

Looks like cudnn R2 is a release candidate for a good reason.
For convolution kernels, there are lots of illegal memory accesses. I encountered this while debugging:
torch/cutorch#87 reported by both @russelfei and @szagoruyko

running a script with cuda-memcheck returns tons of illegal memory access issues, which in turn are failing the later operations in the pipeline, which is the error that shows up in cutorch.

Do you guys want me to revert this repo to R1 and put R2-rc1 in a separate branch, and only make the repo R2-ready when R2 final candidate is released?

LRNs have different names in inn and R3

inn	cudnn R3
inn.SpatialCrossResponseNormalization	cudnn.SpatialCrossLRN
inn.SpatialSameResponseNormalization	cudnn.SpatialDivisiveNormalization

also cudnn.SpatialCrossLRN has different default parameters. Should we maybe use inn naming and defaults?

cc @fmassa

R2-rc2 is the default now.

Have made CuDNN R2-rc2 the default bindings in master now.
If you need R1 bindings, checkout the branch R1 instead of master.

Very nice speedups with R2-rc2 and all bugs seem to be gone.

Save and load

When I try to save and load cudnn.SpatialConvolution module fields iDesc and convDesc are nils.
code to reproduce:

require 'cudnn'

input = torch.CudaTensor(32,3,27,27)
--[[
net = cudnn.SpatialConvolution(3,96,5,5)
net = net:cuda()
o = net:forward(input)
torch.save('model.net', net)
--]]
net = torch.load('model.net')
o = net:forward(input)

SpatialBatchNormalization is 2-3 times slower than nn

As pointed by @colesbury in torch/cunn#185
Benchmarks:

GeForce GTX TITAN X
nn
   nn 64 x  16 x 112 x 112   4.82 ms [  4.82 ms]
   nn 64 x  64 x  56 x  56   3.69 ms [ 22.15 ms]
   nn 64 x 128 x  28 x  28   1.23 ms [  9.81 ms]
   nn 64 x 256 x  14 x  14   0.64 ms [  7.63 ms]
   nn 64 x 512 x   7 x   7   0.33 ms [  1.98 ms]
Weighted Total:   46.39 ms

GeForce GTX TITAN X
cudnn
cudnn 64 x  16 x 112 x 112  34.48 ms [ 34.48 ms]
cudnn 64 x  64 x  56 x  56   8.94 ms [ 53.62 ms]
cudnn 64 x 128 x  28 x  28   2.41 ms [ 19.31 ms]
cudnn 64 x 256 x  14 x  14   0.81 ms [  9.76 ms]
cudnn 64 x 512 x   7 x   7   0.47 ms [  2.84 ms]
Weighted Total:  120.00 ms

GeForce GTX TITAN Black
nn
   nn 64 x  16 x 112 x 112   6.76 ms [  6.76 ms]
   nn 64 x  64 x  56 x  56   4.96 ms [ 29.75 ms]
   nn 64 x 128 x  28 x  28   2.01 ms [ 16.07 ms]
   nn 64 x 256 x  14 x  14   0.95 ms [ 11.44 ms]
   nn 64 x 512 x   7 x   7   0.71 ms [  4.29 ms]
Weighted Total:   68.30 ms

GeForce GTX TITAN Black
cudnn
cudnn 64 x  16 x 112 x 112  31.96 ms [ 31.96 ms]
cudnn 64 x  64 x  56 x  56   8.26 ms [ 49.56 ms]
cudnn 64 x 128 x  28 x  28   3.51 ms [ 28.05 ms]
cudnn 64 x 256 x  14 x  14   1.70 ms [ 20.35 ms]
cudnn 64 x 512 x   7 x   7   0.97 ms [  5.85 ms]
Weighted Total:  135.77 ms

Will close when NVIDIA fixes it.

R3: Non-contiguous tensors

The requirements for contiguous tensors seem to be too strict with R3. Actually, cudnn supports operations on non-contiguous data well; there are just some constraints, typically that input and gradInput and output and gradOutput needs to have the same strides. Thus, I was able to extend Pointwise (ReLU) to work with non-contiguous tensors. Basically one just needs to create personalized descriptors for every tensor and not share them.

Problems with DepthConcat

I'm using DepthConcat over convolutions and it's crashing for me with cudnn, but not with cunn, with "gradOutput has to be contiguous" on the backward pass.
Example: https://gist.github.com/etrulls/1444933b86eeda31afd6

convert.lua not present in release 4

Is there any particular reason for not having nn to cudnn conversion function in R4?

buffer size is in bytes, fix it in bindings

Someone from nvidia pointed out that the buffersize returned by cudnnGetConvolutionForwardWorkspaceSize is the number of bytes.

So I am allocating excessive memory here because I thought the size was the number of floats needed. fix it.

there's an issue when changing input sizes, some ffi allocation isnt getting freed

there's an issue when changing input sizes, some ffi allocation isnt getting freed and the stack gets corrupted.

@clementfarabet if you have the chance, have a glance at SpatialConvolution.lua :createIODescriptors and see if there's anything obvious that stands out.

Backward compatibility is broken in R3 average pooling

https://github.com/soumith/cudnn.torch/blob/R3/SpatialAveragePooling.lua#L6 mode is different when loaded from older versions

OS X R2 Tanh and SoftMax tests fail

Have just tested in Ubuntu, all tests pass. But in OS X no:

____*__*______  ==> Done Completed 50 asserts in 14 tests with 3 errors
--------------------------------------------------------------------------------
Tanh_single
error on state (forward)
 LT(<) violation   val=nan, condition=0.0001
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:329: in function <test/test.lua:303>

--------------------------------------------------------------------------------
Tanh_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:332: in function <test/test.lua:303>

--------------------------------------------------------------------------------
SoftMax_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:467: in function <test/test.lua:437>

--------------------------------------------------------------------------------

weird

Updated the branch R4 to be master

Loading old models with cudnn is broken

I've got this problem uncovered when moved to local install. I cannot load models with cudnn saved before December. The issue is with with Sequential models with cudnn inside, pretty generic.
So as I've got some global outdated installs that can load these models I tried to debug and search the module that causes troubles. So let's say this is my model:

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
  (1): nn.Reshape
  (2): cudnn.SpatialConvolution
  (3): cudnn.ReLU
  (4): cudnn.SpatialMaxPooling
  (5): cudnn.SpatialConvolution
}

I save it with old install and open with new install and got errors like this:

/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:254: unknown object
stack traceback:
    [C]: in function 'error'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:254: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:211: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:248: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:234: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:248: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:248: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:234: in function 'readObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:271: in function 'load'

When I debug it recursively goes with readObject and finds the last SpatialConvolution with typeidx=1332609024 and doesn't know what to do with it.
If I split Sequential and save it module by module - it loads.

SpatialConvolution running in fully-connected mode is very slow, even with R4

cudnnR4 doesn't choose the optimal algorithm in the fully-connected mode, even with cudnn.benchmark = true, which results in ~20x slower backward pass compared to MatConvNet.

Torch:

--TORCH OUTPUT (in seconds)
--forward   0.24779605865479    
--backward  4.5414280891418 
--forward   0.051395893096924   
--backward  4.5211651325226 
--forward   0.054457902908325   
--backward  4.5210771560669

-- with cudnn.benchmark = true
--forward   14.457499027252 
--backward  0.98335909843445    
--forward   0.045572996139526   
--backward  0.98773503303528    
--forward   0.045454025268555   
--backward  0.98268413543701


require 'cudnn'
require 'hdf5'

function gpuTicToc(f)
    cutorch.synchronize()
    local tic = torch.tic()
    f()
    cutorch.synchronize()
    return torch.toc(tic)
end

model = cudnn.SpatialConvolution(256, 4096, 6, 6, 1, 1):cuda(); model.weight:fill(1); model.bias:fill(1)
input = torch.CudaTensor(1600, 256, 6, 6):fill(1):cuda()

for i = 1, 3 do
  model:zeroGradParameters()
  print('forward', gpuTicToc(function()
      model:forward(input)
  end))

  one = torch.CudaTensor():resize(model.output:size()):fill(1)
  print('backward', gpuTicToc(function()
      model:backward(input, one)
  end))
end

model:float()

h = hdf5.open('test.h5', 'w')
h:write('/output', model.output)
h:write('/gradInput', model.gradInput)
h:write('/gradWeight', model.gradWeight)
h:write('/gradBias', model.gradBias)
h:close()

MatConvNet:

%MATLAB OUTPUT (in seconds)
%
%forward 0.224209
%backward 0.046167
%forward 0.045812
%backward 0.044633
%forward 0.043401
%backward 0.044506
%
%output diff: 0.000000
%gradInput diff: 0.000000
%gradWeight diff: 0.000000
%gradBias diff: 0.000000


%addpath('matconvnet-1.0-beta18/matlab'); vl_compilenn('EnableGpu', true, 'EnableCudnn', true, 'CudnnRoot', '/home/kantorov/cudnnR4');
run('matconvnet-1.0-beta18/matlab/vl_setupnn.m');

weight = gpuArray(ones(6, 6, 256, 4096, 'single'));
bias = gpuArray(ones(1, 4096, 'single'));

input = gpuArray(ones(6, 6, 256, 1600, 'single'));
one = gpuArray(single(ones(1, 1, 4096, 1600)));

for i = 1:3
    wait(gpuDevice); tic;
    output = vl_nnconv(input, weight, bias);
    wait(gpuDevice); fprintf('forward %f\n', toc);

    wait(gpuDevice); tic;
    [dzdx, dzdf dzdb] = vl_nnconv(input, weight, bias, one);
    wait(gpuDevice); fprintf('backward %f\n', toc);
end

torch_output = h5read('test.h5', '/output');
torch_gradInput = h5read('test.h5', '/gradInput');
torch_gradWeight = h5read('test.h5', '/gradWeight');
torch_gradBias = h5read('test.h5', '/gradBias');

fprintf('output diff: %f\n', sum(abs(reshape(torch_output, [], 1) - reshape(output, [], 1))));
fprintf('gradInput diff: %f\n', sum(abs(reshape(torch_gradInput, [], 1) - reshape(dzdx, [], 1))));
fprintf('gradWeight diff: %f\n', sum(abs(reshape(torch_gradWeight, [], 1) - reshape(dzdf, [], 1))));
fprintf('gradBias diff: %f\n', sum(abs(reshape(torch_gradBias, [], 1) - reshape(dzdb, [], 1))));

Replacing cudnn.SpatialConvolution with nn.Linear makes Torch and MatConvNet even:

--TORCH OUTPUT (in seconds) with nn.Linear
--forward   0.046329975128174   
--backward  0.048556089401245   
--forward   0.045660018920898   
--backward  0.046145915985107   
--forward   0.045567989349365   
--backward  0.043753862380981

cudnn.LogSoftMax seems to be broken

For the same input {0.5, 0.5}

cudnn.SoftMax   
 0.5000
 0.5000
[torch.CudaTensor of size 2]

nn.SoftMax  
 0.5000
 0.5000
[torch.CudaTensor of size 2]

cudnn.LogSoftMax    
 0.5000
 0.5000
[torch.CudaTensor of size 2]

nn.LogSoftMax   
-0.6931
-0.6931
[torch.CudaTensor of size 2]

Here is my code

require 'torch'
require 'nn'
require 'cutorch'
require 'cunn'
require 'cudnn'

input = torch.Tensor{0.5,0.5}
input = input:cuda()

model = cudnn.SoftMax()
model:cuda()
output_cudnnSM = model:forward(input)
print('cudnn.SoftMax')
print(output_cudnnSM)

model = nn.SoftMax()
model:cuda()
output_nnSM = model:forward(input)
print('nn.SoftMax')
print(output_nnSM)

model = cudnn.LogSoftMax()
model:cuda()
output_cudnnLSM = model:forward(input)
print('cudnn.LogSoftMax')
print(output_cudnnLSM)

model = nn.LogSoftMax()
model:cuda()
output_nnLSM = model:forward(input)
print('nn.LogSoftMax')
print(output_nnLSM)

cudnn doesn't support kernel-3 stride-2 pad-1

I've been looking around to see if this is intended or I'm just doing something stupid, but it's really hard to find official docs on what cudnn supports and what it doesn't support. but, the following code:

require 'cudnn'
x = cudnn.SpatialConvolution(3,16,3,3,2,2,1,1)

x:cuda()
x:forward(torch.rand(3,32,32):cuda())

results in "Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED"

So i assume it has something to do with the stride not matching the kernel size. I haven't gone through and seen what fails and what doesn't, if you know where to find a list of supported convolution arguments from cudnn, would appreciate being pointed in the right direction. Also, i'd be interested to see that list for cudnn v4 as well.

Anyways, if it's just cudnn's fault, that's fine, but I think it might make sense to list this on the readme, since nn's SpatialConvolution does support these convolution arguments, so cudnn is not really fully compatible with nn.

Cannot launch more than one instance on K80

After the update, only one instance of cudnn can be run on K80. When I try to launch another one, it gave the following error.

/usr/local/torch/install/share/lua/5.1/cudnn/init.lua:26: Error in CuDNN: CUDNN_STATUS_INTERNAL_ERROR

I reverted back to the old version and it works again.