Hi all,
When the code is doing back prop through the conv layer, it reports an error that says "unexpected error during CUDA execution: CUDA_ERROR_LAUNCH_FAILED".
The matrix sizes involved in the back propagation is as below:
vl_nnconv: mode gpu; backward
vl_nnconv: stride: [1 1], pad: [1 2 3 3], numGroups: 1, has bias: 1, has filters: 1, fully connected: 0
vl_nnconv: data: 63 x 84 x 512 x 1 [10.3 MB]
vl_nnconv: filters: 4 x 7 x 512 x 2 [0.1 MB]
vl_nnconv: biases: 1 x 2 x 1 x 1 [0.0 MB]
vl_nnconv: derOutput: 63 x 84 x 2 x 1 [0.0 MB]
vl_nnconv: derData: 63 x 84 x 512 x 1 [10.3 MB]
vl_nnconv: derFilters: 4 x 7 x 512 x 2 [0.1 MB]
vl_nnconv: derBiases: 1 x 2 x 1 x 1 [0.0 MB]
vl_nnconv: temp: 63 x 84 x 14336 x 1 [289.4 MB]
vl_nnconv: temp (cached): 63 x 84 x 14336 x 1 [289.4 MB]
vl_nnconv: allOnes: 63 x 84 x 1 x 1 [0.0 MB]
vl_nnconv: allOnes (cached): 375 x 500 x 1 x 1 [0.7 MB]
By attaching cuda-gdb with cuda-memcheck, we can detect the memory error:
Illegal access to address (@global)0xb063d7900 detected.
Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 2, grid 167, block (72,0,0), thread (96,0,0), device 0, sm 7, warp 1, lane 0]
0x00007fff73e39b40 in ??<<<(74,20,1),(256,1,1)>>> ()
By stepping, I locate the error line: at line 130 in matlab/src/vl_nnconv.cu
To reproduce the problem, we could run
data = gpuArray(rand(63, 84, 512, 1, 'single'));
filters = gpuArray(rand(4, 7, 512, 2, 'single'));
biases = gpuArray(rand(1, 2, 1, 1, 'single'));
derOutput = gpuArray(rand(63, 84, 2, 1, 'single'));
[derData derFilters derBiases] = vl_nnconv(data, filters, biases, derOutput, 'pad', [1 2 3 3], 'stride', [1 1]);
I checked the memory usage, I feel it will not exceed the limit. Here is my GPU info:
CUDADevice with properties:
Name: 'Tesla K40c'
Index: 1
ComputeCapability: '3.5'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.2079e+10
FreeMemory: 1.1946e+10
MultiprocessorCount: 15
ClockRateKHz: 745000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
I'm not quite familiar with cuBLAS. My hunch is that some limitation of memory might be violated during executing cublasSgemm.
May I have some suggestions from you guys? Thanks!
Peiyun