xuqiantong / cuda-winograd Goto Github PK

Fast CUDA Kernels for ResNet Inference.

Cuda 83.65% C 8.78% Makefile 0.90% Python 6.67%

cuda-winograd's Introduction

Introduction

This code implements fast cuda kernels for DNN inference, especially for convolution layers / residule blocks in ResNet. Specifically, the kernels combine three parts into one piece:

Convolution
Batch Nomalization (BN + Scale)
Activation (ReLU)

For implementation details, please refer to the technical report included in this repo. Winograd algorithm is used for 3 * 3 convolutional kernels.

Usage

mkdir data
python data_generator.py
make
./Test 0

Set parameters in data_generator.py
Run 6 test cases with changing numbers from 0 to 5 after ./Test

Results

3 * 3 Kernels

Kernals	Operations	128 / 128	256 / 256
Cudnn	Gemm + BN + ReLU	214us	384us
Cudnn	Winograd + BN + ReLU	95us	155us
Our Kernel	Winograd + BN + ReLU	59us	117us

1 * 1 Kernels [BUGGY NUMBERS]

Kernals	512 / 128	128 / 512	1024 / 256	256 / 1024
Operations	Gemm + BN + ReLU	Gemm + BN	Gemm + BN + ReLU	Gemm + BN + ReLU
Cudnn	119us	115us	219us	214us
Our Kernel	58us	55us	186us	181us

cuda-winograd's People

Contributors

Stargazers

Watchers

cuda-winograd's Issues

about outerproduct

hi, @xuqiantong

How can innerproduct be converted to outer product?

Thanks
Yuanbo

about pOutputs index questions in Kernel128_winograd.cu

Hi @xuqiantong . I'm reading your kernel function code kernel_128_winograd_AtIA in the file Kernel128_winograd.cu.

I'm confused with some lines of code with my comment below. The question is labeled Q1,Q2,Q3,Q4.

// Q1: what the Tilex == 3 && Inx > 1 responsible for ?
// I know Inx > 3, because the result dimension is 4 *4 
if (Inx > 3 || (Tilex == 3 && Inx > 1)) return;



int x;
float o;
switch(Iny) {
case 0:
       x = Inx*6;
       o = scale*(input[x]+input[x+1]+input[x+2]+input[x+3]+input[x+4])+ bias;
       // Q2: even Tilex = 0, Tiley = 0, Inx = 0, Iny = 0, kz = 0, the result is stored in index 2176, not 0, why ?
       // what the value stored in index [0~2175] ? 
       pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 1 ) * 128 + kz] = o > 0 ? o : 0;
       break;
case 1:
        x = Inx*6;
	o = scale*(input[x+1] - input[x+2] + 2*input[x+3] - 2*input[x+4]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 2 ) * 128 + kz] = o > 0 ? o : 0;
        break;
case 2:
        // Q3: what the special of Tiley = 3 ?
	if (Tiley == 3) break;
	x = Inx*6;
	o = scale*(input[x+1] + input[x+2] + 4*input[x+3] + 4*input[x+4]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 3 ) * 128 + kz] = o > 0 ? o : 0;
	break;
case 3:
        // Q4: same as Q3
	if (Tiley == 3) break;
	x = Inx*6;
	o = scale*(input[x+1] - input[x+2] + 8*input[x+3] - 8*input[x+4] + input[x+5]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 4 ) *128 + kz] = o > 0 ? o : 0;
	break;
}

Could you explain about those question ?

Seems Kernel128_winograd.cu is using inner-product, not outer-product

At first, the weight data that been generated is by GWGT, and loaded into a variable that named "weights". The feature map is preprocessed by function "kernel_128_winograd_BtdB()". Both of the two matrices are not transposed in the end.

Within the function "kernel_128_OuterProduct_128()", the MM operation is completed by below sentences:

		for (int j = 0; j < 32; j++) {
			sum += input[y_tmp + j] * kernel[tX + B_stride[j]];
		}
		out[tY*128 + tX] += sum;
		__syncthreads();

The iterator "k" divides the weights matrix to (32, 128), and at each iteration, the input matrix has shape (8*128).
The B_stride[j] is constructed with respect to 128*j, pointing to the first element of each row.
The "y_tmp" is defined by "tY*128 + k*32", which points to the element of each row and further into each 32 elements, because k divides weights into 4*32*128.

Hence we can see that, this is an inner-product. For a specific thread with index of (tY, tX), it generates one elements in result matrix with index of (tY*128+tX). And this result is generated by summing up all the multiplication of each corresponding elements in a row from input, and corresponding elements in a column from weights.

Thanks,

Trying to fix the bug in Kernel128_one.cu.

 65   float *input = get_parameter(inputName128one, 14*14*512);
 66   float *weight = get_parameter(weightName128one, 128*512);

 99   cudaMemcpy(input_, input, nInput<<2, cudaMemcpyHostToDevice);
100   cudaMemcpy(weight_, weight_, nWeights<<2, cudaMemcpyHostToDevice);
101   cudaMemcpy(bnBias_, bnBias_myKernel, 128<<2, cudaMemcpyHostToDevice);
102   cudaMemcpy(bnScale_, bnScale_myKernel, 128<<2, cudaMemcpyHostToDevice);

Generated weight data is loaded at line 66 with ptr name weight, but at line 100, you assign an initialized position to device pointer weight_, which will be all zero.

cudnn part also uses weight_ to compute, so I changed line 100 to:
100 cudaMemcpy(weight_, weight, nWeights<<2, cudaMemcpyHostToDevice);
Then I got this result:

---- Iter: 9 ----
TotalTime = 58 us
cudaSuccess
cuDNN TotalTime = 113 us
cudaSuccess
[max_error: 65141.292969][error_cnt: 18947]
Average Total Time: [Mine: 59 us], [cuDNN: 111 us]

Seems your kernel is not doing the correct calculation.

Please help address the issue, thanks.

license

Can you please add a license to your repository ?

Thanks

xuqiantong / cuda-winograd Goto Github PK

cuda-winograd's Introduction

Introduction

Usage

Results

3 * 3 Kernels

1 * 1 Kernels [BUGGY NUMBERS]

cuda-winograd's People

Contributors

Stargazers

Watchers

Forkers

cuda-winograd's Issues

about outerproduct

about pOutputs index questions in Kernel128_winograd.cu

Seems Kernel128_winograd.cu is using inner-product, not outer-product

Trying to fix the bug in Kernel128_one.cu.

license

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent