Giter Club home page Giter Club logo

cuda-winograd's Introduction

Introduction

This code implements fast cuda kernels for DNN inference, especially for convolution layers / residule blocks in ResNet. Specifically, the kernels combine three parts into one piece:

  • Convolution
  • Batch Nomalization (BN + Scale)
  • Activation (ReLU)

For implementation details, please refer to the technical report included in this repo. Winograd algorithm is used for 3 * 3 convolutional kernels.

Usage

mkdir data
python data_generator.py
make
./Test 0
  • Set parameters in data_generator.py
  • Run 6 test cases with changing numbers from 0 to 5 after ./Test

Results

3 * 3 Kernels

Kernals Operations 128 / 128 256 / 256
Cudnn Gemm + BN + ReLU 214us 384us
Cudnn Winograd + BN + ReLU 95us 155us
Our Kernel Winograd + BN + ReLU 59us 117us

1 * 1 Kernels [BUGGY NUMBERS]

Kernals 512 / 128 128 / 512 1024 / 256 256 / 1024
Operations Gemm + BN + ReLU Gemm + BN Gemm + BN + ReLU Gemm + BN + ReLU
Cudnn 119us 115us 219us 214us
Our Kernel 58us 55us 186us 181us

cuda-winograd's People

Contributors

xuqiantong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cuda-winograd's Issues

about pOutputs index questions in Kernel128_winograd.cu

Hi @xuqiantong . I'm reading your kernel function code kernel_128_winograd_AtIA in the file Kernel128_winograd.cu.

I'm confused with some lines of code with my comment below. The question is labeled Q1,Q2,Q3,Q4.

// Q1: what the Tilex == 3 && Inx > 1 responsible for ?
// I know Inx > 3, because the result dimension is 4 *4 
if (Inx > 3 || (Tilex == 3 && Inx > 1)) return;



int x;
float o;
switch(Iny) {
case 0:
       x = Inx*6;
       o = scale*(input[x]+input[x+1]+input[x+2]+input[x+3]+input[x+4])+ bias;
       // Q2: even Tilex = 0, Tiley = 0, Inx = 0, Iny = 0, kz = 0, the result is stored in index 2176, not 0, why ?
       // what the value stored in index [0~2175] ? 
       pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 1 ) * 128 + kz] = o > 0 ? o : 0;
       break;
case 1:
        x = Inx*6;
	o = scale*(input[x+1] - input[x+2] + 2*input[x+3] - 2*input[x+4]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 2 ) * 128 + kz] = o > 0 ? o : 0;
        break;
case 2:
        // Q3: what the special of Tiley = 3 ?
	if (Tiley == 3) break;
	x = Inx*6;
	o = scale*(input[x+1] + input[x+2] + 4*input[x+3] + 4*input[x+4]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 3 ) * 128 + kz] = o > 0 ? o : 0;
	break;
case 3:
        // Q4: same as Q3
	if (Tiley == 3) break;
	x = Inx*6;
	o = scale*(input[x+1] - input[x+2] + 8*input[x+3] - 8*input[x+4] + input[x+5]) + bias;
	pOutputs[ ( ( (Tilex<<2) + 1 + Inx ) * 16 + (Tiley<<2) + 4 ) *128 + kz] = o > 0 ? o : 0;
	break;
}

Could you explain about those question ?

Seems Kernel128_winograd.cu is using inner-product, not outer-product

At first, the weight data that been generated is by GWGT, and loaded into a variable that named "weights". The feature map is preprocessed by function "kernel_128_winograd_BtdB()". Both of the two matrices are not transposed in the end.

Within the function "kernel_128_OuterProduct_128()", the MM operation is completed by below sentences:

		for (int j = 0; j < 32; j++) {
			sum += input[y_tmp + j] * kernel[tX + B_stride[j]];
		}
		out[tY*128 + tX] += sum;
		__syncthreads();

The iterator "k" divides the weights matrix to (32, 128), and at each iteration, the input matrix has shape (8*128).
The B_stride[j] is constructed with respect to 128*j, pointing to the first element of each row.
The "y_tmp" is defined by "tY*128 + k*32", which points to the element of each row and further into each 32 elements, because k divides weights into 4*32*128.

Hence we can see that, this is an inner-product. For a specific thread with index of (tY, tX), it generates one elements in result matrix with index of (tY*128+tX). And this result is generated by summing up all the multiplication of each corresponding elements in a row from input, and corresponding elements in a column from weights.

Thanks,

Trying to fix the bug in Kernel128_one.cu.

 65   float *input = get_parameter(inputName128one, 14*14*512);
 66   float *weight = get_parameter(weightName128one, 128*512);

 99   cudaMemcpy(input_, input, nInput<<2, cudaMemcpyHostToDevice);
100   cudaMemcpy(weight_, weight_, nWeights<<2, cudaMemcpyHostToDevice);
101   cudaMemcpy(bnBias_, bnBias_myKernel, 128<<2, cudaMemcpyHostToDevice);
102   cudaMemcpy(bnScale_, bnScale_myKernel, 128<<2, cudaMemcpyHostToDevice);

Generated weight data is loaded at line 66 with ptr name weight, but at line 100, you assign an initialized position to device pointer weight_, which will be all zero.

cudnn part also uses weight_ to compute, so I changed line 100 to:
100 cudaMemcpy(weight_, weight, nWeights<<2, cudaMemcpyHostToDevice);
Then I got this result:

---- Iter: 9 ----
TotalTime = 58 us
cudaSuccess
cuDNN TotalTime = 113 us
cudaSuccess
[max_error: 65141.292969][error_cnt: 18947]
Average Total Time: [Mine: 59 us], [cuDNN: 111 us]

Seems your kernel is not doing the correct calculation.

Please help address the issue, thanks.

license

Can you please add a license to your repository ?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.