Hello,
this is more a thread discussion than a real issue, but I've been working on the cuda kernel readability.
And pytorch actually provides very nice way of presenting tensor data for kernels as if it was still a multidimensional vector.
see here for a working prototype : https://github.com/ClementPinard/extension-cpp/blob/deviceTensorExperiments/cuda/lltm_cuda_kernel.cu
Essentially, I designed a simple convertor from at::Tensor
to THCDeviceTensor<scalar_t, 2, size_t, RestrictPtrTraits>
The conversion is not very pretty, but it allows us to write more readable memory accesses in kernels while still doing eventually the exact same thing (even the __restricted__
keyword is kept)
Let's look at the current code for forward :
template <typename scalar_t>
__global__ void lltm_cuda_forward_kernel(
const scalar_t* __restrict__ gates,
const scalar_t* __restrict__ old_cell,
scalar_t* __restrict__ new_h,
scalar_t* __restrict__ new_cell,
scalar_t* __restrict__ input_gate,
scalar_t* __restrict__ output_gate,
scalar_t* __restrict__ candidate_cell,
size_t state_size) {
const int column = blockIdx.x * blockDim.x + threadIdx.x;
const int index = blockIdx.y * state_size + column;
const int gates_row = blockIdx.y * (state_size * 3);
if (column < state_size) {
input_gate[index] = sigmoid(gates[gates_row + column]);
output_gate[index] = sigmoid(gates[gates_row + state_size + column]);
candidate_cell[index] = elu(gates[gates_row + 2 * state_size + column]);
new_cell[index] =
old_cell[index] + candidate_cell[index] * input_gate[index];
new_h[index] = tanh(new_cell[index]) * output_gate[index];
}
}
the column
and index
are kinda hard to figure out. It actually use the fact that blockDim.y
is batch size and thus BlockIdx.y
the batch index. column
is then the index in the state and index
is batch_idx * batch_stride + column
while gates_row
is the first index of the gates in that particular element of the batch, because its batch stride is thrice as much.
Now my code proposition :
template <typename scalar_t>
__global__ void lltm_cuda_forward_kernel(
const dTensor2R gates,
const dTensor2R old_cell,
dTensor2R new_h,
dTensor2R new_cell,
dTensor2R input_gate,
dTensor2R output_gate,
dTensor2R candidate_cell,
size_t state_size) {
const int n = blockIdx.y; //batch index
CUDA_KERNEL_LOOP(c, state_size) {
input_gate[n][c] = sigmoid((scalar_t) gates[n][c]);
output_gate[n][c] = sigmoid((scalar_t) gates[n][c + state_size]);
candidate_cell[n][c] = elu((scalar_t) gates[n][c + 2 * state_size]);
new_cell[n][c] =
old_cell[n][c] + candidate_cell[n][c] * input_gate[n][c];
new_h[n][c] = tanh((scalar_t) new_cell[n][c]) * output_gate[n][c];
}
}
I use dTensor2R
that defined as THCDeviceTensor<scalar_t, 2, size_t, RestrictPtrTraits>
in a macro above.
Besides using the strided loop CUDA_KERNEL_LOOP
(just for the sake of good practices), we now only need to compute n
which is explicetely the batch index and c
which is the column
from above.
every relevant value can now be accessed with tensor[n][c + shift]
making it very similar to an actual 2D array.
I tested my code on master (from a few days) and it works for both check.py
and grad_check.py
. It does not need pytorch source code, only the compiled binaries and the headers.
Is this proposition legit ? I feel like it could be good way of letting people write cuda with more complicated ND-tensors (like 4D tensors for regular feature maps) without all the complex indexing stuff. And if so, that could be a good reason for letting a more use friendly method for at::Tensor
to deviceTHCTensor
conversion being written.