Problem Statement: Stencil Sum Computation on GPU. i.e output at each

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support for tl shift operator on tensor about triton HOT 3 OPEN

thumbe3 commented on August 24, 2024 1

Support for tl shift operator on tensor

from triton.

Comments (3)

yunjiangster commented on August 24, 2024

Maybe you can achieve something similar by loading a 2d tensor (via unsqueeze and broadcasting) with overlapping elements. Here is an example:

import triton
import triton.language as tl
import torch

@triton.jit
def test_stencil(x_ptr, o_ptr):
    pid = tl.program_id(axis=0)
    rng = tl.arange(0, 4)
    x = tl.load(x_ptr + rng[:, None] + rng[None, :])
    tl.store(o_ptr + rng, tl.sum(x, axis=1))

x = torch.arange(8).cuda()
y = torch.zeros_like(x)
test_stencil[(1,)](x, y)
x, y

output looks like this

(tensor([0, 1, 2, 3, 4, 5, 6, 7], device='cuda:0'),
 tensor([ 6, 10, 14, 18,  0,  0,  0,  0], device='cuda:0'))

from triton.

thumbe3 commented on August 24, 2024

Hi @yunjiangster, this is definitely a good idea to make code concise. However, it has the same problem of multiple loads. Here is the implementation I did by borrowing your idea. NUM_OFFSETS below is the next power of 2 for (2 * RADIUS + 1)

@triton.jit
def stencil_kernel_v2(
    inputs: tl.tensor,
    outputs: tl.tensor,
    shape: tl.int32,
    BLOCK_SIZE: tl.constexpr,
    NUM_OFFSETS: tl.constexpr,
    RADIUS: tl.constexpr):

    pid = tl.program_id(0)
    # Starting Offsets
    base_offs = tl.arange(0, BLOCK_SIZE) 
    inc_offs = tl.arange(0, NUM_OFFSETS) - RADIUS
    output_offsets = pid * BLOCK_SIZE + base_offs
    input_offsets = pid * BLOCK_SIZE + inc_offs[None, :] + base_offs[:, None]

    input_tensor = tl.load(inputs + input_offsets,
        mask=(input_offsets < shape and input_offsets >=0)
            and inc_offs[None, :] <= RADIUS,
        other=0.0)

    tl.store(outputs + output_offsets,
        tl.sum(input_tensor, axis=1),
        mask = output_offsets < shape)

Even this kernel has similar performance compared to the previous triton kernel and starts to lag behind the hand-tuned cuda implementation on high values of RADIUS

from triton.

yunjiangster commented on August 24, 2024

@thumbe3 ah that makes sense. The load of the same repeated element will still require multiple loading work probably. I am curious how triton handles convolutional kernel then? It’s doing something similar.

from triton.

Recommend Projects

Support for tl shift operator on tensor about triton HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent