Hi, All Is there any support for using GPU tensor core in Sparse-Den

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I have several follow-up questions. In order to use TensorCore

Support for Sparse-Dense Matrix Mulitplication about triton HOT 19 CLOSED

triton-lang commented on August 24, 2024

Support for Sparse-Dense Matrix Mulitplication

from triton.

Comments (19)

wang-y-z commented on August 24, 2024 2

Hi @ptillet , I was wondering will tirton support CSR/COO format spmm(or other sparse kernel) cuda kernel in the future without TensorCore, thx.

from triton.

ptillet commented on August 24, 2024

Yep. The triton.ops.blocksparse.matmul supports S=DD, D=SD and D=D*S modes, all with tensor cores.

from triton.

YukeWang96 commented on August 24, 2024

Is there any code example that I can follow?
I check the official document, I could not search the one that describe this

Thanks!

from triton.

YukeWang96 commented on August 24, 2024

BTW, What is the type and format of S and D?
are they scipy.coo/csr?
Also, does this API support large sparse matrix in .mtx format?

from triton.

ptillet commented on August 24, 2024

Oh, I think I may have misunderstood you. Triton only supports block-sparsity at the moment. An example of how it can be used for block-sparse attention can be seen here https://github.com/ptillet/triton/blob/master/python/test/test_blocksparse.py#L150-L159

The format consists of (1) a layout tensor of ones and zeros; (2) the raw data, as the concatenation of all the flattened blocks

from triton.

YukeWang96 commented on August 24, 2024

Hi,

For block-sparse computation, is it similar to something described in this post from NVIDIA?

Thanks!

from triton.

ptillet commented on August 24, 2024

Yep

from triton.

YukeWang96 commented on August 24, 2024

Hi,

does the underlying implementation of this block-sparse API of triton for TC-core acceleration is based on cuSPARSE Block-SpMM? is this correct?

Thanks!

from triton.

ptillet commented on August 24, 2024

Nope, it's based on custom kernels written in the Triton programming language. Much more concise. I didn't get a chance to compare the performance since there's no python api for cusparse block-spmm, but I would expect ours to be on par

from triton.

YukeWang96 commented on August 24, 2024

I have several follow-up questions.

In order to use TensorCore, should I set the dtype=float16?
What is the difference between using different MODE, such as sdd, dsd, dds.
To compute a GEMM with MxNxK, is this setting for test_matmul correct?

test_matmul('sdd', True, False, 64, 'float16', Z=1, H=1, M=512, N=384, K=256)

From the performance point of view, how should we set the BLOCK value?

Thanks a lot!

from triton.

ptillet commented on August 24, 2024

1 - Yes. The FP32 version uses CUDA cores
2 - There are three modes:
SDD: sparse = dense x dense, a.k.a. sampled dense-dense matrix multiplication
DSD: dense = sparse x dense, the lhs is sparse
DDS: dense = dense x sparse, the rhs is sparse
3 - I believe so
4 - Supported block sizes are [16, 32, 64, 128]. At equal number of FLOPS, bigger blocks will perform better. 128 should have GPU efficiency roughly on the order of a dense matmul. Anything below may be quite a bit worse -- especially 16.

from triton.

YukeWang96 commented on August 24, 2024

is there any limit on the sparse matrix size?
for example an nxn sparse matrix, should I estimate its memory size as n x n x sizeof(float)?

from triton.

ptillet commented on August 24, 2024

The memory footprint of our block-sparse matrices is equal to nnz * sizeof(dtype), where nnz is the number of non-zero elements

from triton.

YukeWang96 commented on August 24, 2024

Is there any example of loading a sparse matrix from .mtx sparse matrix?
I found that the example only shows the initialization of a sparse matrix without loading an external dataset.

from triton.

ptillet commented on August 24, 2024

I have never tried loading .mtx data. I can give more information on the internal format used by Tritonm though. It is essentially COO, and consists of (1) a block-sparse layout tensor, and (2) a data tensor that is just of the concatenation of all flattened blocks, row-major.

from triton.

YukeWang96 commented on August 24, 2024

ok, essentially, .mtx is similar to COO. Do you have the example of how I can initialize a sparse tensor with COO (edge list)?
for example the edge list is

COO = [ 
[1, 2, 3],
[2, 3, 1]
]

to represent the non-zero element on [1, 2], [2,3], and [3, 1], so how should I initialize a sparse tensor with this COO?
Thanks!

from triton.

ptillet commented on August 24, 2024

I don't have such an example, but looking at triton.testing.sparsify_tensor could be pretty useful https://github.com/ptillet/triton/blob/master/python/triton/testing.py#L12-L16 . Essentially converts a dense tensor + layout into a sparse tensor. You can see the format there :)

from triton.

lhl2017 commented on August 24, 2024

Is there a way to see how the generated kernel works for Cuda programmers? Is it the best way to check C-like code? @ptillet

from triton.

ptillet commented on August 24, 2024

Unfortunately, not at the moment. This would be particularly hard to do, since the Triton front-end directly generates LLVM-like SSA code with explicit branches. The best Triton will be able to do for the foreseeable future is to display the PTX

from triton.

Support for Sparse-Dense Matrix Mulitplication about triton HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent