bruce-lee-ly / cuda_hgemm Goto Github PK

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

License: MIT License

CMake 0.79% Shell 1.71% C 3.84% C++ 3.97% Cuda 88.16% Python 1.54%

cublas cuda gemm gpu hgemm matrix-multiply nvidia tensor-core

cuda_hgemm's People

Stargazers

Watchers

cuda_hgemm's Issues

#define CHUNK_K 2 // 32 / WMMA_K

CHUNK_K 没太理解CHUNK_K什么意思，32哪里来的？

A矩阵采用padding 8好像没有完全解决bank conflict问题？

采用padding 16好像就可以了。我理解的有问题吗？哈😅

咨询：Share Mem bank Confict.

Hi, 我想咨询下，Share Mem to Register 出现mem bank confict的解决方案。尤其是Permute方法。
对于一个KxN 的矩阵B，K=16，N=16.
其share mem 数据排布如下：
data_x其大小2byte

data0，data1，data2，...,data8,| ... data15
data0，data1，data2，...,data8,| ... data15
data0，data1，data2，...,data8,| ... data15
data0，data1，data2，...,data8,| ... data15
...
------------------------------------------------------------ 8
data0，data1，data2，...,data8,| ... data15
data0，data1，data2，...,data8,| ... data15
data0，data1，data2，...,data8,| ... data15
data0，data1，data2，...,data8,| ... data15
------------------------------------------------------------ 16

我使用 ldmatrix.sync.aligned.x2.trans.m8n8.shared.b16 {%0, %1}, [%2];\n load数据
其shape (16, 8), 然后出现了2个 Bank Confict.
想问下，这里如何解决其 Bank Confict.

如能讨论不胜感激。

mma_naive结果不正确

我是windows 11 + cuda 11.8 + sm_86，测试了一下mma_naive.cu的kernel，结果不正确。本地还在review确认中，大佬有空的话可以看看是否确实有问题。

Question about the tiling size

First of all, I would like to express my deep gratitude for providing the excellent educational materials. The articles posted on Medium have also been very helpful in my studies.

I have a question. I am using an A100 GPU. I have noticed that the performance shown on the A6000 you demonstrate seems to reach throughput almost close to cublas for certain kernels.
When I tested it on my A100, it showed the following performance (I fixed M=512 instead of M=N=K).
After considering, I think I need to change the tiling size because I did not keep the dimensions the same. Do you think I am heading in the right direction?

If so, could you explain how you decided on the current tiling size and other "execution config" settings? (Or did you just search for all these settings?)

Thank you once again.

关于permute实现方式

您好，在看mma_permuted.cu源码时，发现您使用的permute方式和DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100文档中所提到的xor方式好像有所区别。您使用的是在不同stage使用循环右移的方式来避免ldmatrix时的bank conflict，不知道理解的正不正确，希望能抽空帮忙解答，感谢。

请教一个 `wmma_async_stage2.cu` 中的代码细节

为什么这一行的for循环会按照chunk_k分成两组去做？

cuda_hgemm/src/wmma/wmma_async_stage2.cu

Line 208 in 10a8a84

for (size_t i = 0; i < A_smem_iters / CHUNK_K; ++i) {

#pragma unroll
        for (size_t i = 0; i < A_smem_iters / CHUNK_K; ++i) { <- 这个位置
            uint32_t A_smem_lane_addr =
                __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

            CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

            A_lane_ptr = (int4 *)((half *)A_lane_ptr + CHUNK_COPY_LINES_PER_WARP * K);
            A_smem_idx += CHUNK_COPY_LINES_PER_WARP;
        }

        ...

        smem_load_idx = (smem_load_idx + 1) % K_STAGE;
        smem_load_off = smem_load_idx * smem_stage_off;

#pragma unroll
        for (size_t i = (CHUNK_K - 1) * A_smem_iters / CHUNK_K; i < A_smem_iters; ++i) { <- 这个位置
            uint32_t A_smem_lane_addr =
                __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

            CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

            A_lane_ptr = (int4 *)((half *)A_lane_ptr + CHUNK_COPY_LINES_PER_WARP * K);
            A_smem_idx += CHUNK_COPY_LINES_PER_WARP;
        }

这二者之间好像没有做什么特殊的逻辑？谢谢🤣

关于A/B阵的Layout

 m_A = new Matrix(m_M, m_K, "Matrix A");
 m_B = new Matrix(m_K, m_N, "Matrix B");

尽管在src/common/tester.h中，A定义的形状是(M, K)， B是（K, N)，但实际实现矩阵乘的时候都是按A（M， K）， B（N, K）的方式进行的。请问如果A(M, K)， B（K, N）的情况下还能否使用mma进行加速？cp.async和ldmatrix都只能处理连续的128bit，对于B（K，N）的情况下，很难把K方向的元素放到连续的线程中。最近看mma相关教程时的一点疑惑，希望您能帮忙解答。

为什么B矩阵要transpose？

我注意到这里的B是T的layout，为什么这样呢？我采用N的layout，如下：

采用padding 16的方式，然后B reg采用row-major。然后这种做法在wmma_async_stage3.cu代码A100下测试，会有10%的性能损失。这是为什么？这里面有什么说法吗🤣

Cooperative Async Copies

Thanks for this wonderful repo.

I have a question about the async copies:

uint32_t A_smem_lane_addr =
   __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

Does this mean that every lane (thread) has a different pointer to the shared memory and a different pointer to the global memory?

The way I understand the async copies, the src and dst pointers must be the same for every thread in the thread block. See the docs.

Change to block of 128 by 256

谢谢分享代码！如果我把wmma_async_pg2s.cu 的block_rows and block_cols改成256 和 128，会出现error。我看不出来有什么问题...

./hgemm -M=4096 -N=4096 -K=1024 -profiling_iterations=1 -warmup_iterations=1 -enable_check=true
[HGEMM 2023-09-25 19:00:25 1022624:1022624 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Pg2s -----------------
[HGEMM 2023-09-25 19:00:26 1022624:1022624 wmma_async_pg2s.cu:274 initWmmaAsyncPg2s] shmem_max_size: 66 KBytes (67584 Bytes)
[HGEMM 2023-09-25 19:00:30 1022624:1022624 cuda_timer.h:39 end] CUDA Runtime API error = 0700 "cudaErrorIllegalAddress", runtime version: 12000, driver version: 12020

#define BLOCK_ROWS 128
#define BLOCK_COLS 256

#define WARP_ROWS 64
#define WARP_COLS 64

#define BLOCK_ROW_WARPS 4  // BLOCK_COLS / WARP_COLS
#define BLOCK_COL_WARPS 2  // BLOCK_ROWS / WARP_ROWS

#define BLOCK_ROW_TILES 16   // BLOCK_COLS / WMMA_N
#define BLOCK_COL_TILES 8  // BLOCK_ROWS / WMMA_M

#define WARP_ROW_TILES 4  // WARP_COLS / WMMA_N
#define WARP_COL_TILES 4  // WARP_ROWS / WMMA_M

#define WARP_SIZE 32
#define WARPS_PER_BLOCK 8      // BLOCK_ROW_WARPS * BLOCK_COL_WARPS
#define THREADS_PER_BLOCK 256  // WARP_SIZE * WARPS_PER_BLOCK

#define CHUNK_K 2  // 32 / WMMA_K

#define THREAD_COPY_BYTES 16

#define CHUNK_LINE_BYTES 64          // CHUNK_K * WMMA_K * sizeof(half)
#define CHUNK_COPY_LINES_PER_WARP 8  // WARP_SIZE * THREAD_COPY_BYTES / CHUNK_LINE_BYTES
#define CHUNK_COPY_LINE_LANES 4      // WARP_SIZE / CHUNK_COPY_LINES_PER_WARP

#define SHMEM_PADDING 8

#define AB_SHMEM_STRIDE 40  // CHUNK_K * WMMA_K + SHMEM_PADDING

#define C_SHMEM_STRIDE 264  // BLOCK_COLS + SHMEM_PADDING
#define C_SHMEM_OFFSET 64

bruce-lee-ly / cuda_hgemm Goto Github PK

cuda_hgemm's People

Stargazers

Watchers

Forkers

cuda_hgemm's Issues

#define CHUNK_K 2 // 32 / WMMA_K

A矩阵采用padding 8好像没有完全解决bank conflict问题？

咨询：Share Mem bank Confict.

mma_naive结果不正确

Question about the tiling size

关于permute实现方式

请教一个 `wmma_async_stage2.cu` 中的代码细节

关于A/B阵的Layout

为什么B矩阵要transpose？

Cooperative Async Copies

Change to block of 128 by 256

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent