bruce-lee-ly / cuda_hgemm Goto Github PK
View Code? Open in Web Editor NEWSeveral optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
License: MIT License
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
License: MIT License
CHUNK_K 没太理解CHUNK_K什么意思,32哪里来的?
Hi, 我想咨询下,Share Mem to Register 出现mem bank confict的解决方案。尤其是Permute方法。
对于一个KxN 的矩阵B,K=16,N=16.
其share mem 数据排布如下:
data_x其大小2byte
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
...
------------------------------------------------------------ 8
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
------------------------------------------------------------ 16
我使用 ldmatrix.sync.aligned.x2.trans.m8n8.shared.b16 {%0, %1}, [%2];\n
load数据
其shape (16, 8), 然后出现了2个 Bank Confict.
想问下,这里如何解决其 Bank Confict.
如能讨论不胜感激。
我是windows 11 + cuda 11.8 + sm_86,测试了一下mma_naive.cu的kernel,结果不正确。本地还在review确认中,大佬有空的话可以看看是否确实有问题。
First of all, I would like to express my deep gratitude for providing the excellent educational materials. The articles posted on Medium have also been very helpful in my studies.
I have a question. I am using an A100 GPU. I have noticed that the performance shown on the A6000 you demonstrate seems to reach throughput almost close to cublas for certain kernels.
When I tested it on my A100, it showed the following performance (I fixed M=512 instead of M=N=K).
After considering, I think I need to change the tiling size because I did not keep the dimensions the same. Do you think I am heading in the right direction?
If so, could you explain how you decided on the current tiling size and other "execution config" settings? (Or did you just search for all these settings?)
Thank you once again.
您好,在看mma_permuted.cu源码时,发现您使用的permute方式和DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100
文档中所提到的xor方式好像有所区别。您使用的是在不同stage使用循环右移的方式来避免ldmatrix时的bank conflict,不知道理解的正不正确,希望能抽空帮忙解答,感谢。
为什么这一行的for循环会按照chunk_k分成两组去做?
cuda_hgemm/src/wmma/wmma_async_stage2.cu
Line 208 in 10a8a84
#pragma unroll
for (size_t i = 0; i < A_smem_iters / CHUNK_K; ++i) { <- 这个位置
uint32_t A_smem_lane_addr =
__cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;
CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);
A_lane_ptr = (int4 *)((half *)A_lane_ptr + CHUNK_COPY_LINES_PER_WARP * K);
A_smem_idx += CHUNK_COPY_LINES_PER_WARP;
}
...
smem_load_idx = (smem_load_idx + 1) % K_STAGE;
smem_load_off = smem_load_idx * smem_stage_off;
#pragma unroll
for (size_t i = (CHUNK_K - 1) * A_smem_iters / CHUNK_K; i < A_smem_iters; ++i) { <- 这个位置
uint32_t A_smem_lane_addr =
__cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;
CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);
A_lane_ptr = (int4 *)((half *)A_lane_ptr + CHUNK_COPY_LINES_PER_WARP * K);
A_smem_idx += CHUNK_COPY_LINES_PER_WARP;
}
这二者之间好像没有做什么特殊的逻辑?谢谢🤣
m_A = new Matrix(m_M, m_K, "Matrix A");
m_B = new Matrix(m_K, m_N, "Matrix B");
尽管在src/common/tester.h
中,A定义的形状是(M, K), B是(K, N),但实际实现矩阵乘的时候都是按A(M, K), B(N, K)的方式进行的。请问如果A(M, K), B(K, N)的情况下还能否使用mma进行加速?cp.async和ldmatrix都只能处理连续的128bit,对于B(K,N)的情况下,很难把K方向的元素放到连续的线程中。最近看mma相关教程时的一点疑惑,希望您能帮忙解答。
我注意到这里的B是T的layout,为什么这样呢?我采用N的layout,如下:
采用padding 16的方式,然后B reg采用row-major。然后这种做法在wmma_async_stage3.cu代码A100下测试,会有10%的性能损失。这是为什么?这里面有什么说法吗🤣
Thanks for this wonderful repo.
I have a question about the async copies:
uint32_t A_smem_lane_addr =
__cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;
CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);
Does this mean that every lane (thread) has a different pointer to the shared memory and a different pointer to the global memory?
The way I understand the async copies, the src and dst pointers must be the same for every thread in the thread block. See the docs.
谢谢分享代码!如果我把wmma_async_pg2s.cu 的block_rows and block_cols改成256 和 128,会出现error。我看不出来有什么问题...
./hgemm -M=4096 -N=4096 -K=1024 -profiling_iterations=1 -warmup_iterations=1 -enable_check=true
[HGEMM 2023-09-25 19:00:25 1022624:1022624 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Pg2s -----------------
[HGEMM 2023-09-25 19:00:26 1022624:1022624 wmma_async_pg2s.cu:274 initWmmaAsyncPg2s] shmem_max_size: 66 KBytes (67584 Bytes)
[HGEMM 2023-09-25 19:00:30 1022624:1022624 cuda_timer.h:39 end] CUDA Runtime API error = 0700 "cudaErrorIllegalAddress", runtime version: 12000, driver version: 12020
#define BLOCK_ROWS 128
#define BLOCK_COLS 256
#define WARP_ROWS 64
#define WARP_COLS 64
#define BLOCK_ROW_WARPS 4 // BLOCK_COLS / WARP_COLS
#define BLOCK_COL_WARPS 2 // BLOCK_ROWS / WARP_ROWS
#define BLOCK_ROW_TILES 16 // BLOCK_COLS / WMMA_N
#define BLOCK_COL_TILES 8 // BLOCK_ROWS / WMMA_M
#define WARP_ROW_TILES 4 // WARP_COLS / WMMA_N
#define WARP_COL_TILES 4 // WARP_ROWS / WMMA_M
#define WARP_SIZE 32
#define WARPS_PER_BLOCK 8 // BLOCK_ROW_WARPS * BLOCK_COL_WARPS
#define THREADS_PER_BLOCK 256 // WARP_SIZE * WARPS_PER_BLOCK
#define CHUNK_K 2 // 32 / WMMA_K
#define THREAD_COPY_BYTES 16
#define CHUNK_LINE_BYTES 64 // CHUNK_K * WMMA_K * sizeof(half)
#define CHUNK_COPY_LINES_PER_WARP 8 // WARP_SIZE * THREAD_COPY_BYTES / CHUNK_LINE_BYTES
#define CHUNK_COPY_LINE_LANES 4 // WARP_SIZE / CHUNK_COPY_LINES_PER_WARP
#define SHMEM_PADDING 8
#define AB_SHMEM_STRIDE 40 // CHUNK_K * WMMA_K + SHMEM_PADDING
#define C_SHMEM_STRIDE 264 // BLOCK_COLS + SHMEM_PADDING
#define C_SHMEM_OFFSET 64
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.