Giter Club home page Giter Club logo

cuda_hgemm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cuda_hgemm's Issues

咨询:Share Mem bank Confict.

Hi, 我想咨询下,Share Mem to Register 出现mem bank confict的解决方案。尤其是Permute方法。
对于一个KxN 的矩阵B,K=16,N=16.
其share mem 数据排布如下:
data_x其大小2byte

data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
...
------------------------------------------------------------ 8
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
data0,data1,data2,...,data8,| ... data15
------------------------------------------------------------ 16

我使用 ldmatrix.sync.aligned.x2.trans.m8n8.shared.b16 {%0, %1}, [%2];\n load数据
其shape (16, 8), 然后出现了2个 Bank Confict.
想问下,这里如何解决其 Bank Confict.

如能讨论不胜感激。

mma_naive结果不正确

我是windows 11 + cuda 11.8 + sm_86,测试了一下mma_naive.cu的kernel,结果不正确。本地还在review确认中,大佬有空的话可以看看是否确实有问题。

Question about the tiling size

First of all, I would like to express my deep gratitude for providing the excellent educational materials. The articles posted on Medium have also been very helpful in my studies.

I have a question. I am using an A100 GPU. I have noticed that the performance shown on the A6000 you demonstrate seems to reach throughput almost close to cublas for certain kernels.
When I tested it on my A100, it showed the following performance (I fixed M=512 instead of M=N=K).
After considering, I think I need to change the tiling size because I did not keep the dimensions the same. Do you think I am heading in the right direction?

If so, could you explain how you decided on the current tiling size and other "execution config" settings? (Or did you just search for all these settings?)

Thank you once again.

스크린샷 2024-01-03 오후 2 40 10

关于permute实现方式

您好,在看mma_permuted.cu源码时,发现您使用的permute方式和DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100文档中所提到的xor方式好像有所区别。您使用的是在不同stage使用循环右移的方式来避免ldmatrix时的bank conflict,不知道理解的正不正确,希望能抽空帮忙解答,感谢。

请教一个 `wmma_async_stage2.cu` 中的代码细节

为什么这一行的for循环会按照chunk_k分成两组去做?

for (size_t i = 0; i < A_smem_iters / CHUNK_K; ++i) {

#pragma unroll
        for (size_t i = 0; i < A_smem_iters / CHUNK_K; ++i) { <- 这个位置
            uint32_t A_smem_lane_addr =
                __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

            CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

            A_lane_ptr = (int4 *)((half *)A_lane_ptr + CHUNK_COPY_LINES_PER_WARP * K);
            A_smem_idx += CHUNK_COPY_LINES_PER_WARP;
        }

        ...

        smem_load_idx = (smem_load_idx + 1) % K_STAGE;
        smem_load_off = smem_load_idx * smem_stage_off;

#pragma unroll
        for (size_t i = (CHUNK_K - 1) * A_smem_iters / CHUNK_K; i < A_smem_iters; ++i) { <- 这个位置
            uint32_t A_smem_lane_addr =
                __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

            CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

            A_lane_ptr = (int4 *)((half *)A_lane_ptr + CHUNK_COPY_LINES_PER_WARP * K);
            A_smem_idx += CHUNK_COPY_LINES_PER_WARP;
        }

这二者之间好像没有做什么特殊的逻辑?谢谢🤣

关于A/B阵的Layout

 m_A = new Matrix(m_M, m_K, "Matrix A");
 m_B = new Matrix(m_K, m_N, "Matrix B");

尽管在src/common/tester.h中,A定义的形状是(M, K), B是(K, N),但实际实现矩阵乘的时候都是按A(M, K), B(N, K)的方式进行的。请问如果A(M, K), B(K, N)的情况下还能否使用mma进行加速?cp.async和ldmatrix都只能处理连续的128bit,对于B(K,N)的情况下,很难把K方向的元素放到连续的线程中。最近看mma相关教程时的一点疑惑,希望您能帮忙解答。

为什么B矩阵要transpose?

我注意到这里的B是T的layout,为什么这样呢?我采用N的layout,如下:

image

采用padding 16的方式,然后B reg采用row-major。然后这种做法在wmma_async_stage3.cu代码A100下测试,会有10%的性能损失。这是为什么?这里面有什么说法吗🤣

Cooperative Async Copies

Thanks for this wonderful repo.

I have a question about the async copies:

uint32_t A_smem_lane_addr =
   __cvta_generic_to_shared(&smem[A_smem_idx][0]) + (lane_id % CHUNK_COPY_LINE_LANES) * THREAD_COPY_BYTES;

CP_ASYNC_CG(A_smem_lane_addr, A_lane_ptr, THREAD_COPY_BYTES);

Does this mean that every lane (thread) has a different pointer to the shared memory and a different pointer to the global memory?

The way I understand the async copies, the src and dst pointers must be the same for every thread in the thread block. See the docs.

Change to block of 128 by 256

谢谢分享代码!如果我把wmma_async_pg2s.cu 的block_rows and block_cols改成256 和 128,会出现error。我看不出来有什么问题...

./hgemm -M=4096 -N=4096 -K=1024 -profiling_iterations=1 -warmup_iterations=1 -enable_check=true
[HGEMM 2023-09-25 19:00:25 1022624:1022624 tester.h:72 evaluate] ----------------- Evaluating Wmma-Async-Pg2s -----------------
[HGEMM 2023-09-25 19:00:26 1022624:1022624 wmma_async_pg2s.cu:274 initWmmaAsyncPg2s] shmem_max_size: 66 KBytes (67584 Bytes)
[HGEMM 2023-09-25 19:00:30 1022624:1022624 cuda_timer.h:39 end] CUDA Runtime API error = 0700 "cudaErrorIllegalAddress", runtime version: 12000, driver version: 12020
#define BLOCK_ROWS 128
#define BLOCK_COLS 256

#define WARP_ROWS 64
#define WARP_COLS 64

#define BLOCK_ROW_WARPS 4  // BLOCK_COLS / WARP_COLS
#define BLOCK_COL_WARPS 2  // BLOCK_ROWS / WARP_ROWS

#define BLOCK_ROW_TILES 16   // BLOCK_COLS / WMMA_N
#define BLOCK_COL_TILES 8  // BLOCK_ROWS / WMMA_M

#define WARP_ROW_TILES 4  // WARP_COLS / WMMA_N
#define WARP_COL_TILES 4  // WARP_ROWS / WMMA_M

#define WARP_SIZE 32
#define WARPS_PER_BLOCK 8      // BLOCK_ROW_WARPS * BLOCK_COL_WARPS
#define THREADS_PER_BLOCK 256  // WARP_SIZE * WARPS_PER_BLOCK

#define CHUNK_K 2  // 32 / WMMA_K

#define THREAD_COPY_BYTES 16

#define CHUNK_LINE_BYTES 64          // CHUNK_K * WMMA_K * sizeof(half)
#define CHUNK_COPY_LINES_PER_WARP 8  // WARP_SIZE * THREAD_COPY_BYTES / CHUNK_LINE_BYTES
#define CHUNK_COPY_LINE_LANES 4      // WARP_SIZE / CHUNK_COPY_LINES_PER_WARP

#define SHMEM_PADDING 8

#define AB_SHMEM_STRIDE 40  // CHUNK_K * WMMA_K + SHMEM_PADDING

#define C_SHMEM_STRIDE 264  // BLOCK_COLS + SHMEM_PADDING
#define C_SHMEM_OFFSET 64

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.