是因为两次GEMM，寄存器需要从C排布转A排布的原因吗，而这其实并不容易？

请问只支持N=16的原因 about cuda_back2back_hgemm HOT 2 CLOSED

abangdd commented on July 3, 2024

请问只支持N=16的原因

from cuda_back2back_hgemm.

Comments (2)

Bruce-Lee-LY commented on July 3, 2024

主要原因是计算D矩阵的每一个tile时，都需要计算P矩阵对应的一行tile，访存量和计算量都随着N线性增加。也就是在N比较大的场景下，B2B HGEMM不一定具有优势。
当然，如果直接使用寄存器从C排布转A排布，需要一定的计算技巧；但如果使用smem方案，会简单很多。

from cuda_back2back_hgemm.

abangdd commented on July 3, 2024

我测了你的flash attention inference，在head=32，dim=128时确实效率不理想，还不如两次GEMM
当然考虑到flash attention有做softmax，如果在两次GEMM之间吸收掉softmax的IO，可能跟flash attention v2效率也差不多
flash attention v2的优势也许只在于节约显存

from cuda_back2back_hgemm.

Recommend Projects

请问只支持N=16的原因 about cuda_back2back_hgemm HOT 2 CLOSED

Comments (2)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent