Hi, I'm using a GEMM with tile size 512 for xtot and ytot. But if I

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF about gemm_hls HOT 9 CLOSED

spcl commented on September 13, 2024

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF

from gemm_hls.

Comments (9)

definelicht commented on September 13, 2024

The ideal solution is increasing your overall tile size, but you could also try to shift some of your tile sizes from N to M. This is not ideal for communication volume, but it can keep the same number of inner tiles, while decreasing the number of columns that must be read from A per tile. For example, you could try -DMM_MEMORY_TILE_SIZE_N=256 -DMM_MEMORY_TILE_SIZE_M=1024?

from gemm_hls.

charliechou1001 commented on September 13, 2024

Thanks. The problem for me is there are three GEMM computation tasks and the convention number of GEMM accelerator size is 512, and these three GEMM computation has different transpose format.

My understanding of the transposeA implementation is to use parallel aSplit buffers to read columns of A and then in ConvertWidthA it will sequentially feed the element of outer tile A. The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

from gemm_hls.

definelicht commented on September 13, 2024

What sizes does your matrix have? Even if you chain multiple multiplications of different sizes, it should still be possible for them to have different tile sizes. The problem is of course if your matrix is only 512x512, then you cannot make the tile bigger without losing computational efficiency.

The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

Yes, correct

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

Yes, this could work. It's not super simple, since you would need to vectorize the buffering of A between and inside all the PEs. But it's definitely doable (actually the initial version had a version of this feature, but I removed it for simplicity since I never used it).

You could give it a try :-)

from gemm_hls.

charliechou1001 commented on September 13, 2024

...since you would need to vectorize the buffering of A between and inside all the PEs....

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed? In that case, there is no extra modification for PEs, and constraint can be kInnerTileN*kInnerTileM >= MEMORY_TILE_SIZE_N/2.

from gemm_hls.

definelicht commented on September 13, 2024

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

from gemm_hls.

charliechou1001 commented on September 13, 2024

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

I got it. That sounds to set the Granularity of PE to more than 1 and performance is doubled. And do you mind sharing that code?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

from gemm_hls.

definelicht commented on September 13, 2024

The code does not exist anymore, it was the idea that this should be supported when I wrote the transpose code, but in the end I decided that I didn't need it :-)

from gemm_hls.

charliechou1001 commented on September 13, 2024

Thanks. And what do you think of the other method? does it work?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

from gemm_hls.

definelicht commented on September 13, 2024

It doesn't matter if your parallelism is in A or B, it only matters how many columns of A must be loaded for every outer tile. You either need to increase the tile size, or allow multiple elements of A to be buffered per cycle.

from gemm_hls.

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF about gemm_hls HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent