Giter Club home page Giter Club logo

Comments (9)

definelicht avatar definelicht commented on September 13, 2024

The ideal solution is increasing your overall tile size, but you could also try to shift some of your tile sizes from N to M. This is not ideal for communication volume, but it can keep the same number of inner tiles, while decreasing the number of columns that must be read from A per tile. For example, you could try -DMM_MEMORY_TILE_SIZE_N=256 -DMM_MEMORY_TILE_SIZE_M=1024?

from gemm_hls.

charliechou1001 avatar charliechou1001 commented on September 13, 2024

Thanks. The problem for me is there are three GEMM computation tasks and the convention number of GEMM accelerator size is 512, and these three GEMM computation has different transpose format.

My understanding of the transposeA implementation is to use parallel aSplit buffers to read columns of A and then in ConvertWidthA it will sequentially feed the element of outer tile A. The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

from gemm_hls.

definelicht avatar definelicht commented on September 13, 2024

What sizes does your matrix have? Even if you chain multiple multiplications of different sizes, it should still be possible for them to have different tile sizes. The problem is of course if your matrix is only 512x512, then you cannot make the tile bigger without losing computational efficiency.

The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

Yes, correct

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

Yes, this could work. It's not super simple, since you would need to vectorize the buffering of A between and inside all the PEs. But it's definitely doable (actually the initial version had a version of this feature, but I removed it for simplicity since I never used it).

You could give it a try :-)

from gemm_hls.

charliechou1001 avatar charliechou1001 commented on September 13, 2024

...since you would need to vectorize the buffering of A between and inside all the PEs....

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed? In that case, there is no extra modification for PEs, and constraint can be kInnerTileN*kInnerTileM >= MEMORY_TILE_SIZE_N/2.

from gemm_hls.

definelicht avatar definelicht commented on September 13, 2024

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

from gemm_hls.

charliechou1001 avatar charliechou1001 commented on September 13, 2024

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

I got it. That sounds to set the Granularity of PE to more than 1 and performance is doubled. And do you mind sharing that code?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

from gemm_hls.

definelicht avatar definelicht commented on September 13, 2024

The code does not exist anymore, it was the idea that this should be supported when I wrote the transpose code, but in the end I decided that I didn't need it :-)

from gemm_hls.

charliechou1001 avatar charliechou1001 commented on September 13, 2024

Thanks. And what do you think of the other method? does it work?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

from gemm_hls.

definelicht avatar definelicht commented on September 13, 2024

It doesn't matter if your parallelism is in A or B, it only matters how many columns of A must be loaded for every outer tile. You either need to increase the tile size, or allow multiple elements of A to be buffered per cycle.

from gemm_hls.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.