Giter Club home page Giter Club logo

Comments (4)

te42kyfo avatar te42kyfo commented on September 28, 2024

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte), which decides whether data fits in cache

Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.

The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.

from gpu-benches.

guohaoqiang avatar guohaoqiang commented on September 28, 2024

I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte), which decides whether data fits in cache

Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.

The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.

When I run gpu-l2-cache on h100 PCIe, it exposes a weird bw column (the bw of L2 shall be around 7500GB/s). Do I need to change the code?

Screen Shot 2023-11-26 at 9 58 32 PM

from gpu-benches.

te42kyfo avatar te42kyfo commented on September 28, 2024

Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning:
Initially, for the first few dataset sizes, there is still some coverage by the 256kB L1 cache. For example, the 2048kB data point consists of 8 blocks of 256kB, so there is a 1 in 8 chance that a thread block runs on a SM where the previous, just exited thread block had worked on the same block of data which then still resides in the L1 cache. It eventually settles to around 6700 GB/s, which is the pure L2 bandwidth.

For the data in the plot and the included data, I have changed the per thread block data set from 256kB to 512kB, exactly because of this reason. This reduces this effect, but doesn't eliminate it so you still should not use the first few values. Instead, use the values right before the dataset drops out of the L2 cache into memory. With 512kB per thread block, I get 7TB/s.

from gpu-benches.

te42kyfo avatar te42kyfo commented on September 28, 2024

The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed.

from gpu-benches.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.