Comments (4)
I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable blockRun
is adjusted to the total amount of simultaneously running thread blocks. Each thread reads N
pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57 int blockCount = blockRun * 10000;
). By adjusting N, the total volume can be adjusted ( N * blockSize * blockRun * 8 Byte
), which decides whether data fits in cache
Think of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.
The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.
from gpu-benches.
I have written cuda-L2-cache specifically to benchmark the L2 cache bandwidth only. It simulates a scenario, where data is being read repeatedly from thread blocks on SMs all over the chip. The variable
blockRun
is adjusted to the total amount of simultaneously running thread blocks. Each thread readsN
pieces of data, in a grid stride loop. Each piece of data is being read by 10000 different thread blocks (see line 57int blockCount = blockRun * 10000;
). By adjusting N, the total volume can be adjusted (N * blockSize * blockRun * 8 Byte
), which decides whether data fits in cacheThink of it as a grid stride loop over some data volume of as many threads as can run simultenously (a 'wave'). Only that afterwards, 10000 more waves will do the same thing, only that this time, the distribution of thread blocks is different. This is benchmark was written for this paper (see Figure 2). The peculiar way of how data is repeatedly read by different thread blocks is because of A100's segmented L2 cache, where L2 cache data repeatedly being read by the same thread block would show higher L2 cache capacity, because there would not be duplication. Whereas with this scheme, data would need to be duplicated, because reads come from SMs attached to different L2 cache segments.
The gpu-cache benchmark is a general cache benchmark for both L1 and L2 cache. Because each thread block reads the same data as all the others, it never falls out of L2 cache. Even if the data volume exceeds L2 cache capacity, there would be reuse in the L2 cache by different thread blocks.
When I run gpu-l2-cache on h100 PCIe, it exposes a weird bw column (the bw of L2 shall be around 7500GB/s). Do I need to change the code?
from gpu-benches.
Your results look absolutely in line with what I had measured myself before. Regarding the very high numbers in the beginning:
Initially, for the first few dataset sizes, there is still some coverage by the 256kB L1 cache. For example, the 2048kB data point consists of 8 blocks of 256kB, so there is a 1 in 8 chance that a thread block runs on a SM where the previous, just exited thread block had worked on the same block of data which then still resides in the L1 cache. It eventually settles to around 6700 GB/s, which is the pure L2 bandwidth.
For the data in the plot and the included data, I have changed the per thread block data set from 256kB to 512kB, exactly because of this reason. This reduces this effect, but doesn't eliminate it so you still should not use the first few values. Instead, use the values right before the dataset drops out of the L2 cache into memory. With 512kB per thread block, I get 7TB/s.
from gpu-benches.
The used parameters (256kB) had been fine before, but doesn't work as well for the increased L1 cache in H100. The CL replacement strategy might also have changed.
from gpu-benches.
Related Issues (11)
- How would you like this work cited? HOT 1
- Why the block size in gpu-latency is 64 HOT 2
- The variables in cache-latency test HOT 4
- measure-metrics folder missing HOT 2
- can not run on gfx1100 --> rx7900 HOT 3
- cuda-strides complier error HOT 1
- Problem about bandwidht test HOT 5
- Problem about l2 cache test HOT 3
- Problem about gpu-latency benchmark on A100 HOT 5
- Why blocksize is 256 in gpu-cache test HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpu-benches.