te42kyfo / gpu-benches Goto Github PK

collection of benchmarks to measure basic GPU capabilities

License: GNU General Public License v3.0

C++ 34.20% Makefile 2.67% Cuda 21.58% Shell 0.14% Python 6.61% C 0.35% Jupyter Notebook 34.44%

cache gpu-computing micro-benchmarks performance

gpu-benches's Introduction

GPU benchmarks

This is a collection of GPU micro benchmarks. Each test is designed to test a particular scenario or hardware mechanism. Some of the benchmarks have been used to produce data for these papers:

"Analytical performance estimation during code generation on modern GPUs"

"Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs"

Benchmarks that are called gpu-<benchmarkname> are hipifyable! Whereas the default Makefile target builds the CUDA executable cuda-<benchmarkname>, the target make hip-<benchmarkname> uses the hipify-perl tool to create a file main.hip from the main.cu file, and builds it using the hip compiler. The CUDA main files are written so that the hipify tool works without further intervention.

Also have a look at the gpu-metrics functions, which provide a concise way of measuring hardware performance counter metrics of a kernel launch inside the running program.

If any of this is useful, stars and citations are welcome!

gpu-stream

Measures the bandwidth of streaming kernels for varying occupancy. A shared memory allocation serves as a spoiler, so that only two thread blocks can run per SM. Scanning the thread block size from 32 to 1024 scans the occupancy from 3% to 100%.

Kernel	Formula
init	A[i] = c	1 store stream
read	sum = A[i]	1 load stream
scale	A[i] = B[i] * c	1 load stream, 1 store stream
triad	A[i] = B[i] + c * C[i]	2 load streams, 1 store stream
3pt	A[i] = B[i-1] + B[i] + B[i+1]	1 load streams, 1 store stream
5pt	A[i] = B[i-2] + B[i-1] + B[i] + B[i+1] + B[i+2]	1 load streams, 1 store stream

Results from a NVIDIA-H100-PCIe / CUDA 11.7

blockSize   threads       %occ  |                init       read       scale     triad       3pt        5pt
       32        3648      3 %  |  GB/s:         228         96        183        254        168        164
       64        7296    6.2 %  |  GB/s:         452        189        341        459        316        310
       96       10944    9.4 %  |  GB/s:         676        277        472        635        443        436
      128       14592   12.5 %  |  GB/s:         888        368        607        821        567        558
      160       18240   15.6 %  |  GB/s:        1093        449        704        966        680        670
      192       21888   18.8 %  |  GB/s:        1301        533        817       1121        794        781
      224       25536   21.9 %  |  GB/s:        1495        612        925       1264        903        889
      256       29184   25.0 %  |  GB/s:        1686        702       1037       1399       1005        989
      288       32832   28.1 %  |  GB/s:        1832        764       1124       1487       1100       1082
      320       36480   31.2 %  |  GB/s:        2015        841       1213       1564       1188       1169
      352       40128   34.4 %  |  GB/s:        2016        908       1295       1615       1269       1250
      384       43776   37.5 %  |  GB/s:        2016        985       1378       1644       1348       1326
      416       47424   40.6 %  |  GB/s:        2016       1045       1439       1641       1415       1395
      448       51072   43.8 %  |  GB/s:        2016       1116       1497       1649       1472       1453
      480       54720   46.9 %  |  GB/s:        2016       1179       1544       1655       1521       1505
      512       58368   50.0 %  |  GB/s:        2017       1261       1583       1675       1556       1545
      544       62016   53.1 %  |  GB/s:        2016       1300       1591       1669       1572       1563
      576       65664   56.2 %  |  GB/s:        2016       1362       1607       1678       1587       1579
      608       69312   59.4 %  |  GB/s:        2018       1416       1619       1689       1598       1592
      640       72960   62.5 %  |  GB/s:        2016       1473       1639       1712       1613       1607
      672       76608   65.6 %  |  GB/s:        2016       1527       1638       1714       1618       1613
      704       80256   68.8 %  |  GB/s:        2015       1578       1644       1725       1625       1619
      736       83904   71.9 %  |  GB/s:        2016       1624       1651       1738       1632       1628
      768       87552   75.0 %  |  GB/s:        2016       1680       1666       1755       1642       1638
      800       91200   78.1 %  |  GB/s:        2015       1714       1663       1758       1645       1642
      832       94848   81.2 %  |  GB/s:        2016       1759       1668       1770       1649       1647
      864       98496   84.4 %  |  GB/s:        2016       1795       1673       1779       1654       1651
      896      102144   87.5 %  |  GB/s:        2016       1837       1686       1796       1663       1662
      928      105792   90.6 %  |  GB/s:        2018       1871       1684       1800       1666       1664
      960      109440   93.8 %  |  GB/s:        2016       1897       1688       1808       1672       1670
      992      113088   96.9 %  |  GB/s:        2016       1919       1693       1818       1678       1675
     1024      116736  100.0 %  |  GB/s:        2016       1942       1704       1832       1686       1683

The results for the SCALE kernel and a selection of GPUs:

Note that the H100 results are for the PCIe version, which has lower DRAM bandwidth than the SXM version!

gpu-latency

Pointer chasing benchmark for latency measurement. A single warp fully traverses a buffer in random order. A partitioning scheme is used to ensure that all cache lines are hit exactly once before they are accessed again. Latency in clock cycles is computed with the current clock rate.

Sharp L1 cache transitions at 128/192/256 kB for NVIDIAS V100/A100/H100 and at 16kB for AMD's MI210. V100 and MI210 both have a 6MB L2 cache. The A100's and H100 have a segmented L2 cache at 2x20MB and 2x25MB, which manifests as a small intermediate plateau when data is fetched from the far L2 section.

The RDNA2 GPU, the RX6900XT, has the most interesting cache hierarchy with its 4 cache levels are clearly visible: the 16kB L0 cache, the 128kB semi-shared L1 cache, the 4MB L2 cache, and the 128MB Infinity cache. It is also the highest clocking GPU, so that the absolute access times would be lower than the other GPUs. Measuring its DRAM latency is difficult, because the DRAM interface does not clock up for a single wavefront, resulting in DRAM latencies > 2000 cycles.

gpu-cache

Measures bandwidths of the first and second cache level. Launches one thread block per SM. Each thread block repeatedly reads the contents of the same buffer. Varying buffer sizes changes the targeted cache level.

The 16kB (MI100/MI210), 128kB (V100), 192kB (A100) and 256 kB (H100) L1 cache capacities are very pronounced and sharp. The three NVIDIA architectures both transfer close to 128B/cycle/SM, the maximum measured value on AMD's MI100 and MI210 depends on the data type. For double precision, the maximum is 32B/cycle/CU. For single precision and 16B data types (either float4 or double2) the bandwidth is up to 64B.

This benchmark does not target the memory hierarchy levels past the second cache level (i.e. DRAM for most GPUs), because the data sets do not clearly drop out of a shared cache. Because all thread blocks read the same data, there is a lot of reuse potential inside shared cache before the data is evicted. The RX6900XT values are bonkers past its 128kB shared L1 cache. A100 and H100 drop slightly at 20/25MB, when the capacity of a single cache section is exceeded. Beyond this point, data cannot be replicated in both L2 cache sections and the maximum bandwidth drops, as data has also to be fetched from the other section.

gpu-l2-cache

Measures bandwidths of shared cache levels. This benchmark explicitly does not target the L1 caches.

All three GPUs have a similar L2 cache bandwidths of about 5.x TB/s, though with different capactities.

A remarkable observation is the RX6900XT, which has a second shared cache level, the 128MB Infinity Cache. At almost 1.92 TB/s, it is as fast as the A100's DRAM. At the very beginning, the RX6900XT semi-shared L1 cache can be seen, where for some block placements the 4 L1 caches have a small effect. The same applies to the H100, which has a larger L1 cache with an increased chance for a thread block to find the data it wants to work on already in the L1 cache loaded in by the previous thread block. This only works for the small data sets, where there are only a few different data blocks and this chance is still significant. This is not attributable to the Distributed Shared Memory Network, that allows to load from other SM's shared memory, because it only works for explicit shared memory loads and not global loads. This would require tag checking every L1 cache in the GPC for any load.

gpu-strides

Read only, L1 cache benchmark that accesses memory with strides 1 to 128. The bandwidth is converted to Bytes per cycle and SM. The strides from 1 to 128 are formatted in a 16x8 tableau, because that highlights the recurring patterns of multiples of 2/4/8/16.

These multiples are important for NVIDIA's architecture, which clearly have their L1 cache structured in a 16 banks of 8B. For strides that are a multiple of 16, every single thread accesses data from the same cache bank. The rate of address translation is reduced when addresses do not fall into the same 128B cache line anymore.

AMD's MI210 appears to have even more banks, with especially stark slowdowns to less than 4B/cycle for multiples of 32.

Testing the stencil-like, 2D structured grid access with different thread block shapes reveals differences in the L1 cache throughput:

(see the generated machine code of MI210 and A100 here: https://godbolt.org/z/1PvWqs9Kf)

AMD's MI210 is fine (at its much lower level), as long as contiguous blocks of at least 4 threads are accessed. NVIDIA's only reach their maximum throughput for 16 wide thread blocks.

Along with the L1 cache size increass, both Ampere and Hopper also slightly improve the rate of L1 cache address lookups.

gpu-small-kernels

This benchmark explors the potential for cache blocking, where kernels work on a small data set that fits into caches. Because the data set is small, and the L2 cache is fast, the kernel executues so quickly that the startup overhead of a kernel launch becomes dominant. The benchmark queues 10000 calls of a streaming SCALE kernel of varying size. Use commandline option "-graph" to use the cudaGraph/hipGraph API.

Each device gets a fit of $a,b$ for the function

$$T = \frac{V}{a + V/b}$$

which models the performance with a startup overhead $a$ and a bandwidth $b$ depending on the data volume $V$.

cuda-roofline

This program scans a range of Computational Intensities, by varying the amount of inner loop trips. It is suitable both to study the transition from memory- to compute bound codes as well as power consumption, clock frequencies and temperatures when using multiple GPUs. The shell script series.sh builds an executable for each value, and executes them one afer another after finishing building.

The Code runs simultaneously on all available devices. Example output on four Tesla V100 PCIe 16GB:

1 640 blocks     0 its      0.125 Fl/B        869 GB/s       109 GF/s   1380 Mhz   138 W   60°C
2 640 blocks     0 its      0.125 Fl/B        869 GB/s       109 GF/s   1380 Mhz   137 W   59°C
3 640 blocks     0 its      0.125 Fl/B        869 GB/s       109 GF/s   1380 Mhz   124 W   56°C
0 640 blocks     0 its      0.125 Fl/B        869 GB/s       109 GF/s   1380 Mhz   124 W   54°C

1 640 blocks     8 its      1.125 Fl/B        861 GB/s       968 GF/s   1380 Mhz   159 W   63°C
0 640 blocks     8 its      1.125 Fl/B        861 GB/s       968 GF/s   1380 Mhz   142 W   56°C
2 640 blocks     8 its      1.125 Fl/B        861 GB/s       968 GF/s   1380 Mhz   157 W   62°C
3 640 blocks     8 its      1.125 Fl/B        861 GB/s       968 GF/s   1380 Mhz   144 W   59°C
[...]
0 640 blocks    64 its      8.125 Fl/B        811 GB/s      6587 GF/s   1380 Mhz   223 W   63°C
3 640 blocks    64 its      8.125 Fl/B        813 GB/s      6604 GF/s   1380 Mhz   230 W   66°C
1 640 blocks    64 its      8.125 Fl/B        812 GB/s      6595 GF/s   1380 Mhz   241 W   71°C
2 640 blocks    64 its      8.125 Fl/B        813 GB/s      6603 GF/s   1380 Mhz   243 W   69°C

cuda-memcpy

Measures the host-to-device transfer rate of the cudaMemcpy function over a range of transfer sizes

Example output for a Tesla V100 PCIe 16GB

         1kB     0.03ms    0.03GB/s   0.68%
         2kB     0.03ms    0.06GB/s   5.69%
         4kB     0.03ms    0.12GB/s   8.97%
         8kB     0.03ms    0.24GB/s   6.25%
        16kB     0.04ms    0.44GB/s   5.16%
        32kB     0.04ms    0.93GB/s   2.70%
        64kB     0.04ms    1.77GB/s   5.16%
       128kB     0.04ms    3.46GB/s   7.55%
       256kB     0.05ms    5.27GB/s   1.92%
       512kB     0.07ms    7.53GB/s   1.03%
      1024kB     0.11ms    9.25GB/s   2.52%
      2048kB     0.20ms   10.50GB/s   1.07%
      4096kB     0.37ms   11.41GB/s   0.58%
      8192kB     0.71ms   11.86GB/s   0.44%
     16384kB     1.38ms   12.11GB/s   0.14%
     32768kB     2.74ms   12.23GB/s   0.03%
     65536kB     5.46ms   12.29GB/s   0.08%
    131072kB    10.89ms   12.32GB/s   0.02%
    262144kB    21.75ms   12.34GB/s   0.00%
    524288kB    43.47ms   12.35GB/s   0.00%
   1048576kB    86.91ms   12.35GB/s   0.00%

um-stream

Measures CUDA Unified Memory transfer rate using a STREAM triad kernel. A range of data set sizes is used, both smaller and larger than the device memory. Example output on a Tesla V100 PCIe 16GB:

 buffer size      time   spread   bandwidth
       24 MB     0.1ms     3.2%   426.2GB/s
       48 MB     0.1ms    24.2%   511.6GB/s
       96 MB     0.1ms     0.8%   688.0GB/s
      192 MB     0.3ms     1.8%   700.0GB/s
      384 MB     0.5ms     0.5%   764.6GB/s
      768 MB     1.0ms     0.2%   801.8GB/s
     1536 MB     2.0ms     0.0%   816.9GB/s
     3072 MB     3.9ms     0.1%   822.9GB/s
     6144 MB     7.8ms     0.2%   823.8GB/s
    12288 MB    15.7ms     0.1%   822.1GB/s
    24576 MB  5108.3ms     0.5%     5.0GB/s
    49152 MB 10284.7ms     0.8%     5.0GB/s

cuda-incore

Measures the latency and throughput of FMA, DIV and SQRT operation. It scans combinations of ILP=1..8, by generating 1..8 independent dependency chains, and TLP, by varying the warp count on a SM from 1 to 32. The final output is a ILP/TLP table, with the reciprocal throughputs (cycles per operation):

Example output on a Tesla V100 PCIe 16GB:

DFMA
  8.67   4.63   4.57   4.66   4.63   4.72   4.79   4.97
  4.29   2.32   2.29   2.33   2.32   2.36   2.39   2.48
  2.14   1.16   1.14   1.17   1.16   1.18   1.20   1.24
  1.08   1.05   1.05   1.08   1.08   1.10   1.12   1.14
  1.03   1.04   1.04   1.08   1.07   1.10   1.11   1.14
  1.03   1.04   1.04   1.08   1.07   1.10   1.10   1.14

DDIV
111.55 111.53 111.53 111.53 111.53 668.46 779.75 891.05
 55.76  55.77  55.76  55.76  55.76 334.26 389.86 445.51
 27.88  27.88  27.88  27.88  27.88 167.12 194.96 222.82
 14.11  14.11  14.11  14.11  14.11  84.77  98.89 113.00
  8.48   8.48   8.48   8.48   8.48  50.89  59.36  67.84
  7.51   7.51   7.51   7.51   7.51  44.98  52.48  59.97

DSQRT
101.26 101.26 101.26 101.26 101.26 612.76 714.79 816.83
 50.63  50.62  50.63  50.63  50.62 306.36 357.38 408.40
 25.31  25.31  25.31  25.31  25.31 153.18 178.68 204.19
 13.56  13.56  13.56  13.56  13.56  82.75  96.83 110.29
  9.80   9.80   9.80   9.80   9.80  60.47  70.54  80.62
  9.61   9.61   9.61   9.61   9.61  58.91  68.72  78.53

Some Features can be extracted from the plot.

Latencies:

DFMA: 8 cycles
DDIV: 112 cycles
DSQRT: 101 cycles

Throughput of one warp (runs on one SM quadrant), no dependencies:

DFMA: 1/4 per cycle (ILP 2, to ops overlap)
DDIV: 1/112 per cycle (no ILP/overlap)
DSQRT: 1/101 per cycle (no ILP/overlap)

Throughput of multiple warps (all SM quadrants), dependencies irrelevant:

DFMA: 1 per cycle
DDIV: 1/7.5 cycles
DSQRT: 1/9.6 cycles

gpu-benches's People

Contributors

Stargazers

Watchers

gpu-benches's Issues

can not run on gfx1100 --> rx7900

can pass in gpu-cache, gpu-metrics, gpu-stream,gpu-strides
failed in gpu-l2-cache ,gpu-latency test

/opt/rocm/bin/hipcc -std=c++20 -I/opt/rocm/include/rocprofiler/ -I/opt/rocm/hsa/include/hsa -L/opt/rocm/rocprofiler/lib -lrocprofiler64 -lrocprofiler64v2 -lhsa-runtime64 -lrocm_smi64 -ldl main.hip -o demo
./demo
gpu_count 1
Agent 0
data set exec time spread Eff. bw
gpu_count 1
Agent 0
measureMetricStop: no kernel kaunch was intercepted
make: *** [Makefile:25: test] Segmentation fault (core dumped)

Why blocksize is 256 in gpu-cache test

Hey, i find in gpu-cache test the blocksize is 256, why it is not 1024 ？

When i changed blocksize from 256 to 1024， L1 cache bandwidth tested has some improvement and fluctuates more.

blocksize = 256 results as follows

         1 kB        50ms       0.7%    8648.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         2 kB        37ms       0.1%   11608.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         3 kB        33ms       0.0%   12947.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         4 kB        31ms       5.4%   14061.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         6 kB        30ms       3.3%   14402.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         8 kB        30ms       6.6%   14989.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        10 kB        30ms       3.0%   14555.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        12 kB        30ms      27.9%   15976.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        14 kB        30ms       5.3%   14430.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        16 kB        30ms       2.2%   14588.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        18 kB        33ms       2.0%   13113.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        20 kB        30ms      17.5%   15206.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        22 kB        29ms       7.9%   15610.4 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        24 kB        28ms      11.8%   15916.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        26 kB        32ms      11.1%   13737.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        28 kB        30ms       5.0%   14240.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        30 kB        31ms       0.6%   14172.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        32 kB        30ms       4.1%   14733.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        34 kB        29ms       2.2%   14845.4 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        36 kB        29ms       3.3%   15113.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        38 kB        29ms       5.4%   14967.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        40 kB        29ms       5.4%   15129.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        42 kB        29ms       8.7%   15437.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        44 kB        29ms       7.0%   15451.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        46 kB        29ms       8.4%   15633.8 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        48 kB        28ms      12.3%   15940.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        50 kB        28ms      16.4%   16288.1 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        52 kB        28ms      14.6%   16230.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        54 kB        28ms      12.6%   16195.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        56 kB        27ms      10.0%   16434.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        58 kB        28ms      11.0%   16433.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s

blocksize = 1024 results as follows

     data set   exec time     spread        Eff. bw       DRAM read      DRAM write         L2 read       L2 store
         4 kB        37ms       0.1%   11645.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         6 kB       111ms       0.0%    3902.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
         8 kB        29ms      46.0%   17593.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        10 kB        66ms       6.0%    6564.7 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        12 kB        29ms      24.8%   16609.0 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        14 kB        52ms       1.4%    8303.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        16 kB        28ms      27.1%   17275.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        18 kB        44ms       6.6%    9894.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        20 kB        28ms      27.0%   17521.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        22 kB        39ms       7.5%   11307.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        24 kB        27ms      16.9%   17184.6 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        26 kB        37ms      18.0%   12475.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        28 kB        27ms      40.3%   18542.5 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        30 kB        34ms      11.9%   13365.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        32 kB        26ms      20.7%   18043.9 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        34 kB        34ms      23.1%   14124.3 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s
        36 kB        27ms      26.9%   17707.2 GB/s         0 GB/s          0 GB/s          0 GB/s          0 GB/s

My device is A800 80GB PCIe.

measure-metrics folder missing

Could you kindly share the measure metrics folder? Because unable to build the Cuda-l2-cache benchmark.

Problem about l2 cache test

Hey, I was runing l2 cache test in my A800 80GB GPU, and i tried to modify the parameters N, there are some strange results.

In default, N=64, and result as follow:

       256 kB       768 kB         9ms       3.6%    5663.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      1024 kB         9ms       0.1%    5679.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      1280 kB         9ms       0.5%    5614.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      1536 kB        10ms       0.2%    5421.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      1792 kB        10ms       0.2%    5485.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      2048 kB         9ms       0.3%    5608.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      2304 kB         9ms       0.2%    5532.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      2560 kB         9ms       0.2%    5531.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      2816 kB        10ms       0.4%    5473.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      3072 kB        10ms       0.1%    5468.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      3328 kB        10ms       0.1%    5423.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      3584 kB        10ms       0.2%    5407.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      3840 kB        10ms       0.1%    5433.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      4096 kB        10ms       0.2%    5492.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      4352 kB        10ms       0.1%    5411.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      4608 kB        10ms       0.1%    5416.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      4864 kB        10ms       0.2%    5382.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      5120 kB        10ms       0.5%    5297.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      5632 kB        10ms       0.4%    5441.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      6144 kB        10ms       0.1%    5461.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      6656 kB        10ms       0.1%    5435.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      7168 kB        10ms       0.3%    5412.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      7680 kB        10ms       0.1%    5414.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      8448 kB        10ms       0.1%    5469.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      9216 kB        10ms       0.2%    5442.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB      9984 kB        10ms       0.3%    5313.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     10752 kB        10ms       0.1%    5376.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     11776 kB        10ms       0.3%    5459.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     12800 kB        10ms       0.1%    5447.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     14080 kB        10ms       0.2%    5424.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     15360 kB        10ms       0.3%    5287.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     16896 kB        10ms       0.1%    5394.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     18432 kB        10ms       0.3%    5416.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     20224 kB        10ms       0.4%    5307.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     22016 kB        10ms       0.1%    5416.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     24064 kB        10ms       0.1%    5374.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     26368 kB        10ms       0.2%    5354.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     28928 kB        10ms       0.8%    5161.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     31744 kB        13ms       2.3%    4095.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     34816 kB        20ms       2.2%    2602.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     38144 kB        24ms       1.3%    2197.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     41728 kB        25ms       2.1%    2123.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     45824 kB        27ms       1.7%    1948.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     50176 kB        28ms       0.2%    1897.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     55040 kB        28ms       0.2%    1893.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     60416 kB        28ms       0.3%    1891.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     66304 kB        28ms       0.4%    1891.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     72704 kB        28ms       0.6%    1890.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     79872 kB        28ms       0.5%    1888.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     87808 kB        28ms       0.4%    1885.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB     96512 kB        28ms       0.8%    1889.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    105984 kB        28ms       0.5%    1880.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    116480 kB        28ms       0.5%    1877.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    128000 kB        28ms       0.8%    1881.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    140800 kB        28ms       0.6%    1871.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    154880 kB        28ms       0.6%    1867.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    170240 kB        28ms       0.9%    1870.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    187136 kB        28ms       0.6%    1858.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    205824 kB        28ms       0.6%    1854.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    226304 kB        28ms       1.0%    1853.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    248832 kB        28ms       0.7%    1841.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    273664 kB        29ms       0.8%    1836.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    300800 kB        28ms       1.4%    1840.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    330752 kB        29ms       0.9%    1821.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    363776 kB        29ms       1.5%    1824.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    400128 kB        29ms       0.9%    1805.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    440064 kB        29ms       1.0%    1797.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    483840 kB        29ms       1.6%    1795.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    532224 kB        30ms       1.1%    1770.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    585216 kB        30ms       1.2%    1759.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    643584 kB        30ms       1.5%    1755.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    707840 kB        30ms       0.1%    1730.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    778496 kB        30ms       0.1%    1730.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    856320 kB        30ms       0.1%    1730.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB    941824 kB        30ms       0.1%    1730.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1035776 kB        30ms       0.1%    1730.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1139200 kB        30ms       0.2%    1734.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1253120 kB        30ms       0.1%    1730.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1378304 kB        30ms       0.1%    1730.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1516032 kB        30ms       0.2%    1733.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1667584 kB        30ms       0.3%    1734.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   1834240 kB        30ms       0.2%    1732.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   2017536 kB        30ms       0.1%    1731.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   2219264 kB        30ms       0.1%    1730.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
       256 kB   2440960 kB        30ms       0.2%    1732.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s

When I change N from 64 to 512, result as follows：

      4096 kB     12288 kB       154ms       0.1%    5447.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     16384 kB       153ms       0.1%    5471.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     20480 kB       155ms       0.2%    5418.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     24576 kB       154ms       0.4%    5435.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     28672 kB       161ms       2.2%    5207.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     32768 kB       164ms       2.1%    5117.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     36864 kB       164ms       2.1%    5103.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     40960 kB       166ms       2.3%    5051.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     45056 kB       166ms       1.5%    5045.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     49152 kB       167ms       1.0%    5012.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     53248 kB       167ms       2.6%    5023.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     57344 kB       168ms       1.9%    5000.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     61440 kB       168ms       1.9%    5003.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     65536 kB       167ms       1.9%    5008.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     69632 kB       167ms       2.0%    5032.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     73728 kB       168ms       2.5%    4996.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     77824 kB       171ms       1.3%    4903.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     81920 kB       200ms       4.2%    4201.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     90112 kB       199ms       1.2%    4220.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB     98304 kB       210ms      51.8%    3991.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    106496 kB       213ms      58.6%    3936.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    114688 kB       425ms       3.1%    1975.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    122880 kB       439ms       0.5%    1909.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    135168 kB       440ms       0.7%    1905.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    147456 kB       446ms       0.3%    1879.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    159744 kB       442ms       1.0%    1896.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    172032 kB       448ms       0.4%    1870.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    188416 kB       449ms       0.2%    1867.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    204800 kB       450ms       0.1%    1864.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    225280 kB       452ms       0.2%    1855.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    245760 kB       451ms       0.1%    1860.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    270336 kB       451ms       0.1%    1859.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    294912 kB       452ms       0.1%    1856.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    323584 kB       451ms       0.1%    1859.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    352256 kB       451ms       0.1%    1858.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    385024 kB       452ms       0.1%    1857.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    421888 kB       453ms       0.2%    1853.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    462848 kB       453ms       0.1%    1851.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    507904 kB       455ms       0.1%    1845.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    557056 kB       452ms       0.2%    1854.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    610304 kB       453ms       0.2%    1851.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    667648 kB       453ms       0.2%    1852.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    733184 kB       454ms       0.2%    1846.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    802816 kB       453ms       0.2%    1850.6 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    880640 kB       454ms       0.1%    1847.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB    966656 kB       455ms       0.2%    1842.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1060864 kB       455ms       0.2%    1845.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1163264 kB       455ms       0.2%    1842.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1277952 kB       455ms       0.2%    1842.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1404928 kB       456ms       0.2%    1839.1 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1544192 kB       457ms       0.3%    1834.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1695744 kB       458ms       0.3%    1830.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   1863680 kB       459ms       0.3%    1828.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   2048000 kB       459ms       0.3%    1827.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   2252800 kB       460ms       0.3%    1822.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   2478080 kB       462ms       0.3%    1817.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   2723840 kB       462ms       0.3%    1814.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   2994176 kB       464ms       0.4%    1809.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   3293184 kB       464ms       0.4%    1809.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   3620864 kB       466ms       0.4%    1798.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   3981312 kB       467ms       0.4%    1795.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   4378624 kB       469ms       0.5%    1786.9 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   4812800 kB       471ms       0.6%    1779.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   5292032 kB       473ms       0.6%    1775.3 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   5820416 kB       476ms       0.6%    1763.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   6402048 kB       477ms       0.7%    1758.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   7041024 kB       480ms       0.7%    1747.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   7741440 kB       483ms       0.8%    1736.8 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   8515584 kB       487ms       0.8%    1724.2 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB   9363456 kB       489ms       0.9%    1713.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB  10297344 kB       494ms       0.4%    1698.5 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB  11325440 kB       495ms       0.1%    1693.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB  12455936 kB       496ms       0.0%    1690.4 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB  13701120 kB       481ms       3.0%    1744.7 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s
      4096 kB  15069184 kB       484ms       2.4%    1734.0 GB/s        0 GB/s      0 GB/s      0 GB/s      0 GB/s

From result in N=64, l2 cache size is between [26MB, 45MB], but from result in N = 512, the l2 cache size is out of this range.

What does N affect the test result? Because when the value of the second column is 60416KB(~60MB), bandwidth in N=64 is 1891.3 GB/s, however bandwidth in N=512 is ~5003.1 GB/s

How would you like this work cited?

I am publishing some work and would like to include a reference to this. Please provide me with any information you would like to appear, or if there is a related publication that would be wonderful too. Otherwise, I will cite it similar to this:

 te42kyfo,  “cuda-benches,”  https://github.com/te42kyfo/cuda-benches, 2019

Thank you!

cuda-strides complier error

nvcc -std=c++17 -O3 -arch=sm_70 --compiler-options="-O2 -pipe -Wall -fopenmp -g" -Xcompiler -rdynamic --generate-line-info -Xcompiler "-Wl,-rpath,/usr/local/cuda/extras/CUPTI/lib64/" -Xcompiler "-Wall" -I/usr/local/cuda/extras/CUPTI/include -o cuda-strides main.cu -L/usr/local/cuda/lib64 -L/usr/local/cuda/extras/CUPTI/lib64 -lcupti -lcuda -lnvidia-ml
../gpu-metrics/cuda_metrics/Eval.hpp(72): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(72): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(75): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(78): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(78): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(83): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(84): error: identifier "NVPW_MetricsEvaluator" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(84): error: identifier "metricEvaluator" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(96): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(97): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(97): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(101): error: identifier "NVPW_MetricEvalRequest_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(102): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(127): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(127): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(131): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(134): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(134): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(139): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(145): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(151): error: identifier "NVPW_MetricsEvaluator_Destroy_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(151): error: identifier "NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(153): error: identifier "NVPW_MetricsEvaluator_Destroy" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(168): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(168): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(171): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(174): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(174): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(181): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(182): error: identifier "NVPW_MetricsEvaluator" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(182): error: identifier "metricEvaluator" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(199): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(200): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(200): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(204): error: identifier "NVPW_MetricEvalRequest_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(205): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(227): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(227): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(231): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(234): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(234): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(239): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(245): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(252): error: identifier "NVPW_MetricsEvaluator_Destroy_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(252): error: identifier "NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(254): error: identifier "NVPW_MetricsEvaluator_Destroy" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(270): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(270): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(273): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(276): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(276): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(283): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(284): error: identifier "NVPW_MetricsEvaluator" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(284): error: identifier "metricEvaluator" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(301): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(302): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(302): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(306): error: identifier "NVPW_MetricEvalRequest_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(307): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(329): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(329): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(333): error: identifier "NVPW_MetricsEvaluator_SetDeviceAttributes" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(336): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(336): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(341): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(347): error: identifier "NVPW_MetricsEvaluator_EvaluateToGpuValues" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(355): error: identifier "NVPW_MetricsEvaluator_Destroy_Params" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(355): error: identifier "NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Eval.hpp(357): error: identifier "NVPW_MetricsEvaluator_Destroy" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(17): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(17): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(20): error: identifier "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(23): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(23): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(28): error: identifier "NVPW_CUDA_MetricsEvaluator_Initialize" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(29): error: identifier "NVPW_MetricsEvaluator" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(29): error: identifier "metricEvaluator" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(39): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(40): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(40): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(44): error: identifier "NVPW_MetricEvalRequest_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(45): error: identifier "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(48): error: identifier "NVPW_MetricsEvaluator_GetMetricRawDependencies_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(48): error: identifier "NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(53): error: identifier "NVPW_MetricEvalRequest" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(54): error: identifier "NVPW_MetricsEvaluator_GetMetricRawDependencies" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(57): error: identifier "NVPW_MetricsEvaluator_GetMetricRawDependencies" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(74): error: identifier "NVPW_MetricsEvaluator_Destroy_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(74): error: identifier "NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(76): error: identifier "NVPW_MetricsEvaluator_Destroy" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(85): error: identifier "NVPW_CUDA_RawMetricsConfig_Create_V2_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(85): error: identifier "NVPW_CUDA_RawMetricsConfig_Create_V2_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(89): error: identifier "NVPW_CUDA_RawMetricsConfig_Create_V2" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(141): error: identifier "NVPW_CUDA_CounterDataBuilder_Create_Params" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(141): error: identifier "NVPW_CUDA_CounterDataBuilder_Create_Params_STRUCT_SIZE" is undefined

../gpu-metrics/cuda_metrics/Metric.hpp(144): error: identifier "NVPW_CUDA_CounterDataBuilder_Create" is undefined

main.cu(167): error: identifier "measureMetricStop" is undefined

What's the difference between cuda-l2-cache and gpu-cache benchmarks?

Problem about gpu-latency benchmark on A100

Description:

I ran the gpu-latency test on an A100 80GB PCIe GPU and noticed that the results were similar to those provided in the repository. However, I observed a slight discrepancy in the capacity corresponding to the first inflection point in the curve where L1/shared memory is exhausted. The capacity in my test was approximately 146KB, which differs slightly from the value of 128KB provided by Nvidia. I would like to inquire whether there might be an issue with my testing configuration. Thank you!

Configuration:

A100 80GB PCIe version, on Intel x86 server. The CUDA version is 12.4. Except for changing --arch to sm80 in the Makefile, no other configuration changes were made.

Problem about bandwidht test

I tried to test bandwidth with cuda-stream benchmark, my device is 4060TI, bandwidth is 288GB/s.

I changed param max_buffer_size from 128l * 1024 *1024 +2 to 1024 * 1024as follows:

#include "../MeasurementSeries.hpp"
#include "../dtime.hpp"
#include "../gpu-error.h"
#include <iomanip>
#include <iostream>

using namespace std;

const int64_t max_buffer_size =  1024 * 1024;
double *dA, *dB, *dC, *dD;

The result is much larger than 288GB/s, result as follows:

blockSize   threads       %occ  |                init       read       scale     triad       3pt        5pt
       32        1088      4 %  |  GB/s:         108         67        130        191        119        100
       64        2176    8.3 %  |  GB/s:         209        128        250        362        225        184
       96        3264   12.5 %  |  GB/s:         294        184        357        514        320        266
      128        4352   16.7 %  |  GB/s:         380        233        453        654        400        313
      160        5440   20.8 %  |  GB/s:         443        279        541        773        466        275
      192        6528   25.0 %  |  GB/s:         506        323        620        880        533        308
      224        7616   29.2 %  |  GB/s:         577        357        697        968        589        313
      256        8704   33.3 %  |  GB/s:         646        391        765       1072        634        419
      288        9792   37.5 %  |  GB/s:         697        419        818       1141        634        329
      320       10880   41.7 %  |  GB/s:         733        454        885       1227        673        356
      352       11968   45.8 %  |  GB/s:         800        479        932       1287        745        415
      384       13056   50.0 %  |  GB/s:         800        506        984       1362        782        430
      416       14144   54.2 %  |  GB/s:         800        525       1020       1436        800        430
      448       15232   58.3 %  |  GB/s:         838        558       1050       1476        818        430
      480       16320   62.5 %  |  GB/s:         800        563       1117       1530        863        430
      512       17408   66.7 %  |  GB/s:         800        596       1154       1575        885        425
      544       18496   70.8 %  |  GB/s:         800        601       1193       1624        908        430
      576       19584   75.0 %  |  GB/s:         800        623       1203       1675        932        430
      608       20672   79.2 %  |  GB/s:         800        646       1235       1675        957        430
      640       21760   83.3 %  |  GB/s:         800        646       1245       1730        957        425
      672       22848   87.5 %  |  GB/s:         800        670       1291       1730        908        430
      704       23936   91.7 %  |  GB/s:         800        670       1291       1730        957        430
      736       25024   95.8 %  |  GB/s:         800        697       1340       1804        957        430
      768       26112  100.0 %  |  GB/s:         800        697       1340       1745        957        425

When I don't change the parameters, the test results are normal. The same phenomenon occurred on my other A800 machine.

So have you ever had that happen to you？