Comments (9)
Could you please elaborate in what case we will see a fractional number?
In all cases. I would suggest trying out the dcgmproftester12 tool that's installed with DCGM to understand how workloads relate to profiling metrics.
See docs here: CUDA Test Generator (dcgmproftester)ΒΆ
And also how to interpret it?
You can think of it as a percentage but instead of between 0 and 100 it's between 0 and 1.0. You can multiply this by 100 if you want percentage.
from dcgm.
DCGM_FI_DEV_GPU_UTIL is roughly equal to DCGM_FI_PROF_GR_ENGINE_ACTIVE. DCGM_FI_PROF_GR_ENGINE_ACTIVE is higher precision and works on MIG.
DCGM_FI_DEV_MEM_COPY_UTIL is utilization of the copy engine of the GPU. I would shy away from it as it may not capture all memory bandwidth. Sometimes cuda mem copies use cuda kernels rather than the copy engine, which would not be picked up by this metric. Also, this metric does not work on MIG.
DCGM_FI_PROF_DRAM_ACTIVE is dram bandwidth vs theoretical maximum. This metric is accurate and captures all transfers to and from the GPU's DRAM.
"memory utilization" is ambiguous. Do you mean bandwidth or allocation? For bandwidth, use DCGM_FI_PROF_DRAM_ACTIVE. For allocation, use DCGM_FI_DEV_FB_USED,DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_TOTAL.
Also dcgm documentation says not all fields can be queried in parallel.
This is no longer true. We will remove this from our documentation.
More specifically I wanted to know if the DCGM_FI_DEV_GPU_UTIL can be allotted to any group or does it need to be part of its own group?
DCGM_FI_DEV_GPU_UTIL can be in any group. The same is true for any other fieldIds.
from dcgm.
Thanks @bstollenvidia! Could you please help me with the following questions as well?
- Is
PROF_SM_OCCUPANCY
the right field to understand the SM effective usage? i.e, are all the cores in an SM being utilized? (I do understand this could be limited by SM shared/L1 memory, registers) - Is
PROF_SM_ACTIVE*PROF_SM_OCCUPANCY*Total_cores
indicative of the cores used (rough estimate)? - Is there a way I can get
DCGM_FI_DEV_FB_USED
,DCGM_FI_DEV_FB_FREE
,DCGM_FI_DEV_FB_TOTAL
for MIGs as well? Is this something being planned for future releases? The documentation says only the *_PROF_* fields are available for MIGs. - w.r.t
PROF_SM_OCCUPANCY
Say a GPU has 128 cores/SM, does that mean itsmaximum number of concurrent warps
is128/32 = 4
?
from dcgm.
Think of them as 3 dimensions of utilization:
PROF_GR_ENGINE_ACTIVE - Is any kernel running on any SM?
PROF_SM_ACTIVE - What ratio of SMs are active?
PROF_SM_OCCUPANCY - How many warps are running vs theoretical max (2048 per SM).
Rough estimate of SMs used would be PROF_SM_ACTIVE * numSMs.
w.r.t PROF_SM_OCCUPANCY Say a GPU has 128 cores/SM, does that mean its maximum number of concurrent warps is 128/32 = 4?
There aren't cores per SM. I would read up on the cuda programming model here:
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
or here:
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)
The FB metrics are available at the MIG level. See
DCGM/dcgmlib/src/DcgmCacheManager.cpp
Line 10585 in 661d939
Make sure you're using our latest release (DCGM 3.1.3) from here https://developer.nvidia.com/dcgm#Downloads
from dcgm.
@bstollenvidia The image in https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/ does mention SM has multiple cores. Am I understanding it in a wrong way?
Another reference which gives out the same meaning.
from dcgm.
PROF_GR_ENGINE_ACTIVE - Is any kernel running on any SM?
@bstollenvidia Does this mean this can only have a discreet set of values, either 0 or 1? 0 if nothing is running, 1 if GPU is being used in any way(Even if 1 core or device memory is being used).
Also, is the value inclusive of memory utilization as well, i.e, it would be 1 even if the compute core is not being used but data is being transferred to the device memory?
from dcgm.
@bstollenvidia Could you please look into the questions posted in the above 2 comments?
from dcgm.
SMs don't have independent cores. They can do cuda threads in parallel but every thread is doing exactly the same instruction per cycle. They re-run kernels for any branches taken.
Does this mean this can only have a discreet set of values, either 0 or 1? 0 if nothing is running, 1 if GPU is being used in any way(Even if 1 core or device memory is being used).
No. The values are from 0 - 1. Could be 0.522922 or something.
Also, is the value inclusive of memory utilization as well, i.e, it would be 1 even if the compute core is not being used but data is being transferred to the device memory?
Yes. Waiting on memory transfers = busy. Technically, the SM is busy but blocked on a dram transfer.
from dcgm.
No. The values are from 0 - 1. Could be 0.522922 or something.
@bstollenvidia Could you please elaborate in what case we will see a fractional number? And also how to interpret it?
from dcgm.
Related Issues (20)
- AppArmor profile for DCGM HOT 3
- dcgm-exporter crashes hostengine. HOT 21
- How to get SM Occupancy in real-time except dcgm in RTX Series? HOT 1
- `power_usage` vs. `power_usage_instant`? HOT 1
- dcgm dagnostic command exits with status 226 HOT 1
- log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8 HOT 8
- Build output does not include libnvperf_dcgm_host.so HOT 13
- Removal of dependencies on cuda v10 HOT 7
- a question about dcgm policy listening for xid HOT 2
- Memory usage by dcgm during runtime diagnostics HOT 2
- Metrics around capturing gpu FLOPS HOT 4
- Facing error in running sdk_sample DCGMReader.py HOT 2
- Facing unknown docker flag --compress while using build.sh
- Running diagnostics causes the Memory Usage of the other GPUs to increase
- Error setting up dcgm with startHostEngine mode from a golang based container HOT 1
- [Question] Understanding multiplexing of profiling counters HOT 2
- [Question] Amount of lag expected for metrics HOT 2
- Incorrect values reported by dcgm stats HOT 3
- Cannot get "nvlink_flit_crc_error_count_total(409)" and "nvlink_data_crc_error_count_total(419)" in H800 System HOT 2
- Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm.