Comments (12)
Thank you for the report!
Could you try to do a small experiment - run dcgm-exporter and generate some GPU load for a minute or two. After that keep GPU idle and watch for the DCGM_FI_PROF* metrics for another couple of minutes.
I'm wondering if dcgm-exporter reports the same values it reported in the first minute or two again and again when the GPU is idle.
from dcgm-exporter.
@nikkon-dev sure thing.
just few more comments on this current issue.
we tested these valuses on dcgmi instead of the exporter about a year ago and the same issue was present back then.
when we monitor the DCGM_FI_PROF* metrics for a long time the change in values were very slow and meaningless like how its is now with the DCGM-exporter.
and we had to restart the dcgmi command to get the instantaneous value. So im hypothesizing that the issue is with method dcgm uses to gather the value for these metrics.
Im still running few more tests as you have requested.
I ran another test (not the test you requested, that i will do after i post this comment), where I left both the dcgm exporter and the job running for about 5 hours. (the initial state is same as my initial post in this issue)
a. tfjob that i ran in the last experiment in the initial post was running before i retarted the dcgm exporter.
b. dcgm exporter after restart showed
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="container-testest-1",namespace="gi-user1-giops-ai",pod="testest-9sxmd"} 0.959228
so like i mentioned before i left both pods running for 5 hours. and after 5 hours i checked the same metric again and received follwing.
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="container-testest-1",namespace="gi-user1-giops-ai",pod="testest-9sxmd"} 0.955586
which seems to be about right.
next, I stopped the job and read the values again. and received the following
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.955019
which shouldn't be right.... and since there could be late responses i waited about 10~15 minutes and read the values again and received. the following
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.814772
since no jobs were running for the past 10-15 minutes that cannot be right.
at the same point in time as the last value all the DCGM_FI_PROF metrics' values are wrong for that instance. as shown in the following.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.133374
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.580126
but other values that are not DCGM_FI_PROF_ metrics are just fine as shown in the following
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 13
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 11954
and then the values as soon as i restart the dcgm pods
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
now i will try the test you requested
from dcgm-exporter.
@nikkon-dev
this is the test you requested.
- kubectl rollout restart ds dcgm-exporter
- curl 10.10.204.252:9400/metrics
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
- started a tf-benchmark job on the same instance.
- curl 10.10.204.252:9400/metrics (immediately, notice how the dcgm didnt pick up the tf-benchmark pod)
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.381124
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.022137
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.098690
- curl 10.10.204.252:9400/metrics (its after a few minutes but it didnt pick up the k8s pod for some reason...)
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.966660
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.159025
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.549615
- stop the tf-benchmark job
- curl 10.10.204.252:9400/metrics (values immediately after killing the job)
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.938835
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.156037
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.549615
- nvidia-smi (immediately after quiting the tf-benchmark job)
|==================+======================+===========+=======================|
| 0 1 0 0 | 13MiB / 11968MiB | 28 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
- curl 10.10.204.252:9400/metrics (after letting it sit idle for few minutes)
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.549615
it seemed like the some values were flushed out and some weren;t flushed out.
values in step 9 remained for more than 5 minutes. (the exact same values)
from dcgm-exporter.
Thanks for the tests. Unfortunately, it's challenging to spot a pattern if there is just a single point.
Do you have graphs that could help see the pattern in time?
Another option would be to directly run dcgmi dmon -d 500 -e 1001,1004,1005
to collect values every half second (you may change to -d 1000
to collect every second). Continous values output when the GPU is idle will help see the pattern because that's not just one stuck, not flushed value, but the whole series +- some delta.
from dcgm-exporter.
I almost forgot an important question - what is your driver version and GPU name? nvidia-smi
output should be enough.
from dcgm-exporter.
@nikkon-dev
im tyring to run dcgmi on the base os. and the profiling module is failing to load...
libdcgmmoduleprofiling.so
libdcgmmoduleprofiling.so.2
libdcgmmoduleprofiling.so.2.3.6
are all under /usr/lib/x86_64-linux-gnu
any ideas?
from dcgm-exporter.
@nikkon-dev
and im using
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
for nvidia driver on an asus server with two A30's
from dcgm-exporter.
Regarding the initial issue - please check the just-released DCGM 2.4.5 - it should have a fix for "ghost" MIG metrics.
from dcgm-exporter.
As for BaseOS - the profiling module needs nv-hostengine run as root.
If it fails to load even with the root permissions, please collect the debug logs from the nv-hostegine:
- Stop nvidia-dcgm service (if running)
- As root, run
nv-hostengine -f /tmp/nv-hostengine.debug.log --log-level debug
- Run
dcgmi dmon -e 1001
That should be enough to trigger the profiling module loading.
We will need to see the /tmp/nv-hostengine.debug.log after that.
from dcgm-exporter.
@nikkon-dev Thx,, I will try to get some graphs using grafana and see if i get the same values with dcgm on baseOS
Also, we had the same issue on BaseOS when the dcgm dmon command was running for a long time. I will also check if that also happens
from dcgm-exporter.
@nikkon-dev
hi im finally back with the test results.
okay so there were quite a bit of changes.
- we used the updated version of dcgm exporter in this test.
a, before we used 2.3.6-2.6.6
b. in this test we used 2.4.5-2.6.7 - instead of running dcgm with the hostengine running inside the container, we installed the dcgm on the baseOS and used -r option inside the container to connect the exporter to the nv-hostengine on the baseOS.
So as you can see in the picture. GRACT(1001) seems to work fine now.
but DRAMA(1005). still seems to give wrong values.
the second image is the result of dcgmi dmon and curl on the exporter consecutively.
i know the profile is wrong on the dcgmi dmon but stilll it should show some values on gpu0 if the dcgmi dmon picked up values for gpu0 instance1 but it shows 0
from dcgm-exporter.
I have the same problem. I have a mig partition of 1g 1g 2g 3g for one A100 gpu, and using a gpu_burn to make all of them full load, and here is the result of DCGM_FI_PROF_GR_ENGINE_ACTIVE
:
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.999980
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.404050
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.333328
from dcgm-exporter.
Related Issues (20)
- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ is not signed HOT 2
- nvlink metrics are not available on the gh200 gpu node HOT 2
- I can't get the following metrics, but I've set the environment variable HOT 3
- config csv DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, but cannot get on metrics HOT 2
- can I get computeRunningProcesses and graphicsRunningProcesses this two metrics?? HOT 1
- exported_pod cause issue with query -> every sample a different metrics HOT 3
- Switch GPU Util metric to `DCGM_FI_PROF_GR_ENGINE_ACTIVE` in NVIDIA DCGM Metrics Dashboard
- `namespace` and `pod` labels are sometimes missing from metrics HOT 12
- How to obtain the namespace , pod and container data HOT 6
- How to install dcgm-exporter on Windows Server? HOT 6
- DCGM_FI_DEV_MEM_COPY_UTIL not correct always 1 or 2 HOT 3
- Cannot Retrieve GPU PIDs from DCGM Metrics HOT 4
- GPU Failure Detection and Alerting Enhancement HOT 2
- enable DCGM_EXPORTER_KUBERNETES and podrequestapi is avaiable but not found container and namespace label in Metrics HOT 2
- cannot get DCGM_FI_PROF_SM_ACTIVE metrics HOT 1
- Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values HOT 2
- Why `DCGM_FI_DEV_PCIE_{TX,RX}_THROUGHPUT` is default instead of `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES `? HOT 2
- Seeking community feedback on potential new feature: Standardize labels for next major release HOT 4
- README link about "To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide." is already invalid HOT 1
- dcgm-exporter crashes when run on Debian 12 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.