im testing dcgm exporter version [2.3.6-2.6.6] with mig enabled on k8s everything

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

TYPE DCGM_FI_PROF_ metrics value issue about dcgm-exporter HOT 12 CLOSED

nvidia commented on July 23, 2024

TYPE DCGM_FI_PROF_ metrics value issue

from dcgm-exporter.

Comments (12)

nikkon-dev commented on July 23, 2024

@Omoong,

Thank you for the report!

Could you try to do a small experiment - run dcgm-exporter and generate some GPU load for a minute or two. After that keep GPU idle and watch for the DCGM_FI_PROF* metrics for another couple of minutes.

I'm wondering if dcgm-exporter reports the same values it reported in the first minute or two again and again when the GPU is idle.

from dcgm-exporter.

Omoong commented on July 23, 2024

@nikkon-dev sure thing.
just few more comments on this current issue.
we tested these valuses on dcgmi instead of the exporter about a year ago and the same issue was present back then.
when we monitor the DCGM_FI_PROF* metrics for a long time the change in values were very slow and meaningless like how its is now with the DCGM-exporter.
and we had to restart the dcgmi command to get the instantaneous value. So im hypothesizing that the issue is with method dcgm uses to gather the value for these metrics.
Im still running few more tests as you have requested.

I ran another test (not the test you requested, that i will do after i post this comment), where I left both the dcgm exporter and the job running for about 5 hours. (the initial state is same as my initial post in this issue)
a. tfjob that i ran in the last experiment in the initial post was running before i retarted the dcgm exporter.
b. dcgm exporter after restart showed
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="container-testest-1",namespace="gi-user1-giops-ai",pod="testest-9sxmd"} 0.959228

so like i mentioned before i left both pods running for 5 hours. and after 5 hours i checked the same metric again and received follwing.
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="container-testest-1",namespace="gi-user1-giops-ai",pod="testest-9sxmd"} 0.955586

which seems to be about right.

next, I stopped the job and read the values again. and received the following
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.955019

which shouldn't be right.... and since there could be late responses i waited about 10~15 minutes and read the values again and received. the following
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.814772

since no jobs were running for the past 10-15 minutes that cannot be right.
at the same point in time as the last value all the DCGM_FI_PROF metrics' values are wrong for that instance. as shown in the following.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.133374
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 0.580126

but other values that are not DCGM_FI_PROF_ metrics are just fine as shown in the following
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 13
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-4flp2",container="",namespace="",pod=""} 11954

and then the values as soon as i restart the dcgm pods
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000

now i will try the test you requested

from dcgm-exporter.

Omoong commented on July 23, 2024

@nikkon-dev
this is the test you requested.

kubectl rollout restart ds dcgm-exporter
curl 10.10.204.252:9400/metrics

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000

started a tf-benchmark job on the same instance.
curl 10.10.204.252:9400/metrics (immediately, notice how the dcgm didnt pick up the tf-benchmark pod)

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.381124
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.022137
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.098690

curl 10.10.204.252:9400/metrics (its after a few minutes but it didnt pick up the k8s pod for some reason...)

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.966660
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.159025
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.549615

stop the tf-benchmark job
curl 10.10.204.252:9400/metrics (values immediately after killing the job)

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.938835
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.156037
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.549615

nvidia-smi (immediately after quiting the tf-benchmark job)

|==================+======================+===========+=======================|
| 0 1 0 0 | 13MiB / 11968MiB | 28 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+

curl 10.10.204.252:9400/metrics (after letting it sit idle for few minutes)

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-603a8873-0edd-c157-ce48-2078e8b1a935",device="nvidia0",modelName="NVIDIA A30",GPU_I_PROFILE="2g.12gb",GPU_I_ID="1",Hostname="dcgm-exporter-sn5rp",container="",namespace="",pod=""} 0.549615

it seemed like the some values were flushed out and some weren;t flushed out.
values in step 9 remained for more than 5 minutes. (the exact same values)

from dcgm-exporter.

nikkon-dev commented on July 23, 2024

@Omoong,

Thanks for the tests. Unfortunately, it's challenging to spot a pattern if there is just a single point.
Do you have graphs that could help see the pattern in time?
Another option would be to directly run dcgmi dmon -d 500 -e 1001,1004,1005 to collect values every half second (you may change to -d 1000 to collect every second). Continous values output when the GPU is idle will help see the pattern because that's not just one stuck, not flushed value, but the whole series +- some delta.

from dcgm-exporter.

nikkon-dev commented on July 23, 2024

I almost forgot an important question - what is your driver version and GPU name? nvidia-smi output should be enough.

from dcgm-exporter.

Omoong commented on July 23, 2024

@nikkon-dev
im tyring to run dcgmi on the base os. and the profiling module is failing to load...

libdcgmmoduleprofiling.so
libdcgmmoduleprofiling.so.2
libdcgmmoduleprofiling.so.2.3.6

are all under /usr/lib/x86_64-linux-gnu

any ideas?

from dcgm-exporter.

Omoong commented on July 23, 2024

@nikkon-dev
and im using
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
for nvidia driver on an asus server with two A30's

from dcgm-exporter.

nikkon-dev commented on July 23, 2024

@Omoong,

Regarding the initial issue - please check the just-released DCGM 2.4.5 - it should have a fix for "ghost" MIG metrics.

from dcgm-exporter.

nikkon-dev commented on July 23, 2024

As for BaseOS - the profiling module needs nv-hostengine run as root.
If it fails to load even with the root permissions, please collect the debug logs from the nv-hostegine:

Stop nvidia-dcgm service (if running)
As root, run nv-hostengine -f /tmp/nv-hostengine.debug.log --log-level debug
Run dcgmi dmon -e 1001
That should be enough to trigger the profiling module loading.
We will need to see the /tmp/nv-hostengine.debug.log after that.

from dcgm-exporter.

Omoong commented on July 23, 2024

@nikkon-dev Thx,, I will try to get some graphs using grafana and see if i get the same values with dcgm on baseOS
Also, we had the same issue on BaseOS when the dcgm dmon command was running for a long time. I will also check if that also happens

from dcgm-exporter.

Omoong commented on July 23, 2024

@nikkon-dev
hi im finally back with the test results.

okay so there were quite a bit of changes.

we used the updated version of dcgm exporter in this test.
a, before we used 2.3.6-2.6.6
b. in this test we used 2.4.5-2.6.7
instead of running dcgm with the hostengine running inside the container, we installed the dcgm on the baseOS and used -r option inside the container to connect the exporter to the nv-hostengine on the baseOS.

So as you can see in the picture. GRACT(1001) seems to work fine now.
but DRAMA(1005). still seems to give wrong values.

the second image is the result of dcgmi dmon and curl on the exporter consecutively.
i know the profile is wrong on the dcgmi dmon but stilll it should show some values on gpu0 if the dcgmi dmon picked up values for gpu0 instance1 but it shows 0

from dcgm-exporter.

aisensiy commented on July 23, 2024

I have the same problem. I have a mig partition of 1g 1g 2g 3g for one A100 gpu, and using a gpu_burn to make all of them full load, and here is the result of DCGM_FI_PROF_GR_ENGINE_ACTIVE:

DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.999980
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="2g.20gb",GPU_I_ID="3",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.404050
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",device="nvidia3",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="3g.40gb",GPU_I_ID="2",DCGM_FI_DRIVER_VERSION="515.65.01"} 0.333328

from dcgm-exporter.

TYPE DCGM_FI_PROF_ metrics value issue about dcgm-exporter HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent