We use dcdm-exporter as described in <a href="https://docs.nvidia.com/datacenter/cloud

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

FWIW I'm seeing these same messages using libraries from the <code class="notranslate"

Errors in nv-hostengine log about dcgm HOT 6 OPEN

itzsimpl commented on August 16, 2024

Errors in nv-hostengine log

from dcgm.

Comments (6)

nikkon-dev commented on August 16, 2024

@itzsimpl,

Can you please check the dmesg messages and confirm if you are using the GSP driver?

from dcgm.

itzsimpl commented on August 16, 2024

@nikkon-dev

The installed drivers are 535.129.03, based on https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/gsp.html,

# nvidia-smi -q | grep -i gsp
    GSP Firmware Version                  : 535.129.03

but

# cat /proc/driver/nvidia/gpus/0000\:1b\:00.0/information 
Model:           NVIDIA H100 80GB HBM3
IRQ:             18
GPU UUID:        GPU-875f3ca0-9de4-e78c-9cea-5140b030b627
Video BIOS:      96.00.89.00.01
Bus Type:        PCIe
DMA Size:        52 bits
DMA Mask:        0xfffffffffffff
Bus Location:    0000:1b:00.0
Device Minor:    0
GPU Firmware:    535.129.03
GPU Excluded:    No

I don't see GSP mentioned in dmesg.

Could you provide more details on to what to look for in dmesg?

from dcgm.

jfolz commented on August 16, 2024

We see the same errors on DGX-H100. Same nv-hostengine, driver, (GSP) firmware, etc.

ERROR [597577:597597] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
ERROR [597577:597597] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]

Not sure if this is relevant, but here's the metrics collected by dcgm-exporter.

DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

@nikkon-dev please let us know if you need additional information, and if this is a DCGM or dcgm-exporter issue.

from dcgm.

itzsimpl commented on August 16, 2024

@nikkon-dev Any news on this?

Upgraded to dcgm 3.3.3 and dcgm-exporter 3.3.3-3.2.0

# dcgmi -v
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Hostengine build info:
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Still seeing the errors, but found also

2024-02-01 13:34:36.214 ERROR [8878:8879] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@֟�O^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 13:34:36.215 ERROR [8878:8879] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 13:34:36.215 ERROR [8878:8879] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]

The system has the latest DGXOS 6.1, latest fw 1.1.3, and all ubuntu packages upgrades applied; the driver is 535.154.05.

from dcgm.

pintohutch commented on August 16, 2024

FWIW I'm seeing these same messages using libraries from the nvcr.io/nvidia/cloud-native/dcgm:3.3.3-1-ubuntu22.04 container.

time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR

For the GPU firmare:

cat /proc/driver/nvidia/gpus/0000:00:03.0/information
Model: 		 NVIDIA L4
IRQ:   		 11
GPU UUID: 	 GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3
Video BIOS: 	 95.04.29.00.07
Bus Type: 	 PCI
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:00:03.0
Device Minor: 	 0
GPU Firmware: 	 535.129.03
GPU Excluded:	 No

from dcgm.

pintohutch commented on August 16, 2024

Though I want to point out, I'm deploying dcgm-exporter along with the DCGM libraries in "embedded mode" and I can see it exposing 0-value metrics for Field 230 (DCGM_FI_DEV_XID_ERRORS):

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0

So I dunno if dcgm-exporter is elegantly handling/defaulting here in the case of errors, or if its reporting an incorrect metric because of real errors from DCGM.

from dcgm.

Errors in nv-hostengine log about dcgm HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent