Giter Club home page Giter Club logo

Comments (6)

nikkon-dev avatar nikkon-dev commented on August 16, 2024

@itzsimpl,

Can you please check the dmesg messages and confirm if you are using the GSP driver?

from dcgm.

itzsimpl avatar itzsimpl commented on August 16, 2024

@nikkon-dev

The installed drivers are 535.129.03, based on https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/gsp.html,

# nvidia-smi -q | grep -i gsp
    GSP Firmware Version                  : 535.129.03

but

# cat /proc/driver/nvidia/gpus/0000\:1b\:00.0/information 
Model:           NVIDIA H100 80GB HBM3
IRQ:             18
GPU UUID:        GPU-875f3ca0-9de4-e78c-9cea-5140b030b627
Video BIOS:      96.00.89.00.01
Bus Type:        PCIe
DMA Size:        52 bits
DMA Mask:        0xfffffffffffff
Bus Location:    0000:1b:00.0
Device Minor:    0
GPU Firmware:    535.129.03
GPU Excluded:    No

I don't see GSP mentioned in dmesg.

Could you provide more details on to what to look for in dmesg?

from dcgm.

jfolz avatar jfolz commented on August 16, 2024

We see the same errors on DGX-H100. Same nv-hostengine, driver, (GSP) firmware, etc.

ERROR [597577:597597] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
ERROR [597577:597597] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]

Not sure if this is relevant, but here's the metrics collected by dcgm-exporter.

DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

@nikkon-dev please let us know if you need additional information, and if this is a DCGM or dcgm-exporter issue.

from dcgm.

itzsimpl avatar itzsimpl commented on August 16, 2024

@nikkon-dev Any news on this?

Upgraded to dcgm 3.3.3 and dcgm-exporter 3.3.3-3.2.0

# dcgmi -v
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Hostengine build info:
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Still seeing the errors, but found also

2024-02-01 13:34:36.214 ERROR [8878:8879] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@֟�O^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 13:34:36.215 ERROR [8878:8879] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 13:34:36.215 ERROR [8878:8879] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]

The system has the latest DGXOS 6.1, latest fw 1.1.3, and all ubuntu packages upgrades applied; the driver is 535.154.05.

from dcgm.

pintohutch avatar pintohutch commented on August 16, 2024

FWIW I'm seeing these same messages using libraries from the nvcr.io/nvidia/cloud-native/dcgm:3.3.3-1-ubuntu22.04 container.

time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR

For the GPU firmare:

cat /proc/driver/nvidia/gpus/0000:00:03.0/information
Model: 		 NVIDIA L4
IRQ:   		 11
GPU UUID: 	 GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3
Video BIOS: 	 95.04.29.00.07
Bus Type: 	 PCI
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:00:03.0
Device Minor: 	 0
GPU Firmware: 	 535.129.03
GPU Excluded:	 No

from dcgm.

pintohutch avatar pintohutch commented on August 16, 2024

Though I want to point out, I'm deploying dcgm-exporter along with the DCGM libraries in "embedded mode" and I can see it exposing 0-value metrics for Field 230 (DCGM_FI_DEV_XID_ERRORS):

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0

So I dunno if dcgm-exporter is elegantly handling/defaulting here in the case of errors, or if its reporting an incorrect metric because of real errors from DCGM.

from dcgm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.