Comments (6)
Can you please check the dmesg messages and confirm if you are using the GSP driver?
from dcgm.
The installed drivers are 535.129.03, based on https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/gsp.html,
# nvidia-smi -q | grep -i gsp
GSP Firmware Version : 535.129.03
but
# cat /proc/driver/nvidia/gpus/0000\:1b\:00.0/information
Model: NVIDIA H100 80GB HBM3
IRQ: 18
GPU UUID: GPU-875f3ca0-9de4-e78c-9cea-5140b030b627
Video BIOS: 96.00.89.00.01
Bus Type: PCIe
DMA Size: 52 bits
DMA Mask: 0xfffffffffffff
Bus Location: 0000:1b:00.0
Device Minor: 0
GPU Firmware: 535.129.03
GPU Excluded: No
I don't see GSP mentioned in dmesg.
Could you provide more details on to what to look for in dmesg?
from dcgm.
We see the same errors on DGX-H100. Same nv-hostengine, driver, (GSP) firmware, etc.
ERROR [597577:597597] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
ERROR [597577:597597] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]
Not sure if this is relevant, but here's the metrics collected by dcgm-exporter.
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.
@nikkon-dev please let us know if you need additional information, and if this is a DCGM or dcgm-exporter issue.
from dcgm.
@nikkon-dev Any news on this?
Upgraded to dcgm 3.3.3 and dcgm-exporter 3.3.3-3.2.0
# dcgmi -v
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e
Hostengine build info:
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e
Still seeing the errors, but found also
2024-02-01 13:34:36.214 ERROR [8878:8879] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@֟�O^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 13:34:36.215 ERROR [8878:8879] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 13:34:36.215 ERROR [8878:8879] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
The system has the latest DGXOS 6.1, latest fw 1.1.3, and all ubuntu packages upgrades applied; the driver is 535.154.05.
from dcgm.
FWIW I'm seeing these same messages using libraries from the nvcr.io/nvidia/cloud-native/dcgm:3.3.3-1-ubuntu22.04
container.
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
For the GPU firmare:
cat /proc/driver/nvidia/gpus/0000:00:03.0/information
Model: NVIDIA L4
IRQ: 11
GPU UUID: GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3
Video BIOS: 95.04.29.00.07
Bus Type: PCI
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:00:03.0
Device Minor: 0
GPU Firmware: 535.129.03
GPU Excluded: No
from dcgm.
Though I want to point out, I'm deploying dcgm-exporter
along with the DCGM libraries in "embedded mode" and I can see it exposing 0-value metrics for Field 230 (DCGM_FI_DEV_XID_ERRORS):
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0
So I dunno if dcgm-exporter is elegantly handling/defaulting here in the case of errors, or if its reporting an incorrect metric because of real errors from DCGM.
from dcgm.
Related Issues (20)
- Previous profiling results are still stored in dcgmGroup.samples.GetAllSinceLastCall
- Does DCGM support profiling metrics for A10 ? HOT 9
- When I run diagnostics, the two GPUs in the group both get failed results. HOT 2
- Does DCGM supports creating groups of GPU from different hosts? HOT 1
- diag --configfile option is silently ignored if --parameters options is present
- No NVLINK activity on DGX-A100 320GB HOT 6
- DCGM cannot listen on ipv6 address
- Error setting watches. Result: The third-party Profiling module returned an unrecoverable error HOT 4
- DCGM_FI_PROF_SM_ACTIVE is showing a value higher than 100% for MIG devices HOT 5
- AppArmor profile for DCGM HOT 3
- dcgm-exporter crashes hostengine. HOT 21
- How to get SM Occupancy in real-time except dcgm in RTX Series? HOT 1
- `power_usage` vs. `power_usage_instant`? HOT 1
- dcgm dagnostic command exits with status 226 HOT 1
- log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8 HOT 8
- Build output does not include libnvperf_dcgm_host.so HOT 13
- Removal of dependencies on cuda v10 HOT 7
- a question about dcgm policy listening for xid HOT 2
- Memory usage by dcgm during runtime diagnostics HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm.