Giter Club home page Giter Club logo

Comments (5)

nvvfedorov avatar nvvfedorov commented on August 27, 2024

@jacksonyi0, more info is needed to find the cause. Based on shared log messages, I suspect that the container you're running doesn't have enough permissions. Try running the container in privileged mode. Since it needs to access GPU data, it requires root-level privileges.

from dcgm-exporter.

PrakChandra avatar PrakChandra commented on August 27, 2024

@nvvfedorov I am not getting the desired result here, I have installed dcgm nad nv hostengine on my GPU machine

time="2024-05-24T04:53:22Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T04:53:22Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T04:53:22Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-24T04:53:22Z" level=info msg="Pipeline starting"
time="2024-05-24T04:53:22Z" level=info msg="Starting webserver"

```root@ip-10-20-61-45 dcgm]# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Loaded                                           |
| 8         | Profiling          | Not loaded                                       |
| 9         | SysMon             | Not loaded                                       |
+-----------+--------------------+--------------------------------------------------+


```[root@ip-10-20-61-45 etc]# nv-hostengine -f host.log --log-level debug
Host engine already running with pid 1240135


```+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P0    27W /  70W |  11070MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     33357      C   nvidia-cuda-mps-server             23MiB |
|    0   N/A  N/A    282750    M+C   /usr/bin/python3                 1629MiB |
|    0   N/A  N/A    282754    M+C   /usr/bin/python3                 1629MiB |
|    0   N/A  N/A    886169    M+C   /usr/bin/python3                 1235MiB |
|    0   N/A  N/A    886170    M+C   /usr/bin/python3                 1237MiB |
|    0   N/A  N/A   1000687    M+C   /usr/bin/python3                 1351MiB |
|    0   N/A  N/A   1000688    M+C   /usr/bin/python3                 1349MiB |
|    0   N/A  N/A   1232400    M+C   /usr/bin/python3                 2613MiB |


[root@ip-10-20-61-45 etc]# dcgmi dmon -e 1004
#Entity   TENSO
ID
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000

Could you please suggest me what is wrong here, why I am not able to get the Profiling metrics?


from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on August 27, 2024

@PrakChandra , Please share DCGM logs, which you can find here: /var/log/nvidia-dcgm/*.log.

from dcgm-exporter.

PrakChandra avatar PrakChandra commented on August 27, 2024

@nvvfedorov I do not see this particular folder nvidia-dcgm in /var/log in my container.
A new update is , When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest , I see the following logs

time="2024-05-24T08:48:21Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T08:48:21Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T08:48:21Z" level=info msg="Collecting DCP Metrics"
time="2024-05-24T08:48:21Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-05-24T08:48:21Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-24T08:48:21Z" level=info msg="Pipeline starting"
time="2024-05-24T08:48:21Z" level=info msg="Starting webserver"```

However, I am not getting the metrics on Grafana. I can see the nv-hostengine logs which do not look good

```2024-05-27 06:51:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]```


Am I missing something?

from dcgm-exporter.

PrakChandra avatar PrakChandra commented on August 27, 2024

Also the .csv file shows output like this

# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

from dcgm-exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.