Giter Club home page Giter Club logo

dcgm's Introduction

NVIDIA Data Center GPU Manager

GitHub license

Data Center GPU Manager (DCGM) is a daemon that allows users to monitor NVIDIA data-center GPUs. You can find out more about DCGM by visiting DCGM's official page

dcgm

Introduction

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners.

DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, Arm and POWER (ppc64le) platforms. The installer packages include libraries, binaries, NVIDIA Validation Suite (NVVS) and source examples for using the API (C, Python and Go).

DCGM integrates into the Kubernetes ecosystem by allowing users to gather GPU telemetry using dcgm-exporter.

More information is available on DCGM's official page

Quickstart

DCGM installer packages are available on the CUDA network repository and DCGM can be easily installed using Linux package managers.

Ubuntu LTS

Set up the CUDA network repository meta-data, GPG key:

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"

Install DCGM

$ sudo apt-get update \
    && sudo apt-get install -y datacenter-gpu-manager

Red Hat

Set up the CUDA network repository meta-data, GPG key:

$ sudo dnf config-manager \
    --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Install DCGM

$ sudo dnf clean expire-cache \
    && sudo dnf install -y datacenter-gpu-manager

Start the DCGM service

$ sudo systemctl --now enable nvidia-dcgm

Product Documentation

For information on platform support, getting started and using DCGM APIs, visit the official documentation repository.

Building DCGM

Once this repo is cloned, DCGM can be built by:

  • creating the build image
  • using the build image to generate a DCGM build

Why a Docker build image

The Docker build image provides two benefits

  • easier builds as the environment is controlled
  • a repeatable environment which creates reproducible builds

New dependencies can be added by adding a script in the “scripts” directory similar to the existing scripts.

As DCGM needs to support some older Linux distributions on various CPU architectures, the image provide custom builds of GCC compilers that produce binaries which depend on older versions of the GLibc libraries. The DCGM build image will also contain all 3rd party libraries that are precompiled using those custom GCC builds.

Prerequisites

In order to create the build image and to then generate a DCGM build, you will need to have the following installed and configured:

  • git and git-lfs (can be skipped if downloading a snapshot archive)
  • realpath, basename, nproc (coreutils deb package)
  • gawk (via awk binary, may not work on MacOS without installing a gawk port)
  • getopt (util-linux deb package)
  • recent version of Docker

The build.sh script was tested in Linux, Windows (WSL2) and MacOS, though MacOS may need some minor changes in the script (like s/awk/gawk/) as MacOS is not an officially supported development environment.

Creating the build image

The build image is stored in ./dcgmbuild.

The image can be built by:

  • ensuring Docker is installed and running
  • navigating to ./dcgmbuild
  • running ./build.sh

Note that if your user does not have permission to access the Docker socket, you will need to run sudo ./build.sh

The build process may take several hours to create the image as the image is building 3 versions of GCC toolset for all supported platforms. Once the image has been built, it can be reused to build DCGM.

Generating a DCGM build

Once the build image is created, you can use the run build.sh to produce builds. A simple debian build of release (non-debug) code for an x86_64 system can be made with:

./build.sh -r --deb

The rpm will be placed in _out/Linux-amd64-release/datacenter-gpu-manager_2.1.4_amd64.deb; it can now be installed as needed. The script includes options for building just the binaries (default), tarballs (--packages), or RPM (--rpm) as well. A complete list of options can been seen using ./build.sh -h.

Running the Test Framework

DCGM includes an extensive test suite that can be run on any system with one or more supported GPUs. After successfully building DCGM, a datacenter-gpu-manager-tests package is created alongside the normal DCGM package. There are multiple ways to run the tests but the most straightforward steps are the following:

  1. Install or extract the datacenter-gpu-manager-tests package
  2. Navigate to usr/share/dcgm_tests
  3. Execute run_tests.sh

Notes:

  • The location of the tests depends on the type of DCGM package. If the installed package was a .deb or .rpm file then the location is /usr/share/dcgm_tests. If the package was .tar.gz then the location is relative to where it was uncompresssed.
  • The test suite utilizes DCGM's python bindings. Python version 2 is the only supported version at this time.
  • Running the tests as root is not required. However, some tests that require root permissions will not execute. For maximum test coverage running the tests as root is recommended.
  • The entire test suite can take anywhere from 10 to >60 minutes depending on the speed of the machine, the number of gpus and the presence of additional NVIDIA hardware such as NVSwitches and NVLinks. On most systems the average time is ~30 minutes.
  • Please do note file bug reports based on test failures. While great effort is made to ensure the tests are stable and resilient transient failures will occur from time to time.

Reporting An Issue

Issues in DCGM can be reported by opening an issue in Github. Please include in reporting an issue:

  • A description of the problem.
  • Steps to reproduce the issue.
  • If the issue cannot be reproduced, then please provide as much information as possible about conditions on the system that bring it about, how often it happens, and any other discernible patterns around when the bug occurs.
  • Relevant configuration information, potentially including things like: whether or not this is in a container, VM, or bare metal environment, GPU SKU, number of GPUs on the system, operating system, driver version, GPU power settings, etc.
  • DCGM logging for the error.
    • For nv-hostengine, you can start it with -f /tmp/hostengine.log --log-level ERROR to generate a log file with all error messages in /tmp/hostengine.log.
    • For the diagnostic, you can add --debugLogFile /tmp/diag.log -d ERROR to your command line in order to generate /tmp/diag.log with all error messages.
    • For any additional information about the command line, please see the documentation.
  • The output of nvidia-smi and nvidia-smi -q.
  • Full output of dcgmi -v.

The following template may be helpful:

GPU SKU(s):
OS:
DRIVER:
GPU power settings (nvidia-smi -q -d POWER):
CPU(s):
RAM:
Topology (nvidia-smi topo -m):

Reporting Security Issues

We ask that all community members and users of DCGM follow the standard Nvidia process for reporting security vulnerabilities. This process is documented at the NVIDIA Product Security website. Following the process will result in any needed CVE being created as well as appropriate notifications being communicated to the entire DCGM community.

Please refer to the policies listed there to answer questions related to reporting security issues.

Tagging DCGM Releases

DCGM releases will be tagged once the release is finalized. The last commit will be the one that sets the release version, and we will then tag the releases. Releases tags will be the release version prepended with a v. For example, v2.0.13.

License

The source code for DCGM in this repository is licensed under Apache 2.0. Binary installer packages for DCGM are available for download from the product page and are licensed under the NVIDIA DCGM SLA.

Additional Topics

dcgm's People

Contributors

dbeer avatar de11n avatar dmonakhov avatar dshaiknvidia avatar letmerecall avatar nikkon-dev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dcgm's Issues

What is the sampling period of DCGM

I am using the dcgm library to collect some GPU metrics
But I don't know how to set the three parameters updateFreq,maxKeepAge,maxKeepSamples.
Can you describe the specific functions and usage of these parameters and their relationship with the underlying sampling period of dcgm?
`

var updateFreq int64 = 100000 // 100ms
var maxKeepAge float64 = 0
var maxKeepSamples int32 = 1
err = dcgm.WatchFieldsWithGroupEx(fieldsId, groupId, updateFreq, maxKeepAge, maxKeepSamples)
if err != nil {
_ = dcgm.FieldGroupDestroy(fieldsId)
return
}`

build error from ct-ng build

I just follow the github guide sudo ./build.sh to build this dcgm, however I keep getting the error when it comes to the ct-ng build command in the Dockerfile.

Screen Shot 2021-08-19 at 1 30 23 AM

I then try to ignore the Dockfile and build on my own in my local machine. In this way, when I use the CT_PREFIX=/opt/cross ct-ng build -j12 on my command line, it gives the same error. I checked out the build.log file, it says ct-ng tries to create a director named '', which is not possible although I don't know why ct-ng wants to do that there. Here is the build.log error message. Is this where the cause of the error is?

Screen Shot 2021-08-19 at 1 35 21 AM

I tried finding solutions, but I have not got one. Please let me know if you can repeat this error and has the solution.

How to enable DCGM service when I build DCGM debug version from the source code

Hi, I am building the DCGM debug version from source code and it has already been built successfully on a server. For instance, after building, it will generate a executable file ./DCGM/_out/Linux-amd64-debug/bin/dcgmi. The executable file dcgmi can be executed for some simple application, such as ./dcgmi --help, ./dcgmi -v. But for some more complex application, such as ./dcgmi group --list, it will generated
"Error: unable to establish a connection to the specified host: localhost"
"Error: Unable to connect to host engine. Host engine connection invalid/disconnected."

Previously, I also install the prebuilt version of DCGM on another server, this problem also happened. But when I enter
"sudo systemctl --now enable nvidia-dcgm", then the DCGM service started and this problem can be solved.

But now, when I compile DCGM from source code and run "sudo systemctl --now enable nvidia-dcgm", it will generate:
"Failed to enable unit: Unit file nvidia-dcgm.service does not exist." OR
"Failed to execute operation: No such file or directory".

So please help me, how can I start the DCGM service when I build DCGM debug version from source code?

The GPU environment on server is shown in below.
image

libdcgmmoduleprofiling.so missing from RPM build in 2.3.4

Hello,

After building the build image and compiling for the amd64 arch, running through a series of tests, the profiling module fails to load:

[root@casper27 ~]# dcgmi profile -l
Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.
[root@casper27 ~]# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+ 

/usr/lib64/libdcgmmoduleprofiling.so and friends are missing from the 2.3.4 series that this github repo has (at least when building the RPM). Downloading the RHEL 8 RPM directly and inspecting the contents; the files are in there.

[root@casper27 tmp]# rpm -ql datacenter-gpu-manager | grep profiling
[root@casper27 tmp]# rpm -qlp ./datacenter-gpu-manager-2.3.6-1-x86_64.rpm | grep profiling
warning: ./datacenter-gpu-manager-2.3.6-1-x86_64.rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
/usr/lib64/libdcgmmoduleprofiling.so
/usr/lib64/libdcgmmoduleprofiling.so.2
/usr/lib64/libdcgmmoduleprofiling.so.2.3.6

Any chance that this github repo is going to be synced up in the near future with what's being distributed as binary blobs?

Questions about fields identifiers

I have 3 questions about DCGM, I noticed that there are field identifiers like memory utilization and gpu utilization.

  1. How these methods are calculated?
  2. What if I want to monitor the running status of each core of GPU, is there any api for that?
  3. Is there any way to monitor the highest GPU memory use during a time period? I only have a toy plan: collect multiple records during a time period and return the maximum among them.

query profiling metrics raises an exception

hello, I am new to DCGM, I would like to collect the profiling metrics, but there seems to be several problems.

I am using dcgm exporter to collect the profiling metrics, but unfortunately, there are some problems.
NVIDIA/gpu-monitoring-tools#189

to check dcgm is properly working, I tested using the following command, and I got some error messages

root@cl-platform-gpu01:/# dcgmi dmon -e 1004
# Entity  TENSO
      Id
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error

I also used the profilertest, and it seems okay

root@cl-platform-gpu01:/# /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 10
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping WatchFields() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (76460.9 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (76553.6 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83514.9 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83709.5 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78960.7 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78831.1 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (81628.5 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (79509.4 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83713.7 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (84599.7 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (79126.2 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (79630.3 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (80746.4 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (80022.6 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83400.0 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83205.9 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78649.8 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78736.6 gflops)
Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (84357.4 gflops)
Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83102.6 gflops)
Worker 1:0[1004]: Message: Bus ID 00000000:1B:00.0 mapped to cuda device ID 1
DCGM CudaContext Init completed successfully.

CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 2048
CUDA_VISIBLE_DEVICES:
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 80
CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 98304
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 0
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 4096
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 877
Max Memory bandwidth: 898048000000 bytes (898.0 GiB)
CU_DEVICE_ATTRIBUTE_ECC_SUPPORT: true

Worker 0:0[1004]: Message: Bus ID 00000000:04:00.0 mapped to cuda device ID 0
DCGM CudaContext Init completed successfully.

CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 2048
CUDA_VISIBLE_DEVICES:
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 80
CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 98304
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 0
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 4096
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 877
Max Memory bandwidth: 898048000000 bytes (898.0 GiB)
CU_DEVICE_ATTRIBUTE_ECC_SUPPORT: true

Skipping UnwatchFields() since DCGM validation is disabled

the version I used is

root@cl-platform-gpu01:/# dcgmi --version

dcgmi  version: 2.1.4

The os is

CentOS Linux release 7.9.2009 (Core)
Linux cl-platform-gpu02 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I am using the following docker image

nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04

the nvidia-driver and gpus are

450.51.05
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB

What should I check to resolve this issue?

For profiling metrics, dcgmi reports an error message: The third-party Profiling module returned an unrecoverable error

The problem is that dcgmi can not query profiling metrics.

  • Tesla V100 GPU. DCGM version: 3.1.3

  • nvidia-smi :

Tue Jan 10 16:00:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |

-dcgmi modules -l
It shows Profiling module is loaded.
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Loaded | |

  • dcgmi query profiling metrics: SM occupancy

dcgmi dmon -e 1002,1003
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error

  • nv-hostengine debug log:

2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] AddMetricToConfig was successful for deviceIndex 0. metricName sm__warps_active.avg.pct_of_peak_sustained_elapsed [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:905] [DcgmLopConfig::AddMetricToConfig]
2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] CreateConfigAndPrefixImages was successful for deviceIndex 0. imageSize 172 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:348] [DcgmLopConfig::CreateConfigAndPrefixImages]
2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] counterDataImageSize 15316 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:237] [DcgmLopConfig::InitializeWithMetrics]
2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] InitializeCounterData was successful for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:96] [DcgmLopConfig::InitializeCounterData]
2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] InitializeWithMetrics was successful for deviceIndex 0. 2 metrics [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:249] [DcgmLopConfig::InitializeWithMetrics]
2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] Successfully added 1 metric groups. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopGpu.cpp:166] [DcgmLopGpu::InitializeWithMetrics]
2023-01-09 17:34:34.533 DEBUG [933520:933752] [[Profiling]] Enabling metrics for gpuId 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2588] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReconfigureLopGpu]
2023-01-09 17:34:34.535 ERROR [933520:933752] [[Profiling]] [PerfWorks] Got status 1 from NVPW_DCGM_PeriodicSampler_BeginSession() on deviceIndex 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmLopGpu.cpp:351] [DcgmLopGpu::BeginSession]
2023-01-09 17:34:34.535 ERROR [933520:933752] [[Profiling]] EnableMetrics returned -37 The third-party Profiling module returned an unrecoverable error [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2591] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReconfigureLopGpu]
2023-01-09 17:34:34.535 ERROR [933520:933752] [[Profiling]] Unable to reconfigure LOP metric watches for GpuId {0} [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2680] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ChangeWatchState]

zlib download gives 404 since zlib.net updated version.

zlib download is broken now that upstream download is 1.2.12 (pushed out late last month). The previous 1.2.11 download location gives you a http/404. I'd be happy to do a PR, but some guidance on whether or not to use https://zlib.net or pick a mirror that can be used to set specific versions as it appears zlib.net doesn't host older versions.

Alternatively, something as simple as upgrading to the 1.2.12 release should be fine to and I've done that in my local copy of this repo, but it's still building.

Avoid the building for architectures that are not locally used during the image building stage

Is it possible in the future to have an option to avoid building architectures in the build image that are not going to be used. I'm specifically talking about the cross compilers being built and such. For example, we don't have ARM or PowerPC64-LE where we'd use DCGM.

I've been working on getting DCGM built, but when I hit an issue such as #27 -- The build has to start all the way at the beginning and this is not a fast build process and therefore seemingly takes many hours to iterate on.

1:2.3.4 version dcgm_prometheus.py error AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'

we follow doc here
https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/integrating-with-dcgm.html#starting-prometheus-client

and looks like the new version of datacenter-gpu-manager has issue for this script:

python3 dcgm_prometheus.py -e
Traceback (most recent call last):
File "dcgm_prometheus.py", line 264, in
main()
File "dcgm_prometheus.py", line 257, in main
prometheus_obj.LogBasicInformation()
File "dcgm_prometheus.py", line 142, in LogBasicInformation
for fieldId in self.m_publishFieldIds:
AttributeError: 'DcgmPrometheus' object has no attribute 'm_publishFieldIds'

already install datacenter-gpu-manager
Version table:
1:2.3.4 600

Docker build fails at "Retrieving 'isl-0.24'"

The source code was downloaded from Github and build under Ubuntu 20.04 on Intel Platform. Apparently the code failed to build at the line "Retriebing 'isl-0.24'". After doing a search, the line is found in the file 'x86_64.config' under "dcgmbuild" directory. Apparently, when the build script tried to fetch the "isl-0.24" package from "isl.gforge.inria.fr", it failed. Changing the line "CT_ISL_MIRROTS" to point to "http://libisl.sourceforge.io" seems to solve the build issue.

failed to trigger dcgm service

Dear to whom it may concern,

We observed this following error a lot with dcgm script.
[root@gpu06 ~]# less /opt/slurm/tmp/gpu-stats-871006-gpu06.cluster.out
Error: Unable to retrieve job statistics. Return: No data is available.
It worked for some jobs, while fails for other jobs. No clue as what is happening.

When I checked the service, it is running.
[root@gpu06 ~]# systemctl status dcgm.service
● dcgm.service - DCGM service
Loaded: loaded (/etc/systemd/system/dcgm.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-01-07 17:03:34 CST; 1 weeks 5 days ago
Main PID: 1712 (nv-hostengine)
Tasks: 7 (limit: 2464686)
Memory: 32.6M
CGroup: /system.slice/dcgm.service
└─1712 /usr/bin/nv-hostengine -n

Jan 07 17:03:34 gpu06.cluster systemd[1]: Started DCGM service.
Jan 07 17:03:39 gpu06.cluster nv-hostengine[1712]: DCGM initialized
Jan 07 17:03:39 gpu06.cluster nv-hostengine[1712]: Host Engine Listener Started

I wonder if something in my prologue or epilogue script is not set correctly.
Could you look at them? See files attached. Much appreciated!
prologue.txt
epilogue.txt

Question about DCGM fields

I was going through the different dcgm fields and had the following questions:

Question-1: What is the difference between the below fields
DCGM_FI_DEV_GPU_UTIL vs DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_DEV_MEM_COPY_UTIL vs DCGM_FI_PROF_DRAM_ACTIVE

Question-2:
I am looking to track the following metrics for my AI workloads:

  1. GPU utilization over time - Idea here is to understand if the task is utilizing the GPU efficiently or is it just wasting GPU resources
  2. Memory utilization over time - Idea here is to know if there is scope to increase the batchsize of the AI workload
  3. How effectively is my task utilizing the GPU parallelization capabilities? Some stat to understand if there is further scope to parallelize the computation, something like % core utilization and total cores. (Perhaps DCGM_FI_PROF_SM_ACTIVE?)
  4. Can I increase my computation batch size? This I believe should come from some memory utilization stat.

Could you please advice what fields I should look into for the above stats? Note that I require the same set of metrics both on tesla T4 and A100 (MIG) cards. (Asking as issue#58 seems to mentions that the above mentioned *_DEV_* fields do not work for MIGs)

Also dcgm documentation says not all fields can be queried in parallel. Does this apply only for the *_PROF_* fields or even the *_DEV_* fields? More specifically I wanted to know if the DCGM_FI_DEV_GPU_UTIL can be allotted to any group or does it need to be part of its own group?

Error: Cannot find the DCGM Python bindings

image
The document told us that we should find the python bindings from /usr/src/dcgm/bindings. But the python bindings are exactly in the /usr/local/dcgm/bindings. Please update the document.

Build dcgm container images

Hi, I'm currently a bit confused - is this the repository to build these container images?
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/dcgm

I want to build ppc64le container images, because in ngc there are only x86 images, however it seems the corresponding tags are missing in this git (e.g. the latest 2.3.2). On the other side I can see that those packages are being built for Power if I search for datacenter-gpu-manager in the repository
https://developer.download.nvidia.com/compute/cuda/repos/rhel8/ppc64le/

I'd appreciate instructions on how to build the the images :)

Thanks a lot!

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

Hi,

I have tried to monitor some fields of my GPUs (GTX 3090). The configuration is as follows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:18:00.0 Off | N/A |
| 30% 42C P8 26W / 350W | 17MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:3B:00.0 Off | N/A |
| 31% 42C P8 23W / 350W | 7MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:86:00.0 Off | N/A |
| 31% 42C P8 23W / 350W | 7MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:AF:00.0 Off | N/A |
| 31% 43C P8 38W / 350W | 7MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

When I tried the following command

dcgmi dmon -d 100 -e 1011 --host 127.0.0.1:39999

The system reported "Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded
".

Any suggestion?
Thank you very much and looking forward to your reply!

Best regards,
Qiang Wang

DCGM Does Not Refresh Updates For Watched Profiling Fields As Expected

Our team is working on innovating some tools for profiling deep learning programs' performance with the help of dcgm python bindings. Lately a weird thing noticed bugged us a lot, and we hope to get some sage insights here.

Testing Env:

  • driver version: 450.80.02
  • cuda version: 11.2
  • dcgm version 2.3.2

Details

image

As shown above, we launched the same workloads on 4 GPUs on one node and used dstat-like tool we developed to monitor its performance in the same time. At the beginning. the sm activity column showed meaningful statistics as expected, but after a certain period of time, all data( of profiling fields ) on GPU 0 became 0 whereas the others still showed reasonable output. Furthermore, we also noticed the same behavior using dcgmi, the cmd is like as follows:

dcgmi dmon -e 203,1002 -i 0,1,2,3

image

Any advice or help is appreciated, thanks.

A question about each sampling record

By using API GetAllSinceLastCall_v2, we could obtain more sampling records. I have a question about the records. For example, if we monitor DCGM_FI_PROF_PIPE_FP32_ACTIVE , each record has a timestamp and value. Is the value the max FP32 active ratio in the past duration from the last timestamp to the current timestamp? Or the average FP32 active ratio?

Error adding GPU 1 to group: This GPU is not supported by DCGM

dcgmi version: 3.1.3
dcgm-exporter version: 3.1.3-3.1.2
NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2
the gpu is v100
my question:
dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Collecting DCP Metrics
INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv
INFO[0000] Initializing system entities of type: GPU
FATA[0000] Error adding GPU 1 to group: This GPU is not supported by DCGM

Add timestamp to metrics through dmon

Is there a way to add a timestamp to collected metrics by dmon? I want to align the results with metrics from other sources, such as through put of nics.

Mirror down

A mirror was down for the docker build. #11 fixes it.

Step 16/39 : RUN CT_PREFIX=/opt/cross ct-ng build -j12
 ---> Running in 92b985b682ed
[INFO ]  Performing some trivial sanity checks
[INFO ]  Build started 20211212.135153
[INFO ]  Building environment variables
[EXTRA]  Preparing working directories
[EXTRA]  Installing user-supplied crosstool-NG configuration
[EXTRA]  =================================================================
[EXTRA]  Dumping internal crosstool-NG configuration
[EXTRA]    Building a toolchain for:
[EXTRA]      build  = x86_64-pc-linux-gnu
[EXTRA]      host   = x86_64-pc-linux-gnu
[EXTRA]      target = x86_64-linux-gnu
[EXTRA]  Dumping internal crosstool-NG configuration: done in 0.05s (at 00:01)
[INFO ]  =================================================================
[INFO ]  Retrieving needed toolchain components' tarballs
[EXTRA]    Retrieving 'linux-3.16.85'
[EXTRA]    Verifying SHA512 checksum for 'linux-3.16.85.tar.xz'
[EXTRA]    Saving 'linux-3.16.85.tar.xz' to local storage
[EXTRA]    Retrieving 'zlib-1.2.11'
[EXTRA]    Verifying SHA512 checksum for 'zlib-1.2.11.tar.xz'
[EXTRA]    Saving 'zlib-1.2.11.tar.xz' to local storage
[EXTRA]    Retrieving 'gmp-6.2.1'
[EXTRA]    Verifying SHA512 checksum for 'gmp-6.2.1.tar.xz'
[EXTRA]    Saving 'gmp-6.2.1.tar.xz' to local storage
[EXTRA]    Retrieving 'mpfr-4.1.0'
[EXTRA]    Verifying SHA512 checksum for 'mpfr-4.1.0.tar.xz'
[EXTRA]    Saving 'mpfr-4.1.0.tar.xz' to local storage
[EXTRA]    Retrieving 'isl-0.24'
[ERROR]    isl: download failed
[ERROR]
[ERROR]  >>
[ERROR]  >>  Build failed in step 'Retrieving needed toolchain components' tarballs'
[ERROR]  >>        called in step '(top-level)'
[ERROR]  >>
[ERROR]  >>  Error happened in: CT_Abort[scripts/functions@487]
[ERROR]  >>        called from: CT_DoFetch[scripts/functions@2131]
[ERROR]  >>        called from: CT_PackageRun[scripts/functions@2091]
[ERROR]  >>        called from: CT_Fetch[scripts/functions@2202]
[ERROR]  >>        called from: do_isl_get[scripts/build/companion_libs/121-isl.sh@16]
[ERROR]  >>        called from: do_companion_libs_get[scripts/build/companion_libs.sh@15]
[ERROR]  >>        called from: main[scripts/crosstool-NG.sh@647]
[ERROR]  >>
[ERROR]  >>  For more info on this error, look at the file: 'build.log'
[ERROR]  >>  There is a list of known issues, some with workarounds, in:
[ERROR]  >>      https://crosstool-ng.github.io/docs/known-issues/
[ERROR]  >>
[ERROR]  >> NOTE: You configuration uses non-default patch sets. Please
[ERROR]  >> select 'bundled' as the set of patches applied and attempt
[ERROR]  >> to reproduce this issue. Issues reported with other patch
[ERROR]  >> set selections (none, local, bundled+local) are going to be
[ERROR]  >> closed without explanation.
[ERROR]  >>
[ERROR]  >>  If you feel this is a bug in crosstool-NG, report it at:
[ERROR]  >>      https://github.com/crosstool-ng/crosstool-ng/issues/
[ERROR]  >>
[ERROR]  >>  Make sure your report includes all the information pertinent to this issue.
[ERROR]  >>  Read the bug reporting guidelines here:
[ERROR]  >>      http://crosstool-ng.github.io/support/
[ERROR]
[ERROR]  (elapsed: 2:04.37)
make: *** [/opt/crosstool-ng/bin/ct-ng:261: build] Error 1
The command '/bin/sh -c CT_PREFIX=/opt/cross ct-ng build -j12' returned a non-zero code: 2

infoROM is corrupted. However, diagnostics of dcgm is all pass.

description of the problem

  • nvidia-smi report that infoROM is corrupted
    image

  • However, diagnostics of dcgm is all pass
    image

environment information

  • Bare Metal Server : QuantaGrid D52G-4U
  • GPU SKU(s) : Tesla V100-SXM2-32GB
  • OS : CentOS 7.8
  • DRIVER : 450.80.02
  • GPU power settings (nvidia-smi -q -d POWER) : nv_power.log
  • CPU(s) : Intel(R) Xeon(R) Gold 6154 CPU
  • RAM : 768 GB
  • Topology (nvidia-smi topo -m) : nv_topo.log
  • The output of nvidia-smi -q : nv_q.log
  • Full output of dcgmi -v : dcgm_v.log

I can not use nvvs to View fp16 tensor core peak flops vs theoretical flops

hi as we know A100 FP16 tensor core flos ls 312Tflops,but i want to sure could i get the theoretical value,when i use cublas or cutlass ,from the kernel ,i can get about 80% of the theoretical value.
so i want to know how to use the nvvs to View fp16 tensor core peak flops vs theoretical flops

thank u

Diagnostics fail expecting NvLink on non-NvLink systems

I am running datacenter-gpu-manager-2.4.6 from https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/

My O/S is AlmaLinux 8.6, CUDA 11.7.1. GPUs are 2x A100.

When I start the nvidia-dcgm service and run "dcgmi diag -r 2", the PCIe test always fails with a bunch of warnings like:

+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Fail - All                                     |
| Warning                   | GPU 0GPU 0's NvLink link 0 is currently down   |
|                           | Run a field diagnostic on the GPU., GPU 0GPU   |
|                           | 0's NvLink link 1 is currently down Run a fie  |
|                           | ld diagnostic on the GPU., GPU 0GPU 0's NvLin  |
|                           | k link 2 is currently down Run a field diagno  |
|                           | stic on the GPU., GPU 0GPU 0's NvLink link 3   |
|                           | is currently down Run a field diagnostic on t  |
|                           | he GPU., GPU 0GPU 0's NvLink link 4 is curren  |
|                           | tly down Run a field diagnostic on the GPU.,   
...

This is on any GPU system without NvLink. I have tried several.

How can I get DCGM's diagnostics not to try to use NvLink when it is not present?

Error retrieving profiling metrics

The dcgmi profiling command fails

dcgmi dmon -e 1001
# Entity GRACT
Id
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error

I have enabled the logging level DEBUG of nv-hostengine. Ran as root.
I didn't see any information of this error. checked the codes , didn't find any clues.
Does someone know the clues about this issue?

2022-10-20 22:34:16.548 ERROR [658373:658384] EnableMetrics returned -37 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:828] [DcgmModuleProfiling::ProcessWatchFieldsGpu]
2022-10-20 22:34:16.548 ERROR [658373:658376] DCGM_PROFILING_SR_WATCH_FIELDS failed with -37 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5799] [DcgmHostEngineHandler::WatchFieldGroup]
2022-10-20 22:34:16.549 ERROR [658373:658378] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2022-10-20 22:34:16.549 ERROR [658373:658384] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2022-10-20 22:34:16.549 ERROR [658373:658378] Unknown subcommand: 1 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]

Get N/W when profiling metrics

image

dcgmi version 1.7.2

I want to watch SMACT,SMOCC,FP16,FP32,FP64,TensorCore metrics, but the command "dcgmi dmon -e 54,1002,1003,1004,1006,1007,1008,1005,1011,1012,1009,1010 -c 2 -d 10000 --no-watch" shows me only SMACT and SMOCC, all data remaining are "N/W".

Conflicts with other profiling tools

Hi, I'm using DCGM on a server with multiple A100 GPUs. When other profiling tools are running such as CUPTI based profilers, the dcgmi will raise an error the third-party profiling module returned an unrecoverable error. It makes sense. But if other profilers only profile on one GPU, and I specify the other GPUs such as dcgmi -i 1, the error doesn't make sense.

Is it possible to implement such a feature that dcgm still works if the specified GPUs are not used and will not be used in the future by other processes?

DCGM Telegraf Issues v2.3.4-1

Hello,
We're seeing an issue with dccmd-telegraf.service with this release of DCGM and RHEL 8.4 EUS. Service will start but will crash right around the 1 minute mark with the following:

systemd[1]: Started DCGM Telegraf service.
python3[116631]: Traceback (most recent call last):
python3[116631]:   File "/usr/local/dcgm/bindings/python3/dcgm_telegraf.py", line 63, in <module>
python3[116631]:     main(DcgmTelegraf, TELEGRAF_NAME, DEFAULT_TELEGRAF_PORT, add_target_host=True)
python3[116631]:   File "/usr/local/dcgm/bindings/python3/common/dcgm_client_main.py", line 81, in main
python3[116631]:     dr.Process()
python3[116631]:   File "/usr/local/dcgm/bindings/python3/DcgmReader.py", line 443, in Process
python3[116631]:     self.dfvc = self.m_dcgmGroup.samples.GetAllSinceLastCall(self.dfvc, self.m_fieldGroup)
python3[116631]:   File "/usr/local/dcgm/bindings/python3/DcgmGroup.py", line 162, in GetAllSinceLastCall
python3[116631]:     if dfvc.values.len() == 0:
python3[116631]: AttributeError: 'dict' object has no attribute 'len'
systemd[1]: dcgmd-telegraf.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: dcgmd-telegraf.service: Failed with result 'exit-code'.

Using Prometheus isn't an option for us due to some operational security requirements and all other monitoring of our facility uses Telegraf/InfluxDB/Grafana (we also aren't running K8s on these machines, they are in an HPC cluster). Any guidance on how to resolve would be appreciated.

Incomplete metrics when using `dcgmi stats`

I'm using A100 80GB cards on my box with 2x 2g.20gb MIG instances and on a box with DCGM 3.0.4. I tried running a demo workload just to see if stats are correctly collected, but most of the fields come up as "N/A" or "Not Found".

+------------------------------------------------------------------------------+ 
| GPU ID: 0                                                                    | 
+====================================+=========================================+ 
|-----  Execution Stats  ------------+-----------------------------------------| 
| Start Time                         | Thu Dec  1 13:40:26 2022                | 
| End Time                           | Thu Dec  1 13:52:30 2022                | 
| Total Execution Time (sec)         | 724.15                                  | 
| No. of Processes                   | 1                                       | 
+-----  Performance Stats  ----------+-----------------------------------------+ 
| Energy Consumed (Joules)           | 32112                                   | 
| Power Usage (Watts)                | Avg: 58.889, Max: 63.761, Min: 42.902   | 
| Max GPU Memory Used (bytes)        | 0                                       | 
| SM Clock (MHz)                     | Avg: 1160, Max: 1410, Min: 210          | 
| Memory Clock (MHz)                 | Avg: 1512, Max: 1512, Min: 1512         | 
| SM Utilization (%)                 | Avg: N/A, Max: N/A, Min: N/A            | 
| Memory Utilization (%)             | Avg: N/A, Max: N/A, Min: N/A            | 
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            | 
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            | 
+-----  Event Stats  ----------------+-----------------------------------------+ 
| Single Bit ECC Errors              | 0                                       | 
| Double Bit ECC Errors              | 0                                       | 
| PCIe Replay Warnings               | 0                                       | 
| Critical XID Errors                | 0                                       | 
+-----  Slowdown Stats  -------------+-----------------------------------------+ 
| Due to - Power (%)                 | 0                                       | 
|        - Thermal (%)               | 0                                       | 
|        - Reliability (%)           | Not Supported                           | 
|        - Board Limit (%)           | Not Supported                           | 
|        - Low Utilization (%)       | Not Supported                           | 
|        - Sync Boost (%)            | 0                                       | 
+--  Compute Process Utilization  ---+-----------------------------------------+ 
| PID                                | 3714799                                 | 
|     Avg SM Utilization (%)         | Not Found                               | 
|     Avg Memory Utilization (%)     | Not Found                               | 
+-----  Overall Health  -------------+-----------------------------------------+ 
| Overall Health                     | Healthy                                 | 
+------------------------------------+-----------------------------------------+ 

Is there a reason for this?

Steps to reproduce

  1. Create a dcgm group with all the GPUs and MIG instances (same with specific GPU and MIG instances)
  2. Enable stat collection with dcgmi stats -e
  3. Start recording stats for a test job with dcgmi stats -s foo
  4. Run any test workload that uses a GPU/MIG instance
  5. Check the stats with dcgmi stats -j foo. Output would be as given above
  6. Stop stat collection with dcgmi stats -x foo

All the above commands that interact with DCGM were run as root (just to rule out the possibility of insufficient perms).
Am I missing a step here?

Error: unable to establish a connection to the specified host: localhost

I'm using aws and my environment is:

  • AMI : Deep Learning AMI GPU CUDA 11.4.1 (Ubuntu 18.04) 20211204
  • g4dn.xlarge instance (T4)

and install DCGM refer to this link https://developer.nvidia.com/dcgm (The version was modified from ubuntu20 to 18 and installed)

after install DCGM, When I enter that statement, I get the following error

dcgmi discovery -l

Error: unable to establish a connection to the specified host: localhost
Error: Unable to connect to host engine. Host engine connection invalid/disconnected.

dcgmi discovery -v

Version : 2.4.5
Build ID : 9
Build Date : 2022-06-03
Build Type : Release
Commit ID : 82470ec91c4a20565182d65d2b8f0ea756c70285
Branch Name : rel_dcgm_2_4
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 54832e64be3a6a8ad586bcae022ca6cb

Diagnostics skipped unless is_allowed is set

After installing 3.0.4 from the Ubuntu 20.04 repo, starting the nvidia-dcgm server and running dcgmi diag -r 3, it appears most of the tests (beyond the level 1 tests) are simply skipped:

$ dcgmi diag -r 3
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Skip - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Skip - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| SM Stress                 | Skip - All                                     |
| Targeted Stress           | Skip - All                                     |
| Targeted Power            | Skip - All                                     |
| Memory Bandwidth          | Skip - All                                     |
+---------------------------+------------------------------------------------+

By digging around the source code, it seems that in addition to specifying -r, one still needs to pass options to allow the tests to run:

$ dcgmi diag -r 3 -p "targeted stress.is_allowed=True;sm stress.is_allowed=true;targeted power.is_allowed=true;pcie.is_allowed=true"
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Skip - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| SM Stress                 | Pass - All                                     |
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Skip - All                                     |
+---------------------------+------------------------------------------------+

That's counter-intuitive and not documented in https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html.

E: Unable to locate package datacenter-gpu-manager

I follow the install instruction here. But find an error saying that unable to find the package.

My commands are followed:

  1. apt-key del 7fa2af80
  2. distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/.//g')
  3. wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
  4. dpkg -i cuda-keyring_1.0-1_all.deb
  5. apt-get update
  6. apt-get install -y datacenter-gpu-manager

On step 6 I met error "E: Unable to locate package datacenter-gpu-manager"

I wonder how to fix this problem?

dcgm pod is failing after GPU operator is upgraded to 1.9.0

oc get pods
NAME READY STATUS RESTARTS AGE
cuda-vectoradd 0/1 Completed 0 8d
gpu-feature-discovery-bchqv 1/1 Running 6 8d
gpu-operator-755c469bcd-w5sxp 1/1 Running 4 8d
nvidia-container-toolkit-daemonset-ddwqg 1/1 Running 0 8d
nvidia-cuda-validator-vtck8 0/1 Completed 0 8d
nvidia-dcgm-989pl 0/1 CrashLoopBackOff 1426 5d1h
nvidia-dcgm-exporter-f5n6r 1/1 Running 0 5d1h
nvidia-device-plugin-daemonset-2wl6x 1/1 Running 6 8d
nvidia-device-plugin-validator-v5gbk 0/1 Completed 0 8d
nvidia-driver-daemonset-48.84.202112162303-0-j2stm 2/2 Running 0 8d
nvidia-node-status-exporter-vbpp8 1/1 Running 0 8d
nvidia-operator-validator-4zq7j 1/1 Running 0 8d

===============================================================================

oc describe pod nvidia-dcgm-989pl
Name: nvidia-dcgm-989pl
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: AWS Node
Start Time: Wed, 16 Feb 2022 09:24:49 +0000
Labels: app=nvidia-dcgm
controller-revision-hash=78c4ff8b97
pod-template-generation=2
Annotations: openshift.io/scc: nvidia-dcgm
Status: Running
Controlled By: DaemonSet/nvidia-dcgm
Init Containers:
toolkit-validation:
Container ID: cri-o://ad640e4ef341acc64f4285395d3bf0b126bba59d147a9f34d32828a2f91db9c0
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:d7e0397249cd5099046506f32841535ea4f329f7b7583a6ddd9f75ff0f53385e
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:d7e0397249cd5099046506f32841535ea4f329f7b7583a6ddd9f75ff0f53385e
Port:
Host Port:
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 16 Feb 2022 09:24:50 +0000
Finished: Wed, 16 Feb 2022 09:24:50 +0000
Ready: True
Restart Count: 0
Environment:
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xxxx (ro)
Containers:
nvidia-dcgm-ctr:
Container ID: cri-o://f8f25d1d9aab241a2cea4dc5315f832782b30f780a7ead509157c6c20c0f2a05
Image: nvcr.io/nvidia/cloud-native/dcgm@sha256:f4c4de8d66b2fef8cebaee6fec2fb2d15d01e835de2654df6dfd4a0ce0baec6b
Image ID: nvcr.io/nvidia/cloud-native/dcgm@sha256:f4c4de8d66b2fef8cebaee6fec2fb2d15d01e835de2654df6dfd4a0ce0baec6b
Port: 5555/TCP
Host Port: 5555/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Mon, 21 Feb 2022 11:05:54 +0000
Finished: Mon, 21 Feb 2022 11:05:55 +0000
Ready: False
Restart Count: 1426
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xxxx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: root-ca.crt
ConfigMapOptional:
DownwardAPI: true
ConfigMapName: service-ca.crt
ConfigMapOptional:
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.dcgm=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message


Normal Pulled 29m (x1422 over 5d1h) kubelet Container image "nvcr.io/nvidia/cloud-native/dcgm@sha256:f4c4de8d66b2fef8cebaee6fec2fb2d15d01e835de2654df6dfd4a0ce0baec6b" already present on machine
Warning BackOff 4m50s (x33491 over 5d1h) kubelet Back-off restarting failed container

Require binaries for ARM64 architecture

Hi Team,

I am working on building a nvidia/dcgm-exporter image for arm64 but found that it is installing datacenter-gpu-manager binaries which is not available for ARM64 in official releases.

I have followed the step provided in this link and tried to install binary for arm64, but it is not getting installed for arm64. However, I am able to install successfully in amd64 by following the same.

Do you have any plans to release datacenter-gpu-manager binaries for linux-arm64?

It will be very helpful if arm64 binaries are released. If required, I am happy to contribute

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

hello, I am using p100 gpu, and there is a problem that more than 1000 features(1002,1003,1004,1005....) are not work with this error code

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

docker pull: requested access to the resource is denied

log:

[root@VM-4-12-centos DCGM]# ./build.sh --release --debug --arch amd64 --clean --packages

Unable to find image 'dcgmbuild:latest' locally
docker: Error response from daemon: pull access denied for dcgmbuild, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

question

it seemed that i need to login into dockerhub? but i did not find any images named dcgmbuild in dockerhub

Questions on documentation for DCGM python tests and bindings

In the documentation it is written: The test suite utilizes DCGM's python bindings. Python version 2 is the only supported version at this time; However under testing/ there is directory named python3. Same as above, in \usr\local\dcgm\bindings, also python3 is listed.

Is this a documentation mistake or the words should be understood in a different manner? It seems a little confusing to me. I would like to use DCGM library for profiling with python 3 code. Thanks!

Errors in DcgmReaderExample.py

Hi, there is an error when I tried to test /usr/local/dcgm/sdk_samples/scripts/DcgmReaderExample.py. Could you help me figure out what's wrong with it?

Processing in field order by overriding the CustomerDataHandler() method
type of findBynameid <class 'ctypes.c_void_p'>
c_void_p(12)
Traceback (most recent call last):
  File "DcgmReaderExample.py", line 95, in <module>
    main()
  File "DcgmReaderExample.py", line 88, in main
    cdr.Process()
  File "/usr/local/dcgm/bindings/python3/DcgmReader.py", line 440, in Process
    self.Reconnect()
  File "/usr/local/dcgm/bindings/python3/DcgmReader.py", line 342, in Reconnect
    self.InitializeFromHandle()
  File "/usr/local/dcgm/bindings/python3/DcgmReader.py", line 312, in InitializeFromHandle
    self.GetFieldMetadata()
  File "/usr/local/dcgm/bindings/python3/DcgmReader.py", line 400, in GetFieldMetadata
    self.LogInfo("fieldGroupId: " + findByNameId  + "\n")
TypeError: can only concatenate str (not "c_void_p") to str

Err: Failed to start DCGM Server: -7

root@test-install-nvidia-driver-with-monitoring:/home/linpei.yang# systemctl --now enable nvidia-dcgm
root@test-install-nvidia-driver-with-monitoring:/home/linpei.yang# nv-hostengine

ubuntu2004

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.