Comments (7)
The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.
WBR,
Nik
from dcgm.
The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.
WBR, Nik
Dear Nik,
Thanks for the prompt reply. Is it possible or planned to add support for those RTX GPUs? We are trying to leverage DCGM to conduct some performance modeling research and hope to have your help.
Best regards,
Qiang Wang
from dcgm.
The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.
WBR, Nik
By the way, I have also tried profiling of DCGM on GTX 1650 SUPER and observed the same error:
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded
Do the GTX cards support profiling with DCGM?
Best regards,
Qiang Wang
from dcgm.
The DCP metrics are only supported on Datacenter grade and Quadro GPUs. Neither RTX nor GTX kind of GPUs is supported.
There are no plans to support those GPUs as it's a hardware limitation that does not allow us to provide low-latency profiling on RTX and GTX GPUs.
from dcgm.
@nikkon-dev How about the NVIDIA RTX A4000? The NVIDIA RTX series represents a new series of Quadro GPUs, although regrettably, the DCGM does not seem to be compatible with it. For further information, kindly check the description on the NVIDIA's web page: https://www.nvidia.com/en-us/design-visualization/quadro/
from dcgm.
Could you share the nvidia-smi -q
output?
from dcgm.
@nikkon-dev FYR. Thanks!
lilo@bokeh:~$ dcgmi dmon -e 1002
#Entity SMACT
ID
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded
lilo@bokeh:~$ nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Sat Jul 29 16:56:53 2023
Driver Version : 535.54.03
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:07:00.0
Product Name : NVIDIA RTX A4000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1563221025362
GPU UUID : GPU-622d54a3-f5d7-cb2d-6d95-51f9ba06809e
Minor Number : 0
VBIOS Version : 94.04.57.00.0A
MultiGPU Board : No
Board ID : 0x700
Board Part Number : 900-5G190-2700-003
GPU Part Number : 24B0-875-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G190.0510.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x07
Device : 0x00
Domain : 0x0000
Device Id : 0x24B010DE
Bus Id : 00000000:07:00.0
Sub System Id : 0x14AD17AA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 2
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 41 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15352 MiB
Reserved : 261 MiB
Used : 1 MiB
Free : 15089 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 1 MiB
Free : 255 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 128 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 35 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 103 C
GPU Slowdown Temp : 100 C
GPU Max Operating Temp : 98 C
GPU Target Temperature : 90 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 5.07 W
Current Power Limit : 140.00 W
Requested Power Limit : 140.00 W
Default Power Limit : 140.00 W
Min Power Limit : 100.00 W
Max Power Limit : 140.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Default Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 681.250 mV
Fabric
State : N/A
Status : N/A
Processes : None
from dcgm.
Related Issues (20)
- AppArmor profile for DCGM HOT 3
- dcgm-exporter crashes hostengine. HOT 21
- How to get SM Occupancy in real-time except dcgm in RTX Series? HOT 1
- `power_usage` vs. `power_usage_instant`? HOT 1
- dcgm dagnostic command exits with status 226 HOT 1
- log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8 HOT 8
- Build output does not include libnvperf_dcgm_host.so HOT 13
- Removal of dependencies on cuda v10 HOT 7
- a question about dcgm policy listening for xid HOT 2
- Memory usage by dcgm during runtime diagnostics HOT 2
- Metrics around capturing gpu FLOPS HOT 4
- Facing error in running sdk_sample DCGMReader.py HOT 2
- Facing unknown docker flag --compress while using build.sh
- Running diagnostics causes the Memory Usage of the other GPUs to increase
- Error setting up dcgm with startHostEngine mode from a golang based container HOT 1
- [Question] Understanding multiplexing of profiling counters HOT 2
- [Question] Amount of lag expected for metrics HOT 2
- Incorrect values reported by dcgm stats HOT 3
- Cannot get "nvlink_flit_crc_error_count_total(409)" and "nvlink_data_crc_error_count_total(419)" in H800 System HOT 2
- Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm.