Comments (5)
@nikkon-dev, whenever you have a time, could you please take a look at this issue?
from dcgm.
Could you provide nvidia-smi
and nvidia-smi -q
output?
from dcgm.
nvidia-smi
Wed Feb 14 08:49:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000: Off | On |
| N/A 33C P0 85W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 3 1 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 2 0 1 | 49MiB / 40192MiB | 56 0 | 4 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
nvidia-smi -q -i 3
==============NVSMI LOG==============
Timestamp : Wed Feb 14 08:47:31 2024
Driver Version : 535.104.05
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:
Product Name : NVIDIA A100-SXM4-80GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Enabled
Pending : Enabled
MIG Device
Index : 0
GPU Instance ID : 1
Compute Instance ID : 0
Device Attributes
Shared
Multiprocessor count : 42
Copy Engine count : 3
Encoder count : 0
Decoder count : 2
OFA count : 0
JPG count : 0
ECC Errors
Volatile
SRAM Uncorrectable : 0
FB Memory Usage
Total : 40192 MiB
Reserved : 0 MiB
Used : 37 MiB
Free : 40154 MiB
BAR1 Memory
Total : 65535 MiB
Used : 0 MiB
Free : 65535 MiB
MIG Device
Index : 1
GPU Instance ID : 2
Compute Instance ID : 0
Device Attributes
Shared
Multiprocessor count : 56
Copy Engine count : 4
Encoder count : 0
Decoder count : 2
OFA count : 0
JPG count : 0
ECC Errors
Volatile
SRAM Uncorrectable : 0
FB Memory Usage
Total : 40192 MiB
Reserved : 0 MiB
Used : 49 MiB
Free : 40142 MiB
BAR1 Memory
Total : 65535 MiB
Used : 0 MiB
Free : 65535 MiB
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : XXXXXX
GPU UUID : GPU-072b19bd-4ccf-63cf-6435-cf462aff833d
Minor Number : 5
VBIOS Version : 92.00.45.00.05
MultiGPU Board : No
Board ID : 0xa04
Board Part Number : XXXX
GPU Part Number : XXXX
FRU Part Number : N/A
Module ID : 8
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.104.05
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x0A
Device : 0x04
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:0A:04.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : Insufficient Permissions
Reserved : Insufficient Permissions
Used : Insufficient Permissions
Free : Insufficient Permissions
BAR1 Memory Usage
Total : Insufficient Permissions
Used : Insufficient Permissions
Free : Insufficient Permissions
Conf Compute Protected Memory Usage
Total : Insufficient Permissions
Used : Insufficient Permissions
Free : Insufficient Permissions
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 33 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 47 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 85.09 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 893.750 mV
Fabric
State : N/A
Status : N/A
Processes : None
@nikkon-dev ^^
from dcgm.
Tensor Core utilization is also higher than 1.0
dcgmi dmon -e 1002,1003,1004,155 -g 18
#Entity SMACT SMOCC TENSO POWER
ID W
GPU-I 21 1.309 0.164 1.255 400.336
GPU-I 22 0.977 0.122 0.933 400.336
from dcgm.
@nikkon-dev any progress on that?
from dcgm.
Related Issues (20)
- Previous profiling results are still stored in dcgmGroup.samples.GetAllSinceLastCall
- Errors in nv-hostengine log HOT 7
- Does DCGM support profiling metrics for A10 ? HOT 9
- When I run diagnostics, the two GPUs in the group both get failed results. HOT 2
- Does DCGM supports creating groups of GPU from different hosts? HOT 1
- diag --configfile option is silently ignored if --parameters options is present
- No NVLINK activity on DGX-A100 320GB HOT 6
- DCGM cannot listen on ipv6 address
- Error setting watches. Result: The third-party Profiling module returned an unrecoverable error HOT 4
- AppArmor profile for DCGM HOT 3
- dcgm-exporter crashes hostengine. HOT 21
- How to get SM Occupancy in real-time except dcgm in RTX Series? HOT 1
- `power_usage` vs. `power_usage_instant`? HOT 1
- dcgm dagnostic command exits with status 226 HOT 1
- log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8 HOT 8
- Build output does not include libnvperf_dcgm_host.so HOT 13
- Removal of dependencies on cuda v10 HOT 7
- a question about dcgm policy listening for xid HOT 2
- Memory usage by dcgm during runtime diagnostics HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm.