Comments (7)
Hello!
Capacity: Total number of GPUs
Allocatable: Number of GPUs in a "Healthy" state
The number of GPUs "Available" is not advertised :)
Do you mind pasting the logs of the device plugin here?
Thanks!
from k8s-device-plugin.
Hmm, I don't think the device plugin outputs enough logs to properly diagnose your issue.
Possible quick fix though:
I suggest that if you restart the device plugin (ssh + docker stop) or delete it (kubectl delete pod nvidia-device-plugin-daemonset-5nggt
) on that node, the daemonset will restart a new pod that will re-register the GPUs in a healthy state.
Do you mind putting the kubelet log of that node here?
from k8s-device-plugin.
@RenaudWasTaken - of course not! thanks for the help :)
on the kube node:
$ nvidia-smi
Tue Jun 26 23:09:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.42 Driver Version: 390.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
| N/A 30C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:07:00.0 Off | 0 |
| N/A 26C P8 29W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:0A:00.0 Off | 0 |
| N/A 33C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:0B:00.0 Off | 0 |
| N/A 26C P8 30W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 00000000:10:00.0 Off | 0 |
| N/A 32C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
...
$ kubectl -n kube-system get ds nvidia-device-plugin-daemonset -o wide
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
nvidia-device-plugin-daemonset 2 2 1 2 1 accelerator=nvidia 71d nvidia-device-plugin-ctr nvidia/k8s-device-plugin:1.10 name=nvidia-device-plugin-ds
$ kubectl get node -o wide --show-labels
...
ocio-gpu01 Ready <none> 50d v1.10.2 <none> CentOS Linux 7 (Core) 3.10.0-693.21.1.el7.x86_64 docker://1.13.1 accelerator=nvidia,accelerator_model=tesla-k80,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,group=ocio,kubernetes.io/hostname=ocio-gpu01,storage=gpfs
...
$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-5nggt
2018/06/26 17:13:24 Loading NVML
2018/06/26 17:13:24 Fetching devices.
2018/06/26 17:13:24 Starting FS watcher.
2018/06/26 17:13:24 Starting OS watcher.
2018/06/26 17:13:24 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/26 17:13:24 Registered device plugin with Kubelet
on the kube node:
$ sudo cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
{"PodDeviceEntries":[{"PodUID":"c1c45620-6882-11e8-be68-fa163e21c438","ContainerName":"notebook","ResourceName":"nvidia.com/gpu","DeviceIDs":["GPU-4ba65f49-8229-fa6e-46d5-0a016ec81a35"],"AllocResp":"CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS00YmE2NWY0OS04MjI5LWZhNmUtNDZkNS0wYTAxNmVjODFhMzU="},{"PodUID":"bbb71db7-68f7-11e8-be68-fa163e21c438","ContainerName":"notebook","ResourceName":"nvidia.com/gpu","DeviceIDs":["GPU-44da6aa2-2bb6-f3fc-605f-4c716876b417"],"AllocResp":"CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS00NGRhNmFhMi0yYmI2LWYzZmMtNjA1Zi00YzcxNjg3NmI0MTc="}],"RegisteredDevices":{"nvidia.com/gpu":[]}}
from k8s-device-plugin.
further more , i can't delete my nvidia-plugin because it overtime all the time . can u also tell me how to delete the daemonset without timeout,sothat i can retry it
from k8s-device-plugin.
@RenaudWasTaken when i delete the plugin and redeploy it , so far ,all is well.but i wonder a question , does this plugin will be decresing performance in other issue ? i has been checking it
from k8s-device-plugin.
Hello @ray1888
Device Plugins should not decrease performance. If there is a performance issue it is likely a core kubernetes issue.
If your issue has been solved feel free to close it :)
from k8s-device-plugin.
Closing as no further answers or questions were provided.
from k8s-device-plugin.
Related Issues (20)
- [gfd] Incorrect implementation of atomic writing to a file when exporting features by gpu-feature-discovery
- [gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit HOT 2
- The MPS container has started running, but cannot call GPU resources inside the container HOT 6
- nvidia-device-plugin-daemonset CrashLoopBackoff in Truenas Scale Dragonfish HOT 6
- `/demo/clusters/kind/create-cluster.sh` fails with `umount: /proc/driver/nvidia: not mounted` HOT 1
- RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx HOT 2
- When I want to use MPS in Kubernetes, I need to specify --mps-root. HOT 3
- Addressing several security vulnerabilities in the version v0.15.1
- [k0s] `libnvidia-ml.so.1` missing in the pod HOT 2
- Docker image tag v0.9.0-ubuntu20.04 HOT 3
- Why there is no GPU resource allocatable on a GPU cloud instance HOT 3
- README section for MPS should state `spec.hostIPC: true` is required in a Pod HOT 5
- Documentation for GFD HOT 1
- Security Vulnerability: Red Hat Enterprise Linux 8.10 - openldap Remote Denial of Service Vulnerability - RHSA-2024:4264 HOT 1
- Add section to README with catalog of device plugin specific labels HOT 1
- Helm Chart v0.16.1 not available HOT 1
- gpu pod Pending HOT 2
- Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart
- `nvml init failed: ERROR_LIBRARY_NOT_FOUND` error after upgrading from `0.15.1` to `0.16.x` HOT 8
- Lack of Metrics for SLO Onboarding with k8s-device-plugin HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.