i have a kubernetes node on 1.10.2 with <code class="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

allocatable stuck at zero about k8s-device-plugin HOT 7 CLOSED

nvidia commented on September 8, 2024

allocatable stuck at zero

from k8s-device-plugin.

Comments (7)

RenaudWasTaken commented on September 8, 2024 1

Hello!

Capacity: Total number of GPUs
Allocatable: Number of GPUs in a "Healthy" state

The number of GPUs "Available" is not advertised :)
Do you mind pasting the logs of the device plugin here?

Thanks!

from k8s-device-plugin.

RenaudWasTaken commented on September 8, 2024 1

Hmm, I don't think the device plugin outputs enough logs to properly diagnose your issue.

Possible quick fix though:
I suggest that if you restart the device plugin (ssh + docker stop) or delete it (kubectl delete pod nvidia-device-plugin-daemonset-5nggt) on that node, the daemonset will restart a new pod that will re-register the GPUs in a healthy state.

Do you mind putting the kubelet log of that node here?

from k8s-device-plugin.

yee379 commented on September 8, 2024

@RenaudWasTaken - of course not! thanks for the help :)

on the kube node:

$ nvidia-smi
Tue Jun 26 23:09:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.42                 Driver Version: 390.42                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:07:00.0 Off |                    0 |
| N/A   26C    P8    29W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   33C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   26C    P8    30W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 00000000:10:00.0 Off |                    0 |
| N/A   32C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
...

$ kubectl -n kube-system get ds nvidia-device-plugin-daemonset -o wide
NAME                             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR        AGE       CONTAINERS                 IMAGES                          SELECTOR
nvidia-device-plugin-daemonset   2         2         1         2            1           accelerator=nvidia   71d       nvidia-device-plugin-ctr   nvidia/k8s-device-plugin:1.10   name=nvidia-device-plugin-ds

$ kubectl get node   -o wide --show-labels
...
ocio-gpu01              Ready                         <none>    50d       v1.10.2   <none>        CentOS Linux 7 (Core)                         3.10.0-693.21.1.el7.x86_64   docker://1.13.1     accelerator=nvidia,accelerator_model=tesla-k80,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,group=ocio,kubernetes.io/hostname=ocio-gpu01,storage=gpfs
...

$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-5nggt
2018/06/26 17:13:24 Loading NVML
2018/06/26 17:13:24 Fetching devices.
2018/06/26 17:13:24 Starting FS watcher.
2018/06/26 17:13:24 Starting OS watcher.
2018/06/26 17:13:24 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/26 17:13:24 Registered device plugin with Kubelet

on the kube node:

$ sudo cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
{"PodDeviceEntries":[{"PodUID":"c1c45620-6882-11e8-be68-fa163e21c438","ContainerName":"notebook","ResourceName":"nvidia.com/gpu","DeviceIDs":["GPU-4ba65f49-8229-fa6e-46d5-0a016ec81a35"],"AllocResp":"CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS00YmE2NWY0OS04MjI5LWZhNmUtNDZkNS0wYTAxNmVjODFhMzU="},{"PodUID":"bbb71db7-68f7-11e8-be68-fa163e21c438","ContainerName":"notebook","ResourceName":"nvidia.com/gpu","DeviceIDs":["GPU-44da6aa2-2bb6-f3fc-605f-4c716876b417"],"AllocResp":"CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS00NGRhNmFhMi0yYmI2LWYzZmMtNjA1Zi00YzcxNjg3NmI0MTc="}],"RegisteredDevices":{"nvidia.com/gpu":[]}}

from k8s-device-plugin.

ray1888 commented on September 8, 2024

further more , i can't delete my nvidia-plugin because it overtime all the time . can u also tell me how to delete the daemonset without timeout,sothat i can retry it

from k8s-device-plugin.

ray1888 commented on September 8, 2024

@RenaudWasTaken when i delete the plugin and redeploy it , so far ,all is well.but i wonder a question , does this plugin will be decresing performance in other issue ? i has been checking it

from k8s-device-plugin.

RenaudWasTaken commented on September 8, 2024

Hello @ray1888

Device Plugins should not decrease performance. If there is a performance issue it is likely a core kubernetes issue.

If your issue has been solved feel free to close it :)

from k8s-device-plugin.

RenaudWasTaken commented on September 8, 2024

Closing as no further answers or questions were provided.

from k8s-device-plugin.

allocatable stuck at zero about k8s-device-plugin HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent