Giter Club home page Giter Club logo

Comments (7)

RenaudWasTaken avatar RenaudWasTaken commented on September 8, 2024 1

Hello!

Capacity: Total number of GPUs
Allocatable: Number of GPUs in a "Healthy" state

The number of GPUs "Available" is not advertised :)
Do you mind pasting the logs of the device plugin here?

Thanks!

from k8s-device-plugin.

RenaudWasTaken avatar RenaudWasTaken commented on September 8, 2024 1

Hmm, I don't think the device plugin outputs enough logs to properly diagnose your issue.

Possible quick fix though:
I suggest that if you restart the device plugin (ssh + docker stop) or delete it (kubectl delete pod nvidia-device-plugin-daemonset-5nggt) on that node, the daemonset will restart a new pod that will re-register the GPUs in a healthy state.

Do you mind putting the kubelet log of that node here?

from k8s-device-plugin.

yee379 avatar yee379 commented on September 8, 2024

@RenaudWasTaken - of course not! thanks for the help :)

on the kube node:

$ nvidia-smi
Tue Jun 26 23:09:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.42                 Driver Version: 390.42                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:07:00.0 Off |                    0 |
| N/A   26C    P8    29W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   33C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   26C    P8    30W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 00000000:10:00.0 Off |                    0 |
| N/A   32C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
...
$ kubectl -n kube-system get ds nvidia-device-plugin-daemonset -o wide
NAME                             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR        AGE       CONTAINERS                 IMAGES                          SELECTOR
nvidia-device-plugin-daemonset   2         2         1         2            1           accelerator=nvidia   71d       nvidia-device-plugin-ctr   nvidia/k8s-device-plugin:1.10   name=nvidia-device-plugin-ds
$ kubectl get node   -o wide --show-labels
...
ocio-gpu01              Ready                         <none>    50d       v1.10.2   <none>        CentOS Linux 7 (Core)                         3.10.0-693.21.1.el7.x86_64   docker://1.13.1     accelerator=nvidia,accelerator_model=tesla-k80,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,group=ocio,kubernetes.io/hostname=ocio-gpu01,storage=gpfs
...
$ kubectl -n kube-system logs nvidia-device-plugin-daemonset-5nggt
2018/06/26 17:13:24 Loading NVML
2018/06/26 17:13:24 Fetching devices.
2018/06/26 17:13:24 Starting FS watcher.
2018/06/26 17:13:24 Starting OS watcher.
2018/06/26 17:13:24 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/26 17:13:24 Registered device plugin with Kubelet

on the kube node:

$ sudo cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
{"PodDeviceEntries":[{"PodUID":"c1c45620-6882-11e8-be68-fa163e21c438","ContainerName":"notebook","ResourceName":"nvidia.com/gpu","DeviceIDs":["GPU-4ba65f49-8229-fa6e-46d5-0a016ec81a35"],"AllocResp":"CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS00YmE2NWY0OS04MjI5LWZhNmUtNDZkNS0wYTAxNmVjODFhMzU="},{"PodUID":"bbb71db7-68f7-11e8-be68-fa163e21c438","ContainerName":"notebook","ResourceName":"nvidia.com/gpu","DeviceIDs":["GPU-44da6aa2-2bb6-f3fc-605f-4c716876b417"],"AllocResp":"CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS00NGRhNmFhMi0yYmI2LWYzZmMtNjA1Zi00YzcxNjg3NmI0MTc="}],"RegisteredDevices":{"nvidia.com/gpu":[]}}

from k8s-device-plugin.

ray1888 avatar ray1888 commented on September 8, 2024

further more , i can't delete my nvidia-plugin because it overtime all the time . can u also tell me how to delete the daemonset without timeout,sothat i can retry it

from k8s-device-plugin.

ray1888 avatar ray1888 commented on September 8, 2024

@RenaudWasTaken when i delete the plugin and redeploy it , so far ,all is well.but i wonder a question , does this plugin will be decresing performance in other issue ? i has been checking it

from k8s-device-plugin.

RenaudWasTaken avatar RenaudWasTaken commented on September 8, 2024

Hello @ray1888

Device Plugins should not decrease performance. If there is a performance issue it is likely a core kubernetes issue.

If your issue has been solved feel free to close it :)

from k8s-device-plugin.

RenaudWasTaken avatar RenaudWasTaken commented on September 8, 2024

Closing as no further answers or questions were provided.

from k8s-device-plugin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.