Comments (12)
Hi @hassanbabaie,
I am not familiar with Kubernetes. Let me find out if someone might know the answer.
from gdrcopy.
Thanks @pakmarkthub , I'm hoping there is a documented way as this should be something that I would expected is a growing scenario.
from gdrcopy.
Hi @hassanbabaie, mounting the device node with a hostPath
volume will only work if you run your pod as privileged. Otherwise, the container process won't have write permissions on the device node.
We are working on adding gdrcopy support to NVIDIA Container Runtime (see https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/530), and it should make it into the next release. With this feature, you can inject /dev/gdrdrv
into your container by setting the NVIDIA_GDRCOPY=enabled
environment variable in your container spec. It will not be required to run your pod as privileged.
cc @elezar
from gdrcopy.
Thanks this is great news @cdesiniotis, yes this will be much better as leveraging privileged is not desired.
If possible can you post here when it's released we can then look to try it out
from gdrcopy.
Hi @cdesiniotis I can't seem to access https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/530
Do you happen to have any update on this?
from gdrcopy.
It looks like this is now covered in v1.15.0-rc.2 and it's worked it's way through to v1.15.0-rc.4,
Do we happen to know the estimated release timeline?
from gdrcopy.
Hi @hassanbabaie,
FYI, we have released a gdrdrv container image on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/gdrdrv. Running that image will automatically compile and install the gdrdrv driver on your system. It will also expose /dev/gdrdrv
. However, you still need to attach that path to your application containers manually.
It looks like this is now covered in v1.15.0-rc.2 and it's worked it's way through to v1.15.0-rc.4,
Do we happen to know the estimated release timeline?
@cdesiniotis Do you have anything that you can share regarding gdrdrv support in NVIDIA Container Runtime?
from gdrcopy.
@hassanbabaie apologies for the delayed response. NVIDIA Container Toolkit 1.15.0 has been released. You can set NVIDIA_GDRCOPY=enabled
environment variable in your container spec, and the /dev/gdrdrv
device node should be made available to your container.
from gdrcopy.
I tried to deploy this on OpenShift, and at the beginning I was not able to have the /dev/gdrdrv
available in the (privileged) container. I found the logs of the gdrcopy installer:
# oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-414.92.202401121330-0-cmhxv -c nvidia-gdrcopy-ctr
Running gdrcopy-ctr-run-with-dtk
+ [[ ! -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]]
+ cp -r /usr/local/gdrcopy /mnt/shared-nvidia-driver-toolkit/
+ set +x
Mon May 13 20:45:15 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to start ...
Mon May 13 20:45:30 UTC 2024 openshift-driver-toolkit-ctr started.
Mon May 13 20:45:30 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:45:45 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:46:00 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:46:15 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:46:30 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:46:45 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:47:00 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:47:15 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
Mon May 13 20:47:30 UTC 2024 Waiting for openshift-driver-toolkit-ctr container to build the gdrdrv.ko kernel object ...
+ SRC_SHARED=/mnt/shared-nvidia-driver-toolkit/gdrcopy/src/gdrdrv
+ '[' -d /run/nvidia/driver/usr/src ']'
Waiting for nvidia driver to be loaded and rootfs to be mounted ...
+ echo 'Waiting for nvidia driver to be loaded and rootfs to be mounted ...'
+ sleep 5
+ '[' -d /run/nvidia/driver/usr/src ']'
+ lsmod
+ grep nvidia
nvidia_peermem 20480 0
nvidia_modeset 1499136 0
nvidia_uvm 6455296 4
nvidia 8626176 17 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_core 491520 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm 581632 5 drm_kms_helper,ast,drm_shmem_helper,nvidia
+ cp -rv /mnt/shared-nvidia-driver-toolkit/gdrcopy/src/gdrdrv/gdrdrv.ko /usr/local/gdrcopy/src/gdrdrv/gdrdrv.ko
'/mnt/shared-nvidia-driver-toolkit/gdrcopy/src/gdrdrv/gdrdrv.ko' -> '/usr/local/gdrcopy/src/gdrdrv/gdrdrv.ko'
Loading gdrdrv kernel module
+ echo 'Loading gdrdrv kernel module'
+ _load_module
+ cd /usr/local/gdrcopy
+ insmod src/gdrdrv/gdrdrv.ko dbg_enabled=0 info_enabled=0 use_persistent_mapping=0
Creating gdrdrv device node
+ echo 'Creating gdrdrv device node'
+ _create_device_node
++ fgrep gdrdrv /proc/devices
++ cut -b 1-4
+ major='505 '
INFO: driver major is 505
+ echo 'INFO: driver major is 505 '
+ mknod /run/nvidia/driver/dev/gdrdrv c 505 0
+ chmod a+w+r /run/nvidia/driver/dev/gdrdrv
Done, now waiting for signal
+ echo 'Done, now waiting for signal'
+ trap 'echo '\''Caught signal'\''; _shutdown && { kill 696326; exit 0; }' HUP INT QUIT PIPE TERM
+ trap - EXIT
+ true
+ wait 696326
+ sleep infinity
and noticed that gdrcopy was not installed in /dev/gdrdrv
but in /run/nvidia/driver/dev/gdrdrv
. Therefore, I had to modify the volume mounts in the pod as following:
- mountPath: /dev/gdrdrv
name: gdrdrv
- name: gdrdrv
hostPath:
path: /run/nvidia/driver/dev/gdrdrv
I don't know if this is the right way to do it but it works.
from gdrcopy.
@stefanomaxenti yes, if you are leveraging GPU Operator to install the GDRCopy driver, the device node will be present at /run/nvidia/driver/dev/gdrdrv
on the host.
from gdrcopy.
While everything works fine on a privilged container, I am unable to use the env. variable NVIDIA_GDRCOPY=enabled inside a non-privileged pod using the NVIDIA GPU Operator. Without hostPath and with the variable, gdrdrv is not visible. But with hostPath, it is not usable since it requires R/W permission and the pod is not privileged.
I think it is related to this issue NVIDIA/gpu-operator#713 on the operator side closed some days after releasing 24.3.0. I will try when it is avaiable to deploy.
Do you maybe have any other ideas to why GDRCopy is not working as expected in this setup? Thank you.
from gdrcopy.
@stefanomaxenti ah I see you are on OpenShift. Unfortunately the NVIDIA_GDRCOPY=enabled
envvar will have no effect on OpenShift today because we are not using NVIDIA Container Runtime there, and instead, are using an OCI hook -- so the component which parses this environment variable and adds the gdrdrv character device to the container never gets invoked. Like you observed, only a privileged container which has the /dev/gdrdrv hostPath mounted will work on OpenShift. We are looking to remove this limitation in a future release.
from gdrcopy.
Related Issues (20)
- Facing issue when installing HOT 1
- Ubuntu 22 - dpkg: error processing package gdrdrv-dkms:amd64 (--install) during installation of gdrcopy HOT 3
- Why D2H is relatively slower? HOT 2
- Query: Confusion about sudo requirement HOT 3
- thinking about working with CUDA async API
- gdrcopy_sanity failed when GPU Compute Mode is set to EXCLUSIVE HOT 1
- Unable to compile GDRCOPY v2.4 HOT 2
- Minimal steps to install gdrdrv driver only please HOT 6
- Fail to access mapped memory from CPU side(Fail data_validation tests) HOT 14
- tests build failing when check.h is not available HOT 1
- How to understand the file "nv-p2p-dummpy.c" HOT 3
- Driver flavor detection fails for 545 series HOT 2
- bad performance(compare with cuMemcpy) on x86 system HOT 3
- GDRCopy 2.4 on Centos7 failing build of RPM packages HOT 2
- Increasing utilization - gdrcopy_copybw HOT 3
- Improve the error report of gdrcopy_pplat when the CUDA kernel cannot be launched
- How to effectively test if gdrcopy is enabled using Real world ML workload ? HOT 2
- Can't make with Intel Compiler HOT 4
- MAINT: gdr_unmap segfault on master branch via NVSHMEM 2.10.1 on Cray Slingshot 11 with cuFFTMp HOT 22
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gdrcopy.