Deion If an NVIDIA GPU's Device In

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Could you try with <a class="commit-link" data-hovercard-type="commit" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for the details. A few followup questions: Are you pass

Yes we always pass --nvproxy-docker for GPU tas

Ah thanks for clarifying. So IIUC, GPU device minor do

I have implemented (3) from above comment. <a class="user-mention no

nvproxy assumes GPU index == minor device ID, which isn't always true,about google/gvisor

Comments (27)

thundergolfer commented on September 23, 2024 1

Hey @ayushr2 thanks for the fast fix!

We have persistent hosts with this mismatch so will be able to deploy and validate if we take one out of service for testing. 👍

from gvisor.

ayushr2 commented on September 23, 2024 1

Interesting... Could you try running something like docker run --runtime=runsc --rm --gpus="device=GPU-074eca25-dfaf-84aa-333d-8760f532be4b" nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8 on this machine?

Could you also share the container spec? (it should be printed when runsc -debug is specified)

from gvisor.

ayushr2 commented on September 23, 2024 1

Note: rootfs path being absolute is actually part of OCI specification, see here.

Hmm, now that's confusing that it repros with absolute paths as well. Let me prepare a patch with more debug logs that you can test with.

from gvisor.

ayushr2 commented on September 23, 2024 1

Could you try with a3a29bf? The relevant logs are prefixed with modal:. Also they may be in different log files (boot and create).

from gvisor.

ayushr2 commented on September 23, 2024

IIUC we use the GPU index number (from NVIDIA_VISIBLE_DEVICES) to bind mount /dev/nvidia{index} into the container's filesystem. See here. So even if the device at that index has a different minor number on the host, the right device is exposed to the container. All other devices are not exposed, since their host device is not bind mounted into the container.

And within the container, this GPU device is advertised with minor number == index number (which I guess is what you are referring to). See here. Note that gVisor's device numbers are virtualized. To the application, the GPU device at /dev/nvidia{index} actually has minor number index. When the application opens that device, gVisor internally opens the corresponding host device at that index (which might have a different host device minor number). See here.

How are you observing this bug?

from gvisor.

thundergolfer commented on September 23, 2024

@ayushr2 thanks for the reply. I can give more details.

I first observed this bug as an occasional 'out of memory' error when running gvisor gpu tasks on OCI A100 hosts. After investigating the bug for a bit, I ran nvidia-smi --loop=1 on one of the hosts and started running stuff on that host. I saw that a runsc-sandbox task was able to use GPU 3 at the same time as another task, the latter having already used 38.5GiB of GPU mem:

# clipped output from nvidia-smi --loop=1
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    3   N/A  N/A     54736      C   python                          38588MiB |
|    3   N/A  N/A   3047441      C   runsc-sandbox                    1768MiB |
+-----------------------------------------------------------------------------+

This is what was causing the out of memory error. Tasks should have exclusive access to a GPU in our system, so this overlap should not be possible. I checked our worker logs to confirm that the worker was assigning and reclaiming GPU 1 when managing the relevant gvisor task, because GPU 1 was free. For some reason gvisor was using GPU 3, however.

Some further printing within the gvisor task and querying the host shows gvisor to be confused about which GPU it should use:

(venv) ubuntu@ip-172-31-70-51:~/modal$ MODAL_WORKER_ID="wo-1234" MODAL_FUNCTION_RUNTIME_DEBUG=1 MODAL_FUNCTION_RUNTIME=gvisor MODAL_PROFILE=modal_labs python3 gpu_a100.py
✓ Initialized. View app at https://modal.com/apps/ap-1234
✓ Created objects.
├── 🔨 Created f_gpu_a100.
└── 🔨 Created mount /home/ubuntu/gpu_a100.py
pid=2
os.environ['NVIDIA_VISIBLE_DEVICES']='5'
19 crw-rw-rw- 1 root root 254,   0 Sep 17 01:17 /dev/nvidia-uvm
20 crw-rw-rw- 1 root root 195,   5 Sep 17 01:17 /dev/nvidia5
18 crw-rw-rw- 1 root root 195, 255 Sep 17 01:17 /dev/nvidiactl
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-d096069d-5388-aca8-6391-58292e5228b6)
time inside container: 8.858397245407104
Runner terminated, in-progress inputs will be re-scheduled
✓ App completed.

the above shows the output inside the container of the NVIDIA_VISIBLE_DEVICES variable, ls -lai /dev/, and nvidia-smi -L.

If I run nvidia-smi -L on the host:

[opc@modal-worker-oci-a100-40gb-8g ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-89e3ce69-ee37-233d-b893-7acb7c132849)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-c4334207-d952-1963-5e0f-ff1f9c8f54a9)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-a3271686-45c9-7757-0eae-de2c2aa2b4e7)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-602162d9-2041-bf59-c231-054bc04fa934)
GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-82fe7f68-f52c-ef81-eef4-ce53767e3f80)
GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-2bc27049-6745-6155-ef28-cc3231b3811c)
GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-1e611294-3f0d-b865-862f-84cd756d0e87)
GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-d096069d-5388-aca8-6391-58292e5228b6)

The UUID shows that the 'visible device' is 7 not 5. This 'off by two' issue is consistent within gVisor on the worker, and is consistent with /dev/nvidia{device minor} being the mapping not /dev/nvidia{index}.

[opc@modal-worker-oci-a100-40gb-8g ~]$ ls -la /dev/char | grep nvidia  | grep -v cap
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:0 -> /dev/nvidia0
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:1 -> /dev/nvidia1
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:2 -> /dev/nvidia2
lrwxrwxrwx  1 root root    19 Sep 17 06:44 195:254 -> /dev/nvidia-modeset
lrwxrwxrwx  1 root root    14 Sep 17 06:44 195:255 -> /dev/nvidiactl
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:3 -> /dev/nvidia3
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:4 -> /dev/nvidia4
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:5 -> /dev/nvidia5
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:6 -> /dev/nvidia6
lrwxrwxrwx  1 root root    12 Sep 17 06:44 195:7 -> /dev/nvidia7
lrwxrwxrwx  1 root root    15 Sep 17 06:44 508:0 -> /dev/nvidia-uvm
lrwxrwxrwx  1 root root    21 Sep 17 06:44 508:1 -> /dev/nvidia-uvm-tools

The most related Github Issue I can find is NVIDIA/nvidia-docker#376

from gvisor.

ayushr2 commented on September 23, 2024

Thanks for the details. A few followup questions:

Are you passing --nvproxy-docker flag to runsc? The NVIDIA_VISIBLE_DEVICES env var is only interpreted in gVisor when this flag is set. If this flag is not set, then runsc relies on the container spec's Device list (spec.Linux.Devices) to create devices files within the sandbox. Could you provide the container spec if you are not passing --nvproxy-docker? I think I can see how this bug can manifest in this case.
In your output of ls -la /dev/char | grep nvidia | grep -v cap from the host, the device minors and device indices happen to align... i.e. /dev/nvidia7 on the host is actually 195:7. But it seems we are still experiencing this bug even when these numbers are aligned?

from gvisor.

thundergolfer commented on September 23, 2024

Yes we always pass --nvproxy-docker for GPU tasks
I was perhaps a bit misleading with that /dev/char output. I meant to show it as an indicator that /dev/nvidia{device minor} is correct and that /dev/nvidia{index} is incorrect. The additional bit of context is the output of nvidia-container-cli info:

NVRM version:   525.60.13
CUDA version:   12.0

Device Index:   0
Device Minor:   2
Model:          NVIDIA A100-SXM4-40GB
Brand:          Nvidia
GPU UUID:       GPU-89e3ce69-ee37-233d-b893-7acb7c132849
Bus Location:   00000000:0f:00.0
Architecture:   8.0

Device Index:   1
Device Minor:   3
Model:          NVIDIA A100-SXM4-40GB
Brand:          Nvidia
GPU UUID:       GPU-c4334207-d952-1963-5e0f-ff1f9c8f54a9
Bus Location:   00000000:15:00.0
Architecture:   8.0

Device Index:   2
Device Minor:   0
Model:          NVIDIA A100-SXM4-40GB
Brand:          Nvidia
GPU UUID:       GPU-a3271686-45c9-7757-0eae-de2c2aa2b4e7
Bus Location:   00000000:51:00.0
Architecture:   8.0
...

On the hosts affected with index<>minor mismatch, NVIDIA_VISIBLE_DEVICES=0 provides access to GPU-89e3ce69-ee37-233d-b893-7acb7c132849 when using runc; this is the GPU with index 0. But when using gvisor we get GPU-a3271686-45c9-7757-0eae-de2c2aa2b4e7 which is the GPU with index 2 (device minor 0).

from gvisor.

ayushr2 commented on September 23, 2024

Ah thanks for clarifying. So IIUC,

GPU device minor does indicate the location at which the GPU device is mounted; i.e. /dev/nvidia{minor}.
Our interpretation that GPU index indicates where the GPU device is mounted is wrong. GPU index is unrelated to host mount path.

Here are some ways in which we can fix this:

libnvidia-container queries the GPU index -> GPU minor mapping information from NVML. NVML is a C library... However, Nvidia does provide Golang bindings for NVML: https://github.com/NVIDIA/go-nvml. So maybe we should use that to convert index -> device minor.
Alternatively, we can also just invoke nvidia-container-cli info and parse the output. However, @ekzhang had reported in #9215 that calls to nvidia-container-cli info were expensive (costing 2-3 seconds).
I think the ideal long term strategy is letting nvidia-container-cli configure handle this complexity. We already passthrough the --devices flag to it with all the GPU indices. That command will configure the container filesystem. Then we query /container/rootfs/dev/nvidia* and configure all those devices.

from gvisor.

ayushr2 commented on September 23, 2024

I have implemented (3) from above comment.

@thundergolfer could you confirm if #9391 fixes this issue for you? I don't happen to have setups that have this index/minor mismatch and don't know how to replicate it.

from gvisor.

ayushr2 commented on September 23, 2024

@thundergolfer FYI I have updated #9391. (Same functionality, just simplified that change.) Could you test with the most recent version and confirm?

from gvisor.

ayushr2 commented on September 23, 2024

I tested this on a multi-GPU machine and used docker run --runtime=runsc --gpus="device=GPU-b85f3e6c-53cd-b949-fc54-a243a0fa397e" ... and verified that the right GPU was exposed to the sandbox. Although I found another bug (#9401) which needed to be fixed as well. Will submit this patch for now and fix that separately. But this patch should fix your issue, since you are not using docker.

from gvisor.

thundergolfer commented on September 23, 2024

Thanks @ayushr2 will test the latest today and confirm.

from gvisor.

thundergolfer commented on September 23, 2024

@ayushr2 I'm testing 57ef86a4 now and getting an assertion error on assert torch.cuda.is_available()

It appears that no device is mapped into the container:

    print(f"{os.environ['NVIDIA_VISIBLE_DEVICES']=}")
    subprocess.run("ls /dev/nvidia*", shell=True)

os.environ['NVIDIA_VISIBLE_DEVICES']='GPU-074eca25-dfaf-84aa-333d-8760f532be4b'
/dev/nvidia-uvm
/dev/nvidiactl

[modal@ip-10-101-6-143 ~]$ nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-074eca25-dfaf-84aa-333d-8760f532be4b)
GPU 1: Tesla T4 (UUID: GPU-263401a3-a881-91cb-d56e-206ac7d1fe62)
GPU 2: Tesla T4 (UUID: GPU-cb6f19c5-a1c3-3407-7482-7aca62d8e4a0)
GPU 3: Tesla T4 (UUID: GPU-4967f774-3529-ce79-e978-1b38a9d097d1)

Will investigate more with debug logs.

Edit:

Mount logs i see in gvisor debug logs

I0920 20:21:51.475426   15152 chroot.go:92] Setting up sandbox chroot in "/tmp"
I0920 20:21:51.475532   15152 chroot.go:37] Mounting "/proc" at "/tmp/proc"
I0920 20:21:51.475566   15152 chroot.go:37] Mounting "/dev/nvidiactl" at "/tmp/dev/nvidiactl"
I0920 20:21:51.475594   15152 chroot.go:37] Mounting "/dev/nvidia-uvm" at "/tmp/dev/nvidia-uvm"

Edit2: Here's the configure command details

D0920 20:21:51.225467   15091 container.go:1833] Executing ["/usr/bin/nvidia-container-cli" "--load-kmods" "configure" "--ldconfig=@/sbin/ldconfig" "--no-cgroups" "--utility" "--compute" "--pid=15127" "--device=GPU-074eca25-dfaf-84aa-333d-8760f532be4b" "/tmp/task-data/ta-XwfiDQXo7rhm6iU2v1TwZM/.tmpAyNZQt/rootfs"]

from gvisor.

ayushr2 commented on September 23, 2024

Also in your debug logs, there must be a line like:

D0920 19:13:03.510146   33361 container.go:1833] Executing ["/usr/bin/nvidia-container-cli" "--load-kmods" "configure" ...

Can you provide that as well? I am interested in the --device flag passed here. Basically my fix relies on this command to understand NVIDIA_VISIBLE_DEVICES and set up the devices in the container's rootfs. Then runsc infers the exposed devices by scanning container rootfs.

from gvisor.

thundergolfer commented on September 23, 2024

The docker command you provide succeeds on a host that doesn't succeed in our runtime. Here's a spec from a run that reproduces the error:

{
  "ociVersion": "1.0.2-dev",
  "hooks": {
    "createRuntime": [],
    "postStop": []
  },
  "process": {
    "terminal": false,
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "sh",
      "-c",
      "nvidia-smi"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm",
      "SSL_CERT_DIR=/etc/ssl/certs",
      "SOURCE_DATE_EPOCH=1641013200",
      "PIP_NO_CACHE_DIR=off",
      "PYTHONHASHSEED=0",
      "PIP_ROOT_USER_ACTION=ignore",
      "CFLAGS=-g0",
      "PIP_DEFAULT_TIMEOUT=30",
      "NVIDIA_VISIBLE_DEVICES=GPU-09504027-18f8-4494-a6b0-118b9f5781df",
      "NVIDIA_DRIVER_CAPABILITIES=all"
    ],
    "cwd": "/root",
    "capabilities": {
      "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "inheritable": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "ambient": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ]
    },
    "rlimits": [
      {
        "type": "RLIMIT_NOFILE",
        "hard": 65536,
        "soft": 65536
      }
    ],
    "noNewPrivileges": true
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "hostname": "modal",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc"
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "tmpfs",
      "source": "shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    }
  ],
  "linux": {
    "cgroupsPath": "user.slice:oci:testbed",
    "resources": {},
    "uidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1,
        "hostID": 100000,
        "size": 16777216
      }
    ],
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1,
        "hostID": 100000,
        "size": 16777216
      }
    ],
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "user"
      },
      {
        "type": "cgroup"
      },
      {
        "type": "network",
        "path": "/var/run/netns/650b5fdac440beb7e33a08d4"
      }
    ],
    "maskedPaths": [
      "/proc/acpi",
      "/proc/asound",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/sys/firmware",
      "/proc/scsi"
    ],
    "readonlyPaths": [
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ]
  }
}

from gvisor.

thundergolfer commented on September 23, 2024

Also in your debug logs, there must be a line like:
@ayushr2 yep already copied that in an edit to a comment above.

The device does look to be present in the rootfs. If I run a find while the container is executing:

[modal@ip-172-31-21-169 ~]$ sudo find /tmp/task-data/ -name "*nvidia0*"
/tmp/task-data/ta-VW6LW2AZyVPB3m7wmvUvjw/.tmpkGua24/upper/dev/nvidia0
/tmp/task-data/ta-VW6LW2AZyVPB3m7wmvUvjw/.tmpkGua24/rootfs/dev/nvidia0

/tmp/task-data/ta-VW6LW2AZyVPB3m7wmvUvjw/.tmpkGua24/rootfs/ is Spec.root.path for the relevant container.

from gvisor.

ayushr2 commented on September 23, 2024

Your container spec has Spec.root.path = rootfs (it is a relative path). Could you change the spec to use absolute path?

from gvisor.

thundergolfer commented on September 23, 2024

We also have a repro with an absolute path:

  },
  "root": {
    "path": "/tmp/task-data/ta-VW6LW2AZyVPB3m7wmvUvjw/.tmpkGua24/rootfs"
  },

from gvisor.

thundergolfer commented on September 23, 2024

Note: rootfs path being absolute is actually part of OCI specification, see here.

True. That spec came from a testbed environment that doesn't have as much Modal specific stuff in it. We will update it to use an absolute path. The other root.path is from our actual Modal runtime environment.

from gvisor.

thundergolfer commented on September 23, 2024

[modal@ip-172-31-86-106 ~]$ cat /tmp/runsc/* | grep modal:
I0920 21:57:31.776594    6770 container.go:1833] modal: Executing ["/usr/bin/nvidia-container-cli" "--load-kmods" "configure" "--ldconfig=@/sbin/ldconfig" "--no-cgroups" "--utility" "--compute" "--pid=6787" "--device=GPU-de7f8ab2-e00f-7d0a-3ba2-c03698b1b22f" "/tmp/task-data/ta-KxrvpxEuJ9pdpxV81NNZRF/.tmpcynYmO/rootfs"]
I0920 21:57:32.371053    7059 chroot.go:203] modal: got [] device minors after scanning "/tmp/task-data/ta-KxrvpxEuJ9pdpxV81NNZRF/.tmpcynYmO/rootfs"
I0920 21:57:32.394348    7059 vfs.go:1202] modal: after pivot_root(2), got [] device minors after scanning /

Edit:

Showing another instance alongside a find executed after starting process:

[modal@ip-172-31-86-106 ~]$ sudo find /tmp/task-data/ -name "*nvidia0" -exec ls -lah --time-style=full-iso {} \;
-rw-r--r-- 1 root root 0 2023-09-20 22:09:39.129323146 +0000 /tmp/task-data/ta-wcu0p5gUwycTEVY6JwmcSV/.tmpSzemgV/rootfs/dev/nvidia0
-rw-r--r-- 1 root root 0 2023-09-20 22:09:39.129323146 +0000 /tmp/task-data/ta-wcu0p5gUwycTEVY6JwmcSV/.tmpSzemgV/upper/dev/nvidia0
[modal@ip-172-31-86-106 ~]$ cat /tmp/runsc/* | grep modal:
I0920 22:09:39.080838   14474 container.go:1833] modal: Executing ["/usr/bin/nvidia-container-cli" "--load-kmods" "configure" "--ldconfig=@/sbin/ldconfig" "--no-cgroups" "--utility" "--compute" "--pid=14542" "--device=GPU-de7f8ab2-e00f-7d0a-3ba2-c03698b1b22f" "/tmp/task-data/ta-wcu0p5gUwycTEVY6JwmcSV/.tmpSzemgV/rootfs"]
I0920 22:09:39.777030   14616 chroot.go:203] modal: got [] device minors after scanning "/tmp/task-data/ta-wcu0p5gUwycTEVY6JwmcSV/.tmpSzemgV/rootfs"
I0920 22:09:39.801066   14616 vfs.go:1202] modal: after pivot_root(2), got [] device minors after scanning /

from gvisor.

thundergolfer commented on September 23, 2024

Stuck this inside FindAllGpuDevices and each time it reports the file as not existing:

expectedPath := path.Join(rootfs, "dev/nvidia0")
_, err := os.Stat(expectedPath)

File does not exist at path: /tmp/task-data/ta-xb4D4msDkHZandOYBp6SMZ/.tmpWMUwj1/rootfs/dev/nvidia0
File does not exist at path: /dev/nvidia0

from gvisor.

ayushr2 commented on September 23, 2024

Thanks for working with me on this. Could you test one more time with fdb6f11?

So the GPU is not observable from the boot process, but the caller can observe this outside the sandbox.

from gvisor.

thundergolfer commented on September 23, 2024

@ayushr2 I've got to go to dinner now, but will try it first thing tomorrow.

from gvisor.

ayushr2 commented on September 23, 2024

I just had this thought (which woke me up 😅). Maybe the simplication I did earlier broke this. Could you try the older (bulkier) commit: e890018? I think this should work.

I think it has to do with mount propagation and mount namespaces. We can access rootfs/dev/nvidia* from outside the sandbox process because we are in the root mount namespace. The rootfs/dev/nvidia* mounts are created before the gofer can mark all shared mounts as slave mounts. So these mounts are propagated to the parent mount namespace. So we are able to see from outside.

The older commit did the "search" outside the sandbox process and passed the devices found to the sandbox process via a flag. The newer version moved this "search" to inside the sandbox process (simplified a lot of code). The sandbox process is running in a new mount namespace. But somehow, with your container spec, the sandbox process is started in a different peer group. So mounts from the gofer mount namespace are not propagated here. I don't know what this something is that makes the difference between your spec and a spec from Docker.

Just a gut feeling. Let me know if using the older commit fixes the issue.

from gvisor.

thundergolfer commented on September 23, 2024

Ahh nice. Mount namespaces are tricky. I'll try e890018

from gvisor.

thundergolfer commented on September 23, 2024

@ayushr2 can confirm that commit works in our testbed. 👍 👍

(I couldn't find a straightforward way to grab e890018 because it's detached and I'm not a repo member, but as the changeset is a single commit I was able to download https://github.com/google/gvisor/commit/e8900182a0f5347349415612ad0e2d8a5fb4f902.diff and do git apply against master)

from gvisor.

nvproxy assumes GPU index == minor device ID, which isn't always true about gvisor HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent