Giter Club home page Giter Club logo

Comments (6)

chulkilee avatar chulkilee commented on June 20, 2024 1
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04 image - but still failing.

Update

declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"

Even I unset NVIDIA_REQUIRE_CUDA - it still fails with the same error.

I also tested the same image with 1.19.4-4f0a078e and 1.19.5-64049ba8 AMI releases - both failed.

from bottlerocket.

chulkilee avatar chulkilee commented on June 20, 2024

If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.

from bottlerocket.

yeazelm avatar yeazelm commented on June 20, 2024

Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem.

The difference in the output between Bottlerocket and Amazon Linux for the module config is:

Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0

Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0

EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.

What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.

Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?

from bottlerocket.

yeazelm avatar yeazelm commented on June 20, 2024

Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:

# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4

Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.

from bottlerocket.

bryantbiggs avatar bryantbiggs commented on June 20, 2024

@chulkilee do your container images contain the following environment variables?

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

If not, I would suggest adding them

from bottlerocket.

arnaldo2792 avatar arnaldo2792 commented on June 20, 2024

@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all to get access to all the GPUs in the instance from your pod?

from bottlerocket.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.