Comments (6)
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04
image - but still failing.
Update
declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"
Even I unset NVIDIA_REQUIRE_CUDA
- it still fails with the same error.
I also tested the same image with 1.19.4-4f0a078e
and 1.19.5-64049ba8
AMI releases - both failed.
from bottlerocket.
If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.
from bottlerocket.
Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem.
The difference in the output between Bottlerocket and Amazon Linux for the module config is:
Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0
Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0
EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.
What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.
Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?
from bottlerocket.
Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:
# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4
Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.
from bottlerocket.
@chulkilee do your container images contain the following environment variables?
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
If not, I would suggest adding them
from bottlerocket.
@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all
to get access to all the GPUs in the instance from your pod?
from bottlerocket.
Related Issues (20)
- No Automatic DHCP on secondary `eth1` interface for `aws-ecs-2` variant HOT 1
- switch to predefined license for kmod-*-nvidia
- DCGM will not run on GPU nodes with Bottlerocket OS HOT 7
- v1.20.1 Go dependency updates
- Suggestion: Publish AWS AMIs using gp3 block devices as default HOT 2
- Question: how to use "kubectl debug node" with bottlerocket? HOT 6
- Remove metal and vmware k8s 1.26 variants by June 11 2024 HOT 1
- Remove metal and vmware k8s 1.27 variants by July 24 2024
- settings-committer fails if there are no pending changes HOT 1
- Unable to mount nfs persistent volume from pod running EKS bottlerockt node HOT 3
- v1.20.1 🐫 Tracking Issue HOT 3
- v1.20.2 🤘🏾 Tracking Issue HOT 1
- Publish AMI IDs for K8s with Nvidia support via public SSM parameters, just like other AMIs HOT 4
- core kit migration 🚧 tracking issue HOT 1
- Potential significant max network throughput performance regression HOT 3
- Create an interface for determining the release date of an update HOT 2
- Add the socat package to Bottlerocket
- Need API Setting to allow modify kubelet config for Json logging format HOT 1
- Allow parallel image pulls HOT 1
- `host-ctr` cli crashes when pulling public ECR image HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bottlerocket.