Comments (8)
Hi @Discipe, thanks for the issue. I'll start taking a look into this.
from bottlerocket.
Hi @Discipe, my apologies for the confusion. The first response was using Bottlerocket 1.19.5, the latest release (I've update the comment to reflect that). My follow up was testing with Bottlerocket 1.19.0.
Am I understanding it correctly, that breaking change could've been introduced by #3718 ?
Yes, this is potentially causing the issue. We're working to expose settings that will enable users to avoid this issue.
from bottlerocket.
I just tried to add following environment variables to our Dockerfile:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
and it didn't help.
from bottlerocket.
Confirmed the issue on my end on 1.19.5, will keep this separate from #3916 while we await a response from the author of that issue.
Using the test_main.py
provided and including nvidia environment variables in Dockerfile:
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
RUN DEBIAN_FRONTEND=noninteractive apt-get update --fix-missing && apt-get upgrade -y
RUN DEBIAN_FRONTEND=noninteractive apt-get install --fix-missing -y \
software-properties-common \
wget \
curl
RUN DEBIAN_FRONTEND=noninteractive apt-get install -yy \
python3 \
python3-dev
RUN wget https://bootstrap.pypa.io/pip/get-pip.py && \
python3 ./get-pip.py && rm ./get-pip.py
RUN pip3 install --upgrade pip
# CUDA 11.7, torch 1.13.1
RUN pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu117
COPY test_main.py .
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENTRYPOINT ["python3", "test_main.py"]
I launched a simple cluster w/ g4dn.xlarge node:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: issue-3937
region: us-west-2
version: '1.29'
nodeGroups:
- name: ng-bottlerocket-3937
instanceType: g4dn.xlarge
desiredCapacity: 1
amiFamily: Bottlerocket
disableIMDSv1: true
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
ssh:
allow: true
publicKeyName: <redacted>
and a basic pod:
---
apiVersion: v1
kind: Pod
metadata:
name: 3937-test-pod
spec:
containers:
- name: test-container
image: public.ecr.aws/o7o0w6s5/3937:latest
resources:
limits:
memory: 1000Mi
requests:
cpu: 100m
memory: 1000Mi
Observed:
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
from bottlerocket.
For completeness, I checked on Bottlerocket 1.19.0 (ami-00b3c0d9c67029782) and have not seen this issue:
[2024-05-06 18:08:44,778][INFO] Init
[2024-05-06 18:08:44,778][INFO] Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0]
[2024-05-06 18:08:44,778][INFO] Pytorch version: 1.13.1+cu117
[2024-05-06 18:08:44,952][INFO] Cuda available: True
This might be related to #3718
from bottlerocket.
I would like to clear out something in the response.
Am I understanding it correctly, that breaking change could've been introduced by #3718 ? Which version of Bottlerocker was used to reproduce it in the first response?
from bottlerocket.
@Discipe , have you heard of TimeSlicing? We don't support it yet, but we are planning to support it. I'm asking because I want to know if this feature could fit your need to oversubscribe a GPU, while allowing the orchestrator (k8s) manage the resources. This feature will allow you to further control access to the GPUs instead of granting blank access to all the GPUs in all the pods that have NVIDIA_VISIBLE_DEVICES=all
.
One caveat of NVIDIA_DEVICE_PLUGINS=all
is that the orchestrator wouldn't track the usage of the GPUs. With TimeSlicing, you can oversubscribe and let the orchestrator track GPU usage. But again, I wonder if you already heard of it before, and decided not to use it because it doesn't fit your needs.
from bottlerocket.
Hello @arnaldo2792, We are aware of TimeSlicing, and I agree with your description of its benefits. We are currently testing time slicing outside of K8s cluster.
It's not the ideal solution, but it is the only existing option now to specify GPU resources for pods. The best alternative for us would be pod allocation based on GPU memory. Something like
resources:
limits:
gpu/memory: 2048MiB
but I don't think it is supported in any way by GPU drivers or containers.
Let me provide some additional context :)
We are trying to solve two problems:
- share a single GPU across many small apps for cost-efficiency
- have an autoscaling based on GPU capacity and utilization
The GPU utilization and resource allocation by slices could still be tricky, and we'll need to figure out how to balance slices for a cluster with different GPU sizes, but it should be possible.
I don't think I understand all the implications ofNVIDIA_DEVICE_PLUGINS=all
, but I doubt we need it. And if it prevents us from tracking actual GPU usage, we want to avoid using it. We don't need instances with more than one GPU, so we can always specifydevice=1
.
There are two issues with GPU Slicing support: we need it supported by Bottlerocket and by AWS EKS (more specifically in Karpenter) to enable nodes and pods scaling based on GPU slices. As far as I'm aware, support from AWS is not guaranteed by the end of the year. But if it will be done - that should solve both of our problems. So yes, we would like to see TimeSlicing support in Bottlerocket! :)
I just noticed that NVidia device plugin supports resource allocation using MPS (which is what we are using implicitly right now, if I understand correctly): https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#with-cuda-mps
I haven't looked into the details yet; maybe it is a viable alternative to slicing...
from bottlerocket.
Related Issues (20)
- BottleRocket NVIDIA EKS Node group wont join EKS Cluster HOT 2
- nvidia-container-cli timeout error when running ECS tasks HOT 1
- Changes to kernel module compression can break certain workflows HOT 15
- Cilium-agent does not start after upgrading to bottlerocket OS 1.20.0 HOT 1
- Host Container Unable to Create Container Task HOT 6
- Collecting logs from EKS Worker Nodes running Bottlerocket AMI when no SSH is enabled HOT 1
- Create symlinks to devices using the device name configured for EBS volumes
- v1.20.1 🐫 Tracking Issue HOT 1
- Vendor different compiled versions of the amazon-ssm-agent
- No Automatic DHCP on secondary `eth1` interface for `aws-ecs-2` variant HOT 1
- switch to predefined license for kmod-*-nvidia
- DCGM will not run on GPU nodes with Bottlerocket OS HOT 7
- v1.20.1 Go dependency updates
- Suggestion: Publish AWS AMIs using gp3 block devices as default HOT 2
- Question: how to use "kubectl debug node" with bottlerocket? HOT 6
- Remove metal and vmware k8s 1.26 variants by June 11 2024 HOT 1
- Remove metal and vmware k8s 1.27 variants by July 24 2024
- settings-committer fails if there are no pending changes HOT 1
- Unable to mount nfs persistent volume from pod running EKS bottlerockt node HOT 3
- v1.20.1 🐫 Tracking Issue HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bottlerocket.