Comments (3)
Thanks for bringing this up @chiragjn.
As has been discussed on the linked issue, the GSP firmware can be disabled through the module option NVreg_EnableGpuFirmware
(as also detailed in the documentation of the driver in https://download.nvidia.com/XFree86/Linux-x86_64/535.161.07/README/gsp.html).
Setting module options for Bottlerocket can be done through the user-data setting for the kernel command line and the reboot-to-reconcile
option as documented on https://bottlerocket.dev/en/os/1.19.x/api/settings/boot/ .
To achieve disabling GSP, you will have to set the following options in your user data:
[settings.boot]
reboot-to-reconcile: true
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware" =
[
"0"
]
I have done a test with the following eksctl config to check in an A/B scenario. One nodegroup that boots the image "vanilla" (ng-bottlerocket-g4), and one nodegroup with the appropriate settings to disable GSP firmware (ng-bottlerocket-g4-nogsp):
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: bottlerocket-nvidia
region: us-west-2
version: '1.28'
nodeGroups:
- name: ng-bottlerocket-g4
instanceType: g4dn.xlarge
desiredCapacity: 1
amiFamily: Bottlerocket
ami: ami-0afc36986e4122bb4
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
ssh:
allow: true
publicKeyName: ec2_rsa
bottlerocket:
settings:
motd: "Hello from eksctl!"
- name: ng-bottlerocket-g4-nogsp
instanceType: g4dn.xlarge
desiredCapacity: 1
amiFamily: Bottlerocket
ami: ami-0afc36986e4122bb4
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
ssh:
allow: true
publicKeyName: ec2_rsa
bottlerocket:
settings:
motd: "Hello from eksctl!"
boot:
reboot-to-reconcile: true
kernel-parameters:
nvidia.NVreg_EnableGpuFirmware:
- "0"
Instance g4dn.xlarge with GSP disabled:
bash-5.1# /usr/libexec/nvidia/tesla/bin/nvidia-smi -q | grep "GSP Firmware Version"
GSP Firmware Version : N/A
bash-5.1# cat /proc/cmdline
nvidia.NVreg_EnableGpuFirmware="0" [...]
Instance g4dn.xlarge without GSP disabled:
bash-5.1# /usr/libexec/nvidia/tesla/bin/nvidia-smi -q | grep "GSP Firmware Version"
GSP Firmware Version : 535.161.07
Would this fix your issue or is there anything extra that you would need from Bottlerocket?
from bottlerocket.
Oh amazing, didn't know about this
I'll try this out with Karpenter user data and report back by tomorrow
from bottlerocket.
This works as expected! Thanks again :)
from bottlerocket.
Related Issues (20)
- Upgrade failures on Bottlerocket 1.21.0 HOT 2
- Kubernetes 1.31 Variants Tracking Issue 👻
- v1.21.1 🪁 Tracking Issue HOT 2
- Setting to disable IMDS access
- Compatibility with Nvidia GPU Operator HOT 1
- Possible Containerd Startup Latency Increase HOT 2
- control the the cuda toolkit/nvidia drivers while picking images HOT 1
- Allow to add additional credential-providers HOT 9
- Move `schnauzer` to the `bottlerocket-settings-sdk` HOT 1
- v1.22.0 ⛰️ Tracking Issue HOT 2
- Add Open GPU drivers from NVIDIA
- Getting repeated warnings about world-readable systemd unit file HOT 1
- Update Nvidia driver to 545 HOT 4
- Expose Kubernetes maxParallelImagePulls HOT 4
- Ephemeral Storage commands have a race condition when used with bootstrap commands
- Update ECS agent to v1.86.3
- Allow customizing the parameters in the kube_reserve_memory() function
- v1.23.0 🚎 Tracking Issue
- What is the best way to monitor the MemoryUtilization for a Bottlerocket Container instance for ECS HOT 4
- Build `metal-dev` + `aarch64` fails with `file/dir not found (index.json)` HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bottlerocket.