squat / modulus Goto Github PK
View Code? Open in Web Editor NEWAutomatically compile kernel modules for Flatcar Linux / CoreOS Container Linux
License: MIT License
Automatically compile kernel modules for Flatcar Linux / CoreOS Container Linux
License: MIT License
With Nvidia not supporting the later Flatcar kernels #18 I wonder if it'd be useful to create a best effort compatibility matrix (I only know one though):
2605.12.0 | |
---|---|
440.64 | ✓ |
After upgrading to the most recent Container Linux version, compiling modules fails due do NVIDIA errors:
var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c: In function ‘nvidia_cpu_callback’:
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:213:14: error: ‘CPU_DOWN_FAILED’ undeclared (first use in this function)
case CPU_DOWN_FAILED:
^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:213:14: note: each undeclared identifier is reported only once for each function it appears in
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:220:14: error: ‘CPU_DOWN_PREPARE’ undeclared (first use in this function)
case CPU_DOWN_PREPARE:
^
In file included from /var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:15:0:
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c: In function ‘nv_init_pat_support’:
/var/lib/dkms/nvidia/378.13/build/common/inc/nv-linux.h:391:34: error: implicit declaration of function ‘register_cpu_notifier’ [-Werror=implicit-function-declaration]
#define register_hotcpu_notifier register_cpu_notifier
^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:258:17: note: in expansion of macro ‘register_hotcpu_notifier’
if (register_hotcpu_notifier(&nv_hotcpu_nfb) != 0)
^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c: In function ‘nv_teardown_pat_support’:
/var/lib/dkms/nvidia/378.13/build/common/inc/nv-linux.h:388:36: error: implicit declaration of function ‘unregister_cpu_notifier’ [-Werror=implicit-function-declaration]
#define unregister_hotcpu_notifier unregister_cpu_notifier
^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:283:9: note: in expansion of macro ‘unregister_hotcpu_notifier’
unregister_hotcpu_notifier(&nv_hotcpu_nfb);
^
I have been testing Modulus with Nvidia drivers so I can use GPU scheduling on Tectonic. It was working just fine until CLUO upgraded my cluster to 1520.6.0. Now modulus fails to create compile build of Nvidia drivers.
David Michael said its because the build-1520 branch has a kernel not in 1520.6.0, since it will go in 1520.7.0. The Git branch has changes that were not built in a release yet. https://github.com/squat/modulus/blob/master/nvidia/compile#L8
Euan Kemp believe you are doing the wrong thing here and suggest I submit a issue. Given it's already in the dev container then I think, it should be pretty easy to get the right version. The repo should be handy and everything.
David Michael suggested I make the script find the release commit in the manifest and check out that commit instead of the branch: https://github.com/coreos/manifest/blob/v1520.6.0/release.xml#L23
I'm not a developer so I'm struggling to figure out how to workaround this issue. Greatly appreciate if you could fix your code so it will work with the latest CL version.
I've been trying to compile the Nvidia driver from source on my Flatcar development container.
Recently it began failing consistently with errors like:
CC [M] /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/nvidia/nv-frontend.o
CC [M] /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/nvidia/nv.o
In file included from ./include/linux/kernel.h:5,
from ./include/linux/list.h:9,
from ./include/linux/preempt.h:11,
from ./include/linux/spinlock.h:55,
from /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/common/inc/nv-lock.h:16,
from /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/common/inc/nv-linux.h:20,
from /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/nvidia/nv-frontend.c:13:
./include/linux/stdarg.h:6: warning: "va_start" redefined
6 | #define va_start(v, l) __builtin_va_start(v, l)
I was able to fix the complication by using an older driver & a patch released by if-not-true-then-false.com
: https://www.if-not-true-then-false.com/2020/inttf-nvidia-patcher/
https://github.com/mediadepot/docker-flatcar-nvidia-driver/blob/master/compile-patched.sh#L11-L16
The problem is that, while compilation finishes successfully, I cannot depmod
or modprobe
the compiled kernel modules:
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.dep.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.dep.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.alias.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.alias.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.softdep.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.symbols.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.symbols.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.builtin.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.builtin.alias.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.devname.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570000]: ++ modprobe -d / ipmi_devintf
Apr 13 06:20:44 localhost bash[570000]: ++ depmod -b /opt/drivers/nvidia
Apr 13 06:20:44 localhost bash[570480]: depmod: WARNING: could not open modules.order at /opt/drivers/nvidia/lib/modules/5.15.32-flatcar: No such file or directory
Apr 13 06:20:44 localhost bash[570480]: depmod: WARNING: could not open modules.builtin at /opt/drivers/nvidia/lib/modules/5.15.32-flatcar: No such file or directory
Apr 13 06:20:44 localhost bash[570000]: ++ modprobe -d /opt/drivers/nvidia nvidia
Apr 13 06:20:45 localhost bash[570481]: modprobe: ERROR: could not insert 'nvidia': Exec format error
Apr 13 06:20:45 localhost systemd[1]: Finished Compile Kernel Modules for Flatcar.
Is this an issue you've run into?
Hey,
The systemd instructions now fail in the complication stage (systemct start ..
) after your most recent change.
I ended up just using a previous commit to get everything working, and unfortunately I don't have the failing log files available to include in this issue.
Hi,
I am facing a problem here with the DevicePlugin. It starts normally and the logs look fine. However after a couple of secs/mins the Pod gets terminated without any error.
I am using your modulus daemonset and the latest GKE device plugin.
Any idea what is going on there? Is this the normal behaviour?
Thanks,
Andreas
I am having trouble building the Nvidia module on Flatcar Linux version 2765.2.0.
I have attached the logs from modulus for two Nvidia versions that I tried (I think 472.32 is not the right one, but 440.64 is what I was using before I upgraded Flatcar):
First off, just wanted to say thanks for this awesome repo and your KubeConEU presentation Lucas!
Quick side-question: at the 12:49 timestamp in the video of your talk, you briefly have a slide up showing the gdisk -l coreos_developer_container.bin
command and mention that you'd come back to it if there was time. Looks like there wasn't time, or was the topic covered elsewhere?
Main question: (perhaps relatedly) it seemed like modulus might be a great way to automatically compile other (non-nvidia) modules for those of us running Kubernetes on CoreOS Container Linux, but, the 2.9G size of the chroot appears to end up being a bottleneck. I've been bumping against Fatal error: [...] No space left on device
errors while playing around trying to adapt modulus to compile some in-tree kernel modules. It's unclear to me what the specific purpose of the truncate_bin()
function is, and/or whether the dev container image base size could easily be made bigger in order to have more free space for compiling other modules? Would an approach along those lines make sense for folks interested in general-purpose automatic kernel module compilation, or is that barking up the wrong tree?
Thanks a lot for the great project and the great talk at KubeCon Europe! :)
Am I correct, that the 2 missing pieces now are 1. Nvidia Docker runtime and 2. Kubernetes Device Plugin?
Regarding 1: Do you have some hints on how to get this up and running on CoreOS? There're a couple of projects on github covering how to get the Nvidia docker runtime up and running - I am not sure which one to use.
Regarding 2: Can I use the official off-the-shelf Nvidia Device Plugin[1]?
Regards,
Andreas
I am trying to configure OpenGL 3D hardware acceleration - this project has got me very close, however there appears to be a missing sub module from my X Server log output.
Full details of this issue is on SO here - I can copy them here though if that'd be better.
Hey
any idea why this is happening with systemd?
● [email protected] - compile kernel modules
Loaded: loaded (/etc/systemd/system/[email protected]; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2019-03-03 06:02:40 UTC; 5min ago
Process: 14367 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus install $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=1/FAILURE)
Process: 3377 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus compile $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=0/SUCCESS)
Process: 3366 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus install $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=1/FAILURE)
Process: 3297 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus download $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=255)
Main PID: 14367 (code=exited, status=1/FAILURE)
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.symbols.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.symbols.bin.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.builtin.bin.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.devname.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: WARNING: could not open /opt/drivers/nvidia/lib/modules/4.19.23-coreos-r1/modules.order: No such file or directory
Mar 03 06:02:40 localhost bash[14367]: depmod: WARNING: could not open /opt/drivers/nvidia/lib/modules/4.19.23-coreos-r1/modules.builtin: No such file or directory
Mar 03 06:02:40 localhost bash[14367]: modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)
Mar 03 06:02:40 localhost systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Mar 03 06:02:40 localhost systemd[1]: [email protected]: Failed with result 'exit-code'.
Mar 03 06:02:40 localhost systemd[1]: Failed to start compile kernel modules.```
I'm seeing the following:
$ kubectl logs -n kube-system nvidia-driver-installer-9zdtf -c modulus -f
...
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 48820723 Merge pull request #3879 from kinvolk/kai/oem-gce-rkt-dns-nsswitch
Performing Global Updates
(Could take a couple of minutes if you have a lot of binary packages.)
>>> Cloning repository 'portage-stable' from 'https://github.com/coreos/portage-stable.git'...
>>> Starting git clone in /var/lib/portage/portage-stable
>>> Git clone in /var/lib/portage/portage-stable successful
>>> Release checkout b46831017581540a0105466c9cd03d342041e577 in /var/lib/portage/portage-stable successful
>>> Cloning repository 'coreos' from 'https://github.com/coreos/coreos-overlay.git'...
>>> Starting git clone in /var/lib/portage/coreos-overlay
>>> Git clone in /var/lib/portage/coreos-overlay successful
>>> Release checkout 488207230f745be2f1daea126cd5034c2ff59557 in /var/lib/portage/coreos-overlay successful
error: pathspec 'src/third_party/coreos-overlay' did not match any file(s) known to git
I'm not too sure what the issue here is, potentially related to CoreOS going out of support?
EDIT: this seems to fix it: #16
Hey, I've started seeing the following error when starting the modulus service w/nvidia.
Apr 30 05:57:50 localhost bash[13665]: >>> Emerging binary (7 of 7) sys-kernel/coreos-sources-4.19.34::coreos
Apr 30 05:57:54 localhost bash[13665]: Traceback (most recent call last):
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib/python-exec/python2.7/emerge", line 53, in <module>
Apr 30 05:57:54 localhost bash[13665]: retval = emerge_main()
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/main.py", line 1289, in emerge_main
Apr 30 05:57:54 localhost bash[13665]: return run_action(emerge_config)
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 3332, in run_action
Apr 30 05:57:54 localhost bash[13665]: retval = action_build(emerge_config, spinner=spinner)
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 541, in action_build
Apr 30 05:57:54 localhost bash[13665]: retval = mergetask.merge()
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1019, in merge
Apr 30 05:57:54 localhost bash[13665]: rval = self._merge()
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1413, in _merge
Apr 30 05:57:54 localhost bash[13665]: self._main_loop()
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1389, in _main_loop
Apr 30 05:57:54 localhost bash[13665]: self._event_loop.run_until_complete(self._main_exit)
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 831, in run_until_complete
Apr 30 05:57:54 localhost bash[13665]: self.iteration()
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 285, in iteration
Apr 30 05:57:54 localhost bash[13665]: return self._iteration(*args)
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 379, in _iteration
Apr 30 05:57:54 localhost bash[13665]: if not x.callback(f, event, *x.args):
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 117, in __call__
Apr 30 05:57:54 localhost bash[13665]: callback()
Apr 30 05:57:54 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/_async/PipeLogger.py", line 124, in _output_handler
Apr 30 05:57:54 localhost bash[13665]: log_file.flush()
Apr 30 05:57:54 localhost bash[13665]: IOError: [Errno 28] No space left on device
Apr 30 05:57:55 localhost bash[13665]: Traceback (most recent call last):
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib/python-exec/python2.7/emerge", line 53, in <module>
Apr 30 05:57:55 localhost bash[13665]: retval = emerge_main()
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/main.py", line 1289, in emerge_main
Apr 30 05:57:55 localhost bash[13665]: return run_action(emerge_config)
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 3332, in run_action
Apr 30 05:57:55 localhost bash[13665]: retval = action_build(emerge_config, spinner=spinner)
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 541, in action_build
Apr 30 05:57:55 localhost bash[13665]: retval = mergetask.merge()
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 942, in merge
Apr 30 05:57:55 localhost bash[13665]: self._save_resume_list()
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1819, in _save_resume_list
Apr 30 05:57:55 localhost bash[13665]: mtimedb.commit()
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/mtimedb.py", line 125, in commit
Apr 30 05:57:55 localhost bash[13665]: f.close()
Apr 30 05:57:55 localhost bash[13665]: File "/usr/lib64/python2.7/site-packages/portage/util/__init__.py", line 1350, in close
Apr 30 05:57:55 localhost bash[13665]: f.close()
Apr 30 05:57:55 localhost bash[13665]: IOError: [Errno 28] No space left on device
Apr 30 05:57:56 localhost bash[13665]: Container coreosdevelopercontainer.bin failed with error code 1.
Apr 30 05:57:56 localhost systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Apr 30 05:57:56 localhost systemd[1]: [email protected]: Failed with result 'exit-code'.
Apr 30 05:57:56 localhost systemd[1]: Failed to start compile kernel modules.```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.