Giter Club home page Giter Club logo

modulus's People

Contributors

analogj avatar dippynark avatar paulcapestany avatar squat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

modulus's Issues

compiling modules fails in 4.10 kernels

After upgrading to the most recent Container Linux version, compiling modules fails due do NVIDIA errors:

var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c: In function ‘nvidia_cpu_callback’:
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:213:14: error: ‘CPU_DOWN_FAILED’ undeclared (first use in this function)
         case CPU_DOWN_FAILED:
              ^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:213:14: note: each undeclared identifier is reported only once for each function it appears in
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:220:14: error: ‘CPU_DOWN_PREPARE’ undeclared (first use in this function)
         case CPU_DOWN_PREPARE:
              ^
In file included from /var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:15:0:
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c: In function ‘nv_init_pat_support’:
/var/lib/dkms/nvidia/378.13/build/common/inc/nv-linux.h:391:34: error: implicit declaration of function ‘register_cpu_notifier’ [-Werror=implicit-function-declaration]
 #define register_hotcpu_notifier register_cpu_notifier
                                  ^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:258:17: note: in expansion of macro ‘register_hotcpu_notifier’
             if (register_hotcpu_notifier(&nv_hotcpu_nfb) != 0)
                 ^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c: In function ‘nv_teardown_pat_support’:
/var/lib/dkms/nvidia/378.13/build/common/inc/nv-linux.h:388:36: error: implicit declaration of function ‘unregister_cpu_notifier’ [-Werror=implicit-function-declaration]
 #define unregister_hotcpu_notifier unregister_cpu_notifier
                                    ^
/var/lib/dkms/nvidia/378.13/build/nvidia/nv-pat.c:283:9: note: in expansion of macro ‘unregister_hotcpu_notifier’
         unregister_hotcpu_notifier(&nv_hotcpu_nfb);
         ^

Nvidia drivers failing to compile with CL 1520.6.0

I have been testing Modulus with Nvidia drivers so I can use GPU scheduling on Tectonic. It was working just fine until CLUO upgraded my cluster to 1520.6.0. Now modulus fails to create compile build of Nvidia drivers.

David Michael said its because the build-1520 branch has a kernel not in 1520.6.0, since it will go in 1520.7.0. The Git branch has changes that were not built in a release yet. https://github.com/squat/modulus/blob/master/nvidia/compile#L8

Euan Kemp believe you are doing the wrong thing here and suggest I submit a issue. Given it's already in the dev container then I think, it should be pretty easy to get the right version. The repo should be handy and everything.

David Michael suggested I make the script find the release commit in the manifest and check out that commit instead of the branch: https://github.com/coreos/manifest/blob/v1520.6.0/release.xml#L23

I'm not a developer so I'm struggling to figure out how to workaround this issue. Greatly appreciate if you could fix your code so it will work with the latest CL version.

question: does this nvidia driver work with the latest stable flatcar kenerl (5.15)?

I've been trying to compile the Nvidia driver from source on my Flatcar development container.
Recently it began failing consistently with errors like:

     CC [M]  /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/nvidia/nv-frontend.o
     CC [M]  /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/nvidia/nv.o
   In file included from ./include/linux/kernel.h:5,
                    from ./include/linux/list.h:9,
                    from ./include/linux/preempt.h:11,
                    from ./include/linux/spinlock.h:55,
                    from /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/common/inc/nv-lock.h:16,
                    from /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/common/inc/nv-linux.h:20,
                    from /tmp/nvidia/NVIDIA-Linux-x86_64-460.32.03/kernel/nvidia/nv-frontend.c:13:
   ./include/linux/stdarg.h:6: warning: "va_start" redefined
       6 | #define va_start(v, l) __builtin_va_start(v, l)

I was able to fix the complication by using an older driver & a patch released by if-not-true-then-false.com: https://www.if-not-true-then-false.com/2020/inttf-nvidia-patcher/

https://github.com/mediadepot/docker-flatcar-nvidia-driver/blob/master/compile-patched.sh#L11-L16

The problem is that, while compilation finishes successfully, I cannot depmod or modprobe the compiled kernel modules:

Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.dep.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.dep.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.alias.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.alias.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.softdep.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.symbols.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.symbols.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.builtin.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.builtin.alias.bin.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570434]: depmod: ERROR: openat(//lib/modules/5.15.32-flatcar, modules.devname.570434.941891.1649830844, 301, 644): Read-only file system
Apr 13 06:20:44 localhost bash[570000]: ++ modprobe -d / ipmi_devintf
Apr 13 06:20:44 localhost bash[570000]: ++ depmod -b /opt/drivers/nvidia
Apr 13 06:20:44 localhost bash[570480]: depmod: WARNING: could not open modules.order at /opt/drivers/nvidia/lib/modules/5.15.32-flatcar: No such file or directory
Apr 13 06:20:44 localhost bash[570480]: depmod: WARNING: could not open modules.builtin at /opt/drivers/nvidia/lib/modules/5.15.32-flatcar: No such file or directory
Apr 13 06:20:44 localhost bash[570000]: ++ modprobe -d /opt/drivers/nvidia nvidia
Apr 13 06:20:45 localhost bash[570481]: modprobe: ERROR: could not insert 'nvidia': Exec format error
Apr 13 06:20:45 localhost systemd[1]: Finished Compile Kernel Modules for Flatcar.

Is this an issue you've run into?

Systemd instructions now fail when compiling Nvidia

Hey,
The systemd instructions now fail in the complication stage (systemct start ..) after your most recent change.

I ended up just using a previous commit to get everything working, and unfortunately I don't have the failing log files available to include in this issue.

DevicePlugin keeps Terminating

Hi,

I am facing a problem here with the DevicePlugin. It starts normally and the logs look fine. However after a couple of secs/mins the Pod gets terminated without any error.

I am using your modulus daemonset and the latest GKE device plugin.

Any idea what is going on there? Is this the normal behaviour?

Thanks,
Andreas

Adapting modulus for general-purpose kernel module compilation

First off, just wanted to say thanks for this awesome repo and your KubeConEU presentation Lucas!

Quick side-question: at the 12:49 timestamp in the video of your talk, you briefly have a slide up showing the gdisk -l coreos_developer_container.bin command and mention that you'd come back to it if there was time. Looks like there wasn't time, or was the topic covered elsewhere?

Main question: (perhaps relatedly) it seemed like modulus might be a great way to automatically compile other (non-nvidia) modules for those of us running Kubernetes on CoreOS Container Linux, but, the 2.9G size of the chroot appears to end up being a bottleneck. I've been bumping against Fatal error: [...] No space left on device errors while playing around trying to adapt modulus to compile some in-tree kernel modules. It's unclear to me what the specific purpose of the truncate_bin() function is, and/or whether the dev container image base size could easily be made bigger in order to have more free space for compiling other modules? Would an approach along those lines make sense for folks interested in general-purpose automatic kernel module compilation, or is that barking up the wrong tree?

Nvidia Docker runtime and Kubernetes Device Plugin

Thanks a lot for the great project and the great talk at KubeCon Europe! :)

Am I correct, that the 2 missing pieces now are 1. Nvidia Docker runtime and 2. Kubernetes Device Plugin?

Regarding 1: Do you have some hints on how to get this up and running on CoreOS? There're a couple of projects on github covering how to get the Nvidia docker runtime up and running - I am not sure which one to use.

Regarding 2: Can I use the official off-the-shelf Nvidia Device Plugin[1]?

Regards,
Andreas

[1] https://github.com/NVIDIA/k8s-device-plugin

Missing glxserver_nvidia submodule?

I am trying to configure OpenGL 3D hardware acceleration - this project has got me very close, however there appears to be a missing sub module from my X Server log output.

Full details of this issue is on SO here - I can copy them here though if that'd be better.

Systemd failures

Hey
any idea why this is happening with systemd?

[email protected] - compile kernel modules
   Loaded: loaded (/etc/systemd/system/[email protected]; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2019-03-03 06:02:40 UTC; 5min ago
  Process: 14367 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus install $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=1/FAILURE)
  Process: 3377 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus compile $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=0/SUCCESS)
  Process: 3366 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus install $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=1/FAILURE)
  Process: 3297 ExecStart=/bin/bash -c ${MODULUS_BIN_DIR}/modulus download $(echo nvidia-390.48 | tr '-' ' ') (code=exited, status=255)
 Main PID: 14367 (code=exited, status=1/FAILURE)

Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.symbols.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.symbols.bin.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.builtin.bin.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: ERROR: openat(//lib/modules/4.19.23-coreos-r1, modules.devname.tmp, 1101, 644): Read-only file system
Mar 03 06:02:40 localhost bash[14367]: depmod: WARNING: could not open /opt/drivers/nvidia/lib/modules/4.19.23-coreos-r1/modules.order: No such file or directory
Mar 03 06:02:40 localhost bash[14367]: depmod: WARNING: could not open /opt/drivers/nvidia/lib/modules/4.19.23-coreos-r1/modules.builtin: No such file or directory
Mar 03 06:02:40 localhost bash[14367]: modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)
Mar 03 06:02:40 localhost systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Mar 03 06:02:40 localhost systemd[1]: [email protected]: Failed with result 'exit-code'.
Mar 03 06:02:40 localhost systemd[1]: Failed to start compile kernel modules.```

coreos-overlay did not match any file(s) known to git

I'm seeing the following:

$ kubectl logs -n kube-system nvidia-driver-installer-9zdtf -c modulus -f
...
Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 48820723 Merge pull request #3879 from kinvolk/kai/oem-gce-rkt-dns-nsswitch

Performing Global Updates
(Could take a couple of minutes if you have a lot of binary packages.)


>>> Cloning repository 'portage-stable' from 'https://github.com/coreos/portage-stable.git'...
>>> Starting git clone in /var/lib/portage/portage-stable
>>> Git clone in /var/lib/portage/portage-stable successful
>>> Release checkout b46831017581540a0105466c9cd03d342041e577 in /var/lib/portage/portage-stable successful
>>> Cloning repository 'coreos' from 'https://github.com/coreos/coreos-overlay.git'...
>>> Starting git clone in /var/lib/portage/coreos-overlay
>>> Git clone in /var/lib/portage/coreos-overlay successful
>>> Release checkout 488207230f745be2f1daea126cd5034c2ff59557 in /var/lib/portage/coreos-overlay successful
error: pathspec 'src/third_party/coreos-overlay' did not match any file(s) known to git

I'm not too sure what the issue here is, potentially related to CoreOS going out of support?

EDIT: this seems to fix it: #16

Seeing `No space left on device` errors when using SystemD

Hey, I've started seeing the following error when starting the modulus service w/nvidia.

Apr 30 05:57:50 localhost bash[13665]: >>> Emerging binary (7 of 7) sys-kernel/coreos-sources-4.19.34::coreos
Apr 30 05:57:54 localhost bash[13665]: Traceback (most recent call last):
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib/python-exec/python2.7/emerge", line 53, in <module>
Apr 30 05:57:54 localhost bash[13665]:     retval = emerge_main()
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/main.py", line 1289, in emerge_main
Apr 30 05:57:54 localhost bash[13665]:     return run_action(emerge_config)
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 3332, in run_action
Apr 30 05:57:54 localhost bash[13665]:     retval = action_build(emerge_config, spinner=spinner)
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 541, in action_build
Apr 30 05:57:54 localhost bash[13665]:     retval = mergetask.merge()
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1019, in merge
Apr 30 05:57:54 localhost bash[13665]:     rval = self._merge()
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1413, in _merge
Apr 30 05:57:54 localhost bash[13665]:     self._main_loop()
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1389, in _main_loop
Apr 30 05:57:54 localhost bash[13665]:     self._event_loop.run_until_complete(self._main_exit)
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 831, in run_until_complete
Apr 30 05:57:54 localhost bash[13665]:     self.iteration()
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 285, in iteration
Apr 30 05:57:54 localhost bash[13665]:     return self._iteration(*args)
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 379, in _iteration
Apr 30 05:57:54 localhost bash[13665]:     if not x.callback(f, event, *x.args):
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/_eventloop/EventLoop.py", line 117, in __call__
Apr 30 05:57:54 localhost bash[13665]:     callback()
Apr 30 05:57:54 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/_async/PipeLogger.py", line 124, in _output_handler
Apr 30 05:57:54 localhost bash[13665]:     log_file.flush()
Apr 30 05:57:54 localhost bash[13665]: IOError: [Errno 28] No space left on device
Apr 30 05:57:55 localhost bash[13665]: Traceback (most recent call last):
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib/python-exec/python2.7/emerge", line 53, in <module>
Apr 30 05:57:55 localhost bash[13665]:     retval = emerge_main()
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/main.py", line 1289, in emerge_main
Apr 30 05:57:55 localhost bash[13665]:     return run_action(emerge_config)
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 3332, in run_action
Apr 30 05:57:55 localhost bash[13665]:     retval = action_build(emerge_config, spinner=spinner)
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/actions.py", line 541, in action_build
Apr 30 05:57:55 localhost bash[13665]:     retval = mergetask.merge()
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 942, in merge
Apr 30 05:57:55 localhost bash[13665]:     self._save_resume_list()
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/_emerge/Scheduler.py", line 1819, in _save_resume_list
Apr 30 05:57:55 localhost bash[13665]:     mtimedb.commit()
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/mtimedb.py", line 125, in commit
Apr 30 05:57:55 localhost bash[13665]:     f.close()
Apr 30 05:57:55 localhost bash[13665]:   File "/usr/lib64/python2.7/site-packages/portage/util/__init__.py", line 1350, in close
Apr 30 05:57:55 localhost bash[13665]:     f.close()
Apr 30 05:57:55 localhost bash[13665]: IOError: [Errno 28] No space left on device
Apr 30 05:57:56 localhost bash[13665]: Container coreosdevelopercontainer.bin failed with error code 1.
Apr 30 05:57:56 localhost systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Apr 30 05:57:56 localhost systemd[1]: [email protected]: Failed with result 'exit-code'.
Apr 30 05:57:56 localhost systemd[1]: Failed to start compile kernel modules.```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.