Giter Club home page Giter Club logo

Comments (13)

drossetti avatar drossetti commented on May 23, 2024

Ehsan, insmod.sh requires that the user issuing the command have sudo privileges.

from gdrcopy.

moravveji avatar moravveji commented on May 23, 2024

I definitely have a root permission. Let me copy-paste what I get when running "make":
gdrcopy$> sudo ./build.sh &> log.txt

And the tail of the log.txt reads:

`make PREFIX=/easybuild/work/gdrcopy/install CUDA=/software/CUDA/9.1.85 all install
echo "GDRAPI_ARCH=X86"
GDRAPI_ARCH=X86

cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -o gdrapi.o gdrapi.c
cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -mavx -o memcpy_avx.o memcpy_avx.c
cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -msse -o memcpy_sse.o memcpy_sse.c
cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -msse4.1 -o memcpy_sse41.o memcpy_sse41.c
cc -shared -Wl,-soname,libgdrapi.so.1 -o libgdrapi.so.1.2 gdrapi.o memcpy_avx.o memcpy_sse.o memcpy_sse41.o
ldconfig -n /easybuild/work/gdrcopy/gdrcopy
ln -sf libgdrapi.so.1.2 libgdrapi.so.1
ln -sf libgdrapi.so.1 libgdrapi.so
cd gdrdrv;
make
find: ‘/usr/src/nvidia-’: No such file or directory
dirname: missing operand
Try 'dirname --help' for more information.
make[1]: Entering directory /easybuild/work/gdrcopy/gdrcopy/gdrdrv' Picking NVIDIA driver sources from NVIDIA_SRC_DIR=NVIDIA_DRIVER_MISSING. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. make[2]: Entering directory /usr/src/kernels/3.10.0-693.17.1.el7.x86_64'
find: ‘/usr/src/nvidia-
’: No such file or directory
dirname: missing operand
Try 'dirname --help' for more information.
CC [M] /easybuild/work/gdrcopy/gdrcopy/gdrdrv/nv-p2p-dummy.o
/easybuild/work/gdrcopy/gdrcopy/gdrdrv/nv-p2p-dummy.c:48:20: fatal error: nv-p2p.h: No such file or directory
#include "nv-p2p.h"
^
compilation terminated.
make[3]: *** [/easybuild/work/gdrcopy/gdrcopy/gdrdrv/nv-p2p-dummy.o] Error 1
make[2]: *** [module/easybuild/work/gdrcopy/gdrcopy/gdrdrv] Error 2
make[2]: Leaving directory /usr/src/kernels/3.10.0-693.17.1.el7.x86_64' make[1]: *** [module] Error 2 make[1]: Leaving directory /easybuild/work/gdrcopy/gdrcopy/gdrdrv'
make: *** [driver] Error 2
`

I am building against CUDA/9.1.85.

from gdrcopy.

moravveji avatar moravveji commented on May 23, 2024

I made some progress with the previous errors, and now, I get a new error:
insmod: ERROR: could not insert module gdrdrv/gdrdrv.ko: Unknown symbol in module

from gdrcopy.

drossetti avatar drossetti commented on May 23, 2024

Hard to tell.
Are you building and installing on the same machine?
There should be a detailed error in the kernel log. You could use 'dmesg' to dump that log and copy the relevant lines here.

from gdrcopy.

moravveji avatar moravveji commented on May 23, 2024

Alright ... I'm coming back to this ticket, because I need gdrcopy for a CUDA-aware OpenMPI. I am attaching the redirected stderr/stdout from building gdrcopy in here, together with the very simple build script I am using.
In brief, I have two complains now, one about NVIDIA_SRC_DIR, and the other about CONFIG_RETPOLINE during the "make" step. In fact, I am not sure how to set these, so that they propagate properly to the make.

Furthermore, I need to know what is expected to be inside NVIDIA_SRC_DIR?
What do you see on your platform?

gdrcopy.zip

from gdrcopy.

moravveji avatar moravveji commented on May 23, 2024

I would like to attract your attention to this ticket. In fact, my installation of CUDA-aware MPI is pending on compiling gdrcopy. Could you please take a look at my error logs, and also the questions I raised above?
Thanks a lot.
E.

from gdrcopy.

drossetti avatar drossetti commented on May 23, 2024

Ehsan,
thank you for trying gdrcopy.
The excerpt from your build log, copied below, is clear enough:

  1. NVIDIA_SRC_DIR is auto set based on your local install dir of the GPU driver
  2. CONFIG_RETPOLINE is apparently not supported by your host compiler. I am not an expert, but I don't believe you are supposed to tweak the compiler command line for a kernel module. Either your Linux kernel automatically detects and enables retpoline or not.
make[1]: Entering directory `/easybuild/work/gdrcopy/gdrcopy/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.40.04/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/3.10.0-957.10.1.el7.x86_64'
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[2]: Leaving directory `/usr/src/kernels/3.10.0-957.10.1.el7.x86_64'
make[1]: *** [module] Error 2
make[1]: Leaving directory `/easybuild/work/gdrcopy/gdrcopy/gdrdrv'
make: *** [driver] Error 2

from gdrcopy.

moravveji avatar moravveji commented on May 23, 2024

Thanks Davide for your message; it brought some activity back to this ticket.
My problem is, whether or not I set the two env vars NVIDIA_SRC_DIR and/or CONFIG_RETPOLINE, my build always crashes at the same location, and throws the same error message. That made me wonder I am not doing it right.
Do you have any idea why my build crashes? And how to resolve this?

from gdrcopy.

drossetti avatar drossetti commented on May 23, 2024

That kernel module build error is discussed on the net, e.g. on RH/CentOS forums/bugzilla.
For example see https://bugzilla.redhat.com/show_bug.cgi?id=1566297#c12
I think you might have updated the kernel but not the gcc RPM.

from gdrcopy.

moravveji avatar moravveji commented on May 23, 2024

Thanis Davide for the hint. For some reason, when I use GCC/6.4.0 module on our compute nodes (with rpm -q gcc command givinb gcc-4.8.5-36.el7_6.1.x86_64), the installation keeps failing! However, I purge the GCC module, and stick to the system gcc and it builds flawlessly.
I still cannot comprehend why gdrcopy builds with an older GCC rather than a newer one!

from gdrcopy.

drossetti avatar drossetti commented on May 23, 2024

BTW gdrdrv is a kernel module, which takes advantage of the Linux kernel build system, i.e. it does not have its own build system.
It looks like retpoline support is in gcc 7.3 or 8.x, but not in 6.x.
Most probably RH backported retpoline support onto their gcc 4.8.5 branch.
closing as this is a local customer server issue

from gdrcopy.

zhuanwancaishi avatar zhuanwancaishi commented on May 23, 2024

dear , how dou you fix the problem "insmod: ERROR: could not insert module gdrdrv/gdrdrv.ko: Unknown symbol in module" ? i

from gdrcopy.

pakmarkthub avatar pakmarkthub commented on May 23, 2024

Hi @zhuanwancaishi ,

There are multiple possibilities:

  1. Was nvidia driver (nvidia.ko) loaded before you tried insmod.sh?
  2. When you compiled gdrdrv, there should be a message printed out. Did it pick the correct nvidia driver and the linux kernel version you are running?

from gdrcopy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.