nvidia / gdrcopy Goto Github PK
View Code? Open in Web Editor NEWA fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
License: MIT License
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
License: MIT License
Hi,
I use these same APIs to create perfectly valid CPU mappings of two GPU memory.Like this:
//------dev_b pin buff----------------------------------------------------------
unsigned int flag_b;
cuPointerSetAttribute(&flag_b, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_b);
gdr_mh_t mh_b;
gdr_t g_b=gdr_open();
ASSERT_NEQ(g_b, (void*)0);
gdr_pin_buffer(g_b,dev_b,sizeof(int),0,0,&mh_b);
void * bar_ptr_b=NULL;
ASSERT_EQ(gdr_map(g_b, mh_b, &bar_ptr_b, sizeof(int)), 0);
gdr_info_t info_b;
gdr_get_info(g_b,mh_b,&info_b);
int off_b=dev_b-info_b.va;
cout<<"off_b:"<<off_b<<endl;
uint32_t * buf_ptr_b=(uint32_t *)((char *)bar_ptr_b+off_b);
cout<<"buf_ptr_b:"<<buf_ptr_b<<endl;
//---------------------------------------------------------------------------------------
//-------dev_a pin buff------------------------------------------------------
unsigned int flag;
cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_a);
gdr_mh_t mh;
gdr_t g=gdr_open();
ASSERT_NEQ(g, (void*)0);
gdr_pin_buffer(g, dev_a, N*sizeof(int), 0, 0, &mh);
void * bar_ptr=NULL;
ASSERT_EQ(gdr_map(g, mh, &bar_ptr, N*sizeof(int)), 0);
gdr_info_t info;
gdr_get_info(g,mh,&info);
int off=dev_a-info.va;
cout<<"off_a:"<<off<<endl;
uint32_t * buf_ptr=(uint32_t *)((char *)bar_ptr+off);
cout<<"buf_ptr:"<<buf_ptr<<endl;
But it failed.And I find that It's related to the order of a and b, the previous success, the subsequent failure.
Do you have any idea what could be happening here?
Thanks,
[ 2260.994632] gdrdrv:minor=0
[ 2260.994639] gdrdrv:ioctl called (cmd 0xc020da01)
[ 2260.994641] gdrdrv:invoking nvidia_p2p_get_pages(va=0x10916200000 len=4096 p2p_tok=0 va_tok=0)
[ 2260.995112] gdrdrv:page table entries: 1
[ 2260.995113] gdrdrv:page[0]=0x0000383800200000
[ 2260.995116] gdrdrv:ioctl called (cmd 0xc008da04)
[ 2260.995120] gdrdrv:mmap start=0x7f20ae059000 size=4096 off=0x455f5790
[ 2260.995121] gdrdrv:offset=0 len=65536 vaddr+offset=7f20ae059000 paddr+offset=383800200000
[ 2260.995122] gdrdrv:mmaping phys mem addr=0x383800200000 size=65536 at user virt addr=0x7f20ae059000
[ 2260.995123] gdrdrv:pfn=0x383800200
[ 2260.995124] gdrdrv:calling io_remap_pfn_range() vma=ffff883f28c33a90 vaddr=7f20ae059000 pfn=383800200 size=65536
[ 2260.995163] ------------[ cut here ]------------
[ 2260.995182] kernel BUG at /build/linux-lts-xenial-80t3lB/linux-lts-xenial-4.4.0/mm/memory.c:1674!
[ 2260.995204] invalid opcode: 0000 [#1] SMP
...
[ 2260.995861] [] gdrdrv_mmap_phys_mem_wcomb+0x71/0x130 [gdrdrv]
[ 2260.995879] [] gdrdrv_mmap+0x156/0x2e0 [gdrdrv]
[ 2260.995896] [] ? kmem_cache_alloc+0x1e2/0x200
[ 2260.995911] [] mmap_region+0x3f4/0x610
[ 2260.995926] [] do_mmap+0x2fc/0x3d0
[ 2260.995940] [] vm_mmap_pgoff+0x91/0xc0
[ 2260.995954] [] SyS_mmap_pgoff+0x197/0x260
[ 2260.995970] [] SyS_mmap+0x22/0x30
-bash-4.2$ ./validate
buffer size: 327680
device ptr: 7fffa0600000
gdr open: 0xc9abf0
before ioctl GDRDRV IOC PIN BUFFER c020da01
After ioctl retcode -1
-bash-4.2$
-bash-4.2$ ./copybw
GPU id:0 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 26 device: 0
GPU id:1 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 28 device: 0
GPU id:2 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 136 device: 0
GPU id:3 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 138 device: 0
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: 7fffa0600000
before ioctl GDRDRV IOC PIN BUFFER c020da01
After ioctl size -1
closing gdrdrv
-bash-4.2$
Since struct file in gdrdrv keeps the internal data for the process that opened the file, being able to share the fd can lead to undesired behaviors. Create a unit test to make sure that fd is not sharable:
Code from the master branch (bf4848f).
$ dpkg -l gdrdrv-dkms
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============================================-============================-============================-==================================================================================================
ii gdrdrv-dkms:amd64 2.0 amd64 gdrdrv driver in DKMS format.
$ modinfo gdrdrv
filename: /lib/modules/4.15.0-58-generic/updates/dkms/gdrdrv.ko
version: 1.1
description: GDRCopy kernel-mode driver
license: MIT
author: [email protected]
srcversion: D5FB5F3108420043522DCAC
depends: nv-p2p-dummy
retpoline: Y
name: gdrdrv
vermagic: 4.15.0-58-generic SMP mod_unload
parm: dbg_enabled:enable debug tracing (int)
parm: info_enabled:enable info tracing (int)
devendar@ibm-p9-013 gdrcopy (git::devel)$ make
echo "GDRAPI_ARCH=POWER"
GDRAPI_ARCH=POWER
make: Warning: File `libgdrapi.so.1.2' has modification time 22 s in the future
cc -O2 -fPIC -I /usr/local/cuda/include -I gdrdrv/ -I /usr/local/cuda/include -D GDRAPI_ARCH=POWER -c -o gdrapi.o gdrapi.c
cc -shared -Wl,-soname,libgdrapi.so.1 -o libgdrapi.so.1.2 gdrapi.o
ldconfig -n /labhome/devendar/gdrcopy
ln -sf libgdrapi.so.1.2 libgdrapi.so.1
ln -sf libgdrapi.so.1 libgdrapi.so
cd gdrdrv; \
make
make[1]: Entering directory `/labhome/devendar/gdrcopy/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-387.26/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/4.11.0-44.el7a.ppc64le'
make[3]: Warning: File `/labhome/devendar/gdrcopy/gdrdrv/modules.order' has modification time 22 s in the future
make[3]: warning: Clock skew detected. Your build may be incomplete.
Building modules, stage 2.
MODPOST 2 modules
make[2]: Leaving directory `/usr/src/kernels/4.11.0-44.el7a.ppc64le'
make[1]: Leaving directory `/labhome/devendar/gdrcopy/gdrdrv'
g++ -O2 -I /usr/local/cuda/include -I gdrdrv/ -I /usr/local/cuda/include -D GDRAPI_ARCH=POWER -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64 -o basic basic.o libgdrapi.so.1.2 -lcudart -lcuda -lpthread -ldl
libgdrapi.so.1.2: undefined reference to `_mm_sfence'
collect2: error: ld returned 1 exit status
make: *** [basic] Error 1
GDRDRV needs 64kB aligned addresses.
gdrdrv_pin_buffer() {
...
page_virt_start = params.addr & GPU_PAGE_MASK;
page_virt_end = params.addr + params.size - 1;
rounded_size = page_virt_end - page_virt_start + 1;
mr->offset = params.addr & GPU_PAGE_OFFSET;
...
}
and
gdrdrv_mmap() {
...
if (mr->offset) {
gdr_dbg("offset != 0 is not supported\n");
ret = -EINVAL;
goto out;
}
...
}
This is no more guaranteed with the cudaMalloc in recent CUDA drivers (since 410). A temporary WAR could be (at application level) to allocate with the cudaMalloc a memory area that is size + GPU_PAGE_SIZE
and then search for the first 64kB aligned address. Something like:
alloc_size = (buffer_size + GPU_PAGE_SIZE) & GPU_PAGE_MASK;
cuMemAlloc(&dev_addr, alloc_size);
if(dev_addr % GPU_PAGE_SIZE) {
dev_addr += (GPU_PAGE_SIZE - (dev_addr % GPU_PAGE_SIZE));
}
Hello,
I'm trying to build gdrcopy correctly in order to build UCX. (Following the website instructions) the installation seems to work fine:
sudo make PREFIX=/usr/local/gdrcopy CUDA=/usr/local/cuda-10.1
echo "GDRAPI_ARCH=X86"
GDRAPI_ARCH=X86
cd gdrdrv;
make
make[1]: Entering directory /home/centos/gdrcopy/gdrdrv' Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. make[2]: Entering directory
/usr/src/kernels/3.10.0-957.5.1.el7.x86_64'
Building modules, stage 2.
MODPOST 2 modules
make[2]: Leaving directory /usr/src/kernels/3.10.0-957.5.1.el7.x86_64' make[1]: Leaving directory
/home/centos/gdrcopy/gdrdrv'
sudo ./insmod.sh
INFO: driver major is 240
INFO: creating /dev/gdrdrv inode
The validation codes yield:
./validate
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
warning: buffer size -325939184 is not dword aligned, ignoring trailing bytes
unampping
unpinning
./copybw
GPU id:0 name:Tesla M60 PCI domain: 0 bus: 0 device: 30
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: b04720000
bar_ptr: 0x7f670a353000
info.va: b04720000
info.mapped_size: 131072
info.page_size: 65536
page offset: 0
user-space pointer:0x7f670a353000
writing test, size=131072 offset=0 num_iters=10000
write BW: 9585.88MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 529.436MB/s
unmapping buffer
unpinning buffer
closing gdrdrv
However, I don't see any file in the subdirectory /usr/local/gdrcopy and, when I try to configure and build UCX(1.5.2), I get the error message: configure: error: gdrcopy support is requested but gdrcopy packages can't found
Thank you.
Hello,
I'm running into some issues while trying to use gdrcopy in a MPI environment. I have CUDA 10.1 (418.67) and the error reads:
GDRCOPY library "libgdrapi.so" unable to open GDR driver, is gdrdrv.ko loaded?
I'm new to gdrcopy and don't really know what this means. After installing gdrcopy, I performed the suggested validations that read OK to me:
./validate
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
warning: buffer size 1763323920 is not dword aligned, ignoring trailing bytes
unampping
unpinning
./copybw
GPU id:0 name:Tesla K80 PCI domain: 0 bus: 0 device: 4
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: 403960000
bar_ptr: 0x7f0395d5c000
info.va: 403960000
info.mapped_size: 131072
info.page_size: 65536
page offset: 0
user-space pointer:0x7f0395d5c000
writing test, size=131072 offset=0 num_iters=10000
write BW: 9437.68MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 356.296MB/s
unmapping buffer
unpinning buffer
closing gdrdrv
Any suggestions on how to proceed or what am I missing? Thanks.
I get this exception:
ln -sf libgdrapi.so.1.2 libgdrapi.so.1
ln -sf libgdrapi.so.1 libgdrapi.so
cd gdrdrv;
/usr/bin/make64
make64[1]: Entering directory /home/users/tangwei12/gdrcopy-master/gdrdrv' Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-linux-x86_64-390.12/kernel/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. make64[2]: Entering directory
/home/users/tangwei12/linux-4-14'
WARNING: Symbol version dump ./Module.symvers
is missing; modules will have no dependencies and modversions.
CC [M] /home/users/tangwei12/gdrcopy-master/gdrdrv/nv-p2p-dummy.o
CC [M] /home/users/tangwei12/gdrcopy-master/gdrdrv/gdrdrv.o
Building modules, stage 2.
MODPOST 2 modules
FATAL: /home/users/tangwei12/gdrcopy-master/gdrdrv/gdrdrv.o is truncated. sechdrs[i].sh_offset=7089075323386670592 > sizeof(*hrd)=64
make64[3]: *** [__modpost] Error 1
make64[2]: *** [modules] Error 2
make64[2]: Leaving directory /home/users/tangwei12/linux-4-14' make64[1]: *** [module] Error 2 make64[1]: Leaving directory
/home/users/tangwei12/gdrcopy-master/gdrdrv'
make64: *** [driver] Error 2
OS: centOS 6.3 (4.14.18)
CUDA: 9
Driver Version: 390.12
I'm seeing following error when building gdrcopy deb package:
> ./build-deb-packages.sh
...
> dpkg-shlibdeps: error: no dependency information found for /usr/lib/x86_64-linux-gnu/libcuda.so.1 (used by debian/gdrcopy/usr/bin/sanity)
Which, i suppose, due to installing driver from downloaded *.run package.
Error can be suppressed by adding rule:
override_dh_shlibdeps:
dh_shlibdeps --dpkg-shlibdeps-params=--ignore-missing-info
to packages/debian/rules
, but i'm not sure whether this is the right way to maintain this.
I have built the gdrcopy RPMS from the build_packages.sh script in the source tree and I've found that I'm unable to install the gdrcopy-devel package because it is missing a required dependency.
Error: Package: gdrcopy-devel-1.3-2.x86_64 (/gdrcopy-devel-1.3-2.x86_64)
Requires: libgdrapi.so.1()(64bit)
The file listed was installed by the gdrcopy RPM, but that library didn't get listed as being provided by the RPM.
If this works for other users then I may have messed something up in my environment, but if not it is probably a bug in the spec file that should get fixed.
Either way I'm willing to put some work into figuring it out, but I wanted to know which side of the problem to focus on.
This is a code error in "check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + %d bytes offset". The size passed to gdr_copy_to_mapping and gdr_copy_from_mapping does not account for extra_off.
systemctl is used in installation and uninstallation scripts for gdrcopy-kmod.rpm, but RHEL6 does not have systemctl.
This has been reported by Mark Mark Silberstein [email protected]

We finally pinpointed the setup, and it's easily reproducible.
As long as blocks are less than 64K, we get ~1GB/s. For blocks >= 64K we get around 13MB/s
Hi, we found the tests break the machine with CUDA-9, Nvidia driver (390.25), latest Mellanox driver (v4_3-1_0_1_0) and Linux kernel (4.14.11). The installation of the gdrcopy finishes successfully, but the machine freezes and then reboots when running the tests.
any suggestions? Thanks
strawman design:
root@ibm-p9-012 gdrcopy]# ./copybw -s 4294967296 -c 4294967296 -d 0
GPU id:0 name:Tesla V100-SXM2-16GB PCI domain: 4 bus: 4 device: 0
GPU id:1 name:Tesla V100-SXM2-16GB PCI domain: 4 bus: 5 device: 0
GPU id:2 name:Tesla V100-SXM2-16GB PCI domain: 53 bus: 3 device: 0
GPU id:3 name:Tesla V100-SXM2-16GB PCI domain: 53 bus: 4 device: 0
selecting device 0
testing size: 4294967296
rounded size: 4294967296
device ptr: 7ffe40000000
bar_ptr: 0x7ffc3fff0000
info.va: 7ffe40000000
info.mapped_size: 4294967296
info.page_size: 65536
page offset: 0
user-space pointer:0x7ffc3fff0000
BAR writing test, size=4294967296 offset=0 num_iters=10000
Bus error (core dumped)
[root@ibm-p9-012 gdrcopy]#
$ make
...
...
/usr/bin/ld: copybw.o: undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'
/usr/bin/ld: note: 'clock_gettime@@GLIBC_2.2.5' is defined in DSO /lib64/librt.so.1 so try adding it to the linker command line
/lib64/librt.so.1: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [copybw] Error 1
Had to add "-lrt" to LIBS in Makefile:15
Dear,
We have several GPU nodes (Skylake processors with 4x P100 cards per each node), and I would like to test if the RDMA is available on these nodes or not.
When I try to build the gdrcopy, I get the following error message:
mknod: ‘/dev/gdrdrv’: Operation not permitted
Here is the specification of the host:
$> uname -a Linux r23g34 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
In fact, there is not such a file at /dev/gdrdrv
on our current system. Do you have an idea what is wrong here?
Thanks
Ehsan
today, forking could lead to spurious prints (
Line 236 in f54766b
Appears the tag/GitHub release for 1.2 is missing. Could this please be added?
Hi,
I have seen segfault when copying buffers (gdr_copy_from_bar
) with size ranging from 64-byte to 127-byte. Followings are the reproducers in our machines.
$ /opt/gdrcopy8.0/copybw -s 64
GPU id:0 name:Tesla K40c PCI domain: 0 bus: 2 device: 0
selecting device 0
testing size: 64
rounded size: 65536
device ptr: b05a40000
bar_ptr: 0x7f43d9223000
info.va: b05a40000
info.mapped_size: 65536
info.page_size: 65536
page offset: 0
user-space pointer:0x7f43d9223000
BAR writing test, size=64 offset=0 num_iters=10000
BAR1 write BW: 457.923MB/s
BAR reading test, size=64 offset=0 num_iters=100
Segmentation fault
$dmesg
...
[2689239.364734] copybw[5308]: segfault at 2846000 ip 00007f43d8ecd06c sp 00007fff23b939e0 error 6 in libgdrapi.so.1.2[7f43d8ecb000+3000]
$ /opt/gdrcopy8.0/copybw -s 64
GPU id:0 name:Tesla K80 PCI domain: 0 bus: 5 device: 0
GPU id:1 name:Tesla K80 PCI domain: 0 bus: 6 device: 0
selecting device 0
testing size: 64
rounded size: 65536
device ptr: 2304fc0000
bar_ptr: 0x2acc78311000
info.va: 2304fc0000
info.mapped_size: 65536
info.page_size: 65536
page offset: 0
user-space pointer:0x2acc78311000
BAR writing test, size=64 offset=0 num_iters=10000
BAR1 write BW: 722.593MB/s
BAR reading test, size=64 offset=0 num_iters=100
Segmentation fault (core dumped)
$dmesg
...
[2614698.728292] copybw[32532]: segfault at 2acc78321000 ip 00002acc78459018 sp 00007ffd51c16b10 error 4 in libgdrapi.so.1.2[2acc78457000+3000]
Do you have any idea what could be happening here?
Thanks,
dkms is needed anyway for deb kernel module packages
I just installed gdrcopy on my machine (Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)) using CUDA 7.5, V7.5.17 (NVIDIA driver version 367.27), with an NVIDIA Tesla K20m GPU. After trying to run $ ./validate, the following error was printed in dmesg:
gdrdrv:nvidia_p2p_get_pages(va=704fe0000 len=327680 p2p_token=0 va_space=0) failed [ret = -22]
-22 = -EINVAL, and according to the GPUDirect CUDA Toolkit page that function returs -EINVAL if an invalid argument was supplied.
Does anyone have any bright ideas on why I can't do GPUDirect RDMA? Thanks.
Please help me! How to direct access nvidia physical memory from kernel module?
I observed a 1msec latency from when gdr_copy_to_bar is issued to when the update is observed on the GPU.
When the target buffer is not aligned or when the copy size is too small, gdr_copy_to_bar translates to an sfence followed by a memcpy.
Issuing sfence after the mempcy seems to prevent some buffering and helps reduce the latency significantly.
reported by Ching Chu:
$ ./validate
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
[1] 316576 segmentation fault ./validate
optimized memcpy implementations should be chosen at run-time during a tuning phase, possibly in gdr_open()
The issue comes from:
Hi,
I have installed gdrcopy, but I am getting NULL for the call gdr_open and the test cases are failing.
As of today, the nvidia_p2p_get_pages callback does not tear down the CPU mappings created via gdr_map().
This is a potential security threat, as those BAR1 pages could be reused later to expose some other GPU device memory, possibly belonging to a different OS process colocated on the same GPU.
Currently there is no run-time ABI compatibility check between libgdrapi and gdrdrv.
That can generate obscure errors, say in a container when libgdapi version A tries to work with baremetal gdrdrv version B.
A possible plan would be:
sudo /sbin/insmod gdrdrv/gdrdrv.ko dbg_enabled=0 info_enabled=0
insmod: ERROR: could not insert module gdrdrv/gdrdrv.ko: Invalid parameters
so I tried:
insmod gdrdrv.ko
insmod: ERROR: could not insert module gdrdrv.ko: Invalid parameters
Could you take a look and do a quick fix on it, now it is not working.
[61024.799569] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.799746] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.799920] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.800127] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.800151] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.800327] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.800502] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.800704] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.800726] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.800901] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.801083] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.801285] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.801307] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.801484] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.801659] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.801861] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.801883] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.802064] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
ATM there are 3 places where the library major and minor version are specified:
[1218757.588122] gdrdrv:mmap start=0x7f45e83c3000 size=196608 off=0xc31d2952
[1218757.588123] gdrdrv:range start with p=0 vaddr=7f45e83c3000 page_paddr=3838082a0000
[1218757.588125] gdrdrv:non-contig p=1 prev_page_paddr=3838082a0000 cur_page_paddr=3838084b0000
[1218757.588127] gdrdrv:mapping p=1 entries=1 offset=0 len=65536 vaddr=7f45e83c3000 paddr=3838082a0000
[1218757.588128] gdrdrv:mmaping phys mem addr=0x3838082a0000 size=65536 at user virt addr=0x7f45e83c3000
[1218757.588129] gdrdrv:is_cow_mapping is FALSE
[1218757.588138] gdrdrv:range start with p=1 vaddr=7f45e83d3000 page_paddr=3838084b0000
[1218757.588139] gdrdrv:mapping p=3 entries=2 offset=0 len=131072 vaddr=7f45e83d3000 paddr=3838084b0000
[1218757.588141] gdrdrv:mmaping phys mem addr=0x3838084b0000 size=131072 at user virt addr=0x7f45e83d3000
[1218757.588141] gdrdrv:is_cow_mapping is FALSE
[1218757.588146] gdrdrv:track_pfn_remap failed :-22
[1218757.588150] gdrdrv:error in remap_pfn_range() ret:-22
[1218757.588151] gdrdrv:error -11 in gdrdrv_mmap_phys_mem_wcomb
We might consider a run-time query mechanism, like gdr_query_version(int *major, int *minor)
or the more generic gdr_get_attribute(int attr, int *value)
, which would complement the dynamic link time mechanism offered by ld.so.
That would be especially useful, say in MPI libraries, when dynamically loading the library with dlopen("libgdrapi.so")
and resolving symbols with dlsym()
, to enforce a run-time compatibility check.
Hello,
I've run into the following error message while building gdrcopy-v1.3 (it doesn't happen with the master branch):
sudo make CUDA=/usr/local/cuda-10.1 all install
make: execvp: ./config_arch: Permission denied
echo "GDRAPI_ARCH="
GDRAPI_ARCH=
cd gdrdrv; \
make
make[1]: Entering directory `/home/ody/gdrcopy-1.3/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stal e driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/3.10.0-957.27.2.el7.x86_64'
Building modules, stage 2.
MODPOST 2 modules
make[2]: Leaving directory `/usr/src/kernels/3.10.0-957.27.2.el7.x86_64'
make[1]: Leaving directory `/home/ody/gdrcopy-1.3/gdrdrv'
g++ -O2 -I /usr/local/cuda-10.1/include -I gdrdrv/ -I /usr/local/cuda-10.1/include -D GDRAPI_ARCH= -L /usr/local/cuda-10.1/lib64 -L /usr/local /cuda-10.1/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda-10.1/lib64 -o basic basic.o libgdrapi.so.1.2 -lcudart -lcuda -lpth read -ldl
libgdrapi.so.1.2: undefined reference to `memcpy_cached_store_sse'
libgdrapi.so.1.2: undefined reference to `memcpy_uncached_store_avx'
libgdrapi.so.1.2: undefined reference to `memcpy_cached_store_avx'
libgdrapi.so.1.2: undefined reference to `memcpy_uncached_store_sse'
libgdrapi.so.1.2: undefined reference to `memcpy_uncached_load_sse41'
collect2: error: ld returned 1 exit status
make: *** [basic] Error 1
The hardware is a virtualized environment with an Intel(R) Xeon(R) CPU @ 2.30GHz and a 00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) ) GPU. Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.