Giter Club home page Giter Club logo

gdrcopy's Introduction

GDRCopy

A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.

Introduction

While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, it is possible to use these same APIs to create perfectly valid CPU mappings of the GPU memory.

The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required.

What is inside

GDRCopy offers the infrastructure to create user-space mappings of GPU memory, which can then be manipulated as if it was plain host memory (caveats apply here).

A simple by-product of it is a copy library with the following characteristics:

  • very low overhead, as it is driven by the CPU. As a reference, currently a cudaMemcpy can incur in a 6-7us overhead.

  • An initial memory pinning phase is required, which is potentially expensive, 10us-1ms depending on the buffer size.

  • Fast H-D, because of write-combining. H-D bandwidth is 6-8GB/s on Ivy Bridge Xeon but it is subject to NUMA effects.

  • Slow D-H, because the GPU BAR, which backs the mappings, can't be prefetched and so burst reads transactions are not generated through PCIE

The library comes with a few tests like:

  • gdrcopy_sanity, which contains unit tests for the library and the driver.
  • gdrcopy_copybw, a minimal application which calculates the R/W bandwidth for a specific buffer size.
  • gdrcopy_copylat, a benchmark application which calculates the R/W copy latency for a range of buffer sizes.
  • gdrcopy_apiperf, an application for benchmarking the latency of each GDRCopy API call.
  • gdrcopy_pplat, a benchmark application which calculates the round-trip ping-pong latency between GPU and CPU.

Requirements

GPUDirect RDMA requires NVIDIA Data Center GPU or NVIDIA RTX GPU (formerly Tesla and Quadro) based on Kepler or newer generations, see GPUDirect RDMA. For more general information, please refer to the official GPUDirect RDMA design document.

The device driver requires GPU display driver >= 418.40 on ppc64le and >= 331.14 on other platforms. The library and tests require CUDA >= 6.0.

DKMS is a prerequisite for installing GDRCopy kernel module package. On RHEL or SLE, however, users have an option to build kmod and install it instead of the DKMS package. See Build and installation section for more details.

# On RHEL
# dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL.
$ sudo yum install dkms

# On Debian - No additional dependency

# On SLE / Leap
# On SLE dkms can be installed from PackageHub.
$ sudo zypper install dkms rpmbuild

CUDA and GPU display driver must be installed before building and/or installing GDRCopy. The installation instructions can be found in https://developer.nvidia.com/cuda-downloads.

GPU display driver header files are also required. They are installed as a part of the driver (or CUDA) installation with runfile. If you install the driver via package management, we suggest

  • On RHEL, sudo dnf module install nvidia-driver:latest-dkms.
  • On Debian, sudo apt install nvidia-dkms-<your-nvidia-driver-version>.
  • On SLE, sudo zypper install nvidia-gfx<your-nvidia-driver-version>-kmp.

The supported architectures are Linux x86_64, ppc64le, and arm64. The supported platforms are RHEL8, RHEL9, Ubuntu20_04, Ubuntu22_04, SLE-15 (any SP) and Leap 15.x.

Root privileges are necessary to load/install the kernel-mode device driver.

Build and installation

We provide three ways for building and installing GDRCopy.

rpm package

# For RHEL:
$ sudo yum groupinstall 'Development Tools'
$ sudo yum install dkms rpm-build make

# For SLE:
$ sudo zypper in dkms rpmbuild

$ cd packages
$ CUDA=<cuda-install-top-dir> ./build-rpm-packages.sh
$ sudo rpm -Uvh gdrcopy-kmod-<version>dkms.noarch.<platform>.rpm
$ sudo rpm -Uvh gdrcopy-<version>.<arch>.<platform>.rpm
$ sudo rpm -Uvh gdrcopy-devel-<version>.noarch.<platform>.rpm

DKMS package is the default kernel module package that build-rpm-packages.sh generates. To create kmod package, -m option must be passed to the script. Unlike the DKMS package, the kmod package contains a prebuilt GDRCopy kernel module which is specific to the NVIDIA driver version and the Linux kernel version used to build it.

deb package

$ sudo apt install build-essential devscripts debhelper fakeroot pkg-config dkms
$ cd packages
$ CUDA=<cuda-install-top-dir> ./build-deb-packages.sh
$ sudo dpkg -i gdrdrv-dkms_<version>_<arch>.<platform>.deb
$ sudo dpkg -i libgdrapi_<version>_<arch>.<platform>.deb
$ sudo dpkg -i gdrcopy-tests_<version>_<arch>.<platform>.deb
$ sudo dpkg -i gdrcopy_<version>_<arch>.<platform>.deb

from source

$ make prefix=<install-to-this-location> CUDA=<cuda-install-top-dir> all install
$ sudo ./insmod.sh

Notes

Compiling the gdrdrv driver requires the NVIDIA driver source code, which is typically installed at /usr/src/nvidia-<version>. Our make file automatically detects and picks that source code. In case there are multiple versions installed, it is possible to pass the correct path by defining the NVIDIA_SRC_DIR variable, e.g. export NVIDIA_SRC_DIR=/usr/src/nvidia-520.61.05/nvidia before building the gdrdrv module.

There are two major flavors of NVIDIA driver: 1) proprietary, and 2) opensource. We detect the flavor when compiling gdrdrv based on the source code of the NVIDIA driver. Different flavors come with different features and restrictions:

  • gdrdrv compiled with the opensource flavor will provide functionality and high performance on all platforms. However, you will not be able to load this gdrdrv driver when the proprietary NVIDIA driver is loaded.
  • gdrdrv compiled with the proprietary flavor can always be loaded regardless of the flavor of NVIDIA driver you have loaded. However, it may have suboptimal performance on coherent platforms such as Grace-Hopper. Functionally, it will not work correctly on Intel CPUs with Linux kernel built with confidential compute (CC) support, i.e. CONFIG_ARCH_HAS_CC_PLATFORM=y, WHEN CC is enabled at runtime.

Tests

Execute provided tests:

$ gdrcopy_sanity 
Total: 28, Passed: 28, Failed: 0, Waived: 0

List of passed tests:
    basic_child_thread_pins_buffer_cumemalloc
    basic_child_thread_pins_buffer_vmmalloc
    basic_cumemalloc
    basic_small_buffers_mapping
    basic_unaligned_mapping
    basic_vmmalloc
    basic_with_tokens
    data_validation_cumemalloc
    data_validation_vmmalloc
    invalidation_access_after_free_cumemalloc
    invalidation_access_after_free_vmmalloc
    invalidation_access_after_gdr_close_cumemalloc
    invalidation_access_after_gdr_close_vmmalloc
    invalidation_fork_access_after_free_cumemalloc
    invalidation_fork_access_after_free_vmmalloc
    invalidation_fork_after_gdr_map_cumemalloc
    invalidation_fork_after_gdr_map_vmmalloc
    invalidation_fork_child_gdr_map_parent_cumemalloc
    invalidation_fork_child_gdr_map_parent_vmmalloc
    invalidation_fork_child_gdr_pin_parent_with_tokens
    invalidation_fork_map_and_free_cumemalloc
    invalidation_fork_map_and_free_vmmalloc
    invalidation_two_mappings_cumemalloc
    invalidation_two_mappings_vmmalloc
    invalidation_unix_sock_shared_fd_gdr_map_cumemalloc
    invalidation_unix_sock_shared_fd_gdr_map_vmmalloc
    invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc
    invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc


$ gdrcopy_copybw
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7f1153a00000
map_d_ptr: 0x7f1172257000
info.va: 7f1153a00000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7f1172257000
writing test, size=131072 offset=0 num_iters=10000
write BW: 9638.54MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 530.135MB/s
unmapping buffer
unpinning buffer
closing gdrdrv


$ gdrcopy_copylat
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
device ptr: 0x7fa2c6000000
allocated size: 16777216
gpu alloc fn: cuMemAlloc

map_d_ptr: 0x7fa2f9af9000
info.va: 7fa2c6000000
info.mapped_size: 16777216
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7fa2f9af9000

gdr_copy_to_mapping num iters for each size: 10000
WARNING: Measuring the API invocation overhead as observed by the CPU. Data
might not be ordered all the way to the GPU internal visibility.
Test             Size(B)     Avg.Time(us)
gdr_copy_to_mapping             1         0.0889
gdr_copy_to_mapping             2         0.0884
gdr_copy_to_mapping             4         0.0884
gdr_copy_to_mapping             8         0.0884
gdr_copy_to_mapping            16         0.0905
gdr_copy_to_mapping            32         0.0902
gdr_copy_to_mapping            64         0.0902
gdr_copy_to_mapping           128         0.0952
gdr_copy_to_mapping           256         0.0983
gdr_copy_to_mapping           512         0.1176
gdr_copy_to_mapping          1024         0.1825
gdr_copy_to_mapping          2048         0.2549
gdr_copy_to_mapping          4096         0.4366
gdr_copy_to_mapping          8192         0.8141
gdr_copy_to_mapping         16384         1.6155
gdr_copy_to_mapping         32768         3.2284
gdr_copy_to_mapping         65536         6.4906
gdr_copy_to_mapping        131072        12.9761
gdr_copy_to_mapping        262144        25.9459
gdr_copy_to_mapping        524288        51.9100
gdr_copy_to_mapping       1048576       103.8028
gdr_copy_to_mapping       2097152       207.5990
gdr_copy_to_mapping       4194304       415.2856
gdr_copy_to_mapping       8388608       830.6355
gdr_copy_to_mapping      16777216      1661.3285

gdr_copy_from_mapping num iters for each size: 100
Test             Size(B)     Avg.Time(us)
gdr_copy_from_mapping           1         0.9069
gdr_copy_from_mapping           2         1.7170
gdr_copy_from_mapping           4         1.7169
gdr_copy_from_mapping           8         1.7164
gdr_copy_from_mapping          16         0.8601
gdr_copy_from_mapping          32         1.7024
gdr_copy_from_mapping          64         3.1016
gdr_copy_from_mapping         128         3.4944
gdr_copy_from_mapping         256         3.6400
gdr_copy_from_mapping         512         2.4394
gdr_copy_from_mapping        1024         2.8022
gdr_copy_from_mapping        2048         4.6615
gdr_copy_from_mapping        4096         7.9783
gdr_copy_from_mapping        8192        14.9209
gdr_copy_from_mapping       16384        28.9571
gdr_copy_from_mapping       32768        56.9373
gdr_copy_from_mapping       65536       114.1008
gdr_copy_from_mapping      131072       234.9382
gdr_copy_from_mapping      262144       496.4011
gdr_copy_from_mapping      524288       985.5196
gdr_copy_from_mapping     1048576      1970.7057
gdr_copy_from_mapping     2097152      3942.5611
gdr_copy_from_mapping     4194304      7888.9468
gdr_copy_from_mapping     8388608     18361.5673
gdr_copy_from_mapping    16777216     36758.8342
unmapping buffer
unpinning buffer
closing gdrdrv


$ gdrcopy_apiperf -s 8
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
device ptr: 0x7f1563a00000
allocated size: 65536
Size(B) pin.Time(us)    map.Time(us)    get_info.Time(us)   unmap.Time(us)
unpin.Time(us)
65536   1346.034060 3.603800    0.340270    4.700930    676.612800
Histogram of gdr_pin_buffer latency for 65536 bytes
[1303.852000    -   2607.704000]    93
[2607.704000    -   3911.556000]    0
[3911.556000    -   5215.408000]    0
[5215.408000    -   6519.260000]    0
[6519.260000    -   7823.112000]    0
[7823.112000    -   9126.964000]    0
[9126.964000    -   10430.816000]   0
[10430.816000   -   11734.668000]   0
[11734.668000   -   13038.520000]   0
[13038.520000   -   14342.372000]   2

closing gdrdrv



$ numactl -N 1 -l gdrcopy_pplat
GPU id:0; name: NVIDIA A40; Bus id: 0000:09:00
selecting device 0
device ptr: 0x7f99d2600000
gpu alloc fn: cuMemAlloc
map_d_ptr: 0x7f9a054fb000
info.va: 7f99d2600000
info.mapped_size: 4
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7f9a054fb000
CPU does gdr_copy_to_mapping and GPU writes back via cuMemHostAlloc'd buffer.
Running 1000 iterations with data size 4 bytes.
Round-trip latency per iteration is 1.08762 us
unmapping buffer
unpinning buffer
closing gdrdrv

NUMA effects

Depending on the platform architecture, like where the GPU are placed in the PCIe topology, performance may suffer if the processor which is driving the copy is not the one which is hosting the GPU, for example in a multi-socket server.

In the example below, GPU ID 0 is hosted by CPU socket 0. By explicitly playing with the OS process and memory affinity, it is possible to run the test onto the optimal processor:

$ numactl -N 0 -l gdrcopy_copybw -d 0 -s $((64 * 1024)) -o $((0 * 1024)) -c $((64 * 1024))
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
testing size: 65536
rounded size: 65536
gpu alloc fn: cuMemAlloc
device ptr: 7f5817a00000
map_d_ptr: 0x7f583b186000
info.va: 7f5817a00000
info.mapped_size: 65536
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7f583b186000
writing test, size=65536 offset=0 num_iters=1000
write BW: 9768.3MB/s
reading test, size=65536 offset=0 num_iters=1000
read BW: 548.423MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

or on the other socket:

$ numactl -N 1 -l gdrcopy_copybw -d 0 -s $((64 * 1024)) -o $((0 * 1024)) -c $((64 * 1024))
GPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00
GPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00
GPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00
GPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00
GPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00
GPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00
GPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00
GPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00
selecting device 0
testing size: 65536
rounded size: 65536
gpu alloc fn: cuMemAlloc
device ptr: 7fbb63a00000
map_d_ptr: 0x7fbb82ab0000
info.va: 7fbb63a00000
info.mapped_size: 65536
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7fbb82ab0000
writing test, size=65536 offset=0 num_iters=1000
write BW: 9224.36MB/s
reading test, size=65536 offset=0 num_iters=1000
read BW: 521.262MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

Restrictions and known issues

GDRCopy works with regular CUDA device memory only, as returned by cudaMalloc. In particular, it does not work with CUDA managed memory.

gdr_pin_buffer() accepts any addresses returned by cudaMalloc and its family. In contrast, gdr_map() requires that the pinned address is aligned to the GPU page. Neither CUDA Runtime nor Driver APIs guarantees that GPU memory allocation functions return aligned addresses. Users are responsible for proper alignment of addresses passed to the library.

Two cudaMalloc'd memory regions may be contiguous. Users may call gdr_pin_buffer and gdr_map with address and size that extend across these two regions. This use case is not well-supported in GDRCopy. On rare occassions, users may experience 1.) an error in gdr_map, or 2.) low copy performance because gdr_map cannot provide write-combined mapping.

In some GPU driver versions, pinning the same GPU address multiple times consumes additional BAR1 space. This is because the space is not properly reused. If you encounter this issue, we suggest that you try the latest version of NVIDIA GPU driver.

On POWER9 where CPU and GPU are connected via NVLink, CUDA9.2 and GPU Driver v396.37 are the minimum requirements in order to achieve the full performance. GDRCopy works with ealier CUDA and GPU driver versions but the achievable bandwidth is substantially lower.

If gdrdrv is compiled with the proprietary flavor of NVIDIA driver, GDRCopy does not fully support Linux with the confidential computing (CC) configuration with Intel CPU. In particular, it does not functional if CONFIG_ARCH_HAS_CC_PLATFORM=y and CC is enabled at runtime. However, it works if CC is disabled or CONFIG_ARCH_HAS_CC_PLATFORM=n. This issue is not applied to AMD CPU. To avoid this issue, please compile and load gdrdrv with the opensource flavor of NVIDIA driver.

To allow the loading of unsupported 3rd party modules in SLE, set allow_unsupported_modules 1 in /etc/modprobe.d/unsupported-modules. After making this change, modules missing the "supported" flag, will be allowed to load.

Bug filing

For reporting issues you may be having using any of NVIDIA software or reporting suspected bugs we would recommend you use the bug filing system which is available to NVIDIA registered developers on the developer site.

If you are not a member you can sign up.

Once a member you can submit issues using this form. Be sure to select GPUDirect in the "Relevant Area" field.

You can later track their progress using the My Bugs link on the left of this view.

Acknowledgment

If you find this software useful in your work, please cite: R. Shi et al., "Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters," 2014 21st International Conference on High Performance Computing (HiPC), Dona Paula, 2014, pp. 1-10, doi: 10.1109/HiPC.2014.7116873.

gdrcopy's People

Contributors

akshay-venkatesh avatar al42and avatar csirving avatar daniel-k avatar drossetti avatar e4t avatar flx42 avatar jrk avatar lukeyeager avatar pakmarkthub avatar raj111samant avatar realarnavgoel avatar sayarsoft avatar seth-howell avatar sokolmish avatar spotluri avatar stonesjtu avatar wpoely86 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gdrcopy's Issues

ioremap sometimes too slow

  1. use nvidia_p2p_get_page get physical pages
  2. use ioremap to map nvidia physical page to kernel virtual address
  3. use memcpy_toio copy data from kernel to gpu
    Sometimes memcpy_toio too slow, it costs 80ms to transfer 600KB data to gpu.
    normally is 0.18ms.
  4. when machine boots, it remains either be normal or slow until machine reboots.

Please help me! How to direct access nvidia physical memory from kernel module?

Support for fork

today, forking could lead to spurious prints (

retcode = nvidia_p2p_put_pages(mr->p2p_token, mr->va_space, mr->va, mr->page_table);
) and possibly a crash.
tracking here further investigations (a new unit test for this case) and possible mitigations (e.g. CLOEXEC when opening the driver fd).

power9: bus error with 4GB copy

root@ibm-p9-012 gdrcopy]# ./copybw -s 4294967296 -c 4294967296 -d 0
GPU id:0 name:Tesla V100-SXM2-16GB PCI domain: 4 bus: 4 device: 0
GPU id:1 name:Tesla V100-SXM2-16GB PCI domain: 4 bus: 5 device: 0
GPU id:2 name:Tesla V100-SXM2-16GB PCI domain: 53 bus: 3 device: 0
GPU id:3 name:Tesla V100-SXM2-16GB PCI domain: 53 bus: 4 device: 0
selecting device 0
testing size: 4294967296
rounded size: 4294967296
device ptr: 7ffe40000000
bar_ptr: 0x7ffc3fff0000
info.va: 7ffe40000000
info.mapped_size: 4294967296
info.page_size: 65536
page offset: 0
user-space pointer:0x7ffc3fff0000
BAR writing test, size=4294967296 offset=0 num_iters=10000
Bus error (core dumped)
[root@ibm-p9-012 gdrcopy]#

consolidate API versioning

ATM there are 3 places where the library major and minor version are specified:

  • gdrapi.h
  • Makefile
  • gdrcopy.spec
    There should be a single place where those version numbers are maintained.

Error when building gdrcopy deb package

I'm seeing following error when building gdrcopy deb package:

> ./build-deb-packages.sh
...
> dpkg-shlibdeps: error: no dependency information found for /usr/lib/x86_64-linux-gnu/libcuda.so.1 (used by debian/gdrcopy/usr/bin/sanity)

Which, i suppose, due to installing driver from downloaded *.run package.

Error can be suppressed by adding rule:

override_dh_shlibdeps:
        dh_shlibdeps --dpkg-shlibdeps-params=--ignore-missing-info

to packages/debian/rules, but i'm not sure whether this is the right way to maintain this.

1.2 release tag

Appears the tag/GitHub release for 1.2 is missing. Could this please be added?

Version mismatch between modinfo gdrdrv and dpkg -l gdrdrv-dkms

Code from the master branch (bf4848f).

$ dpkg -l gdrdrv-dkms
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                           Version                      Architecture                 Description
+++-==============================================-============================-============================-==================================================================================================
ii  gdrdrv-dkms:amd64                              2.0                          amd64                        gdrdrv driver in DKMS format.
$ modinfo gdrdrv
filename:       /lib/modules/4.15.0-58-generic/updates/dkms/gdrdrv.ko
version:        1.1
description:    GDRCopy kernel-mode driver
license:        MIT
author:         [email protected]
srcversion:     D5FB5F3108420043522DCAC
depends:        nv-p2p-dummy
retpoline:      Y
name:           gdrdrv
vermagic:       4.15.0-58-generic SMP mod_unload 
parm:           dbg_enabled:enable debug tracing (int)
parm:           info_enabled:enable info tracing (int)

gdrcopy configuration for use with UCX

Hello,
I'm trying to build gdrcopy correctly in order to build UCX. (Following the website instructions) the installation seems to work fine:

sudo make PREFIX=/usr/local/gdrcopy CUDA=/usr/local/cuda-10.1
echo "GDRAPI_ARCH=X86"
GDRAPI_ARCH=X86
cd gdrdrv;
make
make[1]: Entering directory /home/centos/gdrcopy/gdrdrv' Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. make[2]: Entering directory /usr/src/kernels/3.10.0-957.5.1.el7.x86_64'
Building modules, stage 2.
MODPOST 2 modules
make[2]: Leaving directory /usr/src/kernels/3.10.0-957.5.1.el7.x86_64' make[1]: Leaving directory /home/centos/gdrcopy/gdrdrv'

sudo ./insmod.sh
INFO: driver major is 240
INFO: creating /dev/gdrdrv inode

The validation codes yield:
./validate
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
warning: buffer size -325939184 is not dword aligned, ignoring trailing bytes
unampping
unpinning

./copybw
GPU id:0 name:Tesla M60 PCI domain: 0 bus: 0 device: 30
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: b04720000
bar_ptr: 0x7f670a353000
info.va: b04720000
info.mapped_size: 131072
info.page_size: 65536
page offset: 0
user-space pointer:0x7f670a353000
writing test, size=131072 offset=0 num_iters=10000
write BW: 9585.88MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 529.436MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

However, I don't see any file in the subdirectory /usr/local/gdrcopy and, when I try to configure and build UCX(1.5.2), I get the error message: configure: error: gdrcopy support is requested but gdrcopy packages can't found

Thank you.

Issues with gdr driver

Hello,
I'm running into some issues while trying to use gdrcopy in a MPI environment. I have CUDA 10.1 (418.67) and the error reads:
GDRCOPY library "libgdrapi.so" unable to open GDR driver, is gdrdrv.ko loaded?
I'm new to gdrcopy and don't really know what this means. After installing gdrcopy, I performed the suggested validations that read OK to me:

 ./validate
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
warning: buffer size 1763323920 is not dword aligned, ignoring trailing bytes
unampping
unpinning
 ./copybw
GPU id:0 name:Tesla K80 PCI domain: 0 bus: 0 device: 4
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: 403960000
bar_ptr: 0x7f0395d5c000
info.va: 403960000
info.mapped_size: 131072
info.page_size: 65536
page offset: 0
user-space pointer:0x7f0395d5c000
writing test, size=131072 offset=0 num_iters=10000
write BW: 9437.68MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 356.296MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

Any suggestions on how to proceed or what am I missing? Thanks.

enforce ABI compatibility between user and kernel space components

Currently there is no run-time ABI compatibility check between libgdrapi and gdrdrv.

That can generate obscure errors, say in a container when libgdapi version A tries to work with baremetal gdrdrv version B.

A possible plan would be:

  • to introduce the concept of ABI version in gdrdrv
  • to add a new IOCTL to return that version to user-space
  • in gdr_open(), check ABI compatibility

some questions to use this repo

  1. is there any documents to help me know about how to use code files in this repo in my own project?
  2. is it possible to make all the function into a dll or lib file for convenience?
  3. so i can use this to make screenshots or video streaming for games run by NVIDIA Geforce GPU?

nvidia_p2p_get_pages() failed

I just installed gdrcopy on my machine (Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)) using CUDA 7.5, V7.5.17 (NVIDIA driver version 367.27), with an NVIDIA Tesla K20m GPU. After trying to run $ ./validate, the following error was printed in dmesg:
gdrdrv:nvidia_p2p_get_pages(va=704fe0000 len=327680 p2p_token=0 va_space=0) failed [ret = -22]

-22 = -EINVAL, and according to the GPUDirect CUDA Toolkit page that function returs -EINVAL if an invalid argument was supplied.
Does anyone have any bright ideas on why I can't do GPUDirect RDMA? Thanks.

gdr_open is returning NULL

Hi,

I have installed gdrcopy, but I am getting NULL for the call gdr_open and the test cases are failing.

Need -lrt in Makefile

$ make
...
...
/usr/bin/ld: copybw.o: undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'
/usr/bin/ld: note: 'clock_gettime@@GLIBC_2.2.5' is defined in DSO /lib64/librt.so.1 so try adding it to the linker command line
/lib64/librt.so.1: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [copybw] Error 1

Had to add "-lrt" to LIBS in Makefile:15

./insmod.sh fails

Dear,

We have several GPU nodes (Skylake processors with 4x P100 cards per each node), and I would like to test if the RDMA is available on these nodes or not.
When I try to build the gdrcopy, I get the following error message:
mknod: ‘/dev/gdrdrv’: Operation not permitted
Here is the specification of the host:

$> uname -a Linux r23g34 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

In fact, there is not such a file at /dev/gdrdrv on our current system. Do you have an idea what is wrong here?

Thanks
Ehsan

Failed to make install

I get this exception:

ln -sf libgdrapi.so.1.2 libgdrapi.so.1
ln -sf libgdrapi.so.1 libgdrapi.so
cd gdrdrv;
/usr/bin/make64
make64[1]: Entering directory /home/users/tangwei12/gdrcopy-master/gdrdrv' Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-linux-x86_64-390.12/kernel/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. make64[2]: Entering directory /home/users/tangwei12/linux-4-14'

WARNING: Symbol version dump ./Module.symvers
is missing; modules will have no dependencies and modversions.

CC [M] /home/users/tangwei12/gdrcopy-master/gdrdrv/nv-p2p-dummy.o
CC [M] /home/users/tangwei12/gdrcopy-master/gdrdrv/gdrdrv.o
Building modules, stage 2.
MODPOST 2 modules
FATAL: /home/users/tangwei12/gdrcopy-master/gdrdrv/gdrdrv.o is truncated. sechdrs[i].sh_offset=7089075323386670592 > sizeof(*hrd)=64
make64[3]: *** [__modpost] Error 1
make64[2]: *** [modules] Error 2
make64[2]: Leaving directory /home/users/tangwei12/linux-4-14' make64[1]: *** [module] Error 2 make64[1]: Leaving directory /home/users/tangwei12/gdrcopy-master/gdrdrv'
make64: *** [driver] Error 2

OS: centOS 6.3 (4.14.18)
CUDA: 9
Driver Version: 390.12

slow write BW observed beyond 64KB size

This has been reported by Mark Mark Silberstein [email protected]

We finally pinpointed the setup, and it's easily reproducible.

  1. Get the CPU ptr for the buffer in the mapped BAR
  2. Sequentially pread from file into that buffer in blocks >=64K.

As long as blocks are less than 64K, we get ~1GB/s. For blocks >= 64K we get around 13MB/s

Fails in ioctl call in gdr_pin_buffer. Perhaps the GDRDRV_IOC_PIN_BUFFER flags are incorrect.

-bash-4.2$ ./validate
buffer size: 327680
device ptr: 7fffa0600000
gdr open: 0xc9abf0
before ioctl GDRDRV IOC PIN BUFFER c020da01
After ioctl retcode -1
-bash-4.2$

-bash-4.2$ ./copybw
GPU id:0 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 26 device: 0
GPU id:1 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 28 device: 0
GPU id:2 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 136 device: 0
GPU id:3 name:Tesla V100-SXM2-32GB PCI domain: 0 bus: 138 device: 0
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: 7fffa0600000
before ioctl GDRDRV IOC PIN BUFFER c020da01
After ioctl size -1
closing gdrdrv
-bash-4.2$

cudaMalloc can no longer guarantee to return 64kB aligned address

GDRDRV needs 64kB aligned addresses.

gdrdrv_pin_buffer() {
...
    page_virt_start  = params.addr & GPU_PAGE_MASK;
    page_virt_end    = params.addr + params.size - 1;
    rounded_size     = page_virt_end - page_virt_start + 1;
    mr->offset       = params.addr & GPU_PAGE_OFFSET;
...
}

and

gdrdrv_mmap() {
...
    if (mr->offset) {
        gdr_dbg("offset != 0 is not supported\n");
        ret = -EINVAL;
        goto out;
    }
...
}

This is no more guaranteed with the cudaMalloc in recent CUDA drivers (since 410). A temporary WAR could be (at application level) to allocate with the cudaMalloc a memory area that is size + GPU_PAGE_SIZE and then search for the first 64kB aligned address. Something like:

alloc_size = (buffer_size + GPU_PAGE_SIZE) & GPU_PAGE_MASK;
cuMemAlloc(&dev_addr, alloc_size);
if(dev_addr % GPU_PAGE_SIZE) {
    dev_addr += (GPU_PAGE_SIZE - (dev_addr % GPU_PAGE_SIZE));
}

kernel crash in gdrdrv_mmap for small size

[ 2260.994632] gdrdrv:minor=0
[ 2260.994639] gdrdrv:ioctl called (cmd 0xc020da01)
[ 2260.994641] gdrdrv:invoking nvidia_p2p_get_pages(va=0x10916200000 len=4096 p2p_tok=0 va_tok=0)
[ 2260.995112] gdrdrv:page table entries: 1
[ 2260.995113] gdrdrv:page[0]=0x0000383800200000
[ 2260.995116] gdrdrv:ioctl called (cmd 0xc008da04)
[ 2260.995120] gdrdrv:mmap start=0x7f20ae059000 size=4096 off=0x455f5790
[ 2260.995121] gdrdrv:offset=0 len=65536 vaddr+offset=7f20ae059000 paddr+offset=383800200000
[ 2260.995122] gdrdrv:mmaping phys mem addr=0x383800200000 size=65536 at user virt addr=0x7f20ae059000
[ 2260.995123] gdrdrv:pfn=0x383800200
[ 2260.995124] gdrdrv:calling io_remap_pfn_range() vma=ffff883f28c33a90 vaddr=7f20ae059000 pfn=383800200 size=65536
[ 2260.995163] ------------[ cut here ]------------
[ 2260.995182] kernel BUG at /build/linux-lts-xenial-80t3lB/linux-lts-xenial-4.4.0/mm/memory.c:1674!
[ 2260.995204] invalid opcode: 0000 [#1] SMP
...
[ 2260.995861] [] gdrdrv_mmap_phys_mem_wcomb+0x71/0x130 [gdrdrv]
[ 2260.995879] [] gdrdrv_mmap+0x156/0x2e0 [gdrdrv]
[ 2260.995896] [] ? kmem_cache_alloc+0x1e2/0x200
[ 2260.995911] [] mmap_region+0x3f4/0x610
[ 2260.995926] [] do_mmap+0x2fc/0x3d0
[ 2260.995940] [] vm_mmap_pgoff+0x91/0xc0
[ 2260.995954] [] SyS_mmap_pgoff+0x197/0x260
[ 2260.995970] [] SyS_mmap+0x22/0x30

provide a run-time version query mechanism

We might consider a run-time query mechanism, like gdr_query_version(int *major, int *minor) or the more generic gdr_get_attribute(int attr, int *value), which would complement the dynamic link time mechanism offered by ld.so.

That would be especially useful, say in MPI libraries, when dynamically loading the library with dlopen("libgdrapi.so") and resolving symbols with dlsym(), to enforce a run-time compatibility check.

cannot load the driver: Invalid parameters while run ./insert.sh

sudo /sbin/insmod gdrdrv/gdrdrv.ko dbg_enabled=0 info_enabled=0
insmod: ERROR: could not insert module gdrdrv/gdrdrv.ko: Invalid parameters

so I tried:

insmod gdrdrv.ko
insmod: ERROR: could not insert module gdrdrv.ko: Invalid parameters

Could you take a look and do a quick fix on it, now it is not working.

Errors with libgdrapi.so.1.2 while building gdrcopy

Hello,
I've run into the following error message while building gdrcopy-v1.3 (it doesn't happen with the master branch):

sudo make CUDA=/usr/local/cuda-10.1 all install
make: execvp: ./config_arch: Permission denied
echo "GDRAPI_ARCH="
GDRAPI_ARCH=
cd gdrdrv; \
make
make[1]: Entering directory `/home/ody/gdrcopy-1.3/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stal                                   e driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/3.10.0-957.27.2.el7.x86_64'
  Building modules, stage 2.
  MODPOST 2 modules
make[2]: Leaving directory `/usr/src/kernels/3.10.0-957.27.2.el7.x86_64'
make[1]: Leaving directory `/home/ody/gdrcopy-1.3/gdrdrv'
g++ -O2 -I /usr/local/cuda-10.1/include -I gdrdrv/ -I /usr/local/cuda-10.1/include -D GDRAPI_ARCH= -L /usr/local/cuda-10.1/lib64 -L /usr/local                                   /cuda-10.1/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda-10.1/lib64   -o basic basic.o libgdrapi.so.1.2 -lcudart -lcuda -lpth                                   read -ldl
libgdrapi.so.1.2: undefined reference to `memcpy_cached_store_sse'
libgdrapi.so.1.2: undefined reference to `memcpy_uncached_store_avx'
libgdrapi.so.1.2: undefined reference to `memcpy_cached_store_avx'
libgdrapi.so.1.2: undefined reference to `memcpy_uncached_store_sse'
libgdrapi.so.1.2: undefined reference to `memcpy_uncached_load_sse41'
collect2: error: ld returned 1 exit status
make: *** [basic] Error 1

The hardware is a virtualized environment with an Intel(R) Xeon(R) CPU @ 2.30GHz and a 00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) ) GPU. Thanks.

add autotuning support

optimized memcpy implementations should be chosen at run-time during a tuning phase, possibly in gdr_open()

buffer overrun in validate test

reported by Ching Chu:

$ ./validate
buffer size: 327680
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H
check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar()
check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset
check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset
[1] 316576 segmentation fault ./validate

segfault for copying data buffer of 64-127 byte

Hi,

I have seen segfault when copying buffers (gdr_copy_from_bar) with size ranging from 64-byte to 127-byte. Followings are the reproducers in our machines.

$ /opt/gdrcopy8.0/copybw -s 64
GPU id:0 name:Tesla K40c PCI domain: 0 bus: 2 device: 0
selecting device 0
testing size: 64
rounded size: 65536
device ptr: b05a40000
bar_ptr: 0x7f43d9223000
info.va: b05a40000
info.mapped_size: 65536
info.page_size: 65536
page offset: 0
user-space pointer:0x7f43d9223000
BAR writing test, size=64 offset=0 num_iters=10000
BAR1 write BW: 457.923MB/s
BAR reading test, size=64 offset=0 num_iters=100
Segmentation fault

$dmesg
...
[2689239.364734] copybw[5308]: segfault at 2846000 ip 00007f43d8ecd06c sp 00007fff23b939e0 error 6 in libgdrapi.so.1.2[7f43d8ecb000+3000]
$ /opt/gdrcopy8.0/copybw -s 64
GPU id:0 name:Tesla K80 PCI domain: 0 bus: 5 device: 0
GPU id:1 name:Tesla K80 PCI domain: 0 bus: 6 device: 0
selecting device 0
testing size: 64
rounded size: 65536
device ptr: 2304fc0000
bar_ptr: 0x2acc78311000
info.va: 2304fc0000
info.mapped_size: 65536
info.page_size: 65536
page offset: 0
user-space pointer:0x2acc78311000
BAR writing test, size=64 offset=0 num_iters=10000
BAR1 write BW: 722.593MB/s
BAR reading test, size=64 offset=0 num_iters=100
Segmentation fault (core dumped)

$dmesg
...
[2614698.728292] copybw[32532]: segfault at 2acc78321000 ip 00002acc78459018 sp 00007ffd51c16b10 error 4 in libgdrapi.so.1.2[2acc78457000+3000]

Do you have any idea what could be happening here?

Thanks,

a question about CPU mappings

Hi,
I use these same APIs to create perfectly valid CPU mappings of two GPU memory.Like this:

//------dev_b pin buff----------------------------------------------------------
     	    unsigned int flag_b;
     	     cuPointerSetAttribute(&flag_b, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_b);
     	     gdr_mh_t mh_b;
     	     gdr_t g_b=gdr_open();
     	     ASSERT_NEQ(g_b, (void*)0);
     	     gdr_pin_buffer(g_b,dev_b,sizeof(int),0,0,&mh_b);
     	     void * bar_ptr_b=NULL;
     	     ASSERT_EQ(gdr_map(g_b, mh_b, &bar_ptr_b, sizeof(int)), 0);
     	     gdr_info_t info_b;
     	     gdr_get_info(g_b,mh_b,&info_b);
     	     int off_b=dev_b-info_b.va;
     	     cout<<"off_b:"<<off_b<<endl;
     	     uint32_t * buf_ptr_b=(uint32_t *)((char *)bar_ptr_b+off_b);
     	     cout<<"buf_ptr_b:"<<buf_ptr_b<<endl;
      //---------------------------------------------------------------------------------------
      //-------dev_a pin buff------------------------------------------------------
     unsigned int flag;
         cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_a);
     gdr_mh_t mh;
     gdr_t g=gdr_open();
     ASSERT_NEQ(g, (void*)0);
     gdr_pin_buffer(g, dev_a, N*sizeof(int), 0, 0, &mh);
     void * bar_ptr=NULL;
     ASSERT_EQ(gdr_map(g, mh, &bar_ptr, N*sizeof(int)), 0);
     gdr_info_t info;
     gdr_get_info(g,mh,&info);
     int off=dev_a-info.va;
     cout<<"off_a:"<<off<<endl;
     uint32_t * buf_ptr=(uint32_t *)((char *)bar_ptr+off);
     cout<<"buf_ptr:"<<buf_ptr<<endl;

But it failed.And I find that It's related to the order of a and b, the previous success, the subsequent failure.
Do you have any idea what could be happening here?

Thanks,

build failure on power PC

devendar@ibm-p9-013 gdrcopy (git::devel)$ make
echo "GDRAPI_ARCH=POWER"
GDRAPI_ARCH=POWER
make: Warning: File `libgdrapi.so.1.2' has modification time 22 s in the future
cc -O2 -fPIC -I /usr/local/cuda/include -I gdrdrv/ -I /usr/local/cuda/include -D GDRAPI_ARCH=POWER  -c -o gdrapi.o gdrapi.c
cc -shared -Wl,-soname,libgdrapi.so.1 -o libgdrapi.so.1.2 gdrapi.o
ldconfig -n /labhome/devendar/gdrcopy
ln -sf libgdrapi.so.1.2 libgdrapi.so.1
ln -sf libgdrapi.so.1 libgdrapi.so
cd gdrdrv; \
make
make[1]: Entering directory `/labhome/devendar/gdrcopy/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-387.26/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/4.11.0-44.el7a.ppc64le'
make[3]: Warning: File `/labhome/devendar/gdrcopy/gdrdrv/modules.order' has modification time 22 s in the future
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
  Building modules, stage 2.
  MODPOST 2 modules
make[2]: Leaving directory `/usr/src/kernels/4.11.0-44.el7a.ppc64le'
make[1]: Leaving directory `/labhome/devendar/gdrcopy/gdrdrv'
g++ -O2 -I /usr/local/cuda/include -I gdrdrv/ -I /usr/local/cuda/include -D GDRAPI_ARCH=POWER -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64   -o basic basic.o libgdrapi.so.1.2 -lcudart -lcuda -lpthread -ldl
libgdrapi.so.1.2: undefined reference to `_mm_sfence'
collect2: error: ld returned 1 exit status
make: *** [basic] Error 1

copy_to_bar: sfence not issued in the right order in some scenarios

I observed a 1msec latency from when gdr_copy_to_bar is issued to when the update is observed on the GPU.

When the target buffer is not aligned or when the copy size is too small, gdr_copy_to_bar translates to an sfence followed by a memcpy.

Issuing sfence after the mempcy seems to prevent some buffering and helps reduce the latency significantly.

gdr_map returns -EAGAIN


 
[1218757.588122] gdrdrv:mmap start=0x7f45e83c3000 size=196608 off=0xc31d2952
[1218757.588123] gdrdrv:range start with p=0 vaddr=7f45e83c3000 page_paddr=3838082a0000
[1218757.588125] gdrdrv:non-contig p=1 prev_page_paddr=3838082a0000 cur_page_paddr=3838084b0000
[1218757.588127] gdrdrv:mapping p=1 entries=1 offset=0 len=65536 vaddr=7f45e83c3000 paddr=3838082a0000
[1218757.588128] gdrdrv:mmaping phys mem addr=0x3838082a0000 size=65536 at user virt addr=0x7f45e83c3000
[1218757.588129] gdrdrv:is_cow_mapping is FALSE
[1218757.588138] gdrdrv:range start with p=1 vaddr=7f45e83d3000 page_paddr=3838084b0000
[1218757.588139] gdrdrv:mapping p=3 entries=2 offset=0 len=131072 vaddr=7f45e83d3000 paddr=3838084b0000
[1218757.588141] gdrdrv:mmaping phys mem addr=0x3838084b0000 size=131072 at user virt addr=0x7f45e83d3000
[1218757.588141] gdrdrv:is_cow_mapping is FALSE
[1218757.588146] gdrdrv:track_pfn_remap failed :-22
[1218757.588150] gdrdrv:error in remap_pfn_range() ret:-22
[1218757.588151] gdrdrv:error -11 in gdrdrv_mmap_phys_mem_wcomb

rate-limit printk to avoid flooding the kernel log

[61024.799569] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.799746] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.799920] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.800127] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.800151] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.800327] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.800502] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.800704] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.800726] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.800901] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.801083] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.801285] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.801307] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.801484] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]
[61024.801659] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23266): umem get failed (-14)
[61024.801861] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 23265): umem get failed (-14)
[61024.801883] gdrdrv:invoking nvidia_p2p_get_pages(va=0x2305ba0000 len=4194304 p2p_tok=0 va_tok=0)
[61024.802064] gdrdrv:nvidia_p2p_get_pages(va=2305ba0000 len=4194304 p2p_token=0 va_space=0) failed [ret = -22]​

add a producer-consumer benchmark

strawman design:

  • allocate device memory buffer B
  • launch CUDA kernel:
    • polling on B[0]
    • writing a zero-copy flag
  • CPU:
    • wait for the kernel to really be polling
    • read tsc in t_start
    • write B[0]
    • wait for flag
    • read tsc in t_end
    • d_t = t_end - t_start should be lower than 1-2 msecs
  • repeat until result is stable

gdrcopy-devel RPM won't install

I have built the gdrcopy RPMS from the build_packages.sh script in the source tree and I've found that I'm unable to install the gdrcopy-devel package because it is missing a required dependency.

Error: Package: gdrcopy-devel-1.3-2.x86_64 (/gdrcopy-devel-1.3-2.x86_64)
Requires: libgdrapi.so.1()(64bit)
The file listed was installed by the gdrcopy RPM, but that library didn't get listed as being provided by the RPM.
If this works for other users then I may have messed something up in my environment, but if not it is probably a bug in the spec file that should get fixed.
Either way I'm willing to put some work into figuring it out, but I wanted to know which side of the problem to focus on.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.