Comments (4)
Thank you for taking a look.
Which CPU, GPU and PCIe topology did you test?
Can you report copy_to_mapping perf ?
from gdrcopy.
Using AVX-512 based memcpy is a bad idea, in general.
This is how gdr_copy_from_mapping
does with AVX512 (In fact, its SSE4.1 version is faster than its AVX version, and the source code prefers it over the AVX version).
gdr_copy_from_mapping num iters for each size: 100
Test Size(B) Avg.Time(us)
DBG: using AVX512 implementation of gdr_copy_from_bar
gdr_copy_from_mapping 1 0.9811
gdr_copy_from_mapping 2 1.2646
gdr_copy_from_mapping 4 1.2648
gdr_copy_from_mapping 8 1.2640
gdr_copy_from_mapping 16 1.8958
gdr_copy_from_mapping 32 3.1540
gdr_copy_from_mapping 64 0.6476
gdr_copy_from_mapping 128 1.2858
gdr_copy_from_mapping 256 2.5581
gdr_copy_from_mapping 512 5.0851
gdr_copy_from_mapping 1024 10.2162
gdr_copy_from_mapping 2048 24.0402
gdr_copy_from_mapping 4096 44.5810
gdr_copy_from_mapping 8192 81.9428
gdr_copy_from_mapping 16384 170.7200
gdr_copy_from_mapping 32768 341.2040
gdr_copy_from_mapping 65536 675.1082
gdr_copy_from_mapping 131072 1357.5815
gdr_copy_from_mapping 262144 2706.2129
gdr_copy_from_mapping 524288 5425.6831
gdr_copy_from_mapping 1048576 10837.6549
gdr_copy_from_mapping 2097152 21672.5916
gdr_copy_from_mapping 4194304 55437.2406
gdr_copy_from_mapping 8388608 110991.1427
gdr_copy_from_mapping 16777216 222043.6687
from gdrcopy.
Thanks for your response!
CPU - Intel Xeon Silver 4114 (Skylake)
GPU - Tesla P100-PCIE-12GB
CUDA version - 11.4
Here are the gdr_copy_to_mapping
numbers for AVX512 -
gdr_copy_to_mapping num iters for each size: 10000
Test | Size(B) | Avg.Time(us) |
---|---|---|
gdr_copy_to_mapping | 1 | 0.1250 |
gdr_copy_to_mapping | 2 | 0.1245 |
gdr_copy_to_mapping | 4 | 0.1245 |
gdr_copy_to_mapping | 8 | 0.1222 |
gdr_copy_to_mapping | 16 | 0.1263 |
gdr_copy_to_mapping | 32 | 0.1252 |
gdr_copy_to_mapping | 64 | 0.1280 |
gdr_copy_to_mapping | 128 | 0.1376 |
gdr_copy_to_mapping | 256 | 0.1439 |
gdr_copy_to_mapping | 512 | 0.1550 |
gdr_copy_to_mapping | 1024 | 0.1927 |
gdr_copy_to_mapping | 2048 | 0.2631 |
gdr_copy_to_mapping | 4096 | 0.4262 |
gdr_copy_to_mapping | 8192 | 0.8239 |
gdr_copy_to_mapping | 16384 | 1.6179 |
gdr_copy_to_mapping | 32768 | 3.2132 |
gdr_copy_to_mapping | 65536 | 6.4094 |
gdr_copy_to_mapping | 131072 | 12.7935 |
gdr_copy_to_mapping | 262144 | 25.5790 |
gdr_copy_to_mapping | 524288 | 51.1738 |
gdr_copy_to_mapping | 1048576 | 102.2248 |
gdr_copy_to_mapping | 2097152 | 204.4293 |
gdr_copy_to_mapping | 4194304 | 409.7942 |
gdr_copy_to_mapping | 8388608 | 822.7885 |
gdr_copy_to_mapping | 16777216 | 1683.7191 |
As for the PCIe topology, I'm not sure, but I did a lspci -tv
:
-+-[0000:d7]-+-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-0e.0 Intel Corporation Device 2058
| +-0e.1 Intel Corporation Device 2059
| +-0f.0 Intel Corporation Device 2058
| +-0f.1 Intel Corporation Device 2059
| +-12.0 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.1 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.2 Intel Corporation Sky Lake-E M3KTI Registers
| +-15.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.0 Intel Corporation Sky Lake-E M2PCI Registers
| \-16.4 Intel Corporation Sky Lake-E M2PCI Registers
+-[0000:ae]-+-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-08.0 Intel Corporation Device 2066
| +-09.0 Intel Corporation Device 2066
| +-0a.0 Intel Corporation Device 2040
| +-0a.1 Intel Corporation Device 2041
| +-0a.2 Intel Corporation Device 2042
| +-0a.3 Intel Corporation Device 2043
| +-0a.4 Intel Corporation Device 2044
| +-0a.5 Intel Corporation Device 2045
| +-0a.6 Intel Corporation Device 2046
| +-0a.7 Intel Corporation Device 2047
| +-0b.0 Intel Corporation Device 2048
| +-0b.1 Intel Corporation Device 2049
| +-0b.2 Intel Corporation Device 204a
| +-0b.3 Intel Corporation Device 204b
| +-0c.0 Intel Corporation Device 2040
| +-0c.1 Intel Corporation Device 2041
| +-0c.2 Intel Corporation Device 2042
| +-0c.3 Intel Corporation Device 2043
| +-0c.4 Intel Corporation Device 2044
| +-0c.5 Intel Corporation Device 2045
| +-0c.6 Intel Corporation Device 2046
| +-0c.7 Intel Corporation Device 2047
| +-0d.0 Intel Corporation Device 2048
| +-0d.1 Intel Corporation Device 2049
| +-0d.2 Intel Corporation Device 204a
| \-0d.3 Intel Corporation Device 204b
+-[0000:85]-+-00.0-[86]----00.0 NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB]
from gdrcopy.
One caveat is that I probably could've used the -mavx512vl
compilation flag to use up to 32 ymm registers for both AVX & AVX2, but I didn't. I wonder if loop-unrolling in the source-code should be tweaked if 32 registers are to be leveraged, instead of the default 16.
from gdrcopy.
Related Issues (20)
- Facing issue when installing HOT 1
- Ubuntu 22 - dpkg: error processing package gdrdrv-dkms:amd64 (--install) during installation of gdrcopy HOT 3
- Why D2H is relatively slower? HOT 2
- Query: Confusion about sudo requirement HOT 3
- thinking about working with CUDA async API
- gdrcopy_sanity failed when GPU Compute Mode is set to EXCLUSIVE HOT 1
- Unable to compile GDRCOPY v2.4 HOT 2
- Minimal steps to install gdrdrv driver only please HOT 6
- Fail to access mapped memory from CPU side(Fail data_validation tests) HOT 14
- tests build failing when check.h is not available HOT 1
- How to understand the file "nv-p2p-dummpy.c" HOT 3
- Driver flavor detection fails for 545 series HOT 2
- bad performance(compare with cuMemcpy) on x86 system HOT 3
- GDRCopy 2.4 on Centos7 failing build of RPM packages HOT 2
- Increasing utilization - gdrcopy_copybw HOT 3
- Improve the error report of gdrcopy_pplat when the CUDA kernel cannot be launched
- Safe Mounting of /dev/gdrdrv in a kubernetes environment - HostPath appears to fail HOT 10
- How to effectively test if gdrcopy is enabled using Real world ML workload ? HOT 2
- Can't make with Intel Compiler HOT 4
- MAINT: gdr_unmap segfault on master branch via NVSHMEM 2.10.1 on Cray Slingshot 11 with cuFFTMp HOT 22
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gdrcopy.