Comments (10)
@jirikraus can you take a look at this issue?
from code-samples.
Thanks for making me aware Mark. I would have missed this. I need to wrap up a few other things and will take a look at this later.
from code-samples.
I found the reason is the local domain size, when I used the same hardware structure, that means 4 node and each node 1 A100 GPU, when the local domain size is 4096, the bandwidth is around 800 GB/s, but when the local domain size is 20480, the bandwidth is around 2.4 TB/s, are there some problems with the bandwidth algorithm?
from code-samples.
Hi Mountain-ql, sorry for following up late. I did not have the time to deep dive into this yet. I agree that regarding the bandwidth calculations something is of. Regarding the performance difference between CUDA-aware MPI and regular MPI can you provide a few more details on your system? What exact MPI are you using (exact version and how it has been built) and the output of nvidia-smi topo -m on the system you are running on.
from code-samples.
sorry for the late reply.
the MPI I used was OpenMPI/4.0.5, because it is the module on HPC, so I don't know how it has been built.
And the output of "nvidia-smi topo -m" is:
GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS SYS 0 0-7
GPU1 NV12 X SYS SYS 0 0-7
mlx5_0 SYS SYS X SYS
mlx5_1 SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
from code-samples.
Thanks. Can can you attach the output of ompi_info -c
and ucx_info -b
that will provide the missing information about the MPI you are using.
from code-samples.
sorry for late reply!
here is the output of "ompi_info -c":
Configured by: hpcglrun
Configured on: Wed Feb 17 12:42:06 CET 2021
Configure host: taurusi6395.taurus.hrsk.tu-dresden.de
Configure command line: '--prefix=/sw/installed/OpenMPI/4.0.5-gcccuda-2020b'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu' '--with-slurm'
'--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64'
'--with-knem=/opt/knem-1.1.3.90mlnx1'
'--enable-mpirun-prefix-by-default'
'--enable-shared'
'--with-cuda=/sw/installed/CUDAcore/11.1.1'
'--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0'
'--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0'
'--with-ofi=/sw/installed/libfabric/1.11.0-GCCcore-10.2.0'
'--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0'
'--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1'
'--without-verbs'
Built by: hpcglrun
Built on: Wed Feb 17 12:50:42 CET 2021
Built host: taurusi6395.taurus.hrsk.tu-dresden.de
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /sw/installed/GCCcore/10.2.0/bin/gcc
C compiler family name: GNU
C compiler version: 10.2.0
C char size: 1
C bool size: 1
C short size: 2
C int size: 4
C long size: 8
C float size: 4
C double size: 8
C pointer size: 8
C char align: 1
C bool align: skipped
C int align: 4
C float align: 4
C double align: 8
C++ compiler: g++
C++ compiler absolute: /sw/installed/GCCcore/10.2.0/bin/g++
Fort compiler: gfortran
Fort compiler abs: /sw/installed/GCCcore/10.2.0/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
Fort integer size: 4
Fort logical size: 4
Fort logical value true: 1
Fort have integer1: yes
Fort have integer2: yes
Fort have integer4: yes
Fort have integer8: yes
Fort have integer16: no
Fort have real4: yes
Fort have real8: yes
Fort have real16: yes
Fort have complex8: yes
Fort have complex16: yes
Fort have complex32: yes
Fort integer1 size: 1
Fort integer2 size: 2
Fort integer4 size: 4
Fort integer8 size: 8
Fort integer16 size: -1
Fort real size: 4
Fort real4 size: 4
Fort real8 size: 8
Fort real16 size: 16
Fort dbl prec size: 8
Fort cplx size: 8
Fort dbl cplx size: 16
Fort cplx8 size: 8
Fort cplx16 size: 16
Fort cplx32 size: 32
Fort integer align: 4
Fort integer1 align: 1
Fort integer2 align: 2
Fort integer4 align: 4
Fort integer8 align: 8
Fort integer16 align: -1
Fort real align: 4
Fort real4 align: 4
Fort real8 align: 8
Fort real16 align: 16
Fort dbl prec align: 8
Fort cplx align: 4
Fort dbl cplx align: 8
Fort cplx8 align: 4
Fort cplx16 align: 8
Fort cplx32 align: 16
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Build CFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno
-finline-functions -fno-strict-aliasing
Build CXXFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno
-finline-functions
Build FCFLAGS: -O3 -march=native -fno-math-errno
Build LDFLAGS: -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib64
-L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib
-L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib64
-L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib
-L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib64
-L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib64
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib64
-L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib
-L/sw/installed/GCCcore/10.2.0/lib64
-L/sw/installed/GCCcore/10.2.0/lib
-L/sw/installed/CUDAcore/11.1.1/lib64
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
Build LIBS: -lutil -lm -lrt -lcudart -lpthread -lz -lhwloc
-levent_core -levent_pthreads
Wrapper extra CFLAGS:
Wrapper extra CXXFLAGS:
Wrapper extra FCFLAGS: -I${libdir}
Wrapper extra LDFLAGS: -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-Wl,-rpath
-Wl,/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-Wl,-rpath
-Wl,/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags
Wrapper extra LIBS: -lhwloc -ldl -levent_core -levent_pthreads -lutil
-lm -lrt -lcudart -lpthread -lz
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
here is the output of "ucx_info -b":
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY 1
#define ENABLE_DEBUG_DATA 0
#define ENABLE_MT 1
#define ENABLE_PARAMS_CHECK 0
#define ENABLE_SYMBOL_OVERRIDE 1
#define HAVE_1_ARG_BFD_SECTION_SIZE 1
#define HAVE_ALLOCA 1
#define HAVE_ALLOCA_H 1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV 1
#define HAVE_CPLUS_DEMANGLE 1
#define HAVE_CPU_SET_T 1
#define HAVE_CUDA 1
#define HAVE_CUDA_H 1
#define HAVE_CUDA_RUNTIME_H 1
#define HAVE_DC_EXP 1
#define HAVE_DECL_ASPRINTF 1
#define HAVE_DECL_BASENAME 1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 1
#define HAVE_DECL_BFD_SECTION_VMA 1
#define HAVE_DECL_CPU_ISSET 1
#define HAVE_DECL_CPU_ZERO 1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN 1
#define HAVE_DECL_F_SETOWN_EX 1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0
#define HAVE_DECL_IBV_ADVISE_MR 0
#define HAVE_DECL_IBV_ALLOC_DM 0
#define HAVE_DECL_IBV_ALLOC_TD 0
#define HAVE_DECL_IBV_CMD_MODIFY_QP 1
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ 1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_EXP_ALLOC_DM 1
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 1
#define HAVE_DECL_IBV_EXP_CREATE_QP 1
#define HAVE_DECL_IBV_EXP_CREATE_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 1
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_DESTROY_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_EXP_POST_SEND 1
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 1
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 1
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1
#define HAVE_DECL_IBV_EXP_REG_MR 1
#define HAVE_DECL_IBV_EXP_RES_DOMAIN_THREAD_MODEL 1
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1
#define HAVE_DECL_IBV_EXP_SETENV 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1
#define HAVE_DECL_IBV_EXP_WR_NOP 1
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_CQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_QP_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_SRQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_UPDATE_CQ_CI 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID 1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_MADV_FREE 0
#define HAVE_DECL_MADV_REMOVE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 0
#define HAVE_DECL_MLX5DV_CREATE_QP 0
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 0
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 0
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 0
#define HAVE_DECL_MLX5DV_OBJ_AH 0
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 0
#define HAVE_DECL_MLX5_WQE_CTRL_SOLICITED 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER 1
#define HAVE_DECL_RDMA_ESTABLISH 1
#define HAVE_DECL_RDMA_INIT_QP_ATTR 1
#define HAVE_DECL_SPEED_UNKNOWN 1
#define HAVE_DECL_STRERROR_R 1
#define HAVE_DECL_SYS_BRK 1
#define HAVE_DECL_SYS_IPC 0
#define HAVE_DECL_SYS_MADVISE 1
#define HAVE_DECL_SYS_MMAP 1
#define HAVE_DECL_SYS_MREMAP 1
#define HAVE_DECL_SYS_MUNMAP 1
#define HAVE_DECL_SYS_SHMAT 1
#define HAVE_DECL_SYS_SHMDT 1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DETAILED_BACKTRACE 1
#define HAVE_DLFCN_H 1
#define HAVE_EXP_UMR 1
#define HAVE_EXP_UMR_KSM 1
#define HAVE_GDRAPI_H 1
#define HAVE_HW_TIMER 1
#define HAVE_IB 1
#define HAVE_IBV_DM 1
#define HAVE_IBV_EXP_DM 1
#define HAVE_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_IBV_EXP_RES_DOMAIN 1
#define HAVE_IB_EXT_ATOMICS 1
#define HAVE_IN6_ADDR_S6_ADDR32 1
#define HAVE_INFINIBAND_MLX5DV_H 1
#define HAVE_INFINIBAND_MLX5_HW_H 1
#define HAVE_INTTYPES_H 1
#define HAVE_IP_IP_DST 1
#define HAVE_LIBGEN_H 1
#define HAVE_LIBRT 1
#define HAVE_LINUX_FUTEX_H 1
#define HAVE_LINUX_IP_H 1
#define HAVE_LINUX_MMAN_H 1
#define HAVE_MALLOC_GET_STATE 1
#define HAVE_MALLOC_H 1
#define HAVE_MALLOC_HOOK 1
#define HAVE_MALLOC_SET_STATE 1
#define HAVE_MALLOC_TRIM 1
#define HAVE_MASKED_ATOMICS_ENDIANNESS 1
#define HAVE_MEMALIGN 1
#define HAVE_MEMORY_H 1
#define HAVE_MLX5_HW 1
#define HAVE_MLX5_HW_UD 1
#define HAVE_MREMAP 1
#define HAVE_NETINET_IP_H 1
#define HAVE_NET_ETHERNET_H 1
#define HAVE_NUMA 1
#define HAVE_NUMAIF_H 1
#define HAVE_NUMA_H 1
#define HAVE_ODP 1
#define HAVE_ODP_IMPLICIT 1
#define HAVE_POSIX_MEMALIGN 1
#define HAVE_PREFETCH 1
#define HAVE_RDMACM_QP_LESS 1
#define HAVE_SCHED_GETAFFINITY 1
#define HAVE_SCHED_SETAFFINITY 1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T 1
#define HAVE_STDINT_H 1
#define HAVE_STDLIB_H 1
#define HAVE_STRERROR_R 1
#define HAVE_STRINGS_H 1
#define HAVE_STRING_H 1
#define HAVE_STRUCT_BITMASK 1
#define HAVE_STRUCT_DL_PHDR_INFO 1
#define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1
#define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1
#define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1
#define HAVE_STRUCT_IBV_MLX5_QP_INFO_BF_NEED_LOCK 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_STRUCT_MLX5_AH_IBV_AH 1
#define HAVE_STRUCT_MLX5_CQE64_IB_STRIDE_INDEX 1
#define HAVE_STRUCT_MLX5_GRH_AV_RMAC 1
#define HAVE_STRUCT_MLX5_SRQ_CMD_QP 1
#define HAVE_STRUCT_MLX5_WQE_AV_BASE 1
#define HAVE_SYS_EPOLL_H 1
#define HAVE_SYS_EVENTFD_H 1
#define HAVE_SYS_STAT_H 1
#define HAVE_SYS_TYPES_H 1
#define HAVE_SYS_UIO_H 1
#define HAVE_TL_DC 1
#define HAVE_TL_RC 1
#define HAVE_TL_UD 1
#define HAVE_UCM_PTMALLOC286 1
#define HAVE_UNISTD_H 1
#define HAVE_VERBS_EXP_H 1
#define HAVE___CLEAR_CACHE 1
#define HAVE___CURBRK 1
#define HAVE___SIGHANDLER_T 1
#define IBV_HW_TM 1
#define LT_OBJDIR ".libs/"
#define NVALGRIND 1
#define PACKAGE "ucx"
#define PACKAGE_BUGREPORT ""
#define PACKAGE_NAME "ucx"
#define PACKAGE_STRING "ucx 1.9"
#define PACKAGE_TARNAME "ucx"
#define PACKAGE_URL ""
#define PACKAGE_VERSION "1.9"
#define STDC_HEADERS 1
#define STRERROR_R_CHAR_P 1
#define UCM_BISTRO_HOOKS 1
#define UCS_MAX_LOG_LEVEL UCS_LOG_LEVEL_INFO
#define UCT_UD_EP_DEBUG_HOOKS 0
#define UCX_CONFIGURE_FLAGS "--disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc --with-cuda=/sw/installed/CUDAcore/11.1.1 --with-gdrcopy=/sw/installed/GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1"
#define UCX_MODULE_SUBDIR "ucx"
#define VERSION "1.9"
#define restrict __restrict
#define test_MODULES ":module"
#define ucm_MODULES ":cuda"
#define uct_MODULES ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES ":gdrcopy"
#define uct_ib_MODULES ":cm"
#define uct_rocm_MODULES ""
#define ucx_perftest_MODULES ":cuda"
from code-samples.
Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK
defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES
is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials).
I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.
from code-samples.
Thanks a lot!!
from code-samples.
Thanks for the feedback. Closing this as it does not seem to be an issue with the code.
from code-samples.
Related Issues (20)
- submodule is broken
- bandwidthtest.cu shows GB/s, but the math looks like MB/s HOT 1
- ERRORS: in simpleOnnx_*.cpp HOT 3
- simpleTensorCoreGEMM has errors in output when compiled with CUDA10 for Turing GPUs HOT 2
- CUDA-aware runtime error HOT 1
- CUDA-aware MPI example complains about CUDA runtime version HOT 2
- memtype_cache.c:137 UCX WARN destroying inuse address HOT 1
- Getting errors running tensor-cores example HOT 4
- CUDA aware Jacobi examples fail using PGI HOT 1
- [grCUDA] vulnerabiliy issues in packackage dependencies HOT 1
- fatal error: cuda_runtime.h HOT 3
- compile "TensorRT-introduction" HOT 3
- ioHelper.cpp:66:5: error: ‘onnx’ has not been declared HOT 7
- Error in simpleOnnx_1.cpp while running on jetson Nano 4 gb
- some questions about unified-memory,dataElem.cu file HOT 1
- Verification Failed on sample for cufft_callbacks
- Can't Detecting CUDA compiler ABI HOT 3
- Cannot reproduce the results on parallel reduce with shfl HOT 7
- tensor core example result mismatch with that of cublas
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code-samples.