Giter Club home page Giter Club logo

Comments (10)

harrism avatar harrism commented on June 22, 2024

@jirikraus can you take a look at this issue?

from code-samples.

jirikraus avatar jirikraus commented on June 22, 2024

Thanks for making me aware Mark. I would have missed this. I need to wrap up a few other things and will take a look at this later.

from code-samples.

Mountain-ql avatar Mountain-ql commented on June 22, 2024

I found the reason is the local domain size, when I used the same hardware structure, that means 4 node and each node 1 A100 GPU, when the local domain size is 4096, the bandwidth is around 800 GB/s, but when the local domain size is 20480, the bandwidth is around 2.4 TB/s, are there some problems with the bandwidth algorithm?

from code-samples.

jirikraus avatar jirikraus commented on June 22, 2024

Hi Mountain-ql, sorry for following up late. I did not have the time to deep dive into this yet. I agree that regarding the bandwidth calculations something is of. Regarding the performance difference between CUDA-aware MPI and regular MPI can you provide a few more details on your system? What exact MPI are you using (exact version and how it has been built) and the output of nvidia-smi topo -m on the system you are running on.

from code-samples.

Mountain-ql avatar Mountain-ql commented on June 22, 2024

sorry for the late reply.
the MPI I used was OpenMPI/4.0.5, because it is the module on HPC, so I don't know how it has been built.
And the output of "nvidia-smi topo -m" is:

GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS SYS 0 0-7
GPU1 NV12 X SYS SYS 0 0-7
mlx5_0 SYS SYS X SYS
mlx5_1 SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

from code-samples.

jirikraus avatar jirikraus commented on June 22, 2024

Thanks. Can can you attach the output of ompi_info -c and ucx_info -b that will provide the missing information about the MPI you are using.

from code-samples.

Mountain-ql avatar Mountain-ql commented on June 22, 2024

sorry for late reply!
here is the output of "ompi_info -c":
Configured by: hpcglrun
Configured on: Wed Feb 17 12:42:06 CET 2021
Configure host: taurusi6395.taurus.hrsk.tu-dresden.de
Configure command line: '--prefix=/sw/installed/OpenMPI/4.0.5-gcccuda-2020b'
'--build=x86_64-pc-linux-gnu'
'--host=x86_64-pc-linux-gnu' '--with-slurm'
'--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64'
'--with-knem=/opt/knem-1.1.3.90mlnx1'
'--enable-mpirun-prefix-by-default'
'--enable-shared'
'--with-cuda=/sw/installed/CUDAcore/11.1.1'
'--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0'
'--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0'
'--with-ofi=/sw/installed/libfabric/1.11.0-GCCcore-10.2.0'
'--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0'
'--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1'
'--without-verbs'
Built by: hpcglrun
Built on: Wed Feb 17 12:50:42 CET 2021
Built host: taurusi6395.taurus.hrsk.tu-dresden.de
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /sw/installed/GCCcore/10.2.0/bin/gcc
C compiler family name: GNU
C compiler version: 10.2.0
C char size: 1
C bool size: 1
C short size: 2
C int size: 4
C long size: 8
C float size: 4
C double size: 8
C pointer size: 8
C char align: 1
C bool align: skipped
C int align: 4
C float align: 4
C double align: 8
C++ compiler: g++
C++ compiler absolute: /sw/installed/GCCcore/10.2.0/bin/g++
Fort compiler: gfortran
Fort compiler abs: /sw/installed/GCCcore/10.2.0/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
Fort integer size: 4
Fort logical size: 4
Fort logical value true: 1
Fort have integer1: yes
Fort have integer2: yes
Fort have integer4: yes
Fort have integer8: yes
Fort have integer16: no
Fort have real4: yes
Fort have real8: yes
Fort have real16: yes
Fort have complex8: yes
Fort have complex16: yes
Fort have complex32: yes
Fort integer1 size: 1
Fort integer2 size: 2
Fort integer4 size: 4
Fort integer8 size: 8
Fort integer16 size: -1
Fort real size: 4
Fort real4 size: 4
Fort real8 size: 8
Fort real16 size: 16
Fort dbl prec size: 8
Fort cplx size: 8
Fort dbl cplx size: 16
Fort cplx8 size: 8
Fort cplx16 size: 16
Fort cplx32 size: 32
Fort integer align: 4
Fort integer1 align: 1
Fort integer2 align: 2
Fort integer4 align: 4
Fort integer8 align: 8
Fort integer16 align: -1
Fort real align: 4
Fort real4 align: 4
Fort real8 align: 8
Fort real16 align: 16
Fort dbl prec align: 8
Fort cplx align: 4
Fort dbl cplx align: 8
Fort cplx8 align: 4
Fort cplx16 align: 8
Fort cplx32 align: 16
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Build CFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno
-finline-functions -fno-strict-aliasing
Build CXXFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno
-finline-functions
Build FCFLAGS: -O3 -march=native -fno-math-errno
Build LDFLAGS: -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib64
-L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib
-L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib64
-L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib
-L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib64
-L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib64
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib64
-L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib
-L/sw/installed/GCCcore/10.2.0/lib64
-L/sw/installed/GCCcore/10.2.0/lib
-L/sw/installed/CUDAcore/11.1.1/lib64
-L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
Build LIBS: -lutil -lm -lrt -lcudart -lpthread -lz -lhwloc
-levent_core -levent_pthreads
Wrapper extra CFLAGS:
Wrapper extra CXXFLAGS:
Wrapper extra FCFLAGS: -I${libdir}
Wrapper extra LDFLAGS: -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-Wl,-rpath
-Wl,/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib
-Wl,-rpath
-Wl,/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64
-Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags
Wrapper extra LIBS: -lhwloc -ldl -levent_core -levent_pthreads -lutil
-lm -lrt -lcudart -lpthread -lz
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128

here is the output of "ucx_info -b":
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY 1
#define ENABLE_DEBUG_DATA 0
#define ENABLE_MT 1
#define ENABLE_PARAMS_CHECK 0
#define ENABLE_SYMBOL_OVERRIDE 1
#define HAVE_1_ARG_BFD_SECTION_SIZE 1
#define HAVE_ALLOCA 1
#define HAVE_ALLOCA_H 1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV 1
#define HAVE_CPLUS_DEMANGLE 1
#define HAVE_CPU_SET_T 1
#define HAVE_CUDA 1
#define HAVE_CUDA_H 1
#define HAVE_CUDA_RUNTIME_H 1
#define HAVE_DC_EXP 1
#define HAVE_DECL_ASPRINTF 1
#define HAVE_DECL_BASENAME 1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 1
#define HAVE_DECL_BFD_SECTION_VMA 1
#define HAVE_DECL_CPU_ISSET 1
#define HAVE_DECL_CPU_ZERO 1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN 1
#define HAVE_DECL_F_SETOWN_EX 1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0
#define HAVE_DECL_IBV_ADVISE_MR 0
#define HAVE_DECL_IBV_ALLOC_DM 0
#define HAVE_DECL_IBV_ALLOC_TD 0
#define HAVE_DECL_IBV_CMD_MODIFY_QP 1
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ 1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_EXP_ALLOC_DM 1
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 1
#define HAVE_DECL_IBV_EXP_CREATE_QP 1
#define HAVE_DECL_IBV_EXP_CREATE_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 1
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_DESTROY_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_EXP_POST_SEND 1
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 1
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 1
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1
#define HAVE_DECL_IBV_EXP_REG_MR 1
#define HAVE_DECL_IBV_EXP_RES_DOMAIN_THREAD_MODEL 1
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1
#define HAVE_DECL_IBV_EXP_SETENV 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1
#define HAVE_DECL_IBV_EXP_WR_NOP 1
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_CQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_QP_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_SRQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_UPDATE_CQ_CI 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID 1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_MADV_FREE 0
#define HAVE_DECL_MADV_REMOVE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 0
#define HAVE_DECL_MLX5DV_CREATE_QP 0
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 0
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 0
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 0
#define HAVE_DECL_MLX5DV_OBJ_AH 0
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 0
#define HAVE_DECL_MLX5_WQE_CTRL_SOLICITED 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER 1
#define HAVE_DECL_RDMA_ESTABLISH 1
#define HAVE_DECL_RDMA_INIT_QP_ATTR 1
#define HAVE_DECL_SPEED_UNKNOWN 1
#define HAVE_DECL_STRERROR_R 1
#define HAVE_DECL_SYS_BRK 1
#define HAVE_DECL_SYS_IPC 0
#define HAVE_DECL_SYS_MADVISE 1
#define HAVE_DECL_SYS_MMAP 1
#define HAVE_DECL_SYS_MREMAP 1
#define HAVE_DECL_SYS_MUNMAP 1
#define HAVE_DECL_SYS_SHMAT 1
#define HAVE_DECL_SYS_SHMDT 1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DETAILED_BACKTRACE 1
#define HAVE_DLFCN_H 1
#define HAVE_EXP_UMR 1
#define HAVE_EXP_UMR_KSM 1
#define HAVE_GDRAPI_H 1
#define HAVE_HW_TIMER 1
#define HAVE_IB 1
#define HAVE_IBV_DM 1
#define HAVE_IBV_EXP_DM 1
#define HAVE_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_IBV_EXP_RES_DOMAIN 1
#define HAVE_IB_EXT_ATOMICS 1
#define HAVE_IN6_ADDR_S6_ADDR32 1
#define HAVE_INFINIBAND_MLX5DV_H 1
#define HAVE_INFINIBAND_MLX5_HW_H 1
#define HAVE_INTTYPES_H 1
#define HAVE_IP_IP_DST 1
#define HAVE_LIBGEN_H 1
#define HAVE_LIBRT 1
#define HAVE_LINUX_FUTEX_H 1
#define HAVE_LINUX_IP_H 1
#define HAVE_LINUX_MMAN_H 1
#define HAVE_MALLOC_GET_STATE 1
#define HAVE_MALLOC_H 1
#define HAVE_MALLOC_HOOK 1
#define HAVE_MALLOC_SET_STATE 1
#define HAVE_MALLOC_TRIM 1
#define HAVE_MASKED_ATOMICS_ENDIANNESS 1
#define HAVE_MEMALIGN 1
#define HAVE_MEMORY_H 1
#define HAVE_MLX5_HW 1
#define HAVE_MLX5_HW_UD 1
#define HAVE_MREMAP 1
#define HAVE_NETINET_IP_H 1
#define HAVE_NET_ETHERNET_H 1
#define HAVE_NUMA 1
#define HAVE_NUMAIF_H 1
#define HAVE_NUMA_H 1
#define HAVE_ODP 1
#define HAVE_ODP_IMPLICIT 1
#define HAVE_POSIX_MEMALIGN 1
#define HAVE_PREFETCH 1
#define HAVE_RDMACM_QP_LESS 1
#define HAVE_SCHED_GETAFFINITY 1
#define HAVE_SCHED_SETAFFINITY 1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T 1
#define HAVE_STDINT_H 1
#define HAVE_STDLIB_H 1
#define HAVE_STRERROR_R 1
#define HAVE_STRINGS_H 1
#define HAVE_STRING_H 1
#define HAVE_STRUCT_BITMASK 1
#define HAVE_STRUCT_DL_PHDR_INFO 1
#define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1
#define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1
#define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1
#define HAVE_STRUCT_IBV_MLX5_QP_INFO_BF_NEED_LOCK 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_STRUCT_MLX5_AH_IBV_AH 1
#define HAVE_STRUCT_MLX5_CQE64_IB_STRIDE_INDEX 1
#define HAVE_STRUCT_MLX5_GRH_AV_RMAC 1
#define HAVE_STRUCT_MLX5_SRQ_CMD_QP 1
#define HAVE_STRUCT_MLX5_WQE_AV_BASE 1
#define HAVE_SYS_EPOLL_H 1
#define HAVE_SYS_EVENTFD_H 1
#define HAVE_SYS_STAT_H 1
#define HAVE_SYS_TYPES_H 1
#define HAVE_SYS_UIO_H 1
#define HAVE_TL_DC 1
#define HAVE_TL_RC 1
#define HAVE_TL_UD 1
#define HAVE_UCM_PTMALLOC286 1
#define HAVE_UNISTD_H 1
#define HAVE_VERBS_EXP_H 1
#define HAVE___CLEAR_CACHE 1
#define HAVE___CURBRK 1
#define HAVE___SIGHANDLER_T 1
#define IBV_HW_TM 1
#define LT_OBJDIR ".libs/"
#define NVALGRIND 1
#define PACKAGE "ucx"
#define PACKAGE_BUGREPORT ""
#define PACKAGE_NAME "ucx"
#define PACKAGE_STRING "ucx 1.9"
#define PACKAGE_TARNAME "ucx"
#define PACKAGE_URL ""
#define PACKAGE_VERSION "1.9"
#define STDC_HEADERS 1
#define STRERROR_R_CHAR_P 1
#define UCM_BISTRO_HOOKS 1
#define UCS_MAX_LOG_LEVEL UCS_LOG_LEVEL_INFO
#define UCT_UD_EP_DEBUG_HOOKS 0
#define UCX_CONFIGURE_FLAGS "--disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc --with-cuda=/sw/installed/CUDAcore/11.1.1 --with-gdrcopy=/sw/installed/GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1"
#define UCX_MODULE_SUBDIR "ucx"
#define VERSION "1.9"
#define restrict __restrict
#define test_MODULES ":module"
#define ucm_MODULES ":cuda"
#define uct_MODULES ":cuda:ib:rdmacm:cma"
#define uct_cuda_MODULES ":gdrcopy"
#define uct_ib_MODULES ":cm"
#define uct_rocm_MODULES ""
#define ucx_perftest_MODULES ":cuda"

from code-samples.

jirikraus avatar jirikraus commented on June 22, 2024

Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials).
I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.

from code-samples.

Mountain-ql avatar Mountain-ql commented on June 22, 2024

Thanks a lot!!

from code-samples.

jirikraus avatar jirikraus commented on June 22, 2024

Thanks for the feedback. Closing this as it does not seem to be an issue with the code.

from code-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.