Comments (22)
Hi @tylerjereddy,
I suspect that the segfault is from somewhere in https://github.com/ofiwg/libfabric/blob/main/src/hmem_cuda_gdrcopy.c#L346-L380 or https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrapi.c#L387-L411. Can you use gdb
to tell the exact line that this segfault is triggered?
For GDRCopy, you may want to change https://github.com/NVIDIA/gdrcopy/blob/master/src/Makefile#L29 to -O0 -g
so that it is friendlier with gdb. I guess that libfabric also has a similar compile options somewhere. Alternatively, you can manually instrument the code by adding printf
and narrow down where the segfault comes from.
from gdrcopy.
I'm working on it, the exact nature of the failure is not deterministic, even between trials with the same builds it seems. I'll keep working on narrowing down for at least one of the failure scenarios. I'll also paste a few more example outputs that looked a little different (they didn't actually segfault, just error).
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: error Cannot allocate memory(12) while mapping handle 92f4280, rounded_size=65536 offset=1fe380000
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:nvshmemt_libfabric_rma:517: Received an error when trying to post an RMA operation
ERR: mh is not mapped yet
ERR: mh is not mapped yet
nid001225:11099:11099 [1] NVSHMEM INFO [5] freeing buf: 0x1477d8b80600
nid001225:11099:11099 [1] NVSHMEM INFO [5] allocated 720 bytes from mspace: 0x1478d46c5400 ptr: 0x1477d8b80600
nid001225:11101:11101 [3] NVSHMEM INFO [7] freeing buf: 0x148fb8b80600
nid001225:11101:11101 [3] NVSHMEM INFO [7] allocated 720 bytes from mspace: 0x1490c7b18400 ptr: 0x148fb8b80600
nid001225:11100:11100 [2] NVSHMEM INFO [6] freeing buf: 0x151cd8b80600
nid001225:11100:11100 [2] NVSHMEM INFO [6] allocated 720 bytes from mspace: 0x151de6ca2400 ptr: 0x151cd8b80600
nid001225:11098:11098 [0] NVSHMEM INFO [4] freeing buf: 0x148a58b80600
nid001225:11098:11098 [0] NVSHMEM INFO [4] allocated 720 bytes from mspace: 0x148b60104400 ptr: 0x148a58b80600
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: error Cannot allocate memory(12) while mapping handle 92f8050, rounded_size=65536 offset=1fe390000
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:nvshmemt_libfabric_rma:517: Received an error when trying to post an RMA operation.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/include/internal/common/nvshmem_internal.h:nvshmemi_process_multisend_rma:302: aborting due to error in process_channel_dma
I see the max value of an unsigned 16-bit integer in one of the errors in there. Anyway, I'll try to dig deeper. My prints aren't showing up yet, so something I'm not understanding obviously.
from gdrcopy.
Ah, completely purged out my custom install of gdrcopy
and the segfault/backtrace persisted, so it looks like some component in the dependency chain is ignoring the gdrcopy
that I ask NVSHMEM to use via GDRCOPY_HOME
when I build NVSHMEM from source.
I did confirm that I can see prints from my custom gdrcopy
install from fi_info
commands, for my custom libfabric build, but something in this backtrace isn't respecting the gdrcopy
version that I want to use for debugging, since it doesn't care if I build NVSHMEM with a different gdrcopy
:
ERR: mh is not mapped yet
ERR: mh is not mapped yet
[nid001500:113320:0:113395] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
ERR: mh is not mapped yet
ERR: mh is not mapped yet
==== backtrace (tid: 113395) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x0000000000001aa7 gdr_unmap() ???:0
2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0
3 0x00000000000a488f cxip_unmap() :0
4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
5 0x00000000000adfe5 cxip_evtq_progress() :0
6 0x0000000000081695 cxip_ep_progress() :0
7 0x0000000000089f99 cxip_util_cq_progress() :0
8 0x000000000004a020 ofi_cq_readfrom() :0
9 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:395
10 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:132
11 0x00000000000e4bad progress_transports() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:963
12 0x00000000000e51b9 progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/proxy/proxy.cpp:992
13 0x000000000000a6ea start_thread() ???:0
14 0x0000000000117a6f __GI___clone() ???:0
from gdrcopy.
Hi @tylerjereddy ,
I reviewed the NVSHMEM libfabric transport code. It does not use GDRCopy with Slingshot -- at least in NVSHMEM 2.10.1. However, libfabric itself (not NVSHMEM libfabric transport) uses GDRCopy. Based on the backtrace logs you posted, I think NVSHMEM calls into libfabric, which in turn triggers this issue. I think we can ignore NVSHMEM for now.
Guessing from your first comment, you originally ran with GDRCopy v2.3 and then moved to the master branch, right? Do you have root access on your system? Have you reloaded the gdrdrv driver from the master branch? If you have root access, can you enable debugging in the gdrdrv driver? After compiling GDRCopy, you can simply modify https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 to set dbg_enabled=0 info_enabled=0
and call sudo ./insmod.sh
. Please run sudo dmesg -w
on a separate shell. When you run your application and hit a GDRCopy error, you can see more lines in dmesg
. Please show me those lines.
ERR: error Cannot allocate memory(12) while mapping handle 92f4280, rounded_size=65536 offset=1fe380000
This line does not make sense to me. In most cases, the error code should be propagated from the gdrdrv driver. However, the driver never returns -ENOMEM (12) in the mmap path. And that line with that phrase can only be printed out from mmap
inside libgdrapi. One possibility is that ENOMEM is a stale error number from some other code paths. Before this line, can you add printf("ERRNO before calling mmap %d\n", errno);
? You can also reset errno = 0
before calling mmap
too.
from gdrcopy.
I don't have root access, it is a supercomputer at LANL. I could perhaps try linking your suggestions for HPC support to see if there's anything they can check.
from gdrcopy.
Have you reloaded the gdrdrv driver from the master branch?
I think the HPC admins are looking into your comment a bit, but I wanted to check on a few things:
- any risk that some problems arise because I'm building a newer
gdrcopy
than the driver version available on the HPC machine? - any risk that CXI proper (closed source HPE thing I think?) is somehow associated with an older
gdrcopy
version and/or driver? Or shouldLD_LIBRARY_PATH
allow me to easily swapgdrcopy
versions at runtime irrespective of how CXI was installed and the specificgdrcopy
driver version available?
from gdrcopy.
any risk that some problems arise because I'm building a newer gdrcopy than the driver version available on the HPC machine?
libgdrapi.so and gdrdrv (driver) are forward and backward compatible. Still, there might be some bugs we have fixed in a newer version of gdrdrv. It would be good to use the latest release version.
Your application talks to libgdrapi.so (not directly to gdrdrv). For this one, it is backward compatible only. For example, if you compile with GDRCopy v2.4, we cannot guarantee that your application will work with libgdrapi.so v2.3.
any risk that CXI proper (closed source HPE thing I think?) is somehow associated with an older gdrcopy version and/or driver? Or should LD_LIBRARY_PATH allow me to easily swap gdrcopy versions at runtime irrespective of how CXI was installed and the specific gdrcopy driver version available?
I don't know the answer. Is this a user-space library or a driver? If it is a user-space library, you probably can ldd <lib.so>
and see if it links with libgdrapi.so. It is possible that they use dlopen
. That will be more challenging to detect. If it is a driver, the answer is no. gdrdrv does not export any symbols. No other drivers can call into gdrdrv.
By the way, you may want to try setting use_persistent_mapping=1
on some systems. This is a gdrdrv module parameter. You set it when you load gdrdrv. I did not suggest this because the issues you encountered were during gdr_map
. Without that use_persistent_mapping=1
, you may run out of GPU BAR1. But an error should show up during gdr_pin_buffer
or when you call ibv_reg_mr
(from the IB stack). So, this parameter might be irrelevant, but you can try to set it if you plan to reload gdrdrv to enable the debug mode.
from gdrcopy.
So, my debug prints were not showing up because prepending my custom gdrcopy
builds to LD_LIBRARY_PATH
was insufficient to override the gdrcopy
that was linked to ucx
, which in turn was linked to OpenMPI. That's pretty confusing, but for now swapping to a ucx at runtime that does not have gdrcopy
linked to it allows me to see my prints from another gdrcopy
loaded in LD_LIBRARY_PATH
.
Anyway, now I should be able to report some better debug prints.
from gdrcopy.
More detailed debug analysis below, now that I can use custom gdrcopy
build with lower optimization level and interwoven prints. Keep in mind that the errors are not fully deterministic, so that does still make it a little tricky to drill down, but these analyses should be deeper than before at least.
- The prints in this first error case run right past the final
err
block ofgdr_unmap
and the backtrace has more info now, that ultimately seems to lead togdr_unpin_buffer()
atLine 270 in bb13928
Details for backtrace scenario 1
<snip>
gdr_unmap checkpoint 1
gdr_unmap checkpoint 2
gdr_unmap checkpoint 3
gdr_unmap checkpoint 4
gdr_unmap checkpoint 5
gdr_unmap checkpoint 6
gdr_unmap checkpoint 7
[nid001453:60481:0:60481] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
[nid001492:121042:0:121042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
[nid001453:60479:0:60479] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
[nid001492:121044:0:121044] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
==== backtrace (tid: 60479) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000170b gdr_unpin_buffer() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:271
2 0x0000000000032da6 cuda_gdrcopy_dev_unregister() :0
3 0x00000000000a488f cxip_unmap() :0
4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
5 0x00000000000adfe5 cxip_evtq_progress() :0
6 0x0000000000081695 cxip_ep_progress() :0
7 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0
8 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441
9 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
10 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
12 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
13 0x0000000000500a66 cufftMpDestroyReshape() ???:0
14 0x00000000004ff627 cufftMpDestroyReshape() ???:0
15 0x000000000015893a cufftMpAttachComm() ???:0
16 0x00000000004e058f cufftMpDestroyReshape() ???:0
17 0x00000000004e0a85 cufftMpDestroyReshape() ???:0
18 0x000000000014cb6e cufftMpAttachComm() ???:0
19 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0
20 0x0000000000147511 cufftXtMakePlanGuru64() ???:0
21 0x0000000000148105 cufftXtMakePlanMany() ???:0
22 0x0000000000145d7d cufftMakePlanMany64() ???:0
23 0x00000000001461bf cufftMakePlanMany() ???:0
24 0x0000000000146386 cufftMakePlan3d() ???:0
25 0x0000000000406619 run_r2c_c2r_slabs() ???:0
26 0x00000000004079c7 main() ???:0
27 0x000000000003529d __libc_start_main() ???:0
28 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
- The second error scenario also runs right through the final
err
block ofgdr_unmap
, but this time it seems to hit a line of code in the closed-source HPE CXI library code:
Details of error scenario 2
<snip>
gdr_unmap checkpoint 1
gdr_unmap checkpoint 2
gdr_unmap checkpoint 3
gdr_unmap checkpoint 4
gdr_unmap checkpoint 1
gdr_unmap checkpoint 2
gdr_unmap checkpoint 3
gdr_unmap checkpoint 5
gdr_unmap checkpoint 6
gdr_unmap checkpoint 7
==== backtrace (tid: 75821) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000c59e cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945
2 0x00000000000a47cb cxip_unmap() :0
3 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
4 0x00000000000adfe5 cxip_evtq_progress() :0
5 0x0000000000081695 cxip_ep_progress() :0
6 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0
7 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441
8 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
9 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
10 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
12 0x0000000000500a66 cufftMpDestroyReshape() ???:0
13 0x00000000004ff598 cufftMpDestroyReshape() ???:0
14 0x000000000015893a cufftMpAttachComm() ???:0
15 0x00000000004e058f cufftMpDestroyReshape() ???:0
16 0x00000000004e0a85 cufftMpDestroyReshape() ???:0
17 0x000000000014cb6e cufftMpAttachComm() ???:0
18 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0
19 0x0000000000147511 cufftXtMakePlanGuru64() ???:0
20 0x0000000000148105 cufftXtMakePlanMany() ???:0
21 0x0000000000145d7d cufftMakePlanMany64() ???:0
22 0x00000000001461bf cufftMakePlanMany() ???:0
23 0x0000000000146386 cufftMakePlan3d() ???:0
24 0x0000000000406619 run_r2c_c2r_slabs() ???:0
25 0x00000000004079c7 main() ???:0
26 0x000000000003529d __libc_start_main() ???:0
27 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
- The third error scenario (in three tries!) also goes through the
err
block ofgdr_unmap
, and this time the final line reported in the backtrace is atLine 396 in bb13928
Details of error scenario 3
<snip>
gdr_unmap checkpoint 1
gdr_unmap checkpoint 2
gdr_unmap checkpoint 3
gdr_unmap checkpoint 1
gdr_unmap checkpoint 2
gdr_unmap checkpoint 3
gdr_unmap checkpoint 4
gdr_unmap checkpoint 5
gdr_unmap checkpoint 6
gdr_unmap checkpoint 7
==== backtrace (tid: 76631) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x0000000000001caf gdr_unmap() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:403
2 0x0000000000032d92 cuda_gdrcopy_dev_unregister() :0
3 0x00000000000a488f cxip_unmap() :0
4 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
5 0x00000000000adfe5 cxip_evtq_progress() :0
6 0x0000000000081695 cxip_ep_progress() :0
7 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0
8 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441
9 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
10 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
12 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
13 0x0000000000500a66 cufftMpDestroyReshape() ???:0
14 0x00000000004ff598 cufftMpDestroyReshape() ???:0
15 0x000000000015893a cufftMpAttachComm() ???:0
16 0x00000000004e058f cufftMpDestroyReshape() ???:0
17 0x00000000004e0a85 cufftMpDestroyReshape() ???:0
18 0x000000000014cb6e cufftMpAttachComm() ???:0
19 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0
20 0x0000000000147511 cufftXtMakePlanGuru64() ???:0
21 0x0000000000148105 cufftXtMakePlanMany() ???:0
22 0x0000000000145d7d cufftMakePlanMany64() ???:0
23 0x00000000001461bf cufftMakePlanMany() ???:0
24 0x0000000000146386 cufftMakePlan3d() ???:0
25 0x0000000000406619 run_r2c_c2r_slabs() ???:0
26 0x00000000004079c7 main() ???:0
27 0x000000000003529d __libc_start_main() ???:0
28 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
Does this give you any more traction to diagnose the problem? While I wait to hear back about the debug driver stuff, anything else you want me to try here? It also seems to me like there's a misunderstanding somewhere with UCX + gdrcopy
+ OpenMPI if my provider is actually CXI? I was originally asked to build OpenMPI linked to UCX + gdrcopy
+ CUDA.
from gdrcopy.
Thank you @tylerjereddy. I suspect that you may run into a race condition from multithreading. GDRCopy, especially libgdrapi.so, is not thread safe. Anyway, I added a global lock to some functions in this branch https://github.com/NVIDIA/gdrcopy/tree/dev-issue-296-exp. Please try if it helps. You just need to recompile libgdrapi.so and use that. There is no need to install a new gdrdrv driver.
from gdrcopy.
I still see errors that are not deterministic on that branch (I reduced the optimization level again as well).
<snip>
Hello from rank 5/8 using GPU 1
Hello from rank 4/8 using GPU 0
Hello from rank 7/8 using GPU 3
Hello from rank 6/8 using GPU 2
Hello from rank 2/8 using GPU 2
Hello from rank 3/8 using GPU 3
Hello from rank 1/8 using GPU 1
Hello from rank 0/8 using GPU 0
ERR: Error in pthread_mutex_init with errno=5246
ERR: Error in pthread_mutex_init with errno=5291
ERR: Error in pthread_mutex_init with errno=5336
ERR: Error in pthread_mutex_init with errno=5364
ERR: Error in pthread_mutex_init with errno=5364
<snip>
Note that ERR: Error in pthread_mutex_init with errno=5296
occurs even on some simple fi_info
commands now, like fi_info -p cxi
.
Hanging seems more common on this branch now as well, and fi_info
commands seem slower. Perhaps not surprising if something isn't quite right with lock acquisition I suppose?
from gdrcopy.
Sorry, there was a left-over code block. I just removed it. Please try again.
Note that this is not our final solution. It is just an adhoc implementation to see if it helps. It might not work if the caller calls a GDRCopy API with stale memory handle
. For example, if they call gdr_close
and then gdr_unmap
or gdr_unpin_buffer
, libgdrapi.so will access the memory handle object that has already been freed.
from gdrcopy.
Here's the backtrace for the 2-node cuFFTMp reproducer with your updated branch (with optimization level reduced):
<snip>
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
ERR: mh is not mapped yet
[nid001237:69241:0:69241] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
[nid001240:34042:0:34042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41)
==== backtrace (tid: 69241) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000176c _gdr_unpin_buffer() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:281
2 0x00000000000017ca gdr_unpin_buffer() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:291
3 0x0000000000032da6 cuda_gdrcopy_dev_unregister() :0
4 0x00000000000a488f cxip_unmap() :0
5 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
6 0x00000000000adfe5 cxip_evtq_progress() :0
7 0x0000000000081695 cxip_ep_progress() :0
8 0x0000000000089f99 cxip_util_cq_progress() :0
9 0x000000000004a020 ofi_cq_readfrom() :0
10 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:395
11 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:132
12 0x000000000000e7c8 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:390
13 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
14 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
15 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
16 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
17 0x0000000000500a66 cufftMpDestroyReshape() ???:0
18 0x00000000004ff627 cufftMpDestroyReshape() ???:0
19 0x000000000015893a cufftMpAttachComm() ???:0
20 0x00000000004e058f cufftMpDestroyReshape() ???:0
21 0x00000000004e0a85 cufftMpDestroyReshape() ???:0
22 0x000000000014cb6e cufftMpAttachComm() ???:0
23 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0
24 0x0000000000147511 cufftXtMakePlanGuru64() ???:0
25 0x0000000000148105 cufftXtMakePlanMany() ???:0
26 0x0000000000145d7d cufftMakePlanMany64() ???:0
27 0x00000000001461bf cufftMakePlanMany() ???:0
28 0x0000000000146386 cufftMakePlan3d() ???:0
29 0x0000000000406619 run_r2c_c2r_slabs() ???:0
30 0x00000000004079c7 main() ???:0
31 0x000000000003529d __libc_start_main() ???:0
32 0x000000000040573a _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
<snip>
So, crash seems to be near here in your new branch, in _gdr_unpin_buffer
, while attempting to remove an element from a list:
Line 281 in d229925
Now, if we look at the special branch of libfabric
I'm using, in function cuda_gdrcopy_dev_unregister()
, which is called just before control flow returns to gdrcopy
proper, I see two calls that may be worth asking you about. First, there is a call to gdr_unmap()
, then right after that, there is a call to gdr_unpin_buffer()
. Both operate on the same handle/structure member it seems. Here is the permalink to that particular libfabric
branch/code block, which I think I needed for CXI support: https://github.com/thomasgillis/libfabric/blob/10caf878ccacedd2ce907e8e714a9d90d74d63ca/src/hmem_cuda_gdrcopy.c#L359-L368
The situation looks the same in the main
branch of libfabric
, for that particular block of code: https://github.com/ofiwg/libfabric/blob/f41cea52738da193fd312ce9cf0a1adf23acaa8f/src/hmem_cuda_gdrcopy.c#L359-L368
All of this code is in a libfabric
code block with a #if ENABLE_GDRCOPY_DLOPEN
preprocessor guard (or just after it..). I decided to mess around a little with that code block on the cxi-enabled branch of libfabric
using the diff below the fold.
--- a/src/hmem_cuda_gdrcopy.c
+++ b/src/hmem_cuda_gdrcopy.c
@@ -33,6 +33,7 @@
#if HAVE_CONFIG_H
#include <config.h>
+#include <stdio.h>
#endif
#include "ofi_hmem.h"
@@ -356,26 +357,27 @@ int cuda_gdrcopy_dev_unregister(uint64_t handle)
assert(gdrcopy);
pthread_spin_lock(&global_gdr_lock);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 1\n");
err = global_gdrcopy_ops.gdr_unmap(global_gdr, gdrcopy->mh,
gdrcopy->user_ptr, gdrcopy->length);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 2\n");
if (err) {
+ printf("cuda_gdrcopy_dev_unregister checkpoint 2b\n");
FI_WARN(&core_prov, FI_LOG_CORE,
"gdr_unmap failed! error: %s\n",
strerror(err));
goto exit;
}
+ printf("cuda_gdrcopy_dev_unregister checkpoint 3\n");
+ pthread_spin_unlock(&global_gdr_lock);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 4\n");
- err = global_gdrcopy_ops.gdr_unpin_buffer(global_gdr, gdrcopy->mh);
- if (err) {
- FI_WARN(&core_prov, FI_LOG_MR,
- "gdr_unmap failed! error: %s\n",
- strerror(err));
- goto exit;
- }
exit:
+ printf("cuda_gdrcopy_dev_unregister checkpoint 5\n");
pthread_spin_unlock(&global_gdr_lock);
free(gdrcopy);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 6\n");
return err;
}
Although deleting gdr_unpin_buffer
doesn't really protect me from backtraces/problems, print checkpoints 5 and 6 are hit regularly, suggesting non-zero exit codes returned regularly from gdr_unmap
on the special gdrcopy
branch. Is this more helpful? Is there anything in that libfabric
code block that could be safer/better?
from gdrcopy.
@tylerjereddy Thank you for the additional info. We also call gdr_unpin_buffer
inside gdr_close
. But I don't expect to see segfault in LIST_REMOVE
if it comes from there. A few requests:
- I added code instrumentation in https://github.com/NVIDIA/gdrcopy/tree/dev-issue-296-exp. Can you try again? Please show me the whole output. If possible, please separate the output from Node 1 and 2.
- Have you tested a standalone GDRCopy application? Do
gdrcopy_copybw
andgdrcopy_sanity
work?
from gdrcopy.
Starting with your second point of the GDRCopy test applications, I used the latest master
branch of GDRCopy without modification and the dev-cxi
branch of libfabric
without modification (I'm guessing it didn't use libfabric
here, but just to be safe...)
The output is below, the sanity check seems to "pass" but spits out errors?
<snip>
---------- Running gdrcopy_copybw ----------
GPU id:0; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:03:00
GPU id:1; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:41:00
GPU id:2; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:81:00
GPU id:3; name: NVIDIA A100-SXM4-40GB; Bus id: 0000:c1:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 15150b200000
map_d_ptr: 0x15152c97b000
info.va: 15150b200000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x15152c97b000
writing test, size=131072 offset=0 num_iters=10000
write BW: 18087.3MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 18.9831MB/s
unmapping buffer
unpinning buffer
closing gdrdrv
---------- End of gdrcopy_copybw ----------
---------- Running gdrcopy_sanity ----------
ERR: error Invalid argument(22) while mapping handle b37fd0, rounded_size=4096 offset=10000
ERR: error Invalid argument(22) while mapping handle b38550, rounded_size=69632 offset=0
ERR: error Permission denied(13) while mapping handle b38640, rounded_size=65536 offset=0
ERR: error Permission denied(13) while mapping handle b38f80, rounded_size=65536 offset=0
ERR: ioctl error (errno=22)
ERR: error Permission denied(13) while mapping handle 3066cbf0, rounded_size=91 offset=0
ERR: error Permission denied(13) while mapping handle 3066cbf0, rounded_size=91 offset=0
ERR: ioctl error (errno=13)
ERR: ioctl error (errno=13)
Total: 28, Passed: 28, Failed: 0, Waived: 0
---------- End of gdrcopy_sanity ----------
<snip>
The modified interactive script for the 2-node test:
#!/bin/bash -l
#
# setup the runtime environment
#export FI_LOG_LEVEL=debug
#export NVSHMEM_DEBUG=TRACE
export FI_HMEM=cuda
export GDRCOPY_ENABLE_LOGGING=1
# we need special CXI- and CUDA-enabled version of libfabric
# per: https://github.com/ofiwg/libfabric/issues/10001#issuecomment-2078604043
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:$LD_LIBRARY_PATH"
export PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin:$PATH"
export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin"
export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib/ucx:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib"
export NVSHMEM_DISABLE_CUDA_VMM=1
export FI_CXI_OPTIMIZED_MRS=false
export NVSHMEM_REMOTE_TRANSPORT=libfabric
export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib
export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp
export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib
export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/lib:$LD_LIBRARY_PATH"
which fi_info
echo "fi_info -l:"
fi_info -l
echo "fi_info -p cxi:"
fi_info -p cxi
#cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS
#make clean
#make build
#make run
echo "---------- Running gdrcopy_copybw ---------- "
/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/bin/gdrcopy_copybw
echo "---------- End of gdrcopy_copybw ---------- "
echo "---------- Running gdrcopy_sanity ---------- "
/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/bin/gdrcopy_sanity
echo "---------- End of gdrcopy_sanity ---------- "
from gdrcopy.
For the first point, using the latest version of dev-issue-296-exp
gdrcopy
branch with the original cuFFTMp 2-node reproducer, this is the raw output:
/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin/fi_info
fi_info -l:
===> [12292, 12292] GDRCopy Checkpoint gdr_open: 1
===> [12292, 12292] GDRCopy Checkpoint gdr_open: 1
verbs:
version: 120.0
cxi:
version: 0.1
psm3:
version: 305.1010
ofi_rxd:
version: 120.0
shm:
version: 120.0
udp:
version: 120.0
tcp:
version: 120.0
ofi_hook_perf:
version: 120.0
ofi_hook_trace:
version: 120.0
ofi_hook_debug:
version: 120.0
ofi_hook_noop:
version: 120.0
ofi_hook_hmem:
version: 120.0
ofi_hook_dmabuf_peer_mem:
version: 120.0
off_coll:
version: 120.0
sm2:
version: 120.0
ofi_mrail:
version: 120.0
===> [12292, 12292] GDRCopy Checkpoint gdr_close: 1
===> [12292, 12292] GDRCopy Checkpoint gdr_close: 4
===> [12292, 12292] GDRCopy Checkpoint gdr_close: 5
===> [12292, 12292] GDRCopy Checkpoint gdr_close: 6
fi_info -p cxi:
===> [12312, 12312] GDRCopy Checkpoint gdr_open: 1
===> [12312, 12312] GDRCopy Checkpoint gdr_open: 1
provider: cxi
fabric: cxi
domain: cxi0
version: 0.1
type: FI_EP_RDM
protocol: FI_PROTO_CXI
provider: cxi
fabric: cxi
domain: cxi1
version: 0.1
type: FI_EP_RDM
protocol: FI_PROTO_CXI
===> [12312, 12312] GDRCopy Checkpoint gdr_close: 1
===> [12312, 12312] GDRCopy Checkpoint gdr_close: 4
===> [12312, 12312] GDRCopy Checkpoint gdr_close: 5
===> [12312, 12312] GDRCopy Checkpoint gdr_close: 6
rm -rf cufftmp_r2c_c2r_slabs_GROMACS
/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin/../bin/nvcc cufftmp_r2c_c2r_slabs_GROMACS.cu -o cufftmp_r2c_c2r_slabs_GROMACS -std=c++17 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_90,code=sm_90 -I/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp -I/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include -I/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/include -lcuda -L/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib -L/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib -lcufftMp -lnvshmem_device -lnvshmem_host -L/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib -lmpi
LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/lib:/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib/ucx:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:/opt/cray/pe/papi/7.0.0.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib" /lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin/mpirun -oversubscribe -n 8 -N 4 cufftmp_r2c_c2r_slabs_GROMACS
Hello from rank 2/8 using GPU 2
Hello from rank 1/8 using GPU 1
Hello from rank 3/8 using GPU 3
Hello from rank 0/8 using GPU 0
Hello from rank 6/8 using GPU 2
Hello from rank 7/8 using GPU 3
Hello from rank 5/8 using GPU 1
Hello from rank 4/8 using GPU 0
===> [12410, 12410] GDRCopy Checkpoint gdr_open: 1
===> [12412, 12412] GDRCopy Checkpoint gdr_open: 1
===> [12413, 12413] GDRCopy Checkpoint gdr_open: 1
===> [12411, 12411] GDRCopy Checkpoint gdr_open: 1
===> [12411, 12411] GDRCopy Checkpoint gdr_open: 1
===> [12413, 12413] GDRCopy Checkpoint gdr_open: 1
===> [12412, 12412] GDRCopy Checkpoint gdr_open: 1
===> [12410, 12410] GDRCopy Checkpoint gdr_open: 1
===> [119878, 119878] GDRCopy Checkpoint gdr_open: 1
===> [119876, 119876] GDRCopy Checkpoint gdr_open: 1
===> [119877, 119877] GDRCopy Checkpoint gdr_open: 1
===> [119879, 119879] GDRCopy Checkpoint gdr_open: 1
===> [119878, 119878] GDRCopy Checkpoint gdr_open: 1
===> [119877, 119877] GDRCopy Checkpoint gdr_open: 1
===> [119876, 119876] GDRCopy Checkpoint gdr_open: 1
===> [119879, 119879] GDRCopy Checkpoint gdr_open: 1
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
/lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/init/init.cu:register_state_ptr:110: IBGDA not enabled by host lib, but passed in by device. Host ignoring IBGDA state.
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5080
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5080, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5c20
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5c20, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5940
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5940, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd7200
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd7200, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5080
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5080, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5cd0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5cd0, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd54e0
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd54e0, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5840
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5840, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5080
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5080, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5c20
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5c20, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5940
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5940, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd7200
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd7200, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5080
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5080, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5cd0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5cd0, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd54e0
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd54e0, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5840
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5840, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5300
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5300, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5ea0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5ea0, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5bc0
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5bc0, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5300
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5300, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd7480
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd7480, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5760
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5760, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5f50
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5f50, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5300
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5300, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x6cd5ac0
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x6cd5ac0, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5ea0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5ea0, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5300
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5300, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5bc0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5bc0, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd7480
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd7480, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5760
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5760, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5f50
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5f50, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x6cd5ac0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x6cd5ac0, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03340
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03340, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03340
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03340, ret=0
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03340
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03340
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03340
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03340, ret=0
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03340
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03340
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03310
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03310, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03310
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03310, ret=0
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03310
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03310
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03310
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03310, ret=0
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03310
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03310
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e07540
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e07540, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e07540
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e07540, ret=0
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e07540
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e07540
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e07540
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e07540, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03940
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03940, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03940
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03940, ret=0
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03940
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03940
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03940
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03940, ret=0
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03940
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03940
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03480
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03480, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03480
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03480, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03720
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03720, ret=0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03480
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03480
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03480
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03480, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e03110
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e03110, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03110
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03110, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03110
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03110
===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e07540
===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e07540
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e03720
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e03720, ret=0
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e03720
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e03720
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03720
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03720, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e03110
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e03110, ret=0
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03480
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03480
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e04690
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e04690, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e04690
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e04690, ret=0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e04690
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e04690
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e04690
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e04690, ret=0
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03720
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03720
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e03110
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e03110
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ab90
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ab90, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ab90
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ab90, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ab40
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ab40, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ab40
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ab40, ret=0
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ab40
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ab40
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ab90
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ab90
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ab90
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ab90, ret=0
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e04690
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e04690
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ab40
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ab40, ret=0
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ab90
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ab90
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1acd0
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1acd0, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1acd0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1acd0, ret=0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1acd0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1acd0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1acd0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1acd0, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1af00
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1af00, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1af00
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1af00, ret=0
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1af00
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1af00
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1af00
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1af00, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1af50
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1af50, ret=0
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ab40
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ab40
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1af50
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1af50, ret=0
ERR: mh is not mapped yet
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1af50
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1af50
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1af50
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1af50, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1a960
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1a960, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1a960
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1a960, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1a960
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1a960
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1a960
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1a960, ret=0
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1acd0
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1acd0
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1af00
ERR: mh is not mapped yet
[nid001468:119877:0:119877] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd8)
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1af00
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ed70
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ed70, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ed70
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ed70, ret=0
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ed70
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ed70
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ed70
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ed70, ret=0
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1af50
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1af50
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ac50
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ac50, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ac50
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ac50, ret=0
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ac50
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ac50
===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ed70
===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ed70
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1bec0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1bec0, ret=0
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ac50
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ac50, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1bec0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1bec0, ret=0
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1a960
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1a960
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bec0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bec0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bec0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bec0, ret=0
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ac50
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ac50
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1bec0
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1bec0
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ac00
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ac00, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ac00
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ac00, ret=0
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ac00
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ac00
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ac00
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ac00, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ad90
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ad90, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ad90
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ad90, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1afc0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1afc0, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1afc0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1afc0, ret=0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ad90
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ad90
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1afc0
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1afc0
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1afc0
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1afc0, ret=0
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ac00
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ac00
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1b010
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1b010, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1b010
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1b010, ret=0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ad90
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ad90, ret=0
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1b010
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1b010
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1b010
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1b010, ret=0
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ad90
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ad90
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1afc0
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1afc0
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa20, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ee30
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ee30, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ee30
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ee30, ret=0
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ee30
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ee30
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ee30
===> [12410, 12467] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ee30, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1acb0
===> [12413, 12413] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1acb0, ret=0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1acb0
===> [12413, 12413] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1acb0, ret=0
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1b010
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1b010
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1acb0
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1acb0
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1acb0
===> [12413, 12465] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1acb0, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1bf80
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1bf80, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1bf80
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1bf80, ret=0
===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ee30
===> [12410, 12467] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ee30
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa20
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bf80
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bf80
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bf80
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bf80, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ac60
===> [12412, 12412] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ac60, ret=0
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ac60
===> [12412, 12412] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ac60, ret=0
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ac60
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ac60
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1acb0
===> [12413, 12465] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1acb0
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ac60
===> [12412, 12464] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ac60, ret=0
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1bf80
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1bf80
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ac60
===> [12412, 12464] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ac60
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1adf0
===> [119879, 119879] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1adf0, ret=0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1adf0
===> [119879, 119879] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1adf0, ret=0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1adf0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1adf0
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1b020
===> [12411, 12411] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1b020, ret=0
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1b020
===> [12411, 12411] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1b020, ret=0
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1b020
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1b020
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1b020
===> [12411, 12466] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1b020, ret=0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1adf0
===> [119879, 119945] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1adf0, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1ee90
===> [12410, 12410] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1ee90, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1ee90
===> [12410, 12410] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1ee90, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1b070
===> [119878, 119878] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1b070, ret=0
===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ee90
===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1ee90
===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1ee90
===> [12410, 12410] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ee90, ret=0
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1b070
===> [119878, 119878] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1b070, ret=0
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1b020
===> [12411, 12466] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1b020
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1adf0
===> [119879, 119945] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1adf0
===> [119879, 119879] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1ad90
===> [119879, 119879] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1ad90
===> [119879, 119879] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1ad90, ret=22
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1b070
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1b070
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1b070
===> [119878, 119944] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1b070, ret=0
===> [12410, 12410] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1ee90
===> [12410, 12410] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1ee90
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1b070
===> [119878, 119944] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1b070
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa80
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa80, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa80
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa80, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa80
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa80
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa80
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa80, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1bfe0
===> [119876, 119876] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1bfe0, ret=0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1bfe0
===> [119876, 119876] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1bfe0, ret=0
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa80
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa80
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=22
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bfe0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bfe0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bfe0
===> [119876, 119942] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bfe0, ret=0
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1bfe0
===> [119876, 119942] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1bfe0
==== backtrace (tid: 119877) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000c59e cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945
2 0x00000000000a47cb cxip_unmap() :0
3 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
4 0x00000000000adfe5 cxip_evtq_progress() :0
5 0x0000000000081695 cxip_ep_progress() :0
6 0x000000000008b6c9 cxip_cntr_read() cxip_cntr.c:0
7 0x000000000000e7d3 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:441
8 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
9 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
10 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
11 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
12 0x0000000000500a66 cufftMpDestroyReshape() ???:0
13 0x00000000004ff598 cufftMpDestroyReshape() ???:0
14 0x000000000015893a cufftMpAttachComm() ???:0
15 0x00000000004e058f cufftMpDestroyReshape() ???:0
16 0x00000000004e0a85 cufftMpDestroyReshape() ???:0
17 0x000000000014cb6e cufftMpAttachComm() ???:0
18 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0
19 0x0000000000147511 cufftXtMakePlanGuru64() ???:0
20 0x0000000000148105 cufftXtMakePlanMany() ???:0
21 0x0000000000145d7d cufftMakePlanMany64() ???:0
22 0x00000000001461bf cufftMakePlanMany() ???:0
23 0x0000000000146386 cufftMakePlan3d() ???:0
24 0x00000000004066b2 run_r2c_c2r_slabs() ???:0
25 0x0000000000407d4b main() ???:0
26 0x000000000003529d __libc_start_main() ???:0
27 0x00000000004057ea _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
make: *** [Makefile:18: run] Error 139
After that, I tried to do a bit more work. First, I added another print in _gdr_unpin_buffer
after the free
operation, because your print after the LIST_REMOVE
did show up in the failure scenario I pasted above.
Sample Diff
--- a/src/gdrapi.c
+++ b/src/gdrapi.c
@@ -302,6 +302,7 @@ static int _gdr_unpin_buffer(gdr_t g, gdr_mh_t handle)
LIST_REMOVE(mh, entries);
printf("===> [%d, %d] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=%p\n", getpid(), gettid(), mh);
free(mh);
+ printf("===> [%d, %d] GDRCopy Checkpoint _gdr_unpin_buffer: 3\n", getpid(), gettid());
return ret;
}
On top of that, per the request to separate the output by node, I made a few more changes to the source to prefix the hostname in each of the prints. These changes are available on my fork of gdrcopy
(https://github.com/tylerjereddy/gdrcopy) on feature branch treddy_gh_296
(just builds a few commits on top of your branch).
Now, when I run the 2-node cuFFTMp reproducer with that version of gdrcopy
I see a double free or corruption (out)
error, apparently at the free()
call in _gdr_unpin_buffer()
. Full log:
out_improved_prints.txt
That would be consistent with your original instrumented code as well, with the list removal "succeeding," but the free
failing inside of _gdr_unpin_buffer
. I think you were already worried about a double free somewhere above.
I ran the reproducer two more times, and this was not always the case however--sometimes we get the printf
after the free
operation in _gdr_unpin_buffer()
and then the backtrace happens after that:
===> [nid001196, 36860, 36913] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e21190
===> [nid001468, 5905, 5958] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e21530
===> [nid001468, 5905, 5958] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e21530
===> [nid001468, 5905, 5958] GDRCopy Checkpoint _gdr_unpin_buffer: 3
===> [nid001196, 36860, 36913] GDRCopy Checkpoint _gdr_unpin_buffer: 3
===> [nid001196, 36860, 36860] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e21190
===> [nid001196, 36860, 36860] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e21190
===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1d8f0
===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1d8f0
===> [nid001196, 36860, 36860] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e21190, ret=22
===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1d8f0
===> [nid001468, 5906, 5956] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1d8f0, ret=0
===> [nid001468, 5906, 5956] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1d8f0
===> [nid001468, 5906, 5956] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1d8f0
===> [nid001468, 5906, 5956] GDRCopy Checkpoint _gdr_unpin_buffer: 3
==== backtrace (tid: 36860) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000c59b cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945
2 0x00000000000a47cb cxip_unmap() :0
3 0x000000000008c165 cxip_rma_cb() cxip_rma.c:0
4 0x00000000000adfe5 cxip_evtq_progress() :0
5 0x0000000000081695 cxip_ep_progress() :0
6 0x0000000000089f99 cxip_util_cq_progress() :0
7 0x000000000004a020 ofi_cq_readfrom() :0
8 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/include/rdma/fi_eq.h:395
9 0x000000000000dff4 nvshmemt_libfabric_progress() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:132
10 0x000000000000e7c8 nvshmemt_libfabric_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/modules/transport/libfabric/libfabric.cpp:390
11 0x00000000000d653a nvshmem_quiet() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/comm/quiet.cpp:51
12 0x000000000004525d nvshmemi_barrier() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:19
13 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:39
14 0x00000000000456b3 nvshmem_barrier_all() /lustre/scratch5/treddy/march_april_2024_testing/nvshmem_custom_dl/nvshmem_src_2.10.1-3/src/host/coll/barrier/barrier.cpp:41
15 0x0000000000500a66 cufftMpDestroyReshape() ???:0
16 0x00000000004ff598 cufftMpDestroyReshape() ???:0
17 0x000000000015893a cufftMpAttachComm() ???:0
18 0x00000000004e058f cufftMpDestroyReshape() ???:0
19 0x00000000004e0a85 cufftMpDestroyReshape() ???:0
20 0x000000000014cb6e cufftMpAttachComm() ???:0
21 0x000000000011bf4f cufftXtSetCallbackSharedSize() ???:0
22 0x0000000000147511 cufftXtMakePlanGuru64() ???:0
23 0x0000000000148105 cufftXtMakePlanMany() ???:0
24 0x0000000000145d7d cufftMakePlanMany64() ???:0
25 0x00000000001461bf cufftMakePlanMany() ???:0
26 0x0000000000146386 cufftMakePlan3d() ???:0
27 0x00000000004066b2 run_r2c_c2r_slabs() ???:0
28 0x0000000000407d4b main() ???:0
29 0x000000000003529d __libc_start_main() ???:0
30 0x00000000004057ea _start() /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
=================================
Of course, things are not fully deterministic, and I saw the double free error happening in what appears to be other parts of the control flow as well:
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e34350, ret=0
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 1: mh=0x7e34350
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 2: mh=0x7e34350, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3a2c0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3a2c0, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3a2c0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3a2c0, ret=0
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e312a0
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e312a0, ret=0
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e312a0
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e312a0, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3b450
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3b450, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3b450
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3b450, ret=0
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e354e0
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e354e0, ret=0
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 1: mh=0x7e354e0
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_map: 2: mh=0x7e354e0, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3c5e0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3c5e0, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3c5e0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3c5e0, ret=0
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e32430
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e32430, ret=0
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e32430
===> [nid001196, 37927, 37927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e32430, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e3d770
double free or corruption (out)
[nid001468:06925] *** Process received signal ***
[nid001468:06925] Signal: Aborted (6)
[nid001468:06925] Signal code: (-6)
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e3d770, ret=0
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 1: mh=0x7e3d770
===> [nid001468, 6927, 6927] GDRCopy Checkpoint gdr_map: 2: mh=0x7e3d770, ret=0
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e36670
===> [nid001196, 37924, 37924] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e36670, ret=0
I'm guessing your team has already flushed the code through an address sanitizer at some point though? This is confusing! What can I do next to help get to the bottom of it?
from gdrcopy.
There are multiple things that went wrong here. Let's start with the raw output from my instrumented code without your patch.
-
The output from the instrumented code is in
[pid, tid]
format. I think the caller uses multiple threads here. I didn't see locking when I reviewed the libfabric code. We probably see some racing. But the experimental branch you are using should not have this problem because I added a global lock. So, I will remove racing inside GDRCopy from the discussion for now. -
Let's stick some lines from the same process that reported segfault together. They should be in chronological order.
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 3: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_pin_buffer: 4: mh=0x7e1aa20, ret=0
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_map: 2: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=0
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x7e1aa20
===> [119877, 119943] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 2: mh=0x7e1aa20
===> [119877, 119877] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1aa20, ret=22
==== backtrace (tid: 119877) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x000000000000c59e cxil_unmap() /workspace/src/github.hpe.com/hpe/hpc-shs-libcxi/WORKSPACE/BUILD/libcxi-0.9/src/libcxi.c:945
2 0x00000000000a47cb cxip_unmap() :0
As shown, they were dealing with the same mh object based on the address. The caller called gdr_unmap
two times on the same object! What made it worse is that they called gdr_unmap
after _gdr_unpin_buffer
. The mh object had already been destroyed before the last gdr_unmap
was called. Note that mh
is directly translated from handle
that the caller passes to GDRCopy API. Basically, this is a use-after-free problem inside the caller.
- Now, let's look at your https://github.com/NVIDIA/gdrcopy/files/15312969/out_improved_prints.txt.
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 1: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 3: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 5: mh=0x7e1bf60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint gdr_unmap: 6: mh=0x7e1bf60, ret=0
...
===> [nid001196, 34247, 34247] GDRCopy Checkpoint _gdr_unpin_buffer: 1: mh=0x14c509661a60
===> [nid001196, 34247, 34247] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=0x14c509661a60
double free or corruption (out)
[nid001196:34247] *** Process received signal ***
[nid001196:34247] Signal: Aborted (6)
[nid001196:34247] Signal code: (-6)
The mh address that the caller passed to _gdr_unpin_buffer
was 0x14c509661a60
. That was completely in a different memory region as what the other mh objects were. In fact, I cannot find a single line printed from gdr_pin_buffer
that shows that this mh=0x14c509661a60
was created by GDRCopy. The caller probably passed in an unrelated object here.
from gdrcopy.
So, suspicion would be on the libfabric
side, but not as far up the control flow as NVSHMEM?
from gdrcopy.
IIUC, NVSHMEM does not use GDRCopy directly in that environment. I don't know the libfabric programming model. Is it thread safe? Does it require special handling from the libfabric caller (NVSHMEM in this case)? My suggestion is to move up one step at a time. Items 2 and 3 are clearly a mistake from GDRCopy's caller. Even if we make GDRCopy thread safe, you will still run into this segfault issue.
from gdrcopy.
I think I've found evidence of a spin lock in libfabric
not having the right guards before unmapping and unpinning per the issue cross-listed above. I had to paste a bunch of internal gdrcopy
stuff to check if the unmapping had already happened in my sample patch there, and I still got crashes, but that particular pathology did seem to disappear when I hacked that in...
from gdrcopy.
Looking at the log you posted in the libfabric issue 10041, you have
[112670, 112670] cuda_gdrcopy_dev_unregister() checkpoint 2 after spin lock and before unmap gdrcopy->mh=(nil)
===> [nid001252, 112670, 112670] GDRCopy Checkpoint gdr_unmap: 1: mh=(nil)
...
==== backtrace (tid: 112670) ====
0 0x00000000000168c0 __funlockfile() ???:0
1 0x00000000000023fb gdr_unmap() /lustre/scratch5/treddy/march_april_2024_testing/github_projects/gdrcopy/src/gdrapi.c:459
2 0x0000000000032e33 cuda_gdrcopy_dev_unregister() :0
3 0x00000000000a4bed cxip_unmap() :0
...
So, libfabric passes NULL to gdr_unmap
. That is likely the source of your segfault.
from gdrcopy.
I think we agree on that, though I wasn't convinced that guarding against that was sufficient to fix all the problems, since I saw other backtraces after that was protected. I'm hoping to make another push at getting that working soon..
from gdrcopy.
Related Issues (20)
- Facing issue when installing HOT 1
- Ubuntu 22 - dpkg: error processing package gdrdrv-dkms:amd64 (--install) during installation of gdrcopy HOT 3
- Why D2H is relatively slower? HOT 2
- Query: Confusion about sudo requirement HOT 3
- thinking about working with CUDA async API
- gdrcopy_sanity failed when GPU Compute Mode is set to EXCLUSIVE HOT 1
- Unable to compile GDRCOPY v2.4 HOT 2
- Minimal steps to install gdrdrv driver only please HOT 6
- Fail to access mapped memory from CPU side(Fail data_validation tests) HOT 14
- tests build failing when check.h is not available HOT 1
- How to understand the file "nv-p2p-dummpy.c" HOT 3
- Driver flavor detection fails for 545 series HOT 2
- bad performance(compare with cuMemcpy) on x86 system HOT 3
- GDRCopy 2.4 on Centos7 failing build of RPM packages HOT 2
- Increasing utilization - gdrcopy_copybw HOT 3
- Improve the error report of gdrcopy_pplat when the CUDA kernel cannot be launched
- Safe Mounting of /dev/gdrdrv in a kubernetes environment - HostPath appears to fail HOT 12
- How to effectively test if gdrcopy is enabled using Real world ML workload ? HOT 2
- Can't make with Intel Compiler HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gdrcopy.