Comments (9)
It's querying for a key only present if using the Hydra process manager. I'm surprised it doesn't just return not found. We do test with PMIx/PRRTE and haven't encountered this issue. You can manually apply this patch to MPICH as a workaround for the time-being while we investigate a proper fix.
diff --git a/src/util/mpir_hwtopo.c b/src/util/mpir_hwtopo.c
index 33e88bc..ee3641c 100644
--- a/src/util/mpir_hwtopo.c
+++ b/src/util/mpir_hwtopo.c
@@ -200,18 +200,6 @@ int MPII_hwtopo_init(void)
#ifdef HAVE_HWLOC
bindset = hwloc_bitmap_alloc();
hwloc_topology_init(&hwloc_topology);
- char *xmlfile = MPIR_pmi_get_jobattr("PMI_hwloc_xmlfile");
- if (xmlfile != NULL) {
- int rc;
- rc = hwloc_topology_set_xml(hwloc_topology, xmlfile);
- if (rc == 0) {
- /* To have hwloc still actually call OS-specific hooks, the
- * HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM has to be set to assert that the loaded
- * file is really the underlying system. */
- hwloc_topology_set_flags(hwloc_topology, HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM);
- }
- MPL_free(xmlfile);
- }
hwloc_topology_set_io_types_filter(hwloc_topology, HWLOC_TYPE_FILTER_KEEP_ALL);
if (!hwloc_topology_load(hwloc_topology))
from mpich.
Thanks for the quick reply. I applied the patch, but now is stuck here:
(gdb) bt
#0 0x00007fb74d494ebe in __futex_abstimed_wait_common () from target:/nix/store/ksk3rnb0ljx8gngzk19jlmbjyvac4hw6-glibc-2.38-44/lib/libc.so.6
#1 0x00007fb74d497720 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/nix/store/ksk3rnb0ljx8gngzk19jlmbjyvac4hw6-glibc-2.38-44/lib/libc.so.6
#2 0x00007fb74d1dad4b in PMIx_Get () from target:/nix/store/xqpyk6kvwpr9hlxzdygfa4zfl8sr2nwg-pmix-all/lib/libpmix.so.2
#3 0x00007fb74dd88c01 in MPIR_pmi_get_jobattr () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#4 0x00007fb74dcae94c in MPII_init_local_proc_attrs () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#5 0x00007fb74dcac0d4 in MPII_Init_thread () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#6 0x00007fb74dcac9f5 in MPIR_Init_impl () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#7 0x00007fb74dae5768 in PMPI_Init () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#8 0x0000000000402568 in main ()
I can build it with debug symbols to see what is going on if you need to see more details.
from mpich.
Another Hydra key
diff --git a/src/mpi/init/local_proc_attrs.c b/src/mpi/init/local_proc_attrs.c
index 09cbdb0f..29d3e682 100644
--- a/src/mpi/init/local_proc_attrs.c
+++ b/src/mpi/init/local_proc_attrs.c
@@ -79,10 +79,6 @@ int MPII_init_local_proc_attrs(int *p_thread_required)
/* Set the number of tag bits. The device may override this value. */
MPIR_Process.tag_bits = MPIR_TAG_BITS_DEFAULT;
- char *requested_kinds = MPIR_pmi_get_jobattr("PMI_mpi_memory_alloc_kinds");
- MPIR_get_supported_memory_kinds(requested_kinds, &MPIR_Process.memory_alloc_kinds);
- MPL_free(requested_kinds);
-
return mpi_errno;
}
from mpich.
Now it works, thanks!
hut% srun -N2 osu_bw
# OSU MPI Bandwidth Test v7.1
# Size Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1 1.22
2 2.44
4 4.91
8 9.76
16 19.10
32 38.17
64 76.83
128 143.71
256 289.09
512 560.92
1024 915.90
2048 1464.13
4096 2098.66
8192 2976.49
16384 3456.71
32768 3765.65
65536 5475.29
131072 8047.09
262144 8667.83
524288 8872.39
1048576 8808.91
2097152 8593.88
4194304 8496.65
hut% ldd $(which osu_bw) | grep mpi
libmpicxx.so.12 => /nix/store/ilk9pgjy30fsncc349gcbc7l1sgfpsna-mpich-4.2.0/lib/libmpicxx.so.12 (0x00007f142e34c000)
libmpi.so.12 => /nix/store/ilk9pgjy30fsncc349gcbc7l1sgfpsna-mpich-4.2.0/lib/libmpi.so.12 (0x00007f142b400000)
from mpich.
Great. I'm trying to find a new enough Slurm cluster where I can figure out what's going wrong, so this issue will remain open for now.
from mpich.
Great. I'm trying to find a new enough Slurm cluster where I can figure out what's going wrong, so this issue will remain open for now.
At least in a single node dev environment, I do not observe a hang with MPICH main
or 4.2.0 with PMIx 5.0.1 and Slurm 23.11.4. Can you confirm on your end whether single node vs multi node behaves differently?
from mpich.
At least in a single node dev environment, I do not observe a hang with MPICH
main
or 4.2.0 with PMIx 5.0.1 and Slurm 23.11.4. Can you confirm on your end whether single node vs multi node behaves differently?
I can, but the osu_bw benchmark requires two ranks. Maybe I can run it with two ranks in the same node.
Also, I needed to patch SLURM 23.11.4.1 as commented here: https://bugs.schedmd.com/show_bug.cgi?id=19324#c3
from mpich.
At least in a single node dev environment, I do not observe a hang with MPICH
main
or 4.2.0 with PMIx 5.0.1 and Slurm 23.11.4. Can you confirm on your end whether single node vs multi node behaves differently?I can, but the osu_bw benchmark requires two ranks. Maybe I can run it with two ranks in the same node.
That's what I meant. 2 ranks on a single node. I can setup a multi node dev environment, but it will take me some more time to get it working.
Also, I needed to patch SLURM 23.11.4.1 as commented here: https://bugs.schedmd.com/show_bug.cgi?id=19324#c3
I don't think the bug is in MPICH. We are only working around the bad behavior with the patches I've suggested. PMIx should return key not found in both cases, which MPICH understands and can handle. Whatever is causing the PMIx_Get
call to hang still needs to be resolved on the PMIx/Slurm side.
from mpich.
Related Issues (20)
- build: Build embedded libfabric as a shared library HOT 5
- MPICH building error: simple/lib: No such file or directory HOT 2
- build: Add configure option to not build/install html and man pages
- f08: MPI_SUBARRAYS_SUPPORTED corner cases need to be error-checked HOT 1
- More verbose error message when the request pools run out of space HOT 1
- File descriptors required for multiple VCIs HOT 1
- PMI error when running on SDSC Expanse HOT 16
- romio: daos: implement auto-detection
- errhan: Issue with error handling HOT 2
- Multiple MPI_Test invocations on the same MPI_Request object are not thread-safe HOT 5
- can't build Ch4:UCX with static HOT 1
- how do i make libtool substitute -shared for -Wl,-shared HOT 2
- Can't build on lustre HOT 4
- ABI: Fortran ABI TODO HOT 3
- mpich not configuring with clang compilers on Perlmutter HOT 30
- Subcommunicators HOT 4
- [4.2.0] Assert in mpl_gpu_ze.c:466 when ZE_AFFINITY_MASK set to second device HOT 7
- Memory growth with GPU-aware MPICH on Intel PVC GPUs HOT 3
- make testing failed in VPATH build
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ī¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mpich.