Giter Club home page Giter Club logo

Comments (9)

raffenet avatar raffenet commented on June 16, 2024 1

It's querying for a key only present if using the Hydra process manager. I'm surprised it doesn't just return not found. We do test with PMIx/PRRTE and haven't encountered this issue. You can manually apply this patch to MPICH as a workaround for the time-being while we investigate a proper fix.

diff --git a/src/util/mpir_hwtopo.c b/src/util/mpir_hwtopo.c
index 33e88bc..ee3641c 100644
--- a/src/util/mpir_hwtopo.c
+++ b/src/util/mpir_hwtopo.c
@@ -200,18 +200,6 @@ int MPII_hwtopo_init(void)
 #ifdef HAVE_HWLOC
     bindset = hwloc_bitmap_alloc();
     hwloc_topology_init(&hwloc_topology);
-    char *xmlfile = MPIR_pmi_get_jobattr("PMI_hwloc_xmlfile");
-    if (xmlfile != NULL) {
-        int rc;
-        rc = hwloc_topology_set_xml(hwloc_topology, xmlfile);
-        if (rc == 0) {
-            /* To have hwloc still actually call OS-specific hooks, the
-             * HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM has to be set to assert that the loaded
-             * file is really the underlying system. */
-            hwloc_topology_set_flags(hwloc_topology, HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM);
-        }
-        MPL_free(xmlfile);
-    }

     hwloc_topology_set_io_types_filter(hwloc_topology, HWLOC_TYPE_FILTER_KEEP_ALL);
     if (!hwloc_topology_load(hwloc_topology))

from mpich.

rodarima avatar rodarima commented on June 16, 2024

Thanks for the quick reply. I applied the patch, but now is stuck here:

(gdb) bt
#0  0x00007fb74d494ebe in __futex_abstimed_wait_common () from target:/nix/store/ksk3rnb0ljx8gngzk19jlmbjyvac4hw6-glibc-2.38-44/lib/libc.so.6
#1  0x00007fb74d497720 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/nix/store/ksk3rnb0ljx8gngzk19jlmbjyvac4hw6-glibc-2.38-44/lib/libc.so.6
#2  0x00007fb74d1dad4b in PMIx_Get () from target:/nix/store/xqpyk6kvwpr9hlxzdygfa4zfl8sr2nwg-pmix-all/lib/libpmix.so.2
#3  0x00007fb74dd88c01 in MPIR_pmi_get_jobattr () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#4  0x00007fb74dcae94c in MPII_init_local_proc_attrs () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#5  0x00007fb74dcac0d4 in MPII_Init_thread () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#6  0x00007fb74dcac9f5 in MPIR_Init_impl () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#7  0x00007fb74dae5768 in PMPI_Init () from target:/nix/store/6l8v79jg2yxhanzrjg8kn7z3k5q117y3-mpich-4.2.0/lib/libmpi.so.12
#8  0x0000000000402568 in main ()

I can build it with debug symbols to see what is going on if you need to see more details.

from mpich.

raffenet avatar raffenet commented on June 16, 2024

Another Hydra key ☚ī¸ that it's not finding. Here's another workaround:

diff --git a/src/mpi/init/local_proc_attrs.c b/src/mpi/init/local_proc_attrs.c
index 09cbdb0f..29d3e682 100644
--- a/src/mpi/init/local_proc_attrs.c
+++ b/src/mpi/init/local_proc_attrs.c
@@ -79,10 +79,6 @@ int MPII_init_local_proc_attrs(int *p_thread_required)
     /* Set the number of tag bits. The device may override this value. */
     MPIR_Process.tag_bits = MPIR_TAG_BITS_DEFAULT;

-    char *requested_kinds = MPIR_pmi_get_jobattr("PMI_mpi_memory_alloc_kinds");
-    MPIR_get_supported_memory_kinds(requested_kinds, &MPIR_Process.memory_alloc_kinds);
-    MPL_free(requested_kinds);
-
     return mpi_errno;
 }

from mpich.

rodarima avatar rodarima commented on June 16, 2024

Now it works, thanks!

hut% srun -N2 osu_bw
# OSU MPI Bandwidth Test v7.1
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       1.22
2                       2.44
4                       4.91
8                       9.76
16                     19.10
32                     38.17
64                     76.83
128                   143.71
256                   289.09
512                   560.92
1024                  915.90
2048                 1464.13
4096                 2098.66
8192                 2976.49
16384                3456.71
32768                3765.65
65536                5475.29
131072               8047.09
262144               8667.83
524288               8872.39
1048576              8808.91
2097152              8593.88
4194304              8496.65

hut% ldd $(which osu_bw) | grep mpi
        libmpicxx.so.12 => /nix/store/ilk9pgjy30fsncc349gcbc7l1sgfpsna-mpich-4.2.0/lib/libmpicxx.so.12 (0x00007f142e34c000)
        libmpi.so.12 => /nix/store/ilk9pgjy30fsncc349gcbc7l1sgfpsna-mpich-4.2.0/lib/libmpi.so.12 (0x00007f142b400000)

from mpich.

raffenet avatar raffenet commented on June 16, 2024

Great. I'm trying to find a new enough Slurm cluster where I can figure out what's going wrong, so this issue will remain open for now.

from mpich.

raffenet avatar raffenet commented on June 16, 2024

Great. I'm trying to find a new enough Slurm cluster where I can figure out what's going wrong, so this issue will remain open for now.

At least in a single node dev environment, I do not observe a hang with MPICH main or 4.2.0 with PMIx 5.0.1 and Slurm 23.11.4. Can you confirm on your end whether single node vs multi node behaves differently?

from mpich.

rodarima avatar rodarima commented on June 16, 2024

At least in a single node dev environment, I do not observe a hang with MPICH main or 4.2.0 with PMIx 5.0.1 and Slurm 23.11.4. Can you confirm on your end whether single node vs multi node behaves differently?

I can, but the osu_bw benchmark requires two ranks. Maybe I can run it with two ranks in the same node.

Also, I needed to patch SLURM 23.11.4.1 as commented here: https://bugs.schedmd.com/show_bug.cgi?id=19324#c3

from mpich.

raffenet avatar raffenet commented on June 16, 2024

At least in a single node dev environment, I do not observe a hang with MPICH main or 4.2.0 with PMIx 5.0.1 and Slurm 23.11.4. Can you confirm on your end whether single node vs multi node behaves differently?

I can, but the osu_bw benchmark requires two ranks. Maybe I can run it with two ranks in the same node.

That's what I meant. 2 ranks on a single node. I can setup a multi node dev environment, but it will take me some more time to get it working.

Also, I needed to patch SLURM 23.11.4.1 as commented here: https://bugs.schedmd.com/show_bug.cgi?id=19324#c3

I don't think the bug is in MPICH. We are only working around the bad behavior with the patches I've suggested. PMIx should return key not found in both cases, which MPICH understands and can handle. Whatever is causing the PMIx_Get call to hang still needs to be resolved on the PMIx/Slurm side.

from mpich.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.