Giter Club home page Giter Club logo

mpich's Introduction

			MPICH Release %VERSION%

MPICH is a high-performance and widely portable implementation of the
MPI-4.1 standard from the Argonne National Laboratory.  This release
has all MPI 4.1 functions and features required by the standard with
the exception of support for user-defined data representations for I/O.

This README file should contain enough information to get you started
with MPICH. More extensive installation and user guides can be found
in the doc/installguide/install.pdf and doc/userguide/user.pdf files
respectively. Additional information regarding the contents of the
release can be found in the CHANGES file in the top-level directory,
and in the RELEASE_NOTES file, where certain restrictions are
detailed. Finally, the MPICH web site, http://www.mpich.org, contains
information on bug fixes and new releases.


1.  Getting Started
2.  Reporting Installation or Usage Problems
3.  Compiler Flags
4.  Alternate Channels and Devices
5.  Alternate Process Managers
6.  Alternate Configure Options
7.  Testing the MPICH installation
8.  Fault Tolerance
9.  Developer Builds
10. Multiple Fortran compiler support
11. ABI Compatibility
12. Capability Sets
13. Threads


-------------------------------------------------------------------------

1. Getting Started
==================

Note: this guide assumes you are building MPICH from one of the MPICH
release tarballs. If you are starting from a git checkout, you will need
a few additional steps. Please refer to the wiki page --
https://github.com/pmodels/mpich/blob/main/doc/wiki/Index.md.

The following instructions take you through a sequence of steps to get
the default configuration (ch3 device, nemesis channel (with TCP and
shared memory), Hydra process management) of MPICH up and running.

(a) You will need the following prerequisites.

    - REQUIRED: This tar file mpich-%VERSION%.tar.gz

    - REQUIRED: Perl

    - REQUIRED: A C compiler (C99 support is required. See
      https://github.com/pmodels/mpich/blob/main/doc/wiki/source_code/Shifting_Toward_C99.md)

    - OPTIONAL: A C++ compiler, if C++ applications are to be used
      (g++, etc.). If you do not require support for C++ applications,
      you can disable this support using the configure option
      --disable-cxx (configuring MPICH is described in step 1(d)
      below).

    - OPTIONAL: A Fortran compiler, if Fortran applications are to be
      used (gfortran, ifort, etc.). If you do not require support for
      Fortran applications, you can disable this support using
      --disable-fortran (configuring MPICH is described in step 1(d)
      below).

    - OPTIONAL: Python 3. Python 3 is needed to generate Fortran bindings.

    Also, you need to know what shell you are using since different shell
    has different command syntax. Command "echo $SHELL" prints out the
    current shell used by your terminal program.

(b) Unpack the tar file and go to the top level directory:

      tar xzf mpich-%VERSION%.tar.gz
      cd mpich-%VERSION%

    If your tar doesn't accept the z option, use

      gunzip mpich-%VERSION%.tar.gz
      tar xf mpich-%VERSION%.tar
      cd mpich-%VERSION%

(c) Choose an installation directory, say
    /home/<USERNAME>/mpich-install, which is assumed to non-existent
    or empty. It will be most convenient if this directory is shared
    by all of the machines where you intend to run processes. If not,
    you will have to duplicate it on the other machines after
    installation.

(d) Configure MPICH specifying the installation directory and device:

    for csh and tcsh:

      ./configure --prefix=/home/<USERNAME>/mpich-install |& tee c.txt

    for bash and sh:

      ./configure --prefix=/home/<USERNAME>/mpich-install 2>&1 | tee c.txt

    The configure will try to determine the best device (the internal
    network modules) based on system environment. You may also supply
    a device configuration. E.g.

      ./configure --prefix=... --with-device=ch4:ofi |...

    or:

      ./configure --prefix=... --with-device=ch4:ucx |...

    Refer to section below -- Alternate Channels and Devices -- for
    more details.

    Bourne-like shells, sh and bash, accept "2>&1 |".  Csh-like shell,
    csh and tcsh, accept "|&". If a failure occurs, the configure
    command will display the error. Most errors are straight-forward
    to follow. For example, if the configure command fails with:

       "No Fortran compiler found. If you don't need to build any
        Fortran programs, you can disable Fortran support using
        --disable-fortran. If you do want to build Fortran programs,
        you need to install a Fortran compiler such as gfortran or
        ifort before you can proceed."

    ... it means that you don't have a Fortran compiler :-). You will
    need to either install one, or disable Fortran support in MPICH.

    If you are unable to understand what went wrong, please go to step
    (2) below, for reporting the issue to the MPICH developers and
    other users.

(e) Build MPICH:

    for csh and tcsh:

      make |& tee m.txt

    for bash and sh:

      make 2>&1 | tee m.txt

    This step should succeed if there were no problems with the
    preceding step. Check file m.txt. If there were problems, do a
    "make clean" and then run make again with V=1.

      make V=1 |& tee m.txt       (for csh and tcsh)

      OR

      make V=1 2>&1 | tee m.txt   (for bash and sh)

    Then go to step (2) below, for reporting the issue to the MPICH
    developers and other users.

(f) Install the MPICH commands:

    for csh and tcsh:

      make install |& tee mi.txt

    for bash and sh:

      make install 2>&1 | tee mi.txt

    This step collects all required executables and scripts in the bin
    subdirectory of the directory specified by the prefix argument to
    configure.

(g) Add the bin subdirectory of the installation directory to your
    path in your startup script (.bashrc for bash, .cshrc for csh,
    etc.):

    for csh and tcsh:

      setenv PATH /home/<USERNAME>/mpich-install/bin:$PATH

    for bash and sh:
  
      PATH=/home/<USERNAME>/mpich-install/bin:$PATH ; export PATH

    Check that everything is in order at this point by doing:

      which mpicc
      which mpiexec

    These commands should display the path to your bin subdirectory of
    your install directory.

    IMPORTANT NOTE: The install directory has to be visible at exactly
    the same path on all machines you want to run your applications
    on. This is typically achieved by installing MPICH on a shared
    NFS file-system. If you do not have a shared NFS directory, you
    will need to manually copy the install directory to all machines
    at exactly the same location.

(h) MPICH uses a process manager for starting MPI applications. The
    process manager provides the "mpiexec" executable, together with
    other utility executables. MPICH comes packaged with multiple
    process managers; the default is called Hydra.

    Now we will run an MPI job, using the mpiexec command as specified
    in the MPI standard. There are some examples in the install
    directory, which you have already put in your path, as well as in
    the directory mpich-%VERSION%/examples. One of them is the classic
    CPI example, which computes the value of pi by numerical
    integration in parallel.

    To run the CPI example with 'n' processes on your local machine,
    you can use:

      mpiexec -n <number> ./examples/cpi

    Test that you can run an 'n' process CPI job on multiple nodes:

      mpiexec -f machinefile -n <number> ./examples/cpi

    The 'machinefile' is of the form:

      host1
      host2:2
      host3:4   # Random comments
      host4:1

    'host1', 'host2', 'host3' and 'host4' are the hostnames of the
    machines you want to run the job on. The ':2', ':4', ':1' segments
    depict the number of processes you want to run on each node. If
    nothing is specified, ':1' is assumed.

    More details on interacting with Hydra can be found at
    https://github.com/pmodels/mpich/blob/main/doc/wiki/how_to/Using_the_Hydra_Process_Manager.md

If you have completed all of the above steps, you have successfully
installed MPICH and run an MPI example.

-------------------------------------------------------------------------

2. Reporting Installation or Usage Problems
===========================================

[VERY IMPORTANT: PLEASE COMPRESS ALL FILES BEFORE SENDING THEM TO
US. DO NOT SPAM THE MAILING LIST WITH LARGE ATTACHMENTS.]

The distribution has been tested by us on a variety of machines in our
environments as well as our partner institutes. If you have problems
with the installation or usage of MPICH, please follow these steps:

1. First see the Frequently Asked Questions (FAQ) page at
https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md
to see if the problem you are facing has a simple solution. Many common
problems and their solutions are listed here.

2. If you cannot find an answer on the FAQ page, look through previous
email threads on the [email protected] mailing list archive
(https://lists.mpich.org/mailman/listinfo/discuss). It is likely
someone else had a similar problem, which has already been resolved
before.

3. If neither of the above steps work, please send an email to
[email protected]. You need to subscribe to this list
(https://lists.mpich.org/mailman/listinfo/discuss) before sending an
email.

Your email should contain the following files.  ONCE AGAIN, PLEASE
COMPRESS BEFORE SENDING, AS THE FILES CAN BE LARGE.  Note that,
depending on which step the build failed, some of the files might not
exist.

    mpich-%VERSION%/c.txt (generated in step 1(d) above)
    mpich-%VERSION%/m.txt (generated in step 1(e) above)
    mpich-%VERSION%/mi.txt (generated in step 1(f) above)
    mpich-%VERSION%/config.log (generated in step 1(d) above)
    mpich-%VERSION%/src/mpl/config.log (generated in step 1(d) above)
    mpich-%VERSION%/src/pm/hydra/config.log (generated in step 1(d) above)

    DID WE MENTION? DO NOT FORGET TO COMPRESS THESE FILES!

If you have compiled MPICH and are having trouble running an
application, please provide the output of the following command in
your email.

    mpiexec -info

Finally, please include the actual error you are seeing when running
the application, including the mpiexec command used, and the host
file. If possible, please try to reproduce the error with a smaller
application or benchmark and send that along in your bug report.

4. If you have found a bug in MPICH, you can report it on our Github
page (https://github.com/pmodels/mpich/issues).


-------------------------------------------------------------------------

3. Compiler Flags
=================

MPICH allows several sets of compiler flags to be used. The first
three sets are configure-time options for MPICH, while the fourth is
only relevant when compiling applications with mpicc and friends.

(a) CFLAGS, CPPFLAGS, CXXFLAGS, FFLAGS, FCFLAGS, LDFLAGS and LIBS
(abbreviated as xFLAGS): Setting these flags would result in the
MPICH library being compiled/linked with these flags and the flags
internally being used in mpicc and friends.

(b) MPICHLIB_CFLAGS, MPICHLIB_CPPFLAGS, MPICHLIB_CXXFLAGS,
MPICHLIB_FFLAGS, MPICHLIB_FCFLAGS, MPICHLIB_LDFLAGS and
MPICHLIB_LIBS (abbreviated as MPICHLIB_xFLAGS): Setting these flags
would result in the MPICH library being compiled/linked with these
flags. However, these flags will *not* be used by mpicc and friends.

(c) MPICH_MPICC_CFLAGS, MPICH_MPICC_CPPFLAGS, MPICH_MPICC_LDFLAGS,
MPICH_MPICC_LIBS, and so on for MPICXX, MPIF77 and MPIFORT
(abbreviated as MPICH_MPIX_FLAGS): These flags do *not* affect the
compilation of the MPICH library itself, but will be internally used
by mpicc and friends.


  +--------------------------------------------------------------------+
  |                    |                      |                        |
  |                    |    MPICH library     |    mpicc and friends   |
  |                    |                      |                        |
  +--------------------+----------------------+------------------------+
  |                    |                      |                        |
  |     xFLAGS         |         Yes          |           Yes          |
  |                    |                      |                        |
  +--------------------+----------------------+------------------------+
  |                    |                      |                        |
  |  MPICHLIB_xFLAGS   |         Yes          |           No           |
  |                    |                      |                        |
  +--------------------+----------------------+------------------------+
  |                    |                      |                        |
  | MPICH_MPIX_FLAGS   |         No           |           Yes          |
  |                    |                      |                        |
  +--------------------+----------------------+------------------------+


All these flags can be set as part of configure command or through
environment variables.


Default flags
--------------
By default, MPICH automatically adds certain compiler optimizations
to MPICHLIB_CFLAGS. The currently used optimization level is -O2.

** IMPORTANT NOTE: Remember that this only affects the compilation of
the MPICH library and is not used in the wrappers (mpicc and friends)
that are used to compile your applications or other libraries.

This optimization level can be changed with the --enable-fast option
passed to configure. For example, to build an MPICH environment with
-O3 for all language bindings, one can simply do:

  ./configure --enable-fast=O3

Or to disable all compiler optimizations, one can do:

  ./configure --disable-fast

For more details of --enable-fast, see the output of "configure
--help".

For performance testing, we recommend the following flags:

  ./configure --enable-fast=O3,ndebug --disable-error-checking --without-timing \
              --without-mpit-pvars


Examples
--------

Example 1:

  ./configure --disable-fast MPICHLIB_CFLAGS=-O3 MPICHLIB_FFLAGS=-O3 \
        MPICHLIB_CXXFLAGS=-O3 MPICHLIB_FCFLAGS=-O3

This will cause the MPICH libraries to be built with -O3, and -O3
will *not* be included in the mpicc and other MPI wrapper script.

Example 2:

  ./configure --disable-fast CFLAGS=-O3 FFLAGS=-O3 CXXFLAGS=-O3 FCFLAGS=-O3

This will cause the MPICH libraries to be built with -O3, and -O3
will be included in the mpicc and other MPI wrapper script.

-------------------------------------------------------------------------

4. Alternate Channels and Devices
=================================

The communication mechanisms in MPICH are called "devices". MPICH
supports ch3 and ch4 (default), as well as many
third-party devices that are released and maintained by other
institutes.

                   *************************************

ch3 device
**********
The ch3 device contains different internal communication options
called "channels". We currently support nemesis (default) and sock
channels.

nemesis channel
---------------
Nemesis provides communication using different networks (tcp, mx) as
well as various shared-memory optimizations. To configure MPICH with
nemesis, you can use the following configure option:

  --with-device=ch3:nemesis

Shared-memory optimizations are enabled by default to improve
performance for multi-processor/multi-core platforms. They can be
disabled (at the cost of performance) either by setting the
environment variable MPICH_NO_LOCAL to 1, or using the following
configure option:

  --enable-nemesis-dbg-nolocal

The --with-shared-memory= configure option allows you to choose how
Nemesis allocates shared memory.  The options are "auto", "sysv", and
"mmap".  Using "sysv" will allocate shared memory using the System V
shmget(), shmat(), etc. functions.  Using "mmap" will allocate shared
memory by creating a file (in /dev/shm if it exists, otherwise /tmp),
then mmap() the file.  The default is "auto". Note that System V
shared memory has limits on the size of shared memory segments so
using this for Nemesis may limit the number of processes that can be
started on a single node.

ofi network module
```````````````````
The ofi netmod provides support for the OFI network programming interface.
To enable, configure with the following option:

  --with-device=ch3:nemesis:ofi

If the OFI include files and libraries are not in the normal search paths,
you can specify them with the following options:

  --with-ofi-include= and --with-ofi-lib=

... or the if lib/ and include/ are in the same directory, you can use
the following option:

  --with-ofi=

If the OFI libraries are shared libraries, they need to be in the
shared library search path. This can be done by adding the path to
/etc/ld.so.conf, or by setting the LD_LIBRARY_PATH variable in your
environment. It's also possible to set the shared library search path
in the binary. If you're using gcc, you can do this by adding

  LD_LIBRARY_PATH=/path/to/lib

  (and)

  LDFLAGS="-Wl,-rpath -Wl,/path/to/lib"

... as arguments to configure.


sock channel
------------
sock is the traditional TCP sockets based communication channel. It
uses TCP/IP sockets for all communication including intra-node
communication. So, though the performance of this channel is worse
than that of nemesis, it should work on almost every platform. This
channel can be configured using the following option:

  --with-device=ch3:sock


ch4 device
**********
The ch4 device contains different network and shared memory modules
for communication. We currently support the ofi and ucx network
modules, and posix shared memory module.

ofi network module
```````````````````
The ofi netmod provides support for the OFI network programming interface.
To enable, configure with the following option:

  --with-device=ch4:ofi[:provider]

If the OFI include files and libraries are not in the normal search paths,
you can specify them with the following options:

  --with-libfabric-include= and --with-libfabric-lib=

... or the if lib/ and include/ are in the same directory, you can use
the following option:

  --with-libfabric=

If specifying the provider, the MPICH library will be optimized specifically
for the requested provider by removing runtime branches to determine
provider capabilities. Note that using this feature with a version of the
libfabric library older than that recommended with this version of MPICH is
unsupported and may result in unexpected behavior. This is also true when
using the environment variable FI_PROVIDER.

The currently expected version of libfabric is: %LIBFABRIC_VERSION%.

ucx network module
``````````````````
The ucx netmod provides support for the Unified Communication X
library. It can be built with the following configure option:

  --with-device=ch4:ucx

If the UCX include files and libraries are not in the normal search paths,
you can specify them with the following options:

  --with-ucx-include= and --with-ucx-lib=

... or the if lib/ and include/ are in the same directory, you can use
the following option:

  --with-ucx=

By default, the UCX library throws warnings when the system does not
enable certain features that might hurt performance.  These are
important warnings that might cause performance degradation on your
system.  But you might need root privileges to fix some of them.  If
you would like to disable such warnings, you can set the UCX log level
to "error" instead of the default "warn" by using:

  UCX_LOG_LEVEL=error
  export UCX_LOG_LEVEL

GPU support
***********

GPU support is automatically enabled if CUDA, ZE, or HIP runtime is
detected during configure. To specify where your GPU runtime is
installed, use:

  --with-cuda=<path>  or --with-ze=<path> or --with-hip=<path>

If the lib/ and include/ are not in the same path, both can be specified
separately, for example:

  --with-cuda-include= and --with-cuda-lib=

In addition, GPU support can be explicitly disabled by using:

  --without-cuda  or  --without-ze or --without-hip

If desirable, GPU support can be disabled during runtime by setting
environment variable MPIR_CVAR_ENABLE_GPU=0. This may help avoid the GPU
initialization and detection overhead for non-GPU applications.

-------------------------------------------------------------------------

5. Alternate Process Managers
=============================

hydra
-----
Hydra is the default process management framework that uses existing
daemons on nodes (e.g., ssh, pbs, slurm, sge) to start MPI
processes. More information on Hydra can be found at
https://github.com/pmodels/mpich/blob/main/doc/wiki/how_to/Using_the_Hydra_Process_Manager.md

gforker
-------
gforker is a process manager that creates processes on a single
machine, by having mpiexec directly fork and exec them. gforker is
mostly meant as a research platform and for debugging purposes, as it
is only meant for single-node systems.

slurm
-----
Slurm is an external process manager not distributed with
MPICH. MPICH's default process manager, hydra, has native support
for Slurm and you can directly use it in Slurm environments (it will
automatically detect Slurm and use Slurm capabilities). However, if
you want to use the Slurm-provided "srun" process manager, you can use
the "--with-pmi=slurm --with-pm=no" option with configure. Note that
the "srun" process manager that comes with Slurm uses an older PMI
standard which does not have some of the performance enhancements that
hydra provides in Slurm environments.

-------------------------------------------------------------------------

6. Alternate Configure Options
==============================

MPICH has a number of other features. If you are exploring MPICH as
part of a development project, you might want to tweak the MPICH
build with the following configure options. A complete list of
configuration options can be found using:

   ./configure --help

-------------------------------------------------------------------------

7. Testing the MPICH installation
==================================

To test MPICH, we package the MPICH test suite in the MPICH
distribution. You can run the test suite after "make install" using:

     make testing

The results summary will be placed in test/summary.xml.

The test suite can be used independently to test any installed MPI
implementations:

     cd test/mpi
     ./configure --with-mpi=/path/to/mpi
     make testing

-------------------------------------------------------------------------

8. Fault Tolerance
==================

MPICH has some tolerance to process failures, and supports
checkpointing and restart. 

Tolerance to Process Failures
-----------------------------

The features described in this section should be considered
experimental.  Which means that they have not been fully tested, and
the behavior may change in future releases. The below notes are some
guidelines on what can be expected in this feature:

 - ERROR RETURNS: Communication failures in MPICH are not fatal
   errors.  This means that if the user sets the error handler to
   MPI_ERRORS_RETURN, MPICH will return an appropriate error code in
   the event of a communication failure.  When a process detects a
   failure when communicating with another process, it will consider
   the other process as having failed and will no longer attempt to
   communicate with that process.  The user can, however, continue
   making communication calls to other processes.  Any outstanding
   send or receive operations to a failed process, or wildcard
   receives (i.e., with MPI_ANY_SOURCE) posted to communicators with a
   failed process, will be immediately completed with an appropriate
   error code.

 - COLLECTIVES: For collective operations performed on communicators
   with a failed process, the collective would return an error on
   some, but not necessarily all processes. A collective call
   returning MPI_SUCCESS on a given process means that the part of the
   collective performed by that process has been successful.

 - PROCESS MANAGER: If used with the hydra process manager, hydra will
   detect failed processes and notify the MPICH library.  Users can
   query the list of failed processes using MPIX_Comm_group_failed().
   This functions returns a group consisting of the failed processes
   in the communicator.  The function MPIX_Comm_remote_group_failed()
   is provided for querying failed processes in the remote processes
   of an intercommunicator.

   Note that hydra by default will abort the entire application when
   any process terminates before calling MPI_Finalize.  In order to
   allow an application to continue running despite failed processes,
   you will need to pass the -disable-auto-cleanup option to mpiexec.

 - FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
   ALMOST CERTAINLY CHANGE IN THE FUTURE!

   In the current release, hydra notifies the MPICH library of failed
   processes by sending a SIGUSR1 signal.  The application can catch
   this signal to be notified of failed processes.  If the application
   replaces the library's signal handler with its own, the application
   must be sure to call the library's handler from it's own
   handler.  Note that you cannot call any MPI function from inside a
   signal handler.

Checkpoint and Restart
----------------------

MPICH supports checkpointing and restart fault-tolerance using BLCR.

CONFIGURATION

First, you need to have BLCR version 0.8.2 or later installed on your
machine.  If it's installed in the default system location, you don't
need to do anything.

If BLCR is not installed in the default system location, you'll need
to tell MPICH's configure where to find it. You might also need to
set the LD_LIBRARY_PATH environment variable so that BLCR's shared
libraries can be found.  In this case add the following options to
your configure command:

  --with-blcr=<BLCR_INSTALL_DIR> 
  LD_LIBRARY_PATH=<BLCR_INSTALL_DIR>/lib

where <BLCR_INSTALL_DIR> is the directory where BLCR has been
installed (whatever was specified in --prefix when BLCR was
configured).

After it's configured compile as usual (e.g., make; make install).

Note, checkpointing is only supported with the Hydra process manager.


VERIFYING CHECKPOINTING SUPPORT

Make sure MPICH is correctly configured with BLCR. You can do this
using:

  mpiexec -info

This should display 'BLCR' under 'Checkpointing libraries available'.


CHECKPOINTING THE APPLICATION

There are two ways to cause the application to checkpoint. You can ask
mpiexec to periodically checkpoint the application using the mpiexec
option -ckpoint-interval (seconds):

  mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint \
      -ckpoint-interval 3600 -f hosts -n 4 ./app

Alternatively, you can also manually force checkpointing by sending a
SIGUSR1 signal to mpiexec.

The checkpoint/restart parameters can also be controlled with the
environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and
HYDRA_CKPOINT_INTERVAL.

To restart a process:

  mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 -ckpoint-num <N>

where <N> is the checkpoint number you want to restart from.

These instructions can also be found on the MPICH wiki:

  https://github.com/pmodels/mpich/blob/main/doc/wiki/design/Checkpointing.md

-------------------------------------------------------------------------

9. Developer Builds
===================
For MPICH developers who want to directly work on the primary version
control system, there are a few additional steps involved (people
using the release tarballs do not have to follow these steps). Details
about these steps can be found here:
https://github.com/pmodels/mpich/blob/main/doc/wiki/source_code/Github.md

-------------------------------------------------------------------------

10. Multiple Fortran compiler support
=====================================

If the C compiler that is used to build MPICH libraries supports both
multiple weak symbols and multiple aliases of common symbols, the
Fortran binding can support multiple Fortran compilers. The
multiple weak symbols support allow MPICH to provide different name
mangling scheme (of subroutine names) required by different Fortran
compilers. The multiple aliases of common symbols support enables
MPICH to equal different common block symbols of the MPI Fortran
constant, e.g. MPI_IN_PLACE, MPI_STATUS_IGNORE. So they are understood
by different Fortran compilers.

Since the support of multiple aliases of common symbols is
new/experimental, users can disable the feature by using configure
option --disable-multi-aliases if it causes any undesirable effect,
e.g. linker warnings of different sizes of common symbols, MPIFCMB*
(the warning should be harmless).

We have only tested this support on a limited set of
platforms/compilers.  On linux, if the C compiler that builds MPICH is
either gcc or icc, the above support will be enabled by configure.  At
the time of this writing, pgcc does not seem to have this multiple
aliases of common symbols, so configure will detect the deficiency and
disable the feature automatically.  The tested Fortran compilers
include GNU Fortran compilers (gfortan), Intel Fortran compiler
(ifort), Portland Group Fortran compilers (pgfortran), Absoft Fortran
compilers (af90), and IBM XL fortran compiler (xlf).  What this means
is that if mpich is built by gcc/gfortran, the resulting mpich library
can be used to link a Fortran program compiled/linked by another
fortran compiler, say pgf90, say through mpifort -fc=pgf90.  As long
as the Fortran program is linked without any errors by one of these
compilers, the program shall be running fine.

-------------------------------------------------------------------------

11. ABI Compatibility
=====================

The MPICH ABI compatibility initiative was announced at SC 2014
(http://www.mpich.org/abi).  As a part of this initiative, Argonne,
Intel, IBM and Cray have committed to maintaining ABI compatibility
with each other.

As a first step in this initiative, starting with version 3.1, MPICH
is binary (ABI) compatible with Intel MPI 5.0.  This means you can
build your program with one MPI implementation and run with the other.
Specifically, binary-only applications that were built and distributed
with one of these MPI implementations can now be executed with the
other MPI implementation.

Some setup is required to achieve this.  Suppose you have MPICH
installed in /path/to/mpich and Intel MPI installed in /path/to/impi.

You can run your application with mpich using:

   % export LD_LIBRARY_PATH=/path/to/mpich/lib:$LD_LIBRARY_PATH
   % mpiexec -np 100 ./foo

or using Intel MPI using:

   % export LD_LIBRARY_PATH=/path/to/impi/lib:$LD_LIBRARY_PATH
   % mpiexec -np 100 ./foo

This works irrespective of which MPI implementation your application
was compiled with, as long as you use one of the MPI implementations
in the ABI compatibility initiative.

-------------------------------------------------------------------------

12. Capability Sets
=====================

The CH4 device contains a feature called "capability sets" to simplify
configuration of MPICH on systems using the OFI netmod. This feature
configures MPICH to use a predetermined set of OFI features based on the
provider being used. Capability sets can be configured at compile time or
runtime. Compile time configuration provides better performance by
reducing unnecessary code branches, but at the cost of flexibility.

To configure at compile time, the device string should be amended to include
the OFI provider with the following option:

    --with-device=ch4:ofi:sockets

This will setup the OFI netmod to use the optimal configuration for the
sockets provider, and will set various compile time constants. These settings
cannot be changed at runtime.

If runtime configuration is needed, use:

    --with-device=ch4:ofi

i.e. without the OFI provider extension, and set various environment variables to
achieve a similar configuration. To select the desired provider:

    % export FI_PROVIDER=sockets

This will select the OFI provider and the associated MPICH capability set. To
change the preset configuration, there exists an extended set of environment
variables. As an example, native provider RMA atomics can be disabled by using
the environment variable:

    % export MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=0

For some configuration options (in particular, MPIR_CVAR_CH4_OFI_ENABLE_TAGGED and
MPIR_CVAR_CH4_OFI_ENABLE_RMA), if disabled, some functionality may fallback to
generic implementations.

A full list of capability set configuration variables can be found in the
environment variables README.envvar.

-------------------------------------------------------------------------

13. Threads
===========

The supported thread level are configured by option:

    --enable-threads={single,funneled,serialized,multiple}

The default depends on the configured device. With ch4, "multiple" is the
default. Set thread level to "single" provides best performance when application
does not use multiple threads. Use "multiple" to allow application to access
MPI from multiple threads concurrently.

With "multiple" thread level, there are a few choices for the internal critical
section models. This is controlled by configure option:

    --enable-thread-cs={global,per-vci}

Current default is to use "global" cs. Applications that do heavy concurrent
MPI communications may experience slow down due to this global cs. The
"per-vci" cs internally will use multiple VCI (virtual communication
interface) critical sections, thus can provide much better performance. To
achieve the best performance, applications should try to expose as much
parallel information to MPI as possible. For example, if each threads use
separate communicators, MPICH may be able to assign separate VCI for each
thread, thus achieving the maximum performance.

The multiple VCI support may increase the resource allocation and overheads
during initialization. By default, only a single vci is used. Set

    MPIR_CVAR_CH4_NUM_VCIS=<N>

to enable multiple vcis at runtime. For best performance, match number of VCIs
to the number threads application is using.

MPICH supports multiple threading packages.  The default is posix
threads (pthreads), but solaris threads, windows threads, argobots and
qthreads are also supported.

To configure mpich to work with argobots or qthreads, use the
following configure options:

    --with-thread-package=argobots \
        CFLAGS="-I<path_to_argobots/include>" \
        LDFLAGS="-L<path_to_argobots/lib>"

    --with-thread-package=qthreads \
        CFLAGS="-I<path_to_qthreads/include>" \
        LDFLAGS="-L<path_to_qthreads/lib>"

mpich's People

Contributors

abrooks98 avatar alexeymalkhanov avatar danghvu avatar gcongiu avatar goodell avatar hajimefu avatar hzhou avatar jainsura-intel avatar jayeshkrishna avatar jczhang07 avatar jdinan avatar jeffhammond avatar masamichitakagi avatar minsii avatar pavanbalaji avatar raffenet avatar rkalidas avatar roblatham00 avatar sagarth avatar shawnccx avatar shintaro-iwasaki avatar sonjahapp avatar sssharka avatar suhuang99 avatar tarudoodi avatar wesbland avatar wgropp avatar wkliao avatar yfguo avatar zhenggb72 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mpich's Issues

PATCH: Fixes for MPI_Comm_dup and MPI_Comm_split (intercommunicator case)

Originally by "Lisandro Dalcin" [email protected] on 2008-08-01 14:39:19 -0500


Hi all,

Some intercommunicator collectives make use of 'is_low_group' field in
MPID_Comm structure. This field is not being correctly filled when
MPI_Comm_dup() and MPI_Comm_split() is called on an intercommunicator,
and then MPI_Barrier(), MPI_Allgather(), MPI_Allgatherv() (and
probably MPI_Reduce_scatter(), I've not tried) deadlock.

You have attached a tentative patch (against SVN trunk) for fixing this issue.

I've tested them for MPI_Comm_dup() case, but not for the
MPI_Comm_split() case (but it seems that the low group flag just needs
to be inherited from the parent intercommunicator, but perhaps I'm
missing something, so please review this case with care).

BTW, Could you anticipate in what version (1.1.0 or perhaps 1.0.7p1)
could this issue get fixed?

Regards,

Lisandro Dalcรญn

Centro Internacional de Mรฉtodos Computacionales en Ingenierรญa (CIMEC)
Instituto de Desarrollo Tecnolรณgico para la Industria Quรญmica (INTEC)
Consejo Nacional de Investigaciones Cientรญficas y Tรฉcnicas (CONICET)
PTLC - Gรผemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594

adio/common/system_hints.c

Originally by William Gropp [email protected] on 2008-08-04 14:12:41 -0500


I'm getting a failure in this file:

gcc -I/Users/gropp/tmp/mpich2-sock/src/mpid/ch3/include -I/Users/
gropp/projects/software/mpich2/src/mpid/ch3/include -I/Users/gropp/
tmp/mpich2-sock/src/mpid/common/datatype -I/Users/gropp/projects/
software/mpich2/src/mpid/common/datatype -I/Users/gropp/tmp/mpich2-
sock/src/mpid/common/locks -I/Users/gropp/projects/software/mpich2/
src/mpid/common/locks -I/Users/gropp/tmp/mpich2-sock/src/mpid/ch3/
channels/sock/include -I/Users/gropp/projects/software/mpich2/src/
mpid/ch3/channels/sock/include -I/Users/gropp/tmp/mpich2-sock/src/
mpid/common/sock -I/Users/gropp/projects/software/mpich2/src/mpid/
common/sock -I/Users/gropp/tmp/mpich2-sock/src/mpid/common/sock/poll -
I/Users/gropp/projects/software/mpich2/src/mpid/common/sock/poll -g -
Wall -O2 -Wstrict-prototypes -Wmissing-prototypes -Wundef -Wpointer-
arith -Wbad-function-cast -ansi -DGCC_WALL -D_POSIX_C_SOURCE=199506L -
std=c89 -DFORTRANUNDERSCORE -DHAVE_ROMIOCONF_H -I. -I/Users/gropp/
projects/software/mpich2/src/mpi/romio/adio/common/../include -I../
include -I../../include -I/Users/gropp/projects/software/mpich2/src/
mpi/romio/adio/common/../../../../../src/include -I../../../../../src/
include -c /Users/gropp/projects/software/mpich2/src/mpi/romio/adio/
common/system_hints.c
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:82:63: warning: character constant too long for its type
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c: In function 'file_to_info':
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:82: error: parse error before ':' token
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:111:16: warning: character constant too long for its type
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:111: error: parse error before ':' token
make[5]: *** [system_hints.o] Error 1
Make failed in directory adio/common
make[4]: *** [mpiolib] Error 1
make[3]: *** [mpio] Error 2
make[2]: *** [all-redirect] Error 1
make[1]: *** [all-redirect] Error 2
make: *** [all-redirect] Error 2
groppmac:~/tmp/mpich2-sock gropp$

It looks like it is using calloc and free instead of the memory
routines (which will introduce a compile-time error with --enable-
dbg=mem is selected, which I always do). I'll fix this, but this is
a reminder to (a) use the memory routines and (b) configure with --
enable-dbg=mem .

Bill

William Gropp
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign

Using 32 bit as rank

Originally by wei huang [email protected] on 2008-08-05 18:47:22 -0500


Hi list,

We here are trying to run mvapich2, which is based on mpich2-1.0.7, on
more than 32k processes. However, we find that MPIDI_Message_match
structure uses only int16_t as the rank. This is not enough for job larger
than 32k. It looks like the follow change that uses int32_t for rank is
needed to scale. Would you consider integrate this change in future mpich2
releases? Thanks.

Index: src/mpid/ch3/include/mpidpre.h
===================================================================
--- src/mpid/ch3/include/mpidpre.h      (revision 2891)
+++ src/mpid/ch3/include/mpidpre.h      (revision 2892)
@@ -65,7 +65,7 @@
 typedef struct MPIDI_Message_match
 {
     int32_t tag;
-    int16_t rank;
+    int32_t rank;
     int16_t context_id;
 }
 MPIDI_Message_match;

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501

Re: [MPICH2 Req #3768] Problem with MPICH2

Originally by Anthony Chan [email protected] on 2008-08-06 16:43:39 -0500


I would think that the default binary on 64bit machine is 64bit, i.e
you don't need to set any *FLAGS in building mpich2. Assuming it is
not the case that you do need to modify the binary format, you need to
set CFLAGS, CXXFLAGS, FFLAGS and F90FLAGS ("./configure --help" will
show all the relevant *FLAGS, be sure don't set CPPFLAGS before configuing
mpich2) to -m64.

A.Chan
----- "Vijay Mann" [email protected] wrote:

Hi,

I think we are hitting into the same problem with fortran mpich
libraries. We are using gfortran (which accepts -m64 flag for 64 bit
compilation).

We tried the following set of flags:
export FCFLAGS="-m64"
export FCFLAGS_f90="-m64"
export FFLAGS="-m64

and they didn't seem to work.

Can you please help?

Thanks,

Vijay Mann
Technical Staff Member,
IBM India Research Laboratory, New Delhi, India.
Phone: 91-11- 41292168
http://www.research.ibm.com/people/v/vijamann/

Anthony Chan [email protected]

11/16/2007 04:52 AM
To Pradipta De/India/IBM@IBMIN

cc [email protected], Vijay Mann/India/IBM@IBMIN

Subject Re: [MPICH2 Req #3768] Problem with MPICH2

Did you set CFLAGS and CXXFLAGS to the same 64bit flag used by your
C/C++ compiler ?

On Thu, 15 Nov 2007, Pradipta De wrote:

Hi,

We downloaded and compiled MPICH2 on a PowerPC box running FC6.
We are trying to use mpicxx to compile our mpi code in 64-bit mode.
We get the following error: (our mpich2 install directory is
/hpcfs/downloaded_software/mpich2-install/)

/usr/bin/ld: skipping incompatible
/hpcfs/downloaded_software/mpich2-install//lib/libmpichcxx.a when
searching for -lmpichcxx
/usr/bin/ld: cannot find -lmpichcxx
collect2: ld returned 1 exit status

Is there some flag that needs to be specified during configuration
to
allow for 64-bit version ?

thanks, and regards,
-- pradipta

datatype bug

Originally by Darius Buntinas [email protected] on 2008-08-01 15:55:57 -0500


Forwarding from David Gingold.
-d

-------- Original Message --------

Darius --

I got the changes integrated into our library this week, and all
seemed well until I tried the failing Intel MPI test again. Now it
falls over with a similar assertion, using a different datatype.

The test case below (which is what I sent before, but this time with
different block lengths and displacements) reproduces the problem in
our library. Can you try this easily with yours?

-dg

....

include <assert.h>

include <stdlib.h>

include <stdio.h>

include <mpi.h>

include "mpid_dataloop.h"

int MPID_Segment_init(const DLOOP_Buffer buf,
DLOOP_Count count,
DLOOP_Handle handle,
struct DLOOP_Segment *segp,
int hetero);

void MPID_Segment_pack(struct DLOOP_Segment *segp,
DLOOP_Offset first,
DLOOP_Offset *lastp,
void *pack_buffer);

int main(int argc, char *argv[])
{
int ierr;
MPID_Segment segment;
MPI_Aint last;
int dis[2], blklens[2];
MPI_Datatype type;
int send_buffer[60];
int recv_buffer[60];

ierr = MPI_Init(&argc, &argv);
assert(ierr == MPI_SUCCESS);

dis[0] = 0;
dis[1] = 15;

blklens[0] = 0;
blklens[1] = 10;

last = 192;

ierr = MPI_Type_indexed(2, blklens, dis, MPI_INT, &type);
assert(ierr == MPI_SUCCESS);

ierr = MPI_Type_commit(&type);
assert(ierr == MPI_SUCCESS);

ierr = MPID_Segment_init(send_buffer, 1, type, &segment, 0);
assert(ierr == MPI_SUCCESS);

MPID_Segment_pack(&segment, 88, &last, recv_buffer);

MPI_Finalize();
return 0;

}

MPICH2 fpi.exe hanging on Windows XP

Originally by "Ayer, Timothy C." [email protected] on 2008-08-04 09:32:36 -0500


I am testing MPICH2 MPICH2-1.0.7 Windows XP (sp2). I have installed it on 2
hosts (hostA, hostB) and trying to run the fpi.exe built with fmpich2.lib.
The code is hanging in a MPI_Bcast call. The fpi.exe source is attached.

The following tests work fine from hostA, both prompt for a number of
intervals, accept input, and produce and estimate of PI

mpiexec.exe -hosts 2 hostA hostA \hostA\temp\fpi.exe <\hostA\temp\fpi.exe>

mpiexec.exe -hosts 2 hostB hostB \hostA\temp\fpi.exe <\hostA\temp\fpi.exe>

The following test hangs when submitted from hostA (in MPI_Bcast). It does
prompt for input (number of intervals) but once entered it hangs. I have
launched the smpd process using smpd -d but see no output from the smpd
after I enter an interval value

mpiexec.exe -hosts 2 hostA hostB \hostA\temp\fpi.exe <\hostA\temp\fpi.exe>

Any suggestions would be appreciated. Also let me know if you want me to
send debug output.

Thanks,
Tim


Timothy C. Ayer
High Performance Technical Computing
United Technologies - Pratt & Whitney
[email protected]
(860) 565 - 5268 v
(860) 565 - 2668 f

<<fpi.f>>

Re: MPE logging API with threadsafety

Originally by "P. Klein" [email protected] on 2008-08-04 08:31:40 -0500


Hi Anthony,

thanks once again for the beta release. In February,I planned to sent to
you more or less immediately a response on how things work, but I have
been pretty busy in those days. Therefore, I decided to wait until I can
tell you something more exciting than just "it runs".

Please find attached a paper which uses the thread-safe MPE and for your
amusement the .slog file mentioned in this paper. I am looking forward
to hear your opinion about our work.

Kind regards
Peter

Kind regards
Peter

Anthony Chan wrote:

Hi Peter,

I have put together a threadsafe version of MPE logging API in
the latest RC tarball which can be downloaded at

ftp://ftp.mcs.anl.gov/pub/mpi/mpe/beta/mpe2-1.0.7rc1.tar.gz

A sample C program that uses the updated API at
/share/examples_logging/pthread_sendrecv_user.c
which can be compiled just like pthread_sendrecv as documented
in the Makefile. Documentation of the updated API's manpages can
be found in /man and /www.
Let me know if the updated API has any problem used in
your multithreaded program.

A.Chan

On Wed, 9 Jan 2008, Anthony Chan wrote:

            Dr. rer. nat. Peter Klein

| | ||||| Fraunhofer ITWM
|_|||||| Abteilung: OPT
| | |
|||| Fraunhoferplatz 1
|**
|**||||| D-67663 Kaiserslautern
| ___ |
|| | | | |/|| phone (+49-)|(0)631 31600 4591
|| | |/| | || fax (+49-)|(0)631 31600 1099
|
______________| e-Mail: [email protected]

autoconf warnings

Originally by Dave Goodell [email protected] on 2008-07-31 11:26:44 -0500


Whenever I do a ./maint/updatefiles I get these warning messages over
and over. Presumably they're harmless, since everything still builds
and runs just fine, but they'd be nice to get rid of.

-Dave

configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_static_works, ...): suspicious cache-id, must
contain cv to be cached
/sandbox/chan/autoconf/autoconf-2.62/lib/autoconf/general.m4:1973:
AC_CACHE_VAL is expanded from...
/sandbox/chan/autoconf/autoconf-2.62/lib/autoconf/general.m4:1993:
AC_CACHE_CHECK is expanded from...
./libtool.m4:640: AC_LIBTOOL_LINKER_OPTION is expanded from...
./libtool.m4:2551: _LT_AC_LANG_C_CONFIG is expanded from...
./libtool.m4:2550: AC_LIBTOOL_LANG_C_CONFIG is expanded from...
./libtool.m4:80: AC_LIBTOOL_SETUP is expanded from...
./libtool.m4:60: _AC_PROG_LIBTOOL is expanded from...
./libtool.m4:25: AC_PROG_LIBTOOL is expanded from...
configure.in:202: the top level
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works, ...): suspicious cache-id, must contain
cv to be cached
./libtool.m4:595: AC_LIBTOOL_COMPILER_OPTION is expanded from...
./libtool.m4:4666: AC_LIBTOOL_PROG_COMPILER_PIC is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works_CXX, ...): suspicious cache-id, must
contain cv to be cached
./libtool.m4:2663: _LT_AC_LANG_CXX_CONFIG is expanded from...
./libtool.m4:2662: AC_LIBTOOL_LANG_CXX_CONFIG is expanded from...
./libtool.m4:1701: _LT_AC_TAGCONFIG is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works_F77, ...): suspicious cache-id, must
contain cv to be cached
./libtool.m4:3756: _LT_AC_LANG_F77_CONFIG is expanded from...
./libtool.m4:3755: AC_LIBTOOL_LANG_F77_CONFIG is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works_GCJ, ...): suspicious cache-id, must
contain cv to be cached
./libtool.m4:3862: _LT_AC_LANG_GCJ_CONFIG is expanded from...
./libtool.m4:3861: AC_LIBTOOL_LANG_GCJ_CONFIG is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_static_works, ...): suspicious cache-id, must
contain cv to be cached

MPICH2 Compile Error

Originally by "McDonald, Sean M CTR USAF AFMC AFRL/RXQO" [email protected] on 2008-07-31 17:30:59 -0500


Hello,

I am trying to compile MPICH2 on a Scientific Linux 5.0 box
(www.scientificlinux.org). The configuration seems to be fine, but I
get compile errors. I looked around and found that in Makefile.in on
Line 50 and 52 it has a hardcoded path to "${srcdir} &&
/sandbox/balaji/trunk/maint/mpich2-1.0.7/maint/simplemake". I would
think it should be something like "${srcdir} && /maint/simplemake/".
This is in version 1.0.7. I modified the Makefile, but this error seems
to be in other places.

I downloaded version 1.0.6p1 and it compiled and ran fine. I would like
to use the newest version for a system I am setting up, so could you
look into this problem and let me know a solution. Thanks

Sean McDonald
AFRL/RXQ Network Administrator
139 Barnes Dr., Suite 2
Tyndall AFB, FL. 32403
Duty Phone: (850) 283-6407 - DSN: 523-6407
Email: [email protected]

nemesis ext_procs optimization

Originally by goodell on 2008-08-01 08:42:34 -0500


In [de6e5ee] I committed a rough cut of dynamic processes for nemesis
newtcp. In mpid_nem_inline.h I commented out an optimization that
uses MPID_nem_mem_region.ext_procs because it prevents the proper
operation of dynamic processes. Unfortunately, removing it adds
~100ns to our zero-byte message latencies. So there is a FIXME in
the code that reads like this:

 /* FIXME the ext_procs bit is an optimization for the all-local-procs case.
    This has been commented out for now because it breaks dynamic processes.
    Some other solution should be implemented eventually, possibly using a
    flag that is set whenever a port is opened. [goodell@ 2008-06-18] */

In general, this won't affect real uses who run any inter-node jobs,
since they were already polling every time anyway. However, it does
hurt those wonderful microbenchmarks. A hack fix is to leave this in
but also check to see if a port has been opened. A possibly better
fix is to only poll the network every X iterations of "poll
everything", where X is some tunable parameter.

This req is a reminder for this FIXME.

-Dave

mpdboot error

Originally by "Osentoski, Sarah" [email protected] on 2008-08-04 16:11:20 -0500


Hi,

I have a question. I tried to set up mpi on a set of 5 computers.
mpdboot works on each machine individually.

However if I run:

-bash-3.00$ mpdboot -n 5 -d --verbose --ncpus=3
debug: starting
running mpdallexit on erl01
LAUNCHED mpd on erl01 via
debug: launch cmd= /home/sosentoski/mpich2-install/bin/mpd.py
--ncpus=3 -e -d
debug: mpd on erl01 on port 35273
RUNNING: mpd on erl01
debug: info for running mpd: {'ncpus': 3, 'list_port': 35273,
'entry_port': *, 'host': 'erl01', 'entry_host': *, 'ifhn': ''}
LAUNCHED mpd on erl04 via erl01
debug: launch cmd= ssh -x -n -q erl04
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
LAUNCHED mpd on erl07 via erl01
debug: launch cmd= ssh -x -n -q erl07
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
LAUNCHED mpd on erl06 via erl01
debug: launch cmd= ssh -x -n -q erl06
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
LAUNCHED mpd on erl05 via erl01
debug: launch cmd= ssh -x -n -q erl05
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
debug: mpd on erl07 on port no_port
mpdboot_erl01 (handle_mpd_output 406): from mpd on erl07, invalid port
info:
no_port

Do you have any helpful hints about what might be wrong with my set up?

Thanks

Sarah Osentoski

[MPICH2 Req #4214] cleanup MPIR_Get_contextid (and callers)

Originally by goodell on 2008-08-01 07:24:49 -0500


(this is a re-send of req#4214 so that trac learns about it)

The MPIR_Get_contextid function needs to be overhauled a bit. It
doesn't use the standard MPICH2 error handling approach yet it's a
non-trivial function. Specifically, I've run into issues lately
where the comm subsystem is hosed in such a way it makes the
NMPI_Allreduce call that MPIR_Get_contextid makes fail.
Unfortunately, MPIR_Get_contextid simply returns 0 if there was a
problem, so the stack trace is simply thrown away and all errors show
up like this:

Fatal error in MPI_Comm_accept: Other MPI error, error stack:
MPI_Comm_accept(117)..: MPI_Comm_accept(port="tag#1$description#intel-
loane[1]$port#46959$ifname#140.221.37.57$", MPI_INFO_NULL, root=0,
MPI_COMM_WORLD, newcomm=0x7ff0004dc) failed
MPID_Comm_accept(149).:
MPIDI_Comm_accept(915): Too many communicators

In reality, the original error was caused deep down in the nemesis
layer, but you can't see it here.

I'm filing this instead of just fixing it because there are two
versions of this function that need to be fixed and tested on all
platforms. Also, all the call sites need to be updated to check the
mpi_errno and handle it accordingly. This isn't critical for the
release, so it can probably wait a little while.

-Dave

ssm build

Originally by "Rajeev Thakur" [email protected] on 2008-07-30 11:33:25 -0500


All the ssm builds in last night's tests failed to compile. Might be a
simple fix.

Rajeev

Beginning make
Using variables CC='gcc' CFLAGS=' -O2' LDFLAGS='' AR='ar' FC='g77' F90='f95'
FFLAGS=' -O2' F90FLAGS=' -O2' CXX='g++'
In directory: /sandbox/buntinas/cb/mpich2/src/mpid/ch3/util/shm
CC
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:
In function 'MPIDI_CH3U_Finalize_sshm':
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:7
4: error: too few arguments to function 'MPIDI_PG_Get_next'
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:7
7: error: too few arguments to function 'MPIDI_PG_Get_next'
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:8
0: error: too few arguments to function 'MPIDI_PG_Get_next'

[MPICH2 Req #3942] ch3:nemesis:newtcp and gforker valgrind errors

Originally by goodell on 2008-08-01 08:42:26 -0500


I see some valgrind errors when I build with ch3:nemesis:newtcp and
gforker. I couldn't figure out the problem in a few minutes of
investigation, so I'm filing this bug report so that we don't lose
track of this. There is likely a simpler configuration and test case
that will elicit these warnings, I just haven't spent any time paring
things down and playing with configure args.

-Dave

Configuration line:
./configure --prefix=/home/goodell/testing/nemesis_gforker/test_1/ 
mpich2-installed --with-pm=gforker --with-device=ch3:nemesis:newtcp -- 
enable-g=dbg,log,meminit --disable-fast --enable-nemesis-dbg-nolocal

Test program:
bblogin% cat test.c
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[])
{
    int rank,np;
    int i;
    char buf[100];
    MPI_Status status;

    MPI_Init (&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&np);

    if (rank == 0) {
        for (i = 1; i < np; i++) {
            MPI_Send(buf, 0, MPI_CHAR, i, 0, MPI_COMM_WORLD);
        }
        for (i = 1; i < np; i++) {
            MPI_Send(buf, 0, MPI_CHAR, i, 0, MPI_COMM_WORLD);
        }
    }
    else {
        MPI_Recv(buf, 0, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
        MPI_Recv(buf, 0, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
    }
    MPI_Finalize();
    return 0;
}

Valgrind output:
bblogin% valgrind ./a.out
==28198## Memcheck, a memory error detector.28198== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et  
al.
==28198== Using LibVEX rev 1658, a library for dynamic binary  
translation.
==28198## Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.28198== Using valgrind-3.2.1-Debian, a dynamic binary  
instrumentation framework.
==28198== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et  
al.
==28198## For more details, rerun with: -v28198## ==28198== Invalid read of size 828198##    at 0x40152A4: (within /lib/ld-2.5.so)28198##    by 0x400A7CD: (within /lib/ld-2.5.so)28198##    by 0x4006164: (within /lib/ld-2.5.so)28198##    by 0x40084AB: (within /lib/ld-2.5.so)28198##    by 0x40116EC: (within /lib/ld-2.5.so)28198##    by 0x400D725: (within /lib/ld-2.5.so)28198##    by 0x401114A: (within /lib/ld-2.5.so)28198##    by 0x534BB7F: (within /lib/libc-2.5.so)28198##    by 0x400D725: (within /lib/ld-2.5.so)28198##    by 0x534BCE6: __libc_dlopen_mode (in /lib/libc-2.5.so)28198##    by 0x5327516: __nss_lookup_function (in /lib/libc-2.5.so)28198##    by 0x53275C4: (within /lib/libc-2.5.so)28198==  Address 0x4032CE0 is 16 bytes inside a block of size 23  
alloc'd
==28198##    at 0x4C20A69: malloc (vg_replace_malloc.c:149)28198##    by 0x4008999: (within /lib/ld-2.5.so)28198##    by 0x40116EC: (within /lib/ld-2.5.so)28198##    by 0x400D725: (within /lib/ld-2.5.so)28198##    by 0x401114A: (within /lib/ld-2.5.so)28198##    by 0x534BB7F: (within /lib/libc-2.5.so)28198##    by 0x400D725: (within /lib/ld-2.5.so)28198##    by 0x534BCE6: __libc_dlopen_mode (in /lib/libc-2.5.so)28198##    by 0x5327516: __nss_lookup_function (in /lib/libc-2.5.so)28198##    by 0x53275C4: (within /lib/libc-2.5.so)28198##    by 0x532DC0A: gethostbyname_r (in /lib/libc-2.5.so)28198##    by 0x532D402: gethostbyname (in /lib/libc-2.5.so)28198==
==28198== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 16  
from 1)
==28198## malloc/free: in use at exit: 2,826 bytes in 15 blocks.28198## malloc/free: 106 allocs, 91 frees, 8,405,715 bytes allocated.28198## For counts of detected errors, rerun with: -v28198## searching for pointers to 15 not-freed blocks.28198## checked 264,992 bytes.28198## ==28198== LEAK SUMMARY:28198##    definitely lost: 0 bytes in 0 blocks.28198##      possibly lost: 0 bytes in 0 blocks.28198##    still reachable: 2,826 bytes in 15 blocks.28198##         suppressed: 0 bytes in 0 blocks.28198== Reachable blocks (those to which a pointer was found) are  
not shown.
==28198== To see them, rerun with: --show-reachable=yes

[MPICH2 Req #4176] Fwd: [mpich2-dev] Apparent bypass of correct macros in collective operation code

Originally by goodell on 2008-08-01 08:39:54 -0500


so that we don't forget...

Begin forwarded message:
> From: Dave Goodell <[email protected]>
> Date: July 11, 2008 Jul 11 8:46:04 AM CDT
> To: [email protected]
> Cc: Dave Goodell <[email protected]>
> Subject: Re: [mpich2-dev] Apparent bypass of correct macros in  
> collective operation code
>
> Good catch, Joe.  These ought to be changed to use the  
> MPIU_THREADPRIV_* macros.  I'll forward this to mpich2-maint@ to  
> make sure we don't forget to fix it.
>
> -Dave
>
> On Jul 10, 2008, at 8:47 PM, Joe Ratterman wrote:
>
>> I was recently doing some thread hacking, and I found that some of  
>> my changes where causing a problem in the collective operation C  
>> files: mpich2/src/mpi/coll/op*.c
>>
>> Specifically, this sort of code was a problem since I got rid of  
>> the op_errno field in the MPICH_PerThread object.
>> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/coll/opbor.c
>>    165      default: {
>>    166          MPICH_PerThread_t *p;
>>    167          MPIR_GetPerThread(&p);
>>    168          p->op_errno = MPIR_Err_create_code( MPI_SUCCESS,  
>> MPIR_ERR_RECOVERABLE, FCNAME, __LINE__, MPI_ERR_OP,  
>> "**opundefined","**opundefined %s", "MPI_BOR" );
>>    169          break;
>>    170      }
>>
>> I think that there are macros to do this, as seen in the  
>> allreduce.c file (extra lines deleted):
>>    117      MPIU_THREADPRIV_DECL;
>>    126      MPIU_THREADPRIV_GET;
>>    158          MPIU_THREADPRIV_FIELD(op_errno) = 0;
>>    473          if (MPIU_THREADPRIV_FIELD(op_errno))
>>    474              mpi_errno = MPIU_THREADPRIV_FIELD(op_errno);
>>
>> With the default macros, that basically does the same thing, but I  
>> didn't have to change the .c file--only the header files.  The  
>> same thing happens in errutil.c
>> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/errhan/ 
>> errutil.c
>>    156  /* These routines export the nest increment and decrement  
>> for use in ROMIO */
>>    157  void MPIR_Nest_incr_export( void )
>>    158  {
>>    159      MPICH_PerThread_t *p;
>>    160      MPIR_GetPerThread(&p);
>>    161      p->nest_count++;
>>    162  }
>>    163  void MPIR_Nest_decr_export( void )
>>    164  {
>>    165      MPICH_PerThread_t *p;
>>    166      MPIR_GetPerThread(&p);
>>    167      p->nest_count--;
>>    168  }
>>
>>
>>
>> I really think that these places should be using the existing  
>> macros to handle the work.
>>
>>
>> Comments?
>> Joe Ratterman
>> [email protected]
>

Build fails

Originally by "Rajeev Thakur" [email protected] on 2008-08-01 14:56:17 -0500


If I do a fresh main/updatefiles, configure, and make, I get the following
error:

rm -f mpich2-mpdroot.o
copying python files/links into /sandbox/thakur/tmp/bin
rm -f mpich2-mpdroot
make[4]: Leaving directory /sandbox/thakur/tmp/src/pm/mpd' make[3]: Leaving directory/sandbox/thakur/tmp/src/pm'
make[2]: Leaving directory /sandbox/thakur/tmp/src/pm' make[1]: Leaving directory/sandbox/thakur/tmp/src'
make[1]: Entering directory /sandbox/thakur/tmp/examples' CC /homes/thakur/cvs/mpich2/examples/cpi.c ../bin/mpicc -o cpi cpi.o -lm /sandbox/thakur/tmp/lib/libmpich.a(socksm.o)(.text+0x66): In function dbg_print_sc_tbl':
: undefined reference to CONN_STATE_TO_STRING' collect2: ld returned 1 exit status make[1]: *** [cpi] Error 1 make[1]: Leaving directory/sandbox/thakur/tmp/examples'
make: *** [all-redirect] Error 2

test/mpi/io/resized failing on all platforms in old nightly tests

Originally by goodell on 2008-08-01 07:51:00 -0500


http://www.mcs.anl.gov/research/projects/mpich2/todo/runs/IA32-Linux-
GNU-mpd-ch3:nemesis-2008-07-31-22-00-testsumm-mpich2-fail.xml

resized
1 processes
./io
fail

Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks,
combiner = 18
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0rank 0 in
job 278 schwinn.mcs.anl.gov_37714 caused collective abort of all
ranks
exit status of rank 0: return code 1

Re: [MPICH2 Req #3768] Problem with MPICH2

Originally by Vijay Mann [email protected] on 2008-08-06 15:43:46 -0500


Hi,

I think we are hitting into the same problem with fortran mpich libraries.
We are using gfortran (which accepts -m64 flag for 64 bit compilation).

We tried the following set of flags:
export FCFLAGS="-m64"
export FCFLAGS_f90="-m64"
export FFLAGS="-m64

and they didn't seem to work.

Can you please help?

Thanks,

Vijay Mann
Technical Staff Member,
IBM India Research Laboratory, New Delhi, India.
Phone: 91-11- 41292168
http://www.research.ibm.com/people/v/vijamann/

Anthony Chan [email protected]
11/16/2007 04:52 AM

To
Pradipta De/India/IBM@IBMIN
cc
[email protected], Vijay Mann/India/IBM@IBMIN
Subject
Re: [MPICH2 Req #3768] Problem with MPICH2

Did you set CFLAGS and CXXFLAGS to the same 64bit flag used by your
C/C++ compiler ?

On Thu, 15 Nov 2007, Pradipta De wrote:

Hi,

We downloaded and compiled MPICH2 on a PowerPC box running FC6.
We are trying to use mpicxx to compile our mpi code in 64-bit mode.
We get the following error: (our mpich2 install directory is
/hpcfs/downloaded_software/mpich2-install/)

/usr/bin/ld: skipping incompatible
/hpcfs/downloaded_software/mpich2-install//lib/libmpichcxx.a when
searching for -lmpichcxx
/usr/bin/ld: cannot find -lmpichcxx
collect2: ld returned 1 exit status

Is there some flag that needs to be specified during configuration to
allow for 64-bit version ?

thanks, and regards,
-- pradipta

Errors in maint/updatefiles

Originally by "Rajeev Thakur" [email protected] on 2008-08-04 14:23:09 -0500


Maint/updatefiles gives the following errors now.

Shortname **mpi_accumulate %p %d %D %d %d %d %D %O %W for specific messages
has no expansion (first seen in file :src/mpi/rma/accumulate.c)
Shortname **mpi_alloc_mem %d %I %p for specific messages has no expansion
(first seen in file :src/mpi/rma/alloc_mem.c)
Shortname **mpi_get %p %d %D %d %d %d %D %W for specific messages has no
expansion (first seen in file :src/mpi/rma/get.c)
Shortname **mpi_pack_external %s %p %d %D %p %d %p for specific messages has
no expansion (first seen in file :src/mpi/datatype/pack_external.c)
Shortname **mpi_put %p %d %D %d %d %d %D %W for specific messages has no
expansion (first seen in file :src/mpi/rma/put.c)
Shortname **mpi_type_create_hvector %d %d %d %D %p for specific messages has
no expansion (first seen in file :src/mpi/datatype/type_create_hvector.c)
Shortname **mpi_type_hvector %d %d %d %D %p for specific messages has no
expansion (first seen in file :src/mpi/datatype/type_hvector.c)
Shortname **mpi_unpack_external %s %p %d %p %p %d %D for specific messages
has no expansion (first seen in file :src/mpi/datatype/unpack_external.c)
Shortname **mpi_win_create %p %d %d %I %C %p for specific messages has no
expansion (first seen in file :src/mpi/rma/win_create.c)
Because of errors in extracting error messages, the file
src/mpi/errhan/defmsg.h was not updated.
There are unused error message texts in src/mpi/errhan/errnames.txt
See the file unusederr.txt for the complete list

.....
Creating src/pm/smpd/smpd_version.h
Updating README's version ID.

Problems encountered while running updatefiles.
These may cause problems when configuring or building MPICH2.
 Error message files in src/mpi/errhan were not updated.

RE: [MPICH2 Req #4127] smpd and singleton init

Originally by "Jayesh Krishna" [email protected] on 2008-07-31 16:49:05 -0500


Hi,
As you mentioned, the current code (smpd) in trunk requires PM support
for MPI_Comm_connect()/MPI_Comm_accept() (hence the requirement that
mpiexec should be in the PATH).
We should be able to remove this dependency but I need to run some more
tests before I can confirm that. I am on vacation till Monday so I will
run the tests on Monday and get back to you.
Have a nice weekend,

Regards,
Jayesh


From: Edric Ellis [mailto:[email protected]]
Sent: Monday, July 28, 2008 7:38 AM
To: Jayesh Krishna
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init

Hi Jayesh,

Actually, I now seem to be able to get things communicating once the
processes have mpiexec on their $PATH, and smpd is running. I'm not sure
quite what I fixed though.

In any case, it would be much better for us to be able to restore the
behaviour as per 1.0.3 where the smpd process wasn't needed for
connect/accept.

Cheers,

Edric.


From: Edric Ellis
Sent: Monday, July 28, 2008 11:42 AM
To: 'Jayesh Krishna'
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init

Hi Jayesh,

I've just been looking at the latest MPICH2 from SVN, and what I find is
that I can now get past the call to MPI_Init() without running smpd, but
as soon as I try to perform the MPI_Comm_connect / MPI_Comm_accept phase,
the "connect" process reports an error because it can't execv mpiexec. Is
that expected?

I've tried adding the path to mpiexec to $PATH, and that doesn't help -
even running "smpd -d" shows that the processes are talking to smpd, but
they then both get stuck in a poll().

Cheers,

Edric.


From: Jayesh Krishna [mailto:[email protected]]
Sent: Friday, June 06, 2008 3:14 PM
To: Edric Ellis
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init

Hi,

Just to add to my prev email, currently the singleton init client fails
when smpd is not running on the system (We will be fixing the code so that
the singleton init client fails only when the PM is not running and the
client requires the PM (external PM) services to proceed. We expect this
to be fixed in our next release.)

Regards,

Jayesh


From: Jayesh Krishna [mailto:[email protected]]
Sent: Friday, June 06, 2008 9:08 AM
To: 'Edric Ellis'
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init

Hi,
Yes we have changed the way singleton init client works with SMPD. Now
when a singleton init client is run it tries to connect to the process
manager (without this change the process has no support from the PM and
calls like MPI_Comm_spawn() won't work).
Is there any reason why you don't want smpd to be running on these
machines ?

(PS: We didn't have time to fix the connect/accept problem in smpd for
1.0.7 release. We will be looking into fixing the bug in the next
release.)
Regards,
Jayesh
-----Original Message-----
From: Edric Ellis [mailto:[email protected]]
Sent: Friday, June 06, 2008 8:20 AM
To: [email protected]
Cc: [email protected]
Subject: [MPICH2 Req #4127] smpd and singleton init

Hi mpich2-maint,

I'm looking at upgrading MATLAB from using MPICH2-1.0.3 to MPICH2-1.0.7,
and I notice a change in the way in which singleton init works for the
smpd process manager (we use the smpd build on UNIX and Windows).

In 1.0.3, what we do is the following:

  1. Launch our application under some other services of our own 2. Inside
    our application, call "MPI_Init( 0, 0 )"
  2. Use MPI_Comm_connect/accept to glue things together

When I substitute 1.0.7, MPI_Init fails because smpd isn't running.

Is there a way to get 1.0.7 to behave as per 1.0.3 - i.e. without any
reliance on the smpd process?

As a separate question: what is the status of the connect/accept stall
problems under smpd? I believe that this was fixed in 1.0.6 for mpd.

Cheers,

Edric.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.