pmodels / mpich Goto Github PK
View Code? Open in Web Editor NEWOfficial MPICH Repository
Home Page: http://www.mpich.org
License: Other
Official MPICH Repository
Home Page: http://www.mpich.org
License: Other
MPICH Release %VERSION% MPICH is a high-performance and widely portable implementation of the MPI-4.1 standard from the Argonne National Laboratory. This release has all MPI 4.1 functions and features required by the standard with the exception of support for user-defined data representations for I/O. This README file should contain enough information to get you started with MPICH. More extensive installation and user guides can be found in the doc/installguide/install.pdf and doc/userguide/user.pdf files respectively. Additional information regarding the contents of the release can be found in the CHANGES file in the top-level directory, and in the RELEASE_NOTES file, where certain restrictions are detailed. Finally, the MPICH web site, http://www.mpich.org, contains information on bug fixes and new releases. 1. Getting Started 2. Reporting Installation or Usage Problems 3. Compiler Flags 4. Alternate Channels and Devices 5. Alternate Process Managers 6. Alternate Configure Options 7. Testing the MPICH installation 8. Fault Tolerance 9. Developer Builds 10. Multiple Fortran compiler support 11. ABI Compatibility 12. Capability Sets 13. Threads ------------------------------------------------------------------------- 1. Getting Started ================== Note: this guide assumes you are building MPICH from one of the MPICH release tarballs. If you are starting from a git checkout, you will need a few additional steps. Please refer to the wiki page -- https://github.com/pmodels/mpich/blob/main/doc/wiki/Index.md. The following instructions take you through a sequence of steps to get the default configuration (ch3 device, nemesis channel (with TCP and shared memory), Hydra process management) of MPICH up and running. (a) You will need the following prerequisites. - REQUIRED: This tar file mpich-%VERSION%.tar.gz - REQUIRED: Perl - REQUIRED: A C compiler (C99 support is required. See https://github.com/pmodels/mpich/blob/main/doc/wiki/source_code/Shifting_Toward_C99.md) - OPTIONAL: A C++ compiler, if C++ applications are to be used (g++, etc.). If you do not require support for C++ applications, you can disable this support using the configure option --disable-cxx (configuring MPICH is described in step 1(d) below). - OPTIONAL: A Fortran compiler, if Fortran applications are to be used (gfortran, ifort, etc.). If you do not require support for Fortran applications, you can disable this support using --disable-fortran (configuring MPICH is described in step 1(d) below). - OPTIONAL: Python 3. Python 3 is needed to generate Fortran bindings. Also, you need to know what shell you are using since different shell has different command syntax. Command "echo $SHELL" prints out the current shell used by your terminal program. (b) Unpack the tar file and go to the top level directory: tar xzf mpich-%VERSION%.tar.gz cd mpich-%VERSION% If your tar doesn't accept the z option, use gunzip mpich-%VERSION%.tar.gz tar xf mpich-%VERSION%.tar cd mpich-%VERSION% (c) Choose an installation directory, say /home/<USERNAME>/mpich-install, which is assumed to non-existent or empty. It will be most convenient if this directory is shared by all of the machines where you intend to run processes. If not, you will have to duplicate it on the other machines after installation. (d) Configure MPICH specifying the installation directory and device: for csh and tcsh: ./configure --prefix=/home/<USERNAME>/mpich-install |& tee c.txt for bash and sh: ./configure --prefix=/home/<USERNAME>/mpich-install 2>&1 | tee c.txt The configure will try to determine the best device (the internal network modules) based on system environment. You may also supply a device configuration. E.g. ./configure --prefix=... --with-device=ch4:ofi |... or: ./configure --prefix=... --with-device=ch4:ucx |... Refer to section below -- Alternate Channels and Devices -- for more details. Bourne-like shells, sh and bash, accept "2>&1 |". Csh-like shell, csh and tcsh, accept "|&". If a failure occurs, the configure command will display the error. Most errors are straight-forward to follow. For example, if the configure command fails with: "No Fortran compiler found. If you don't need to build any Fortran programs, you can disable Fortran support using --disable-fortran. If you do want to build Fortran programs, you need to install a Fortran compiler such as gfortran or ifort before you can proceed." ... it means that you don't have a Fortran compiler :-). You will need to either install one, or disable Fortran support in MPICH. If you are unable to understand what went wrong, please go to step (2) below, for reporting the issue to the MPICH developers and other users. (e) Build MPICH: for csh and tcsh: make |& tee m.txt for bash and sh: make 2>&1 | tee m.txt This step should succeed if there were no problems with the preceding step. Check file m.txt. If there were problems, do a "make clean" and then run make again with V=1. make V=1 |& tee m.txt (for csh and tcsh) OR make V=1 2>&1 | tee m.txt (for bash and sh) Then go to step (2) below, for reporting the issue to the MPICH developers and other users. (f) Install the MPICH commands: for csh and tcsh: make install |& tee mi.txt for bash and sh: make install 2>&1 | tee mi.txt This step collects all required executables and scripts in the bin subdirectory of the directory specified by the prefix argument to configure. (g) Add the bin subdirectory of the installation directory to your path in your startup script (.bashrc for bash, .cshrc for csh, etc.): for csh and tcsh: setenv PATH /home/<USERNAME>/mpich-install/bin:$PATH for bash and sh: PATH=/home/<USERNAME>/mpich-install/bin:$PATH ; export PATH Check that everything is in order at this point by doing: which mpicc which mpiexec These commands should display the path to your bin subdirectory of your install directory. IMPORTANT NOTE: The install directory has to be visible at exactly the same path on all machines you want to run your applications on. This is typically achieved by installing MPICH on a shared NFS file-system. If you do not have a shared NFS directory, you will need to manually copy the install directory to all machines at exactly the same location. (h) MPICH uses a process manager for starting MPI applications. The process manager provides the "mpiexec" executable, together with other utility executables. MPICH comes packaged with multiple process managers; the default is called Hydra. Now we will run an MPI job, using the mpiexec command as specified in the MPI standard. There are some examples in the install directory, which you have already put in your path, as well as in the directory mpich-%VERSION%/examples. One of them is the classic CPI example, which computes the value of pi by numerical integration in parallel. To run the CPI example with 'n' processes on your local machine, you can use: mpiexec -n <number> ./examples/cpi Test that you can run an 'n' process CPI job on multiple nodes: mpiexec -f machinefile -n <number> ./examples/cpi The 'machinefile' is of the form: host1 host2:2 host3:4 # Random comments host4:1 'host1', 'host2', 'host3' and 'host4' are the hostnames of the machines you want to run the job on. The ':2', ':4', ':1' segments depict the number of processes you want to run on each node. If nothing is specified, ':1' is assumed. More details on interacting with Hydra can be found at https://github.com/pmodels/mpich/blob/main/doc/wiki/how_to/Using_the_Hydra_Process_Manager.md If you have completed all of the above steps, you have successfully installed MPICH and run an MPI example. ------------------------------------------------------------------------- 2. Reporting Installation or Usage Problems =========================================== [VERY IMPORTANT: PLEASE COMPRESS ALL FILES BEFORE SENDING THEM TO US. DO NOT SPAM THE MAILING LIST WITH LARGE ATTACHMENTS.] The distribution has been tested by us on a variety of machines in our environments as well as our partner institutes. If you have problems with the installation or usage of MPICH, please follow these steps: 1. First see the Frequently Asked Questions (FAQ) page at https://github.com/pmodels/mpich/blob/main/doc/wiki/faq/Frequently_Asked_Questions.md to see if the problem you are facing has a simple solution. Many common problems and their solutions are listed here. 2. If you cannot find an answer on the FAQ page, look through previous email threads on the [email protected] mailing list archive (https://lists.mpich.org/mailman/listinfo/discuss). It is likely someone else had a similar problem, which has already been resolved before. 3. If neither of the above steps work, please send an email to [email protected]. You need to subscribe to this list (https://lists.mpich.org/mailman/listinfo/discuss) before sending an email. Your email should contain the following files. ONCE AGAIN, PLEASE COMPRESS BEFORE SENDING, AS THE FILES CAN BE LARGE. Note that, depending on which step the build failed, some of the files might not exist. mpich-%VERSION%/c.txt (generated in step 1(d) above) mpich-%VERSION%/m.txt (generated in step 1(e) above) mpich-%VERSION%/mi.txt (generated in step 1(f) above) mpich-%VERSION%/config.log (generated in step 1(d) above) mpich-%VERSION%/src/mpl/config.log (generated in step 1(d) above) mpich-%VERSION%/src/pm/hydra/config.log (generated in step 1(d) above) DID WE MENTION? DO NOT FORGET TO COMPRESS THESE FILES! If you have compiled MPICH and are having trouble running an application, please provide the output of the following command in your email. mpiexec -info Finally, please include the actual error you are seeing when running the application, including the mpiexec command used, and the host file. If possible, please try to reproduce the error with a smaller application or benchmark and send that along in your bug report. 4. If you have found a bug in MPICH, you can report it on our Github page (https://github.com/pmodels/mpich/issues). ------------------------------------------------------------------------- 3. Compiler Flags ================= MPICH allows several sets of compiler flags to be used. The first three sets are configure-time options for MPICH, while the fourth is only relevant when compiling applications with mpicc and friends. (a) CFLAGS, CPPFLAGS, CXXFLAGS, FFLAGS, FCFLAGS, LDFLAGS and LIBS (abbreviated as xFLAGS): Setting these flags would result in the MPICH library being compiled/linked with these flags and the flags internally being used in mpicc and friends. (b) MPICHLIB_CFLAGS, MPICHLIB_CPPFLAGS, MPICHLIB_CXXFLAGS, MPICHLIB_FFLAGS, MPICHLIB_FCFLAGS, MPICHLIB_LDFLAGS and MPICHLIB_LIBS (abbreviated as MPICHLIB_xFLAGS): Setting these flags would result in the MPICH library being compiled/linked with these flags. However, these flags will *not* be used by mpicc and friends. (c) MPICH_MPICC_CFLAGS, MPICH_MPICC_CPPFLAGS, MPICH_MPICC_LDFLAGS, MPICH_MPICC_LIBS, and so on for MPICXX, MPIF77 and MPIFORT (abbreviated as MPICH_MPIX_FLAGS): These flags do *not* affect the compilation of the MPICH library itself, but will be internally used by mpicc and friends. +--------------------------------------------------------------------+ | | | | | | MPICH library | mpicc and friends | | | | | +--------------------+----------------------+------------------------+ | | | | | xFLAGS | Yes | Yes | | | | | +--------------------+----------------------+------------------------+ | | | | | MPICHLIB_xFLAGS | Yes | No | | | | | +--------------------+----------------------+------------------------+ | | | | | MPICH_MPIX_FLAGS | No | Yes | | | | | +--------------------+----------------------+------------------------+ All these flags can be set as part of configure command or through environment variables. Default flags -------------- By default, MPICH automatically adds certain compiler optimizations to MPICHLIB_CFLAGS. The currently used optimization level is -O2. ** IMPORTANT NOTE: Remember that this only affects the compilation of the MPICH library and is not used in the wrappers (mpicc and friends) that are used to compile your applications or other libraries. This optimization level can be changed with the --enable-fast option passed to configure. For example, to build an MPICH environment with -O3 for all language bindings, one can simply do: ./configure --enable-fast=O3 Or to disable all compiler optimizations, one can do: ./configure --disable-fast For more details of --enable-fast, see the output of "configure --help". For performance testing, we recommend the following flags: ./configure --enable-fast=O3,ndebug --disable-error-checking --without-timing \ --without-mpit-pvars Examples -------- Example 1: ./configure --disable-fast MPICHLIB_CFLAGS=-O3 MPICHLIB_FFLAGS=-O3 \ MPICHLIB_CXXFLAGS=-O3 MPICHLIB_FCFLAGS=-O3 This will cause the MPICH libraries to be built with -O3, and -O3 will *not* be included in the mpicc and other MPI wrapper script. Example 2: ./configure --disable-fast CFLAGS=-O3 FFLAGS=-O3 CXXFLAGS=-O3 FCFLAGS=-O3 This will cause the MPICH libraries to be built with -O3, and -O3 will be included in the mpicc and other MPI wrapper script. ------------------------------------------------------------------------- 4. Alternate Channels and Devices ================================= The communication mechanisms in MPICH are called "devices". MPICH supports ch3 and ch4 (default), as well as many third-party devices that are released and maintained by other institutes. ************************************* ch3 device ********** The ch3 device contains different internal communication options called "channels". We currently support nemesis (default) and sock channels. nemesis channel --------------- Nemesis provides communication using different networks (tcp, mx) as well as various shared-memory optimizations. To configure MPICH with nemesis, you can use the following configure option: --with-device=ch3:nemesis Shared-memory optimizations are enabled by default to improve performance for multi-processor/multi-core platforms. They can be disabled (at the cost of performance) either by setting the environment variable MPICH_NO_LOCAL to 1, or using the following configure option: --enable-nemesis-dbg-nolocal The --with-shared-memory= configure option allows you to choose how Nemesis allocates shared memory. The options are "auto", "sysv", and "mmap". Using "sysv" will allocate shared memory using the System V shmget(), shmat(), etc. functions. Using "mmap" will allocate shared memory by creating a file (in /dev/shm if it exists, otherwise /tmp), then mmap() the file. The default is "auto". Note that System V shared memory has limits on the size of shared memory segments so using this for Nemesis may limit the number of processes that can be started on a single node. ofi network module ``````````````````` The ofi netmod provides support for the OFI network programming interface. To enable, configure with the following option: --with-device=ch3:nemesis:ofi If the OFI include files and libraries are not in the normal search paths, you can specify them with the following options: --with-ofi-include= and --with-ofi-lib= ... or the if lib/ and include/ are in the same directory, you can use the following option: --with-ofi= If the OFI libraries are shared libraries, they need to be in the shared library search path. This can be done by adding the path to /etc/ld.so.conf, or by setting the LD_LIBRARY_PATH variable in your environment. It's also possible to set the shared library search path in the binary. If you're using gcc, you can do this by adding LD_LIBRARY_PATH=/path/to/lib (and) LDFLAGS="-Wl,-rpath -Wl,/path/to/lib" ... as arguments to configure. sock channel ------------ sock is the traditional TCP sockets based communication channel. It uses TCP/IP sockets for all communication including intra-node communication. So, though the performance of this channel is worse than that of nemesis, it should work on almost every platform. This channel can be configured using the following option: --with-device=ch3:sock ch4 device ********** The ch4 device contains different network and shared memory modules for communication. We currently support the ofi and ucx network modules, and posix shared memory module. ofi network module ``````````````````` The ofi netmod provides support for the OFI network programming interface. To enable, configure with the following option: --with-device=ch4:ofi[:provider] If the OFI include files and libraries are not in the normal search paths, you can specify them with the following options: --with-libfabric-include= and --with-libfabric-lib= ... or the if lib/ and include/ are in the same directory, you can use the following option: --with-libfabric= If specifying the provider, the MPICH library will be optimized specifically for the requested provider by removing runtime branches to determine provider capabilities. Note that using this feature with a version of the libfabric library older than that recommended with this version of MPICH is unsupported and may result in unexpected behavior. This is also true when using the environment variable FI_PROVIDER. The currently expected version of libfabric is: %LIBFABRIC_VERSION%. ucx network module `````````````````` The ucx netmod provides support for the Unified Communication X library. It can be built with the following configure option: --with-device=ch4:ucx If the UCX include files and libraries are not in the normal search paths, you can specify them with the following options: --with-ucx-include= and --with-ucx-lib= ... or the if lib/ and include/ are in the same directory, you can use the following option: --with-ucx= By default, the UCX library throws warnings when the system does not enable certain features that might hurt performance. These are important warnings that might cause performance degradation on your system. But you might need root privileges to fix some of them. If you would like to disable such warnings, you can set the UCX log level to "error" instead of the default "warn" by using: UCX_LOG_LEVEL=error export UCX_LOG_LEVEL GPU support *********** GPU support is automatically enabled if CUDA, ZE, or HIP runtime is detected during configure. To specify where your GPU runtime is installed, use: --with-cuda=<path> or --with-ze=<path> or --with-hip=<path> If the lib/ and include/ are not in the same path, both can be specified separately, for example: --with-cuda-include= and --with-cuda-lib= In addition, GPU support can be explicitly disabled by using: --without-cuda or --without-ze or --without-hip If desirable, GPU support can be disabled during runtime by setting environment variable MPIR_CVAR_ENABLE_GPU=0. This may help avoid the GPU initialization and detection overhead for non-GPU applications. ------------------------------------------------------------------------- 5. Alternate Process Managers ============================= hydra ----- Hydra is the default process management framework that uses existing daemons on nodes (e.g., ssh, pbs, slurm, sge) to start MPI processes. More information on Hydra can be found at https://github.com/pmodels/mpich/blob/main/doc/wiki/how_to/Using_the_Hydra_Process_Manager.md gforker ------- gforker is a process manager that creates processes on a single machine, by having mpiexec directly fork and exec them. gforker is mostly meant as a research platform and for debugging purposes, as it is only meant for single-node systems. slurm ----- Slurm is an external process manager not distributed with MPICH. MPICH's default process manager, hydra, has native support for Slurm and you can directly use it in Slurm environments (it will automatically detect Slurm and use Slurm capabilities). However, if you want to use the Slurm-provided "srun" process manager, you can use the "--with-pmi=slurm --with-pm=no" option with configure. Note that the "srun" process manager that comes with Slurm uses an older PMI standard which does not have some of the performance enhancements that hydra provides in Slurm environments. ------------------------------------------------------------------------- 6. Alternate Configure Options ============================== MPICH has a number of other features. If you are exploring MPICH as part of a development project, you might want to tweak the MPICH build with the following configure options. A complete list of configuration options can be found using: ./configure --help ------------------------------------------------------------------------- 7. Testing the MPICH installation ================================== To test MPICH, we package the MPICH test suite in the MPICH distribution. You can run the test suite after "make install" using: make testing The results summary will be placed in test/summary.xml. The test suite can be used independently to test any installed MPI implementations: cd test/mpi ./configure --with-mpi=/path/to/mpi make testing ------------------------------------------------------------------------- 8. Fault Tolerance ================== MPICH has some tolerance to process failures, and supports checkpointing and restart. Tolerance to Process Failures ----------------------------- The features described in this section should be considered experimental. Which means that they have not been fully tested, and the behavior may change in future releases. The below notes are some guidelines on what can be expected in this feature: - ERROR RETURNS: Communication failures in MPICH are not fatal errors. This means that if the user sets the error handler to MPI_ERRORS_RETURN, MPICH will return an appropriate error code in the event of a communication failure. When a process detects a failure when communicating with another process, it will consider the other process as having failed and will no longer attempt to communicate with that process. The user can, however, continue making communication calls to other processes. Any outstanding send or receive operations to a failed process, or wildcard receives (i.e., with MPI_ANY_SOURCE) posted to communicators with a failed process, will be immediately completed with an appropriate error code. - COLLECTIVES: For collective operations performed on communicators with a failed process, the collective would return an error on some, but not necessarily all processes. A collective call returning MPI_SUCCESS on a given process means that the part of the collective performed by that process has been successful. - PROCESS MANAGER: If used with the hydra process manager, hydra will detect failed processes and notify the MPICH library. Users can query the list of failed processes using MPIX_Comm_group_failed(). This functions returns a group consisting of the failed processes in the communicator. The function MPIX_Comm_remote_group_failed() is provided for querying failed processes in the remote processes of an intercommunicator. Note that hydra by default will abort the entire application when any process terminates before calling MPI_Finalize. In order to allow an application to continue running despite failed processes, you will need to pass the -disable-auto-cleanup option to mpiexec. - FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL ALMOST CERTAINLY CHANGE IN THE FUTURE! In the current release, hydra notifies the MPICH library of failed processes by sending a SIGUSR1 signal. The application can catch this signal to be notified of failed processes. If the application replaces the library's signal handler with its own, the application must be sure to call the library's handler from it's own handler. Note that you cannot call any MPI function from inside a signal handler. Checkpoint and Restart ---------------------- MPICH supports checkpointing and restart fault-tolerance using BLCR. CONFIGURATION First, you need to have BLCR version 0.8.2 or later installed on your machine. If it's installed in the default system location, you don't need to do anything. If BLCR is not installed in the default system location, you'll need to tell MPICH's configure where to find it. You might also need to set the LD_LIBRARY_PATH environment variable so that BLCR's shared libraries can be found. In this case add the following options to your configure command: --with-blcr=<BLCR_INSTALL_DIR> LD_LIBRARY_PATH=<BLCR_INSTALL_DIR>/lib where <BLCR_INSTALL_DIR> is the directory where BLCR has been installed (whatever was specified in --prefix when BLCR was configured). After it's configured compile as usual (e.g., make; make install). Note, checkpointing is only supported with the Hydra process manager. VERIFYING CHECKPOINTING SUPPORT Make sure MPICH is correctly configured with BLCR. You can do this using: mpiexec -info This should display 'BLCR' under 'Checkpointing libraries available'. CHECKPOINTING THE APPLICATION There are two ways to cause the application to checkpoint. You can ask mpiexec to periodically checkpoint the application using the mpiexec option -ckpoint-interval (seconds): mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint \ -ckpoint-interval 3600 -f hosts -n 4 ./app Alternatively, you can also manually force checkpointing by sending a SIGUSR1 signal to mpiexec. The checkpoint/restart parameters can also be controlled with the environment variables HYDRA_CKPOINTLIB, HYDRA_CKPOINT_PREFIX and HYDRA_CKPOINT_INTERVAL. To restart a process: mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/app.ckpoint -f hosts -n 4 -ckpoint-num <N> where <N> is the checkpoint number you want to restart from. These instructions can also be found on the MPICH wiki: https://github.com/pmodels/mpich/blob/main/doc/wiki/design/Checkpointing.md ------------------------------------------------------------------------- 9. Developer Builds =================== For MPICH developers who want to directly work on the primary version control system, there are a few additional steps involved (people using the release tarballs do not have to follow these steps). Details about these steps can be found here: https://github.com/pmodels/mpich/blob/main/doc/wiki/source_code/Github.md ------------------------------------------------------------------------- 10. Multiple Fortran compiler support ===================================== If the C compiler that is used to build MPICH libraries supports both multiple weak symbols and multiple aliases of common symbols, the Fortran binding can support multiple Fortran compilers. The multiple weak symbols support allow MPICH to provide different name mangling scheme (of subroutine names) required by different Fortran compilers. The multiple aliases of common symbols support enables MPICH to equal different common block symbols of the MPI Fortran constant, e.g. MPI_IN_PLACE, MPI_STATUS_IGNORE. So they are understood by different Fortran compilers. Since the support of multiple aliases of common symbols is new/experimental, users can disable the feature by using configure option --disable-multi-aliases if it causes any undesirable effect, e.g. linker warnings of different sizes of common symbols, MPIFCMB* (the warning should be harmless). We have only tested this support on a limited set of platforms/compilers. On linux, if the C compiler that builds MPICH is either gcc or icc, the above support will be enabled by configure. At the time of this writing, pgcc does not seem to have this multiple aliases of common symbols, so configure will detect the deficiency and disable the feature automatically. The tested Fortran compilers include GNU Fortran compilers (gfortan), Intel Fortran compiler (ifort), Portland Group Fortran compilers (pgfortran), Absoft Fortran compilers (af90), and IBM XL fortran compiler (xlf). What this means is that if mpich is built by gcc/gfortran, the resulting mpich library can be used to link a Fortran program compiled/linked by another fortran compiler, say pgf90, say through mpifort -fc=pgf90. As long as the Fortran program is linked without any errors by one of these compilers, the program shall be running fine. ------------------------------------------------------------------------- 11. ABI Compatibility ===================== The MPICH ABI compatibility initiative was announced at SC 2014 (http://www.mpich.org/abi). As a part of this initiative, Argonne, Intel, IBM and Cray have committed to maintaining ABI compatibility with each other. As a first step in this initiative, starting with version 3.1, MPICH is binary (ABI) compatible with Intel MPI 5.0. This means you can build your program with one MPI implementation and run with the other. Specifically, binary-only applications that were built and distributed with one of these MPI implementations can now be executed with the other MPI implementation. Some setup is required to achieve this. Suppose you have MPICH installed in /path/to/mpich and Intel MPI installed in /path/to/impi. You can run your application with mpich using: % export LD_LIBRARY_PATH=/path/to/mpich/lib:$LD_LIBRARY_PATH % mpiexec -np 100 ./foo or using Intel MPI using: % export LD_LIBRARY_PATH=/path/to/impi/lib:$LD_LIBRARY_PATH % mpiexec -np 100 ./foo This works irrespective of which MPI implementation your application was compiled with, as long as you use one of the MPI implementations in the ABI compatibility initiative. ------------------------------------------------------------------------- 12. Capability Sets ===================== The CH4 device contains a feature called "capability sets" to simplify configuration of MPICH on systems using the OFI netmod. This feature configures MPICH to use a predetermined set of OFI features based on the provider being used. Capability sets can be configured at compile time or runtime. Compile time configuration provides better performance by reducing unnecessary code branches, but at the cost of flexibility. To configure at compile time, the device string should be amended to include the OFI provider with the following option: --with-device=ch4:ofi:sockets This will setup the OFI netmod to use the optimal configuration for the sockets provider, and will set various compile time constants. These settings cannot be changed at runtime. If runtime configuration is needed, use: --with-device=ch4:ofi i.e. without the OFI provider extension, and set various environment variables to achieve a similar configuration. To select the desired provider: % export FI_PROVIDER=sockets This will select the OFI provider and the associated MPICH capability set. To change the preset configuration, there exists an extended set of environment variables. As an example, native provider RMA atomics can be disabled by using the environment variable: % export MPIR_CVAR_CH4_OFI_ENABLE_ATOMICS=0 For some configuration options (in particular, MPIR_CVAR_CH4_OFI_ENABLE_TAGGED and MPIR_CVAR_CH4_OFI_ENABLE_RMA), if disabled, some functionality may fallback to generic implementations. A full list of capability set configuration variables can be found in the environment variables README.envvar. ------------------------------------------------------------------------- 13. Threads =========== The supported thread level are configured by option: --enable-threads={single,funneled,serialized,multiple} The default depends on the configured device. With ch4, "multiple" is the default. Set thread level to "single" provides best performance when application does not use multiple threads. Use "multiple" to allow application to access MPI from multiple threads concurrently. With "multiple" thread level, there are a few choices for the internal critical section models. This is controlled by configure option: --enable-thread-cs={global,per-vci} Current default is to use "global" cs. Applications that do heavy concurrent MPI communications may experience slow down due to this global cs. The "per-vci" cs internally will use multiple VCI (virtual communication interface) critical sections, thus can provide much better performance. To achieve the best performance, applications should try to expose as much parallel information to MPI as possible. For example, if each threads use separate communicators, MPICH may be able to assign separate VCI for each thread, thus achieving the maximum performance. The multiple VCI support may increase the resource allocation and overheads during initialization. By default, only a single vci is used. Set MPIR_CVAR_CH4_NUM_VCIS=<N> to enable multiple vcis at runtime. For best performance, match number of VCIs to the number threads application is using. MPICH supports multiple threading packages. The default is posix threads (pthreads), but solaris threads, windows threads, argobots and qthreads are also supported. To configure mpich to work with argobots or qthreads, use the following configure options: --with-thread-package=argobots \ CFLAGS="-I<path_to_argobots/include>" \ LDFLAGS="-L<path_to_argobots/lib>" --with-thread-package=qthreads \ CFLAGS="-I<path_to_qthreads/include>" \ LDFLAGS="-L<path_to_qthreads/lib>"
Originally by "Lisandro Dalcin" [email protected] on 2008-08-01 14:39:19 -0500
Hi all,
Some intercommunicator collectives make use of 'is_low_group' field in
MPID_Comm structure. This field is not being correctly filled when
MPI_Comm_dup() and MPI_Comm_split() is called on an intercommunicator,
and then MPI_Barrier(), MPI_Allgather(), MPI_Allgatherv() (and
probably MPI_Reduce_scatter(), I've not tried) deadlock.
You have attached a tentative patch (against SVN trunk) for fixing this issue.
I've tested them for MPI_Comm_dup() case, but not for the
MPI_Comm_split() case (but it seems that the low group flag just needs
to be inherited from the parent intercommunicator, but perhaps I'm
missing something, so please review this case with care).
BTW, Could you anticipate in what version (1.1.0 or perhaps 1.0.7p1)
could this issue get fixed?
Regards,
Centro Internacional de Mรฉtodos Computacionales en Ingenierรญa (CIMEC)
Instituto de Desarrollo Tecnolรณgico para la Industria Quรญmica (INTEC)
Consejo Nacional de Investigaciones Cientรญficas y Tรฉcnicas (CONICET)
PTLC - Gรผemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594
Originally by William Gropp [email protected] on 2008-08-04 14:12:41 -0500
I'm getting a failure in this file:
gcc -I/Users/gropp/tmp/mpich2-sock/src/mpid/ch3/include -I/Users/
gropp/projects/software/mpich2/src/mpid/ch3/include -I/Users/gropp/
tmp/mpich2-sock/src/mpid/common/datatype -I/Users/gropp/projects/
software/mpich2/src/mpid/common/datatype -I/Users/gropp/tmp/mpich2-
sock/src/mpid/common/locks -I/Users/gropp/projects/software/mpich2/
src/mpid/common/locks -I/Users/gropp/tmp/mpich2-sock/src/mpid/ch3/
channels/sock/include -I/Users/gropp/projects/software/mpich2/src/
mpid/ch3/channels/sock/include -I/Users/gropp/tmp/mpich2-sock/src/
mpid/common/sock -I/Users/gropp/projects/software/mpich2/src/mpid/
common/sock -I/Users/gropp/tmp/mpich2-sock/src/mpid/common/sock/poll -
I/Users/gropp/projects/software/mpich2/src/mpid/common/sock/poll -g -
Wall -O2 -Wstrict-prototypes -Wmissing-prototypes -Wundef -Wpointer-
arith -Wbad-function-cast -ansi -DGCC_WALL -D_POSIX_C_SOURCE=199506L -
std=c89 -DFORTRANUNDERSCORE -DHAVE_ROMIOCONF_H -I. -I/Users/gropp/
projects/software/mpich2/src/mpi/romio/adio/common/../include -I../
include -I../../include -I/Users/gropp/projects/software/mpich2/src/
mpi/romio/adio/common/../../../../../src/include -I../../../../../src/
include -c /Users/gropp/projects/software/mpich2/src/mpi/romio/adio/
common/system_hints.c
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:82:63: warning: character constant too long for its type
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c: In function 'file_to_info':
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:82: error: parse error before ':' token
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:111:16: warning: character constant too long for its type
/Users/gropp/projects/software/mpich2/src/mpi/romio/adio/common/
system_hints.c:111: error: parse error before ':' token
make[5]: *** [system_hints.o] Error 1
Make failed in directory adio/common
make[4]: *** [mpiolib] Error 1
make[3]: *** [mpio] Error 2
make[2]: *** [all-redirect] Error 1
make[1]: *** [all-redirect] Error 2
make: *** [all-redirect] Error 2
groppmac:~/tmp/mpich2-sock gropp$
It looks like it is using calloc and free instead of the memory
routines (which will introduce a compile-time error with --enable-
dbg=mem is selected, which I always do). I'll fix this, but this is
a reminder to (a) use the memory routines and (b) configure with --
enable-dbg=mem .
Bill
William Gropp
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign
Originally by wei huang [email protected] on 2008-08-05 18:47:22 -0500
Hi list,
We here are trying to run mvapich2, which is based on mpich2-1.0.7, on
more than 32k processes. However, we find that MPIDI_Message_match
structure uses only int16_t as the rank. This is not enough for job larger
than 32k. It looks like the follow change that uses int32_t for rank is
needed to scale. Would you consider integrate this change in future mpich2
releases? Thanks.
Index: src/mpid/ch3/include/mpidpre.h
===================================================================
--- src/mpid/ch3/include/mpidpre.h (revision 2891)
+++ src/mpid/ch3/include/mpidpre.h (revision 2892)
@@ -65,7 +65,7 @@
typedef struct MPIDI_Message_match
{
int32_t tag;
- int16_t rank;
+ int32_t rank;
int16_t context_id;
}
MPIDI_Message_match;
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
Originally by Anthony Chan [email protected] on 2008-08-06 16:43:39 -0500
I would think that the default binary on 64bit machine is 64bit, i.e
you don't need to set any *FLAGS in building mpich2. Assuming it is
not the case that you do need to modify the binary format, you need to
set CFLAGS, CXXFLAGS, FFLAGS and F90FLAGS ("./configure --help" will
show all the relevant *FLAGS, be sure don't set CPPFLAGS before configuing
mpich2) to -m64.
A.Chan
----- "Vijay Mann" [email protected] wrote:
Hi,
I think we are hitting into the same problem with fortran mpich
libraries. We are using gfortran (which accepts -m64 flag for 64 bit
compilation).We tried the following set of flags:
export FCFLAGS="-m64"
export FCFLAGS_f90="-m64"
export FFLAGS="-m64and they didn't seem to work.
Can you please help?
Thanks,
Vijay Mann
Technical Staff Member,
IBM India Research Laboratory, New Delhi, India.
Phone: 91-11- 41292168
http://www.research.ibm.com/people/v/vijamann/Anthony Chan [email protected]
11/16/2007 04:52 AM
To Pradipta De/India/IBM@IBMINcc [email protected], Vijay Mann/India/IBM@IBMIN
Subject Re: [MPICH2 Req #3768] Problem with MPICH2
Did you set CFLAGS and CXXFLAGS to the same 64bit flag used by your
C/C++ compiler ?On Thu, 15 Nov 2007, Pradipta De wrote:
Hi,
We downloaded and compiled MPICH2 on a PowerPC box running FC6.
We are trying to use mpicxx to compile our mpi code in 64-bit mode.
We get the following error: (our mpich2 install directory is
/hpcfs/downloaded_software/mpich2-install/)/usr/bin/ld: skipping incompatible
/hpcfs/downloaded_software/mpich2-install//lib/libmpichcxx.a when
searching for -lmpichcxx
/usr/bin/ld: cannot find -lmpichcxx
collect2: ld returned 1 exit statusIs there some flag that needs to be specified during configuration
to
allow for 64-bit version ?thanks, and regards,
-- pradipta
Originally by Brian Curtis [email protected] on 2008-08-05 12:34:59 -0500
We've come across an Intel test
(fortran/datatype/functional/MPI_Type_lb_2MPI_LB) failure that is
reproducible with MPICH2 1.0.7 by configuring with
--enable-fast=nochkmsg. Have you guys encountered this failure and have
a fix?
Brian
Originally by Darius Buntinas [email protected] on 2008-08-01 15:55:57 -0500
Forwarding from David Gingold.
-d
-------- Original Message --------
Darius --
I got the changes integrated into our library this week, and all
seemed well until I tried the failing Intel MPI test again. Now it
falls over with a similar assertion, using a different datatype.
The test case below (which is what I sent before, but this time with
different block lengths and displacements) reproduces the problem in
our library. Can you try this easily with yours?
-dg
....
int MPID_Segment_init(const DLOOP_Buffer buf,
DLOOP_Count count,
DLOOP_Handle handle,
struct DLOOP_Segment *segp,
int hetero);
void MPID_Segment_pack(struct DLOOP_Segment *segp,
DLOOP_Offset first,
DLOOP_Offset *lastp,
void *pack_buffer);
int main(int argc, char *argv[])
{
int ierr;
MPID_Segment segment;
MPI_Aint last;
int dis[2], blklens[2];
MPI_Datatype type;
int send_buffer[60];
int recv_buffer[60];
ierr = MPI_Init(&argc, &argv);
assert(ierr == MPI_SUCCESS);
dis[0] = 0;
dis[1] = 15;
blklens[0] = 0;
blklens[1] = 10;
last = 192;
ierr = MPI_Type_indexed(2, blklens, dis, MPI_INT, &type);
assert(ierr == MPI_SUCCESS);
ierr = MPI_Type_commit(&type);
assert(ierr == MPI_SUCCESS);
ierr = MPID_Segment_init(send_buffer, 1, type, &segment, 0);
assert(ierr == MPI_SUCCESS);
MPID_Segment_pack(&segment, 88, &last, recv_buffer);
MPI_Finalize();
return 0;
}
Originally by "Ayer, Timothy C." [email protected] on 2008-08-04 09:32:36 -0500
I am testing MPICH2 MPICH2-1.0.7 Windows XP (sp2). I have installed it on 2
hosts (hostA, hostB) and trying to run the fpi.exe built with fmpich2.lib.
The code is hanging in a MPI_Bcast call. The fpi.exe source is attached.
The following tests work fine from hostA, both prompt for a number of
intervals, accept input, and produce and estimate of PI
mpiexec.exe -hosts 2 hostA hostA \hostA\temp\fpi.exe <\hostA\temp\fpi.exe>
mpiexec.exe -hosts 2 hostB hostB \hostA\temp\fpi.exe <\hostA\temp\fpi.exe>
The following test hangs when submitted from hostA (in MPI_Bcast). It does
prompt for input (number of intervals) but once entered it hangs. I have
launched the smpd process using smpd -d but see no output from the smpd
after I enter an interval value
mpiexec.exe -hosts 2 hostA hostB \hostA\temp\fpi.exe <\hostA\temp\fpi.exe>
Any suggestions would be appreciated. Also let me know if you want me to
send debug output.
Thanks,
Tim
Timothy C. Ayer
High Performance Technical Computing
United Technologies - Pratt & Whitney
[email protected]
(860) 565 - 5268 v
(860) 565 - 2668 f
<<fpi.f>>
Originally by "Rajeev Thakur" [email protected] on 2008-08-05 11:15:11 -0500
Need test for MPI_Type_create_indexed_block in ROMIO flatten code.
Rajeev
Originally by "P. Klein" [email protected] on 2008-08-04 08:31:40 -0500
Hi Anthony,
thanks once again for the beta release. In February,I planned to sent to
you more or less immediately a response on how things work, but I have
been pretty busy in those days. Therefore, I decided to wait until I can
tell you something more exciting than just "it runs".
Please find attached a paper which uses the thread-safe MPE and for your
amusement the .slog file mentioned in this paper. I am looking forward
to hear your opinion about our work.
Kind regards
Peter
Kind regards
Peter
Anthony Chan wrote:
Hi Peter,
I have put together a threadsafe version of MPE logging API in
the latest RC tarball which can be downloaded atftp://ftp.mcs.anl.gov/pub/mpi/mpe/beta/mpe2-1.0.7rc1.tar.gz
A sample C program that uses the updated API at
/share/examples_logging/pthread_sendrecv_user.c
which can be compiled just like pthread_sendrecv as documented
in the Makefile. Documentation of the updated API's manpages can
be found in /man and /www.
Let me know if the updated API has any problem used in
your multithreaded program.A.Chan
On Wed, 9 Jan 2008, Anthony Chan wrote:
Dr. rer. nat. Peter Klein
| | ||||| Fraunhofer ITWM
|_|||||| Abteilung: OPT
| | ||||| Fraunhoferplatz 1
|**|**||||| D-67663 Kaiserslautern
| ___ |
|| | | | |/|| phone (+49-)|(0)631 31600 4591
|| | |/| | || fax (+49-)|(0)631 31600 1099
|______________| e-Mail: [email protected]
Originally by Dave Goodell [email protected] on 2008-07-31 11:26:44 -0500
Whenever I do a ./maint/updatefiles I get these warning messages over
and over. Presumably they're harmless, since everything still builds
and runs just fine, but they'd be nice to get rid of.
-Dave
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_static_works, ...): suspicious cache-id, must
contain cv to be cached
/sandbox/chan/autoconf/autoconf-2.62/lib/autoconf/general.m4:1973:
AC_CACHE_VAL is expanded from...
/sandbox/chan/autoconf/autoconf-2.62/lib/autoconf/general.m4:1993:
AC_CACHE_CHECK is expanded from...
./libtool.m4:640: AC_LIBTOOL_LINKER_OPTION is expanded from...
./libtool.m4:2551: _LT_AC_LANG_C_CONFIG is expanded from...
./libtool.m4:2550: AC_LIBTOOL_LANG_C_CONFIG is expanded from...
./libtool.m4:80: AC_LIBTOOL_SETUP is expanded from...
./libtool.m4:60: _AC_PROG_LIBTOOL is expanded from...
./libtool.m4:25: AC_PROG_LIBTOOL is expanded from...
configure.in:202: the top level
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works, ...): suspicious cache-id, must contain
cv to be cached
./libtool.m4:595: AC_LIBTOOL_COMPILER_OPTION is expanded from...
./libtool.m4:4666: AC_LIBTOOL_PROG_COMPILER_PIC is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works_CXX, ...): suspicious cache-id, must
contain cv to be cached
./libtool.m4:2663: _LT_AC_LANG_CXX_CONFIG is expanded from...
./libtool.m4:2662: AC_LIBTOOL_LANG_CXX_CONFIG is expanded from...
./libtool.m4:1701: _LT_AC_TAGCONFIG is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works_F77, ...): suspicious cache-id, must
contain cv to be cached
./libtool.m4:3756: _LT_AC_LANG_F77_CONFIG is expanded from...
./libtool.m4:3755: AC_LIBTOOL_LANG_F77_CONFIG is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_pic_works_GCJ, ...): suspicious cache-id, must
contain cv to be cached
./libtool.m4:3862: _LT_AC_LANG_GCJ_CONFIG is expanded from...
./libtool.m4:3861: AC_LIBTOOL_LANG_GCJ_CONFIG is expanded from...
configure.in:202: warning: AC_CACHE_VAL
(lt_prog_compiler_static_works, ...): suspicious cache-id, must
contain cv to be cached
Originally by "McDonald, Sean M CTR USAF AFMC AFRL/RXQO" [email protected] on 2008-07-31 17:30:59 -0500
Hello,
I am trying to compile MPICH2 on a Scientific Linux 5.0 box
(www.scientificlinux.org). The configuration seems to be fine, but I
get compile errors. I looked around and found that in Makefile.in on
Line 50 and 52 it has a hardcoded path to "${srcdir} &&
/sandbox/balaji/trunk/maint/mpich2-1.0.7/maint/simplemake". I would
think it should be something like "${srcdir} && /maint/simplemake/".
This is in version 1.0.7. I modified the Makefile, but this error seems
to be in other places.
I downloaded version 1.0.6p1 and it compiled and ran fine. I would like
to use the newest version for a system I am setting up, so could you
look into this problem and let me know a solution. Thanks
Sean McDonald
AFRL/RXQ Network Administrator
139 Barnes Dr., Suite 2
Tyndall AFB, FL. 32403
Duty Phone: (850) 283-6407 - DSN: 523-6407
Email: [email protected]
Originally by "Rajeev Thakur" [email protected] on 2008-07-31 09:00:06 -0500
There are some "assertion failed" in last night's Nemesis runs.
http://www.mcs.anl.gov/research/projects/mpich2/todo/runs/IA32-Linux-GNU-mpd
-ch3:nemesis-2008-07-30-22-00-testsumm-mpich2-fail.xml
Rajeev
Originally by goodell on 2008-08-01 08:42:34 -0500
In [de6e5ee] I committed a rough cut of dynamic processes for nemesis
newtcp. In mpid_nem_inline.h I commented out an optimization that
uses MPID_nem_mem_region.ext_procs because it prevents the proper
operation of dynamic processes. Unfortunately, removing it adds
~100ns to our zero-byte message latencies. So there is a FIXME in
the code that reads like this:
/* FIXME the ext_procs bit is an optimization for the all-local-procs case.
This has been commented out for now because it breaks dynamic processes.
Some other solution should be implemented eventually, possibly using a
flag that is set whenever a port is opened. [goodell@ 2008-06-18] */
In general, this won't affect real uses who run any inter-node jobs,
since they were already polling every time anyway. However, it does
hurt those wonderful microbenchmarks. A hack fix is to leave this in
but also check to see if a port has been opened. A possibly better
fix is to only poll the network every X iterations of "poll
everything", where X is some tunable parameter.
This req is a reminder for this FIXME.
-Dave
Originally by Anthony Chan [email protected] on 2008-07-31 16:35:45 -0500
Can you tell me where is the config.status ?
----- "Rajeev Thakur" [email protected] wrote:
Can you take a look at this one if it is something easy?
Rajeev
Originally by "Osentoski, Sarah" [email protected] on 2008-08-04 16:11:20 -0500
Hi,
I have a question. I tried to set up mpi on a set of 5 computers.
mpdboot works on each machine individually.
However if I run:
-bash-3.00$ mpdboot -n 5 -d --verbose --ncpus=3
debug: starting
running mpdallexit on erl01
LAUNCHED mpd on erl01 via
debug: launch cmd= /home/sosentoski/mpich2-install/bin/mpd.py
--ncpus=3 -e -d
debug: mpd on erl01 on port 35273
RUNNING: mpd on erl01
debug: info for running mpd: {'ncpus': 3, 'list_port': 35273,
'entry_port': *, 'host': 'erl01', 'entry_host': *, 'ifhn': ''}
LAUNCHED mpd on erl04 via erl01
debug: launch cmd= ssh -x -n -q erl04
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
LAUNCHED mpd on erl07 via erl01
debug: launch cmd= ssh -x -n -q erl07
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
LAUNCHED mpd on erl06 via erl01
debug: launch cmd= ssh -x -n -q erl06
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
LAUNCHED mpd on erl05 via erl01
debug: launch cmd= ssh -x -n -q erl05
'/home/sosentoski/mpich2-install/bin/mpd.py -h erl01 -p 35273
--ncpus=1 -e -d'
debug: mpd on erl07 on port no_port
mpdboot_erl01 (handle_mpd_output 406): from mpd on erl07, invalid port
info:
no_port
Do you have any helpful hints about what might be wrong with my set up?
Thanks
Sarah Osentoski
Originally by goodell on 2008-08-01 07:24:49 -0500
(this is a re-send of req#4214 so that trac learns about it)
The MPIR_Get_contextid function needs to be overhauled a bit. It
doesn't use the standard MPICH2 error handling approach yet it's a
non-trivial function. Specifically, I've run into issues lately
where the comm subsystem is hosed in such a way it makes the
NMPI_Allreduce call that MPIR_Get_contextid makes fail.
Unfortunately, MPIR_Get_contextid simply returns 0 if there was a
problem, so the stack trace is simply thrown away and all errors show
up like this:
Fatal error in MPI_Comm_accept: Other MPI error, error stack:
MPI_Comm_accept(117)..: MPI_Comm_accept(port="tag#1$description#intel-
loane[1]$port#46959$ifname#140.221.37.57$", MPI_INFO_NULL, root=0,
MPI_COMM_WORLD, newcomm=0x7ff0004dc) failed
MPID_Comm_accept(149).:
MPIDI_Comm_accept(915): Too many communicators
In reality, the original error was caused deep down in the nemesis
layer, but you can't see it here.
I'm filing this instead of just fixing it because there are two
versions of this function that need to be fixed and tested on all
platforms. Also, all the call sites need to be updated to check the
mpi_errno and handle it accordingly. This isn't critical for the
release, so it can probably wait a little while.
-Dave
Originally by "Rajeev Thakur" [email protected] on 2008-07-30 11:33:25 -0500
All the ssm builds in last night's tests failed to compile. Might be a
simple fix.
Rajeev
Beginning make
Using variables CC='gcc' CFLAGS=' -O2' LDFLAGS='' AR='ar' FC='g77' F90='f95'
FFLAGS=' -O2' F90FLAGS=' -O2' CXX='g++'
In directory: /sandbox/buntinas/cb/mpich2/src/mpid/ch3/util/shm
CC
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:
In function 'MPIDI_CH3U_Finalize_sshm':
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:7
4: error: too few arguments to function 'MPIDI_PG_Get_next'
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:7
7: error: too few arguments to function 'MPIDI_PG_Get_next'
/home/MPI/testing/mpich2/mpich2/src/mpid/ch3/util/shm/ch3u_finalize_sshm.c:8
0: error: too few arguments to function 'MPIDI_PG_Get_next'
Originally by goodell on 2008-08-01 08:42:26 -0500
I see some valgrind errors when I build with ch3:nemesis:newtcp and
gforker. I couldn't figure out the problem in a few minutes of
investigation, so I'm filing this bug report so that we don't lose
track of this. There is likely a simpler configuration and test case
that will elicit these warnings, I just haven't spent any time paring
things down and playing with configure args.
-Dave
Configuration line:
./configure --prefix=/home/goodell/testing/nemesis_gforker/test_1/
mpich2-installed --with-pm=gforker --with-device=ch3:nemesis:newtcp --
enable-g=dbg,log,meminit --disable-fast --enable-nemesis-dbg-nolocal
Test program:
bblogin% cat test.c
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc,char *argv[])
{
int rank,np;
int i;
char buf[100];
MPI_Status status;
MPI_Init (&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&np);
if (rank == 0) {
for (i = 1; i < np; i++) {
MPI_Send(buf, 0, MPI_CHAR, i, 0, MPI_COMM_WORLD);
}
for (i = 1; i < np; i++) {
MPI_Send(buf, 0, MPI_CHAR, i, 0, MPI_COMM_WORLD);
}
}
else {
MPI_Recv(buf, 0, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
MPI_Recv(buf, 0, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status);
}
MPI_Finalize();
return 0;
}
Valgrind output:
bblogin% valgrind ./a.out
==28198## Memcheck, a memory error detector.28198== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et
al.
==28198== Using LibVEX rev 1658, a library for dynamic binary
translation.
==28198## Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.28198== Using valgrind-3.2.1-Debian, a dynamic binary
instrumentation framework.
==28198== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et
al.
==28198## For more details, rerun with: -v28198## ==28198== Invalid read of size 828198## at 0x40152A4: (within /lib/ld-2.5.so)28198## by 0x400A7CD: (within /lib/ld-2.5.so)28198## by 0x4006164: (within /lib/ld-2.5.so)28198## by 0x40084AB: (within /lib/ld-2.5.so)28198## by 0x40116EC: (within /lib/ld-2.5.so)28198## by 0x400D725: (within /lib/ld-2.5.so)28198## by 0x401114A: (within /lib/ld-2.5.so)28198## by 0x534BB7F: (within /lib/libc-2.5.so)28198## by 0x400D725: (within /lib/ld-2.5.so)28198## by 0x534BCE6: __libc_dlopen_mode (in /lib/libc-2.5.so)28198## by 0x5327516: __nss_lookup_function (in /lib/libc-2.5.so)28198## by 0x53275C4: (within /lib/libc-2.5.so)28198== Address 0x4032CE0 is 16 bytes inside a block of size 23
alloc'd
==28198## at 0x4C20A69: malloc (vg_replace_malloc.c:149)28198## by 0x4008999: (within /lib/ld-2.5.so)28198## by 0x40116EC: (within /lib/ld-2.5.so)28198## by 0x400D725: (within /lib/ld-2.5.so)28198## by 0x401114A: (within /lib/ld-2.5.so)28198## by 0x534BB7F: (within /lib/libc-2.5.so)28198## by 0x400D725: (within /lib/ld-2.5.so)28198## by 0x534BCE6: __libc_dlopen_mode (in /lib/libc-2.5.so)28198## by 0x5327516: __nss_lookup_function (in /lib/libc-2.5.so)28198## by 0x53275C4: (within /lib/libc-2.5.so)28198## by 0x532DC0A: gethostbyname_r (in /lib/libc-2.5.so)28198## by 0x532D402: gethostbyname (in /lib/libc-2.5.so)28198==
==28198== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 16
from 1)
==28198## malloc/free: in use at exit: 2,826 bytes in 15 blocks.28198## malloc/free: 106 allocs, 91 frees, 8,405,715 bytes allocated.28198## For counts of detected errors, rerun with: -v28198## searching for pointers to 15 not-freed blocks.28198## checked 264,992 bytes.28198## ==28198== LEAK SUMMARY:28198## definitely lost: 0 bytes in 0 blocks.28198## possibly lost: 0 bytes in 0 blocks.28198## still reachable: 2,826 bytes in 15 blocks.28198## suppressed: 0 bytes in 0 blocks.28198== Reachable blocks (those to which a pointer was found) are
not shown.
==28198== To see them, rerun with: --show-reachable=yes
Originally by goodell on 2008-08-01 08:39:54 -0500
so that we don't forget...
Begin forwarded message:
> From: Dave Goodell <[email protected]>
> Date: July 11, 2008 Jul 11 8:46:04 AM CDT
> To: [email protected]
> Cc: Dave Goodell <[email protected]>
> Subject: Re: [mpich2-dev] Apparent bypass of correct macros in
> collective operation code
>
> Good catch, Joe. These ought to be changed to use the
> MPIU_THREADPRIV_* macros. I'll forward this to mpich2-maint@ to
> make sure we don't forget to fix it.
>
> -Dave
>
> On Jul 10, 2008, at 8:47 PM, Joe Ratterman wrote:
>
>> I was recently doing some thread hacking, and I found that some of
>> my changes where causing a problem in the collective operation C
>> files: mpich2/src/mpi/coll/op*.c
>>
>> Specifically, this sort of code was a problem since I got rid of
>> the op_errno field in the MPICH_PerThread object.
>> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/coll/opbor.c
>> 165 default: {
>> 166 MPICH_PerThread_t *p;
>> 167 MPIR_GetPerThread(&p);
>> 168 p->op_errno = MPIR_Err_create_code( MPI_SUCCESS,
>> MPIR_ERR_RECOVERABLE, FCNAME, __LINE__, MPI_ERR_OP,
>> "**opundefined","**opundefined %s", "MPI_BOR" );
>> 169 break;
>> 170 }
>>
>> I think that there are macros to do this, as seen in the
>> allreduce.c file (extra lines deleted):
>> 117 MPIU_THREADPRIV_DECL;
>> 126 MPIU_THREADPRIV_GET;
>> 158 MPIU_THREADPRIV_FIELD(op_errno) = 0;
>> 473 if (MPIU_THREADPRIV_FIELD(op_errno))
>> 474 mpi_errno = MPIU_THREADPRIV_FIELD(op_errno);
>>
>> With the default macros, that basically does the same thing, but I
>> didn't have to change the .c file--only the header files. The
>> same thing happens in errutil.c
>> https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/errhan/
>> errutil.c
>> 156 /* These routines export the nest increment and decrement
>> for use in ROMIO */
>> 157 void MPIR_Nest_incr_export( void )
>> 158 {
>> 159 MPICH_PerThread_t *p;
>> 160 MPIR_GetPerThread(&p);
>> 161 p->nest_count++;
>> 162 }
>> 163 void MPIR_Nest_decr_export( void )
>> 164 {
>> 165 MPICH_PerThread_t *p;
>> 166 MPIR_GetPerThread(&p);
>> 167 p->nest_count--;
>> 168 }
>>
>>
>>
>> I really think that these places should be using the existing
>> macros to handle the work.
>>
>>
>> Comments?
>> Joe Ratterman
>> [email protected]
>
Originally by "Rajeev Thakur" [email protected] on 2008-08-01 14:56:17 -0500
If I do a fresh main/updatefiles, configure, and make, I get the following
error:
rm -f mpich2-mpdroot.o
copying python files/links into /sandbox/thakur/tmp/bin
rm -f mpich2-mpdroot
make[4]: Leaving directory /sandbox/thakur/tmp/src/pm/mpd' make[3]: Leaving directory
/sandbox/thakur/tmp/src/pm'
make[2]: Leaving directory /sandbox/thakur/tmp/src/pm' make[1]: Leaving directory
/sandbox/thakur/tmp/src'
make[1]: Entering directory /sandbox/thakur/tmp/examples' CC /homes/thakur/cvs/mpich2/examples/cpi.c ../bin/mpicc -o cpi cpi.o -lm /sandbox/thakur/tmp/lib/libmpich.a(socksm.o)(.text+0x66): In function
dbg_print_sc_tbl':
: undefined reference to CONN_STATE_TO_STRING' collect2: ld returned 1 exit status make[1]: *** [cpi] Error 1 make[1]: Leaving directory
/sandbox/thakur/tmp/examples'
make: *** [all-redirect] Error 2
Originally by goodell on 2008-08-01 07:51:00 -0500
http://www.mcs.anl.gov/research/projects/mpich2/todo/runs/IA32-Linux-
GNU-mpd-ch3:nemesis-2008-07-31-22-00-testsumm-mpich2-fail.xml
resized
1 processes
./io
fail
Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks,
combiner = 18
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0rank 0 in
job 278 schwinn.mcs.anl.gov_37714 caused collective abort of all
ranks
exit status of rank 0: return code 1
Originally by Vijay Mann [email protected] on 2008-08-06 15:43:46 -0500
Hi,
I think we are hitting into the same problem with fortran mpich libraries.
We are using gfortran (which accepts -m64 flag for 64 bit compilation).
We tried the following set of flags:
export FCFLAGS="-m64"
export FCFLAGS_f90="-m64"
export FFLAGS="-m64
and they didn't seem to work.
Can you please help?
Thanks,
Vijay Mann
Technical Staff Member,
IBM India Research Laboratory, New Delhi, India.
Phone: 91-11- 41292168
http://www.research.ibm.com/people/v/vijamann/
Anthony Chan [email protected]
11/16/2007 04:52 AM
To
Pradipta De/India/IBM@IBMIN
cc
[email protected], Vijay Mann/India/IBM@IBMIN
Subject
Re: [MPICH2 Req #3768] Problem with MPICH2
Did you set CFLAGS and CXXFLAGS to the same 64bit flag used by your
C/C++ compiler ?
On Thu, 15 Nov 2007, Pradipta De wrote:
Hi,
We downloaded and compiled MPICH2 on a PowerPC box running FC6.
We are trying to use mpicxx to compile our mpi code in 64-bit mode.
We get the following error: (our mpich2 install directory is
/hpcfs/downloaded_software/mpich2-install/)/usr/bin/ld: skipping incompatible
/hpcfs/downloaded_software/mpich2-install//lib/libmpichcxx.a when
searching for -lmpichcxx
/usr/bin/ld: cannot find -lmpichcxx
collect2: ld returned 1 exit statusIs there some flag that needs to be specified during configuration to
allow for 64-bit version ?thanks, and regards,
-- pradipta
Originally by "Rajeev Thakur" [email protected] on 2008-08-04 14:23:09 -0500
Maint/updatefiles gives the following errors now.
Shortname **mpi_accumulate %p %d %D %d %d %d %D %O %W for specific messages
has no expansion (first seen in file :src/mpi/rma/accumulate.c)
Shortname **mpi_alloc_mem %d %I %p for specific messages has no expansion
(first seen in file :src/mpi/rma/alloc_mem.c)
Shortname **mpi_get %p %d %D %d %d %d %D %W for specific messages has no
expansion (first seen in file :src/mpi/rma/get.c)
Shortname **mpi_pack_external %s %p %d %D %p %d %p for specific messages has
no expansion (first seen in file :src/mpi/datatype/pack_external.c)
Shortname **mpi_put %p %d %D %d %d %d %D %W for specific messages has no
expansion (first seen in file :src/mpi/rma/put.c)
Shortname **mpi_type_create_hvector %d %d %d %D %p for specific messages has
no expansion (first seen in file :src/mpi/datatype/type_create_hvector.c)
Shortname **mpi_type_hvector %d %d %d %D %p for specific messages has no
expansion (first seen in file :src/mpi/datatype/type_hvector.c)
Shortname **mpi_unpack_external %s %p %d %p %p %d %D for specific messages
has no expansion (first seen in file :src/mpi/datatype/unpack_external.c)
Shortname **mpi_win_create %p %d %d %I %C %p for specific messages has no
expansion (first seen in file :src/mpi/rma/win_create.c)
Because of errors in extracting error messages, the file
src/mpi/errhan/defmsg.h was not updated.
There are unused error message texts in src/mpi/errhan/errnames.txt
See the file unusederr.txt for the complete list
.....
Creating src/pm/smpd/smpd_version.h
Updating README's version ID.
Problems encountered while running updatefiles.
These may cause problems when configuring or building MPICH2.
Error message files in src/mpi/errhan were not updated.
Originally by "Jayesh Krishna" [email protected] on 2008-07-31 16:49:05 -0500
Hi,
As you mentioned, the current code (smpd) in trunk requires PM support
for MPI_Comm_connect()/MPI_Comm_accept() (hence the requirement that
mpiexec should be in the PATH).
We should be able to remove this dependency but I need to run some more
tests before I can confirm that. I am on vacation till Monday so I will
run the tests on Monday and get back to you.
Have a nice weekend,
Regards,
Jayesh
From: Edric Ellis [mailto:[email protected]]
Sent: Monday, July 28, 2008 7:38 AM
To: Jayesh Krishna
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init
Hi Jayesh,
Actually, I now seem to be able to get things communicating once the
processes have mpiexec on their $PATH, and smpd is running. I'm not sure
quite what I fixed though.
In any case, it would be much better for us to be able to restore the
behaviour as per 1.0.3 where the smpd process wasn't needed for
connect/accept.
Cheers,
Edric.
From: Edric Ellis
Sent: Monday, July 28, 2008 11:42 AM
To: 'Jayesh Krishna'
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init
Hi Jayesh,
I've just been looking at the latest MPICH2 from SVN, and what I find is
that I can now get past the call to MPI_Init() without running smpd, but
as soon as I try to perform the MPI_Comm_connect / MPI_Comm_accept phase,
the "connect" process reports an error because it can't execv mpiexec. Is
that expected?
I've tried adding the path to mpiexec to $PATH, and that doesn't help -
even running "smpd -d" shows that the processes are talking to smpd, but
they then both get stuck in a poll().
Cheers,
Edric.
From: Jayesh Krishna [mailto:[email protected]]
Sent: Friday, June 06, 2008 3:14 PM
To: Edric Ellis
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init
Hi,
Just to add to my prev email, currently the singleton init client fails
when smpd is not running on the system (We will be fixing the code so that
the singleton init client fails only when the PM is not running and the
client requires the PM (external PM) services to proceed. We expect this
to be fixed in our next release.)
Regards,
Jayesh
From: Jayesh Krishna [mailto:[email protected]]
Sent: Friday, June 06, 2008 9:08 AM
To: 'Edric Ellis'
Cc: [email protected]
Subject: RE: [MPICH2 Req #4127] smpd and singleton init
Hi,
Yes we have changed the way singleton init client works with SMPD. Now
when a singleton init client is run it tries to connect to the process
manager (without this change the process has no support from the PM and
calls like MPI_Comm_spawn() won't work).
Is there any reason why you don't want smpd to be running on these
machines ?
(PS: We didn't have time to fix the connect/accept problem in smpd for
1.0.7 release. We will be looking into fixing the bug in the next
release.)
Regards,
Jayesh
-----Original Message-----
From: Edric Ellis [mailto:[email protected]]
Sent: Friday, June 06, 2008 8:20 AM
To: [email protected]
Cc: [email protected]
Subject: [MPICH2 Req #4127] smpd and singleton init
Hi mpich2-maint,
I'm looking at upgrading MATLAB from using MPICH2-1.0.3 to MPICH2-1.0.7,
and I notice a change in the way in which singleton init works for the
smpd process manager (we use the smpd build on UNIX and Windows).
In 1.0.3, what we do is the following:
When I substitute 1.0.7, MPI_Init fails because smpd isn't running.
Is there a way to get 1.0.7 to behave as per 1.0.3 - i.e. without any
reliance on the smpd process?
As a separate question: what is the status of the connect/accept stall
problems under smpd? I believe that this was fixed in 1.0.6 for mpd.
Cheers,
Edric.
Originally by Dave Goodell [email protected] on 2008-07-31 08:45:32 -0500
The new nightly tests* haven't run since [1177], which was committed
on Jul-27: https://trac.mcs.anl.gov/projects/mpich2/changeset/1177
-Dave
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.