I B M · Release Notes Planning What's new Prerequisites Nodes User authorization Tuning your Linux...

IBM

Tables of ContentsWelcome

What's new

PDFs

Overview

Release Notes

PlanningWhat's newPrerequisitesNodesUser authorizationTuning your Linux systemSMT and HT mode

InstallingRoot installationNon-root installationMigratingEnabling enhanced GPU supportTroubleshooting

AdministeringMigrating from the Parallel Environment Runtime EditionConfiguring key-based login with OpenSSHCode structure and library supportCollective library (libcollectives)Applications

Compiling applicationsDebugging applications

TotalView debuggerAllinea DDT debuggerRedirecting debugging outputUsing the -disable_gpu_hooks option with a debugger

Running applicationsEstablishing a path to the executables and librariesRunning containerized applicationsRunning programs with the mpirun command

Specifying the hosts on which your application runsSpecifying hosts individuallySpecifying hosts using a host list file

Starting a SPMD (Single Program, Multiple Data) applicationStarting an MPMD (multiple program, multiple data) applicationmpirun command options

mpirun options for on-host communication methodmpirun options for display interconnectmpirun options for standard I/O

1

1

3

3

3

13

13

14

17

18

18

20

20

21

22

23

24

25

26

27

29

29

30

31

32

35

35

36

37

39

39

39

40

44

44

44

45

46

47

47

49

49

50

mpirun options for IP network selectionmpirun options for affinitympirun options for PMPI layering

Running applications with IBM Platform LSFRunning jobs with ssh or rshManaging IBM Spectrum MPI jobs

OSHMEM applicationsInterconnect selection

IBM Spectrum MPI supports Mellanox tag matchingMellanox Multi-host featureSpecifying use of the FCA (hcoll) libraryManaging on-host communicationSpecifying an IP networkDisplaying communication methods between hosts

Dynamic MPI profiling interface with layeringDefining consistent layeringLayered profiling implementationMPE performance visualization toolUsing the MPE Jumpshot viewer

Managing process placement and affinityIBM Spectrum MPI affinity shortcutsIBM PE Runtime Edition affinity equivalents

MP_TASK_AFFINITY=coreMP_TASK_AFFINITY=core:nMP_TASK_AFFINITY=cpuMP_TASK_AFFINITY=cpu:nMP_TASK_AFFINITY=mcmMP_CPU_BIND_LIST=list_of_hyper-threads

Mapping options and modifiers--map-by unit option--map-by slot option--map-by unit:PE=n and --map-by slot:PE=n options--map-by ppr:n:unit and --map-by ppr:n:unit:pe=n options--map-by dist:span option (adapter affinity)

Helper optionsOversubscriptionOverloadOpenMP (and similar APIs)

Tuning the runtime environmentFrameworks, components, and MCA parametersDisplaying a list of MCA parametersOptimizing non-contiguous data transfersControlling the level of MCA parameters that are displayedSetting MCA parametersTuning multithread controlsTunnel atomics

50

51

51

52

52

54

55

55

57

58

58

59

59

60

60

62

63

64

64

64

65

67

68

68

69

70

71

72

73

73

75

76

77

78

79

81

81

82

83

83

84

85

86

86

88

88

IBM Spectrum MPI version 10.3 Welcome to the IBM® HighPerformance Computing

Clustering Service Packs documentation.

Getting started

What's newRelease NotesPDFs

Common tasks

InstallingMigratingCompiling applicationsRunning applications

Troubleshooting and support

Debugging applicationsIBM SupportFix Central

More information

Platform MPI documentationPlatform LSF documentationPassport Advantage

What's new in IBM Spectrum MPI version 10.3

Read about new or significantly changed information for IBM Spectrum™ MPI version 10.3.

February 2021The following two important fixes were delivered in multiple 10.3 Fix Packs ( 10.3.0.2, 10.3.1.4, 10.3.2.1 )

Fix undetected data corruption when using MPI One-Sided communications calls and multipleprocessing threads.Fix undetected data corruption when calling MPI_GET_ACCUMULATE with a user-defined, non-contiguous receive datatype.

March 2020The following information is a summary of the updates that are made to the Spectrum MPI version 10.3.2(power8 only) documentation:

Updated General limitations and restrictions. For more information, see Release Notes.

1

https://www.ibm.com/mysupport/s/topic/0TO50000000IMsoGAG/spectrum-mpi?language=en_US&productId=01t50000004uSW0

http://www.ibm.com/support/fixcentral/

https://www.ibm.com/support/knowledgecenter/en/SSENRW_4.2.1/get_started_admin/getting_started_mpi.html

https://www.ibm.com/support/knowledgecenter/en/SSWRJV/product_welcome_spectrum_lsf.html?origURL=SSWRJV&view=kc

https://www-01.ibm.com/software/passportadvantage/pao_customer.html

November 2019The following information is a summary of the updates that are made to the Spectrum MPI version 10.3.1documentation:

IBM Spectrum MPI now supports the Singularity container runtime. For more information, see Runningcontainerized applications.

June 2019The following information is a summary of the updates that are made to the Spectrum MPI version 10.3.0.1documentation:

The Spectrum MPI version 10.3 team advises customers with POWER9™ systems to upgrade to MOFED4.5-2.2.9.0. You must upgrade to MOFED 4.5-2.2.9.0 if you use the -HCOLL option. For POWER8®systems, SMPI 10.3.0.1 requires at least MOFED 4.5-2.2.0.1. Mellanox does not support MOFED 4.5-2.2.9.0 on POWER8 systems.Re-enabled HCOLL, Bcast, Barrier, and Alltoall for MOFED 4.5-2.2.9.0 in the etc/smpi.conf file. There-enablement requires MOFED 4.5-2.2.9.0.

April 2019The following information is a summary of the updates that are made to the Spectrum MPI version 10.3documentation:

IBM Spectrum MPI can be used on 64-bit in Little Endian mode for the following POWER9 systems:8335-GTC, 8335-GTG, 8335-GTW, 8335-GTX, and 8335-GTH.IBM Spectrum MPI supports the IBM® Platform Load Sharing Facility (LSF®) version 10.1.0.x, or later, forstarting jobs. For more information, see the Running applications with IBM Platform LSF topic.For parallel I/O, IBM Spectrum MPI supports ROMIO version 3.2.1 and OMPIO, as an unsupportedtechnical preview.IBM Spectrum MPI version 10.3 now includes a UCX path as a tech. preview as an alternative to thePAMI path. To request UCX on clusters running Mellanox OFED 4.5.x, users should specify -ucx on thempirun command line. See the UCX section in the Mellanox OFED document or Open Source UCXdocumentation for more information.

August 2018The following information is a summary of the updates that are made to the Spectrum MPI version 10.3documentation:

Added information about environment variables that can improve multithread performance. For moreinformation, see the Tuning multithread controls and Tunnel atomics topics.Added information about the Mellanox Multi-Host features that can improve performance. For moreinformation, see the Mellanox Multi-host feature topic.Added information about support for the Mellanox tag matching function. For more information, see theIBM Spectrum MPI supports Mellanox tag matching topic.Added information about using the mpicc and mpifort commands to link Fortran applications. For moreinformation, see the Compiling applications topic.Added information about how you can run the MPI_Put and MPI_Get command to optimize non-continuous data transfers. For more information, see the Optimizing non-contiguous data transferstopic.

2

https://github.com/openucx/ucx/wiki

Added information an extra progress option for applications that might require addition progressinformation. For more information, see the PAMI asynchronous thread topic.Added information about the -disable_gpu_hooks option. For more information, see the Using the -disable_gpu_hooks option with a debugger topic.Added information about the order of nodes in a host list file. For more information, see the Specifyinghosts using a host list file topic.Added information about enabling GPU Direct support in the Installing the ibm_gpu_support RPMtopic.Added information about new RPM names for installing Spectrum MPI in the Installation requirementstopic.Updated the steps with the new RPM names in the following topics:

Root installationNon-root installation

Updated information about using the -aff option for some of the underlying Open MPI affinity options.For more information, see the Affinity shortcuts topic.Updated information about using the libpami and libcollectives libraries at run time and by using thempirun -verbsbypass command. For more information, see the Using the PAMI verbs bypass topic.

IBM Spectrum MPI version 10.3 PDFs

This topic contains links to PDFs of the IBM Spectrum™ MPI version 10.3 documentation.

The following is the PDF for IBM Spectrum MPI version 10.3.1:

IBM Spectrum MPI

Overview

IBM Spectrum™ MPI version 10.3 is a high-performance, production-quality implementation of the MessagePassing Interface MPI.

IBM Spectrum MPI version 10.3 is widely used in the high-performance computing HPC industry. It isconsidered one of the standards for developing scalable, parallel applications. IBM Spectrum MPI version 10.3is based on the Open MPI version 4.0.1 and implements the full MPI 3.2 standard.

IBM Spectrum MPI was previously branded as IBM® Platform MPI. IBM Spectrum MPI delivers an Open MPIbased implementation for HPC parallel applications with improved performance, scalability, and stability.

IBM Spectrum MPI version 10.3 incorporates advanced CPU affinity features, dynamic selection of networkinterface libraries, superior workload manager integrations, and improved performance. It supports a broadrange of industry-standard platforms, interconnects, and open APIs to help ensure that parallel applicationscan run almost anywhere.

IBM Spectrum MPI version 10.3 delivers an improved, RDMA-capable Parallel Active Messaging Interface PAMIusing Mellanox OFED on both POWER8® and POWER9™ systems in Little Endian mode. It also offers animproved collective MPI library that supports the seamless use of GPU memory buffers for the applicationdeveloper. The library provides advanced logic to select the fastest algorithm of many implementations foreach MPI collective operation.

IBM Spectrum MPI version 10.3 Release Notes

3

ContentsIBM Spectrum MPI version 10.3.1IBM Spectrum MPI version 10.3 PTF1FeaturesLimitations and restrictionsGeneral limitations and restrictionsPAMI adapter affinity limitations and restrictionsGPU limitations and restrictionsNVIDIA CUDA limitations and restrictions

IBM Spectrum MPI version 10.3.1IBM Spectrum MPI version 10.3 PTF1 (10.3.1) includes the following updates:

IBM Spectrum® MPI now supports the Singularity container runtime. For more information, see Runningcontainerized applications.

IBM Spectrum MPI version 10.3 PTF1IBM Spectrum MPI version 10.3 PTF1 (10.3.0.1) includes the following updates:

Added a Crossover threshold in PAMI to dynamically switch between GDR and CUDAAware [128kdefault]. Enabled explicit ODP for managed memory by default.Created a workaround for PAMI -async and -gpu that prevents the use of cuda-aware. Thisimplementation adds a workaround to the alias.pl file, which raises the cuda-aware threshold tomax-long if it is running with the -async and -gpu flags.Fixed memleak for the MPI_WIN_FLAVOR_ALLOCATE API.Re-enabled CUDA IPC in the smpi.conf file.Updated pre-compiled pgi wrappers and mod to use PGI 2019.Re-enabled HCOLL, Bcast, Barrier, and Alltoall for MOFED 4.5-2.2.9.0.Fixed mpitool to allow it to run without a license.Users can activate the new mca_osc_ucx.so shared library only if the -UCX flag is specified.Allow add-reductions of complex types in libcoll.Revised tuning xml 20190409 at 80 ppn and 160 ppn on POWER9™ with fallback to base, hcoll, andsharp.Restored more hwloc --cpu-set behavior from OMPI 3.x.Removed an overly aggressive error check in binding.Improved ROMIO HINTS and MPI Info interactions.Fixed an unsafe usage of integer, disps[] (romio321 gpfs).Resolved the iput operation delay that occurred when delivering data from HCA to memory on a remoteside by a non-blocking get after iput.LSF® supports multiple MPS on one host. We can now translate CUDA_MPS_PIPE_DIRECTORY1/2/3...to a CUDA_MPS_PIPE_DIRECTORY environment variable per specified rank.Fixed pami gpu get path with -async.Added a new environment variable, PAMI_IPC_MAX_CACHED_HANDLES (defaults to Max).

FeaturesThe following features are provided by IBM Spectrum MPI version 10.3:

64-bit support IBM Spectrum MPI can be used on 64-bit in Little Endian mode for the following IBM® Power Systems™

4

servers with and without GPUs:

POWER9 systems8335-GTC8335-GTG8335-GTW8335-GTX8335-GTH

POWER8® systems8335-GCA8335-GTB

Thread safety The MPI_THREAD_MULTIPLE (multiple threads that are run within the MPI library) option is fully

supported.

GPU support IBM Spectrum MPI supports the following GPU-acceleration technologies:

NVIDIA GPUDirect RDMA on POWER9 systemsCUDA-aware MPI on POWER8 and POWER9

By default, GPU support is disabled. To enable GPU support for GPUDirect RDMA, you can run the mpirun -gpucommand. On POWER8 systems (or POWER9 systems that are not running gpusupport kernel modules), youmust disable GPUDirect RDMA and turn on CUDA-aware support by running the mpirun -gpu -disable_gdrcommand due to hardware limitations. Alternatively, you can specify OMPI_MCA_pml_pami_disable_gdr = 1 and OMPI_MCA_common_pami_disable_gdr = 1 in the /opt/ibm/spectrum_mpi/etc/smpi.conf fileto disable GPUDirect RDMA. Setting these options in the /opt/ibm/spectrum_mpi/etc/smpi.conf file isespecially useful on POWER8 or POWER9 systems that are not running gpusupport kernel modules.

Note To use the mpirun -gpu, you must have the NVIDIA drivers 418.40, or later, installed. To check theversion of the NVIDIA drives that are installed, run the cat /proc/driver/nvidia/version command. The outputthat is displayed when you run this command is similar to the following example: NVRM version: NVIDIA UNIX ppc64le Kernel Module 418.39 Sat Feb 9 19:12:39 CST 2019 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)To download the latest NVIDIA drivers, see the NVIDIA Drivers website. You must use the CUDA Toolkitversion 10.1, or later.

IBM Platform LSF IBM Spectrum MPI supports the IBM Platform Load Sharing Facility (LSF) version 10.1.0.x, or later, for starting

jobs. For more information, see the Running applications with IBM Platform LSF topic. Debugger supportIBM Spectrum MPI supports the Allinea DDT and TotalView parallel debuggers. For more information, see theDebugging applications topic. PMIx

IBM Spectrum MPI supports and redistributes PMI Exascale (PMIx) version 3.1.2. PMIx extends the PMIstandard to support clusters of exascale size. For more information about PMIx, see the PMIx Programmer'sManual. FCA (hcoll) support

For installations that use the InfiniBand interconnect, and the Mellanox Fabric Collective Accelerator (FCA),which uses Core-Direct technology, can be used to accelerate collective operations. FCA is enabled throughthe hcoll collective component. hcoll 4.2.x, or later (included with Melanoma OFED 4.5.x) is required. For moreinformation, see the Mellanox HPC-X Software Toolkit website. Portable Hardware Locality (hwloc)

IBM Spectrum MPI uses hwloc (Portable Hardware Locality), which is an API that navigates the hardwaretopology of your server. You can view an abbreviated image of the server's hardware by using the --report-bindings option to the mpirun command.

For example:

5

http://www.nvidia.com/drivers

https://pmix.github.io/pmix/

http://www.mellanox.com/page/products_dyn?product_family=189&mtag=hpc-x

% mpirun -np 1 --report-bindings ./any_mpi_program.x [ibmgpu01:27613] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]

In the following example, the end of the output line indicates that the server has two sockets, each with eightcores, and that each core has two hyper-threads. This output also shows that the started MPI process isbound to the first socket.

For example:

[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]

hwloc provides IBM Spectrum MPI with details about NUMA memory nodes, sockets, shared caches, coresand simultaneous multithreading, system attributes, and the location of I/O devices. You can use thisinformation to place processes, and the memory associated with them, most efficiently, and for bestperformance.

IBM Spectrum MPI includes it's own build of hwloc version 2.0.3. For more information about hwloc, see theOpen MPI website.

MPI-IO: MPI has a number of subroutines that enable your application program to perform efficient parallel

input/output (I/O) operations. These subroutines (collectively referred to as MPI-IO) allow efficient file I/O ona data structure that is distributed across several tasks for computation, but is organized in a unified way in asingle underlying file. MPI-IO presupposes a single parallel file system that underlies all the tasks in theparallel job. For IBM Spectrum MPI, this parallel file system is IBM Spectrum Scale™ version 5.0.x, or later.

For parallel I/O, IBM Spectrum MPI supports ROMIO version 3.2.1 and OMPIO, as an unsupported technicalpreview. To understand how either ROMIO or OMPIO was built, use the ompi_info command, with thehighest level of verbosity.

For example:

$ MPI_ROOT/bin/ompi_info -l 9 --param io romio321 MCA io: 321 (MCA v2.1.0, API v2.0.0, Component v10.1.0) MCA io 321: --------------------------------------------------- MCA io 321: parameter "io_321_priority" (current value: "40", data source: default, level: 9 dev/all, type: int) Priority of the io romio component MCA io 321: parameter "io_321_delete_priority" (current value: "40", data source: default, level: 9 dev/all, type: int) Delete priority of the io romio component MCA io 321: informational "io_321_version" (current value: "from MPICH v3.1.4", data source: default, level: 9 dev/all, type: string) Version of ROMIO MCA io 321: informational "io_321_user_configure_params" (current value: "", data source: default, level: 9 dev/all, type: string) User-specified command line parameters passed to ROMIO's configure script MCA io 321: informational "io_321_complete_configure_params" (current value: " FROM_OMPI=yes CC='gcc -std=gnu99'

6

http://www.open-mpi.org/

CFLAGS='-DNDEBUG -m64 -O3 -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -pthread

--disable-aio --disable-weak-symbols --enable-strict", data source: default, level: 9 dev/all, type: string) Complete set of command line parameters passed to ROMIO's configure scrip

In Spectrum MPI v10.3.0.0 or later, OMPIO was added as a tech. preview, optional MPI-IO implementation. Torequest OMPIO, users must specify -mca io romio321. OMPIO includes improved nonblocking MPI-IOimplementations.

To use the mpirun -gpu, you must have the NVIDIA drivers 418.40, or later, installed. To check the version ofthe NVIDIA drives that are installed, run the cat /proc/driver/nvidia/version command. The outputthat is displayed when you run this command is similar to the following example.For improved MPI-IOperformance using ROMIO on Spectrum Scale parallel filesystem on POWER9 architecture, users shouldspecify the following hints by creating a my_romio321_hints.txt file with the following content:

romio_cb_write enable romio_ds_write enable cb_buffer_size 16777216 **cb_nodes **<#nodes>

After creating the my_romio321_hints.txt file, users may pass the hints file path to SMPI by using theROMIO_HINTS environment variable.

For example:

ROMIO_HINTS=/path/to/my_romio321_hints.txt.

ImportantThe romio_cb_write hint should not be set on POWER8 architecture.

Limitations and restrictions

General limitations and restrictionsSMPI 10.3.2.x has been tested with MOFED 4.7-1.0.0.2 and with MOFED 4.7-3.1.9.2.

There are several known issues with MOFED 4.7-1.0.0.2. Fixes are included in MOFED 4.7-3.1.9.2. SpectrumMPI 10.3.1 restricted some functions using the installed default smpi.conf file. A second smpi.conf file hasbeen packaged with SMPI that removes these restrictions when running with MOFED 4.7-3.

When using MOFED 4.7-3.1.9.2, users can removes these restrictions. Instructions to change the defaultsmpi.conf files are provided in the "Spectrum MPI & JSM eFix" section below.

IBM Spectrum MPI is restricting several functions via the /opt/ibm/spectrum_mpi/etc/smpi.conf file, whenrunning with MOFED 4.7-1.0.0.2

# Do not allow Hardware Tag Matching w/ DCT –will crash the node, reboot required to recover PAMI_DISALLOW_HWTM_WITH_DCT=1

# Do not allow Multithreaded with HCOLL –UCX does not support this in MOFED 4.7.1 OMPI_MCA_coll_hcoll_allow_thread_multiple=0

7

Users are encouraged to upgrade to MOFED 4.7-3.1.9.2. Once MOFED 4.7-3 is installed, administratorsshould run the following command to remove the above two restrictions from the smpi.conf file afterSpectrum MPI is installed:

/usr/bin/sudo ln -sf /opt/ibm/spectrum_mpi/etc/smpi-MOFED-4_7_3.conf /opt/ibm/spectrum_mpi/etc/smpi.conf

Alternatively, administrators can automate the above in an xCAT postscript. The following xCAT postscript isan example of linking the correct smpi.conf based on the IBM SMPI 10.03.01.x and Mellanox OFED 4.7.3:

if [[ -e /opt/ibm/spectrum_mpi/etc/smpi-MOFED-4_7_3.conf ]] then

ofed_info -n | grep 4.7.3 if (( $? )) then rm /opt/ibm/spectrum_mpi/etc/smpi.conf ln -s /opt/ibm/spectrum_mpi/etc/smpi-MOFED-4_7_3.conf /opt/ibm/spectrum_mpi/etc/smpi.conf fi fi

The use of PAMI's CQ overflow check is restricted, and enforced by the smpi.conf file:

# Do not use CQOverflow fix, this will cause ‘MPI_Alltoallv’ hang PAMI_IBV_SKIP_CQOVERFLOW_CHECK=1

The tuned Linux® service is associated with increased kernel activity and slower overall performancewhen polling for InfiniBand messages in MOFED 4.5. The IBM Spectrum MPI team recommends thatthe user stop the tuned service.

When GPU memory is allocated in a container via the cudaMallocManaged function (with theallocation size exceeding GPU memory limit), GPU kernels and Host to Device data transfers have alower performance compared to the same job running outside of the container. It is recommended tolimit the aggregate size of cudaMallocManaged allocations per GPU over all MPI tasks to the GPUmemory size to avoid a performance hit.

Using --container all with --show, --showonly, or --onlyshow options may result in unexpectedoutput.POWER8 and POWER9 systems

The -async flag is restricted if you use the MPI-IO shared file pointers with CUDA buffers,You can explicitly turn off UMR by setting the corresponding value to 0 if a script or local smpi.conf file is enabling it (for example, -mca common_pami_use_ umr=0 or -x OMPI_MCA_common_pami_use_umr=0.

MPI progression slowdown might occur during calls to the MPI_Reduce_scatter() API when using the -HCOLL option. The IBM Spectrum MPI team did not observe the MPI progression slowdown whenrunning with the -hcoll option. If your application calls theMPI_Reduce_scatter() API, use the -hcoll option to prevent the MPI progression slowdown.

Before you can start nvidia-persistenced and dcgm, nv_rsync_mem needs to be loaded.

Note To load nv_rsync_mem, you need to add the following line at the end of the [Unit] block of/usr/lib/systemd/system/nv_rsync_mem.service: Before=nvidia-persistenced.service dcgm.serviceTo verify that the module has loaded properly, you must run the lsmod |grep nv_rsync_mem command.

The romio_cb_write hint should not be set on POWER8 architecture.

8

The use of certain MPI collectives, when running with the -hcoll or -HCOLL option, are restricted inSMPI 10.3.0.0. These MPI collectives include:

MPI_AlltoallMPI_BarrierMPI_Cast

Notes class="ibm-colored-list ibm-textcolor-gray-80">EThese MPI collectives are automatically disabled in the smpi.conf file when either the -hcoll or -HCOLL option is specified.The issue surrounding these MPI collectives has been resolved in SMPI 10.3.0.1 onPOWER9 systems when running with MOFED 4.5-2.2.9.0.

JSM Dynamic Tasking is restricted in Spectrum MPI 10.3.0.0. As an alternative, users must use mpirunto launch dynamic tasking.

The use of pointers to CUDA buffers in MPI-IO calls is not allowed with the \ async* flag.IBM Spectrum MPI is not Application Binary Interface (ABI) compatible with any other MPIimplementations such as Open MPI, Platform MPI, or IBM PE Runtime Edition.IBM Spectrum MPI version 10.3 now includes a UCX path as a technical preview as an alternative to thePAMI path. To request UCX on clusters running Mellanox OFED 4.5.x, users must specify the -ucx flagon the mpirun command line. See the UCX section in the Mellanox OFED document or Open SourceUCX documentation for more information.The IBM Spectrum MPI collectives component (libcollectives or coll_ibm) does not supportintercommunicators. By default, for intercommunicator collective support, IBM Spectrum MPI relies onOpen MPI intercommunicator collective components (coll_inter).The SMPI health checks (daxpy, CPU dgemm, GPU dgemm, jlink), that are used to check memorybandwidth, CPU and GPU throughput, and network bandwidth on a quiet cluster, do not support coreisolation. Therefore, when you are running SMPI health checks, do not pass the -core_isolationparameter to the LSF bsub command.Modifying the OMPI_MCA_coll_tuned_priority = -1 parameter in the /opt/ibm/spectrum_mpi/etc/smpi.conf file is unsupported.IBM Spectrum MPI requires Mellanox OFED 4.5.x, or later.

NoteUsers can override most of the values that are specified in the /opt/ibm/spectrum_mpi/etc/smpi.conffiles by setting the values in your environment where you run the mpirun command. You can also overridevalues that are specified in the /opt/ibm/spectrum_mpi/etc/smpi.conf files from the mpirun command line byrunning the mpirun -x = command.If administrators want to change these parameters, they can modify the values in the /opt/ibm/spectrum_mpi/etc/smpi.conf file directly on all nodes.

To prevent data integrity problems when you are writing to a file, you must use General Parallel FileSystem (GPFS™) version 4.2.3.7, or later, or version 5.0.0.0, or later.The default of HCOLL Allreduce behavior prevents auto rank reordering. This behavior can improve thereproducibility for certain HCOLL Allreduce calls. For applications that are not sensitive to order ofoperations in MPI_Allreduce, users may wish to set GEOFF.The MPI_Comm_spawn API has reserved keys. When you are calling the MPI_Comm_spawn API, thepath key and the soft key are ignored in IBM Spectrum MPI.Dynamic process management (dynamic tasking) is supported on only IBM Power® System servers. Torun applications with dynamic tasking, you must run the -pami_dt flag with the mpirun command.When you run the -pami_dt flag with the mpirun command, a different PAMI library is used thatsupports dynamic tasking but has a lower performance for non-dynamic tasking communicationoperations. The MPI_Comm_accept and MPI_Comm_connect APIs are not supported by dynamictasking.

9

https://github.com/openucx/ucx/wiki

If you run applications which use dynamic tasking, you must disable the IBM collectives library byusing the -mca coll ^IBM command.If you launch jobs with the mpirun command without LSF, you must specify the launch node in the hostlist file or in the -host parameter. If you do not want to run processes on the launch node, you canspecify slots=0 in the host list file or specify launchnodename:0 for the -host parameter. When youmake these changes to the host list file or to the -host parameter, the value of the MPI_UNIVERSE_SIZE attribute is consistent with the number of processors that are available forrunning jobs that are started with the MPI_Comm_spawn API.Dynamic tasking in IBM Spectrum MPI is supported similar to the Open MPI implementation. For moreinformation about keys that are supported, see the MPI_Comm_spawn man page.If you launch many processes per node, the local daemon might exhaust the file descriptor limit on thenode. Administrators can increase the file descriptor limit by running the ulimit -n VALUE command,where VALUE is the upper limit of file descriptors you want to define per-process on the node. You mustset the VALUE option on all compute nodes in the system. You can estimate the number of filedescriptors that are needed by using the following formula: 25 + (7 x PPN), where PPN is thenumber of processes that are launched on a single node. For example, if you had eight processes thatare launched on a single node, an approximate value for the VALUE option would be 81 (25+7x8). Inthis example, to set the file descriptor limit to 81, you can run the ulimit -n 81 command.If you are using nvprof, the NVIDIA profiler tool, you must use the -restrict_libs none optionwhen you run the mpirun command to avoid unnecessary fork() system calls that can lead to incorrectanswers from memory that is not being pinned as expected. This process is the default behavior.The default Parallel Active Messaging Interface (PAMI) settings for multi-host support on POWER9systems, might not be optimal for bandwidth sensitive applications (multi-hosting is a POWER9 featureto improve off-node bandwidth). You can use the following settings to achieve optimal node aggregatebandwidth:

PAMI_IBV_DEVICE_NAME=mlx5_0:1 PAMI_IBV_DEVICE_NAME_1=mlx5_3:1

In this example, the settings configure tasks on socket 0 to use HCA port 1 and tasks on socket 1 to use HCAport 2.

The host list file is a flat text file that contains the names of the hosts on which your applications run.Each host is included on a separate line. For example, the following example shows the contents of asimple host list file called myhosts:

node1.mydomain.com node2.mydomain.com node3.mydomain.com node4.mydomain.com

In a host list file, the order of the nodes in the file is not preserved when you launch processes acrossresources. In the previous example, the host list that is named myhosts and the node1.mydomain.com entrymight not be the first node used, even though it is listed first in the host list file. For example, the followingmight be the order in which the nodes are used:

1. node3.mydomain.com2. node2.mydomain.com3. node1.mydomain.com4. node4.mydomain.com

5. A failure might occur when many threads (seven or more) call the MPI_Intercomm_create() API.

6. The use of the MPI_File_set_atomicity() API call is not supported.7. While creating n-dimensional topologies by using the MPI_Dims_create() API calls, the ndims value

must be greater than 0.

10

https://www.open-mpi.org/doc/v3.0/man3/MPI_Comm_spawn.3.php

8. While running Spectrum MPI over TCP on nodes with a virtual adapter, users must specify the correctadapter that must be used by using the -netaddr option, because Spectrum MPI does not ignore thevirbr# named devices and the TCP BTL tries to stripe across all devices seen.

9. Multithreaded File I/O is not supported.10. At the time of this release, the MPI standard has some ambiguity about the meaning of the

MPI_Comm_get_info(), MPI_Win_get_info(), and the MPI_File_get_info() APIs. These APIs returnthe current internal value of each info setting, which can differ from values that are provided in aprevious version of these APIs. In the future, the MPI standard is likely to be changed to clarify that theget calls should return the same values that were set. So, the current behavior of these APIs is notcompliant with the expected clarification of the MPI standard.

11. If your switch network topology consists of more than one InfiniBand network, IBM Spectrum MPIrequires a unique subnet prefix (also known as the network ID) is assigned to each network. Ifdisconnected networks are assigned the same network ID, jobs might not run.

12. Querying or attempting to write the value of a performance variable by using the MPI Tool informationinterface (MPI_T) is not supported.

13. Fortran MPI applications require the libgfortran3 runtime library to be installed on each compute node,regardless of which Fortran compiler was used to compile and link the application. This is arequirement because the libmpi_ibm_usempi.so.3 library depends on the libgfortran.so.3library.

14. Using -async with multiple PAMI contexts and transferring large messages (rendezvous protocolutilizing the PAMI_Rget API). PAMI occasionally fails with an assertion when destroying the PAMImemory region.

15. HCOLL collectives may hang in multi-threaded MPI applications with a large number of communicatorcreations.

16. HCOLLs MPI_Reduce may segfault when used with multiple threads. As a workaround, restrict the useof HCOLL MPI_Reduce when running multi-threaded by setting the environment variable toHCOLL_ML_DISABLE_REDUCE = 1.

17. MPI_Iscatter may hang when using HCOLL. As a workaround, either add -async to the mpiruncommand line, or set PAMI_IBV_ENABLE_DCT=1 in the job environment.

18. Singularity containers built from Docker containers may convert environment variables set in theDocker container to default, overridable environment variables in the Singularity container (dependingon the Singularity version). As a result, the environment outside of the container may unintentionallyoverride environment variables inside the Singularity container. Users may experience the impact ofthis issue in a variety of ways including in the resolution of Spectrum MPI's MPI_ROOT environmentvariable. It is recommended that when running in the "orted" fully contained mode that you unset theMPI_ROOT environment variable before establishing the Singularity container environment in whichyou will be calling mpirun. By doing this the Singularity container will use the value for MPI_ROOTdefined in the container image instead of the value that may exist outside of the container.

PAMI adapter affinity limitations and restrictionsThe use of MPI_Issend to SELF followed by MPI_Cancel in HWTM mode is restricted.

Adapter affinity is enabled by default when running MPI jobs over PAMI. This usually results in betterperformance when CPU affinity is enabled (the default setting). Adapter affinity instructs each rank to use theInfiniBand adapter that is physically closest to the core where the process is bound. However, adapter affinitycan lead to either a segmentation violation (SEGV) when you are using -verbsbypass, or a job that hangs in thefollowing circumstances:

If a user runs a job across nodes that have different numbers of InfiniBand adapters per node (forexample, if some nodes have two adapters, and other nodes have only one adapter).If a user runs a job across nodes that have one InfiniBand adapter on one fabric, and another adapteron a different fabric.

11

In either of these situations, you must disable adapter affinity by specifying the PAMI_IBV_ADAPTER_AFFINITY=0 option with the mpirun option.

Adapter striping is also enabled by default, when you are running MPI jobs over PAMI. However, both adapteraffinity and adapter striping must be disabled before a user runs a job across a set of nodes in which one nodecontains a single link only. In this situation, to disable both adapter affinity and adapter striping, specify thefollowing environment variables with the mpirun option:

PAMI_ENABLE_STRIPING=0 PAMI_IBV_DEVICE_NAME=mlx5_0:1PAMI_IBV_ADAPTER_AFFINITY=0

IBM Spectrum MPI version 10.3 supports the OpenSSH 4.1 API. For more information about OpenSHMEM, seethe OpenSHMEM website.

By default, PAMI creates InfiniBand connections by using Reliably Connected (RC) queue pairs for eachdestination rank. When the MPI_Finalize API is called, these queue pairs are removed by PAMI. As job sizesgrow, the number of queue pairs also grow, which requires more time in the MPI_Finalize API.

If a job is running on a single compute node, the -pami_noib flag to the mpirun command instructs IBMSpectrum MPI to use a shared memory only version of PAMI. If a job is run on more than one compute node,users can direct PAMI to create the InfiniBand connections for the node by using Dynamically Connected (DC)queue pairs. To force DC queue pairs, you must add the -x PAMI_IBV_ENABLE_DCT=1 flag to the mpiruncommand. By default, PAMI switches from its default RC to DC queue pairs at certain task geometries.Directing PAMI to use DC queue pairs at other geometries might impact latency.

GPU limitations and restrictionsOn a node with GPUs, it is recommended that you run the following commands shortly after the nodeboots and before you run any GPU workload:

nvidia-smi -pm 1 # persistence mode nvidia-modprobe -u -c=0 # pre-load uvm

When MPI-IO shared file pointers are used with CUDA buffers, the use of the -async feature isrestricted.PAMI_DISABLE_MM_IPC=1 has been set in the smpi.conf file to restrict using IPC with managedmemory. Users should not change this setting.Passing GPU buffers in MPI API calls are supported only if you are using the IBM Spectrum MPI PAMIbackend and the IBM collective library (libcollectives). These are the default options for IBM SpectrumMPI.One-sided communication is not supported with GPU buffers.

NVIDIA CUDA limitations and restrictionsIf an application is built by using the NVIDIA CUDA Toolkit, the NVIDIA CUDA Toolkit must be installedon the node from which it is launched, and on each compute node. You must use CUDA Toolkit version10.1.The following MPI functions are not CUDA-aware:

MPI_AlltoallwMPI_IalltoallwMPI_Ineighbor_allgatherMPI_Ineighbor_allgathervMPI_Ineighbor_alltoallMPI_Ineighbor_alltoallv

12

http://openshmem.org/site/

MPI_Ineighbor_alltoallwMPI_Neighbor_allgatherMPI_Neighbor_allgathervMPI_Neighbor_alltoallMPI_Neighbor_alltoallvMPI_Neighbor_alltoallwAll one-sided MPI calls with PAMI

Planning to install IBM Spectrum MPI

When you are planning to install the IBM Spectrum™ MPI software, you need to ensure that you meet all of thenecessary system requirements. You also need to think about what your programming environment is and thestrategy for using that environment.

What's new in Planning for the IBM Spectrum MPI Read about new or significantly changed information for the Planning Spectrum MPI topic collection.

IBM Spectrum MPI prerequisites Review the prerequisites for installing and running the IBM Spectrum MPI software, include

prerequisites for hardware, software, RPM, and disk space.IBM Spectrum MPI node resources

How you plan your node resources depend on whether you are installing IBM Spectrum MPI with orwithout a resource manager.IBM Spectrum MPI user authorization

IBM Spectrum MPI supports running jobs under the secure shell (ssh) or the remote shell (rsh).Tuning your Linux system for more efficient parallel job performance

The Linux® default network and network device settings might not produce optimum throughput(bandwidth) and latency numbers for large parallel jobs. The information that is provided describeshow to tune the Linux® network and certain network devices for better parallel job performance.Understanding the effects of changing the SMT or HT mode

Simultaneous multi-threading (SMT) is a function of Power Systems™ servers that allows multiplelogical CPUs to share physical core. This same function for Intel™ is called hyper-threading (HT). TheSMT and HT settings can be changed by the system administrator at any time, and it is important tounderstand how changing these settings affects cpusets and running jobs.

What's new in Planning for the IBM Spectrum MPI

Read about new or significantly changed information for the Planning Spectrum™ MPI topic collection.

How to see what's new or changedTo help you see where technical changes have been made, the IBM Knowledge Center uses:

The image to mark where new or changed information begins.The image to mark where new or changed information ends.

August 2018The following information is a summary of the updates made for IBM Spectrum MPI version 10.3:

Added information about new RPM names for installing Spectrum MPI in the IBM Spectrum MPIprerequisites topic.

13

Parent topic: Planning to install IBM Spectrum MPI

IBM Spectrum MPI prerequisites

Review the prerequisites for installing and running the IBM Spectrum™ MPI software, include prerequisites forhardware, software, RPM, and disk space.

Hardware requirementsFor information on the hardware that is supported by IBM Spectrum MPI, see the current announcement letteron the IBM® Offering Information website.

Software requirementsThe software that is required for IBM Spectrum MPI includes various components and, in some cases, extrasoftware. You need to decide which components to install on your system based on the features you plan touse. You might also need to install some additional products or components, based on how you plan to useIBM Spectrum MPI.

The following RPM names are applicable to IBM Spectrum MPI version 10.3:

ibm_smpi_lic_s Contains the license files for IBM Spectrum MPI.

ibm_smpi Contains files that are required to run MPI program. This file set much be installed on each host that

runs MPI processes.ibm_smpi_devel

Contains files that are required to build MPI programs and other content that is not required at run timesuch as man pages.ibm_smpi_jsm

Contains files that are required for the Job Step Manager (JSM) launcher.ibm_smpi_pami_develContains the PAMI header files, libcoll header files, libsym header files, and the .so files that you canuse to link applications to the libraries in this file set.ibm_spindle:

Contains files that are required to support scalable loading of shared libraries.ibm_gpu_support

Contains files that are required to enable IBM Spectrum MPI GPU support.

NoteUsers that do not wish to install the root level kernel modules delivered in the gpusupport RPM, musteither pass the -disable_gdr flag when passing -gpu, or specify theOMPI_MCA_common_pami_disable_gdr=1 environment variable . This uses the CUDA-Aware approach ofcopying data from GPU to host before sending it over RDMA.

MOFEDSpectrum MPI version 10.3 advises customers with POWER9™ systems to upgrade to MOFED 4.5-2.2.9.0. Youmust upgrade to MOFED 4.5-2.2.9.0 if you use the -HCOLL option. Spectrum MPI version 10.3 re-enablesHCOLL Collectives Bcast, Barrier, and Alltoall in the etc/smpi.conf file. This re-enablement requires MOFED4.5-2.2.9.0. For POWER8® systems, SMPI 10.3.0.1 requires at least MOFED 4.5-2.2.0.1.

14

http://www.ibm.com/common/ssi/index.wss?request_locale=en

ImportantIf you are installing Spectrum MPI version 10.3 on a POWER8system and running with a MOFEDolder than 4.5-2.2.8.1, you must uncomment the following three lines in the/opt/ibm/spectrum_mpi/etc/smpi.conf file: HCOLL_ML_DISABLE_ALLTOALL = 1 HCOLL_ML_DISABLE_BARRIER = 1 HCOLL_ML_DISABLE_BCAST = 1Ucommenting these three lines in the /opt/ibm/spectrum_mpi/etc/smpi.conf file prevents the systemfrom hanging or crashing. This issue has been resolved on POWER9 systems with MOFED 4.5-2.2.9.0, which isunavailable on POWER8 system.

Software compatibility within workstation clustersFor all workstations that are within a workstation cluster, the same release level (including maintenancelevels) of IBM Spectrum MPI software is required. If you run the same release level, you must verify that anindividual application can run on any workstation in the cluster.

Additional softwareThe following table lists additional software that you can use with IBM Spectrum MPI version 10.3.

Table 1. Additional softwareIf you plan to: This software is required: Things to consider:

Compile parallel executables

A working C or FORTRANcompiler. IBM Spectrum MPIsupports parallel programdevelopment that uses thefollowing compilers.The IBM compilers for Linux onPower Systems servers include:

IBM C/C++ compiler,V13.1.4, or laterIBM FORTRAN compiler,V15.1.4, or later

The GNU compilers for Linuxinclude:

C compilerC++ compiler, V4.4.7FORTRAN compiler

The Intel compilers for Linuxinclude:

C compiler, V12.1.5C++ compilerFORTRAN compiler

A working C or FORTRAN compileris needed to build parallelapplications.

Provide resource management andscheduling functions through IBM

Platform LSF® to submit and run batchapplications and manage network

resources.

LSF Version 10.1, or later. For more information aboutinstalling LFS, see the Preparingyour system for installation topicin the Spectrum LSF KnowledgeCenter.

15

https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_unix_install/lsf_installnewunix_prepare_con.html?view=kc

If you plan to: This software is required: Things to consider:Choose to have IBM Spectrum MPI

generate lightweight core files insteadof standard core files.

The GDB command, which isincluded with the GDB RPM.

GDB is the GNU Project Debugger.For more information, see theGNU Project Debugger website.

Spectrum MPI componentsBefore you install IBM Spectrum MPI, you must be familiar with its software components that are used forsubmitting and running jobs in a high-performance cluster environment.

IBM Spectrum MPI 10.3.0.x contains the following components:

ibm_smpi Core library, launcher, and components needed to run MPI apps (required).

ibm_smpi_lic_s Standard edition license RPM. Admins must accept the license (required).

ibm_smpi-devel Compiler wrappers and headers required to build MPI applications (optional).

ibm_smpi-libgpumpThe libgpump library, header, examples and source code (optional).

ibm_smpi-pami_devel PAMI headers, libsymm header, lib, and examples (optional unsupported).

ibm_smpi_gpusupport nvidia_peer_mem and nvidia_rsync kernel modules (required for GPU jobs).

ibm_smpi_mpipsupport The mpiP profiling library for profiling MPI applications (optional).

ibm_spindle Scalable shared lib loading for JSM.

ibm_smpi_kt xCat Kit SMPI bundle for diskless images.of some of the components listed above.

Message passing and collective communication API subroutine libraries Libraries that contain subroutines that help application developers parallelize their code.

Optimized Collectives Library (libcollectives) An optimized collectives library with both one-sided and active messages collectives.

Parallel Active Messaging Interface (PAMI) A messaging API that supports point-to-point communications.

IBM Spectrum MPI examplesA set of example programs, included in the product package.

IBM Spectrum MPI documentation You can view, search, and print the most recent IBM Spectrum MPI documentation in PDF and HTML

format at the Spectrum MPI IBM Knowledge Center website.

File systems16

https://www.ibm.com/links?url=http%3A%2F%2Fwww.gnu.org%2Fsoftware%2Fgdb%2F

https://www.ibm.com/support/knowledgecenter/en/SSZTET

IBM Spectrum MPI file sets are installed in the /opt/ibm/spectrum_mpi directory by default. Theinstallation directory is relocatable.

If you do not use a shared file system, you need to copy the user's executable files to the other nodes. If youare managing your cluster with xCAT, you can use the xCAT xdsh and xdcp commands.


IBM Spectrum MPI node resources

How you plan your node resources depend on whether you are installing IBM Spectrum™ MPI with or without aresource manager.

On a cluster that uses a resource manager The system administrator uses a resource manager to partition nodes into pools or features or both, to which

they assign names or numbers and other information. The workstation from which parallel jobs are started iscalled the home node and it can be any workstation on the LAN.

On a cluster without a resource manager On an IBM® Power Systems™ cluster without a resource manager, you assign nodes or servers to the following

categories:

Home node (workstation from which parallel jobs are started) for running the mpirun command.Nodes or servers for developing and compiling applications.Nodes or servers that run applications in the parallel environment.

You must identify the nodes or servers that run as execution nodes by name in a host list file.

Setting the maximum number of processes per userAn operating system limits the number of processes that can be created by a single user. IBM Spectrum MPIrequires this limit to be at least four times the maximum number of tasks on a node to support both runningand debugging parallel programs.

On Linux®, the limit is controlled by the nproc attribute in /etc/security/limits.conf, and can bechanged only by root.

Setting the memory limit per userAn operating system limits the amount of memory available per user. IBM Spectrum MPI requires this limit beincreased.

On Linux, the limit is controlled only by the system administrator (root) and by the memlock attribute in /etc/security/limits.conf. The value should be changed to at least 2 GB for all users, for both the softand hard limits, as follows:

soft memlock 2097152 hard memlock 2097152

NoteUsers cannot use the Linux ulimit command to set or change limits.

User IDs on remote nodes 17

On each remote node, the system administrator must set up a user ID, other than a root ID, for each user oneach remote node who is executing serial or parallel applications. Each user must have an account on allnodes where a job runs. Both the user name and user ID must be the same on all nodes. Also, the user mustbe a member of the same named group on the home node and the remote nodes.


IBM Spectrum MPI user authorization

IBM Spectrum™ MPI supports running jobs under the secure shell (ssh) or the remote shell (rsh). If you areusing ssh to connect to a remote host, in order for mpirun to operate properly, it is recommended that you setup a passphrase for passwordless login. For more information, see Open MPI website.


Tuning your Linux system for more efficient parallel jobperformance

The Linux® default network and network device settings might not produce optimum throughput (bandwidth)and latency numbers for large parallel jobs. The information that is provided describes how to tune the Linuxnetwork and certain network devices for better parallel job performance.

This information is aimed at private networks with high-performance network devices such as the GigabitEthernet network, and might not produce similar results for 10/100 public Ethernet networks.

The following table provides examples for tuning your Linux system for better job performance. By followingthese examples, it is possible to improve the performance of a parallel job that runs over an IP network.

Network Tuning Factors Tuning for the currentboot session

Modifying the systempermanently

arp_ignore - With arp_ignore set to 1, a deviceanswers only to an ARP request if the addressmatches its own.

echo '1' >/proc/sys/net/ipv4/conf/all/arp_ignore

Add this line to the/etc/sysctl.conf file:net.ipv4.conf.all.arp_ignore = 1

arp_filter - With arp_filter set to 1, the kernelanswers only to an ARP request if it matches itsown IP address.

echo '1' >/proc/sys/net/ipv4/conf/all/arp_filter

Add this line to the/etc/sysctl.conf file:net.ipv4.conf.all.arp_filter = 1

rmem_default - Defines the default receivewindow size.

echo '1048576' >/proc/sys/net/core/rmem_default

Add this line to the/etc/sysctl.conf file:net.core.rmem_default =1048576

rmem_max - Defines the maximum receivewindow size.

echo '2097152' >/proc/sys/net/core/rmem_max

Add this line to the/etc/sysctl.conf file:net.core.rmem_max = 2097152

18

https://www.open-mpi.org/

Network Tuning Factors Tuning for the currentboot session

Modifying the systempermanently

wmem_default - Defines the default sendwindow size.

echo '1048576' >/proc/sys/net/core/wmem_default

Add this line to the/etc/sysctl.conf file:net.core.wmem_default =1048576

wmem_max - Defines the maximum sendwindow size.

echo '2097152' >/proc/sys/net/core/wmem_max

Add this line to the/etc/sysctl.conf file:net.core.wmem_max = 2097152

Set device txqueuelen - Sets each networkdevice, for example, eth0, eth1, and on.

/sbin/ifconfigdevice_interface_nametxqueuelen 4096

Not applicable

Turn off device interrupt coalescing - Toimprove latency.

See sample script. Thisscript must be run aftereach reboot.

Not applicable

This sample script unloads the e1000 Gigabit Ethernet device driver and reloads it with interrupt coalescingdisabled.

For example:

#!/bin/kshInterface=eth0 Device=e1000 Kernel_Version=`uname -r` ifdown $ rmmod $ insmod /lib/modules/$/kernel/drivers/net/$/$.ko InterruptThrottleRate=0,0,0 ifconfig $exit $?

MPI jobs use shared memory to handle intranode communication. You might need to modify the systemdefault for allowable maximum shared memory size to allow a large MPI job to successfully enable sharedmemory usage. It is recommended that you set the system allowable maximum shared memory size to 256MB or larger for supporting large MPI jobs.

To modify this limit for the current boot session, run the echo "268435456" > /proc/sys/kernel/shmmaxcommand as root.

To modify this limit permanently, add kernel.shmmax = 268435456 to the /etc/sysclt.conf file andreboot the system:

DNS caching should be enabled to minimize runtime host name resolution, especially if LDAP is also enabledin the cluster.

Network tuning factorsTuning for the current boot session

and updating it into the bootimage

gc_thresh3 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh3 should be themaximum number of compute operating system nodes, plus 300.\

echo "5300">/proc/sys/net/ipv4/neigh/default/gc_thresh3

gc_thresh2 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh2 should be 100less than gc_thresh3.\


19

Network tuning factorsTuning for the current boot session

and updating it into the bootimage

gc_thresh1 - The value in /proc/sys/net/ipv4/neigh/default/gc_thresh1 should be 100less than gc_thresh2\


gc_interval - The ARP garbage collection interval on the compute nodesshould be high so that it does not process ARP cleanup. \

echo "1000000000" >/proc/sys/net/ipv4/neigh/default/gc_interval

gc_stale_time - The ARP stale time should be set high so that it doesnot get discarded. \

echo "2147483647" >/proc/sys/net/ipv4/neigh/default/gc_stale_time

base_reachable_time_ms - The ARP valid entry time (in milliseconds)should be set high so that it does not get discarded. \

echo "2147483647" >/proc/sys/net/ipv4/neigh/default/base_reachable_time_ms


Understanding the effects of changing the SMT or HT mode

Simultaneous multi-threading (SMT) is a function of Power Systems™ servers that allows multiple logicalCPUs to share physical core. This same function for Intel™ is called hyper-threading (HT). The SMT and HTsettings can be changed by the system administrator at any time, and it is important to understand howchanging these settings affects cpusets and running jobs.

For cpusets, note the following:

After lowering the SMT mode or disabling HT for a core:The newly disabled CPUs are removed from the existing cpuset.sched_setaffinity() automatically skips the newly disabled GPUs.The newly disabled CPUs are removed from the binding list of running processes.

When raising the SMT mode or enabling HT for a core, the newly enabled CPUs are not added to theexisting cpuset.If the SMT or HT mode is changed, the CPU numbering remains the same.

For jobs that are currently running, note the following:

Lowering the SMT mode or disabling the HT for a core means that the job that is running is updated bythe kernel to use the most recently available CPUs.When raising the SMT mode or enabling HT for a core, the existing per-job cpuset and the job that isrunning are not affected. However, the system administrator needs to regenerate the CPUs contained inthe parallel_jobs cpuset so that the newly enabled CPUs will be available for future jobs.


Installing IBM Spectrum MPI

You can install IBM Spectrum™ MPI on IBM® Power Systems™ processor-based systems.

20

Directory StructureInstalling IBM Spectrum MPI creates directories in the /opt/ibm/spectrum_mpi directory (default location).

The following are other directories created when you install IBM Spectrum MPI:

[c712f6n01][/opt/ibm/spectrum_mpi]>ls

bin etc examples include lap_se lib profilesupport properties share

[c712f6n01][/opt/ibm/spectrum_mpi]>

IBM Spectrum MPI root installation The following information describes the procedure for root installing the IBM Spectrum MPI software on

an IBM Power® System.IBM Spectrum MPI non-root installation

The following information describes the procedure for non-root installing the IBM Spectrum MPIsoftware on an IBM Power® System.Migrating to IBM Spectrum MPI version 10.3

When you migrate to IBM Spectrum MPI version 10.3, your MPI application must be recompiled andrelinked with the new MP libraries. Existing applications that are built with older version of IBMSpectrum MPI do not run with the new environment. You must rebuild the old applications with the newversion of IBM Spectrum MPI.Enabling enhanced GPU support

To use IBM Spectrum MPI, you do not have to install the ibm_gpu_support RPM. For more informationon disabling GPUDirect RDMA, see the Disabling GPUDirect RDMA section. If you want to enableenhanced GPU support for IBM Spectrum MPI, you must install the ibm_gpu_support RPM.Troubleshooting IBM Spectrum MPI installation errors

If an installation error occurs, you might need to resolve the situation by removing the IBM SpectrumMPI software by using the rpm command options (such as the -noscripts), correcting the issue, andthen reinstalling the software.

IBM Spectrum MPI root installation

You must install and accept the IBM Spectrum™ MPI license packages on each node in a cluster to successfullyinstall IBM Spectrum MPI. The license must be installed before or concurrently with the IBM Spectrum MPIcomponent base RPM. IBM Spectrum MPI checks the current license during run time.

By default, the license and product packages are installed in the /opt/ibm/spectrum_mpi directory. Youcan choose an alternative directory to install packages by using the --prefix option with the rpm command.

NoteIf you choose an alternative directory, the same directory must be used for both the license and productpackages.You can choose how to process the Spectrum MPI license. For an unattended installation, the licenseagreement can be accepted automatically when the license package is installed. Alternately, the licenseagreement can be viewed and accepted manually after the license package is installed.

To accept the license agreement automatically, without viewing it, complete the following steps:

1. Select the license package.2. Log in as root.3. Set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=yes.4. Enter: rpm -i ibm_smpi_lic_s-10.2-rh7.ppc64le.rpm. The license is not displayed, but it is

accepted automatically.

21

To manual view and accept the license agreement, complete the following steps:

1. Select the license package.2. Log in as root.3. Set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=no.4. Enter: rpm -i ibm_smpi_lic_s-10.2-rh7.ppc64le.rpm.5. When installation is complete, the license acceptance script must be started to review and accept the

license. The script can be found under the installation path directory at lap_se/bin/accept_spectrum_mpi_license.sh.

To install the Spectrum MPI RPMs, you can use the Linux® rpm command. In general, IBM® supports only the -i, `-U, and -e RPM options.

1. Get the Spectrum MPI packages from Passport Advantage Online.2. Select the RPM packages that you want to install.3. Log in as root.4. Install the Spectrum MPI license package and accept the license agreement by running the /opt/ibm/spectrum_mpi/lap_se/bin/accept_spectrum_mpi_license.sh script. The licenseis saved in the /opt/ibm/spectrum_mpi/lap_se/ directory.

5. Install the core Spectrum MPI package by running the rpm -i ibm_smpi-10.2.*.ppc64le.rpmcommand.

6. (Optional) Install the devel package by running the rpm -i ibm_smpi_devel-.*.ppc64le.rpm command.7. (Optional) Install the Job Step Manager (JSM) package by running the rpm -i ibm_smpi_jsm command.8. (Optional) Install the PAMI package by running the rpm -i ibm_smpi_pami_devel command.

IBM Spectrum MPI non-root installation

You can install the IBM Spectrum™ MPI software and license as a non-root user by using an alternative RPMdatabase.

You must create an alternative RPM database on each node that you want to install IBM Spectrum MPI on. Thealternative database path that you create is used in any future RPM commands that reference Spectrum MPI.Run the rpm --initdb --dbpath ; command to install the Spectrum MPI RPMs, where <alt_db_path> is thepath on the node where you created the RPM database.

By default, the license and product packages are installed in the /opt/ibm/spectrum_mpi directory. As anon-root user, you cannot install Spectrum MPI in the /opt/ directory. You can specify any alternativedirectory to install the license and product packages into by using the --prefix option with the rpmcommand. You must install the license and product packages into the same directory.

You must install and accept the Spectrum MPI license packages on each node in a cluster. The license must beinstalled before you installed the any other packages. Spectrum MPI checked the current license during runtime.

You can choose how to process the Spectrum MPI license. For an unattended installation, the licenseagreement can be accepted automatically when the license package is installed. Alternately, the licenseagreement can be viewed and accepted manually after the license package is installed.

NoteWhen you do a non-root installation with an alternative RPM database, the prerequisite checks cause afailure (/bin/sh). Therefore, you must use the --nodeps option when you run the rpm command.To accept the license agreement automatically, without viewing it, complete the following steps:

1. Select the license package.2. Log in as a non-root user.

22

https://www.ibm.com/software/passportadvantage/pao_customer.html

3. Set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=yes.4. Enter the following command:

rpm --dbpath <alt_db_path> --prefix <install_prefix> --nodeps -i ibm_smpi_lic_s-10.2-rh7.ppc64le.rpm

The license is not displayed, but it is accepted automatically. To manual view and accept the licenseagreement, complete the following steps:

1. Select the license package.2. Log in as a non-root user.3. Set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=no.4. Enter the following command:

rpm --dbpath <alt_db_path> --prefix <install_prefix> --nodeps -i ibm_smpi_lic_s-10.2-rh7.ppc64le.rpm.

5. When installation is complete, the license acceptance script must be started to review and accept thelicense. The script can be found under the installation path directory at lap_se/bin/accept_spectrum_mpi_license.sh.

To install the Spectrum MPI RPMs into an alternate location, you can use the Linux® rpm --prefix command.Since you are installing as a non-root users, you must use the rpm --dbpath command argument. In general,IBM® supports only the -i, -U, and -e RPM options.

1. Select the RPM packages that you want to install.2. Log in as a non-root user.3. Install the Spectrum MPI license package and accept the license agreement by running the following

script (he license is saved in the /opt/ibm/spectrum_mpi/lap_se/ directory):/opt/ibm/spectrum_mpi/lap_se/bin/accept_spectrum_mpi_license.sh

4. Install the core Spectrum MPI package in an alternative location by running: rpm --dbpath <alt_db_path> --prefix <install_prefix> --nodeps -i ibm_smpi-

10.2.*.ppc64le.rpm5. (Optional) Install the devel package in an alternative location by running:

rpm --dbpath <alt_db_path> --prefix <install_prefix> --nodeps -i ibm_smpi_devel

6. (Optional) Install the Job Step Manager (JSM) package in an alternative location by running: rpm --dbpath <alt_db_path> --prefix <install_prefix> --nodeps -i ibm_smpi_jsm

command.7. (Optional) Install the PAMI package in an alternative location by running:

rpm --dbpath <alt_db_path> --prefix <install_prefix> --nodeps -i ibm_smpi_pami_devel

Migrating to IBM Spectrum MPI version 10.3

When you migrate to IBM Spectrum™ MPI version 10.3, your MPI application must be recompiled and relinkedwith the new MP libraries. Existing applications that are built with older version of IBM Spectrum MPI do notrun with the new environment. You must rebuild the old applications with the new version of IBM SpectrumMPI.

Parent topic: Installing IBM Spectrum MPI

23

Enabling enhanced GPU support

To use IBM Spectrum™ MPI, you do not have to install the ibm_gpu_support RPM. For more information ondisabling GPUDirect RDMA, see the Disabling GPUDirect RDMA section. If you want to enable enhanced GPUsupport for IBM Spectrum MPI, you must install the ibm_gpu_support RPM.

You can use the files in the ibm_gpu_support RPM to build kernel modules that enable direct data transferto and from GPU memory to the InfiniBand network adapter. You can install the ibm_gpu_support RPM as aroot user or non-root user, but you must build the binary source packages to enable the GPU direct functionby completing the following steps:

NoteYou can complete steps 1-4 as either a non-root user or a root user. However, to complete steps 5-9 youmust be a root user.

1. Install the ibm_gpu_support RPM by running the rpm -i ibm_smpi_gpusupport-10.02.00.XXprpq-rh7_YYYYMMDD.ppc64le.rpm command. For more information on installing RPMs as a non-root user,see the Non-root installation topic.

2. Go to the /opt/ibm/spectrum_mpi_gpusupport directory, and locate the nvidia_peer_memoryand the nv_rsync_mem source RPMs. These source RPMs are used to build the kernel modules thatare required to enable GPU Direct support. The nvidia_peer_memory module is required to enableGPU Direct support. The nv_rsync_mem module is optional, but provides optimal performance.

Run the following commands as a non-root user or a root user to unpack the source RPMs:

rpm -i nvidia_peer_memory-1.0-4.src.rpm rpm -i nv_rsync_mem-1.0-0.src.rpm

After the installation is complete, the rpmbuild directory is created in your home path.

3. Go to the ~/rpmbuild/SPECS directory, and run the following commands to build the binary RPMsthat install the kernel modules.

rpmbuild -ba nvidia_peer_memory.spec rpmbuild -ba nv_rsync_mem.spec

4. As a root user, run the following commands to remove the nvidia_peer_memory and the nv_rsync_mem binary packages:

rpm -e nvidia_peer_memory-1.0-4.ppc64le

rpm -e nv_rsync_mem-1.0-0.ppc64le

5. As a root user, go to the ~/rpmbuild/RPMS/ppc64le directory and run the following commands toinstall the binary RPMs that were built in step 4:

rpm -i nvidia_peer_memory-1.0-4.ppc64le.rpm

rpm -i nv_rsync_mem-1.0-0.ppc64le.rpm

6. As a root user, go to the /opt/ibm/spectrum_mpi_gpusupport directory and copy the ibm_gpusupport.conf file to the /etc/modprobe.d directory.

7. As a root user, run the dracut -f command to regenerate the initial boot image to ensure that the kernelmodules are installed at the correct time during system startup.

24

NoteBefore you can start nvidia-persistenced and dcgm, nv_rsync_mem needs to be loaded. To loadnv_rsync_mem, you need to add the following line at the end of the [Unit] block of/usr/lib/systemd/system/nv_rsync_mem.service, Before=nvidia-persistenced.servicedcgm.service. To verify that the module has loaded properly, you must run the following command,lsmod |grep nv_rsync_mem.

8. As a root user, restart the system by running the reboot command. For more information about steps 1-9, view the /opt/ibm/spectrum_mpi_gpusupport/README.txt file.

9. After the system is back online, you can complete the following steps to verify that GPU support isenabled:

1. Verify that the nv_peer_mem and nv_rsync_mem modules are listed by running the lsmod |grepnv_ command.

2. Verify that the GPUs are accessible by running the nvidia-smi command.3. Verify that the InfiniBand network adapters are accessible by running the ibv_devinfo

command.

NoteYou must repeat steps 4-9 if you change the kernel, CUDA runtime drivers, NVIDIA drivers,or Mellanox OFED drivers. Before you complete step 4, you must remove the contents of the~/rpmbuild/RPMS/ directory.

Disabling GPUDirect RDMAOn POWER9™ architecture, PAMI (with -gpu) makes use of the nVidia GPUDirect RDMA by default,but requires you to install the gpusupport RPM kernel modules.If you do not install gpusupport kernel modules, GPUDirect RDMA functionality of PAMI on POWER8®and POWER9 systems must be disabled.You must either pass the -dis able_gdr flag when passing -gpu, or specify theOMPI_MCA_common_pami_disable_gdr=1 environment variable if -gpu is specified and you need todisable PAMI's usage of GPUDirect RDMA. You can set the OMPI_MCA_common_pami_disable_gdr=1environment variable in the /opt/ibm/spectrum_mpi/etc/smpi.conf file for all users.When GPUDirect RDMA is disabled, PAMI uses the CUDA-Aware approach of copying data from GPU tohost before sending it over RDMA.


Troubleshooting IBM Spectrum MPI installation errors

If an installation error occurs, you might need to resolve the situation by removing the IBM Spectrum™ MPIsoftware by using the rpm command options (such as the -noscripts), correcting the issue, and thenreinstalling the software. Installation errors can be caused by a lack of sufficient disk space to install thesoftware. In such cases, you can expand the file systems or remove unneeded files to allow for sufficient diskspace.

Finding installed componentsTo determine which IBM Spectrum MPI product RPMs, or which IBM Spectrum MPI license RPM is installed,you can use the rpm -qa | grep smpi- command.

Removing a software component 25

You can remove any of the IBM Spectrum MPI RPMs, 1 RPM at a time, by using the rpm -e command with thename of the RPM you want to remove.

You cannot randomly delete any of these RPMs because the IBM Spectrum MPI components depend on eachother. You must remove them in the reverse order in which you installed them. To correctly remove RPMs,delete them in the following order:

1. IBM Spectrum MPI pami_devel RPM2. IBM Spectrum MPI jsm RPM3. IBM Spectrum MPI devel RPM4. IBM Spectrum MPI core RPM5. IBM Spectrum MPI license RPM


Administering IBM Spectrum MPI

IBM Spectrum™ MPI is a high-performance implementation of the MPI (Message Passing Interface) Standard.It is widely used in the high-performance computing (HPC) industry for developing scalable, parallelapplications.

Before using IBM Spectrum MPI, it is important to understand the environment in which you will be creatingand running your applications, as well as its requirements and limitations. This topic collection assumes thatone of the currently-supported Linux™ distributions is already installed. It also assumes that you have alreadyinstalled IBM Spectrum MPI.

IBM Spectrum MPI supports a broad range of industry-standard platforms, interconnects, and operatingsystems, helping ensure that parallel applications can run almost anywhere.

IBM Spectrum MPI offers:

Portability IBM Spectrum MPI allows a developer to build a single executable that can take advantage of the performance

features of a wide variety of interconnects. As a result, applications have optimal latency and bandwidth foreach protocol. This reduces development effort and enables applications to use the latest technologies onLinux without the need to recompile and relink applications. Application developers can confidently build andtest applications on small clusters of machines, and deploy that same application to a larger cluster.

Network optimization IBM Spectrum MPI supports a wide variety of networks and interconnects. This enables developers to build

applications that run on more platforms, thereby reducing testing, maintenance, and support costs.

Collective optimization IBM Spectrum MPI offers a library of collectives called libcollectives, which:

Supports the seamless use of GPU memory buffersOffers a range of algorithms that provide enhanced performance, scalability, and stability for collectiveoperationsProvides advanced logic to determine the fastest algorithm for any given collective operation.

Comparing a task in the Parallel Environment Runtime Edition to Spectrum MPI All tasks that you completed in IBM® Parallel Environment Runtime Edition can also be completed in IBM

Spectrum MPI.

26

Configuring key-based login with OpenSSHTo use key-based login with OpenSSH, users must generate SSH key files and add the contents of their publickey file to the authorized_keys file in the $HOME/.ssh directory.

IBM Spectrum MPI code structure and library support IBM Spectrum MPI is an implementation of Open MPI, its basic architecture and functionality is similar. IBM

Spectrum MPI supports many, but not all of the features offered by Open MPI, and adds some unique featuresof its own.

IBM Spectrum MPI's collective library (libcollectives) IBM Spectrum MPI provides a library of collectives called libcollectives. The libcollectives library provides

seamless use of GPU memory buffers and includes a number of algorithms that offer excellent performance,scalability, and stability for collective operations. The libcollectives library also provides advanced logic todetermine the fastest algorithm for any given collective operation.

IBM Spectrum MPI applications Learn about different IBM Spectrum MPI applications.

Interconnect selection You can choose different options for selecting interconnects.

Dynamic MPI profiling interface with layering The MPI standard defines a profiling interface (PMPI) that allows you to create profiling libraries by wrapping

any of the standard MPI routines. A profiling wrapper library contains a subset of redefined MPI* entry points,and inside those redefinitions, a combination of both MPI* and PMPI_* symbols are called.

Managing IBM Spectrum MPI process placement and affinity

IBM Spectrum MPI follows Open MPI's support of processor affinity for improving performance. Withprocessor affinity, MPI processes and their threads are bound to specific hardware resources such as cores,sockets, and so on.

Tuning the runtime environment IBM Spectrum MPI utilizes the parameters of the Modular Component Architecture (MCA) as the primary

mechanism for tuning the runtime environment. Each MCA parameter is a simple key=value pair that controlsa specific aspect of the Spectrum MPI functionality

PAMI APIs Review the PAMI APIs that are available for Spectrum MPI.

Comparing a task in the Parallel Environment Runtime Editionto Spectrum MPI

All tasks that you completed in IBM® Parallel Environment Runtime Edition can also be completed in IBMSpectrum™ MPI.

The following table contains a list of basic end-user tasks, describes the method for completing those taskswith IBM PE Runtime Edition, and then shows you the equivalent method for carrying out the same tasks usingIBM Spectrum MPI.

The following table contains a list of basic end-user tasks, describes the method for completing those taskswith IBM PE Runtime Edition, and then shows you the equivalent method for carrying out the same tasksusing IBM Spectrum MPI.

Table 1. IBM PE Runtime Edition tasks and IBM Spectrum MPI equivalents 27

Task IBM PE Runtime Edition method IBM Spectrum MPI methodTask IBM PE Runtime Edition method IBM Spectrum MPI methodExecuting programs poe program [args] [options] mpirun [options] program [args]Compiling programs The following compiler commands:

mpfort, mpic77, mpif90mpcc, mpiccmpCC, mpic++, mpicxx

or the following environment variablesettings:

MP_COMPILER=xl | gcc | nvcc

The following compiler commands:

mpfortmpiccmpiCC, mpic++, mpicxx

or the following environment variable settings:

OMPI_CC=xl | gccOMPI_FC=xlf | gfortranOMPI_CXX=xlC | g++

Determining rankbefore MPI_Init

The MP_CHILD environment variable The OMPI_COMM_WORLD_RANK environmentvariable

Specifying the localrank

TheMPI_COMM_WORLD_LOCAL_RANKenvironment variable

The OMPI_COMM_WORLD_LOCAL_RANKenvironment variable

Setting affinity The environment variables:

MP_TASK_AFFINITY=cpuMP_TASK_AFFINITY=coreMP_TASK_AFFINITY=mcmMP_TASK_AFFINITY=cpu:nMP_TASK_AFFINITY=core:nMP_TASK_AFFINITY=1

mpirun options:

-aff width:hwthread-aff width:core-aff width:numa--map-byppr:$MP_TASKS_PER_NODE:node:pe=N--bind-to hwthread--map-byppr:$MP_TASKS_PER_NODE:node:pe=N--bind-to core-aff none

Setting CUDA-aware The MP_CUDA_AWARE environmentvariable

The mpirun -gpu option

Setting FCA The MP_COLLECTIVE_OFFLOADenvironment variable

The mpirun -FCA and -fca options

Setting RDMA MP_USE_BULK_XFERThe MP_BULK_MIN_MSG_SIZEenvironment variable

RDMA default, when MSG_SIZE isgreater than 64k.

Controlling level ofdebug messages

The MP_INFOLEVEL environmentvariable

The mpirun -d option

Setting STDIO The environment variables:

MP_STDINMODEMP_STOUTMODEMP_LABELIO

The mpirun -stdio option

Specifying thenumber of tasks

The MP_PROCS environment variable The mpirun -np option

Specifying a host listfile

The MP_HOSTFILE environmentvariable

The mpirun -hostfile * option

28

Parent topic: Administering IBM Spectrum MPI

Related information

IBM PE Runtime Edition affinity equivalents

Configuring key-based login with OpenSSH

To use key-based login with OpenSSH, users must generate SSH key files and add the contents of their publickey file to the authorized_keys file in the $HOME/.ssh directory.

If key-based login is not already configured for end users, you must complete the following steps:

1. If the $HOME/.ssh directory does not exist, create it, as follows:

mkdir $HOME/.ssh chmod 700 $HOME/.ssh

2. If the end user does not have SSH keys, generate them with the ssh-keygen command (enter CR at allprompts).

3. If the end user's public key is not added to the $HOME/.ssh/authorized_keys file, add it as follows:

cat $HOME/.ssh/public_key_file >> $HOME/.ssh/authorized_keys

Where public_key_file is the .pub file that is generated by the ssh-keygen command.


IBM Spectrum MPI code structure and library support

IBM Spectrum™ MPI is an implementation of Open MPI, its basic architecture and functionality is similar. IBMSpectrum MPI supports many, but not all of the features offered by Open MPI, and adds some unique featuresof its own.

IBM Spectrum MPI uses the same basic code structure as Open MPI, and is made up of the following sections:

OMPI - The Open MPI APIORTE - The Open Run-Time Environment, which provides support for back-end runtime systemsOPAL - The Open Portable Access Layer, which provides utility code that is used by OMPI and ORTE

These sections are compiled into three separate libraries, respectively; libmpi_ibm.so, liborte, and libopal. An order of dependency is imposed on these libraries; OMPI depends on ORTE and OPAL, and ORTEdepends on OPAL. However, OMPI, ORTE, and OPAL are not software layers, as one might expect. So, despitethis dependency order, each of these sections of code can reach the operating system or a network interfacewithout going through the other sections.

IBM Spectrum MPI works in conjunction with the ORTE to launch jobs. The mpirun and mpiexec commands,which are used to run IBM Spectrum MPI jobs, are symbolic links to the orterun command.

For more information about the organization of the Open MPI code, refer to the Open MPI website.

MPI library support 29


To create a parallel program with IBM Spectrum MPI, use the API that is provided on the Open MPI website.Information about the Open MPI subroutines and commands, including the various compiler scriptcommands, is also available at this location. It is important to note that if you used Open MPI to build yourapplication, you must recompile and relink with IBM Spectrum MPI.


IBM Spectrum MPI's collective library

IBM Spectrum™ MPI provides a library of collectives called libcollectives. The libcollectives library providesseamless use of GPU memory buffers and includes a number of algorithms that offer excellent performance,scalability, and stability for collective operations. The libcollectives library also provides advanced logic todetermine the fastest algorithm for any given collective operation.

MCA parameters for collective communicationThe MCA parameters can be used for managing collective communications for IBM Spectrum MPI.

To see the complete list of MCA parameters that pertain to libcollectives, use the ompi_info command. Forexample:

ompi_info --param coll ibm -l 9 --internal

The following are MCA parameters for libcollectives:

-mca coll_ibm_priority number Changes the priority of the libcollectives component. By default, the libcollectives component has the

highest priority (a value of 95).

Possible Values: A number less than or equal to 100. If a negative value is specified, the component isdeselected.

Default: 95

-mca coll_ibm_verbose number Changes the verbosity of the collective component. This can be useful for debugging.

Possible Values:

-1: Silence0: Error messages only (the default)10: Component level messages20: Warnings. For example, when libcollectives is skipped and the algorithm with the next highestpriority should be used instead.40: Informational messages about algorithm availability and selection.60: Tracing messages related to the call stack. A message is displayed before every collectionoperation.80: Detailed debugging information.Default: 0

-mca coll_ibm_display_table value Displays a table of the algorithms that are available for each communicator (printed at the rank 0 of

that communicator).

30

Possible Values:

The value is boolean, and can be any one of the following:

0 | f | false | disabled | no1 | t | true | enabled | yes

Default: 0 | f | false | disabled | no

-mca coll_ibm_tune_results directory-prefix Specifies the prefix of a directory that contains the XML libcollectives tuning files that are used. The

directory name is created by appending a .d to the prefix that you specify. You can have more than oneXML file in a directory. The directory must contain a 0.xml file. A warning message is generated if thedirectory does not exist, has invalid permissions, or does not contain a 0.xml file.

Possible Values: Any valid directory.

Default: The default value for the directory prefix is /opt/ibm/spectrum_mpi/etc/libcoll_tune_results.xml. The resulting directory that uses thedefault value is /opt/ibm/spectrum_mpi/etc/libcoll_tune_results.xml.d.

Environment variables for collective communicationIBM Spectrum MPI provides a number of environment variables for managing collective communication.

By default, libcollectives is the collective algorithm that will be used with IBM Spectrum MPI.

When an application makes many calls to a non-synchronizing collective (for example, MPI_Bcast), it ispossible to exhaust the unexpected message queue on one or more participants, causing the application tocrash. To avoid this situation, Open MPI provides a barrier when it encounters a sequence of non-synchronizing collectives. By default,

A barrier is issued every 128 calls for short collectivesA barrier is issued every 16 calls for long collectivesThe boundary between short and long collectives is 262144 bytes.

IBM Spectrum MPI provides the following environment variables for customizing the barrier values.

LIBCOLL_NUM_ASYNC_SHORT For short operations, specifies the number of asynchronous calls per barrier.

LIBCOLL_NUM_ASYNC_LONG For long operations, specifies the number of asynchronous calls per barrier.

LIBCOLL_NUM_ASYNC_CUTOFF Specifies the boundary between short and long collectives (in bytes).


IBM Spectrum MPI applications

Learn about different IBM Spectrum™ MPI applications.

31

Example applicationsIBM Spectrum MPI provides example programs that demonstrate basic communication between ranks, I/O,and thread-safe programming concepts. Both C and FORTRAN compilers must be installed to compile and runthe example programs. For more information, see the examples in the /opt/ibm/spectrum_mpi/examplesdirectory.

Compiling applications IBM Spectrum MPI supports multiple compilers for Power Systems™ users.

Debugging applications IBM Spectrum MPI supports a number of tools for debugging parallel applications.

Running applications With IBM Spectrum MPI, you can run applications under the secure shell (ssh) or the remote shell (rsh),

by using the mpirun command, and by using IBM® Platform LSF® (LSF®).OSHMEM applications

OpenSHMEM is a Partitioned Global Address Space (PGAS) library interface specification. OpenSHMEMaims to provide a standard Application Programming Interface (API) for SHMEM libraries to aidportability and facilitate uniform predictable results of OpenSHMEM programs by explicitly stating thebehavior and semantics of the OpenSHMEM library calls.


Compiling applications

IBM Spectrum™ MPI supports multiple compilers for Power Systems™ users. For Power Systems users, IBMSpectrum MPI supports the following compilers:

IBM® XL C, version 16.1.3, and IBM XL Fortran, version 16.1.3 (the default)GNU GCC compilers for C and FORTRAN, version 4.8.xPGI (the Portland Group) compiler, version 16.9 for C++ and Fortran

The compiler that is used to build your programs is selected automatically by IBM Spectrum MPI. XL compilershave first priority. GNU compilers have second priority, followed other compilers.

To specify the compiler that you want to use, you must set one of the following environment variables.

If you are using the mpicc C compiler wrapper, use the OMPI_CC environment variable to specify the Ccompiler.If you are using the mpicxx or mpiCC C++ compiler wrapper, use the OMPI_CXX environment variable tospecify the C++ compiler. IBM Spectrum MPI supports the C++ language with the C MPI bindings.If you are using the mpif77, mpif90, or mpifort Fortran compiler wrapper, use the OMPI_FCenvironment variable to specify the Fortran compiler.

It is important to note that if you used Open MPI to build your application, you must recompile and relink withIBM Spectrum MPI.

The wrapper scripts do not actually compile your applications; they simply invoke the compiler that isspecified in the configure script. The wrapper scripts provide an easy and reliable way to specify options whenyou compile. As a result, it is strongly recommended that you use one of the wrapper compiler scripts insteadof trying to compile your applications manually.

32

NoteAlthough you are strongly encouraged to use the wrapper compiler scripts, there might be a fewcircumstances in which doing so is not feasible. In this case, consult the Open MPI website FAQ forinformation about how to compile your application without using the wrappers.The wrapper compiler scripts that are provided by IBM Spectrum MPI include:

Language Wrapper compiler nameC mpiccFortran mpifort (v1.7 or later), mpif77 and mpif90 (for earlier versions)In the following example, the mpicc wrapper script is used to compile a C program called hello_world_smpi.c.

shell$ mpicc hello_world_smpi.c -o hello_world_smpi -g

To understand how the underlying compilers are invoked, you can use the various --showme options, whichare available with all of the IBM Spectrum MPI wrapper scripts. The --showme options are:

--showme Displays all of the command line options that will be used to compile the program.

--showme:command Displays the underlying compiler command.

--showme:compile Displays the compiler flags that will be passed to the underlying compiler.

--showme:help Displays a usage message.

--showme:incdirs Displays a list of directories that the wrapper script will pass to the underlying compiler. These

directories indicate the location of relevant header files. It is a space-delimited list.--showme:libdirs

Displays a list of directories that the wrapper script will pass to the underlying linker. These directoriesindicate the location of relevant libraries. It is a space-delimited list.--showme:libs

Displays a list of library names that the wrapper script will use to link an application. For example, mpi open-rte open-pal util. It is a space-delimited list.

--showme:link Displays the linker flags that will be passed to the underlying compiler.

--showme:version Displays the current Open MPI version number.

Using wrapper compiler scripts to link Fortran applicationsIn IBM Spectrum MPI version 10.3 you must recompile and relink your applications with the new MPI libraries.Typically, Spectrum MPI libraries contain the MPI API symbols are not required because these libraries areadded to the link command by the running the mpicc and mpifort commands.

However, an exception to this rule occurs if an application contains Fortran-built object files and is linked withthe mpicc compiler. The following example shows a link attempt:

mpicc -c example_c.c mpifort -c example_f.f mpicc -o application.x example_c.o example_f.o

This scenario results in many unresolved Fortran symbols because the mpicc command is linking only withthe MPI library that contains the C symbols. However, you can run the mpifort -o application.x example_c.o

33


example_f.o command to correctly link the application that contain both the C and the Fortran API symbols.

You should use the mpifort command for linking any objects files that were built with Fortran. However, if youmust use the mpicc command, you can manually provide the libraries that contain the Fortran API symbols bysetting the OMPI_LIBS environment variable to include the link.

For example:

OMPI_LIBS='-lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lgfortran'

The following table describes the libraries that contain the MPI API symbols.

Symbols IBM Spectrum MPI version 10.3library name

IBM Spectrum MPI Version 10.1.0, orearlier, library name

C MPI symbols libmpi_ibm.so libmpi_ibm.soSmall collection of FortranMPI symbols

libmpi_ibm_usempi.so libmpi_usempi.so

Large collection of FortranMPI symbols

libmpi_ibm_mpifh.so libmpi_mpifh.so

Examples The following examples show how you can use the -showme option to see the behavior of thempicc and mpifort commands.

1. You can run the mpicc -o application.x example_c.o example_f.o -showme command to see that thempicc command does not include any libraries that contain Fortran API symbols.

In this example, the link ends with only the -lmpiprofilesupport and -lmpi_ibm libraries.

2. You can run the mpifort -o application.x example_c.o example_f.o -showme command to see thatthe mpifort command includes the required libraries for Fortran applications.

In this example, the link ends with the required -lmpiprofilesupport, -lmpi_ibm_usempi, -lmpi_ibm_mpifh, and -lmpi_ibm libraries.

3. You can run the following command:

env OMPI_LIBS='-lmpi_ibm_usempi -lmpi_ibm_mpifh -lmpi_ibm -lgfortran' mpicc -o application.x example_c.o example_f.o -showme

to get the same results as running the mpifort command, but by using the mpicc command with theOMPI_LIBS environment variable

In this example, the link ends with the -lmpiprofilesupport, -lmpi_ibm_usempi, -lmpi_ibm_mpifh, and -lgfortran libraries.

Refer to the Open MPI website for additional information about compiling applications, such as:

Compiling programs without using the wrapper compiler scriptsOverriding the wrapper compiler flagsDetermining the default values of the wrapper compiler flagsAdding flags to the wrapper compiler scripts.

Configuring the wrapper compiler scriptsIBM Spectrum MPI includes a set of wrapper compiler scripts that read the configuration script and then buildthe command line options that are supplied to the underlying compiler.

34


Parent topic: IBM Spectrum MPI applications

Debugging applications

IBM Spectrum™ MPI supports a number of tools for debugging parallel applications. You can debug an IBMSpectrum MPI application with a serial debugger such as GDB.

The following methods are often used by Open MPI developers to debug applications:

Attach to individual MPI processes after they are running.Use the mpirun command to start multiple xterm windows, each running a serial debugger.Debugging applications with the TotalView debugger and IBM Spectrum MPI

The RogueWave TotalView debugger can be used with IBM Spectrum MPI for viewing message queuesand attaching to running parallel jobs.Debugging applications with the Allinea DDT debugger and IBM Spectrum MPI

The Allinea DDT debugger provides built-in support for MPI applications.Redirecting debugging output from IBM Spectrum MPI

You can redirect the output from *IBM Spectrum MPI to the same stream as the application's stdoutand stderr channels. This makes it easier to determine the location in the program from which theparticular verbose output was emitted, and to identify, for example, which collective algorithm wascalled.Using the -disable_gpu_hooks option with a debugger

You can use the -disable_gpu_hooks option when a debugger or a profiling tool is intercepting CUDAAPIs, or when the dlsym command is used to get a pointer to the actual address of a function (such asmalloc or free).


Debugging applications with the TotalView debugger and IBMSpectrum MPI

The RogueWave TotalView debugger can be used with IBM Spectrum™ MPI for viewing message queues andattaching to running parallel jobs.

In general, if TotalView is the first debugger in your path, you can use the following mpirun command todebug an IBM Spectrum MPI application:

mpirun --debug mpirun_arguments

When it encounters the mpirun command, IBM Spectrum MPI calls the correct underlying command to runyour application with the TotalView debugger. In this case, the underlying command is:

totalview mpirun -a mpirun_arguments

If you want to run a two-process job of executable a.out, the totalview mpirun -a -np 2 a.out underlyingcommand.

The mpirun command also provides the -tv option, which starts a job under the TotalView debugger. The mpirun -tv option provides the same function as TotalView's -a option. The following example showshow the two-process job from the preceding example can be run.

35

mpirun -tv -np 2 a.out

By default, TotalView tries to debug the mpirun code itself, which, at the least, is probably not useful to you.IBM Spectrum MPI provides instructions for avoiding this problem in a sample TotalView startup file called etc/openmpi-totalview.tcl. This file can be used to cause TotalView to ignore the mpirun code andinstead, debug only the application code. By default, etc/openmpi-totalview.tcl is installed to $prefix/etc/openmpi-totalview.tcl in the IBM Spectrum MPI installation.

Note$prefix is the base SMPI install directory, which by default is /opt/ibm/spectrum_mpi.To use the TotalView startup file, you can either copy it into the file that is called $HOME/.tvdrc or source itdirectly from $HOME/.tvdrc. For example, you can place the following line in $HOME/.tvdrc (replacing/path/to/spectrum_mpi/installation with the proper directory name), which causes IBM Spectrum MPI to usethe TotalView startup file:

source /path/to/spectrum_mpi/installation/etc/openmpi-totalview.tcl

Parent topic: Debugging applications

Related information

TotalView for HPC

Debugging applications with the Allinea DDT debugger andIBM Spectrum MPI

The Allinea DDT debugger provides built-in support for MPI applications. In general, if Allinea DDT is the firstdebugger in your path, you can use the following mpirun command to debug an IBM Spectrum™ MPIapplication:

mpirun --debug mpirun_arguments

When t encounters the mpirun --debug command, IBM Spectrum MPI calls the correct underlying commandto run your application with the Allinea debugger. In this case, the underlying command is:

ddt -n -start

NoteThe Allinea DDT debugger does not support passing arbitrary arguments to the mpirun command.With Allinea DDT, you can also attach to processes that are already running.

For example:

ddt -attach [ ...]

You can also attach to running processes by using the ddt -attach-file syntax.


Related information

Allinea DDT

36

https://www.roguewave.com/products-services/totalview

https://www.allinea.com/products/ddt

Redirecting debugging output from IBM Spectrum MPI

You can redirect the output from IBM Spectrum™ MPI to the same stream as the application's stdout and stderr channels. This makes it easier to determine the location in the program from which the particularverbose output was emitted, and to identify, for example, which collective algorithm was called.

The following MCA parameters can be used to redirect the standard output streams that are used by theapplication and IBM Spectrum MPI:

-mca iof_base_redirect_app_stderr_to_stdout 1 For a user's application, specifies that stderr is redirected to stdout at the source.

Possible Values: 0 | false, 1 | true Default: 0 | false

-mca mca_base_verbose stdout Specifies where the default error (or verbose) output stream is directed. Possible Values: A comma-

delimited list of the following values:

stderrstdoutsyslogsyslogpri:notice | info | debugsyslogid:string (string is the prefix string for all syslog notices.)file:filename (If a file name is not specified, a default file name is used.)fileappend (If fileappend is not specified, the file is opened for truncation.)level:number (number specifies the integer verbose level, 0-9. If number is not specified, 0 isimplied.)

Default: stderr,level:0

-mca orte_map_stddiag_to_stderr 1 Specifies that internal IBM Spectrum MPI messages (such as verbose output) are routed over the

stderr of the application process rather than an internal stddiag channel.

NoteThis parameter cannot be used with orte_map_stddiag_to_stdout. If orte_map_stddiag_to_stderrand orte_map_stddiag_to_stdout are used together, an error message is issued when the job is started.


-mca orte_map_stddiag_to_stdout 1 Specifies that internal IBM Spectrum MPI messages (such as verbose output) are routed over the

stdout of the application process rather than an internal stddiag channel. This parameter changes thedefault value of the mca_base_verbose parameter to stdout. The user can still override thisparameter.

NoteThis parameter cannot be used with orte_map_stddiag_to_stderr. If orte_map_stddiag_to_stdoutand orte_map_stddiag_to_stderr are used together, an error message is issued when the job is started.


NoteAlthough the behavior of the -mca orte_map_stddiag_to_stdout 1 and -mca mca_base_verbosestdout parameters seems identical, there is a subtle difference. orte_map_stddiag_to_stdout occurs inthe launching daemon as the application process is being forked, whereas mca_base_verbose is read in

37

the application process relatively early in MPI_Init. As a result, there is a window of time in whichorte_map_stddiag_to_stdout takes precedence over mca_base_verbose.

The default value of mca_base_verbose is stderr,level:0 unless the user specifies orte_map_stddiag_to_stdout 1. In this case, IBM Spectrum MPI automatically switches the default valueof mca_base_verbose to stdout,level:0 (which is closer to what you might expect from the orte_map_stddiag_to_stdout option).

So, this option:

-mca orte_map_stddiag_to_stdout 1

means exactly the same thing as these options, specified together:

-mca orte_map_stddiag_to_stdout 1 -mca mca_base_verbose stdout,level:0

You can override the switched default value for mca_base_verbose by passing both sets of parameters, andchanging the value of mca_base_verbose, follows:

-mca orte_map_stddiag_to_stdout 1 -mca mca_base_verbose stderr,level:0

This is useful if you want the internal Open MPI messages that are generated between MPI_Init andMPI_Finalize (inclusive of these functions) to be routed to a different output channel. In this case, you canadjust the mca_base_verbose variable separately from the orte_map_stddiag_to_stdout variable (whichhandles internal messages that are generated outside of the region of code between MPI_Init andMPI_Finalize).

When you specify the following:

-mca orte_map_stddiag_to_stdout 1

All internal Open MPI messages (for example, internal verbose level output) are routed to stdout at all timesin the application lifecycle.

ExamplesIn the following example, the orte_map_stddiag_to_stdout parameter is used to redirect all of the Open MPIerror and verbose messages to stdout.

mpirun -mca orte_map_stddiag_to_stdout 1 ./myapp

In the following example, the orte_map_stddiag_to_stderr parameter is used to redirect all of the Open MPIerror and verbose messages to stderr.

mpirun -mca orte_map_stddiag_to_stderr 1 ./myapp

In the following example, the iof_base_redirect_app_stderr_to_stdout parameter is used to redirect all ofthe application stdout and stderr to stdout only.

mpirun -mca iof_base_redirect_app_stderr_to_stdout 1 ./myapp

In the following examples, the iof_base_redirect_app_stderr_to_stdout, orte_map_stddiag_to_stderr, andorte_map_stddiag_to_stdout parameters can be used to redirect all of the Open MPI error and verbosemessages and application stdout and stderr to stdout only. In the first example, stddiag is routed to stderr, which is then routed to stdout. In the second example, which is preferred, stddiag is directlyrouted to stdout.

38

mpirun -mca iof_base_redirect_app_stderr_to_stdout 1 -mca orte_map_stddiag_to_stderr 1./myapp mpirun -mca iof_base_redirect_app_stderr_to_stdout 1 -mcaorte_map_stddiag_to_stdout 1 ./myapp


Using the -disable_gpu_hooks option with a debugger

You can use the -disable_gpu_hooks option when a debugger or a profiling tool is intercepting CUDA APIs,or when the dlsym command is used to get a pointer to the actual address of a function (such as malloc orfree).

When you use the -disable_gpu_hooks option, the following options occur:

The ibpami_cudahook.so option is not per-loaded.PAMI cannot use the ibpami_cudahook.so option.If you use the mpirun -gpu command with GNU buffer, the CUDA API exits.


Running applications

With IBM Spectrum™ MPI, you can run applications under the secure shell (ssh) or the remote shell (rsh), byusing the mpirun command, and by using IBM® Platform LSF® (LSF). For troubleshooting information relatedto running jobs, refer to the Open MPI website.

Establishing a path to the IBM Spectrum MPI executables and libraries IBM Spectrum MPI needs to be able to locate its executables and libraries on every node on which

applications will run. It can be installed locally, on each node that will be a part of the MPI job, or in alocation that is accessible to the network. IBM Spectrum MPI installations are relocatable.Running programs with the mpirun command

The mpirun, mpiexec, and the orterun commands can be used with IBM Spectrum MPI to run SPMD orMPMD jobs.Running applications with IBM Platform LSF

IBM Spectrum MPI supports IBM Platform LSF version 9.1.3, or later, for launching jobs.Running jobs with ssh or rsh

IBM Spectrum MPI supports running jobs under the secure shell (ssh) or the remote shell (rsh).Managing IBM Spectrum MPI jobs

There are a number of tasks that apply to running IBM Spectrum MPI jobs.


Establishing a path to the IBM Spectrum MPI executables andlibraries

39


IBM Spectrum™ MPI needs to be able to locate its executables and libraries on every node on whichapplications will run. It can be installed locally, on each node that will be a part of the MPI job, or in a locationthat is accessible to the network. IBM Spectrum MPI installations are relocatable.

Multiple versions of IBM Spectrum MPI can be installed on a cluster, or made available over a network sharedfile system.

The full path to the installed IBM Spectrum MPI must be the same on all the nodes that are participating in anMPI job.

To establish a path to your executables and libraries, do the following:

1. Set the MPI_ROOT environment variable to the installed root of the version of IBM Spectrum MPI thatyou want to use.

2. Add $MPI_ROOT/share/man to the MANPATH environment variable.

No other environmental setup is needed to run jobs with IBM Spectrum MPI.

NoteIt is not recommended that users add any of the directories under MPI_ROOT to the PATH orLD_LIBRARY_PATH statements. Doing so can interfere with the normal functioning of some IBM SpectrumMPI features.

Parent topic: Running applications

Running containerized applications

A container is a mechanism that bundles an application and its execution environment into a sharable image.The image can be shared between execution environments and users, which lead to improved applicationusability, reproducibility, and portability. Container runtimes leverage Linux® cgroups, namespaces, andcapabilities to provide an isolated execution environment for the containerized application. Containers cannatively access devices such as network devices and GPUs if the container runtime has the necessary level ofsupport.

Spectrum MPI currently supports the Singularity container runtime.

There are two different container modes that are commonly used in HPC environments:

1. Rank contained mode: In this mode, there is one container per application process (for example, MPIrank). If there are multiple application processes assigned to a node, then there are multiple containerinstances on that node. The Spectrum MPI runtime from the host system (outside of the container) isused to launch the container runtime with the application container image for each process on eachallocated compute node.

1. The rank contained mode requires that the *Spectrum MPI* runtime on the host system outside of the container is compatible with the *Spectrum MPI* installed inside the container image. 2. In this mode, it is recommended to bind mount the *Spectrum MPI* installed on the host system over the top of the *Spectrum MPI* inside their container. This method is the easiest way to guarantee compatibility across the container boundary. 3. The *Spectrum MPI* version inside the container must be at version 10.3.1.0 or later.

2. Fully contained mode: In this mode, there is one container per allocated compute node. If there aremultiple application processes assigned to that node, then all processes run in the same container

40

instance. The Spectrum MPI runtime from inside the container is used to launch the applicationcontainer image on each compute node.

1. The fully contained mode does not require *Spectrum MPI* to be installed on the host system since the runtime components are part of the container image. 2. This mode provides users with a more portable image and execution environment since there are fewer external dependencies to the container image.

Spectrum MPI provides options to support both containerization modes. The --container option takes acomma-separated list of directives to pick the mode and control the launch environment. Additionally,Spectrum MPI provides enhancements to choose from multiple Spectrum MPI installs within the container andto customize the containerized environment through special environment variables.

If the container requires the use of Mellanox InfiniBand or NVIDIA GPUs, the user must ensure that theircontainer image contains compatible versions of Mellanox MOFED and NVIDIA CUDA with those installed onthe host system. Container images cannot contain the kernel modules that are required by these libraries. Assuch, the container image must contain a compatible version of the user-space library.

The Spectrum MPI runtime requires the user to set the MPIRUN_CONTAINER_CMD environment variablebefore it calls mpirun with any of the --container options. The MPIRUN_CONTAINER_CMD environmentvariable specifies the container runtime command to use when launching the container image. This commandmight be a direct, parameterized call to the container runtime or a script that then calls the container runtime.Spectrum MPI prefixes the binary to be launched with the string specified by this environment variable.

The MPIRUN_CONTAINER_OPTIONS environment variable can be used instead of passing the --containeroption to the mpirun command. The value set in the MPIRUN_CONTAINER_OPTIONS environment variable isthe same string that the user would pass to the --container option.

Running containers in a rank contained modeThe mpirun command's --container rank option tells the Spectrum MPI runtime that the application is to belaunched in a rank contained mode. The user must set the MPIRUN_CONTAINER_CMD environment variable,which tells Spectrum MPI how to activate the container runtime for each process in this application launch.

Spectrum MPI uses an assistant script inside the container to help negotiate container runtime options andenvironment settings across the container boundary. For example, when you launch the application in rankcontained mode by running the following commands:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif" mpirun --container rank ./a.out arg1 arg2

Each MPI rank is executed as:

singularity exec --nv myapp.sif $MPI_ROOT/container/bin/incontainer.pl ./a.out arg1 arg2

If the user needs to replace or alter the assistant script, they can pass their own script by using the --container assist:<path> option. The path that you provide must be valid inside the container image. Bydefault, the assistant script is set to the value seen in the previous example. Note that the --containeroption takes a comma-separated list of values. For example, a user might specify the following commands tolaunch a rank contained containerized application with a custom assistant script:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"

mpirun --container rank,assist:/examples/helper.py ./a.out arg1 arg2

41

then each MPI rank is executed as:

singularity exec --nv myapp.sif /examples/helper.py ./a.out arg1 arg2

The MPI_ROOT variable points to the Spectrum MPI installed inside the container image. It recommended thatMPI_ROOT be set as an environment variable inside the container image. However, if the user does not have itset or wants to use a different MPI_ROOT inside the container then they can use the --container root:<path\> option with the mpirun command to set this environment variable. The path that you provide mustbe valid inside the container image.

NoteUsers should not start containers without using the --container options and theMPIRUN_CONTAINER_CMD environment variable because Spectrum MPI requires the use of an assistantscript to correctly setup the container environment.

Running containers in a fully contained modeThe --container all and --container orted options to the mpirun to tell the Spectrum MPI runtimethat the application is to be launched in a fully contained mode. The user must set theMPIRUN_CONTAINER_CMD environment variable, which tells Spectrum MPI how to activate the containerruntime to setup the execution environment. There is no assistant script that is used in the orted mode, so the--container assist option is ignored in the orted mode. The assist script is used in the all mode beforerelaunching the mpirun process.

The --container all option to the mpirun command causes mpirun to start the container image locallyand reexecute mpirun from within that container instance. The contained mpirun launches on each allocatedcompute node a container instance with the Spectrum MPI daemon(orted) inside that container image.Essentially wrapping the remote daemons in a container instance. The application processes are launched inthe same container instance as the Spectrum MPI runtime.

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"

mpirun --container all ./a.out arg1 arg2

then the following is executed instead:

singularity exec --nv myapp.sif mpirun ./a.out arg1 arg2

Additional environment variables meaningful to mpirun are set to tell mpirun how to launch the remotedaemons in their own private container instances.

The --container orted option to mpirun is similar to the all variant except that it assumes that mpirun isalready executed from within a container instance. In this mode, mpirun does not reexecute itself, but sets upthe environment to place the Spectrum MPI daemons inside container instances on the compute nodes. Nocontainer assistant script is used in the orted mode. As such the assist and root options, andSMPICONTAINERENV prefixed environment variables have no impact in the orted mode. The orted mode ishelpful if the user needs to pre-process or post-process data from inside the same container instance asmpirun, or run multiple mpirun invocations from within the same container instance. A user can create ascript.

For example:

$ cat run-test.sh #!/bin/bash export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif" ./pre-process.py data.in

42

mpirun --container orted ./a.out arg1 arg2 ./post-process.py data.out

The user can run the script by invoking the container runtime around this script:

singularity exec -nv myapp.sif ./run-test.sh

In this example, the user is creating the container around mpirun, and mpirun is in charge of creating thecontainer around the Spectrum MPI daemons (orted) and application processes.

Customizing environment variables for the container environmentDepending on the container runtime, some environment variables might not transfer across the containerboundary. Spectrum MPI allows users to prefix the environment variables that they need moved across thecontainer boundary.

Adding the prefix SMPI_CONTAINERENV_ to an environment variable passes that environment variable insideto the container without the prefix. Adding the SMPI_CONTAINERENV_ prefix is helpful when you need to passan environment variable that the container runtime would otherwise strip from the environment (for example, LD_PRELOAD). To propagate the environment variables to the remote nodes, each environment variable musthave the SMPI_CONTAINERENV_ prefix and must be listed with the -x option on the mpirun command line.

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif" export SMPI_CONTAINERENV_FOO=bar mpirun --container rank -x SMPI_CONTAINERENV_FOO ./a.out

This sets the environment variable FOO to the value bar inside the container. If the environment variableexists in the environment of the container instance, then it is replaced with this value.

Adding the prefix SMPI_CONTAINERENVPREPEND to an environment variable prepends values to an existingenvironment variable inside the container. Adding the SMPI_CONTAINERENVPREPEND prefix is helpful if youneed to extend the default environment variable (that is, PATH).

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif" export SMPI_CONTAINERENV_PREPEND_PATH="/examples/bin" mpirun --container rank -x SMPI_CONTAINERENV_PREPEND_PATH ./a.out

This prepends /examples/bin to the environment variable PATH inside the container instance. A (:) separator isadded after the value if the environment variable exists in the container. If the environment variable does notexist in the container, then it is set to this value. Spectrum MPI places additional items in the PATH andLD_LIBRARY_PATH environment variable before any values specified by this mechanism.

Adding the prefix SMPI_CONTAINERENVAPPEND to an environment variable appends values to an existingenvironment variable inside the container. Adding the SMPI_CONTAINERENVAPPEND prefix is helpful if youneed to extend the default environment variable (for example, PATH).

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif" export SMPI_CONTAINERENV_APPEND_PATH="/examples/bin" mpirun --container rank -x SMPI_CONTAINERENV_APPEND_PATH ./a.out

This sample code prepends /examples/bin to the environment variable PATH inside the container instance. A(:) separator is added before the value if the environment variable exists in the container. If the environmentvariable does not exist in the container, then it is set to this value.

43

Running programs with the mpirun command

The mpirun, mpiexec, and the orterun commands can be used with IBM Spectrum™ MPI to run SPMD orMPMD jobs.

The mpirun and mpiexec commands are identical in their functionality, and are both symbolic links toorterun, which is the job launching command of IBM Spectrum MPI's underlying Open Runtime Environment.Therefore, although this material refers only to the mpirun command, all references to it are consideredsynonymous with the mpiexec and orterun commands.

Specifying the hosts on which your application runs In order to execute your program, IBM Spectrum MPI needs to know the hosts in your network on which

it will run.Starting a SPMD (Single Program, Multiple Data) application

Starting an MPMD (multiple program, multiple data) application Learn how to start an MPMD (multiple program, multiple data) application.

mpirun command options The mpirun command supports a large number of command line options.


Specifying the hosts on which your application runs

In order to execute your program, IBM Spectrum™ MPI needs to know the hosts in your network on which itwill run.

In general, when using the mpirun command, there are two ways that you can do this. You can either:

Enter the names of the hosts individually on the command line.Create a text file containing the names of the hosts, and then specify the list on the command line atruntime. This is called a host list file. A host list file is useful when the number of hosts is large, andentering them individually on the command line would be too cumbersome and error-prone.

Specifying hosts individually To specify individual hosts on the mpirun command line, use the --host option.

Specifying hosts using a host list file The host list file is a flat text file that contains the names of the hosts on which your applications run.

Parent topic: Running programs with the mpirun command

Specifying hosts individually

To specify individual hosts on the mpirun command line, use the --host option.

In the following example, the --host option is used with mpirun to start one instance of prog01 on the h1node and another instance of prog01 on the h2 node.

44

mpirun -host h1,h2 prog1

Note that if you wanted to start two instances of prog01 on the h1 node, and one instance of prog01on the h2node, you could do the following:

mpirun -host h1,h1,h2 prog01

Parent topic: Specifying the hosts on which your application runs

Related information

mpirun command options

Specifying hosts using a host list file

The host list file is a flat text file that contains the names of the hosts on which your applications run.

Each host is included on a separate line. For example, the following display the contents of a simple host listfile called myhosts:

node1.mydomain.com node2.mydomain.com node3.mydomain.com node4.mydomain.com

In a host list file, the order of the nodes in the file is not preserved when you launch processes acrossresources. In the previous example, the host list that is named myhosts and the node1.mydomain.com entrymight not be the first node that is used, even though it is listed first in the host list file. For example, thefollowing might be the order in which the nodes are used:

1. node3.mydomain.com2. node2.mydomain.com3. node1.mydomain.com4. node4.mydomain.com

After you have created the host list file, you can specify it on the command line using the --hostfile (alsoknown as --machinefile) option of the mpirun command. For example, using the simple myhosts host list file,you could run your application, prog01, as follows:

mpirun -np 4 --hostfile myhosts prog01

For each host, the host list file can also specify:

The number of slots (the number of available processors on that host). The number of slots can bedetermined by the number of cores on the node or the number of processor sockets. If no slots arespecified for a host, then the number of slots defaults to one. In this example, a host list file calledmyhosts specifies three nodes, and each node has two slots:

cat myhosts node1 slots=2 node2 slots=2 node3 slots=2

Specifying the following command launches six instances of prog01; two on node1, two on node2, and two onnode3.

45

For example:

mpirun -hostfile myhosts prog01

The maximum number of slots. Note that the maximum slot count on a host defaults to infinite, therebyallowing IBM Spectrum™ MPI to oversubscribe to it. To avoid oversubscribing, you can provide amaximum slot value for the host (max-slots=*).

The host list file can also contain comments, which are prefixed by a pound sign (#). Blank lines are ignored.

For example:

This is a single processor node: node1.mydomain.comThis is a dual-processor node: node2.mydomain.com slots=2This is a quad-processor node. Oversubscribing to it is prevented by setting max-slots=4:node3.mydomain.com slots=4 max-slots=4

Parent topic: Specifying the hosts on which your application runs

Related information


Starting a SPMD (Single Program, Multiple Data) application

In general, for SPMD jobs, the mpirun command can be used in the following format:

mpirun -np num --hostfile filename program

In this command syntax:

-np num specifies the number of processes--hostfile filename specifies the name of the host list fileprogram specifies the name of your application.

In other words, mpirun starts num instances of program on the hosts designated by a host list file calledfilename.

Consider the following example. You have a program called prog1 and a host list file called hosts that containsthe following lines:

host1.mydomain.com host2.mydomain.com host3.mydomain.com

You could run prog1 using the following mpirun command syntax:

mpirun -np 3 --hostfile hosts prog1


46

Starting an MPMD (multiple program, multiple data)application

Learn how to start an MPMD (multiple program, multiple data) application. For MPMD applications, the basicsyntax of the mpirun command is as follows:

mpirun -np num1 prog1 : -np num2 prog2

In this command syntax:

-np num1 specifies the number of processes for prog1-np num2 specifies the number of processes for prog2prog1 specifies the name of an applicationprog2 specifies the name of a second application.

In other words, mpirun starts num1 copies (instances) of prog1 and also starts num2 instances of prog2.Consider the following example. You have two programs; one called prog3 and another called prog4. You wantto run two instances of prog3, and also four instances of prog4. In this scenario, you could use the mpiruncommand, as follows:

mpirun -np 2 prog3 : -np 4 prog4



The mpirun command supports a large number of command line options. The best way to see a complete listof these options is to issue mpirun --help command. The --help option provides usage information and asummary of all of the currently-supported options for mpirun.

The following are commonly-used options for starting applications with the mpirun command:

-np | -n number_of_processes: Specifies the number of instances of a program to start.

If -np number_of_processes:

* Is not specified**, mpirun launches the application on the number of slots that it can discover. * Is specified**, mpirun launches the given number of processes, as long as it will not oversubscribe a node.

-nooversubscribe | --nooversubscribe Indicates that the nodes must not be oversubscribed, even if the system supports such an operation. This is

the default option.

-oversubscribe | –-oversubscribe Indicates that more processes should be assigned to any node in an allocation than that node has slots for.

Nodes can be oversubscribed, even on a managed system. For more information about mapping, binding, andordering behavior for mpirun jobs, see Managing IBM Spectrum MPI process placement and affinity.

-display-allocation | --display-allocation Displays the Allocated Nodes table. This option is useful for verifying that mpirun has read in the correct node

47

and slot combinations. For example:

``` shell$ mpirun -np 2 -host c712f5n07:4,c712f5n08:8 --display-allocation hostname ====================== ALLOCATED NODES ======================

c712f5n07: slots=4 max_slots=0 slots_inuse=0 state=UP c712f5n08: slots=8 max_slots=0 slots_inuse=0 state=UP ================================================================= c712f5n07 c712f5n07 ```

-do-not-launch | --do-not-launch Performs all necessary operations to prepare to launch the application, but without actually launching it. This

option is useful for checking the allocation (with --display-allocation) without actually launching the daemonsand processes.

For example:

``` shell$ mpirun -np 2 -host c712f5n07:4,c712f5n08:8 --display-allocation --do-not-launch hostname ====================== ALLOCATED NODES ======================

c712f5n07: slots=4 max_slots=0 slots_inuse=0 state=UP c712f5n08: slots=8 max_slots=0 slots_inuse=0 state=UP ================================================================= ```

-hostfile | --hostfile hostfile, -machinefile | --machinefile machinefile Specifies a hostfile for launching the application.

-H | -host | --host hosts Specifies a list of hosts on which to invoke processes.

-rf | --rankfile file_names Specifies a rankfile file.

--timeout seconds Indicates that the job should be terminated after the specified number of seconds.

--report-state-on-timeout Reports all job and process status when a timeout occurs.

--get-stack-traces Gets the stack traces of all application processes when a timeout occurs. The Linux™ gstack tool must be

installed on all machines in order to generate the stack trace. If the gstack tool is not available, an errormessage is displayed instead of the stack trace.

mpirun options for on-host communication method Learn about mpirun options for the on-host communication method.

mpirun options for display interconnect Learn about different mpirun options for display interconnect

mpirun options for standard I/O Learn about mpirun options for standard I/O.

48

mpirun options for IP network selectionLearn about mpirun options for IP network selection.

mpirun options for affinity Learn about mpirun options for affinity.

mpirun options for PMPI layering Learn about mpirun options for PMPI layering.


mpirun options for on-host communication method

Learn about mpirun options for the on-host communication method.

The IBM Spectrum™ MPI PAMI component supports on-host shared memory. When running with the -PAMIoption (the default), no additional parameters are required for on-host communication.

-intra=nic Specifies that the off-host BTL should also be used for on-host traffic.

-intra=vader Specifies that BTL=vader (shared memory) should be used for on-host traffic. This only applies if the PML

(point-to-point messaging layer) is already ob1.

-intra=shm Equivalent to -intra=vader.

-intra=sm Specifies that BTL=sm (an older shared memory component) should be used for on-host traffic. This only

applies if the PML is already ob1.

NoteThe -intra flag is incompatible with GPU buffers because it does not allow you to specify PAMI.

Parent topic: mpirun command options

mpirun options for display interconnect

Learn about different mpirun options for display interconnect

-prot Displays the interconnect type that is used by each host. The first rank on each host connects to all peer hosts

in order to establish connections that might otherwise be on-demand.

-protlazy Similar to -prot. Displays the interconnect type that is used by each host at MPI_Finalize. Connections to peer

hosts are not established, so it is possible that many peers are unconnected.

-gpu Enables GPU awareness in PAMI by one MCA option and an -x LD_PRELOAD of libpami_cudahook.so.

49

NoteUsing the -gpu option causes additional runtime checking of every buffer that is passed to MPI. -gpu isonly required for applications that pass pointers to GPU buffers to MPI API calls. Applications that use GPUs,but do not pass pointers that refer to memory that is managed by the GPU, are not required to pass the -gpuoption.


mpirun options for standard I/O

Learn about mpirun options for standard I/O.

-stdio=p Specifies that each rank's output should be prefixed with [job,rank].

-stdio=t Specifies that a timestamp should be included with the output.

-stdio=i[=|all|-|none|rank] Specifies that stdin should be sent to all ranks (+), no ranks (-), or a single, specific rank (rank).

-stdio=file:prefix Specifies that output should be sent to files that are named *prefix*.*rank*. Note that prefix can be either

a file name or a path ending in a file name.

-stdio=option,option,... Specifies a comma-separated list of the standard I/O options.


mpirun options for IP network selection

Learn about mpirun options for IP network selection.

-netaddr=spec,spec,... Specifies the networks that should be used for TCP/IP traffic. This option applies to control messages as well

as regular MPI rank traffic.

-netaddr=type:spec,spec,... Specifies the networks that should be used for different types of traffic.

In this syntax, type can be one of the following:

rank Specifies the network for regular MPI rank-to-rank traffic.

control | mpirunSpecifies the network for control messages (for example, launching mpirun).

In this syntax, spec can be one of the following:

50

interface nameThe interface name. For example, eth0.

CIDR notation The CIDR (Classless Inter-Domain Routing) notation. For example, 10.10.1.0/24.


mpirun options for affinity

Learn about mpirun options for affinity.

-aff Enables affinity, with the default option of bandwidth.

-aff=[option,option,…] Enables affinity, with any of the following options.

v / vv Displays output in verbose mode.

cycle:unit Interleaves the binding over the specified element. The values that can be specified for unit are hwthread,

core, socket (the default), or numa.

bandwidth | default Interleaves sockets but reorders them.

latency Pack.

width:unit Binds each rank to an element of the size that is specified by unit. The values that can be specified for unit are

hwthread, core, socket (the default), or numa.


mpirun options for PMPI layering

Learn about mpirun options for PMPI layering.

-entry lib,... Specifies a list of PMPI wrapping libraries. Each library can be specified in one of the following forms:

libfoo.so/path/to/libfoo.sofoo (which is automatically expanded to libfoo.so for simple strings that contain only characters of a - z,A - Z, or 0 - 9. Expansion is not applicable for the strings fort, fortran, v, and vv.

51

-entry fort | fortranSpecifies the layer into which the base MPI product's Fortran calls (which minimally wrap the C calls) shouldbe installed.

-entrybase | -baseentry lib Optionally specifies the libraries from which to get the bottom level MPI calls. The default value is

RTLD_NEXT, which is the libmpi to which the executable is linked.

-entry v | -entry vv Displays the layering of the MPI entry points in verbose mode.

Specifying a value of v prints verbose output that shows the layering levels of the MPI entry points.

Specifying a value of vv prints more detailed verbose output than the -entry v option. The -entry vvoption shows the levels that are intended to be used, and confirms the libraries that are being opened. Theoutput from -entry vv is less readable, but it allows you to confirm, more visibly, that interception is takingplace.


Running applications with IBM Platform LSF

IBM Spectrum™ MPI supports IBM® Platform LSF® version 9.1.3, or later, for launching jobs.

When a job is launched, the mpirun command searches for the LSF_ENVDIR and LSB_JOBID environmentvariables. If these environment variables are found, and mpirun can successfully reference and use the LSFlibrary, then mpirun determines that it is in an LSF environment. The easiest way to make sure that thenecessary LSF environment variables are set is by sourcing the LSF $LSF_TOP/conf/profile.lsf file in theLSF installation directory.

If LSB_AFFINITY_HOSTFILE is set, then the file that is specified by this environment variable determines themapping, binding, and ordering for the processes that will be launched later. LSF generates LSB_AFFINITY_HOSTFILE during the setup of the allocation.

After the list of hosts is known, the lsf component of the PLM framework in mpirun launches an Open RTEdaemon (orted) on each node using LSF's launch APIs.

After the list of hosts is known, the PLM framework of mpirun launches an Open RTE daemon (orted) on eachnode in a linear manner.

Previously, a limitation existed regarding the use of both short and long host names with LSF. Short names (forexample, nodeA) could not be mixed with long names (for example, nodeA.mycluster.org) by LSF becauseOpen MPI interpreted them as two different nodes, and then failed to launch. However, beginning with IBMSpectrum MPI 10.1.0.2, the MCA parameter setting of orte_keep_fqdn_hostnames=false now causes alllong host names to be converted to short host names, by default, and MPI interprets them correctly. No otherintervention is required by users.


Running jobs with ssh or rsh

52

IBM Spectrum™ MPI supports running jobs under the secure shell (ssh) or the remote shell (rsh).

mpirun first looks for allocation information from a resource manager. If none is found, it uses the valuesprovided for the -hostfile, -machinefile, -host, and -rankfile options, and then uses ssh or rsh tolaunch the Open RTE daemons on the remote nodes.

By default, jobs are launched using ssh, however, you can force the use of rsh by using the -mcaplm_rsh_force_rsh parameter. The following list describes -mca plm_rsh_force_rsh, as well as other MCAparameters that are useful when running jobs under ssh or rsh.

-mca plm_rsh_agent Specifies the agent that will launch executables on remote nodes. The value is a colon-delimited list of

agents, in order of precedence.

Default: ssh : rsh

-mca plm_rsh_args Specifies arguments that should be added to ssh or rsh.

Default: Not set

-mca plm_rsh_assume_same_shell Specifies whether or not to assume that the shell on the remote node is the same as the shell on the

local node. Valid values are 0 | f | false | disabled | no or 1 |t | true | enabled | yes.

Default: true (assume that the shell on the remote node is the same as the shell on the local node)

-mca plm_rsh_num_concurrent Specifies the number of plm_rsh_agent instances to invoke concurrently. You must specify a value

that is greater than 0.

Default: 128

-mca plm_rsh_pass_environ_mca_params Specifies whether or not to include MCA parameters from the environment on the Open RTE (orted)

command line. Valid values are 0 | f | false | disabled | no or 1 | t | true | enabled | yes.

Default: true (MCA parameters from the environment will be included on the orted commandline)

-mca plm_rsh_force_rsh Specifies whether or not to force the launcher to always use rsh. Valid values are 0 | f | false | disabled |

no or 1 | t | true | enabled | yes.

Default: false (the launcher will not use rsh)

-mca plm_rsh_no_tree_spawn Specifies whether or not to launch applications using a tree-based topology. Valid values are 0 | f | false

| disabled | no or 1 | t | true | enabled | yes.

Default: false (applications are launched using a tree-based topology)

-mca plm_rsh_pass_libpath Specifies the library path to prepend to the remote shell's LD_LIBRARY_PATH.

Default: Not set

53

NoteIf you are using ssh to connect to a remote host, in order for mpirun to operate properly, it isrecommended that you set up a passphrase for passwordless login. For more information, see the Open MPIFAQ website. (www.open-mpi.org/faq/?category=rsh).


Managing IBM Spectrum MPI jobs

There are a number of tasks that apply to running IBM Spectrum™ MPI jobs.

Using LD_PRELOAD with Spectrum MPIInternally, Spectrum MPI uses LD_PRELOAD to enable GPU-related features. However, this creates a potentialconflict when a user also adds a setting to LD_PRELOAD.

To enable users to add settings to LD_PRELOAD, IBM Spectrum MPI provides the following environmentvariables:

OMPI_LD_PRELOAD_PREPEND Inserts the user's setting on the beginning of any existing LD_PRELOAD setting before launching the

MPI ranks.

OMPI_LD_PRELOAD_POSTPEND Inserts the user's setting on the end of any existing LD_PRELOAD setting before launching the MPI

ranks.

Using a pre-launch scriptYou can insert a program in front of all the ranks of a job by using the OMPI_LAUNCH environment variable.For example:

% env OMPI_PRELAUNCH=valgrind mpirun -np 2 ./a.out

This is equivalent to the following:

% mpirun -np 2 valgrind ./a.out

The OMPI_LAUNCH environment variable is mainly useful for debugging applications that have complexlaunching scripts that are otherwise difficult to modify.

Consider another example. Here, a sample pre-launch script is used to attach gdb to a particular rank:

#!/bin/sh if [ $OMPI_COMM_WORLD_RANK -eq 0 ] ; then xterm -e gdb --args "$@" else exec "$@" fi

Using the InfiniBand Dynamically Connected Transport protocolIBM Spectrum MPI supports the Mellanox InfiniBand Dynamically Connected Transport (DCT) protocol.

To turn DCT on, you can use one of the following environment variables:

54

https://www.open-mpi.org/faq/?category=rsh

PAMI_IBV_ENABLE_DCTUse PAMI_IBV_ENABLE_DCT to explicitly turn DCT on and off. A value of 1 specifies that all jobs useDCT, regardless of the value of PAMI_IBV_RCQP_PERCENT. A value of 0 specifies that all jobs use RC,regardless of the value of PAMI_IBV_RCQP_PERCENT.

PAMI_IBV_RCQP_PERCENT Specifies the threshold, as a percentage, of the number of qpairs that must be available for use in order

to automatically enable DCT. By default, the value of PAMI_IBV_RCQP_PERCENT is 50. Allowable valuesare 1 through 100.

By default, the use of DCT is not enabled, and IBM Spectrum MPI uses RC mode. However, PAMIautomatically switches to DCT mode when the following condition exists:

number_of_QPs >= 50/100 total_QPs_available*

In PAMI, the total_QPs_available is defined as 64k. The number_of_QPs is determined as follows:

(number_of_ranks^2)number_of_nodes*

For example, a 512 rank job on eight nodes uses (64^2)*8=32k QPs. Because 32k is >= 50/100 * 64k, DCTmode is used. In a similar example, a 512 rank job on 16 nodes uses (32^2)16=16k QPs. Because 16k is not>= 50/100 64k, RC mode is used. If PAMI_IBV_RCQP_PERCENT is set to anything other than 50, the newpercentage value is used with this formula.

NoteAlthough the PAMI verbs bypass can be used in DCT mode, you will not see the normal performancebenefit of the verbs bypass.

TCPBy default, when SMPI uses TCP, it uses all “up” interfaces. Sometimes these "up" interfaces can be virtualinterfaces that are only used within a machine and without the ability to communicate to MPI ranks on othermachines. To avoid hangs in such cases, SMPI needs to be told to ignore interfaces that won’t be able tocommunicate to all peer ranks. For example, --mca btl_tcp_if_exclude virbr0,docker0.


OSHMEM applications

OpenSHMEM is a Partitioned Global Address Space (PGAS) library interface specification. OpenSHMEM aims toprovide a standard Application Programming Interface (API) for SHMEM libraries to aid portability andfacilitate uniform predictable results of OpenSHMEM programs by explicitly stating the behavior andsemantics of the OpenSHMEM library calls.

IBM Spectrum™ MPI ships with a new implementation of OSHMEM 1.4. For more information on OSHMEMapplications, see the OpenSHMEM Application Programming Interface PDF.


Interconnect selection

55

http://www.openshmem.org/site/sites/default/site_files/OpenSHMEM-1.4.pdf

IBM Spectrum™ MPI includes shortcuts for specifying the communication method that is to be used betweenthe ranks. At the Open MPI level, point-to-point communication is handled by a PML (point-to-point messagelayer), which can perform communications directly, or use an MTL (matching transport layer) or BTL (bytetransfer layer) to accomplish its work.

The types of PMLs that can be specified include:

pami: IBM Spectrum MPI PAMI (Parallel Active Messaging Interface).

NotePAMI is the default interconnect.

yalla: Mellanox MXM (Mellanox Messaging Accelerator)cm: Uses an MTL layerob1: Uses a BTL layer

The types of MTLs that can be specified include:

mxm: An alternate Mellanox MXM. However, yalla is preferred.

The types of BTLs that can be specified include:

tcp: TCP/IPopenib: OpenFabrics InfiniBand

IBM Spectrum MPI provides the following shortcuts (mpirun options) that allow you to specify which PML,MTL, or BTL layer should be used. Specifying an option in uppercase letters (for example, -MXM) forces therelated PML, MTL, or BTL layer. The lowercase options are equivalent to the uppercase options.

-PAMI | -pami Specifies that IBM Spectrum MPI's PAMI should be used by way of the PML pami layer.

-PAMI_NOIB | -pami_noib When running with the PAMI PML on a single node, specifies a libpami.so that does not attempt to open

any IBV devices, and communicates only by way of PAMI's shared memory and cross memory attachmechanisms. The -PAMI_NOIB option is available for use with Power Systems™ servers only.

ImportantYou can not use the -pami_noib option with On Demand Paging (ODP) turned on, that is, -mca common_pami_use_odp 1. They are not compatible because ODP requires IB to operate. Using the -pami_noib option with -mca common_pami_use_odp 1 in an mpirun or jsrun command causes the user'sapplication to crash.You can also enable the -PAMI_NOIB option with one of the following methods:

In the MPI_ROOT/etc/openmpi-mca-params.conf file, specify schizo_ompi_prepend_ld_library_path = pami_noib.

From the mpirun command line, enter --mca schizo_ompi_prepend_ld_library_path = pami_noib.

-MXM | -mxm Specifies that Mellanox MXM should be used by way of the PML yalla layer. This is the preferred

method.

-MXMC | -mxmc Specifies that Mellanox MXM should be used by way of the PML cm and MTL mxm layers.

-PSM | -psm Specifies that Intel™ PSM (formerly from QLogic) should be used by way of the PML cm and MTL psm

layers.

56

-TCP | -tcpSpecifies that TCP/IP should be used by way of the PML ob1 and BTL tcp layers.

-UNIC | -unic | -USNIC | -usnic Specifies that Cisco usNIC should be used by way of the PML ob1 and BTL usnic layers.

-IB | -ib | -IBV | -ibv | -OPENIB | -openib Specifies that OpenFabrics InfiniBand should be used by way of the PNL ob1 and BTL openib layers.

Using the PAMI verbs bypass IBM Spectrum MPI includes a version of Parallel Active Messaging Interface (PAMI) that has in-lined many of

the verb header structures, and has removed many of the conditional instructions (such as error checking) toreduce the latency for very short messages. PAMI files are located in the /opt/ibm/spectrum_mpi/lib/pami_451 directory, where 4.5.1 is the version of Mellanox OFED used tobuild and to run PAMI.

IBM Spectrum MPI supports Mellanox tag matching Tag matching is the process of offloading the processing of MPI tag matching from the host system onto the

network adapter.

PAMI asynchronous thread The PAMI interconnect library supports a progress thread called asynchronous thread. The asynchronous

thread provides extra progress for applications that might require addition progress information.

Mellanox Multi-host feature You can use the Mellanox Multi-host feature with a IBM® POWER9™ system that has a single Mellanox 2-port

HCA adapter in the shared PCIe slot.

Specifying use of the FCA (hcoll) library The IBM Spectrum MPI libcollectives collectives library is used by default. However, you can enable the

Mellanox hcoll library (also known as FCA 3.x)

Managing on-host communication If a BTL is used for point-to-point traffic, the most commonly-used on-host communication method is the

shared memory BTL called vader. However, there is an alternate BTL called sm, and it is always possible touse an off-host BTL for on-host traffic, as well.

Specifying an IP network If you are using TCP/IP, you can use the mpirun -netaddr option to specify the network over which traffic is

sent.

Displaying communication methods between hosts You can use IBM Spectrum MPI to print a two-dimensional table that shows the method that is used by each

host to communicate with each of the other hosts.


IBM Spectrum MPI supports Mellanox tag matching

Tag matching is the process of offloading the processing of MPI tag matching from the host system onto thenetwork adapter.

57

If you do not use the Mellanox tag matching function, communication data is processed on the host CPU. Inthis scenario, if an application is busy with computational tasks the communication data stalls until the CPUcan process the communication data. However, if you use the Mellanox tag matching function, thecommunication data is processed in parallel with computational tasks by offloading the communication dataonto the network adapter.

Spectrum™ MPI implements the Mellanox tag matching function through a 2-sided API. Spectrum MPI stillenables the active message path as the default option. To enable tag matching with Spectrum MPI, you canrun the mpirun -hwtm command.

Parent topic: Interconnect selection

Related information

Mellanox: Understanding tag matching

Mellanox Multi-host feature

You can use the Mellanox Multi-host feature with a IBM® POWER9™ system that has a single Mellanox 2-portHCA adapter in the shared PCIe slot.

The Mellanox Multi-host feature allows an HCA adapter to be shared between hosts that allow each port onthe adapter to use the adapter's full bandwidth capabilities. The following example shows a configuration formulti-host virtual adapters (mlx5_*):

mlx5_0, affinitized to socket 0, attached to physical HCA port 1 mlx5_1, affinitized to socket 0, attached to physical HCA port 2 mlx5_2, affinitized to socket 1, attached to physical HCA port 1 mlx5_3, affinitized to socket 1, attached to physical HCA port 2


Related information

Mellanox Multi-Host documentation

Specifying use of the FCA (hcoll) library

The IBM Spectrum™ MPI libcollectives collectives library is used by default. However, you can enable theMellanox hcoll library (also known as FCA 3.x)

To enable the FCA (hcoll) library, you can use one of the following mpirun command line options:

-HCOLL | -FCA Specifies that the hcoll collective library should be used universally.

-hcoll | -fca Specifies that the IBM Spectrum MPI libcollectives collectives library retains the highest priority, but

that it is able to fall back to any of the hcoll collectives.

58

https://community.mellanox.com/docs/DOC-2781

http://www.mellanox.com/page/multihost


Related information

IBM Spectrum MPI's collective library (libcollectives)

Managing on-host communication

If a BTL is used for point-to-point traffic, the most commonly-used on-host communication method is theshared memory BTL called vader. However, there is an alternate BTL called sm, and it is always possible touse an off-host BTL for on-host traffic, as well.

The vader BTL is likely to provide the best on-host performance, but it is possible for InfiniBand, for example,to provide higher on-host bandwidth than shared memory.

You can use the following options to specify how on-host communication should be performed. Note thatthese options only apply if a BTL is being used. They are not available for MXM, PSM, or PAMI.

-intra vader | -intra shm Specifies that BTL=vader (shared memory) should be used for on-host traffic (only applies if the PML is

already ob1).

-intra nic Specifies that the off-host BTL for on-host traffic should be used.

-intra sm Specifies that BTL=sm (older shared memory component) on-host should be used (only applies if the PML is

already ob1).


Specifying an IP network

If you are using TCP/IP, you can use the mpirun -netaddr option to specify the network over which traffic issent. The following are the mpirun -netaddr options:

-netaddr spec,spec,.. Specifies the network to use for TCP/IP traffic. This option applies to control messages as well as the

regular MPI rank traffic.

-netaddr type:spec,spec,.. Specifies the networks for particular types of traffic.

The type variable can be one of the following:

rank Specifies the network for regular MPI rank-to-rank traffic.

control | mpirunSpecifies the network for control messages (for example, launching).

The spec variables can be one of the following:

59

An interface name. For example, eth0.CIDR notation. For example, 10.10.1.0/24.


Displaying communication methods between hosts

You can use IBM Spectrum™ MPI to print a two-dimensional table that shows the method that is used by eachhost to communicate with each of the other hosts.

You can use the following options to print the table:

-prot Displays the interconnect type that is used by each host. The first rank on each host connects to all

peer hosts in order to establish connections that might otherwise be on-demand.

-protlazy Similar to -prot. Displays the interconnect type that is used by each host at MPI_Finalize. Connections

to peer hosts are not established, so it is possible that many peers are unconnected.

The output from either the -prot or -protlazy options looks similar to this:

Host 0 [mpi01] ranks 0 - 3 Host 1 [mpi02] ranks 4 - 7 Host 2 [mpi03] ranks 8 - 11 Host 3 [mpi04] ranks 12 - 15

host | 0 1 2 3 ======|===================== 0 : shm tcp tcp tcp 1 : tcp shm tcp tcp 2 : tcp tcp shm tcp 3 : tcp tcp tcp shm

Connection summary: on-host: all connections are shm off-host: all connections are tcp

By default, the table only displays information for a maximum of 16 hosts (although the connection summary,which appears after the table, is not limited by size). If you have a larger cluster, you can use theMPI_PROT_MAX environment variable to increase the number of hosts that are displayed in the table. Note,however, that the larger this table becomes, the more difficult it is to use.


Dynamic MPI profiling interface with layering

The MPI standard defines a profiling interface (PMPI) that allows you to create profiling libraries by wrappingany of the standard MPI routines. A profiling wrapper library contains a subset of redefined MPI_** entrypoints, and inside those redefinitions, a combination of both MPI* and PMPI** symbols are called.

60

This means that you can write functions with the MPI_** prefix that call the equivalent PMPI_** function.Functions that are written in this manner behave like the standard MPI function, but can also exhibit any otherbehavior that you add.

For example:

int MPI_Allgather(void *sbuf, int scount, MPI_Datatype sdt, void *rbuf, int rcount, MPI_Datatype rdt, MPI_Comm comm) { int rval; double t1, t2, t3; t1 = MPI_Wtime(); MPI_Barrier(comm); t2 = MPI_Wtime(); rval = PMPI_Allgather(sbuf, scount, sdt, rbuf, rcount, rdt, comm); t3 = MPI_Wtime(); // record time waiting vs time spent in allgather.. return(rval); } double MPI_Wtime() { // insert hypothetical high-resolution replacement here, for example }

Using two unrelated wrapper libraries is problematic because, in general, it is impossible to link them so thatproper layering occurs.

For example, you could have two libraries:

libJobLog.so In this library, MPI_Init and MPI_Finalize are wrapped, so that a log of every MPI job is generated,

which lists hosts, run times, and CPU times.

libCollPerf.so In this library, MPI_Init, MPI_Finalize and all the MPI collectives are wrapped, in order to gather

statistics about how evenly the ranks enter the collectives.

With ordinary linking, each MPI_** call would resolve into one of the wrapper libraries, and from there, thewrapper library's call to PMPI_* would resolve into the bottom level library (libmpi.so). As a result, only one ofthe libraries would have its MPI_Init and MPI_Finalize* routines called.

Defining consistent layering You can define a consistent approach to layering, with dynamically loaded symbols, for any number of

wrapper libraries.Layered profiling implementation

Layered profiling is implemented by always linking MPI applications against a library called libmpiprofilesupport.so.Using the MPE performance visualization tool

IBM Spectrum™ MPI includes version mpe2-2.4.9b of the MPE logging library from Argonne NationalLaboratory.Using the MPE Jumpshot viewer

The Jumpshot viewer, which includes the jumpshot command, is a performance visualization tool thatis distributed by Argonne National Laboratory with MPE.


61

Defining consistent layering

You can define a consistent approach to layering, with dynamically loaded symbols, for any number ofwrapper libraries.

If you have a wrapper library named libwrap.so, which redefines an MPI symbol, it can either call another*MPI* entry, or it can call a PMPI* entry. In the case of ordinary single-level wrapping, the calls into MPI* wouldresolve into libwrap.so first, and then libmpi.so if not found. And the calls into PMPI_** would resolveinto libmpi.so.

If multi-level layering were used, MPI_** would resolve to the current level and PMPI_** would resolve to thenext level down in the hierarchy of libraries.

One way to achieve consistent layering is to establish a list of logical levels, where each level consists ofMPI_** entry points from a given library. The bottom level would consist of MPI_** entry points from the baseMPI library (libmpi.so). For example:

Level 0: libJobLog.so Level 1: libCollPerf.so Level 2: libmpi.so

When an application makes an MPI call, a depth counter would start at level 0 and search down the list until itfinds a level that defines that MPI call. From there, if that routine calls another MPI or PMPI function, thedepth counter would remain the same or be incremented respectively, to control the level from which the nextfunction is called.

Using the mpirun -entry option to define consistent layeringYou can establish this layering scheme by using the mpirun command line option -entry.

With -entry, you can specify a library in the form libfoo.so, /path/to/libfoo.so, or simply foo (whichwill be automatically expanded into libfoo.so for simple strings). For example, the following specification:

% mpirun -entry JobLog,CollPerf -np 2 ./example.x

is automatically expanded to:

% mpirun -entry libJobLog.so,libCollPerf.so -np 2 ./example.x

Note that the order in which you specify a list of libraries dictates each library's placement in the hierarchy oflevels. By default, the base product's MPI library, libmpi.so, is placed at the bottom of the list, so it does notneed to be specified with -entry. However, the -entrybase (or -baseentry) option enables you to specify adifferent library from which to get the bottom level MPI calls.

Note

A profiling wrapper library cannot be specified with the mpirun -entry unless it is implemented as ashared library.In order for the libraries to be found, you must either set LD_LIBRARY_PATH or specify full paths to thelibraries.

The syntax of the mpirun -entry option is:

mpirun -entry library Specifies a list of PMPI wrapper libraries.

62

mpirun -entry fortSpecifies the level at which to install the base MPI product's Fortran calls, which, at a minimum, wrapthe C calls. The Fortran calls are placed at the top level, by default.

mpirun -entrybase library Specifies an alternate library from which to get the bottom level calls.

mpirun -baseentry library Synonym for mpirun -baseentry *library*.

mpirun -entry v Prints verbose output that shows the layering levels of the MPI entry points.

For example:

Entrypoint MPI wrapper levels: 1. (fortran from base product) 2. libJobLog.so 3. libCollPerf.so 4. base product Entrypoint MPI base product:(base MPI product as linked)

mpirun -entry vv Prints more detailed verbose output than the -entry v option. The -entry vv option shows the

levels that are intended to be used, and confirms the libraries that are being opened. The output from -entry vv is less readable, but it allows you to confirm, more visibly, that interception is taking place.

By default, the top layer is always the Fortran calls from the base MPI product. The Fortran calls arewrappers over corresponding C routines. As a result, if a profiling library intercepts the C call MPI_Send, and an application makes the Fortran call mpi_send, the profiling library's MPI_Send getscalled, essentially wrapping Fortran for free. If this is not the behavior you want, you can include thefortstring with the -entry option to specify where the base product's Fortran symbols should go.Specifying fort last is equivalent to not treating the Fortran symbols as special, and so wrapping Cfunctions is unconnected to wrapping Fortran functions.

Parent topic: Dynamic MPI profiling interface with layering

Layered profiling implementation

Layered profiling is implemented by always linking MPI applications against a library called libmpiprofilesupport.so.

For performance, the default libmpiprofilesupport.so library is an empty stub and is, therefore, inactivein ordinary runs. When you specify -entry with a list of libraries, LD_LIBRARY_PATH is modified to include analternate libmpiprofilesupport.so that redefines all MPI symbols, thereby allowing the layered profilingscheme.

When -entry is not used, there is no performance impact from being linked against the empty stub library.When -entry is used, the performance impact varies, depending on the machine. However, -entry has beenseen to impact ping pong latency by approximately 15 nanoseconds.


63

Using the MPE performance visualization tool

IBM Spectrum™ MPI includes version mpe2-2.4.9b of the MPE logging library from Argonne NationalLaboratory.

MPE uses the PMPI (standard MPI profiling) interface to provide graphical profiles of MPI traffic forperformance analysis. The MPE library is packaged with IBM Spectrum MPI as libmpe.so and can beaccessed dynamically with the mpirun -entry command without requiring the application to be recompiled orrelinked. For example:

% mpirun -np 2 -entry mpe ./program.x

The preceding command turns on MPE tracing and produces a logfile as output in the working directory ofrank 0 (for example, program.x.clog2). The jumpshot command can be used to convert this log file to differentformats and to view the results.


Using the MPE Jumpshot viewer

The Jumpshot viewer, which includes the jumpshot command, is a performance visualization tool that isdistributed by Argonne National Laboratory with MPE.

The jumpshot command is also included with IBM Spectrum™ MPI (in the bin directory). The jumpshotcommand can be used to view the MPE tracing output file, as follows:

% jumpshot program.x.clog2

Note that Jumpshot requires Java™. If Java is not in the path, you can set the JVM environment variable to thefull path of the Java executable on your system.

The first time you run the jumpshot command, it might issue a prompt that asks you if you want to create asetup file with the default settings. Click OKand Yes. After that, for regular runs on a .clog2 file, Jumpshotissues another prompt that asks if you want to convert to the SLOG2 format. Click Yes, and then, on the nextwindow, click Convert and then OK. The main window is then displayed with the MPE profiling data.

When using Jumpshot to view the MPE timings, several pop-up windows appear. The most important windowsare the main window and a window that indicates the MPI calls by color. Time spent in the various MPI calls isdisplayed in different colors, and messages are shown as arrows. Right-click on the calls and the messagearrows for more information.


Related information

Performance visualization information at Argonne National Laboratory's website

Managing IBM Spectrum MPI process placement and affinity

64

http://www.mcs.anl.gov/research/projects/perfvis/download/index.htm#MPE

IBM Spectrum™ MPI follows Open MPI's support of processor affinity for improving performance. Withprocessor affinity, MPI processes and their threads are bound to specific hardware resources such as cores,sockets, and so on.

Open MPI's **mpirun** affinity options are based on the notions of mapping, ranking, and binding asseparate steps, as follows:

Mapping Mapping determines the number of processes that are launched, and on which hosts. Mapping can also

be used to associate the hardware resources, such as sockets and cores, with each process.

Ranking Ranking determines an MPI rank index for each process in the mapping. If options are not used to

specify ranking behavior, a default granularity is chosen. The ranks are interleaved over the chosengranularity element to produce an ordering.

Binding Binding is the final step and can deviate from the hardware associations that were made at the

mapping stage. The binding unit can be larger or smaller than specified by the mapper, and is expandedor round-robined to achieve the final binding.

IBM Spectrum MPI affinity shortcuts IBM Spectrum MPI provides shortcuts for some of the underlying Open MPI affinity options, by using the -aff

option.

IBM PE Runtime Edition affinity equivalents If you are migrating from IBM® Parallel Environment Runtime Edition, you can use the MP_TASK_AFFINITY and

MP_CPU_BINDLIST environment variable settings to create nearly the same functionality.

Mapping options and modifiers This topic collection explains some of the options that are available for mapping and includes examples.

Ranking and binding options are sometimes shown in the mapping examples.

Helper options You can use the -display-devel-map option and the -report-bindings option to help understand MPI

placement and affinity.

Managing oversubscription Oversubscription refers to the concept of allowing more ranks to be assigned to a host than the number of

slots that are available on that host.

Managing overload Overload occurs in the binding stage of affinity, when ranks are assigned sets of cores. There is a small check

to see if more ranks are assigned to any hardware element than there are cores within that hardware element.In that case, the MPI job aborts.

OpenMP (and similar APIs) Open MPI only binds at the process level. The number of threads that are created by a rank and the binding of

those threads is not directly controlled by Open MPI. However, by default, created threads would inherit thefull mask that is given to the rank.


IBM Spectrum MPI affinity shortcuts 65

IBM Spectrum™ MPI provides shortcuts for some of the underlying Open MPI affinity options, by using the -aff option.

IBM Spectrum MPI provides the following -aff shortcuts:

Shortcut Description-aff bandwidth Emulates the -map-by socket, -rank-by core, and -bind-to core OpenMPI

options. This shortcut alternates between sockets and puts ranks in a natural hardwareorder.

-aff latency Emulates the -map-by core, -rank-by core, and -bind-to core OpenMPI options.This shortcut packs ranks across the natural hardware order.

-aff cycle:<unit>

Emulates the -map-by **<unit>, -rank-by **<unit>, and -bind-to coreOpenMPI options. This shortcut alternates between specified hardware elements.

-aff width:<unit>

Specifies an alternative -bind-to unit value. The value that is specified for **<unit>can be hwthread, core, socket, or numa.

-aff v / -aff vv Emulates the --report-bindings OpenMPI option. This shortcut specifies verboseoutput.

-aff on Enables affinity with the default option of bandwidth binding.-aff auto Same shortcut as the -aff bandwidth shortcut.-aff default Same shortcut as the -aff bandwidth shortcut.-aff off Disables affinity (unbind).-aff none Same shortcut as the -aff off shortcut-aff **<shortcut>,<shortcut>,..

Specify a comma-separated list of shortcuts. For example, -aff bandwidth,latency,v.

Bandwidth bindingThe goal of bandwidth binding is to spread the ranks over a broad range of machine resources, such as cacheand memory controllers. Another goal of bandwidth binding is to maintain the concept that closely numberedranks are as topologically close to each other as possible. In the following example of bandwidth binding, thecommunication between ranks R1 and R2 might be faster if the ranks were on the same socket, but theoverall application performance is better because of greater memory bandwidth and cache availability:

% mpirun -host hostA:4,hostB:2 -map-by socket -rank-by core -bind-to core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../BB/../../../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/../../../../../../..] R3 hostA [../../../../../../../..][../BB/../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/../../../../../../..]

The shortcut for this bandwidth binding example is -aff bandwidth and it is equivalent to the -map-by socket, -rank-by core, and -bind-to core options in OpenMPI.

Latency bindingLatency binding attempts to maximize communication speed between ranks. In applications that use a smallamount of memory, it might not be ideal to span more sockets and cache. Therefore, you can use latencybinding to prioritize communication speeds. The following example shows latency binding:

66

% mpirun -host hostA:4,hostB:2 -map-by core -rank-by core -bind-to core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../BB/../../../../../..][../../../../../../../..] R2 hostA [../../BB/../../../../..][../../../../../../../..] R3 hostA [../../../BB/../../../..][../../../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../BB/../../../../../..][../../../../../../../..]

The shortcut for this latency binding example is -aff latency and it is equivalent to the -map-by core, -rank-by core, and -bind-to core options in OpenMPI.

Cyclic bindingCyclic binding is similar to bandwidth binding, but it does not use the -rank-by core option whenreordering the output. The following example shows cyclic binding:

% mpirun -host hostA:4,hostB:2 -map-by socket -bind-to core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../../../../../../../..][BB/../../../../../../..] R2 hostA [../BB/../../../../../..][../../../../../../../..] R3 hostA [../../../../../../../..][../BB/../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/../../../../../../..]

The shortcut for this cyclic binding example is -aff cycle:unit and it is equivalent to the -map-by **<unit>, -rank-by **<unit>, and -bind-to core options in OpenMPI.

Parent topic: Managing IBM Spectrum MPI process placement and affinity

IBM PE Runtime Edition affinity equivalents

If you are migrating from IBM® Parallel Environment Runtime Edition, you can use the MP_TASK_AFFINITY andMP_CPU_BINDLIST environment variable settings to create nearly the same functionality.

MP_TASK_AFFINITY=core The options -map-by core, -map-by socket, -rank-by core, and -bind-to core offer similar

functionality to the MP_TASK_AFFINITY=core environment variable setting.

MP_TASK_AFFINITY=core:n Learn more about the MP_TASK_AFFINITY=core:n environment variable setting.

MP_TASK_AFFINITY=cpu Learn more about the MP_TASK_AFFINITY=cpu environment variable setting.

MP_TASK_AFFINITY=cpu:n

MP_TASK_AFFINITY=mcm The functionality of the -map-by socket or -map-by numa options is similar to the

MP_TASK_AFFINITY=mcm environment variable setting. Note that in Open MPI terminology, node refers to afull host. The NUMA node level is referred to as numa.

MP_CPU_BIND_LIST=list_of_hyper-threads In Open MPI, specific bindings on a per-rank basis can be made using a rankfile.

67


MP_TASK_AFFINITY=core

The options -map-by core, -map-by socket, -rank-by core, and -bind-to core offer similarfunctionality to the MP_TASK_AFFINITY=core environment variable setting.

For example:

% mpirun -host hostA:4,hostB:2 -map-by core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../BB/../../../../../..][../../../../../../../..] R2 hostA [../../BB/../../../../..][../../../../../../../..] R3 hostA [../../../BB/../../../..][../../../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../BB/../../../../../..][../../../../../../../..] % mpirun -host hostA:4,hostB:2 -map-by socket -rank-by core -bind-to core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../BB/../../../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/../../../../../../..] R3 hostA [../../../../../../../..][../BB/../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/../../../../../../..]

Parent topic: IBM PE Runtime Edition affinity equivalents

MP_TASK_AFFINITY=core:n

Learn more about the MP_TASK_AFFINITY=core:n environment variable setting.

The following options offer similar functionality to the MP_TASK_AFFINITY=core:n environment variablesetting:

-map-by slot:pe=n-map-by socket:pe=n-map-by ppr:ranks-per-socket:slot:pe=n-map-by ppr:ranks-per-socket:socket:pe=n

Depending on the launching method, the rank count that is produced by the -map-by unit:pe=n optionmight not be what you expect because each rank uses n slots.

For example:

% mpirun -host hostA:8,hostB:4 -map-by slot:pe=2 ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..] R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] % mpirun -host hostA:8,hostB:4 -map-by socket:pe=2 -rank-by core ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/../../../../../..] R3 hostA [../../../../../../../..][../../BB/BB/../../../..]

68

R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/../../../../../..]

Using a host file and the -mca rmaps seq option allows specific control of host layout, as long as a packed-style binding is acceptable:

% mpirun -hostfile hostfile --mca rmaps seq -map-by slot:pe=2 ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..] R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] R6 hostA [../../../../../../../..][BB/BB/../../../../../..]

For the -map-by ppr options, the slot count must be able to satisfy the specified processes per resource,and the resulting layout across the hosts is chosen by MPI. For example, the following command is invalidbecause the two slots that are listed as available on hostB are not enough to satisfy the instruction to put fourprocesses on each host.

% mpirun -host hostA:4,hostB:2 -map-by ppr:4:node:pe=2

In the next example, the instruction to put four ranks per host (node) is followed. Even though hostA is listedas having six slots, only four processes are placed on it.

% mpirun -host hostA:6,hostB:4 -map-by ppr:4:node:pe=2 R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..] R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] R6 hostB [../../../../BB/BB/../..][../../../../../../../..] R7 hostB [../../../../../../BB/BB][../../../../../../../..]


MP_TASK_AFFINITY=cpu

Learn more about the MP_TASK_AFFINITY=cpu environment variable setting.

The following options offer similar functionality to the MP_TASK_AFFINITY=cpu environment variable setting:

-map-by hwthread-map-by socket-rank-by hwthread-bind-to hwthread

For example:

% mpirun -host hostA:4,hostB:2 -map-by hwthread ...

R0 hostA [B./../../../../../../..][../../../../../../../..] R1 hostA [.B/../../../../../../..][../../../../../../../..] R2 hostA [../B./../../../../../..][../../../../../../../..] R3 hostA [../.B/../../../../../..][../../../../../../../..]

69

R4 hostB [B./../../../../../../..][../../../../../../../..] R5 hostB [.B/../../../../../../..][../../../../../../../..]

% mpirun -host hostA:4,hostB:2 -map-by socket -rank-by hwthread -bind-to hwthread ...

R0 hostA [B./../../../../../../..][../../../../../../../..] R1 hostA [.B/../../../../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][B./../../../../../../..] R3 hostA [../../../../../../../..][.B/../../../../../../..] R4 hostB [B./../../../../../../..][../../../../../../../..] R5 hostB [.B/../../../../../../..][../../../../../../../..] R6 hostB [../../../../../../../..][B./../../../../../../..] R7 hostB [../../../../../../../..][.B/../../../../../../..]


MP_TASK_AFFINITY=cpu:n

The following options offer similar functionality to the MP_TASK_AFFINITY=cpu:n environment variablesetting:

-map-by slot:pe=n -use-hwthread-cpus-map-by socket:pe=n -use-hwthread-cpus-map-by ppr:ranks-per-host:node:pe=n -use-hwthread-cpus-map-by ppr:ranks-per-socket:socket:pe=n -use-hwthread-cpus

The -use-hwthread-cpus option causes the pe=n option to refer to hyper-threads instead of cores.

For example:

% mpirun -host hostA:16,hostB:8 -map-by slot:pe=4 -use-hwthread-cpus ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..] R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..]

In the preceding example, the slot counts in the -host option are again increased to achieve the desired rankcounts, because each rank is using four slots.

% mpirun -host hostA:16,hostB:8 -map-by socket:pe=4 -use-hwthread-cpus ...

R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../../../../../../..][BB/BB/../../../../../..] R2 hostA [../../BB/BB/../../../..][../../../../../../../..] R3 hostA [../../../../../../../..][../../BB/BB/../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/../../../../../..]

The -map-by ppr option over hyper-threads works similarly:

% mpirun -host hostA:4,hostB:4 -map-by ppr:4:node:pe=4 -use-hwthread-cpus ...

R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..]

70

R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] R6 hostB [../../../../BB/BB/../..][../../../../../../../..] R7 hostB [../../../../../../BB/BB][../../../../../../../..]

% mpirun -host hostA:4,hostB:4 -map-by ppr:2:socket:pe=4 -use-hwthread-cpus ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/../../../../../..] R3 hostA [../../../../../../../..][../../BB/BB/../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] R6 hostB [../../../../../../../..][BB/BB/../../../../../..] R7 hostB [../../../../../../../..][../../BB/BB/../../../..]


MP_TASK_AFFINITY=mcm

The functionality of the -map-by socket or -map-by numa options is similar to theMP_TASK_AFFINITY=mcm environment variable setting. Note that in Open MPI terminology, node refers to afull host. The NUMA node level is referred to as numa.

In Open MPI, the levels are:

hwthread (hyper-thread, or cpu in IBM® PE Runtime Edition terminology)coreL1cacheL2cacheL3cachenuma (a NUMA node)socketboardnode (the full host)

In Open MPI, the mcm level would equate to either socketor numa. For example:

% mpirun -host hostA:4,hostB:4 -map-by numa ... R0 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R1 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R2 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R3 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R4 hostB [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]

% mpirun -host hostA:4,hostB:4 -map-by socket -rank-by core ... R0 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R1 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R3 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R4 hostB [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]


71

MP_CPU_BIND_LIST=list_of_hyper-threads

In Open MPI, specific bindings on a per-rank basis can be made using a rankfile. The list of numbers that isspecified in the rankfile refers to cores, and uses logical hardware ordering. If s:a-b is given, it refers to asocket and a range of cores on that socket.

For example:

% cat rankfile

rank 0=hostA slot=0,1 rank 1=hostA slot=2-3 rank 2=hostA slot=1:4-5 rank 3=hostA slot=0:4-7 rank 4=hostB slot=0-1,8,9 rank 5=hostB slot=2-3,7,8,10-11

% mpirun -rankfile rankfile

R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][../../../../BB/BB/../..] R3 hostA [../../../../BB/BB/BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][BB/BB/../../../../../..] R5 hostB [../../BB/BB/../../../BB][BB/../BB/BB/../../../..]

When the -use-hwthread-cpus option is used, the numbers in the rank file refer to hyper-threads (usinglogical hardware order):

% cat rankfile rank 0=hostA slot=0-7 rank 1=hostA slot=4,5,6,7,8,9,10,11 rank 2=hostA slot=8-15 rank 3=hostA slot=0-23 rank 4=hostB slot=0-3,16-19 rank 5=hostB slot=4-7,20-23

% mpirun -rankfile rankfile -use-hwthread-cpus R0 hostA [BB/BB/BB/BB/../../../..][../../../../../../../..] R1 hostA [../../BB/BB/BB/BB/../..][../../../../../../../..] R2 hostA [../../../../BB/BB/BB/BB][../../../../../../../..] R3 hostA [BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/../../../..] R4 hostB [BB/BB/../../../../../..][BB/BB/../../../../../..] R5 hostB [../../BB/BB/../../../..][../../BB/BB/../../../..]

If the socket:core#-core# syntax is used in a rankfile, those lines are still interpreted as socket:core eventhough the -use-hwthread-cpus option is specified.

For example:

% cat rankfile rank 0=hostA slot=2-3 rank 1=hostA slot=1:2-3

% mpirun -rankfile rankfile -use-hwthread-cpus R0 hostA [../BB/../../../../../..][../../../../../../../..] R1 hostA [../../../../../../../..][../../BB/BB/../../../..]

72


Mapping options and modifiers

This topic collection explains some of the options that are available for mapping and includes examples.Ranking and binding options are sometimes shown in the mapping examples.

--map-by unit option Learn when and how to use the --map-by unit option.

--map-by slot option Mapping by slot resembles mapping by an actual hardware unit within the hosts, but each slot is associated

with the whole host. The slot is essentially an imaginary hardware unit that exists in a certain number on eachhost.

--map-by unit:PE=n and --map-by slot:PE=n options This option is used to bind n cores to each process.

--map-by ppr:n:unit and --map-by ppr:n:unit:pe=n options Learn how and when to use --map-by ppr:n:unit and --map-by ppr:n:unit:pe=n options.

--map-by dist:span option (adapter affinity) Learn when and how to use the --map-by dist:span option (adapter affinity).


--map-by unit option

Learn when and how to use the --map-by unit option.

When using the --map-by unit option, unit can be any of the following values:

hwthreadcoreL1cacheL2cacheL3cachesocketnumaboardnode

--map-by unit is the most basic of the mapping policies, and makes process assignments by iterating overthe specified unit until the process count reaches the number of available slots.

The following example shows the output (in verbose mode) of the --map-by unit option, where core is thespecified unit.

% mpirun -host hostA:4,hostB:2 -map-by core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA[../BB/../../../../../..][../../../../../../../..] R2 hostA [../../BB/../../../../..][../../../../../../../..] R3 hostA

73

[../../../BB/../../../..][../../../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB[../BB/../../../../../..][../../../../../../../..]

This is sometimes called a packed or latency binding because it tends to produce the fastest communicationbetween ranks.

The following example shows the output (in verbose mode) of using the --map-by unit option, wheresocket is the specified unit.

% mpirun -host hostA:4,hostB:2 -map-by socket ... R0 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]R1 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R2 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R3 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R4 hostB[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]

In the preceding examples, -host hostA:4,hostB:2 indicates that the cluster has six slots (spaces inwhich a process can run). Each rank consumes one slot, and processes are assigned hardware elements byiterating over the specified unit until the available slots are consumed.

The ordering of these examples, is implicitly core and socket, respectively, so core and socket are iterated foreach rank assignment. The binding is also implicitly core and socket, respectively, so the final binding is to thesame element that was chosen by the mapping.

When options, such as the ranking unit and binding unit, are not explicitly specified, the -display-devel-map option can be used to display the implicit selections. In the preceding examples, the -display-devel-map includes the following, respectively:

Mapping policy: BYCORE Ranking policy: CORE Binding policy: CORE:IF-SUPPORTED

Copy Mapping policy: BYSOCKET Ranking policy: SOCKET Binding policy: SOCKET:IF-SUPPORTED

If no binding options are specified, by default, Open MPI assumes --map-by-socket for jobs with more thantwo ranks. This produces the interleaved ordering in the preceding examples.

NoteIBM Spectrum™ MPI enables binding by default when using the orted tree to launch jobs. The defaultbinding for a less than, or fully subscribed node is --map-by-socket. In this case, users might see improvedlatency by using either the -aff latency or --map-by core option. For more information on the -aff latencyoption, see the IBM Spectrum MPI affinity shortcuts topic.A natural hardware ordering can be created by specifying a smaller unit over which to iterate for ranking.

For example:

% mpirun -host hostA:4,hostB:2 -map-by socket -rank-by core ... R0 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R1 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R3 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R4 hostB [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]

A common binding pattern involves binding to cores, but spanning those core assignments over all of theavailable sockets.

For example:

% mpirun -host hostA:4,hostB:2 -map-by socket -rank-by core -bind-to core ... R0 hostA [BB/../../../../../../..][../../../../../../../..]

74

R1 hostA [../BB/../../../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/../../../../../../..] R3 hostA [../../../../../../../..][../BB/../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/../../../../../../..]

In this example, the final binding unit is smaller than the hardware selection that was made in the mappingstep. As a result, the cores within the socket are iterated over for the ranks on the same socket. When themapping unit and the binding unit differ, the -display-devel-map output can be used to display themapping output from which the binding was taken. For example, at rank 0, the -display-devel-map outputincludes:

Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] Binding: [BB/../../../../../../..][../../../../../../../..]

A possible purpose for this binding is to use all the available hardware resources such as cache and memorybandwidth. This is sometimes called a bandwidth binding, and is a good starting point for overall applicationperformance. The amount of cache and memory bandwidth is maximized, and the ranks are ordered so thatclose ranks by index are near each other in the hardware as much as possible while still spanning theavailable sockets.

On the hardware used in these examples, socket and numa are the same. On some hardware it may bedesirable to iterate the process placement over the NUMA nodes instead of over the sockets. In this case, -map-by numa can be used.

For example:

% mpirun -host hostA:4,hostB:2 -map-by numa -rank-by core -bind-to core ... R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../BB/../../../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/../../../../../../..] R3 hostA [../../../../../../../..][../BB/../../../../../..] R4 hostB [BB/../../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/../../../../../../..]

NoteIn Open MPI's terminology, numa refers to a NUMA node within a host, while node refers to the wholehost.In the following example, the host (node) is iterated for process assignments. The ranking unit is alsoimplicitly node, so the ordering of the ranks alternates between the hosts as well. However, the binding unitdefaults to the smaller socket element and, similar to the preceding bandwidth example, iterates over socketsfor subsequent ranks that have the same node binding at the mapping step.

For example:

% mpirun -host hostA:4,hostB:2 -map-by node ... R0 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R1 hostB [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R3 hostB [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] R4 hostA [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] R5 hostA [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]

Parent topic: Mapping options and modifiers

--map-by slot option

75

Mapping by slot resembles mapping by an actual hardware unit within the hosts, but each slot is associatedwith the whole host. The slot is essentially an imaginary hardware unit that exists in a certain number on eachhost. Because the slot does not represent a specific subset of cores within a host, slots can be useful inseparating the assignment of processes to hosts from the assignment of processes to specific sockets orcores within the host.


--map-by unit:PE=n and --map-by slot:PE=n options

This option is used to bind n cores to each process.

This option requires that the specified unit contains at least n cores (or that slotis used). Otherwise, processassignments are iterated, as in the examples for --map-by unit and --map by slot, with the caveat that eachprocess assignment also consumes n slots.

For example:

% mpirun -host hostA:4,hostB:2 -map-by socket:pe=2 ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../../../../../../..][BB/BB/../../../../../..] R2 hostB [BB/BB/../../../../../..][../../../../../../../..]

The most immediate point of interest in this example is that the rank count is only three, not six. This isbecause each process is consuming n=2 slots. In launching modes, where the slot count represents thenumber of cores, this is probably desirable because it results in bindings that consume the available numberof cores. However, if a specific rank count is desired, the -host launching method becomes inconvenient.

For example:

% mpirun -host hostA:8,hostB:4 -map-by socket:pe=2 ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../../../../../../..][BB/BB/../../../../../..] R2 hostA [../../BB/BB/../../../..][../../../../../../../..] R3 hostA [../../../../../../../..][../../BB/BB/../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/../../../../../..]

This example shows that the sockets are still iterated over and that the binding width becomes two cores.

If alternating sockets are not desired, a similar mapping can be accomplished by using slots.

For example:

% mpirun -host hostA:8,hostB:4 -map-by slot:pe=2 ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..] R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..]

The preceding example resembles a packed binding. It also illustrates how iterating over slots for themapping causes processes to be assigned to the same host, while leaving the assignment to cores within thehost to the binding step.

76

Because the slotis an imaginary, largest-possible hardware unit inside the host that maps to the entire host,iterating rank placements over the slots causes processes to be assigned to the same host, until that host isfull, and then moved to the next host. At the mapping stage, each process is assigned to the whole hostbecause that is what a slot is. This can be seen in the output of --display-devel-map, which shows that thebinding is not made more specific until the binding stage:

Locale: NODE

Binding: [BB/../../../../../../..][../../../../../../../..]

A similar bandwidth style binding can be produced by adding a -rank-by core to the socket mapping:

% mpirun -host hostA:8,hostB:4 -map-by socket:pe=2 -rank-by core ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/../../../../../..] R3 hostA [../../../../../../../..][../../BB/BB/../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/../../../../../..]

In the preceding examples, the slot counts in -host were modified to produce a desired rank count. A host file,with the special sequential option for the mapper, can be used to force any mapping of processes to hosts:--mca rmaps seq -hostfile *file*.

% cat hostfile hostAhostAhostAhostAhostBhostBhostA

% mpirun -hostfile hostfile --mca rmaps seq -map-by socket:pe=2 ... % mpirun -hostfile hostfile --mca rmaps seq -map-by slot:pe=2 ...

R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../BB/BB/../..][../../../../../../../..] R3 hostA [../../../../../../BB/BB][../../../../../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] R6 hostA [../../../../../../../..][BB/BB/../../../../../..]

The sequential mapper with a host file allows very flexible rank layouts to be made, but a side effect is thatthe mapping step only outputs host mapping information. Normally the two preceding examples would differ,with the -map-by socket alternating between the sockets to produce a more bandwidth style result. But thesequential mapper's output is more coarse and the preceding core mappings occur at the binding step.

The tradeoff here is minor, especially if you are launching fully-subscribed jobs, in which case,latency andbandwidth bindings are identical. Also, the sequential mapper requires that either a -map-by or -bind-tooption be specified, otherwise, it is incomplete and will fail to launch.


--map-by ppr:n:unit and --map-by ppr:n:unit:pe=n options

77

Learn how and when to use --map-by ppr:n:unit and --map-by ppr:n:unit:pe=n options.

The ppr (processes per resource) mode is a convenient shortcut for specifying the number of processes to runon each resource (a socket, for example).

The purpose of ppr:n:socket option is to launch n ranks on each socket. The purpose of the ppr:n:socket:pe=m option is to launch n ranks per socket, with each rank using m cores.

This following restrictions apply to ppr mode:

It will only launch if the slot count is high enough to satisfy the ppr instruction. For example, if enoughprocesses are being started to put n on each socket.The cluster must be fairly homogeneous in order to be able to meaningfully specify a single number asthe ranks per socket.

In the --map-by unit:PE=n and --map-by slot:PE=n options topic, special considerations were given to thelaunching method because the number of slots used was not one-per-process. However with ppr, slots arenot taken into account other than the requirement that enough slots exist to satisfy the specified processesper resource instruction.

% mpirun -host hostA:4,hostB:4 --map-by ppr:2:socket:pe=2 ... R0 hostA [BB/BB/../../../../../..][../../../../../../../..] R1 hostA [../../BB/BB/../../../..][../../../../../../../..] R2 hostA [../../../../../../../..][BB/BB/../../../../../..] R3 hostA [../../../../../../../..][../../BB/BB/../../../..] R4 hostB [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../BB/BB/../../../..][../../../../../../../..] R6 hostB [../../../../../../../..][BB/BB/../../../../../..] R7 hostB [../../../../../../../..][../../BB/BB/../../../..]


--map-by dist:span option (adapter affinity)

Learn when and how to use the --map-by dist:span option (adapter affinity).

This option, along with --mca rmaps_dist_device *device name* (for example, ib0) can be used toenable adapter affinity in Open MPI.

With --mca rmaps_dist_device, Open MPI must be allowed to choose the rank layout, so an explicit hostfile should not be used with this mode.

For example:

% mpirun -host hostA,hostB -np 18 -bind-to core -map-by dist:span -map-by dist:span --mca rmaps_dist_device mthca0 ... R0 hostA [../../../../../../../..][BB/../../../../../../..] R1 hostA [../../../../../../../..][../BB/../../../../../..] R2 hostA [../../../../../../../..][../../BB/../../../../..] R3 hostA [../../../../../../../..][../../../BB/../../../..] R4 hostA [../../../../../../../..][../../../../BB/../../..] R5 hostA [../../../../../../../..][../../../../../BB/../..] R6 hostA [../../../../../../../..][../../../../../../BB/..] R7 hostA [../../../../../../../..][../../../../../../../BB] R8 hostA [BB/../../../../../../..][../../../../../../../..] R9 hostB [../../../../../../../..][BB/../../../../../../..]

78

R10 hostB [../../../../../../../..][../BB/../../../../../..] R11 hostB [../../../../../../../..][../../BB/../../../../..] R12 hostB [../../../../../../../..][../../../BB/../../../..] R13 hostB [../../../../../../../..][../../../../BB/../../..] R14 hostB [../../../../../../../..][../../../../../BB/../..] R15 hostB [../../../../../../../..][../../../../../../BB/..] R16 hostB [../../../../../../../..][../../../../../../../BB] R17 hostB [BB/../../../../../../..][../../../../../../../..]

% mpirun -host hostA,hostB -np 10 -bind-to core -map-by dist:span,pe=2 --mca rmaps_dist_device mthca0 ... R0 hostA [../../../../../../../..][BB/BB/../../../../../..] R1 hostA [../../../../../../../..][../../BB/BB/../../../..] R2 hostA [../../../../../../../..][../../../../BB/BB/../..] R3 hostA [../../../../../../../..][../../../../../../BB/BB] R4 hostA [BB/BB/../../../../../..][../../../../../../../..] R5 hostB [../../../../../../../..][BB/BB/../../../../../..] R6 hostB [../../../../../../../..][../../BB/BB/../../../..] R7 hostB [../../../../../../../..][../../../../BB/BB/../..] R8 hostB [../../../../../../../..][../../../../../../BB/BB] R9 hostB [BB/BB/../../../../../..][../../../../../../../..]

The -map-by dist option without span is less useful, as it fills each host before moving to the next:

% mpirun -host hostA,hostB -np 17 -bind-to core -map-by dist --mca rmaps_dist_device mthca0 ... R0 hostA [../../../../../../../..][BB/../../../../../../..] R1 hostA [../../../../../../../..][../BB/../../../../../..] R2 hostA [../../../../../../../..][../../BB/../../../../..] R3 hostA [../../../../../../../..][../../../BB/../../../..] R4 hostA [../../../../../../../..][../../../../BB/../../..] R5 hostA [../../../../../../../..][../../../../../BB/../..] R6 hostA [../../../../../../../..][../../../../../../BB/..] R7 hostA [../../../../../../../..][../../../../../../../BB] R8 hostA [BB/../../../../../../..][../../../../../../../..] R9 hostA [../BB/../../../../../..][../../../../../../../..] R10 hostA [../../BB/../../../../..][../../../../../../../..] R11 hostA [../../../BB/../../../..][../../../../../../../..] R12 hostA [../../../../BB/../../..][../../../../../../../..] R13 hostA [../../../../../BB/../..][../../../../../../../..] R14 hostA [../../../../../../BB/..][../../../../../../../..] R15 hostA [../../../../../../../BB][../../../../../../../..] R16 hostB [../../../../../../../..][BB/../../../../../../..]


Helper options

You can use the -display-devel-map option and the -report-bindings option to help understand MPIplacement and affinity.

-report-bindings optionThis option displays the binding for each rank similarly to the preceding examples, but in a slightly moreexpanded format:

% mpirun -host hostA:4,hostB:2 --report-bindings -map-by core ...

79

[hostA:ppid] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]

[hostA:ppid] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]

[hostA:ppid] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..] [hostA:ppid] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..]

[hostB:ppid] MCW rank 4 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] [hostB:ppid] MCW rank 5 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]

-display-devel-map optionMuch of the information displayed with this option is internal, but various parts of the output can be helpful indiagnosing why a certain affinity option is behaving the way it is.

The output names that the policy used for mapping, ranking, and binding are particularly useful. The -display-devel-map option displays the number of slots that are used. Also, under the Locale: output, itshows the hardware associates that were made in the mapping stage.

For example:

% mpirun -host hostA:4,hostB:2 --display-devel-map -map-by core ...

Mapper requested: NULL Last mapper: round_robin Mapping policy: BYCORE Ranking policy: CORE Binding policy: CORE:IF-SUPPORTED Cpu set: NULL PPR: NULL Cpus-per-rank: 1 Num new daemons: 0 New daemon starting vpid INVALID Num nodes: 2

Data for node: hostA State: 3 Daemon: [[11988,0],1] Daemon launched: True Num slots: 4 Slots in use: 4 Oversubscribed: FALSE Num slots allocated: 4 Max slots: 0 Num procs: 4 Next node_rank: 4 Data for proc: [[11988,1],0] Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 State: INITIALIZED App_context: 0 Locale: [BB/../../../../../../..][../../../../../../../..] Binding: [BB/../../../../../../..][../../../../../../../..] Data for proc: [[11988,1],1] Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 State: INITIALIZED App_context: 0 Locale: [../BB/../../../../../..][../../../../../../../..] Binding: [../BB/../../../../../..][../../../../../../../..] Data for proc: [[11988,1],2] Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 State: INITIALIZED App_context: 0 Locale: [../../BB/../../../../..][../../../../../../../..] Binding: [../../BB/../../../../..][../../../../../../../..] Data for proc: [[11988,1],3] Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 State: INITIALIZED App_context: 0 Locale: [../../../BB/../../../..][../../../../../../../..] Binding: [../../../BB/../../../..][../../../../../../../..]

Data for node: hostB State: 3

80

Daemon: [[11988,0],2] Daemon launched: True Num slots: 2 Slots in use: 2 Oversubscribed: FALSE Num slots allocated: 2 Max slots: 0 Num procs: 2 Next node_rank: 2 Data for proc: [[11988,1],4] Pid: 0 Local rank: 0 Node rank: 0 App rank: 4 State: INITIALIZED App_context: 0 Locale: [BB/../../../../../../..][../../../../../../../..] Binding: [BB/../../../../../../..][../../../../../../../..] Data for proc: [[11988,1],5] Pid: 0 Local rank: 1 Node rank: 1 App rank: 5 State: INITIALIZED App_context: 0 Locale: [../BB/../../../../../..][../../../../../../../..] Binding: [../BB/../../../../../..][../../../../../../../..]


Managing oversubscription

Oversubscription refers to the concept of allowing more ranks to be assigned to a host than the number ofslots that are available on that host.

For example, by default, the following command:

% mpirun -host hostA:2,hostB:2 -np 5 ...

Reports the following output:

There are not enough slots available in the system to satisfy the 5 slots that were requested bythe application:

To allow the mapper to put more ranks on the hosts, the -oversubscribe modifier can be given to the mapper.For example:

% mpirun -host hostA:2,hostB:2 -np 5 -oversubscribe ...

In this way, ranks would be placed on hostA and ranks would be placed on hostB. In this example, hostA isconsidered to be oversubscribed.

For each host that is oversubscribed, the MPI progression is tuned to yield more when handling MPI traffic, inorder to use fewer CPU cycles.

The -oversubscribe option does not affect CPU affinity.


Managing overload

Overload occurs in the binding stage of affinity, when ranks are assigned sets of cores. There is a small checkto see if more ranks are assigned to any hardware element than there are cores within that hardware element.In that case, the MPI job aborts.

81

For example, on a machine with only 16 cores, the following command:

% mpirun -host hostA:17 --bind-to core ...

Produces an error message similar to the following:

> A request was made to bind to that would result in binding more > processes than cpus on a resource: > Bind to: CORE > Node: hostA > #processes: 2 > #cpus: 1

Here, the binding aborts unless overloading is allowed. Overloading can be allowed by using a bindingmodifier, as follows:

% mpirun -host hostA:17 --bind-to core:overload-allowed ...

The overload-allowed binding modifier would produce the following affinity without aborting, due to 9 ranksthat appear on the first socket:

R0 hostA [BB/../../../../../../..][../../../../../../../..] R1 hostA [../../../../../../../..][BB/../../../../../../..] R2 hostA [../BB/../../../../../..][../../../../../../../..] R3 hostA [../../../../../../../..][../BB/../../../../../..] R4 hostA [../../BB/../../../../..][../../../../../../../..] R5 hostA [../../../../../../../..][../../BB/../../../../..] R6 hostA [../../../BB/../../../..][../../../../../../../..] R7 hostA [../../../../../../../..][../../../BB/../../../..] R8 hostA [../../../../BB/../../..][../../../../../../../..] R9 hostA [../../../../../../../..][../../../../BB/../../..] R10 hostA [../../../../../BB/../..][../../../../../../../..] R11 hostA [../../../../../../../..][../../../../../BB/../..] R12 hostA [../../../../../../BB/..][../../../../../../../..] R13 hostA [../../../../../../../..][../../../../../../BB/..] R14 hostA [../../../../../../../BB][../../../../../../../..] R15 hostA [../../../../../../../..][../../../../../../../BB] R16 hostA [BB/../../../../../../..][../../../../../../../..]


OpenMP (and similar APIs)

Open MPI only binds at the process level. The number of threads that are created by a rank and the binding ofthose threads is not directly controlled by Open MPI. However, by default, created threads would inherit thefull mask that is given to the rank.

OpenMP should detect the number of hyper-threads in the process' mask to determine how many threads tocreate. Alternately, the number of threads to create can be set manually using the OMP_NUM_THREADSenvironment variable

In general, OpenMP is also capable of binding the individual threads more specifically than the inherited maskfor the whole process. However, the mechanism varies across versions of OpenMP (settings to explore for thisoption include GOMP_CPU_AFFINITY, OMP_PROC_BIND, and KMP_AFFINITY).

82


Tuning the runtime environment

IBM Spectrum™ MPI utilizes the parameters of the Modular Component Architecture (MCA) as the primarymechanism for tuning the runtime environment. Each MCA parameter is a simple key=value pair that controlsa specific aspect of the Spectrum MPI functionality

The MCA parameters can be set to meet your particular runtime requirements in several ways. They can bespecified on the mpirun command line, exported as environment variables, or supplied in a separate text file.

Frameworks, components, and MCA parameters In order to understand how to use MCA parameters, you first need to understand their relationship to

MCA's frameworks and components.Displaying a list of MCA parameters

The ompi_info command displays information about the IBM Spectrum MPI installation. It can be usedto display the MCA parameters and their values for a specific framework, a specific component, or forthe entire installation.Optimizing non-contiguous data transfers

When you are using PAMI with the MPI_Put, MPI_Get, MPI_Send, or MPI_Recv commands, you canoptimize non-continuous data transfers.Controlling the level of MCA parameters that are displayed

Although there are many MCA parameters, only a small number are of interest to any given user at anygiven time. To simplify things when listing the parameters that are available, IBM Spectrum MPIprovides the ompi_info –level option, which allows you to limit the number and type of MCAparameters that are returned.Setting MCA parameters

IBM Spectrum MPI gives precedence to parameter values that are set by using the mpirun command.Therefore, a parameter's value that was set by using the mpirun command overrides the sameparameter that was previously set as an environment variable or in a text file.Tuning multithread controls

You can optimize the PAMI PML implementation by tuning multithread controls.Tunnel atomics

You can optimize the PAMI OSC implementation using tunnel atomics


Frameworks, components, and MCA parameters

In order to understand how to use MCA parameters, you first need to understand their relationship to MCA'sframeworks and components.

The MCA frameworks are divided into the following basic types:

OMPI frameworks (in the MPI layer)ORTE frameworks (in the runtime layer)OPAL frameworks (in the operating system and platform layer)

An MCA framework uses the MCA's services to find and load components (implementations of theframework's interface) at run time.

83

The frameworks within the OMPI, ORTE, and OPAL types are further divided into subgroups according tofunction. For example, the OMPI framework contains a subgroup called btl, which is used to send and receivedata on different kinds of networks. And within the btl framework, there are Byte Transfer Layer-relatedcomponents (for example, components for shared memory, TCP, Infiniband, and so on), which can be used atruntime.

Likewise, there are many MCA parameters that allow you to control the runtime environment, and theseparameters apply to the same groups as the frameworks and components. So, considering the example of thebtl framework, there is a corresponding collection of MCA parameters that can be used for setting conditionsfor the Byte Transfer Layer.

The frameworks and their components change over time. For the most up-to-date list of the OMPI, ORTE, and OPAL frameworks, refer to the Open MPI readme file.

Parent topic: Tuning the runtime environment

Displaying a list of MCA parameters

The ompi_info command displays information about the IBM Spectrum™ MPI installation. It can be used todisplay the MCA parameters and their values for a specific framework, a specific component, or for the entireinstallation.

The ompi_info command includes many options, including --param, which you can use to display MCAparameters. In general, when using the --param option, you specify two arguments. The first argument is thecomponent type (framework), and the second argument is the specific component.

ompi_info --param type component

Displaying the MCA parameters for a frameworkTo display the parameters for a entire framework, specify all for the second argument. This instructsompi_info to display the MCA parameters and their values for all components of the specified type(framework).

For example:

ompi_info --param pml all

Displaying the MCA parameters for a componentTo display the parameters for a particular component, specify the type (framework) as the first argument andthe component name as the second argument. For example, to display the MCA parameters for the tcpcomponent of the btl (Byte Transfer Layer) framework (the component that uses TCP for MPIcommunications), you could specify ompi_info as follows:

ompi_info --param pml pami

Displaying the MCA parameters for an entire installationTo display the MCA parameters for all frameworks and components in an IBM Spectrum MPI installation,specify all for both arguments:

84

ompi_info --param all all


Optimizing non-contiguous data transfers

When you are using PAMI with the MPI_Put, MPI_Get, MPI_Send, or MPI_Recv commands, you can optimizenon-continuous data transfers.

To turn on this optimization, you must run the -ompi_common_pami_use_umr MCA option. This code path isturned off by default, and is not called when communication occurs between two ranks on the same node bydefault.

For the optimization of non-contiguous data transfers to start, the following conditions apply:

The ORIGIN or TARGET elements are a non-contiguous data type and are not the same process. Forexample, the process is not started if the rank 0 calls the MPI_Put command with a target of rank 0.The message size can be configured with the MCA_ompi_common_pami_remote_umr_limitparameter. If you set this parameter to zero (MCA_ompi_common_pami_remote_umr_limit=0), theUser-mode Memory Registration (UMR) code path is used on all supported non-contiguous data typesregardless of the message size.The default value for the MCA_ompi_common_pami_umr_use_local parameter is zero. When thedefault value is used for this parameter (MCA_ompi_common_pami_umr_use_local=0), the UMR codepath can be called only on ranks that are available on different nodes. If you set theMCA_ompi_common_pami_umr_use_local parameter to one(MCA_ompi_common_pami_umr_use_local=1), the UMR code path can be called on local ranks andremote ranks.

NoteIf you use the -pami_noib option with the MCA_ompi_common_pami_umr_use_local=1parameter, your application might experience crashes or undefined behavior.The default value for the maximum number of blocks for non-contiguous data types is 1024. You canchange the default value with the MCA_ompi_common_pami_umr_max_list_len parameter.

The non-contiguous data types must be single dimensioned for optimization with UMR.The number of blocks cannot be larger than the value specified in theMCA_ompi_common_pami_umr_max_list_len parameter for the following non-contiguous datatypes:

MPI_COMBINER_INDEXEDMPI_COMBINER_HINDEXEDMPI_COMBINER_INDEXED_BLOCKMPI_COMBINER_HINDEXED_BLOCKMPI_COMBINER_STRUCTMPI_COMBINER_VECTOR1MPI_COMBINER_HVECTOR1

The stride limit cannot be larger than 64 KB (65536 bytes) for the following non-contiguous data types:

MPI_COMBINER_VECTOR2MPI_COMBINER_HVECTOR2

Note

85

1. When sending more than one element.2. When sending one element.


Controlling the level of MCA parameters that are displayed

Although there are many MCA parameters, only a small number are of interest to any given user at any giventime. To simplify things when listing the parameters that are available, IBM Spectrum™ MPI provides the ompi_info –level option, which allows you to limit the number and type of MCA parameters that arereturned.

The following are the nine different levels of the ompi_info –level option that can be specified:

1. Basic information that is of interest to end users.2. Detailed information that is of interest to end users.3. All remaining information that is of interest to end users.4. Basic information that is required for application tuners.5. Detailed information that is required for application tuners.6. All remaining information that is required for application tuners.7. Basic information for Open MPI implementers.8. Detailed information for Open MPI implementers.9. All remaining information for Open MPI implementers.

By default, ompi_info only displays level 1 MCA parameters (basic information that is of interest to endusers). However, you can display the MCA parameters for additional levels (there are nine) by using the ompi_info -–level option.

For example:

ompi_info --param pml pami --level 9


Related information

ompi_info man page

Setting MCA parameters

IBM Spectrum™ MPI gives precedence to parameter values that are set by using the mpirun command.Therefore, a parameter's value that was set by using the mpirun command overrides the same parameter thatwas previously set as an environment variable or in a text file.

Setting MCA parameters with the mpirun commandTo specify MCA parameters on the mpirun command line, use the --mca option. The basic syntax is:

mpirun --mca param_name value

86

https://www.open-mpi.org/doc/v3.0/man1/ompi_info.1.php

In the following example, the MCA mpi_show_handle_leaks parameter is set to a value of 1 and the programa.out is run with four processes:

mpirun --mca mpi_show_handle_leaks 1 -np 4 a.out

Note that if you want to specify a value that includes multiple words, you must surround the value in quotesso that the shell and IBM Spectrum MPI understand that it is a single value.

For example:

mpirun --mca param "multiple word value" ...

Setting MCA parameters as environment variablesAn ssh style example is:

OMPI_MCA_param="multiple word value"

A csh style example is:

setenv OMPI_MCA_param "multiple word value"

The way in which you specify an MCA parameter as an environment variable differs, depending on the shellthat you are using.

For ssh style shells, the syntax of this example would be:

OMPI_MCA_mpi_show_handle_leaks=1 export OMPI_MCA_mpi_show_handle_leaks mpirun -np 4 a.out

For csh style shells, the syntax of this example would be:

setenv OMPI_MCA_mpi_show_handle_leaks 1 mpirun -np 4 a.out

Note that if you want to specify a value that includes multiple words, you must surround the value in quotesso that the shell and IBM Spectrum MPI understand that it is a single value.

Setting MCA parameters with a text fileMCA parameter values can be provided in a text file, called mca-params.conf. At runtime, IBM Spectrum MPIsearches for the mca-params.conf file in one of the following locations, and in the following order:

$HOME/.openmpi/mca-params.conf: This is the user-supplied set of values, which has the highestprecedence.$prefix/etc/openmpi-mca-params.conf: This is the system-supplied set of values, which has a lowerprecedence.

Note$prefix is the base SMPI install directory, which by default is /opt/ibm/spectrum_mpi.The mca_param_files parameter specifies a colon-delimited path of files to search for MCA parameters. Filesto the left have lower precedence, while files to the right have higher precedence.

The mca-params.conf file contains multiple parameter definitions, in which each parameter is specified on aseparate line. The following example shows the mpi_show_handle_leaks parameter, as it is specified in a file:

# This is a comment # Set the same MCA parameter as in previous examples mpi_show_handle_leaks = 1

87

In MCA parameter files, quotes are not necessary for setting values that contain multiple words. If you includequotes in the MCA parameter file, they will be used as part of the value itself.


Tuning multithread controls

You can optimize the PAMI PML implementation by tuning multithread controls.

Multithread controlsThe PML pre-allocates and caches data structures per thread. The following environment variables controlpre-allocation and caching:

--mca common_pami_freelist_cache_size <cache size\> Specifies the size of the per-thread freelist cache. The default value for the <cache size\> is 64. If

you specify a value of 0 for the <cache size\>, the pre-thread free list cache is disabled.--mca common_pami_max_threads <max threads\>

Specifies the maximum number of application threads supported. The default value for the <max threads\> is 64. This environment variable is only available when the MPI_THREAD_MULTIPLE modeis used and per-thread caching is enabled. If the number of application threads that call MPI is morethan the number specified in the <max threads\> variable, the job is aborted.


Tunnel atomics

You can optimize the PAMI OSC implementation using tunnel atomics

If you are using IBM POWER9 systems, or later, you can use tunneled atomics in one sided calls such as theMPI_Accumulate and MPI_Raccumulate APIs. Tunnel atomics is disabled by default. Tunnel atomics workbest with small request sizes.

To enable tunnel atomics, run the mpirun command with the -mca osc_pami_use_tunnel_atomics 1option. To use the full capabilities of tunnel atomics, you can align their window buffers to 32 bytes and makethe size a multiple of 32 bytes.

You can pass a hint by using the -mca option to enable tunnel atomics for an entire buffer in a location whereaccumulate operations are guaranteed to be commutative. By default, tunnel atomics operate on requeststhat use a single 8 byte of 4 byte data type.

In the following commutative operation example, the MPI_SUM operation is performed in the application -mca osc_pami_tunnel_atomics_hint_commutative_operations 1. If you use the -async feature with -mca osc_pami_use_tunnel_atomics 1 option, the result might have high memory usage if theapplication is accommodating large buffers. To avoid resource constraints, you can frequently flush localcommunication.

By default, Spectrum™ MPI stops using tunnel atomics for buffers that are greater than 4000 bytes. You canchange this default value by using the -mca osc_pami_tunnel_atomics_limit option.

88


89

Date post:	12-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

I B M · Release Notes Planning What's new Prerequisites Nodes User authorization Tuning your Linux...

Documents