Date post: | 12-May-2015 |
Category: |
Technology |
Upload: | jeff-squyres |
View: | 823 times |
Download: | 1 times |
{Open} MPI, Parallel Computing, Life, the Universe, and Everything
Dr. Jeffrey M. Squyres
November 7, 2013
Open MPI
PACX-MPI LAM/MPI
LA-MPI FT-MPI
Sun CT 6
Project founded in 2003 after intense
discussions between multiple open source MPI implementations
Open_MPI_Init()
shell$ svn log https://svn.open-mpi.org/svn/ompi -r 1 ------------------------------------------------------------------------ r1 | jsquyres | 2003-11-22 11:36:58 -0500 (Sat, 22 Nov 2003) | 2 lines First commit ------------------------------------------------------------------------ shell$
Open_MPI_Current_status()
shell$ svn log https://svn.open-mpi.org/svn/ompi -r HEAD ------------------------------------------------------------------------ r29619 | brbarret | 2013-11-06 09:14:24 -0800 (Wed, 06 Nov 2013) | 2 lines update ignore file ------------------------------------------------------------------------ shell$
Open MPI 2014 membership
13 members, 15 contributors, 2 partners
Fun stats
• ohloh.net says: § 819,741 lines of code § Average 10-20
committers at a time § “Well-commented
source code”
• I rank in top-25 ohloh stats for: § C § Automake § Shell script § Fortran (…ouch)
Current status
• Version 1.6.5 / stable series § Unlikely to see another release
• Version 1.7.3 / feature series § v1.7.4 due (hopefully) by end of 2013 § Plan to transition to v1.8 in Q1 2014
MPI conformance
• MPI-2.2 conformant as of v1.7.3 § Finally finished several 2.2 issues that no one
really cares about • MPI-3 conformance just missing new RMA
§ Tracked on wiki: https://svn.open-mpi.org/trac/ompi/wiki/MPIConformance
§ Hope to be done by v1.7.4
New MPI-3 features
• Mo’ betta Fortran bindings § You should “use mpi_f08”. Really.
• Matched probe • Sparse and neighborhood collectives • “MPI_T” tools interface • Nonblocking communicator duplication • Noncollective communicator creation • Hindexed block datatype
New Open MPI features
• Better support for more runtime systems § PMI2 scalability, etc.
• New generalized processor affinity system • Better CUDA support • Java MPI bindings (!) • Transports:
§ Cisco usNIC support § Mellanox MXM2 and hcoll support § Portals 4 support
My new favorite random feature
• mpirun CLI option <tab> completion § Bash and zsh § Contributed by Nathan Hjelm, LANL
shell$ mpirun --mca btl_usnic_<tab> btl_usnic_cq_num -- Number of completion queue!btl_usnic_eager_limit -- Eager send limit (0 = use !btl_usnic_if_exclude -- Comma-delimited list of de!btl_usnic_if_include -- Comma-delimited list of de!btl_usnic_max_btls -- Maximum number of usNICs t!btl_usnic_mpool -- Name of the memory pool to!btl_usnic_prio_rd_num -- Number of pre-posted prior!btl_usnic_prio_sd_num -- Maximum priority send desc!btl_usnic_priority_limit -- Max size of "priority" mes!btl_usnic_rd_num -- Number of pre-posted recei!btl_usnic_retrans_timeout -- Number of microseconds bef!btl_usnic_rndv_eager_limit -- Eager rendezvous limit (0 !btl_usnic_sd_num -- Maximum send descriptors t!
Two features to discuss in detail…
1. “MPI_T” interface 2. Flexible process affinity system
MPI_T interface
MPI_T interface
• Added in MPI-3.0 • So-called “MPI_T” because all the
functions start with that prefix § T = tools
• APIs to get/set MPI implementation values § Control variables (e.g., implementation
tunables) § Performance variables (e.g., run-time stats)
MPI_T control variables (“cvar”)
• Another interface to MCA param values • In addition to existing methods:
§ mpirun CLI options § Environment variables § Config file(s)
• Allows tools / applications to programmatically list all OMPI MCA params
MPI_T cvar example
• MPI_T_cvar_get_num() § Returns the number of control variables
• MPI_T_cvar_get_info(index, …) returns: § String name and description § Verbosity level (see next slide) § Type of the variable (integer, double, etc.) § Type of MPI object (communicator, etc.) § “Writability” scope
Verbosity levels
Level name Level description
USER_BASIC Basic information of interest to users USER_DETAIL Detailed information of interest to users USER_ALL All remaining information of interest to users TUNER_BASIC Basic information of interest for tuning TUNER_DETAIL Detailed information of interest for tuning TUNER_ALL All remaining information of interest to tuning MPIDEV_BASIC Basic information for MPI implementers MPIDEV_DETAIL Detailed information for MPI implementers MPIDEV_ALL All remaining information for MPI implementers
Open MPI interpretation of verbosity levels
1. User § Parameters required
for correctness § As few as possible
2. Tuner § Tweak MPI
performance § Resource levels, etc.
3. MPI developer § For Open MPI devs
1. Basic Even for less-advanced users and tuners
2. Detailed Useful but you won’t need to change them often
3. All Anything else
“Writeability” scope
Level name Level description
CONSTANT Read-only, constant value READONLY Read-only, but the value may change LOCAL Writing is local operation GROUP Writing must be done as a group, and all values
must be consistent GROUP_EQ Writing must be done as a group, and all values
must be exactly the same ALL Writing must be done by all processes, and all
values must be consistent ALL_EQ Writing must be done by all processes, and all
values must be exactly the same
Reading / writing a cvar
• MPI_T_cvar_handle_alloc(index, handle, …) § Allocates an MPI_T handle § Binds it to a specific MPI handle (e.g., a
communicator), or BIND_NO_OBJECT • MPI_T_cvar_read(handle, buf) • MPI_T_cvar_write(handle, buf)
à OMPI has very, very few writable control variables after MPI_INIT
MPI_T Performance variables (“pvar”)
• New information available from OMPI § Run-time statistics of implementation details § Similar interface to control variables
• Not many available in OMPI yet • Cisco usnic BTL exports 24 pvars
§ Per usNIC interface § Stats about underlying network
(more details to be provided in usNIC talk)
Process affinity system
Locality matters
• Goals: § Minimize data transfer distance § Reduce network congestion and contention
• …this also matters inside the server, too!
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Intel Xeon E5-2690 (“Sandy Bridge”) 2 sockets, 8 cores, 64GB per socket
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Intel Xeon E5-2690 (“Sandy Bridge”) 2 sockets, 8 cores, 64GB per socket
1G NICs
10G NICs
10G NICs
L1 and L2
Shared L3
Hyperthreading enabled
The intent of this work is to provide a mechanism that allows users to explore the process-placement space
within the scope of their own applications.
A user’s playground
Two complimentary systems
• Simple § mpirun --bind-to [ core | socket | … ] … § mpirun --by[ node | slot | … ] … § …etc.
• Flexible § LAMA: Locality Aware Mapping Algorithm
LAMA
• Supports a wide range of regular mapping patterns § Drawn from much prior work § Most notably, heavily inspired by BlueGene/P
and /Q mapping systems
Launching MPI applications
• Three steps in MPI process placement 1. Mapping 2. Ordering 3. Binding
• Let's discuss how these work in Open MPI
1. Mapping
• Create a layout of processes-to-resources
Server Server Server Server
Server Server Server Server
Server Server Server Server
Server Server Server Server
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
Mapping
• MPI's runtime must create a map, pairing processes-to-processors (and memory).
• Basic technique: § Gather hwloc topologies from allocated nodes. § Mapping agent then makes a plan for which
resources are assigned to processes
Mapping agent
• Act of planning mappings: § Specify which process will be launched on
each server § Identify if any hardware resource will be
oversubscribed • Processes are mapped to the resolution of
a single processing unit (PU) § Smallest unit of allocation: hardware thread § In HPC, usually the same as a processor core
Oversubscription
• Common / usual definition: § When a single PU is assigned more than one
process • Complicating the definition:
§ Some application may need more than one PU per process (multithreaded applications)
• How can the user express what their application means by “oversubscription”?
2. Ordering: by “slot”
Assigning MCW ranks to mapped processes
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32 33 18 19
36 37 22 23
40 41 26 27
44 45 30 31
48 49 50 51
4 5 6 7
8 9 10 11
12 13 14 15
64 65 66 67
20 21 22 23
24 25 26 27
28 29 30 31
80 81 18 19
36 37 22 23
40 41 26 27
44 45 30 31
2. Ordering: by node
Assigning MCW ranks to mapped processes
0 16 32 48
64 80 96 112
128 144 160 176
192 208 224 240
1 17 33 49
65 81 97 113
129 145 161 177
193 209 225 241
2 18 18 19
66 82 22 23
130 146 26 27
194 210 30 31
4 20 36 52
4 5 6 7
8 9 10 11
12 13 14 15
5 23 37 53
20 21 22 23
24 25 26 27
28 29 30 31
6 81 18 19
36 37 22 23
40 41 26 27
44 45 30 31
Ordering
• Each process must be assigned a unique rank in MPI_COMM_WORLD
• Two common types of ordering: § natural
• The order in which processes are mapped determines their rank in MCW
§ sequential • The processes are sequentially numbered starting
at the first processing unit, and continuing until the last processing unit
3. Binding
• Launch processes and enforce the layout Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
32 33 34 3 4 5 6 7
40 41 42 11 12 13 14 15
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Binding
• Process-launching agent working with the OS to limit where each process can run: 1. No restrictions 2. Limited set of restrictions 3. Specific resource restrictions
• “Binding width” § The number of PUs to which a process is
bound
Command Line Interface (CLI)
• 4 levels of abstraction for the user § Level 1: None § Level 2: Simple, common patterns § Level 3: LAMA process layout regular patterns § Level 4: Irregular patterns (not described in
this talk)
CLI: Level 1 (none)
• No mapping or binding options specified § May or may not specify the number of
processes to launch (-np) § If not specified, default to the number of cores
available in the allocation § One process is mapped to each core in the
system in a "by-core" style § Processes are not bound
• …for backwards compatibility reasons L
CLI: Level 2 (common)
• Simple, common patterns for mapping and binding § Specify mapping pattern with
• --map-by X (e.g., --map-by socket) § Specify binding option with:
• --bind-to Y (e.g., --bind-to core) § All of these options are translated to Level 3
options for processing by LAMA (full list of X / Y values shown later)
CLI: Level 3 (regular patterns)
• LAMA process layout regular patterns § Power users wanting something unique for
their application § Four MCA run-time parameters
• rmaps_lama_map: Mapping process layout • rmaps_lama_bind: Binding width • rmaps_lama_order: Ordering of MCW ranks • rmaps_lama_mppr: Maximum allowable number of
processes per resource (oversubscription)
rmaps_lama_map (map)
• Takes as an argument the "process layout" § A series of nine tokens
• allowing 9! (362,880) mapping permutation options.
§ Preferred iteration order for LAMA • innermost iteration specified first • outermost iteration specified last
Example system
2 servers (nodes), 4 sockets, 2 cores, 2 PUs Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core) Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
Step 1: Traverse sockets
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core) Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
Step 2: Ran out of sockets, so now traverse cores
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core) Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
Step 3: Now traverse boards (but there aren’t any)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core) Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
Step 4: Now traverse server nodes
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core) Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
Step 5: After repeating s, c, and b on server node 2, traverse hardware threads
rmaps_lama_bind (bind)
• “Binding width" and layer • Example: bind=3c (3 cores) Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
bind = 3c
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
rmaps_lama_bind (bind)
• “Binding width" and layer • Example: bind=2s (2 sockets) Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
bind = 2s
bind = 2s
rmaps_lama_bind (bind)
• “Binding width" and layer • Example: bind=12 (all PUs in an L2)
bind = 12
rmaps_lama_bind (bind)
• “Binding width" and layer • Example: bind=1N (all PUs in NUMA locality)
bind = 1N
rmaps_lama_order (order)
• Select which ranks are assigned to processes in MCW
• There are other possible orderings, but no one has asked for them yet…
Natural order for map-by-node (default)
Sequential order for any mapping
rmaps_lama_mppr (mppr)
• mppr (mip-per) sets the Maximum number of allowable Processes Per Resource § User-specified definition of oversubscription
• Comma-delimited list of <#:resource>!§ 1:c à At most one process per core § 1:c,2:s à At most one process per core, and
at most two processes per socket
MPPR
§ 1:c à At most one process per core Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
MPPR
§ 1:c,2:s à At most one process per core and two processes per socket
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Level 2 to Level 3 chart
Remember the prior example?
• -np 24 -mppr 2:c -map scbnh Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
420
Core 1H0H1
Socket 1Core 0H0H1
117
521
Core 1H0H1
Socket 2Core 0H0H1
218
622
Core 1H0H1
Socket 3Core 0H0H1
319
723
Core 1H0H1
Socket 0Core 0H0H1
8 12Core 1H0H1
Socket 1Core 0H0H1
9 13
Core 1H0H1
Socket 2Core 0H0H1
10 14Core 1H0H1
Socket 3Core 0H0H1
11 15
Same example, different mapping
• -np 24 -mppr 2:c -map nbsch Node 0
Node 1
Core 1H0H1
Socket 0Core 0H0H1
016
8Core 1H0H1
Socket 1Core 0H0H1
218
10
Core 1H0H1
Socket 2Core 0H0H1
420
12Core 1H0H1
Socket 3Core 0H0H1
622
14
Core 1H0H1
Socket 0Core 0H0H1
1 9Core 1H0H1
Socket 1Core 0H0H1
3 11
Core 1H0H1
Socket 2Core 0H0H1
5 13Core 1H0H1
Socket 3Core 0H0H1
7 15
17 19
21 23
• Displays prettyprint representation of the binding actually used for each process. § Visual feedback = quite helpful when exploring
mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind 1c --mca rmaps_lama_map nbsch --mca rmaps_lama_mppr 1:c --report-bindings hello_world!
MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]!MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]!MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]!MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]!
Report bindings
Feedback
• Available in Open MPI v1.7.2 (and later)
• Open questions to users: § Are more flexible ordering options useful? § What common mapping patterns are useful? § What additional features would you like to
see?
Thank you