MVAPICH2 and MVAPICH2-MIC: Latest Status
Dhabaleswar K. (DK) Panda and Khaled Hamidouche
The Ohio State University
E-mail: {panda, hamidouc}@cse.ohio-state.edu
https://mvapich.cse.ohio-state.edu/
Presentation at IXPUG Meeting, July 2014
by
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
Pe
rce
nta
ge o
f C
lust
ers
Nu
mb
er
of
Clu
ste
rs
Timeline
Percentage of Clusters
Number of Clusters
2 OSU - IXPUG'14
• 223 IB Clusters (44.3%) in the June 2014 Top500 list
(http://www.top500.org)
• Installations in the Top 50 (25 systems):
Large-scale InfiniBand Installations
519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th)
62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th)
147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th)
76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th)
194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th)
110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th)
96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th)
73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th)
77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st)
65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (27th)
and many more!
3 OSU - IXPUG'14
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,150 organizations in 72 countries
– More than 218,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 519,640-core cluster (Stampede) at TACC
• 13th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
• 23rd ranked 96,192-core cluster (Pleiades) at NASA
– Available with software stacks of many IB, HSE, and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
4 OSU - IXPUG'14
• Released on 06/20/14
• Major Features and Enhancements
– Based on MPICH-3.1
– MPI-3 RMA Support
– Support for Non-blocking collectives
– MPI-T support
– CMA support is default for intra-node communication
– Optimization of collectives with CMA support
– Large message transfer support
– Reduced memory footprint
– Improved Job start-up time
– Optimization of collectives and tuning for multiple platforms
– Updated hwloc to version 1.9
• MVAPICH2-X 2.0 GA supports hybrid MPI + PGAS (UPC and OpenSHMEM)
programming models
– Based on MVAPICH2 2.0 GA including MPI-3 features
– Compliant with UPC 2.18.0 and OpenSHMEM v1.0f
5
MVAPICH2 2.0 GA and MVAPICH2-X 2.0 GA
OSU - IXPUG'14
• MPI libraries run out of box (or with minor changes) on the Xeon Phi
• Critical to optimize the runtimes for better performance
– Tune existing designs
– Designs using lower level features offered by MPSS
– Designs to address system-level limitations
• Initial version of MVAPICH2-MIC (based on MVAPICH2 2.0a release) is
available on Stampede since Oct ‘13
– Supports all modes of usage – host-only, offload, coprocessor-only and symmetric
– Improved shared memory communication channel
– SCIF-based designs for improved communication within MIC and between MICs and Hosts
– Proxy-based design to work around bandwidth limitations on Sandy Bridge platform
• Enhanced version based on MVAPICH2 2.0GA release is being worked out.
Will be coming out soon!!
MVAPICH2-MIC: Optimized MPI Library for Xeon Phi Clusters
6 OSU - IXPUG'14
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• MPI (+X) continues to be the predominant programming model in HPC • Flexibility in launching MPI jobs on clusters with Xeon Phi
7 OSU - IXPUG'14
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
• Offload mode: lesser overhead way to extract performance from Xeon Phi
MPI Applications on MIC Clusters
8 OSU - IXPUG'14
MPI Data Movement in Offload Mode
CPU0 CPU1 QPI
MIC
PC
Ie
CPU1
MIC
IB
Node 0 Node 1
1. Intra-Socket 2. Inter-Socket 3. Inter-Node
MPI Process
CPU0
Supported by the Standard MVAPICH2 Library
(Latest Release - MVAPICH2 2.0 GA)
• MPI communication only from the host
9 OSU - IXPUG'14
One-way Latency: MPI over IB with MVAPICH
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
10 OSU - IXPUG'14
Bandwidth: MPI over IB with MVAPICH
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(MB
yte
s/se
c)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
11 OSU - IXPUG'14
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
12
Latest MVAPICH2 2.0
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
OSU - IXPUG'14
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(u
s)
Message Size (Bytes)
Inter-node Get/Put Latency
Get Put
MPI-3 RMA Get/Put with Flush Performance
13
Latest MVAPICH2 2.0, Intel Sandy-bridge with Connect-IB (single-port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Intra-socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
Intra-socket Get/Put Latency
Get Put
0.08 us
OSU - IXPUG'14
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Inter-node Get/Put Bandwidth
Get Put6881
6876
14926
15364
HPCG Benchmark with Offload Mode
• Full subscription of a node:
• 1 MPI process + 10 OpenMP threads on the CPU
• 3 MPI processes + 6 OpenMP threads on CPU + 240 OpenMP threads offload on MIC
• Input data size = 128^3
• Binding using MV2_CPU_MAPPING
• 1 node MVAPICH2 achieves 24 GFlops
• 1024 nodes MVAPICH2 achieves 18.2 TFlops
14 OSU - IXPUG'14
0
200
400
600
800
1 2 4 8 16 32
GFl
op
s
# Nodes
MVAPICH2-Offload
0
5000
10000
15000
20000
64 128 256 512 1024
GFl
op
s
# Nodes
MVAPICH2-Offload
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
• Xeon Phi as a many-core node
MPI Applications on Xeon Phi Clusters - IntraNode
15 OSU - IXPUG'14
MPI Data Movement in Coprocessor-Only Mode
CPU0 CPU1 QPI
MIC
PC
Ie
Node 0
Intra-MIC Communication
MPI Process
C C
L2 L2
C C
L2 L2
C C
L2 L2
C C
L2 L2
C C
L2 L2
C C
L2 L2
C C
L2 L2
C C
L2 L2
GDDR GDDR
GDDR GDDR
GDDR GDDR
GDDR GDDR
Xeon Phi
16 OSU - IXPUG'14
MVAPICH2-MIC Channels for Intra-MIC Communication
• MVAPICH2 provides a hybrid of two
channels
– CH3-SHM
• Derives the design for shared memory
communication on host
• Tuned for the Xeon Phi architecture
– CH3-SCIF
• SCIF is a lower level API provided by
MPSS
• Provides user control of the DMA
• Takes advantage of SCIF to improve
performance
MVAPICH2 MVAPICH2
CH3-SCIF CH3-SCIF
Xeon Phi
CH3-SHM CH3-SHM
SCIF SCIF POSIX Calls POSIX Calls
S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla, and D. K. Panda - Efficient Intra-node Communication on Intel-MIC
Clusters - International Symposium on Cluster, Cloud and Grid Computing, May 2013 (CCGrid’ 13).
S. Potluri, K. Tomko, D. Bureddy and D. K. Panda - Intra-MIC MPI Communication using MVAPICH2: Early
Experience - TACC-Intel Highly-Parallel Computing Symposium (TI-HPCS), April 2012 - Best Student Paper
17 OSU - IXPUG'14
Intra-MIC - Point-to-Point Communication
osu_latency (small)
osu_bw osu_bibw
Better
Bet
ter
Bet
ter
Better
osu_latency (medium)
0
5
10
15
20
25
64 256 1K 4K
La
ten
cy (
use
c)
Message Size (Bytes)
020406080
100120140160
1 4 16 64 256
Ba
nd
wid
th (
MB
/s)
Message Size (Bytes)
0
100
200
300
400
500
1 4 16 64 256 1K 4K
Ba
nd
wid
th (
MB
/s)
Message Size (Bytes)
66%
30%
00.5
11.5
22.5
33.5
4
0 2 8 32
La
ten
cy (
use
c)
Message Size (Bytes)
MV2-MIC-2.0GA
MV2-MIC-2.0a
18 OSU - IXPUG'14
Intra-MIC - Collective Communication – AlltoAll
8 processes – osu_alltoall (small)
Better
Better
16 processes – osu_alltoall (small)
0
100
200
300
400
500
1 4 16 64 256 1K 4K
La
ten
cy (
use
c)
Message Size (Bytes)
• 8 Processes: 40% and 19% improvement for 4 and 4K bytes messages
• 16 Processes: 38% and 71% improvement for 64 and 1K bytes messages
70%
0
50
100
150
200
250
1 4 16 64 256 1K 4K
La
ten
cy (
use
c)
Message Size (Bytes)
MV2-MIC-2.0GA
MV2-MIC-2.0a
19 OSU - IXPUG'14
Intra-MIC – HPCG Benchmark
MV2-MIC-2.0GA MV2-MIC-2.0a
8%
• 4 MPI processes with 60 OpenMP threads per process
• Binding using MV2_MIC_MAPPING environment variable
Bet
ter
20 OSU - IXPUG'14
0
2
4
6
8
10
12
14
16
32 64
GFl
op
s
Input size
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
MPI Applications on MIC Clusters - InterNode
21 OSU - IXPUG'14
MPI Data Movement in Coprocessor-Only Mode
CPU0 CPU1 QPI
MIC
PC
Ie
CPU1
MIC
IB
Node 0 Node 1
1. MIC-RemoteMIC
CPU0
MPI Process
22 OSU - IXPUG'14
MVAPICH2 Channels for MIC-RemoteMIC Communication
• CH3-IB channel
– High-performance InfiniBand channel in MVAPICH2
– Implemented using IB Verbs - DirectIB
– Currently does not support advanced features like hardware multicast, etc.
– Limited by P2P Bandwidth offered on Sandy Bridge platform
MVAPICH2 MVAPICH2
Xeon Phi
IB Verbs IB Verbs
OFA-IB-CH3 OFA-IB-CH3
mlx4_0 mlx4_0
MVAPICH2 MVAPICH2
Xeon Phi
IB Verbs IB Verbs
CH3 CH3
mlx4_0 mlx4_0
IB-HCA IB-HCA
23 OSU - IXPUG'14
Host Proxy-based Designs in MVAPICH2-MIC
• Direct IB channels is limited by P2P read bandwidth
• MVAPICH2-MIC uses a hybrid DirectIB + host proxy-based approach to
work around this
962.86 MB/s
5280 MB/s
P2P Read / IB Read from Xeon Phi
P2P Write/IB Write to Xeon Phi
6977 MB/s 6296 MB/s
IB Read from Host
Xeon Phi-to-Host
SNB E5-2670
24 OSU - IXPUG'14
S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC Clusters Int'l
Conference on Supercomputing (SC '13), November 2013.
MIC-RemoteMIC Point-to-Point Communication (Active Proxy)
osu_latency (small)
osu_bw osu_bibw
Better B
ette
r
Bet
ter
Better
osu_latency (large)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
0
2000
4000
6000
8000
10000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
5681 8290
0
5
10
15
20
25
0 2 8 32 128 512 2K
La
ten
cy (
use
c)
Message Size (Bytes)
MV2-MIC-2.0GA w/ proxy
MV2-MIC-2.0GA w/o proxy
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
25 OSU - IXPUG'14
InterNode - Coprocessor-only Mode - HPCG
26 OSU - IXPUG'14
0
50
100
150
1 2 4 8
GFl
op
s
# MICs
MV2-MIC-2.0GA
• Full subscription of a MIC: (Host is not involved) • 3 MPI processes + 240 OpenMP threads on MIC • Input size = 96^3
• Binding using MV2_MIC_MAPPING • 1 node MVAPICH2-MIC achieves 15 Gflops • 8 nodes MVAPICH2-MIC achieves 105 GFlops
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload
Symmetric
Coprocessor-only
MPI Applications on MIC Clusters - InterNode
27 OSU - IXPUG'14
MPI Data Movement in Symmetric Mode - InterNode
CPU0 CPU1 QPI
PC
Ie
CPU1
MIC
IB
Node 0 Node 1
CPU0
MPI Process
1. MIC- RemoteMIC 2. MIC- RemoteHost (Closer to HCA)
3. MIC- RemoteHost (Farther from HCA)
• Uses OFA-IB-CH3 channel backed by host-based proxy
28 OSU - IXPUG'14
MIC-RemoteHost Point-to-Point Communication (Active Proxy)
osu_latency (small)
osu_bw osu_bibw
Better B
ette
r
Bet
ter
Better
osu_latency (large) 5701 8742
0
2
4
6
8
10
12
14
0 2 8 32 128 512 2K
La
ten
cy (
use
c)
Message Size (Bytes)
MV2-MIC-2.0GA w/ proxy
MV2-MIC-2.0GA w/o proxy
0
500
1000
1500
2000
2500
3000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
1000
2000
3000
4000
5000
6000
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Ba
nd
wid
th (
MB
/s)
Message Size (Bytes)
0
2000
4000
6000
8000
10000
1 4
16
64
25
6
1K
4K
16
K
64
K
25
6K
1M
4M
Ba
nd
wid
th (
MB
/s)
Message Size (Bytes)
29 OSU - IXPUG'14
Inter-Node Symmetric Mode – Passive Proxy
osu_alltoall
Better
Better
osu_allgather
64 MPI processes on 2 nodes (16H+16M per node)
• Fully-subscribed mode (16 processes on the Host)
• Passive proxy design outperforms the active proxy (default in MV2-MIC-2.0a)
30 OSU - IXPUG'14
0100000200000300000400000500000600000700000800000
La
ten
cy (
use
c)
Message Size (Bytes)
MV2-MIC-2.0GA
MV2-MIC-2.0a
0
20000
40000
60000
80000
100000
1 4 16 64 256 1K 4K 16K64K
La
ten
cy (
use
c)
Message Size (Bytes)
1.5X 1.8X
Inter-Node Symmetric Mode – Redesigning Collectives
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 4 8 16 32 64 128 256 512
La
ten
cy (
us)
x
10
00
0
Message Size (bytes)
0
2
4
6
8
10
12
14
La
ten
cy (
us)
x 1
00
00
0
Message Size (bytes)
0
5
10
15
20
25L
ate
ncy
(u
s)
x 1
00
0
Message Size (bytes)
MV2-MIC-2.0a
MV2-MIC-2.0GA
0
2
4
6
8
10
12
14
La
ten
cy (
us)
x 1
00
00
0
Message Size (bytes)
Allgather: 32 Nodes (16H+16M) Allgather: 32 Nodes (8H+8M)
Alltoall: 16 Nodes (8H+8M) Alltoall: 16 Nodes (8H+8M)
Better
Better
31 OSU - IXPUG'14
3X
2X
2.5X
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
InterNode – Symmetric Mode - HPCG
• Full subscription of a node: • 2 MPI processes + 16 OpenMP threads on CPU • 3 MPI processes + 240 OpenMP threads on MIC • Input size = 96^3
• Binding using both MV2_CPU_MAPPING and MV2_MIC_MAPPING • To explore different threading levels for host and Xeon Phi • 1 node MVAPICH2-MIC achieves 23.5 GFlops • 16 nodes MVAPICH2-MIC achieves 340 GFlops
32 OSU - IXPUG'14
0
100
200
300
400
1 2 4 8 16
GFl
op
s
# Nodes
MV2-MIC-2.0GA
On-Going and Future Work
• Redesigning of other collective operations
• Optimizing MPI3-RMA operations for MIC
• Automatic proxy selection using architecture topology detection
• Evaluating application performance and scaling
• Extending designs to KNL-based self-hosted nodes
33 OSU - IXPUG'14
Conclusion
• MVAPICH2-MIC provides a high-performance and scalable MPI library
for InfiniBand clusters using Xeon Phi
• Initial version takes advantage of SCIF to improve Intra-MIC and
Intranode MIC-Host communication
• Host-Proxy based designs help work around P2P bandwidth
bottlenecks
• Provides initial support for heterogeneity-aware collectives
• Enhanced version of MVAPICH2-MIC (based on 2.0 GA) will be
available soon
34 OSU - IXPUG'14
• August 25-27, 2014; Columbus, Ohio, USA
• Keynote Talks, Invited Talks, Contributed Presentations
• Tutorial on MVAPICH2 and MVAPICH2-X optimization and tuning
• Speakers (Confirmed so far):
– Keynote: Dan Stanzione (TACC)
– Keynote: Darren Kerbyson (PNNL)
– Sadaf Alam (CSCS, Switzerland)
– Onur Celebioglu (Dell)
– Jens Glaser (Univ. of Michigan)
– Jeff Hammond (Intel)
– Adam Moody (LLNL)
– David Race (Cray)
– Davide Rossetti (NVIDIA)
– Gilad Shainer (Mellanox)
– Sayantan Sur (Intel)
– Mahidhar Tatineni (SDSC)
– Jerome Vienne (TACC)
• More details at: http://mug.mvapich.cse.ohio-state.edu
35
Upcoming 2nd Annual MVAPICH User Group (MUG) Meeting
OSU - IXPUG'14
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
36 OSU - IXPUG'14