Overview of the MVAPICH Project:Latest Status and Future Roadmap
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
MVAPICH2 User Group (MUG) Meeting
by
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• InfiniBand very popular in HPC clusters•
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand<1usec latency, >100Gbps Bandwidth
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
2
Multi-core Processors
MVAPICH User Group Meeting 2014
MVAPICH User Group Meeting 2014 3
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually
receiving importance
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies(InfiniBand, 10/40GigE,
Aries, BlueGene)
Multi/Many-core Architectures
Point-to-pointCommunication(two-sided & one-sided)
CollectiveCommunication
Synchronization & Locks
I/O & FileSystems
FaultTolerance
Communication Library or Runtime for Programming Models
4
Accelerators(NVIDIA and MIC)
Middleware
MVAPICH User Group Meeting 2014
Co-Design
Opportunities
and Challenges
across Various
Layers
• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …)
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Scalable Collective communication– Offload
– Non-blocking
– Topology-aware
– Power-aware
• Support for MPI-3 RMA Model
• Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communication and I/O
Designing (MPI+X) at Exascale
5MVAPICH User Group Meeting 2014
• http://mvapich.cse.ohio-state.edu
• High Performance open-source MPI Library for InfiniBand,
10Gig/iWARP, and RDMA over Converged Enhanced Ethernet
(RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since
2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,200 organizations (HPC Centers, Industry and
Universities) in 73 countries
MVAPICH2/MVAPICH2-X Software
6MVAPICH User Group Meeting 2014
D. K. Panda, K. Tomko, K. Schulz and A. Majumdar, The MVAPICH Project: Evolution and Sustainability of an Open Source Production
Quality MPI Library for HPC,, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int'l
Conference on Supercomputing (SC '13), November 2013.
MVAPICH User Group Meeting 2014 7
MVAPICH Team Projects
Timeline
Jan-
02
Jan-
10
Nov
-12
MVAPICH2-X
OMB
MVAPICH2
MVAPICH
Oct-00
Nov
-04
Aug
-14
EOL
MVAPICH2-GDR
MVAPICH User Group Meeting 2014 8
MVAPICH/MVAPICH2 Release Timeline and Downloads
• Download counts from MVAPICH2 website
0
50000
100000
150000
200000
250000
300000
350000
Sep
-04
Feb
-05
Jul-
05
Dec
-05
May
-06
Oct
-06
Mar
-07
Au
g-0
7
Jan
-08
Jun
-08
No
v-0
8
Ap
r-0
9
Sep
-09
Feb
-10
Jul-
10
Dec
-10
May
-11
Oct
-11
Mar
-12
Au
g-1
2
Jan
-13
Jun
-13
No
v-1
3
Ap
r-1
4
Nu
mb
er
of
Do
wn
load
s
Timeline
MV
0.9
.4
MV
2 0
.9.0
MV
2 0
.9.8
MV
21
.0
MV
1.0
MV
21
.0.3
MV
1.1
MV
21
.4
MV
21
.5 MV
21
.6
MV
21
.7 MV
21
.8
MV
21
.9 MV
22
.0
MV
22
.0a
MV
22
.0b
MV
22
.0rc
2
• Available with software stacks of many IB, HSE, and server
vendors including Linux Distros (RedHat and SuSE)
• Empowering many TOP500 clusters
– 7th ranked 519,640-core cluster (Stampede) at TACC
– 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology
– 23rd, 96,192-core (Pleiades) at NASA and many others
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software (Cont’d)
9MVAPICH User Group Meeting 2014
• Research is done for exploring new designs
• Designs are first presented to conference/journal publications
• Best performing designs are incorporated into the codebase
• Rigorous Q&A procedure before making a release
– Exhaustive unit testing
– Various test procedures on diverse range of platforms and interconnects
– Performance tuning
– Applications-based evaluation
– Evaluation on large-scale systems
• Even alpha and beta versions go through the above testing
Strong Procedure for Design, Development and Release
10MVAPICH User Group Meeting 2014
MVAPICH2 Architecture (Latest Release 2.0)
Major Computing Platforms: IA-32, EM64T, Ivybridge, Westmere, Sandybridge, Opteron, Magny, ..
11MVAPICH User Group Meeting 2014
All Different PCI interfaces
12
MVAPICH2 2.0 and MVAPICH2-X 2.0
MVAPICH User Group Meeting 2014
• Released on 06/20/2014
• Major Features and Enhancements
– Based on MPICH-3.1
– Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and OFA-RoCE-CH3 interfaces
– Multiple optimizations and performance enhancements for RMA in OFA-IB-CH3 channel
– MPI-3 RMA support for CH3-PSM channel
– CMA support is now enabled by default
– Enhanced intra-node SMP performance
– Dynamic CUDA initialization
– Support for running on heterogeneous clusters with GPU and non-GPU nodes
– Multi-rail support for GPU communication and UD-Hybrid channel
– Improved job-startup performance for large-scale mpirun_rsh jobs
– Reduced memory footprint
– Updated compiler wrappers to remove application dependency on network and other extra libraries
– Updated hwloc to version 1.9
• MVAPICH2-X 2.0GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.
– Based on MVAPICH2 2.0GA including MPI-3 features; Compliant with UPC 2.18.0 and OpenSHMEM v1.0f
– Enhanced optimization of OpenSHMEM collectives and Optimized shmalloc routine
– Optimized UPC collectives and support for GUPC translator
MVAPICH User Group Meeting 2014 13
One-way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.991.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switchConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
MVAPICH User Group Meeting 2014 14
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
19171706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
33413704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switchConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
15
Latest MVAPICH2 2.0
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
MVAPICH User Group Meeting 2014
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(u
s)
Message Size (Bytes)
Inter-node Get/Put Latency
Get Put
MPI-3 RMA Get/Put with Flush Performance
16
Latest MVAPICH2 2.0, Intel Sandy-bridge with Connect-IB (single-port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Intra-socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
Intra-socket Get/Put Latency
Get Put
0.08 us
MVAPICH User Group Meeting 2014
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Inter-node Get/Put Bandwidth
Get Put6881
6876
14926
15364
MVAPICH2-X for Hybrid MPI + PGAS Applications
MVAPICH User Group Meeting 2014 17
MPI Applications, OpenSHMEM Applications, UPC
Applications, Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI CallsUPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM available with
MVAPICH2-X 1.9 onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant (with initial support for UPC 1.3)
– Scalable Inter-node and intra-node communication – point-to-point and collectives
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1MPI
Kernel 2MPI
Kernel 3MPI
Kernel NMPI
HPC Application
Kernel 2PGAS
Kernel NPGAS
MVAPICH User Group Meeting 2014 18
MVAPICH User Group Meeting 2014 19
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
0
1
2
3
4
5
6
7
8
9
26 27 28 29
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
Scale
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
0
1
2
3
4
5
6
7
8
9
1024 2048 4096 8192
Bill
ion
s o
f Tr
ave
rse
d E
dge
s P
er
Seco
nd
(TE
PS)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance
Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Tim
e (
s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 8,192 processes- 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes- 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
20
MVAPICH2 2.1a and MVAPICH2-X 2.1a
MVAPICH User Group Meeting 2014
• Will be available later this week or next week
• Features
– Based on MPICH 3.1.2
– PMI-2-based startup with SLURM
– Improved job-startup, etc.
21
MVAPICH2-GDR 2.0
MVAPICH User Group Meeting 2014
• Released on 08/23/2014
• Major Features and Enhancements
– Based on MVAPICH2 2.0 (OFA-IB-CH3 interface)
– High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) using
GPUDirect RDMA and pipelining
– Support for MPI-3 RMA (one Sided) communication and atomics using GPU buffers with GPUDirect RDMA and pipelining
– Efficient small message inter-node communication using the new NVIDIA GDRCOPY module
– Efficient small message inter-node communication using loopback design
– Multi-rail support for inter-node point-to-point GPU communication
– High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
using CUDA IPC and pipelining
– Automatic communication channel selection for different GPU communication modes (DD, HH and HD) in different
configurations (intra-IOH and inter-IOH)
– Automatic selection and binding of the best GPU/HCA pair in multi-GPU/HCA system configuration
– Optimized and tuned support for collective communication from GPU buffers
– Enhanced and Efficient support for datatype processing on GPU buffers including support for vector/h-vector, index/h-index,
array and subarray
– Dynamic CUDA initialization. Support GPU device selection after MPI_Init
– Support for non-blocking streams in asynchronous CUDA transfers for better overlap
– Efficient synchronization using CUDA Events for pipelined device data transfers
22
Performance of MVAPICH2 with GPU-Direct-RDMA: LatencyGPU-GPU Internode MPI Latency
MVAPICH User Group Meeting 2014
MVAPICH2-GDR-2.0Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCACUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
71 %
90 %
2.18 usec0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.0
MV2-GDR2.0b
MV2 w/o GDR
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
23
Performance of MVAPICH2 with GPU-Direct-RDMA: BandwidthGPU-GPU Internode MPI Uni-Directional Bandwidth
MVAPICH User Group Meeting 2014
10x
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.0
MV2-GDR2.0b
MV2 w/o GDR
Small Message Bandwidth
Message Size (bytes)
Ban
dw
idth
(M
B/s
)
2.2x
MVAPICH2-GDR-2.0Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCACUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
24
Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-BandwidthGPU-GPU Internode MPI Bi-directional Bandwidth
MVAPICH User Group Meeting 2014
10x
2x
0
500
1000
1500
2000
2500
3000
3500
4000
1 4 16 64 256 1K 4K
MV2-GDR2.0
MV2-GDR2.0b
MV2 w/o GDR
Small Message Bi-Bandwidth
Message Size (bytes)
Bi-
Ban
dw
idth
(M
B/s
)
MVAPICH2-GDR-2.0Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCACUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
25
Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMAGPU-GPU Internode MPI Put latency (RMA put operation Device to Device )
MVAPICH User Group Meeting 2014
2.88 us
83%
MVAPICH2-GDR-2.0Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCACUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA
MPI-3 RMA provides flexible synchronization and completion primitives
0
5
10
15
20
25
30
35
0 2 8 32 128 512 2K 8K
MV2-GDR2.0
MV2-GDR2.0b
Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
MVAPICH User Group Meeting 2014 26
Application-Level Evaluation (HOOMD-blue)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• MV2-GDR 2.0 (released on 08/23/14) : try it out !!
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0
MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768
MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
HOOMD-blue Strong Scaling HOOMD-blue Weak Scaling
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
OffloadedComputation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• MPI (+X) continues to be the predominant programming model in HPC • Flexibility in launching MPI jobs on clusters with Xeon Phi
27MVAPICH User Group Meeting 2014
28
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only
• Symmetric Mode
• Multi-MIC Node Configurations
MVAPICH User Group Meeting 2014
MIC-Remote-MIC P2P Communication
29MVAPICH User Group Meeting 2014
Bandwidth
Bette
r Bet
ter
Bet
ter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0200040006000800010000120001400016000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
Bette
r5594
Bandwidth
30
Latest Status on MVAPICH2-MIC
MVAPICH User Group Meeting 2014
• Running on three major systems
– Stampede : module swap mvapich2 mvapich2-mic/20130911
– Blueridge(Virginia Tech) : module swap mvapich2 mvapich2-mic/1.9
– Beacon (UTK) : module unload intel-mpi; module load mvapich2-
mic/1.9
• A new version based on MVAPICH2 2.0 is being worked out
• Will be available in a few weeks
31
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
MVAPICH User Group Meeting 2014
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M)Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
secs
)x 1
00
0
Message Size (Bytes)
32-Node-Allgather (8H + 8 M)Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(u
secs
)x 1
00
0
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M)Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
Exe
cuti
on
Tim
e (
secs
)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
Communication
Computation
76%
58%
55%
• Started in 2004 and continuing steadily
• Allows MPI developers and users to
– Test and evaluate MPI libraries
• Has a wide-range of benchmarks
– Two-sided (MPI-1, MPI-2 and MPI-3)
– One-sided (MPI-2 and MPI-3)
– RMA (MPI-3)
– Collectives (MPI-1, MPI-2 and MPI-3)
– Extensions for GPU-aware communication (CUDA and OpenACC)
– UPC (Pt-to-Pt)
– OpenSHMEM (Pt-to-Pt and Collectives)
• Widely-used in the MPI community
OSU Micro-Benchmarks (OMB)
32MVAPICH User Group Meeting 2014
MVAPICH User Group Meeting 2014 33
OMB (OSU Micro-Benchmarks) 4.4
• Released on 08/23/2014
• Include benchmarks for MPI, OpenSHMEM, and UPC Programming Models
• MPI
– Support MPI-1, MPI-2 and MPI-3 standards
– Point-to-Point Benchmarks: Latency, Multi-threaded latency, Multi-pair latency, Multiple bandwidth / Message rate test
bandwidth, Bidirectional bandwidth
– Collective Benchmarks: Collective latency test for MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast,
MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter, and vector collectives
– One-sided Benchmarks: Put latency, Put bandwidth, Put bidirectional bandwidth, Get latency, Get bandwidth,
Accumulate latency, Atomics latency, and Get_accumulate latency
– CUDA Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices
– OpenACC Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices
– Support for MPI-3 RMA operations using GPU buffers
• OpenSHMEM
– Point-to-Point benchmarks: Put latency, Get latency, Put message rate, and Atomics latency
– Collective benchmarks: Collect latency, FCollect latency, Broadcast Latency, Reduce latency, and Barrier latency
• Unified Parallel C (UPC)
– Point-to-Point benchmarks: Put latency, Get latency
– Collective benchmarks: Barrier latency, Exchange latency, Gatherall latency, Gather latency, Reduce latency, and Scatter
latency
Virtualization Support with MVAPICH2 using SR-IOVIntra-node Inter-VM Point-to-Point Latency and Bandwidth
• 1 VM per Core
• MVAPICH2-SR-IOV-IB brings only 3-7% (latency) and 3-
8%(BW) overheads, compared to MAPICH2 over Native
InfiniBand Verbs (MVAPICH2-Native-IB)
0
20
40
60
80
100
120
140
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (byte)
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Ban
dw
idth
(MB
/s)
Message Size (byte)
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
MVAPICH User Group Meeting 2014 34
Performance Evaluations with NAS and Graph500
• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally
• MVAPICH2-SR-IOV-IB brings 3-7% and 3-9% overheads for
NAS Benchmarks and Graph500, respectively, compared to
MVAPICH2-Native-IB
0
1
2
3
4
5
6
7
MG-B-64 CG-B-64 EP-B-64 LU-B-64 BT-B-64
Exec
uti
on
Tim
e(s)
NAS Benchmarks
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
0
200
400
600
800
1000
1200
1400
1600
20,10 20,16 22,16 24,16
Exec
uti
on
tim
e(m
s)
Graph500
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
MVAPICH User Group Meeting 2014 35
Performance Evaluation with LAMMPS
• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally
• MVAPICH2-SR-IOV-IB brings 7% and 9% overheads for LJ
and CHAIN in LAMMPS, respectively, compared to
MVAPICH2-Native-IB
0
5
10
15
20
25
30
35
40
LJ-64-scaled CHAIN-64-scaled
Exec
uti
on
Tim
e(s
)
LAMMPS
MVAPICH2-SR-IOV-IB
MVAPICH2-Native-IB
MVAPICH User Group Meeting 2014 36
• Performance and Memory scalability toward 500K-1M cores– Dynamically Connected Transport (DCT) service with Connect-IB
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Enhanced Optimization for GPU Support and Accelerators
• Taking advantage of Collective Offload framework– Including support for non-blocking collectives (MPI 3.0)
– Core-Direct support
• Extended RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Extended Support for MPI Tools Interface (as in MPI 3.0)
• Extended Checkpoint-Restart and migration support with SCR
• Low-overhead and high-performance Virtualization support
MVAPICH2 – Plans for Exascale
37MVAPICH User Group Meeting 2014
MVAPICH User Group Meeting 2014
Web Pointers
NOWLAB Web Page
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
38