The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Presentation at the Mellanox Theater (SC ‘15)
by
High-End Computing (HEC): PetaFlop to ExaFlop
Mellanox Theater (SC '15) 2
100-200
PFlops in
2016-2018
1 EFlops in
2020-2024?
Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 40/100GigE,
Aries, and OmniPath)
Multi/Many-core Architectures
3
Accelerators (NVIDIA and MIC)
Middleware
Mellanox Theater (SC '15)
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
(two-sided and
one-sided
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
Designing (MPI+X) at Exascale
4 Mellanox Theater (SC '15)
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)
• Virtualization
• Energy-Awareness
MVAPICH2 Software
5 Mellanox Theater (SC '15)
• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over
Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Used by more than 2,475 organizations in 76 countries
– More than 307,000 downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘15 ranking)
• 10th ranked 519,640-core cluster (Stampede) at TACC
• 13th ranked 185,344-core cluster (Pleiades) at NASA
• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)
Timeline Jan-
04
Jan-
10
Nov
-12
MVAPICH2-X
OMB
MVAPICH2
MVAPICH
Oct
-02
Nov
-04
Apr-
15
EOL
MVAPICH2-GDR
MVAPICH2-MIC
MVAPICH Project Timeline
6 Mellanox Theater (SC '15)
Jul-15
MVAPICH2-Virt
Aug
-14
Aug
-15
Sep-15
MVAPICH2-EA
OSU-INAM
Mellanox Theater (SC '15) 7
MVAPICH2 Software Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE
MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
0
50000
100000
150000
200000
250000
300000
350000Se
p-0
4
Ap
r-0
5
No
v-0
5
Jun
-06
Jan
-07
Au
g-0
7
Mar
-08
Oct
-08
May
-09
Dec
-09
Jul-
10
Feb
-11
Sep
-11
Ap
r-1
2
No
v-1
2
Jun
-13
Jan
-14
Au
g-1
4
Mar
-15
Oct
-15
Nu
mb
er
of
Do
wn
load
s
Timeline
MV
0.9
.4
MV
2 0
.9.0
MV
2 0
.9.8
MV
2 1
.0
MV
1.0
MV
2 1
.0.3
MV
1.1
MV
2 1
.4
MV
2 1
.5 M
V2
1.6
MV
2 1
.7 M
V2
1.8
MV
2 1
.9
MV
2 2
.1
MV
2-G
DR
2.0
b
MV
2-M
IC 2
.0
MV
2-V
irt
2.1
rc2
M
V2
-GD
R 2
.2a
MV
2-X
2.2
b
MV
2 2
.2b
MVAPICH/MVAPICH2 Release Timeline and Downloads
8 Mellanox Theater (SC '15)
9
MVAPICH2 Distributions
Mellanox Theater (SC '15)
• MVAPICH2
– Basic MPI support for IB, iWARP and RoCE
• MVAPICH2-X
– MPI, PGAS and Hybrid MPI+PGAS support for IB
• MVAPIC2-MIC
– Optimized for IB clusters with Intel Xeon Phis
• MVAPICH2-Virt
– Optimized for HPC Clouds with IB and SR-IOV virtualization
– with and without Open Stack
• MVAPICH2-EA
– Energy Efficient Support for point-to-point and collective operations
– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)
• OSU INAM
– InfiniBand Network Analysis and Monitoring Tool
• MVAPICH2-GDR (High-Performance for GPUs) will be presented tomorrow (Thursday, Nov 18th from 11:00-11:30am)
Mellanox Theater (SC '15) 10
Performance of MPI over IB with MVAPICH2
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.26 1.19
1.05 1.15
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back
0
2000
4000
6000
8000
10000
12000
14000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
12465
3387
6356
11497
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
11
Latest MVAPICH2 2.2b
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
Mellanox Theater (SC '15)
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(u
s)
Message Size (Bytes)
Inter-node Get/Put Latency
Get Put
MPI-3 RMA Get/Put with Flush Performance
12
Latest MVAPICH2 2.2b, Intel Sandy-bridge with Connect-IB (single-port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Intra-socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Late
ncy
(u
s)
Message Size (Bytes)
Intra-socket Get/Put Latency
Get Put
0.08 us
Mellanox Theater (SC '15)
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256 1K 4K 16K 64K 256K
Ban
dw
idth
(M
byt
es/s
ec)
Message Size (Bytes)
Inter-node Get/Put Bandwidth
Get Put6881
6876
14926
15364
13
MVAPICH2 Distributions
Mellanox Theater (SC '15)
• MVAPICH2
– Basic MPI support for IB, iWARP and RoCE
• MVAPICH2-X
– MPI, PGAS and Hybrid MPI+PGAS support for IB
• MVAPIC2-MIC
– Optimized for IB clusters with Intel Xeon Phis
• MVAPICH2-Virt
– Optimized for HPC Clouds with IB and SR-IOV virtualization
– with and without Open Stack
• MVAPICH2-EA
– Energy Efficient Support for point-to-point and collective operations
– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)
• OSU INAM
– InfiniBand Network Analysis and Monitoring Tool
MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applications
Mellanox Theater (SC '15)
MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)
Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with
MVAPICH2-X 1.9 (2012) onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)
– Scalable Inter-node and intra-node communication – point-to-point and collectives
CAF Calls
14
Application Level Performance with Graph500 and Sort Graph500 Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l
Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Tim
e (
s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)13X
7.6X
• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
• 8,192 processes - 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes - 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned
Global Address Space Programming Models (PGAS '13), October 2013.
Sort Execution Time
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (
seco
nd
s)
Input Data - No. of Processes
MPI Hybrid
51%
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• 4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min
- Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design
15 Mellanox Theater (SC '15)
Minimizing Memory Footprint further by DC Transport
No
de
0
P1
P0
Node 1
P3
P2
Node 3
P7
P6
No
de
2
P5
P4
IB Network
• Constant connection cost (One QP for any peer)
• Full Feature Set (RDMA, Atomics etc)
• Separate objects for send (DC Initiator) and receive (DC Target)
– DC Target identified by “DCT Number”
– Messages routed with (DCT Number, LID)
– Requires same “DC Key” to enable communication
• Available with MVAPICH2-X 2.2a
0
0.5
1
160 320 620
No
rmal
ize
d E
xecu
tio
n
Tim
e
Number of Processes
NAMD - Apoa1: Large data set RC DC-Pool UD XRC
10 22
47 97
1 1 1 2
10 10 10 10
1 1
3 5
1
10
100
80 160 320 640
Co
nn
ect
ion
Me
mo
ry (
KB
)
Number of Processes
Memory Footprint for Alltoall
RC DC-Pool
UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected
Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).
16 Mellanox Theater (SC '15)
• Introduced by Mellanox to support direct local and remote
noncontiguous memory access
• Avoid packing at sender and unpacking at receiver
• Available in MVAPICH2-X 2.2b
17
User-mode Memory Registration (UMR)
0
50
100
150
200
250
300
350
4K 16K 64K 256K 1M
Late
ncy
(u
s)
Message Size (Bytes)
Small & Medium Message Latency
UMR
Default
02000400060008000
1000012000140001600018000
2M 4M 8M 16M
Late
ncy
(u
s)
Message Size (Bytes)
Large Message Latency
UMR
Default
Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, “High Performance MPI Datatype Support with User-
mode Memory Registration: Challenges, Designs and Benefits”, CLUSTER, 2015
Mellanox Theater (SC '15)
18
MVAPICH2 Distributions
Mellanox Theater (SC '15)
• MVAPICH2
– Basic MPI support for IB, iWARP and RoCE
• MVAPICH2-X
– MPI, PGAS and Hybrid MPI+PGAS support for IB
• MVAPIC2-MIC
– Optimized for IB clusters with Intel Xeon Phis
• MVAPICH2-Virt
– Optimized for HPC Clouds with IB and SR-IOV virtualization
– with and without Open Stack
• MVAPICH2-EA
– Energy Efficient Support for point-to-point and collective operations
– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)
• OSU INAM
– InfiniBand Network Analysis and Monitoring Tool
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI Program
MPI Program
Offloaded Computation
MPI Program
MPI Program
MPI Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
19 Mellanox Theater (SC '15)
20
MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only and Symmetric Mode
• Internode Communication
• Coprocessors-only and Symmetric Mode
• Multi-MIC Node Configurations
• Running on three major systems
• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)
Mellanox Theater (SC '15)
MIC-Remote-MIC P2P Communication with Proxy-based Communication
21 Mellanox Theater (SC '15)
Bandwidth
Bette
r Bet
ter
Bet
ter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ba
nd
wid
th (
MB
/sec
)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0200040006000800010000120001400016000
8K 32K 128K 512K 2M
La
ten
cy (
use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ban
dw
idth
(M
B/s
ec)
Message Size (Bytes)
Bette
r 5594
Bandwidth
22
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
Mellanox Theater (SC '15)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
secs
)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M) Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(u
secs
) x 1
00
0
Message Size (Bytes)
32-Node-Allgather (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(u
secs
) x 1
00
0
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M) Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
Exe
cuti
on
Tim
e (
secs
)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
Communication
Computation
76%
58%
55%
23
MVAPICH2 Distributions
Mellanox Theater (SC '15)
• MVAPICH2
– Basic MPI support for IB, iWARP and RoCE
• MVAPICH2-X
– MPI, PGAS and Hybrid MPI+PGAS support for IB
• MVAPIC2-MIC
– Optimized for IB clusters with Intel Xeon Phis
• MVAPICH2-Virt
– Optimized for HPC Clouds with IB and SR-IOV virtualization
– with and without Open Stack
• MVAPICH2-EA
– Energy Efficient Support for point-to-point and collective operations
– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)
• OSU INAM
– InfiniBand Network Analysis and Monitoring Tool
• Virtualization has many benefits – Fault-tolerance
– Job migration
– Compaction
• Have not been very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV
• MVAPICH2-Virt 2.1 (with and without OpenStack)
Can HPC and Virtualization be Combined?
24 Mellanox Theater (SC '15)
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV
based Virtualized InfiniBand Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled
InfiniBand Clysters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient
Approach to build HPC Clouds, CCGrid’15
0
50
100
150
200
250
300
350
400
450
20,10 20,16 20,20 22,10 22,16 22,20
Exe
cuti
on
Tim
e (
us)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
4%
9%
0
5
10
15
20
25
30
35
FT-64-C EP-64-C LU-64-C BT-64-C
Exe
cuti
on
Tim
e (
s)
NAS Class C
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1%
9%
• Compared to Native, 1-9% overhead for NAS
• Compared to Native, 4-9% overhead for Graph500
Mellanox Theater (SC '15)
Application-Level Performance (8 VM * 8 Core/VM)
NAS Graph500
25
Mellanox Theater (SC '15)
NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument
• Large-scale instrument
– Targeting Big Data, Big Compute, Big Instrument research
– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network
• Reconfigurable instrument
– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use
• Connected instrument
– Workload and Trace Archive
– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others
– Partnerships with users
• Complementary instrument
– Complementing GENI, Grid’5000, and other testbeds
• Sustainable instrument
– Industry connections
http://www.chameleoncloud.org/
26
27
MVAPICH2 Distributions
Mellanox Theater (SC '15)
• MVAPICH2
– Basic MPI support for IB, iWARP and RoCE
• MVAPICH2-X
– MPI, PGAS and Hybrid MPI+PGAS support for IB
• MVAPIC2-MIC
– Optimized for IB clusters with Intel Xeon Phis
• MVAPICH2-Virt
– Optimized for HPC Clouds with IB and SR-IOV virtualization
– with and without Open Stack
• MVAPICH2-EA
– Energy Efficient Support for point-to-point and collective operations
– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)
• OSU INAM
– InfiniBand Network Analysis and Monitoring Tool
• MVAPICH2-EA (Energy-Aware) MPI Library
• Production-ready Energy-Aware MPI Library
• New Energy-Efficient communication protocols for pt-pt and
collective operations
• Intelligently apply the appropriate energy saving techniques
• Application oblivious energy saving
• Released 08/28/15
• OEMT
• A library utility to measure energy consumption for MPI applications
• Works with all MPI runtimes
• PRELOAD option for precompiled applications
• Does not require ROOT permission:
• A safe kernel module to read only a subset of MSRs
• Available from: http://mvapich.cse.ohio-state.edu 28
Energy-Aware MVAPICH2 Library and OSU Energy Management Tool (OEMT)
Mellanox Theater (SC '15)
• An energy efficient
runtime that provides
energy savings without
application knowledge
• A white-box approach
• Automatically and
transparently use the
best energy lever
• Provides guarantees on
maximum degradation
with 5-41% savings at <=
5% degradation
• Pessimistic MPI applies
energy reduction lever to
each MPI call
29
MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)
A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent ,
D. K. Panda , D. Kerbyson , and A. Hoise - Supercomputing ‘15, Nov 2015 , Best Student Paper Finalist, presented in
the Technical Papers Program, Tuesday 3:30-4:00pm (Room 18CD)
Mellanox Theater (SC '15)
1
30
MVAPICH2 Distributions
Mellanox Theater (SC '15)
• MVAPICH2
– Basic MPI support for IB, iWARP and RoCE
• MVAPICH2-X
– MPI, PGAS and Hybrid MPI+PGAS support for IB
• MVAPIC2-MIC
– Optimized for IB clusters with Intel Xeon Phis
• MVAPICH2-Virt
– Optimized for HPC Clouds with IB and SR-IOV virtualization
– with and without Open Stack
• MVAPICH2-EA
– Energy Efficient Support for point-to-point and collective operations
– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)
• OSU INAM
– InfiniBand Network Analysis and Monitoring Tool
Mellanox Theater (SC '15) 31
OSU InfiniBand Network Analysis Monitoring Tool (INAM) – Network Level View
• Show network topology of large clusters
• Visualize traffic pattern on different links
• Quickly identify congested links/links in error state
• See the history unfold – play back historical state of the network
Full Network (152 nodes) Zoomed-in View of the Network
Mellanox Theater (SC '15) 32
OSU INAM Tool – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
• Job level view • Show different network metrics (load, error, etc.) for any live job
• Play back historical data for completed jobs to identify bottlenecks
• Node level view provides details per process or per node • CPU utilization for each rank/node
• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)
• Network metrics (e.g. XmitDiscard, RcvError) per rank/node
MVAPICH2 – Plans for Exascale
33 Mellanox Theater (SC '15)
• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) – Support for UPC++
• Enhanced Optimization for GPU Support and Accelerators
• Taking advantage of advanced features – User Mode Memory Registration (UMR)
– On-demand Paging
• Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath enabled Knights Landing architectures
• Extended RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Energy-aware point-to-point (one-sided and two-sided) and collectives
• Extended Support for MPI Tools Interface (as in MPI 3.0)
• Extended Checkpoint-Restart and migration support with SCR
Two Additional Talks
34 Mellanox Theater (SC '15)
• Today, Wednesday (2:30-3:00pm) • High Performance Big Data (HiBD): Accelerating Hadoop,
Spark and Memcached on Modern Clusters
• Tomorrow, Thursday (11:00-11:30am) • Exploiting Full Potential of GPU Clusters with InfiniBand
using MVAPICH2-GDR
Funding Acknowledgments
Funding Support by
Equipment Support by
35 Mellanox Theater (SC '15)
Personnel Acknowledgments Current Students
– A. Augustine (M.S.)
– A. Awan (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– S. Sur
Current Post-Doc
– J. Lin
– D. Banerjee
Current Programmer
– J. Perkins
Past Post-Docs
– H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Research Scientists Current Senior Research Associate
– H. Subramoni
– X. Lu
Past Programmers
– D. Bureddy
- K. Hamidouche
Current Research Specialist
– M. Arnold
36 Mellanox Theater (SC '15)
Mellanox Theater (SC '15)
Web Pointers
NOWLAB Web Page
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
37