+ All Categories
Home > Documents > The MVAPICH2 Project: Latest Developments and Plans Towards...

The MVAPICH2 Project: Latest Developments and Plans Towards...

Date post: 08-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
37
The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Presentation at the Mellanox Theater (SC ‘15) by
Transcript
Page 1: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Presentation at the Mellanox Theater (SC ‘15)

by

Page 2: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

High-End Computing (HEC): PetaFlop to ExaFlop

Mellanox Theater (SC '15) 2

100-200

PFlops in

2016-2018

1 EFlops in

2020-2024?

Page 3: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies (InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-core Architectures

3

Accelerators (NVIDIA and MIC)

Middleware

Mellanox Theater (SC '15)

Co-Design

Opportunities

and

Challenges

across Various

Layers

Performance

Scalability

Fault-

Resilience

Communication Library or Runtime for Programming Models

Point-to-point

Communication

(two-sided and

one-sided

Collective

Communication

Energy-

Awareness

Synchronization

and Locks

I/O and

File Systems

Fault

Tolerance

Page 4: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Designing (MPI+X) at Exascale

4 Mellanox Theater (SC '15)

• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

• Scalable Collective communication – Offload

– Non-blocking

– Topology-aware

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Integrated Support for GPGPUs and Accelerators

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)

• Virtualization

• Energy-Awareness

Page 5: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

MVAPICH2 Software

5 Mellanox Theater (SC '15)

• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over

Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Used by more than 2,475 organizations in 76 countries

– More than 307,000 downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘15 ranking)

• 10th ranked 519,640-core cluster (Stampede) at TACC

• 13th ranked 185,344-core cluster (Pleiades) at NASA

• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

Page 6: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Timeline Jan-

04

Jan-

10

Nov

-12

MVAPICH2-X

OMB

MVAPICH2

MVAPICH

Oct

-02

Nov

-04

Apr-

15

EOL

MVAPICH2-GDR

MVAPICH2-MIC

MVAPICH Project Timeline

6 Mellanox Theater (SC '15)

Jul-15

MVAPICH2-Virt

Aug

-14

Aug

-15

Sep-15

MVAPICH2-EA

OSU-INAM

Page 7: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Mellanox Theater (SC '15) 7

MVAPICH2 Software Family

Requirements MVAPICH2 Library to use

MPI with IB, iWARP and RoCE MVAPICH2

Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE

MVAPICH2-X

MPI with IB & GPU MVAPICH2-GDR

MPI with IB & MIC MVAPICH2-MIC

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

Page 8: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

0

50000

100000

150000

200000

250000

300000

350000Se

p-0

4

Ap

r-0

5

No

v-0

5

Jun

-06

Jan

-07

Au

g-0

7

Mar

-08

Oct

-08

May

-09

Dec

-09

Jul-

10

Feb

-11

Sep

-11

Ap

r-1

2

No

v-1

2

Jun

-13

Jan

-14

Au

g-1

4

Mar

-15

Oct

-15

Nu

mb

er

of

Do

wn

load

s

Timeline

MV

0.9

.4

MV

2 0

.9.0

MV

2 0

.9.8

MV

2 1

.0

MV

1.0

MV

2 1

.0.3

MV

1.1

MV

2 1

.4

MV

2 1

.5 M

V2

1.6

MV

2 1

.7 M

V2

1.8

MV

2 1

.9

MV

2 2

.1

MV

2-G

DR

2.0

b

MV

2-M

IC 2

.0

MV

2-V

irt

2.1

rc2

M

V2

-GD

R 2

.2a

MV

2-X

2.2

b

MV

2 2

.2b

MVAPICH/MVAPICH2 Release Timeline and Downloads

8 Mellanox Theater (SC '15)

Page 9: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

9

MVAPICH2 Distributions

Mellanox Theater (SC '15)

• MVAPICH2

– Basic MPI support for IB, iWARP and RoCE

• MVAPICH2-X

– MPI, PGAS and Hybrid MPI+PGAS support for IB

• MVAPIC2-MIC

– Optimized for IB clusters with Intel Xeon Phis

• MVAPICH2-Virt

– Optimized for HPC Clouds with IB and SR-IOV virtualization

– with and without Open Stack

• MVAPICH2-EA

– Energy Efficient Support for point-to-point and collective operations

– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

• OSU INAM

– InfiniBand Network Analysis and Monitoring Tool

• MVAPICH2-GDR (High-Performance for GPUs) will be presented tomorrow (Thursday, Nov 18th from 11:00-11:30am)

Page 10: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Mellanox Theater (SC '15) 10

Performance of MPI over IB with MVAPICH2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.26 1.19

1.05 1.15

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

12465

3387

6356

11497

Page 11: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

11

Latest MVAPICH2 2.2b

Intel Ivy-bridge

0.18 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

14,250 MB/s 13,749 MB/s

Mellanox Theater (SC '15)

Page 12: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(u

s)

Message Size (Bytes)

Inter-node Get/Put Latency

Get Put

MPI-3 RMA Get/Put with Flush Performance

12

Latest MVAPICH2 2.2b, Intel Sandy-bridge with Connect-IB (single-port)

2.04 us

1.56 us

0

5000

10000

15000

20000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)

Message Size (Bytes)

Intra-socket Get/Put Bandwidth

Get Put

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Late

ncy

(u

s)

Message Size (Bytes)

Intra-socket Get/Put Latency

Get Put

0.08 us

Mellanox Theater (SC '15)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)

Message Size (Bytes)

Inter-node Get/Put Bandwidth

Get Put6881

6876

14926

15364

Page 13: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

13

MVAPICH2 Distributions

Mellanox Theater (SC '15)

• MVAPICH2

– Basic MPI support for IB, iWARP and RoCE

• MVAPICH2-X

– MPI, PGAS and Hybrid MPI+PGAS support for IB

• MVAPIC2-MIC

– Optimized for IB clusters with Intel Xeon Phis

• MVAPICH2-Virt

– Optimized for HPC Clouds with IB and SR-IOV virtualization

– with and without Open Stack

• MVAPICH2-EA

– Energy Efficient Support for point-to-point and collective operations

– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

• OSU INAM

– InfiniBand Network Analysis and Monitoring Tool

Page 14: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applications

Mellanox Theater (SC '15)

MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)

Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with

MVAPICH2-X 1.9 (2012) onwards!

– http://mvapich.cse.ohio-state.edu

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)

– Scalable Inter-node and intra-node communication – point-to-point and collectives

CAF Calls

14

Page 15: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Application Level Performance with Graph500 and Sort Graph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l

Conference on Parallel Processing (ICPP '12), September 2012

0

5

10

15

20

25

30

35

4K 8K 16K

Tim

e (

s)

No. of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)13X

7.6X

• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design

• 8,192 processes - 2.4X improvement over MPI-CSR

- 7.6X improvement over MPI-Simple

• 16,384 processes - 1.5X improvement over MPI-CSR

- 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned

Global Address Space Programming Models (PGAS '13), October 2013.

Sort Execution Time

0

500

1000

1500

2000

2500

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (

seco

nd

s)

Input Data - No. of Processes

MPI Hybrid

51%

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• 4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min

- Hybrid – 1172 sec; 0.36 TB/min

- 51% improvement over MPI-design

15 Mellanox Theater (SC '15)

Page 16: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Minimizing Memory Footprint further by DC Transport

No

de

0

P1

P0

Node 1

P3

P2

Node 3

P7

P6

No

de

2

P5

P4

IB Network

• Constant connection cost (One QP for any peer)

• Full Feature Set (RDMA, Atomics etc)

• Separate objects for send (DC Initiator) and receive (DC Target)

– DC Target identified by “DCT Number”

– Messages routed with (DCT Number, LID)

– Requires same “DC Key” to enable communication

• Available with MVAPICH2-X 2.2a

0

0.5

1

160 320 620

No

rmal

ize

d E

xecu

tio

n

Tim

e

Number of Processes

NAMD - Apoa1: Large data set RC DC-Pool UD XRC

10 22

47 97

1 1 1 2

10 10 10 10

1 1

3 5

1

10

100

80 160 320 640

Co

nn

ect

ion

Me

mo

ry (

KB

)

Number of Processes

Memory Footprint for Alltoall

RC DC-Pool

UD XRC

H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected

Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14).

16 Mellanox Theater (SC '15)

Page 17: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

• Introduced by Mellanox to support direct local and remote

noncontiguous memory access

• Avoid packing at sender and unpacking at receiver

• Available in MVAPICH2-X 2.2b

17

User-mode Memory Registration (UMR)

0

50

100

150

200

250

300

350

4K 16K 64K 256K 1M

Late

ncy

(u

s)

Message Size (Bytes)

Small & Medium Message Latency

UMR

Default

02000400060008000

1000012000140001600018000

2M 4M 8M 16M

Late

ncy

(u

s)

Message Size (Bytes)

Large Message Latency

UMR

Default

Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, “High Performance MPI Datatype Support with User-

mode Memory Registration: Challenges, Designs and Benefits”, CLUSTER, 2015

Mellanox Theater (SC '15)

Page 18: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

18

MVAPICH2 Distributions

Mellanox Theater (SC '15)

• MVAPICH2

– Basic MPI support for IB, iWARP and RoCE

• MVAPICH2-X

– MPI, PGAS and Hybrid MPI+PGAS support for IB

• MVAPIC2-MIC

– Optimized for IB clusters with Intel Xeon Phis

• MVAPICH2-Virt

– Optimized for HPC Clouds with IB and SR-IOV virtualization

– with and without Open Stack

• MVAPICH2-EA

– Energy Efficient Support for point-to-point and collective operations

– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

• OSU INAM

– InfiniBand Network Analysis and Monitoring Tool

Page 19: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

Offloaded Computation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi

19 Mellanox Theater (SC '15)

Page 20: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

20

MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only and Symmetric Mode

• Internode Communication

• Coprocessors-only and Symmetric Mode

• Multi-MIC Node Configurations

• Running on three major systems

• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

Mellanox Theater (SC '15)

Page 21: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

MIC-Remote-MIC P2P Communication with Proxy-based Communication

21 Mellanox Theater (SC '15)

Bandwidth

Bette

r Bet

ter

Bet

ter

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)

Message Size (Bytes)

5236

Intra-socket P2P

Inter-socket P2P

0200040006000800010000120001400016000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dw

idth

(M

B/s

ec)

Message Size (Bytes)

Bette

r 5594

Bandwidth

Page 22: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

22

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

Mellanox Theater (SC '15)

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

secs

)

Message Size (Bytes)

32-Node-Allgather (16H + 16 M) Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

secs

) x 1

00

0

Message Size (Bytes)

32-Node-Allgather (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(u

secs

) x 1

00

0

Message Size (Bytes)

32-Node-Alltoall (8H + 8 M) Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

10

20

30

40

50

MV2-MIC-Opt MV2-MIC

Exe

cuti

on

Tim

e (

secs

)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance

Communication

Computation

76%

58%

55%

Page 23: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

23

MVAPICH2 Distributions

Mellanox Theater (SC '15)

• MVAPICH2

– Basic MPI support for IB, iWARP and RoCE

• MVAPICH2-X

– MPI, PGAS and Hybrid MPI+PGAS support for IB

• MVAPIC2-MIC

– Optimized for IB clusters with Intel Xeon Phis

• MVAPICH2-Virt

– Optimized for HPC Clouds with IB and SR-IOV virtualization

– with and without Open Stack

• MVAPICH2-EA

– Energy Efficient Support for point-to-point and collective operations

– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

• OSU INAM

– InfiniBand Network Analysis and Monitoring Tool

Page 24: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

• Virtualization has many benefits – Fault-tolerance

– Job migration

– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV

• MVAPICH2-Virt 2.1 (with and without OpenStack)

Can HPC and Virtualization be Combined?

24 Mellanox Theater (SC '15)

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV

based Virtualized InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled

InfiniBand Clysters, HiPC’14

J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient

Approach to build HPC Clouds, CCGrid’15

Page 25: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

0

50

100

150

200

250

300

350

400

450

20,10 20,16 20,20 22,10 22,16 22,20

Exe

cuti

on

Tim

e (

us)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

4%

9%

0

5

10

15

20

25

30

35

FT-64-C EP-64-C LU-64-C BT-64-C

Exe

cuti

on

Tim

e (

s)

NAS Class C

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%

9%

• Compared to Native, 1-9% overhead for NAS

• Compared to Native, 4-9% overhead for Graph500

Mellanox Theater (SC '15)

Application-Level Performance (8 VM * 8 Core/VM)

NAS Graph500

25

Page 26: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Mellanox Theater (SC '15)

NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument

• Large-scale instrument

– Targeting Big Data, Big Compute, Big Instrument research

– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network

• Reconfigurable instrument

– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use

• Connected instrument

– Workload and Trace Archive

– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others

– Partnerships with users

• Complementary instrument

– Complementing GENI, Grid’5000, and other testbeds

• Sustainable instrument

– Industry connections

http://www.chameleoncloud.org/

26

Page 27: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

27

MVAPICH2 Distributions

Mellanox Theater (SC '15)

• MVAPICH2

– Basic MPI support for IB, iWARP and RoCE

• MVAPICH2-X

– MPI, PGAS and Hybrid MPI+PGAS support for IB

• MVAPIC2-MIC

– Optimized for IB clusters with Intel Xeon Phis

• MVAPICH2-Virt

– Optimized for HPC Clouds with IB and SR-IOV virtualization

– with and without Open Stack

• MVAPICH2-EA

– Energy Efficient Support for point-to-point and collective operations

– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

• OSU INAM

– InfiniBand Network Analysis and Monitoring Tool

Page 28: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

• MVAPICH2-EA (Energy-Aware) MPI Library

• Production-ready Energy-Aware MPI Library

• New Energy-Efficient communication protocols for pt-pt and

collective operations

• Intelligently apply the appropriate energy saving techniques

• Application oblivious energy saving

• Released 08/28/15

• OEMT

• A library utility to measure energy consumption for MPI applications

• Works with all MPI runtimes

• PRELOAD option for precompiled applications

• Does not require ROOT permission:

• A safe kernel module to read only a subset of MSRs

• Available from: http://mvapich.cse.ohio-state.edu 28

Energy-Aware MVAPICH2 Library and OSU Energy Management Tool (OEMT)

Mellanox Theater (SC '15)

Page 29: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

• An energy efficient

runtime that provides

energy savings without

application knowledge

• A white-box approach

• Automatically and

transparently use the

best energy lever

• Provides guarantees on

maximum degradation

with 5-41% savings at <=

5% degradation

• Pessimistic MPI applies

energy reduction lever to

each MPI call

29

MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent ,

D. K. Panda , D. Kerbyson , and A. Hoise - Supercomputing ‘15, Nov 2015 , Best Student Paper Finalist, presented in

the Technical Papers Program, Tuesday 3:30-4:00pm (Room 18CD)

Mellanox Theater (SC '15)

1

Page 30: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

30

MVAPICH2 Distributions

Mellanox Theater (SC '15)

• MVAPICH2

– Basic MPI support for IB, iWARP and RoCE

• MVAPICH2-X

– MPI, PGAS and Hybrid MPI+PGAS support for IB

• MVAPIC2-MIC

– Optimized for IB clusters with Intel Xeon Phis

• MVAPICH2-Virt

– Optimized for HPC Clouds with IB and SR-IOV virtualization

– with and without Open Stack

• MVAPICH2-EA

– Energy Efficient Support for point-to-point and collective operations

– Compatible with OSU Energy Monitoring Tool (OEMT-0.8)

• OSU INAM

– InfiniBand Network Analysis and Monitoring Tool

Page 31: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Mellanox Theater (SC '15) 31

OSU InfiniBand Network Analysis Monitoring Tool (INAM) – Network Level View

• Show network topology of large clusters

• Visualize traffic pattern on different links

• Quickly identify congested links/links in error state

• See the history unfold – play back historical state of the network

Full Network (152 nodes) Zoomed-in View of the Network

Page 32: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Mellanox Theater (SC '15) 32

OSU INAM Tool – Job and Node Level Views

Visualizing a Job (5 Nodes) Finding Routes Between Nodes

• Job level view • Show different network metrics (load, error, etc.) for any live job

• Play back historical data for completed jobs to identify bottlenecks

• Node level view provides details per process or per node • CPU utilization for each rank/node

• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)

• Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Page 33: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

MVAPICH2 – Plans for Exascale

33 Mellanox Theater (SC '15)

• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) – Support for UPC++

• Enhanced Optimization for GPU Support and Accelerators

• Taking advantage of advanced features – User Mode Memory Registration (UMR)

– On-demand Paging

• Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath enabled Knights Landing architectures

• Extended RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Energy-aware point-to-point (one-sided and two-sided) and collectives

• Extended Support for MPI Tools Interface (as in MPI 3.0)

• Extended Checkpoint-Restart and migration support with SCR

Page 34: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Two Additional Talks

34 Mellanox Theater (SC '15)

• Today, Wednesday (2:30-3:00pm) • High Performance Big Data (HiBD): Accelerating Hadoop,

Spark and Memcached on Modern Clusters

• Tomorrow, Thursday (11:00-11:30am) • Exploiting Full Potential of GPU Clusters with InfiniBand

using MVAPICH2-GDR

Page 35: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Funding Acknowledgments

Funding Support by

Equipment Support by

35 Mellanox Theater (SC '15)

Page 36: The MVAPICH2 Project: Latest Developments and Plans Towards …hibd.cse.ohio-state.edu/static/media/talks/slide/dk_me... · 2017-07-18 · Presentation at the Mellanox Theater (S

Personnel Acknowledgments Current Students

– A. Augustine (M.S.)

– A. Awan (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist

– S. Sur

Current Post-Doc

– J. Lin

– D. Banerjee

Current Programmer

– J. Perkins

Past Post-Docs

– H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Research Scientists Current Senior Research Associate

– H. Subramoni

– X. Lu

Past Programmers

– D. Bureddy

- K. Hamidouche

Current Research Specialist

– M. Arnold

36 Mellanox Theater (SC '15)


Recommended