+ All Categories
Home > Documents > MVAPICH2-GDR: Pushing the Frontier of HPC and Deep...

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep...

Date post: 16-Feb-2019
Category:
Upload: trantu
View: 214 times
Download: 0 times
Share this document with a friend
49
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda GPU Technology Conference GTC 2017 by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon
Transcript
Page 1: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

GPU Technology Conference GTC 2017

by

Hari Subramoni

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

Page 2: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 2Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 3: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 3Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,750 organizations in 84 countries

– More than 416,000 (> 0.4 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘16 ranking)

• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 13th, 241,108-core (Pleiades) at NASA

• 17th, 462,462-core (Stampede) at TACC

• 40th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Sunway TaihuLight (1st in Jun’16, 10M cores, 100 PFlops)

Page 4: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 4Network Based Computing Laboratory

0

50000

100000

150000

200000

250000

300000

350000

400000

450000Se

p-04

Feb-

05

Jul-0

5

Dec-

05

May

-06

Oct

-06

Mar

-07

Aug-

07

Jan-

08

Jun-

08

Nov

-08

Apr-

09

Sep-

09

Feb-

10

Jul-1

0

Dec-

10

May

-11

Oct

-11

Mar

-12

Aug-

12

Jan-

13

Jun-

13

Nov

-13

Apr-

14

Sep-

14

Feb-

15

Jul-1

5

Dec-

15

May

-16

Oct

-16

Mar

-17

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9 M

V22.

1

MV2

-GDR

2.0

b

MV2

-MIC

2.0

MV2

-Virt

2.1

rc2

MV2

-GDR

2.2

rc1

MV2

-X2.

2M

V22.

3a

MVAPICH2 Release Timeline and Downloads

Page 5: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 5Network Based Computing Laboratory

MVAPICH2 Software Family

Requirements MVAPICH2 Library to use

MPI with IB, iWARP and RoCE MVAPICH2

Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X

MPI with IB & GPU MVAPICH2-GDR

MPI with IB & MIC MVAPICH2-MIC

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

Page 6: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 6Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

Memory CMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

Page 7: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 7Network Based Computing Laboratory

CPU CPUQPI

GPU

PCIe

GPU

GPU

CPU

GPU

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .

• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

Optimizing MPI Data Movement on GPU Clusters

Page 8: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 8Network Based Computing Laboratory

At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

Page 9: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 9Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point

communication (GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-

GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication

from GPU device buffers

Page 10: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 10Network Based Computing Laboratory

• MVAPICH2-2.2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• Please select the best matching package for your system• We have most of common combinations. If you do not find your match here please email OSU with the following details

– OS versions

– OFED version

– CUDA Version

– Compiler (GCC, Intel and PGI)

• Install instructions• Having root permissions

• On default Path: rpm -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm

• On specific Path: rpm --prefix /custom/install/prefix -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm

• Do not have root permissions:• rpm2cpio mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm | cpio –id

• More details on the installation process refer to:

http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2#_installing_mvapich2_gdr_library

Installing MVAPICH2-GDR

Page 11: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 11Network Based Computing Laboratory

01000200030004000

1 4 16 64 256 1K 4K

MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth

Message Size (bytes)

Bi-B

andw

idth

(MB/

s)

05

1015202530

0 2 8 32 128 512 2K

MV2-GDR2.2 MV2-GDR2.0bMV2 w/o GDR

GPU-GPU internode latency

Message Size (bytes)

Late

ncy

(us)

MVAPICH2-GDR-2.2Intel Ivy Bridge (E5-2680 v2) node - 20 cores

NVIDIA Tesla K40c GPUMellanox Connect-X4 EDR HCA

CUDA 8.0Mellanox OFED 3.0 with GPU-Direct-RDMA

10x2X

11x

Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)

2.18us0

50010001500200025003000

1 4 16 64 256 1K 4K

MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bandwidth

Message Size (bytes)

Band

wid

th

(MB/

s) 11X

2X

3X

Page 12: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 12Network Based Computing Laboratory

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X

Page 13: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 13Network Based Computing Laboratory

Full and Efficient MPI-3 RMA Support

05

101520253035

0 2 8 32 128 512 2K 8K

Small Message Latency

Message Size (bytes)

Late

ncy

(us)

MVAPICH2-GDR-2.2Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU

Mellanox Connect-IB Dual-FDR HCA, CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA

6X

2.88 us

Page 14: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 14Network Based Computing Laboratory

Performance of MVAPICH2-GDR with GPU-Direct RDMA and Multi-Rail Support

0100020003000400050006000700080009000

10000

1 4 16 64 256 1K 4K 16K64K256K1M 4M

MV2-GDR 2.1

MV2-GDR 2.1 RC2

GPU-GPU Internode MPI Uni-Directional Bandwidth

Message Size (bytes)

Band

wid

th (M

B/s)

MVAPICH2-GDR-2.2.bIntel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU

Mellanox Connect-IB Dual-FDR HCA CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA

02000400060008000

10000120001400016000

1 4 16 64 256 1K 4K 16K64K256K1M 4M

MV2-GDR 2.1

MV2-GDR 2.1 RC2

GPU-GPU Internode Bi-directional Bandwidth

Message Size (bytes)

Bi-B

andw

idth

(MB/

s)

8,759 15,111

40%20%

Page 15: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 15Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 16: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 16Network Based Computing Laboratory

Non-Blocking Collectives (NBC) using Core-Direct Offload

A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015

• MPI NBC decouples initiation (Ialltoall) and completion (Wait) phases and provide overlap potential (Ialltoall + compute + Wait) but CPU drives progress largely in Wait (=> 0 overlap)

• CORE-Direct feature provides true overlap capabilities by providing a priori specification of a list of network-tasks which is progressed by the NIC instead of the CPU (hence freeing it)

• We propose a design that combines GPUDirect RDMA and Core-Direct features to provide efficient support of CUDA-Aware NBC collectives on GPU buffers

• Overlap communication with CPU computation

• Overlap communication with GPU computation

• Extend OMB with CUDA-Aware NBC benchmarks to evaluate

• Latency

• Overlap on both CPU and GPU

Page 17: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 17Network Based Computing Laboratory

0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Ove

rlap

(%)

Message Size (Bytes)

Medium/Large Message Overlap (64 GPU nodes)

Ialltoall (1process/node)

Ialltoall (2process/node; 1process/GPU)0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Ove

rlap

(%)

Message Size (Bytes)

Medium/Large Message Overlap(64 GPU nodes)

Igather (1process/node)

Igather (2processes/node;1process/GPU)

Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB

Available since MVAPICH2-GDR 2.2b

CUDA-Aware Non-Blocking Collectives

A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015

Page 18: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 18Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 19: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 19Network Based Computing Laboratory

• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions

• Halo data exchange• Duplicate the boundary• Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

Page 20: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 20Network Based Computing Laboratory

MPI Datatype support in MVAPICH2• Datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.

Page 21: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 21Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels• Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block

• MV2_CUDA_KERNEL_IDXBLK_XDIM

Page 22: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 22Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguousMPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

Page 23: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 23Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/

Page 24: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 24Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 25: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 25Network Based Computing Laboratory

Initial (Basic) Support for GPU Managed Memory

● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified) memory allowing a common memory allocation for GPU or CPU through cudaMallocManaged() call

● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()

● Extended MVAPICH2 to perform communications directly from managed buffers (Available since MVAPICH2-GDR 2.2b)

● OSU Micro-benchmarks extended to evaluate the performance of point-to-point and collective communications using managed buffers ● Available since OMB 5.2D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance

Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128 256 1K 4K 8K 16K

Halo

Exc

hang

e Ti

me

(ms)

Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1

DeviceManaged

Page 26: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 26Network Based Computing Laboratory

Enhanced Support for Intra-node Managed Memory

● CUDA Managed => no memory pin down ● No IPC support for intra-node communication ● No GDR support for Inter-node

communication● Initial and basic support in MVAPICH2-GDR

● For both intra- and inter-nodes use “pipeline through” host memory

● Enhance intra-node managed memory to use IPC● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth

● Available since MVAPICH2-GDR 2.2RC1

0

200

400

600

800

1000

1200

4K 16K 64K 256K 1M 4M

EnhancedMV2-GDR 2.2b

Message Size (bytes)

Late

ncy

(us)

02000400060008000

10000

32K 128K 512K 2M

EnhancedMV2-GDR 2.2b

Message Size (bytes)

Band

wid

th (M

B/s)

2.5X

Page 27: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 27Network Based Computing Laboratory

Enhanced Support for Inter-node Managed Memory ● Enhance inter-node managed memory to use GDR● SGL-based scheme:

● Eager Pre-allocated GPU VBUFs● Register the GPU vbufs with the HCA ● Scatter-Gather List to combine control and data messages● Brings GDR performance to Managed memory ● High performance and high productivity● Up to 25% improvement for small and medium messages

(micro-benchmark)● Up to 1.92X improvement for GPU-LBM application

● Will be available in a future MVAPICH2-GDR release

25%

1.92X

K. Hamidouche, A. A. Awan, A. Venkatesh and D. K. Panda, "CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC," 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), Hyderabad, 2016

Page 28: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 28Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 29: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 29Network Based Computing Laboratory

• RoCE V1 and V2 support

• RDMA_CM connection support

• CUDA-Aware Collective Tuning– Point-point Tuning (available since MVAPICH2-GDR 2.0)

• Tuned thresholds for the different communication patterns and features

• Depending on the system configuration (CPU, HCA and GPU models)

– Tuning Framework for GPU based collectives • Select the best algorithm depending on message size, system size and system

configuration

• Support for Bcast and Gather operations for different GDR-enabled systems

• Available since MVAPICH2-GDR 2.2RC1 release

ROCE and Optimized Collectives Support

Page 30: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 30Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 31: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 31Network Based Computing Laboratory

Overview of GPUDirect aSync (GDS) Feature: Current MPI+CUDA interaction

CUDA_Kernel_a<<<>>>(A…., stream1)cudaStreamSynchronize(stream1)MPI_ISend (A,…., req1)MPI_Wait (req1)CUDA_Kernel_b<<<>>>(B…., stream1)

GPU CPU HCA

100% CPU control • Limit the throughput of a GPU • Limit the asynchronous progress • Waste CPU cycles

Page 32: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 32Network Based Computing Laboratory

MVAPICH2-GDS: Decouple GPU Control Flow from CPU

CUDA_Kernel_a<<<>>>(A…., stream1)MPI_ISend (A,…., req1, stream1)MPI_Wait (req1, stream1) (non-blocking from CPU)CUDA_Kernel_b<<<>>>(B…., stream1)

GPU CPU HCA

CPU offloads the compute, communication and synchronization tasks to GPU • CPU is out of the critical path • Tight interaction between GPU and HCA • Hide the overhead of kernel launch • Requires MPI semantics extensions

• All operations are asynchronous from CPU • Extend MPI semantics with Stream-based semantics

Kernel Launch overhead hided

Page 33: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 33Network Based Computing Laboratory

Latency oriented: Send+kernel and Recv+kernelMVAPICH2-GDS: Preliminary Results of Micro-benchmark

05

1015202530

8 32 128 512

DefaultEnhanced-GDS

Message Size (bytes)

Late

ncy

(us)

05

101520253035

16 64 256 1024

DefaultEnhanced-GDS

Message Size (bytes)

Late

ncy

(us)

Throughput Oriented: back-to-back

• Latency Oriented: Able to hide the kernel launch overhead– 25% improvement at 256 Bytes compared to default behavior

• Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks– 14% improvement at 1KB message size

– Requires some tuning and expect better performance for Application with different Kernels

Intel SandyBridge, NVIDIA K20 and Mellanox FDR HCA

Page 34: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 34Network Based Computing Laboratory

Point-to-point: kernels+Send and Recv+kernels

MVAPICH2-GDS: Results of 1D Stencil Kernels

Collective: Broadcast+Kernels

• Point-to-point: Hide the kernel launch overhead– 30% improvement at 16K Bytes compared to traditional MPI+CUDA programs

• Collective: Overlap computations with GPU communications (GDS) – 36% improvement at 64K Bytes compared to traditional MPI+CUDA programs

Intel Broadwell CPU, NVIDIA K80 and Mellanox EDR HCA Akshay Venkatesh, Ching-Hsiang Chu, Khaled Hamidouche, Sreeram Potluri, Davide Rossetti and Dhabaleswar K. Panda, MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling, ICPP 2017, Bristol, UK, Aug 14-17, 2017. (Accepted)

0100200300400500

16K 32K 64K

Late

ncy

(us)

Message Size (Bytes)

Default MPI MPI-GDS

0

20

40

60

80

2 8 32 128 512 2K 8K

Default MPI-GDS

Message Size (bytes)

Late

ncy

(us) 1.4X

1.8X

Page 35: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 35Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 36: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 36Network Based Computing Laboratory

• Many Deep Learning (DL) frameworks have emerged– Berkeley Caffe

– Google TensorFlow

– Microsoft CNTK

– Facebook Torch

– Facebook Caffe2

• Broadly, DL Frameworks are being developed along two directions1. The HPC Eco-system: MPI-based Deep Learning

2. Enterprise Eco-system: BigData-based Deep Learning

Deep Learning Frameworks

Courtesy: https://www.microway.com/hpc-tech-tips/deep-learning-frameworks-survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack/

Page 37: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 37Network Based Computing Laboratory

• Scale-up: Intra-node Performance

– Many improvements like:• NVIDIA cuDNN, cuBLAS, NCCL, etc.

• Scale-out: Inter-node Performance

– DL frameworks - single-node only

– Distributed (Parallel) Training – an emerging trend• S-Caffe or OSU-Caffe – MPI-based

• Microsoft CNTK – MPI-based

• Google TensorFlow – gRPC-based

• Facebook Caffe2 – Hybrid

Parallel Training: Scale-up and Scale-out

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

NCCL

gRPC

Hadoop

MPIcuBLAS

Page 38: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 38Network Based Computing Laboratory

How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous

High Performance Computing (HPC) resources?

Broad Challenge

Page 39: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 39Network Based Computing Laboratory

• Deep Learning frameworks are a different game altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• How to address these newer requirements?– GPU-specific Communication Libraries (NCCL)

• NVidia's NCCL library provides inter-GPU communication

– CUDA-Aware MPI (MVAPICH2-GDR)• Provides support for GPU-based communication

• Can we exploit CUDA-Aware MPI and NCCL to support Deep Learning applications?

New Challenges for MPI Runtimes

Hierarchical Communication (Knomial + NCCL ring)

Page 40: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 40Network Based Computing Laboratory

• NCCL has some limitations– Only works for a single node, thus, no scale-out on

multiple nodes

– Degradation across IOH (socket) for scale-up (within a node)

• We propose optimized MPI_Bcast– Communication of very large GPU buffers (order of

megabytes)

– Scale-out on large number of dense multi-GPU nodes

• Hierarchical Communication that efficiently exploits:– CUDA-Aware MPI_Bcast in MV2-GDR

– NCCL Broadcast primitive

Efficient Broadcast: MVAPICH2-GDR and NCCL

0

10

20

30

2 4 8 16 32 64Tim

e (s

econ

ds)

Number of GPUs

MV2-GDR MV2-GDR-Opt

Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement )

Performance Benefits: OSU Micro-benchmarks

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up]

2.2X

Page 41: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 41Network Based Computing Laboratory

0100200300

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Reduce – 192 GPUs

Large Message Optimized Collectives for Deep Learning

0

100

200

128 160 192

Late

ncy

(ms)

No. of GPUs

Reduce – 64 MB

0100200300

16 32 64

Late

ncy

(ms)

No. of GPUs

Allreduce - 128 MB

0

50

100

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Bcast – 64 GPUs

0

50

100

16 32 64

Late

ncy

(ms)

No. of GPUs

Bcast 128 MB

• MV2-GDR provides optimized collectives for large message sizes

• Optimized Reduce, Allreduce, and Bcast

• Good scaling with large number of GPUs

• Available with MVAPICH2-GDR 2.2GA

0100200300

2 4 8 16 32 64 128La

tenc

y (m

s)Message Size (MB)

Allreduce – 64 GPUs

Page 42: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 42Network Based Computing Laboratory

• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out

(across multi-GPU nodes)

– network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use caseOSU-Caffe is publicly available from: http://hidl.cse.ohio-state.edu

A. A. Awan, K. Hamidouche, J. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters,PPoPP, Sep 2017

Page 43: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 43Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 44: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 44Network Based Computing Laboratory

OpenACC-Aware MPI• acc_malloc to allocate device memory

– No changes to MPI calls– MVAPICH2 detects the device pointer and optimizes data movement

• acc_deviceptr to get device pointer (in OpenACC 2.0)– Enables MPI communication from memory allocated by compiler when it is available in OpenACC 2.0 implementations– MVAPICH2 will detect the device pointer and optimize communication

• Delivers the same performance as with CUDA

A = acc_malloc(sizeof(int) * N);……#pragma acc parallel loop deviceptr(A) . . .//compute for loop

MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD);

……acc_free(A);

A = malloc(sizeof(int) * N);……#pragma acc data copyin(A) . . . {#pragma acc parallel loop . . .//compute for loopMPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD);}……free(A);

Page 45: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 45Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature

• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions

Outline

Page 46: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 46Network Based Computing Laboratory

• MVAPICH2 optimizes MPI communication on InfiniBand and RoCE clusters with GPUs

• Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations

• Efficient and maximal overlap for MPI-3 NBC collectives

• Delivers high performance and high productivity with support for the latest NVIDIA GPUs and InfiniBand/RoCE Adapters

• Looking forward to next-generation designs with GPUDirect Async (GDS) and applications domain like Deep Learning

• Users are strongly encouraged to use the latest MVAPICH2-GDR release to avail all features and performance benefits

Conclusions

Page 47: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 47Network Based Computing Laboratory

A Follow-up Talk on PGAS/OpenSHMEM

• S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions

– Day: Today, 05/11

– Time: 15:00 - 15:25

– Location: Room 211B

Page 48: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 48Network Based Computing Laboratory

Dr. Davide Rossetti Dr. Sreeram Potluri

Filippo Spiga and Stuart Rankin, HPCS, University of Cambridge

(Wilkes Cluster)

Acknowledgments

Page 49: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learningon-demand.gputechconf.com/gtc/...mvapich2-gdr-pushing-the-frontier.pdf · MVAPICH2-GDR: Pushing the Frontier of HPC and

GTC 2017 49Network Based Computing Laboratory

Thank You!

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/

[email protected]@cse.ohio-state.edu


Recommended