+ All Categories
Home > Documents > InfiniBand Hardware Multicast for Streaming Applications...

InfiniBand Hardware Multicast for Streaming Applications...

Date post: 13-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
30
High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming Applications GTC 2015
Transcript
Page 1: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming Applications

GTC 2015

Page 2: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

Presented ByDhabaleswar K. (DK) Panda

The Ohio State University

Email: [email protected]

http://www.cse.ohio-state.edu/~panda

Page 3: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

3

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations • Proposed Approach• Results• Conclusion and Future Work

Outline

Page 4: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

4

GTC ’15Streaming Applications

• Examples - surveillance, habitat monitoring, etc..

• Require efficient transport of data from/to distributed sources/sinks

• Sensitive to latency and throughput metrics

• Require HPC resources to efficiently carry out compute-intensive tasks

Page 5: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

5

GTC ’15

• Proliferation of Multi-Petaflop systems

• Heterogeneity in compute resources with GPGPUs

HPC Landscape

• High performance interconnects with RDMA capabilities to host and GPU memories

• Streaming applications leverage on such resources

Page 6: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

6

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations • Proposed Approach• Results• Conclusion and Future Work

Outline

Page 7: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

7

GTC ’15

• Pipelined data parallel compute phases that form the crux of streaming applications lend themselves for GPGPUs

• Data distribution to GPGPU sites occur over PCIe within the node and over InfiniBand interconnects across nodes

Nature of Streaming Applications

Courtesy: Agarwalla, Bikash, et al. "Streamline: A scheduling heuristic for streaming applications on the grid." Electronic Imaging 2006

• Broadcast operation is a key dictator of throughput of streaming applications

• Reduced latency for each operation• Support multiple back-to-back

operations• More critical with accelerators

Page 8: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

8

GTC ’15

• Traditional short message broadcast operation between GPU buffers involves a Host-Staged Multicast (HSM)• Data copied from GPU buffers to host

memory• Using InfiniBand Unreliable Datagram

(UD)-based hardware multicast

Shortcomings of Existing GPU Broadcast

• Sub-optimal use of near-scale invariant UD-multicast performance

• PCIe resources wasted and benefits of multicast nullified

• GPU-Direct RDMA capabilities unused

Page 9: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

9

GTC ’15

• Can we design a GPU broadcast mechanism that can completely avoid host-staging for streaming applications?

• Can we harness the capabilities of GPU-Direct RDMA (GDR)?• Can we overcome limitations of UD transport and realize the true potential of multicast for GPU buffers?

• Succinctly, how do we multicast GPU data using GDR efficiently?

Problem Statement

Page 10: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

10

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations • Proposed Approach• Results• Conclusion and Future Work

Outline

Page 11: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

11

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations• Critical Factors• Proposed Approach• Results• Conclusion and Future Work

Outline

Page 12: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

12

GTC ’15

• Goal is to be able to multicast GPU data in lesser time than the host-staged multicast (~20us)

• Cost of cudamemcpy is ~8us for short messages for host->gpu, gpu->host and gpu->gpu transfers

• Cudamemcpy costs and memory registration costs determine the viability of a multicast protocol for GPU buffers

Factors to Consider for an Efficient GPU Multicast

Page 13: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

13

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations• Eager Protocol• Rendezvous Protocol• Proposed Approach• Results• Conclusion and Future Work

Outline

Page 14: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

14

GTC ’15

• Copy user GPU data to host eager buffers

• Perform Multicast and copy back

GPU

HCA

Hosteager

userNW

Eager Protocol for GPU multicast

• Cudamemcpy dictates performance

• Similar variation with eager buffers on GPU-Header encoding expensive

CUDAMEMCPY

MCAST

Page 15: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

15

GTC ’15

• Register user GPU data and start RTS multicast with control info

• Confirm ready receivers ≡ 0-byte gather

• Perform Data Multicast

Rendezvous Protocol for GPU multicast

GPU

HCA

Host

userNW

registration

• Registration cost and gather limitations

• Handshake for each operation – not required for streaming applications which are error tolerant

GATHER

INFO MCAST

DATA MCAST

Page 16: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

16

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations• Proposed Approach• Results• Conclusion and Future Work

Outline

Page 17: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

17

GTC ’15

• One time registration of window of persistent buffers in streaming apps

Orchestration of GDR-SGL-MCAST (GSM)

GPU

HCA

Hostcontrol

userNW

Gat

her

ScatterScatter

• Combine control and user data at the source and scatter them at the destinations using Scatter-Gather-List abstraction

MCAST

• Scheme lends itself for pipelined phases abundant in Streaming Applications and avoids stressing PCIe

Page 18: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

18

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations• Proposed Approach• Results• Conclusion and Future Work

Outline

Page 19: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

19

GTC ’15

• Experiments were run on Wilkes @ University of Cambridge• 12-core Ivy Bridge Intel(R) Xeon(R) E5-2630 @ 2.60 GHz with 64 GB RAM• FDR ConnectX2 HCAs• NVIDIA K20c GPUs• Mellanox OFED version MLNX OFED LINUX-2.1-1.0.6 which supports

GPUDirect-RDMA (GDR) required• Baseline Host-based MCAST uses MVAPICH2-GDR (http://mvapich.cse.

ohio-state.edu/downloads)• GDR-SGL-MCAST is based on MVAPICH2-GDR

Experiment Setup

Page 20: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

20

GTC ’15Host Staged MCAST and GDR-SGL MCAST Latency : (<= 8 nodes)

• GDR-SGL-MCAST (GSM)• Host-Staged-MCAST (HSM)• GSM Latency ≤ ~10us vs HSM

Latency ≤ ~23us• Small latency increase with scale

A. Venkatesh, H. Subramoni, K. Hamidouche and D. K. Panda, A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters, IEEE International Conference on High Performance Computing (HiPC ‘14), Dec 2014.

Page 21: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

21

GTC ’15Host Staged MCAST and GDR-SGL MCAST Latency : (<= 64 nodes)

• Both GSM and HSM continue to show near scale invariant latency with 60% improvement (8 bytes)

Page 22: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

22

GTC ’15

• Based on a synthetic benchmark that mimics broadcast patterns in Streaming Applications

• Long window of persistent m-byte buffers with 1,000 back-to-back multicast operations issued

• Execution time reduces by 3x-4x

Host Staged MCAST and GDR-SGL MCAST Streaming Benchmark

Page 23: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

23

GTC ’15

• Introduction• Motivation and Problem Statement• Design Considerations• Proposed Approach• Results• Conclusion and Future Work

Outline

Page 24: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

24

GTC ’15

• Designed an efficient GPU data broadcast for streaming applications which uses near-constant-latency hardware multicast feature and GPUDirect RDMA

• Proposed a new methodology which overcomes the performance challenges posed by UD transport

• Benefits shown with latency and streaming-application-communication mimicking throughput benchmark

• Exploration of NVIDIA’s Fastcopy module for MPI_Bcast

Conclusion and Future work

Page 25: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

25

GTC ’15

Learn about recent advances and upcoming features in CUDA-aware MVAPICH2-GPU library

• S5461 - Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU

Clusters with InfiniBand• Thursday, 03/19 (Today)• Time: 17:00–17:50 • Room 212 B

One More Talk

Page 26: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

26

GTC ’15

Contact [email protected]

Thanks! Questions?

http://mvapich.cse.ohio-state.edu http://nowlab.cse.ohio-state.edu

Page 27: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

27

GTC ’15Funding Acknowledgments

Funding Support by

Equipment Support by

Page 28: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

28

GTC ’15Backup

Page 29: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

29

GTC ’15

• UD makes no ordering and reliability guarantees • UD requires memory registration and has an MTU of 2KB• Notification through polling preferred for performance• Multicast scheme is window-based and NACK-based• GDR allows buffers on GPU memory to be registered • Once registered, the IB network interface can directly access GPU memory

Challenges in UD-transport, Multicast and GPU-Direct RDMA

Page 30: InfiniBand Hardware Multicast for Streaming Applications ...mvapich.cse.ohio-state.edu/static/media/talks/slide/GDRMCAST-GTC… · •GSM Latency ≤ ~10us vs HSM Latency ≤ ~23us

30

GTC ’15

• IB specifies use of SG elements for non-contiguous transfer

• Control and data specified in an array of SG elements

• Avoids expensive cudaMemcpy calls• Persistent buffers amortize registration costs and facilitate pipelining in SA

Hybrid approach using Scatter-Gather Lists + GPU-Direct RDMA: GDR-SGL-MCAST


Recommended