Optimizing Communication for Clusters of GPUs · T. Gysi, J. Bär, and T. Hoefler., “dCUDA:...

PhD Defense

Optimizing Communication for Clusters of GPUs

Michael [email protected]

Advisor: Lizy K. John

▪ GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond

– Excellent performance/watt for many classes of data-parallel computation

▪ Many GPUs are required to solve the biggest computational problems

– Can only fit so many GPUs in a single node!

– GPUs need to talk to each other through Network Interface Controllers (NICs)

– Path between GPU and NIC needs to be efficient

▪ Vendor’s are selling machines filled with many GPUs and NICs:

GPUs and Networks in the Wild

2 Michael LeBeane – PhD Defense 07/16/2018

Nvidia’s DGX-216 Tesla V100

8 Mellanox 100G NICs

2 Ethernet NICs

2 Xeon Platinum

1.6:1 GPU/NIC Ratio

AMD’s Project 47 Node4 Radeon Instinct GPUs

2 Mellanox 100G NICs

1 EPYC 7601 32-Core CPU

2:1 GPU/NIC Ratio

Problem Statement

▪ Largely focused on an optimized data plane

– Path taken by the application data that needs to be transferred by the network

– Industry technologies such as ROCn RDMA and GPUDirect RDMA allow peer-to-peer

data transfers

Today’s GPU Networks

3 07/16/2018

IOC = IO Controller

Initiator Target

CPUCacheMemory

NIC Memory

Network

IOCCPUCache Memory

NICMemory

GPUIOC GPU

Memory Memory

Problem Statement

Michael LeBeane – PhD Defense

▪ Control plane is unoptimized!

– Focused on a host-centric model where only the CPU can coordinate network transfers

– Very high latencies to perform networking from the GPU

Challenges with Today’s GPU Networks

4 07/16/2018

IOC = IO Controller

Problem Statement


Initiator Target

CPUCacheMemory

NIC Memory

Network

IOCCPUCache Memory

NICMemory

GPUIOC GPU

Memory Memory

▪ GPU Allreduce Computation

– Many communication/computation phases

– Scaling out increases the number phases

Motivating Example for Control Plane Optimizations

5 07/16/2018

0

1 2

25

5311

0

1 2

25

5311

11

2553

0

1 2

2511

6453

7825

36

11 53

0

1 2

25

5311

64

3678

36

64 78

0

1 2

2564

8978

8936

89

11 11

Initial Communication Compute Communication Compute

Time

Nodes/

GPUs

Buffers

Problem Statement


Thesis Statement

6 07/16/2018

GPU networking can be improved by both software and hardware enhancements

that enable GPUs to more directly interface with the network control plane.

▪ Proposed Solutions

– Extended Task Queuing

• Direct NIC-to-GPU active messaging

– Command Processor Networking

• Dynamic communication using on-chip GPU Command Processor

– GPU Triggered Networking

• Initiate messages without critical path CPU

Problem Statement


▪ Introduction

▪ Contribution 1: Extended Task Queuing

▪ Contribution 2: Command Processor Networking

▪ Contribution 3: GPU Triggered Networking

▪ Conclusion

Outline

7 07/16/2018Michael LeBeane – PhD Defense

▪ GPUs consume work through in-memory

command queues

– Queue format standardized through

Heterogeneous System Architecture (HSA)

– Any device can produce work for another device

– Assumes unified virtual address space

▪ Can we extend this across a node?

– NIC doesn’t know how to talk to HSA queues

– Initiator doesn’t know the virtual addresses of

resources at the target

Local GPU Work Dispatch

8 07/16/2018

GPU/CPU(Producer)

Devices

VirtualMemory

Command Queue

GPU(Consumer)

Command Packet

Contribution 1: Extended Task Queuing (XTQ)


CacheCache

CPU

Cache

Memory

Cache Cache

CPU

Cache

Memory

XTQ NIC XTQ NICGPU GPU

Initiator Target

▪ XTQ allows direct access to remote GPU queues

– Teach NICs how to speak with HSA queues

▪ Enables Active Messaging without target CPU involvement

– Improves latency and frees CPU service thread(s)

Extended Task Queuing (XTQ) Overview

9 07/16/2018

M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active

messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.



▪ Payload data streams into target-side receive buffer

▪ Command descriptor is placed into command queue

Target-side XTQ Operation

10 07/16/2018

CPUGPUTightly Coupled Devices

XTQ NIC

Doorbell

Payload DataCommand Queue

Lookup

Virtual Memory

Signal



▪ NIC notifies the GPU using memory-mapped doorbell

▪ GPU reads command packet


11 07/16/2018

CPU

Tightly Coupled DevicesXTQ NIC

Doorbell


Lookup

Virtual Memory

Signal

GPU



▪ GPU reads transferred data

▪ GPU writes shared memory completion signal


12 07/16/2018

CPU

Tightly Coupled DevicesXTQ NIC

Doorbell


Lookup

Virtual Memory

Signal

GPU



▪ How does initiator know about remote VAs at the target?

▪ Use coordinated indices specified by the initiator

▪ Lookup tables are populated by the target-side XTQ Library

XTQ Coordinated Indices

13 07/16/2018

Command Packet

Data Payload

Kernel Arguments

RDMA Header

Queue LookupTable

Queue Lookup TableBase Address Register

Target PID

0xF123

Queue Index ....

....

Initiator Target

𝑥

Unified Virtual

Memory

....

Example Queue Lookup



14 07/16/2018

▪ XTQ Put is implemented as a simple extension to standard RDMA put operation

– Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.)

▪ XTQ Registration API is used to provide address index-to-address translations

XTQ Runtime API

Put Command Fields

Target NID/PID

Send Buffer Ptr.

Send Buffer Length

Target Buffer Index

Transport specific metadata

Additional XTQ Fields

Remote Queue Index

Remote Function/Kernel Index

GPU command packet

Kernel/Function Launch

Parameters

Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation XTQ Rewrite Registration API

Register Queue

‒ Queue Desc. VA

Register Function

‒ Function Ptr. VA

‒ Target Side Buffer VA

Register Kernel

‒ Kernel Ptr. VA

‒ Target Side Buffer VA

‒ Kernel Argument Size

‒ Completion Signal VA



▪ CPU: Standard CPU-only systems

– Baseline non-accelerated system

▪ HSA: Currently available GPU

systems

– Involves CPU runtime

▪ XTQ: Extended Task Queuing

– Enables efficient active messaging

style communication that bypasses the

CPU on the target

Experimental Setup

15 07/16/2018

CPU and Memory Configuration

Type 4-wide OOO, x86, 8 cores @ 4GHz

I,D-Cache 64KB, 2-way, 2 cycles

L2-Cache 2MB, 8-way, 8 cycles


DRAM DDR3, 8 Channels, 800MHz

GPU Configuration

Type AMD GCN3 @ 1GHz

CU Config 24 CUs with 4 SIMD-16 engines

Wavefronts 40 Waves per SIMD (64 lanes)

V-Cache 32KB, 16-way, 12 cycles, per CU

K-Cache 32KB, 8-way, 12 cycles, per 4 CU

I-Cache 64KB, 8-way, 12 cycles, per 4 CU

L2-Cache 1MB, 16-way, 8 banks, 100 cycles

NIC Configuration

Link Speed 100ns/ 100Gbps

Topology Star

NIC

Cache Cache

CPU

Cache

Memory

GPU

NIC

Cache Cache

CPU

Cache

Memory

GPU

NIC

Cache Cache

CPU

Cache

Memory

GPU



Results

16 07/16/2018

0.1

1

10

1 16 256 4,096 65,536 1,048,576

Spee

dup

Data Items (4 Byte Integers)

CPUHSAXTQ

0

500

1000

1500

2000

0 8 16 24 32 40 48 56 64

Ru

nti

me

(us)

CPU

HSA

XTQ

Nodes

1 16 256 4K 64K 1M

Big

ger

is B

ette

r

Smal

ler

is B

ette

r▪ MPI Accumulate ▪ MPI Allreduce

0.31

0.31

0.31

0.31

0.31

0.31

0.16

0.11

0.11

0.24

0.22

0.22

0.31

0.30

0.31

0.44

0.43

0.42

0.09

0.06 0.07

0.15

0.14

0.14

0.25

0.61

0.28

0.66

0.23

0.23

0.59

0.55

0.07

0.080.21

0.06

0.070.65

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50Time (µs)

CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion

XTQ

4KB

64B HSACPU

XTQ

HSACPU

19%

15%

Smal

ler

is B

ette

r

▪ Latency Decomposition

1 2 3



Results

17 07/16/2018

Workload Name Domain %Blocked Reductions

Alex Net Classification 14% 4672

AN4 LSTM Speech 50% 131192

CIFAR Classification 4% 939820

Large Synth Synthetic 28% 52800

MNIST Conv Text Recognition 12% 900000

MNIST Hidden Text Recognition 29% 900000

Big

ger

is B

ette

r

▪ Use Microsoft’s Cognitive Toolkit and

sample workloads

▪ Projected using simulation results +

profiling data from TACC’s Stampede

supercomputer

▪ Speedups bound by % time application

blocked on network data

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

AlexNet AN4

LST

CIFAR Large

Synth

MNIST

Conv

MNIST

Hidden

Pro

ject

ed S

pee

du

p

CPU HSA XTQ



Outline

18 07/16/2018

▪ Introduction




▪ Conclusion


▪ XTQ provides optimized remote

kernel invocation

– But still at kernel boundaries

– Kernel launches are expensive!

– Best case ~3µs

• Network latency is < 0.7µs……

▪ Can we do better?

– Networking from within a kernel?

– What have other researchers tried?

Motivating Intra-kernel Networking

19 07/16/2018

0

4

8

12

16

20

1 4 16 64 256

Lau

nch

Lat

ency

(µ

s)Kernel Commands Queued

GPU 1

GPU 2

GPU 3

Smal

ler

is B

ette

r

Contribution 2: Command Processor Networking (ComP-Net)


▪ GPU can send messages

inside a kernel

▪ CPU thread is responsible for

taking packets from GPU and

poking NIC

▪ Will refer to this style of intra-

kernel networking as GPU

Host Networking

Prior-art in Intra-kernel Networking

20 07/16/2018

Kernel

WaitSendWaitLaunch

Put

CPU

GPU

NIC

Done

Send

WaitLaunchSendWaitLaunch

Put

CPU

GPU

NIC

Done

Kernel Kernel

S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein, “GPUnet: Networking Abstractions for GPU Programs,” In USENIX

Conf. on Operating Systems Design and Implementation (OSDI). 2014.

J. A. Stuart and J. D. Owens, “Message passing on data-parallel architectures,” In Intl. Symp. on Parallel Distributed Processing (IPDPS). 2009.

T. Gysi, J. Bär, and T. Hoefler., “dCUDA: hardware supported overlap of computation and communication,” In Proceedings of the International

Conference for High Performance Computing, Networking, Storage and Analysis (SC). 2016.

▪ Host Driven Networking (e.g., MPI + CUDA)

▪ GPU Host Networking



▪ Need multiple trips over IO bus

▪ Where to place queues?

– GPU memory vs. host memory

– High latency in both cases

▪ Not scalable

– 4096 Work-groups fills the GPU

– Still 40µs latency with 8 threads

Performance Problems with GPU Host Networking

21 07/16/2018

Kernel

WaitSendWaitLaunch

Put

CPU

GPU

NIC

Done

Send

0

20

40

60

80

100

1 8 64 512 4096

Ser

vic

e T

ime

(us)

Active Workgroups

Host Queues

GPU Queues

Network Latency

0

20

40

60

80

100

16 128 1024

Ser

vic

e T

ime

(us)

Active Workgroups

1 Thread

2 Threads

4 Threads

8 Threads

Network Latency

Smal

ler

is B

ette

r4096



▪ GPUs have built in CPUs called

Command Processors (CPs)

– Scalar cores == good at running

network runtime code

– Connect to GPU CUs through a

shared LLC

▪ Traditionally used to launch

kernels

– But intra-kernel networking

encourages less kernels…..

Command Processor Overview

22 07/16/2018

Local Data Share

L2 Cache

L1 Cache

CPU Core

GPU Memory

Compute Unit Command Processor

L1 Cache

SIMD SIMD SIMD SIMD

GPU



▪ Uses built in CP to support

network operations

▪ CP/GPU communicate over shared

L2 cache instead of PCIe

▪ Potentially much faster (lower

latency) than other GHN designs

▪ Scales naturally

– Every GPU has multiple CP threads

Command Processor Networking (ComP-Net) Overview

23 07/16/2018

M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “ComP-Net: Command

Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel

Architectures and Compilation Techniques (PACT), 2018.

NIC

…Host Queues

Memory

CUs

CPUs

GPU

Host

PCIe

Memory

…Network Queues

PCIe

GPU Host

Networking

HostPCIe

L2 CacheHost Queues

MemoryNetwork Queues

CUs CPs

PCIe

NICGPU

ComP-Net



▪ Main component of ComP-Net Runtime is CP/GPU producer/consumer queue

▪ Most steps are straightforward

ComP-Net Producer/Consumer Queue

24 07/16/2018

Registers / Non Coherent Cache

Cache/Memory/GPU Coherence Point

Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status

CP-Net GPU ContextWrite Idx

LDS / Non Coherent Cache

Base PtrRead Idx Ptr

Local Read Idx…..

0 1 1 0

CP-Net GPU ContextBase Ptr

Local Read Idx…..


4 CP-Net GPU ContextBase Ptr

Local Read Idx…..

….

Work-Group Command Processor Thread



▪ 1a) Check if queue is full (using local Read Idx)

▪ 1b) If full, update Read Idx and loop till not full


25 07/16/2018






Base Ptr

Local Read Idx…..

0 1 1 0


Local Read Idx…..



Local Read Idx…..

….


1b

1a<=

Read Idx Ptr



▪ 2) Fill Queue Entry with networking metadata

– Or Inline small payloads in the Queue Entry itself


26 07/16/2018







Local Read Idx…..

0 1 1 0


Local Read Idx…..



Local Read Idx…..

….


2



▪ 3) Set status flag with release marker to notify CP


27 07/16/2018







Local Read Idx…..

1 1 1 0


Local Read Idx…..



Local Read Idx…..

….


3



▪ 4) Increment local Write Idx


28 07/16/2018







Local Read Idx…..

1 1 1 0


Local Read Idx…..



Local Read Idx…..

….


4++



▪ 5) Check status bit to determine when CP completes operation


29 07/16/2018







Local Read Idx…..

1 1 1 0


Local Read Idx…..



Local Read Idx…..

….


5== 1



▪ 1) Poll on next Queue Entry based on local Read Idx with acquire marker


30 07/16/2018







Local Read Idx…..

1 1 1 0


Local Read Idx…..



Local Read Idx…..

….


1

== 0



▪ 2) Read data from Queue Entry


31 07/16/2018







Local Read Idx…..

1 1 1 0


Local Read Idx…..



Local Read Idx…..

….


2



▪ 3) Perform Network operation and set Status flag to 0 when complete with

release marker


32 07/16/2018







Local Read Idx…..

1 1 0 0


Local Read Idx…..



Local Read Idx…..

….


3



▪ 4a) Update global read Idx

▪ 4b) Update local read Idx with release marker


33 07/16/2018







Local Read Idx…..

1 1 0 0


Local Read Idx…..



Local Read Idx…..

….


4b

++

4a++



▪ Residency of data in GPU L2 is very small

▪ Work-group data produced for CP is

evicted when other work-groups are

performing streaming memory accesses

▪ Can be solved through cache line locking

– Preliminary results are promising

– Still much to explore here

Tackling GPU Cache Thrashing

34 07/16/2018

Big

ger

is B

ette

rContribution 2: Command Processor Networking (ComP-Net)


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

L2 H

it R

ate

for

CP

Networking Wavefronts / Streaming Wavefronts

Baseline LLC Locking



▪ HDN: Host Driven Networking

– Kernel boundary networking (host MPI + CUDA)

Intra-kernel Networking Schemes:

▪ APU: CPU/GPU on the Same Die

– Intra-kernel networking through host threads on an APU

▪ dGPU: GPU Host Networking

– Intra-kernel networking through host threads on a dGPU

▪ ComP-Net: Command Processor Networking

– Intra-kernel networking through command processor

Experimental Setup

35 07/16/2018







GPU Configuration

Type AMD GCN3 @ 1.5GHz







CP Configuration


D-Cache 32KB, 8-way, 4 cycles

I-Cache 16KB, 8-way, 4 cycles



Results

36 07/16/2018

▪ 2D Jacobi Stencil

– 1D data decomposition

– Iterative compute and halo exchange

– Three regions of interest0.8

0.9

1

1.1

1.2

1.3

16 64 256 1024

Rel

ativ

e S

pee

du

p v

dG

PU

Bas

elin

e

Per-node Problem Size (N x N Grid)

ComP-Net dGPU APU HDN CPU

Big

ger

is B

ette

r

Node 1 (Bottom)

Node 0 (Top)

Halo

Exchange

1 2 3



▪ 64MB Reduction (strong scaling)

– APU performs better than ComP-Net

– ComP-Net is much more energy efficient

Results

37 07/16/2018

0.6

0.8

1

1.2

1.4

0 4 8 12 16 20 24 28 32 36

Rel

ativ

e S

pee

du

p

Number of Nodes in Reduction

ComP-Net dGPU APU HDN CPU

0

0.2

0.4

0.6

0.8

1

1.2

0 4 8 12 16 20 24 28 32 36En

erg

y C

on

sum

pti

on

Number of Nodes in Reduction

ComP-Net dGPU APU

Big

ger

is B

ette

rSm

alle

r is

Bet

ter

0 1 25 2 1 1 3 5

0 9 8

Vector Sum



Results

38 07/16/2018

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

AlexNet AN4

LSTM

CIFAR MNIST

Conv

MNIST

Hidden

Average

Pro

ject

ed S

pee

du

p

CPU HDN dGPU APU ComP-Net

Workload Name Domain %Blocked Reductions

Alex Net Classification 14% 4672

AN4 LSTM Speech 50% 131192

CIFAR Classification 4% 939820

Large Synth Synthetic 28% 52800

MNIST Conv Text Recognition 12% 900000

MNIST Hidden Text Recognition 29% 900000Big

ger

is B

ette

rContribution 2: Command Processor Networking (ComP-Net)


Outline

39 07/16/2018

▪ Introduction




▪ Conclusion


▪ CPU creates network

operation off the critical path

– Registers with the NIC

▪ GPU simply ‘triggers’

operation when the data is

ready

▪ Provides intra-kernel GPU

networking without requiring

a CPU thread

GPU Triggered Networking (GPU-TN) Overview

40 07/16/2018

SendLaunch

Kernel

GPU Triggered Networking Put

CPU

GPU

NIC

Done

Kernel

WaitSendWaitLaunch

GPU Host Networking Put

CPU

GPU

NIC

Done

Send


Host-Driven Networking Put

CPU

GPU

NIC

Done

Kernel Kernel

M. LeBeane, K Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “GPU Triggered Networking for Intra-Kernel

Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.

Contribution 3: GPU Triggered Networking (GPU-TN)


▪ CPU Creates Triggered Entry

– Trigger Entry consists of:

• Network Operation

• Tag

• Counter

• Threshold

– Appends entry to Trigger List

▪ GPU Fills Send Buffer

– During kernel execution

GPU-TN Architecture

41 07/16/2018

Network

Send Buffer

…..

CPU GPU

Trigger List

23

4

NIC

Trigger Entry

Trigger Entry

……

1

1

2



▪ GPU initiates Put operation

– GPU Provides Tag

▪ NIC sends message

– Message triggered when counter

>= CPU provided threshold

▪ HW complexity?

– ‘Trigger list’ might not be a list

▪ CPU/GPU race conditions?

– Allocate null entry for unexpected

triggers

GPU-TN Architecture

42 07/16/2018

Network

Send Buffer

…..

CPU GPU

Trigger List

23

4

NIC

Trigger Entry

Trigger Entry

……

1

Trigger EntryNetwork

Operation

Counter

Tag

Threshold

==

++

>=

Begin NetworkOperation

WR En

Tags

3

4





▪ HDN: Host Driven Networking

– No driver interactions on the critical path, but may involve

CPU runtime

▪ GDS-Sim: GPUDirect Async

– Preregistration of communication but at kernel boundaries

▪ GHN: GPU Host Networking

– Intra-kernel networking through host threads

▪ GPU-TN: GPU Triggered Networking

– Preregistration of network operations and intra-kernel

networking

Experimental Setup

43 07/16/2018







GPU Configuration

Type AMD GCN3 @ 1.5GHz







NIC Configuration

Link Speed 100ns/ 100Gbps

Topology Star



Results

44

1

1.05

1.1

1.15

1.2

16 64 256 1024

Sp

eed

up

VS

HD

N

Local 2D Grid Size (N X N)

CPU

GDS-Sim

GHN

GPU-TN

Big

ger

is B

ette

r

0.8

1

1.2

1.4

1.6

2 5 8 11 14 17 20 23 26 29 32

Spee

dup

Nodes

HDN GDS-Sim GHN GPU-TN

Big

ger

is B

ette

r

▪ 64MB Reduction (strong scaling)

▪ 2D Jacobi Stencil ▪ Machine Learning Training Phase

07/16/2018



0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

AlexNet AN4 LSTM CIFAR Large Synth MNIST

Conv

MNIST

Hidden

Pro

ject

ed S

pee

dup

CPU HDN GDS-Sim GHN GPU-TN

Outline

45 07/16/2018

▪ Introduction




▪ Conclusion


Summary

46 07/16/2018

▪ Presented 3 enhancements to improve GPU networking

– Extended Task Queuing

• Direct NIC-to-GPU active messaging

– Command Processor Networking

• Dynamic communication using on-chip GPU Command Processor

– GPU Triggered Networking

• Initiate messages without critical path CPU

Conclusion


Target

▪ XTQ allows direct access to remote GPU queues

– Teach NICs how to speak with HSA queues

▪ Enables Active Messaging without target CPU involvement

– Improves latency and frees CPU service thread(s)

▪ Improves application performance by ~15%

Extended Task Queuing (XTQ) Summary

47 07/16/2018

M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active

messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.

Conclusion


CacheCache

CPU

Cache

Memory

Cache Cache

CPU

Cache

Memory

XTQ NIC XTQ NICGPU GPU

Initiator

▪ Uses built in CP to support network

operations

▪ CP/GPU communicate over shared L2

cache instead of PCIe

▪ Potentially much faster (lower latency)

than other GHN designs

▪ Scales naturally

– Every GPU has multiple CP threads

▪ Improves application performance ~20%

vs other GHN approaches

Command Processor Networking (ComP-Net) Summary

48 07/16/2018

M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “ComP-Net: Command

Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel

Architectures and Compilation Techniques (PACT), 2018.

NIC

…Host Queues

Memory

CUs

CPUs

GPU

Host

PCIe

Memory

…Network Queues

PCIe

GPU Host

Networking

HostPCIe

L2 CacheHost Queues

MemoryNetwork Queues

CUs CPs

PCIe

NICGPU

ComP-Net

Conclusion


▪ CPU creates network operation off

the critical path

– Registers with the NIC

▪ GPU simply ‘triggers’ operation

when the data is ready

▪ Provides intra-kernel GPU

networking without requiring a CPU

thread

▪ Improves application performance

~20% vs GPUDirect Async

GPU Triggered Networking (GPU-TN) Summary

49 07/16/2018

SendLaunch

Kernel

GPU Triggered Networking Put

CPU

GPU

NIC

Done

Kernel

WaitSendWaitLaunch

GPU Host Networking Put

CPU

GPU

NIC

Done

Send



CPU

GPU

NIC

Done

Kernel Kernel

M. LeBeane, K Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “GPU Triggered Networking for Intra-Kernel

Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.

Conclusion


▪ This dissertation motivates the need for more independent accelerators

– Cannot funnel everything through a central CPU!

– Concepts are applicable to many types of accelerators and networks

▪ Still much to do!

– Application Redesign Opportunities

• Applications presented in this dissertation are scratching the surface

• Algorithms with dynamic communication could significantly benefit from these techniques

– Leveraging Emerging NIC Technologies for GPUs

• Mellanox BlueField, collective offload, programmable message handlers

• How could more intelligent NICs assist with GPU networking?

Towards the Future…..

50 07/16/2018

Conclusion


Thank You!

51 07/16/2018Michael LeBeane – PhD Defense

▪ CPU controls networking

through driver/runtime

▪ Messages sent at kernel

boundaries

▪ Research implementations

include:

– CUDA-Aware MPI [Kraus ‘14]

– CUDA-Aware OpenSHMEM

[Hamidouche ’16]

• GPUDirect RDMA [Mellanox ‘13]

Host-Driven Networking (HDN)

52 07/16/2018



CPU

GPU

NIC

Done

Kernel Kernel

J. Kraus. “Introduction to CUDA-aware MPI and Nvidia GPUDirect,” GPU Tech. Conference. 2014.

K. Hamidouche, A. Venkatesh, A. A. Awan, H. Subramoni, C.H. Chu, and D. K. Panda, “CUDA-Aware OpenSHMEM,” Journal on Parallel Computing.

2016.

Mellanox, “Mellanox GPUDirect RDMA User Manual,” http://www.mellanox.com/related-docs/prod_software/Mellanox GPUDirect User Manual

v1.2.pdf. 2015


▪ GPU runs networking stack

▪ Persistent kernels and LDS

memory used for network

data structures

▪ Research implementations

include:

– GPUrdma [Daoud ’16]

– IBV on GPUs [Oden ‘14]

GPU Native Networking (GNN)

53 02/27/2017



CPU

GPU

NIC

Done

Kernel Kernel

Launch

Kernel

GPU Native Networking Put

CPU

GPU

NIC

Done

Send

F. Daoud, A. Watad, and M. Silberstein, “GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels,” In Intl. Workshop on

Runtime and Operating Systems for Supercomputers (ROSS). 2016.

L. Oden, H. Froning, and F. J. Pfreundt, “Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU,” In

Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 2014.


AMD <=> Nvidia Translator

▪ Work-item = Thread

▪ Wavefront (64 Threads) = Warp (32 Threads)

– Unit of thread dispatch

▪ Work-group = Thread Block

– Unit of Synchronization

▪ Local Data Share (LDS) = Shared Memory

– Work-group scratchpad

▪ Compute Unit (CU) = Streaming Multi-Processor (SM)

– Collection of SIMD engines sharing LDS and L1 cache

GPU Architecture and Terminology

54 07/16/2018

▪ Kernel

– GPU SIMT Function

▪ Command Processor (CP)

– Dispatch engine and scheduler

Local Data Share

L2 Cache

L1 Cache

CPU Core

GPU Memory

Compute Unit Command Processor

L1 Cache

SIMD SIMD SIMD SIMD

GPU


GPU-TN Kernel Programming Interface

55 07/16/2018

__kernel void

kern1(__global char *trigAddr,

const int tagBase,

__global void *buffer)

{

// do work

buffer = ...;

int id = get_global_id();

*trigAddr = tagBase + id;

// do additional work

...

}

__kernel void


const int tagBase,


{

// do work

buffer = ...;

wg_barrier();

if (!get_local_id()) {

int id = get_group_id();

*trigAddr = tagBase + id;

}


...

}

__kernel void


const int tag,


{

// do work

buffer = ...;

wg_barrier();

if (!get_local_id())

*trigAddr = tag;


...

}

Work-item Level Work-group Level Kernel Level


▪ gem5 + AMD GCN3 GPU model + Custom Portals4 NIC Model

– CPU power model with McPAT

– Baseline model is coherent APU

• dGPU modeled with extra delay for IO bus, different memory controllers, and by disabling coherence probes

▪ Each section has slightly different parameters

– Will be discussed before results presented

Simulation Infrastructure

56 07/16/2018

DirectoryMemory

ControllersMemory

GPU CPU

Core

L2

L1IL1D …

Core

L2

L1IL1D

Core

L2

L1IL1D

L3

GPUCore

L1D

GPUCore

L1D

GPUCore

L1D

GPUCore

L1D

Sequencer Cache (SQC)

L2

GPUCore

L1D

GPUCore

L1D

GPUCore

L1D

GPUCore

L1D

Sequencer Cache (SQC)

…

NIC

NIC Processors

DMA Engines

L1IL1D

CPCore

IFNetwork


▪ RDMA allows for direct access of remote memory without involving CPU

– Heavy lifting is performed on the NIC (off-load networking model)

– Generally expressed in terms of remote Put/Get operations

▪ Maps naturally to “one-sided” communication semantics

– Puts/Gets vs. Send/Receive

Remote Direct Memory Access (RDMA)

57 07/16/2018

Initiator Target

Network

CPU

Cache NIC

Memory

IOCMemory

CPU

CacheNIC

Memory

IOCMemory


ComP-Net Host and GPU API

58 07/16/2018

__host__ void

hostInit() {

//Initialize ComP-Net

cpnet_handle_t* cpnet_handle;

cpnet_init(&cpnet_handle, GRID_SZ / WG_SZ);

// Allocate symmetric heap memory

char* buf = cpnet_shmalloc(sizeof(char) *

GRID_SZ / WG_SZ);

//Initiator/target launches kernel

if (cpnet_handle->pe == INITIATOR) {

hipLaunchKernel(Ping, GRID_SZ, GRID_SZ /

WG_SZ, 0, 0, cpnet_handle,

buf);

} else { /* Launch target kernel. */ }

}

__device__ void

Ping(cpnet_handle_t *cpnet_handle,

char* wg_buffer) {

// Extract context from global handle

__shared__ cpnet_ctx_t cpnet_ctx;

cpnet_ctx_create(cpnet_handle,

cpnet_ctx);

// Each WG pings target

cpnet_shmem_char_p(cpnet_ctx,

wg_buffer[hipBlockIdx_x],

1, TARGET);

// Each WG waits for pong target

cpnet_shmem_char_wait_until(

wg_buffer[hipBlockIdx_x, 1);

cpnet_ctx_destroy(cpnet_ctx);

}

Host Code GPU Code


▪ One-sided put latency benchmark

– Initiator launches dummy kernel, executes network command, and terminates

– Target polls on put location

▪ Take-away messages

– HDN < GDS-Sim < GPU-TN

– GPU-TN actually overlaps kernel teardown with network transfer!

Latency Microbenchmark

59 07/16/2018

1.51

1.50

1.50

0.41

0.43

0.49

1.50

1.51

1.49

0.30

0.05

4.21

3.76

2.71

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5Time (µs)

Kernel Launch Kernel Exeuction Kernel Teardown Put Wait

Target

Initiator

TargetInitiator

GPU-TN

GDS-Sim

TargetInitiatorHDN

Smaller is Better


Sweep of payload size for 1 WG

and 1 Thread

Microbenchmarks

60 07/16/2018

0

2

4

6

8

10

12

1 8 64 512 4096 32768

Rem

ote

Get

Tim

e O

bse

rved

fro

m G

PU

(µ

s)

Network Payload Size

ComP-Net dGPU APU

0

20

40

60

80

100

0 2 4 6 8 10

Rem

ote

Get

Tim

e O

bse

rved

fro

m G

PU

(µ

s)

Number of Network Service Threads

ComP-Net dGPU APU

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10

En

erg

y C

on

sum

ed b

y N

etw

ork

Th

read

s w

.r.t

dG

PU

Number of Network Service Threads

ComP-Net dGPU APU

Sweep of threads for 1 byte transfers and 480 WGs


▪ Friendlier programming abstractions

– Nicer abstractions in CUDA and OpenCL

• Dynamic Parallelism, Unified Memory, etc.

– Single-source, kernel-less programming support

• C++ AMP, OpenMP, AMD HC Language, etc.

▪ Architectural Support

– User-level kernel-launch

– Shared virtual address space

– Virtualization

– Multiprocessing

– (Sometimes) Coherent caches

Where are GPUs heading?

61 07/16/2018

MMU

CPU

Tightly Coupled Devices

Physical Memory

GPUOS

DriverIOMMU

CPU(Producer)

Tightly Coupled Devices

VirtualMemory

Command Queue

GPU(Consumer)

Command Packet

Architected Queuing

Shared Virtual MemoryWhat about networking support?


Date post:	27-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Optimizing Communication for Clusters of GPUs · T. Gysi, J. Bär, and T. Hoefler., “dCUDA:...

Documents