PhD Defense
Optimizing Communication for Clusters of GPUs
Michael [email protected]
Advisor: Lizy K. John
▪ GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond
– Excellent performance/watt for many classes of data-parallel computation
▪ Many GPUs are required to solve the biggest computational problems
– Can only fit so many GPUs in a single node!
– GPUs need to talk to each other through Network Interface Controllers (NICs)
– Path between GPU and NIC needs to be efficient
▪ Vendor’s are selling machines filled with many GPUs and NICs:
GPUs and Networks in the Wild
2 Michael LeBeane – PhD Defense 07/16/2018
Nvidia’s DGX-216 Tesla V100
8 Mellanox 100G NICs
2 Ethernet NICs
2 Xeon Platinum
1.6:1 GPU/NIC Ratio
AMD’s Project 47 Node4 Radeon Instinct GPUs
2 Mellanox 100G NICs
1 EPYC 7601 32-Core CPU
2:1 GPU/NIC Ratio
Problem Statement
▪ Largely focused on an optimized data plane
– Path taken by the application data that needs to be transferred by the network
– Industry technologies such as ROCn RDMA and GPUDirect RDMA allow peer-to-peer
data transfers
Today’s GPU Networks
3 07/16/2018
IOC = IO Controller
Initiator Target
CPUCacheMemory
NIC Memory
Network
IOCCPUCache Memory
NICMemory
GPUIOC GPU
Memory Memory
Problem Statement
Michael LeBeane – PhD Defense
▪ Control plane is unoptimized!
– Focused on a host-centric model where only the CPU can coordinate network transfers
– Very high latencies to perform networking from the GPU
Challenges with Today’s GPU Networks
4 07/16/2018
IOC = IO Controller
Problem Statement
Michael LeBeane – PhD Defense
Initiator Target
CPUCacheMemory
NIC Memory
Network
IOCCPUCache Memory
NICMemory
GPUIOC GPU
Memory Memory
▪ GPU Allreduce Computation
– Many communication/computation phases
– Scaling out increases the number phases
Motivating Example for Control Plane Optimizations
5 07/16/2018
0
1 2
25
5311
0
1 2
25
5311
11
2553
0
1 2
2511
6453
7825
36
11 53
0
1 2
25
5311
64
3678
36
64 78
0
1 2
2564
8978
8936
89
11 11
Initial Communication Compute Communication Compute
Time
Nodes/
GPUs
Buffers
Problem Statement
Michael LeBeane – PhD Defense
Thesis Statement
6 07/16/2018
GPU networking can be improved by both software and hardware enhancements
that enable GPUs to more directly interface with the network control plane.
▪ Proposed Solutions
– Extended Task Queuing
• Direct NIC-to-GPU active messaging
– Command Processor Networking
• Dynamic communication using on-chip GPU Command Processor
– GPU Triggered Networking
• Initiate messages without critical path CPU
Problem Statement
Michael LeBeane – PhD Defense
▪ Introduction
▪ Contribution 1: Extended Task Queuing
▪ Contribution 2: Command Processor Networking
▪ Contribution 3: GPU Triggered Networking
▪ Conclusion
Outline
7 07/16/2018Michael LeBeane – PhD Defense
▪ GPUs consume work through in-memory
command queues
– Queue format standardized through
Heterogeneous System Architecture (HSA)
– Any device can produce work for another device
– Assumes unified virtual address space
▪ Can we extend this across a node?
– NIC doesn’t know how to talk to HSA queues
– Initiator doesn’t know the virtual addresses of
resources at the target
Local GPU Work Dispatch
8 07/16/2018
GPU/CPU(Producer)
Devices
VirtualMemory
Command Queue
GPU(Consumer)
Command Packet
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
CacheCache
CPU
Cache
Memory
Cache Cache
CPU
Cache
Memory
XTQ NIC XTQ NICGPU GPU
Initiator Target
▪ XTQ allows direct access to remote GPU queues
– Teach NICs how to speak with HSA queues
▪ Enables Active Messaging without target CPU involvement
– Improves latency and frees CPU service thread(s)
Extended Task Queuing (XTQ) Overview
9 07/16/2018
M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active
messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ Payload data streams into target-side receive buffer
▪ Command descriptor is placed into command queue
Target-side XTQ Operation
10 07/16/2018
CPUGPUTightly Coupled Devices
XTQ NIC
Doorbell
Payload DataCommand Queue
Lookup
Virtual Memory
Signal
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ NIC notifies the GPU using memory-mapped doorbell
▪ GPU reads command packet
Target-side XTQ Operation
11 07/16/2018
CPU
Tightly Coupled DevicesXTQ NIC
Doorbell
Payload DataCommand Queue
Lookup
Virtual Memory
Signal
GPU
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ GPU reads transferred data
▪ GPU writes shared memory completion signal
Target-side XTQ Operation
12 07/16/2018
CPU
Tightly Coupled DevicesXTQ NIC
Doorbell
Payload DataCommand Queue
Lookup
Virtual Memory
Signal
GPU
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ How does initiator know about remote VAs at the target?
▪ Use coordinated indices specified by the initiator
▪ Lookup tables are populated by the target-side XTQ Library
XTQ Coordinated Indices
13 07/16/2018
Command Packet
Data Payload
Kernel Arguments
RDMA Header
Queue LookupTable
Queue Lookup TableBase Address Register
Target PID
0xF123
Queue Index ....
....
Initiator Target
𝑥
Unified Virtual
Memory
....
Example Queue Lookup
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
14 07/16/2018
▪ XTQ Put is implemented as a simple extension to standard RDMA put operation
– Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.)
▪ XTQ Registration API is used to provide address index-to-address translations
XTQ Runtime API
Put Command Fields
Target NID/PID
Send Buffer Ptr.
Send Buffer Length
Target Buffer Index
Transport specific metadata
Additional XTQ Fields
Remote Queue Index
Remote Function/Kernel Index
GPU command packet
Kernel/Function Launch
Parameters
Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation XTQ Rewrite Registration API
Register Queue
‒ Queue Desc. VA
Register Function
‒ Function Ptr. VA
‒ Target Side Buffer VA
Register Kernel
‒ Kernel Ptr. VA
‒ Target Side Buffer VA
‒ Kernel Argument Size
‒ Completion Signal VA
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
▪ CPU: Standard CPU-only systems
– Baseline non-accelerated system
▪ HSA: Currently available GPU
systems
– Involves CPU runtime
▪ XTQ: Extended Task Queuing
– Enables efficient active messaging
style communication that bypasses the
CPU on the target
Experimental Setup
15 07/16/2018
CPU and Memory Configuration
Type 4-wide OOO, x86, 8 cores @ 4GHz
I,D-Cache 64KB, 2-way, 2 cycles
L2-Cache 2MB, 8-way, 8 cycles
L3-Cache 16MB, 16-way, 20 cycles
DRAM DDR3, 8 Channels, 800MHz
GPU Configuration
Type AMD GCN3 @ 1GHz
CU Config 24 CUs with 4 SIMD-16 engines
Wavefronts 40 Waves per SIMD (64 lanes)
V-Cache 32KB, 16-way, 12 cycles, per CU
K-Cache 32KB, 8-way, 12 cycles, per 4 CU
I-Cache 64KB, 8-way, 12 cycles, per 4 CU
L2-Cache 1MB, 16-way, 8 banks, 100 cycles
NIC Configuration
Link Speed 100ns/ 100Gbps
Topology Star
NIC
Cache Cache
CPU
Cache
Memory
GPU
NIC
Cache Cache
CPU
Cache
Memory
GPU
NIC
Cache Cache
CPU
Cache
Memory
GPU
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
Results
16 07/16/2018
0.1
1
10
1 16 256 4,096 65,536 1,048,576
Spee
dup
Data Items (4 Byte Integers)
CPUHSAXTQ
0
500
1000
1500
2000
0 8 16 24 32 40 48 56 64
Ru
nti
me
(us)
CPU
HSA
XTQ
Nodes
1 16 256 4K 64K 1M
Big
ger
is B
ette
r
Smal
ler
is B
ette
r▪ MPI Accumulate ▪ MPI Allreduce
0.31
0.31
0.31
0.31
0.31
0.31
0.16
0.11
0.11
0.24
0.22
0.22
0.31
0.30
0.31
0.44
0.43
0.42
0.09
0.06 0.07
0.15
0.14
0.14
0.25
0.61
0.28
0.66
0.23
0.23
0.59
0.55
0.07
0.080.21
0.06
0.070.65
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50Time (µs)
CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion
XTQ
4KB
64B HSACPU
XTQ
HSACPU
19%
15%
Smal
ler
is B
ette
r
▪ Latency Decomposition
1 2 3
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
Results
17 07/16/2018
Workload Name Domain %Blocked Reductions
Alex Net Classification 14% 4672
AN4 LSTM Speech 50% 131192
CIFAR Classification 4% 939820
Large Synth Synthetic 28% 52800
MNIST Conv Text Recognition 12% 900000
MNIST Hidden Text Recognition 29% 900000
Big
ger
is B
ette
r
▪ Use Microsoft’s Cognitive Toolkit and
sample workloads
▪ Projected using simulation results +
profiling data from TACC’s Stampede
supercomputer
▪ Speedups bound by % time application
blocked on network data
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
AlexNet AN4
LST
CIFAR Large
Synth
MNIST
Conv
MNIST
Hidden
Pro
ject
ed S
pee
du
p
CPU HSA XTQ
Contribution 1: Extended Task Queuing (XTQ)
Michael LeBeane – PhD Defense
Outline
18 07/16/2018
▪ Introduction
▪ Contribution 1: Extended Task Queuing
▪ Contribution 2: Command Processor Networking
▪ Contribution 3: GPU Triggered Networking
▪ Conclusion
Michael LeBeane – PhD Defense
▪ XTQ provides optimized remote
kernel invocation
– But still at kernel boundaries
– Kernel launches are expensive!
– Best case ~3µs
• Network latency is < 0.7µs……
▪ Can we do better?
– Networking from within a kernel?
– What have other researchers tried?
Motivating Intra-kernel Networking
19 07/16/2018
0
4
8
12
16
20
1 4 16 64 256
Lau
nch
Lat
ency
(µ
s)Kernel Commands Queued
GPU 1
GPU 2
GPU 3
Smal
ler
is B
ette
r
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ GPU can send messages
inside a kernel
▪ CPU thread is responsible for
taking packets from GPU and
poking NIC
▪ Will refer to this style of intra-
kernel networking as GPU
Host Networking
Prior-art in Intra-kernel Networking
20 07/16/2018
Kernel
WaitSendWaitLaunch
Put
CPU
GPU
NIC
Done
Send
WaitLaunchSendWaitLaunch
Put
CPU
GPU
NIC
Done
Kernel Kernel
S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein, “GPUnet: Networking Abstractions for GPU Programs,” In USENIX
Conf. on Operating Systems Design and Implementation (OSDI). 2014.
J. A. Stuart and J. D. Owens, “Message passing on data-parallel architectures,” In Intl. Symp. on Parallel Distributed Processing (IPDPS). 2009.
T. Gysi, J. Bär, and T. Hoefler., “dCUDA: hardware supported overlap of computation and communication,” In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis (SC). 2016.
▪ Host Driven Networking (e.g., MPI + CUDA)
▪ GPU Host Networking
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Need multiple trips over IO bus
▪ Where to place queues?
– GPU memory vs. host memory
– High latency in both cases
▪ Not scalable
– 4096 Work-groups fills the GPU
– Still 40µs latency with 8 threads
Performance Problems with GPU Host Networking
21 07/16/2018
Kernel
WaitSendWaitLaunch
Put
CPU
GPU
NIC
Done
Send
0
20
40
60
80
100
1 8 64 512 4096
Ser
vic
e T
ime
(us)
Active Workgroups
Host Queues
GPU Queues
Network Latency
0
20
40
60
80
100
16 128 1024
Ser
vic
e T
ime
(us)
Active Workgroups
1 Thread
2 Threads
4 Threads
8 Threads
Network Latency
Smal
ler
is B
ette
r4096
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ GPUs have built in CPUs called
Command Processors (CPs)
– Scalar cores == good at running
network runtime code
– Connect to GPU CUs through a
shared LLC
▪ Traditionally used to launch
kernels
– But intra-kernel networking
encourages less kernels…..
Command Processor Overview
22 07/16/2018
Local Data Share
L2 Cache
L1 Cache
CPU Core
GPU Memory
Compute Unit Command Processor
L1 Cache
SIMD SIMD SIMD SIMD
GPU
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Uses built in CP to support
network operations
▪ CP/GPU communicate over shared
L2 cache instead of PCIe
▪ Potentially much faster (lower
latency) than other GHN designs
▪ Scales naturally
– Every GPU has multiple CP threads
Command Processor Networking (ComP-Net) Overview
23 07/16/2018
M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “ComP-Net: Command
Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel
Architectures and Compilation Techniques (PACT), 2018.
NIC
…Host Queues
Memory
CUs
CPUs
GPU
Host
PCIe
Memory
…Network Queues
PCIe
GPU Host
Networking
HostPCIe
L2 CacheHost Queues
MemoryNetwork Queues
CUs CPs
PCIe
NICGPU
ComP-Net
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Main component of ComP-Net Runtime is CP/GPU producer/consumer queue
▪ Most steps are straightforward
ComP-Net Producer/Consumer Queue
24 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
0 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 1a) Check if queue is full (using local Read Idx)
▪ 1b) If full, update Read Idx and loop till not full
ComP-Net Producer/Consumer Queue
25 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base Ptr
Local Read Idx…..
0 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
1b
1a<=
Read Idx Ptr
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 2) Fill Queue Entry with networking metadata
– Or Inline small payloads in the Queue Entry itself
ComP-Net Producer/Consumer Queue
26 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
0 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
2
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 3) Set status flag with release marker to notify CP
ComP-Net Producer/Consumer Queue
27 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
3
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 4) Increment local Write Idx
ComP-Net Producer/Consumer Queue
28 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
4++
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 5) Check status bit to determine when CP completes operation
ComP-Net Producer/Consumer Queue
29 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
5== 1
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 1) Poll on next Queue Entry based on local Read Idx with acquire marker
ComP-Net Producer/Consumer Queue
30 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
1
== 0
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 2) Read data from Queue Entry
ComP-Net Producer/Consumer Queue
31 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 1 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
2
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 3) Perform Network operation and set Status flag to 0 when complete with
release marker
ComP-Net Producer/Consumer Queue
32 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 0 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
3
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 4a) Update global read Idx
▪ 4b) Update local read Idx with release marker
ComP-Net Producer/Consumer Queue
33 07/16/2018
Registers / Non Coherent Cache
Cache/Memory/GPU Coherence Point
Queue Entry Queue Entry Queue Entry Queue Entry…..Read Idx Status Status Status Status
CP-Net GPU ContextWrite Idx
LDS / Non Coherent Cache
Base PtrRead Idx Ptr
Local Read Idx…..
1 1 0 0
CP-Net GPU ContextBase Ptr
Local Read Idx…..
Registers / Non Coherent Cache
4 CP-Net GPU ContextBase Ptr
Local Read Idx…..
….
Work-Group Command Processor Thread
4b
++
4a++
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ Residency of data in GPU L2 is very small
▪ Work-group data produced for CP is
evicted when other work-groups are
performing streaming memory accesses
▪ Can be solved through cache line locking
– Preliminary results are promising
– Still much to explore here
Tackling GPU Cache Thrashing
34 07/16/2018
Big
ger
is B
ette
rContribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
L2 H
it R
ate
for
CP
Networking Wavefronts / Streaming Wavefronts
Baseline LLC Locking
▪ CPU: Standard CPU-only systems
– Baseline non-accelerated system
▪ HDN: Host Driven Networking
– Kernel boundary networking (host MPI + CUDA)
Intra-kernel Networking Schemes:
▪ APU: CPU/GPU on the Same Die
– Intra-kernel networking through host threads on an APU
▪ dGPU: GPU Host Networking
– Intra-kernel networking through host threads on a dGPU
▪ ComP-Net: Command Processor Networking
– Intra-kernel networking through command processor
Experimental Setup
35 07/16/2018
CPU and Memory Configuration
Type 8-wide OOO, x86, 8 cores @ 4GHz
I,D-Cache 64KB, 2-way, 2 cycles
L2-Cache 2MB, 8-way, 8 cycles
L3-Cache 16MB, 16-way, 20 cycles
DRAM DDR4, 8 Channels, 2133MHz
GPU Configuration
Type AMD GCN3 @ 1.5GHz
CU Config 12 CUs with 4 SIMD-16 engines
Wavefronts 40 Waves per SIMD (64 lanes)
V-Cache 32KB, 16-way, 12 cycles, per CU
K-Cache 32KB, 8-way, 12 cycles, per 4 CU
I-Cache 64KB, 8-way, 12 cycles, per 4 CU
L2-Cache 1MB, 16-way, 8 banks, 100 cycles
CP Configuration
Type 2-wide OOO, x86, 2 cores @ 2GHz
D-Cache 32KB, 8-way, 4 cycles
I-Cache 16KB, 8-way, 4 cycles
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
Results
36 07/16/2018
▪ 2D Jacobi Stencil
– 1D data decomposition
– Iterative compute and halo exchange
– Three regions of interest0.8
0.9
1
1.1
1.2
1.3
16 64 256 1024
Rel
ativ
e S
pee
du
p v
dG
PU
Bas
elin
e
Per-node Problem Size (N x N Grid)
ComP-Net dGPU APU HDN CPU
Big
ger
is B
ette
r
Node 1 (Bottom)
Node 0 (Top)
Halo
Exchange
1 2 3
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
▪ 64MB Reduction (strong scaling)
– APU performs better than ComP-Net
– ComP-Net is much more energy efficient
Results
37 07/16/2018
0.6
0.8
1
1.2
1.4
0 4 8 12 16 20 24 28 32 36
Rel
ativ
e S
pee
du
p
Number of Nodes in Reduction
ComP-Net dGPU APU HDN CPU
0
0.2
0.4
0.6
0.8
1
1.2
0 4 8 12 16 20 24 28 32 36En
erg
y C
on
sum
pti
on
Number of Nodes in Reduction
ComP-Net dGPU APU
Big
ger
is B
ette
rSm
alle
r is
Bet
ter
0 1 25 2 1 1 3 5
0 9 8
Vector Sum
Contribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
Results
38 07/16/2018
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
AlexNet AN4
LSTM
CIFAR MNIST
Conv
MNIST
Hidden
Average
Pro
ject
ed S
pee
du
p
CPU HDN dGPU APU ComP-Net
Workload Name Domain %Blocked Reductions
Alex Net Classification 14% 4672
AN4 LSTM Speech 50% 131192
CIFAR Classification 4% 939820
Large Synth Synthetic 28% 52800
MNIST Conv Text Recognition 12% 900000
MNIST Hidden Text Recognition 29% 900000Big
ger
is B
ette
rContribution 2: Command Processor Networking (ComP-Net)
Michael LeBeane – PhD Defense
Outline
39 07/16/2018
▪ Introduction
▪ Contribution 1: Extended Task Queuing
▪ Contribution 2: Command Processor Networking
▪ Contribution 3: GPU Triggered Networking
▪ Conclusion
Michael LeBeane – PhD Defense
▪ CPU creates network
operation off the critical path
– Registers with the NIC
▪ GPU simply ‘triggers’
operation when the data is
ready
▪ Provides intra-kernel GPU
networking without requiring
a CPU thread
GPU Triggered Networking (GPU-TN) Overview
40 07/16/2018
SendLaunch
Kernel
GPU Triggered Networking Put
CPU
GPU
NIC
Done
Kernel
WaitSendWaitLaunch
GPU Host Networking Put
CPU
GPU
NIC
Done
Send
WaitLaunchSendWaitLaunch
Host-Driven Networking Put
CPU
GPU
NIC
Done
Kernel Kernel
M. LeBeane, K Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “GPU Triggered Networking for Intra-Kernel
Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
▪ CPU Creates Triggered Entry
– Trigger Entry consists of:
• Network Operation
• Tag
• Counter
• Threshold
– Appends entry to Trigger List
▪ GPU Fills Send Buffer
– During kernel execution
GPU-TN Architecture
41 07/16/2018
Network
Send Buffer
…..
CPU GPU
Trigger List
23
4
NIC
Trigger Entry
Trigger Entry
……
1
1
2
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
▪ GPU initiates Put operation
– GPU Provides Tag
▪ NIC sends message
– Message triggered when counter
>= CPU provided threshold
▪ HW complexity?
– ‘Trigger list’ might not be a list
▪ CPU/GPU race conditions?
– Allocate null entry for unexpected
triggers
GPU-TN Architecture
42 07/16/2018
Network
Send Buffer
…..
CPU GPU
Trigger List
23
4
NIC
Trigger Entry
Trigger Entry
……
1
Trigger EntryNetwork
Operation
Counter
Tag
Threshold
==
++
>=
Begin NetworkOperation
WR En
Tags
3
4
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
▪ CPU: Standard CPU-only systems
– Baseline non-accelerated system
▪ HDN: Host Driven Networking
– No driver interactions on the critical path, but may involve
CPU runtime
▪ GDS-Sim: GPUDirect Async
– Preregistration of communication but at kernel boundaries
▪ GHN: GPU Host Networking
– Intra-kernel networking through host threads
▪ GPU-TN: GPU Triggered Networking
– Preregistration of network operations and intra-kernel
networking
Experimental Setup
43 07/16/2018
CPU and Memory Configuration
Type 8-wide OOO, x86, 8 cores @ 4GHz
I,D-Cache 64KB, 2-way, 2 cycles
L2-Cache 2MB, 8-way, 8 cycles
L3-Cache 16MB, 16-way, 20 cycles
DRAM DDR4, 8 Channels, 2133MHz
GPU Configuration
Type AMD GCN3 @ 1.5GHz
CU Config 24 CUs with 4 SIMD-16 engines
Wavefronts 40 Waves per SIMD (64 lanes)
V-Cache 32KB, 16-way, 12 cycles, per CU
K-Cache 32KB, 8-way, 12 cycles, per 4 CU
I-Cache 64KB, 8-way, 12 cycles, per 4 CU
L2-Cache 1MB, 16-way, 8 banks, 100 cycles
NIC Configuration
Link Speed 100ns/ 100Gbps
Topology Star
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
Results
44
1
1.05
1.1
1.15
1.2
16 64 256 1024
Sp
eed
up
VS
HD
N
Local 2D Grid Size (N X N)
CPU
GDS-Sim
GHN
GPU-TN
Big
ger
is B
ette
r
0.8
1
1.2
1.4
1.6
2 5 8 11 14 17 20 23 26 29 32
Spee
dup
Nodes
HDN GDS-Sim GHN GPU-TN
Big
ger
is B
ette
r
▪ 64MB Reduction (strong scaling)
▪ 2D Jacobi Stencil ▪ Machine Learning Training Phase
07/16/2018
Contribution 3: GPU Triggered Networking (GPU-TN)
Michael LeBeane – PhD Defense
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
AlexNet AN4 LSTM CIFAR Large Synth MNIST
Conv
MNIST
Hidden
Pro
ject
ed S
pee
dup
CPU HDN GDS-Sim GHN GPU-TN
Outline
45 07/16/2018
▪ Introduction
▪ Contribution 1: Extended Task Queuing
▪ Contribution 2: Command Processor Networking
▪ Contribution 3: GPU Triggered Networking
▪ Conclusion
Michael LeBeane – PhD Defense
Summary
46 07/16/2018
▪ Presented 3 enhancements to improve GPU networking
– Extended Task Queuing
• Direct NIC-to-GPU active messaging
– Command Processor Networking
• Dynamic communication using on-chip GPU Command Processor
– GPU Triggered Networking
• Initiate messages without critical path CPU
Conclusion
Michael LeBeane – PhD Defense
Target
▪ XTQ allows direct access to remote GPU queues
– Teach NICs how to speak with HSA queues
▪ Enables Active Messaging without target CPU involvement
– Improves latency and frees CPU service thread(s)
▪ Improves application performance by ~15%
Extended Task Queuing (XTQ) Summary
47 07/16/2018
M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active
messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016.
Conclusion
Michael LeBeane – PhD Defense
CacheCache
CPU
Cache
Memory
Cache Cache
CPU
Cache
Memory
XTQ NIC XTQ NICGPU GPU
Initiator
▪ Uses built in CP to support network
operations
▪ CP/GPU communicate over shared L2
cache instead of PCIe
▪ Potentially much faster (lower latency)
than other GHN designs
▪ Scales naturally
– Every GPU has multiple CP threads
▪ Improves application performance ~20%
vs other GHN approaches
Command Processor Networking (ComP-Net) Summary
48 07/16/2018
M. LeBeane, K. Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “ComP-Net: Command
Processor Networking for Efficient Intra-kernel Communications on GPUs," in Proc. of the Intl. Conf Parallel
Architectures and Compilation Techniques (PACT), 2018.
NIC
…Host Queues
Memory
CUs
CPUs
GPU
Host
PCIe
Memory
…Network Queues
PCIe
GPU Host
Networking
HostPCIe
L2 CacheHost Queues
MemoryNetwork Queues
CUs CPs
PCIe
NICGPU
ComP-Net
Conclusion
Michael LeBeane – PhD Defense
▪ CPU creates network operation off
the critical path
– Registers with the NIC
▪ GPU simply ‘triggers’ operation
when the data is ready
▪ Provides intra-kernel GPU
networking without requiring a CPU
thread
▪ Improves application performance
~20% vs GPUDirect Async
GPU Triggered Networking (GPU-TN) Summary
49 07/16/2018
SendLaunch
Kernel
GPU Triggered Networking Put
CPU
GPU
NIC
Done
Kernel
WaitSendWaitLaunch
GPU Host Networking Put
CPU
GPU
NIC
Done
Send
WaitLaunchSendWaitLaunch
Host-Driven Networking Put
CPU
GPU
NIC
Done
Kernel Kernel
M. LeBeane, K Hamidouche, B. Benton, M. Breternitz, S. K. Reinhardt, and L. K. John, “GPU Triggered Networking for Intra-Kernel
Communications,“ in Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2017.
Conclusion
Michael LeBeane – PhD Defense
▪ This dissertation motivates the need for more independent accelerators
– Cannot funnel everything through a central CPU!
– Concepts are applicable to many types of accelerators and networks
▪ Still much to do!
– Application Redesign Opportunities
• Applications presented in this dissertation are scratching the surface
• Algorithms with dynamic communication could significantly benefit from these techniques
– Leveraging Emerging NIC Technologies for GPUs
• Mellanox BlueField, collective offload, programmable message handlers
• How could more intelligent NICs assist with GPU networking?
Towards the Future…..
50 07/16/2018
Conclusion
Michael LeBeane – PhD Defense
Thank You!
51 07/16/2018Michael LeBeane – PhD Defense
▪ CPU controls networking
through driver/runtime
▪ Messages sent at kernel
boundaries
▪ Research implementations
include:
– CUDA-Aware MPI [Kraus ‘14]
– CUDA-Aware OpenSHMEM
[Hamidouche ’16]
• GPUDirect RDMA [Mellanox ‘13]
Host-Driven Networking (HDN)
52 07/16/2018
WaitLaunchSendWaitLaunch
Host-Driven Networking Put
CPU
GPU
NIC
Done
Kernel Kernel
J. Kraus. “Introduction to CUDA-aware MPI and Nvidia GPUDirect,” GPU Tech. Conference. 2014.
K. Hamidouche, A. Venkatesh, A. A. Awan, H. Subramoni, C.H. Chu, and D. K. Panda, “CUDA-Aware OpenSHMEM,” Journal on Parallel Computing.
2016.
Mellanox, “Mellanox GPUDirect RDMA User Manual,” http://www.mellanox.com/related-docs/prod_software/Mellanox GPUDirect User Manual
v1.2.pdf. 2015
Michael LeBeane – PhD Defense
▪ GPU runs networking stack
▪ Persistent kernels and LDS
memory used for network
data structures
▪ Research implementations
include:
– GPUrdma [Daoud ’16]
– IBV on GPUs [Oden ‘14]
GPU Native Networking (GNN)
53 02/27/2017
WaitLaunchSendWaitLaunch
Host-Driven Networking Put
CPU
GPU
NIC
Done
Kernel Kernel
Launch
Kernel
GPU Native Networking Put
CPU
GPU
NIC
Done
Send
F. Daoud, A. Watad, and M. Silberstein, “GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels,” In Intl. Workshop on
Runtime and Operating Systems for Supercomputers (ROSS). 2016.
L. Oden, H. Froning, and F. J. Pfreundt, “Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU,” In
Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 2014.
Michael LeBeane – PhD Defense
AMD <=> Nvidia Translator
▪ Work-item = Thread
▪ Wavefront (64 Threads) = Warp (32 Threads)
– Unit of thread dispatch
▪ Work-group = Thread Block
– Unit of Synchronization
▪ Local Data Share (LDS) = Shared Memory
– Work-group scratchpad
▪ Compute Unit (CU) = Streaming Multi-Processor (SM)
– Collection of SIMD engines sharing LDS and L1 cache
GPU Architecture and Terminology
54 07/16/2018
▪ Kernel
– GPU SIMT Function
▪ Command Processor (CP)
– Dispatch engine and scheduler
Local Data Share
L2 Cache
L1 Cache
CPU Core
GPU Memory
Compute Unit Command Processor
L1 Cache
SIMD SIMD SIMD SIMD
GPU
Michael LeBeane – PhD Defense
GPU-TN Kernel Programming Interface
55 07/16/2018
__kernel void
kern1(__global char *trigAddr,
const int tagBase,
__global void *buffer)
{
// do work
buffer = ...;
int id = get_global_id();
*trigAddr = tagBase + id;
// do additional work
...
}
__kernel void
kern2(__global char *trigAddr,
const int tagBase,
__global void *buffer)
{
// do work
buffer = ...;
wg_barrier();
if (!get_local_id()) {
int id = get_group_id();
*trigAddr = tagBase + id;
}
// do additional work
...
}
__kernel void
kern3(__global char *trigAddr,
const int tag,
__global void *buffer)
{
// do work
buffer = ...;
wg_barrier();
if (!get_local_id())
*trigAddr = tag;
// do additional work
...
}
Work-item Level Work-group Level Kernel Level
Michael LeBeane – PhD Defense
▪ gem5 + AMD GCN3 GPU model + Custom Portals4 NIC Model
– CPU power model with McPAT
– Baseline model is coherent APU
• dGPU modeled with extra delay for IO bus, different memory controllers, and by disabling coherence probes
▪ Each section has slightly different parameters
– Will be discussed before results presented
Simulation Infrastructure
56 07/16/2018
DirectoryMemory
ControllersMemory
GPU CPU
Core
L2
L1IL1D …
Core
L2
L1IL1D
Core
L2
L1IL1D
L3
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
Sequencer Cache (SQC)
L2
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
GPUCore
L1D
Sequencer Cache (SQC)
…
NIC
NIC Processors
DMA Engines
L1IL1D
CPCore
IFNetwork
Michael LeBeane – PhD Defense
▪ RDMA allows for direct access of remote memory without involving CPU
– Heavy lifting is performed on the NIC (off-load networking model)
– Generally expressed in terms of remote Put/Get operations
▪ Maps naturally to “one-sided” communication semantics
– Puts/Gets vs. Send/Receive
Remote Direct Memory Access (RDMA)
57 07/16/2018
Initiator Target
Network
CPU
Cache NIC
Memory
IOCMemory
CPU
CacheNIC
Memory
IOCMemory
Michael LeBeane – PhD Defense
ComP-Net Host and GPU API
58 07/16/2018
__host__ void
hostInit() {
//Initialize ComP-Net
cpnet_handle_t* cpnet_handle;
cpnet_init(&cpnet_handle, GRID_SZ / WG_SZ);
// Allocate symmetric heap memory
char* buf = cpnet_shmalloc(sizeof(char) *
GRID_SZ / WG_SZ);
//Initiator/target launches kernel
if (cpnet_handle->pe == INITIATOR) {
hipLaunchKernel(Ping, GRID_SZ, GRID_SZ /
WG_SZ, 0, 0, cpnet_handle,
buf);
} else { /* Launch target kernel. */ }
}
__device__ void
Ping(cpnet_handle_t *cpnet_handle,
char* wg_buffer) {
// Extract context from global handle
__shared__ cpnet_ctx_t cpnet_ctx;
cpnet_ctx_create(cpnet_handle,
cpnet_ctx);
// Each WG pings target
cpnet_shmem_char_p(cpnet_ctx,
wg_buffer[hipBlockIdx_x],
1, TARGET);
// Each WG waits for pong target
cpnet_shmem_char_wait_until(
wg_buffer[hipBlockIdx_x, 1);
cpnet_ctx_destroy(cpnet_ctx);
}
Host Code GPU Code
Michael LeBeane – PhD Defense
▪ One-sided put latency benchmark
– Initiator launches dummy kernel, executes network command, and terminates
– Target polls on put location
▪ Take-away messages
– HDN < GDS-Sim < GPU-TN
– GPU-TN actually overlaps kernel teardown with network transfer!
Latency Microbenchmark
59 07/16/2018
1.51
1.50
1.50
0.41
0.43
0.49
1.50
1.51
1.49
0.30
0.05
4.21
3.76
2.71
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5Time (µs)
Kernel Launch Kernel Exeuction Kernel Teardown Put Wait
Target
Initiator
TargetInitiator
GPU-TN
GDS-Sim
TargetInitiatorHDN
Smaller is Better
Michael LeBeane – PhD Defense
Sweep of payload size for 1 WG
and 1 Thread
Microbenchmarks
60 07/16/2018
0
2
4
6
8
10
12
1 8 64 512 4096 32768
Rem
ote
Get
Tim
e O
bse
rved
fro
m G
PU
(µ
s)
Network Payload Size
ComP-Net dGPU APU
0
20
40
60
80
100
0 2 4 6 8 10
Rem
ote
Get
Tim
e O
bse
rved
fro
m G
PU
(µ
s)
Number of Network Service Threads
ComP-Net dGPU APU
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
En
erg
y C
on
sum
ed b
y N
etw
ork
Th
read
s w
.r.t
dG
PU
Number of Network Service Threads
ComP-Net dGPU APU
Sweep of threads for 1 byte transfers and 480 WGs
Michael LeBeane – PhD Defense
▪ Friendlier programming abstractions
– Nicer abstractions in CUDA and OpenCL
• Dynamic Parallelism, Unified Memory, etc.
– Single-source, kernel-less programming support
• C++ AMP, OpenMP, AMD HC Language, etc.
▪ Architectural Support
– User-level kernel-launch
– Shared virtual address space
– Virtualization
– Multiprocessing
– (Sometimes) Coherent caches
Where are GPUs heading?
61 07/16/2018
MMU
CPU
Tightly Coupled Devices
Physical Memory
GPUOS
DriverIOMMU
CPU(Producer)
Tightly Coupled Devices
VirtualMemory
Command Queue
GPU(Consumer)
Command Packet
Architected Queuing
Shared Virtual MemoryWhat about networking support?
Michael LeBeane – PhD Defense