Post on 09-Jan-2016
description
transcript
EECS 583 – Class 20Research Topic 2: Stream Compilation, GPU Compilation
University of Michigan
December 3, 2012
Guest Speakers Today: Daya Khudia and Mehrzad Samadi
- 2 -
Announcements & Reading Material
Exams graded and will be returned in Wednesday’s class
This class» “Orchestrating the Execution of Stream Programs on Multicore
Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.
Next class – Research Topic 3: Security» “Dynamic taint analysis for automatic detection, analysis, and
signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005
Stream Graph Modulo Scheduling
- 4 -
Stream Graph Modulo Scheduling (SGMS)
Coarse grain software pipelining» Equal work distribution» Communication/computation
overlap» Synchronization costs
Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data
DMA engine independent of PEs
Filters = operations, cores = function units
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
EIB
PPE(Power PC)
DRAM
SPE0 SPE1 SPE7
- 5 -
Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]
int->int filter FIR(int N, int wgts[N]) {
work pop 1 push 1 { int i, sum = 0;
for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}
Push and pop items from input/output FIFOs
Stateless
int wgts[N];
wgts = adapt(wgts);
Stateful
- 6 -
SGMS Overview
PE0 PE1 PE2 PE3
PE0
T1
T4
T4
T1 ≈ 4
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
Prologue
Epilogue
- 7 -
SGMS Phases
Fission +Processor
assignment
Stageassignment
Codegeneration
Load balance CausalityDMA overlap
- 8 -
Processor Assignment: Maximizing Throughputs
iij
jij
IIaiw
a
)(
1 for all filter i = 1, …, N
for all PE j = 1,…,P
Minimize II
B C
E
D
F
W: 20
W: 20 W: 20
W: 30
W: 50
W: 30
A
B C
E
D
F
A
A
DB
CE
F
Minimum II: 50Balanced workload!Maximum throughput
T2 = 50
A
B
C
D
E
F
PE0
T1 = 170
T1/T2 = 3. 4
PE0 PE1 PE2 PE3Four Processing Elements
• Assigns each filter to a processor
PE1PE0 PE2 PE3
W: workload
- 9 -
Need More Than Just Processor Assignment
Assign filters to processors» Goal : Equal work distribution
Graph partitioning? Bin packing?
A
B
D
C
A
B C
D
5
5
40 10
Original stream program
PE0 PE1
B
A
C
D Speedup = 60/40 = 1.5
A
B1
D
C
B2
J
S
Modified stream program
B2
CJ
B1
AS
D Speedup = 60/32 ~ 2
- 10 -
Filter Fission Choices
PE0 PE1 PE2 PE3
Speedup ~ 4 ?
- 11 -
Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)
…
Split/Join overheadfactored in
Objective function- Maximal load on any PE» Minimize
Result» Number of times to “split” each filter
» Filter → processor mapping
- 12 -
Step 2: Forming the Software Pipeline To achieve speedup
» All chunks should execute concurrently» Communication should be overlapped
Processor assignment alone is insufficient information
A
B
C
A
CB
PE0 PE1
PE0 PE1
Tim
e AB
A1
B1
A2
A1
B1
A2A→B
A1
B1
A2A→B
A3A→B
Overlap Ai+2 with Bi
X
- 13 -
Stage Assignment
i
j
PE 1
Sj ≥ Si
i
j
DMA
PE 1
PE 2
Si
SDMA > Si
Sj = SDMA+1
Preserve causality(producer-consumer dependence)
Communication-computationoverlap
Data flow traversal of the stream graph» Assign stages using above two rules
- 14 -
Stage Assignment Example
A
B1
D
C
B2
J
S
AS
B1Stage 0
DMA DMA DMA Stage 1
CB2
J
Stage 2
D
DMA Stage 3
Stage 4
DMA
PE 0
PE 1
- 15 -
Step 3: Code Generation for Cell
Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs
» QS20 – up to 16 SPEs
One thread / SPE Challenge
» Making a collection of independent threads implement a software pipeline
» Adapt kernel-only code schema of a modulo schedule
- 16 -
Complete Example
void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}
A
S
B1
DMA DMA DMA
CB2
J
D
DMA DMA
AS
B1
AS
B1
B1toJ
StoB2
AtoC
AS
B1
B2
J
C
B1toJ
StoB2
AtoC
AS
B1
JtoD
CtoD B2
J
C
B1toJ
StoB2
AtoC
AS
B1
JtoD
D
CtoD B2
J
C
B1toJ
StoB2
AtoC
SPE1 DMA1 SPE2 DMA2T
ime
- 17 -
SGMS(ILP) vs. Greedy
0
1
2
3
4
5
6
7
8
9
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rel
ativ
e S
peed
up
ILP Partitioning Greedy Partitioning Exposed DMA
(MIT method, ASPLOS’06)
• Solver time < 30 seconds for 16 processors
- 18 -
SGMS Conclusions
Streamroller» Efficient mapping of stream programs to multicore
» Coarse grain software pipelining
Performance summary» 14.7x speedup on 16 cores
» Up to 35% better than greedy solution (11% on average)
Scheduling framework» Tradeoff memory space vs. load balance
Memory constrained (embedded) systems Cache based system
- 19 -
Discussion Points
Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?
» Filters change execution time?
» Memory faster/slower than expected?
Could this be adapted for a more conventional multiprocessor with caches?
Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:
» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling
» Where else can it be used?
Compilation for GPUs
- 21 -
Why GPUs?
21
- 22 -
Efficiency of GPUs
22
High Memory
Bandwidth
GTX 285 : 159 GB/Seci7 : 32 GB/Sec
High FlopRate
i7 :102 GFLOPS
GTX 285 :1062 GFLOPS
i7 : 51 GFLOPS
GTX 285 : 88.5 GFLOPS
GTX 480 : 168 GFLOPS
High FlopPer Watt
GTX 285 : 5.2 GFLOP/W
i7 : 0.78 GFLOP/W
High FlopPer Dollar
GTX 285 : 3.54 GFLOP/$i7 : 0.36 GFLOP/$
- 23 -
GPU Architecture
23
Shared
Regs
0 1
2 3
4 5
6 7
Interconnection Network
Global Memory (Device Memory)PCIe
Bridge
CPU HostMemory
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
SM 0 SM 1 SM 2 SM 29
- 24 -
CUDA
“Compute Unified Device Architecture”
General purpose programming model» User kicks off batches of threads on the GPU
Advantages of CUDA» Interface designed for compute - graphics free API
» Orchestration of on chip cores
» Explicit GPU memory management
» Full support for Integer and bitwise operations
24
- 25 -
Programming Model
25
Host
Kernel 1
Kernel 2
Grid 1
Grid 2
DeviceTi
me
- 26 -
Grid 1
GPU Scheduling
26
SM 0
Shared
Regs
0 1
2 3
4 5
6 7
SM 1
Shared
Regs
0 1
2 3
4 5
6 7
SM 2
Shared
Regs
0 1
2 3
4 5
6 7
SM 3
Shared
Regs
0 1
2 3
4 5
6 7
SM 30
Shared
Regs
0 1
2 3
4 5
6 7
- 27 -
Warp Generation
Block 0
Block 1
Block 3
Shared
Registers
0 1
2
4 5
3
6 7
SM0
27
Block 2
ThreadId
0 31 32 63Warp 0 Warp 1
- 28 -
Memory Hierarchy
28
Per BlockShared Memory
__shared__ int SharedVar
Block 0
Per-threadRegister
int LocalVarArray[10]
Per-threadLocal Memory
int RegisterVarThread 0
Grid 0
Per appGlobal Memory
Host
__global__ int GlobalVar
__constant__ int ConstVar
Texture<float,1,ReadMode> TextureVar
Per appTexture Memory
Per appConstant Memory
Devic
e