EECS 583 – Class 20Research Topic 2: Stream Compilation,Stream Graph Modulo Scheduling
University of Michigan
November 30, 2011
Guest Speaker Today: Daya Khudia
- 2 -
Announcements & Reading Material
This class» “Orchestrating the Execution of Stream Programs on Multicore
Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.
Next class – GPU compilation» “Program optimization space pruning for a multithreaded GPU,”
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.
- 3 -
Stream Graph Modulo Scheduling (SGMS)
Coarse grain software pipelining» Equal work distribution» Communication/computation
overlap» Synchronization costs
Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data
DMA engine independent of PEs
Filters = operations, cores = function units
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
EIB
PPE(Power PC)
DRAM
SPE0 SPE1 SPE7
- 4 -
Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]
int->int filter FIR(int N, int wgts[N]) {
work pop 1 push 1 { int i, sum = 0;
for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}
Push and pop items from input/output FIFOs
Stateless
int wgts[N];
wgts = adapt(wgts);
Stateful
- 5 -
SGMS Overview
PE0 PE1 PE2 PE3
PE0
T1
T4
T4
T1 ≈ 4
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
Prologue
Epilogue
- 6 -
SGMS Phases
Fission +Processor
assignment
Stageassignment
Codegeneration
Load balance CausalityDMA overlap
- 7 -
Processor Assignment: Maximizing Throughputs
iij
jij
IIaiw
a
)(
1 for all filter i = 1, …, N
for all PE j = 1,…,P
Minimize II
B C
E
D
F
W: 20
W: 20 W: 20
W: 30
W: 50
W: 30
A
B C
E
D
F
A
A
DB
CE
F
Minimum II: 50Balanced workload!Maximum throughput
T2 = 50
A
B
C
D
E
F
PE0
T1 = 170
T1/T2 = 3. 4
PE0 PE1 PE2 PE3Four Processing Elements
• Assigns each filter to a processor
PE1PE0 PE2 PE3
W: workload
- 8 -
Need More Than Just Processor Assignment
Assign filters to processors» Goal : Equal work distribution
Graph partitioning? Bin packing?
A
B
D
C
A
B C
D
5
5
40 10
Original stream program
PE0 PE1
B
A
C
D Speedup = 60/40 = 1.5
A
B1
D
C
B2
J
S
Modified stream program
B2
CJ
B1
AS
D Speedup = 60/32 ~ 2
- 9 -
Filter Fission Choices
PE0 PE1 PE2 PE3
Speedup ~ 4 ?
- 10 -
Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)
…
Split/Join overheadfactored in
Objective function- Maximal load on any PE» Minimize
Result» Number of times to “split” each filter
» Filter → processor mapping
- 11 -
Step 2: Forming the Software Pipeline To achieve speedup
» All chunks should execute concurrently» Communication should be overlapped
Processor assignment alone is insufficient information
A
B
C
A
CB
PE0 PE1
PE0 PE1
Tim
e AB
A1
B1
A2
A1
B1
A2A→B
A1
B1
A2A→B
A3A→B
Overlap Ai+2 with Bi
X
- 12 -
Stage Assignment
i
j
PE 1
Sj ≥ Si
i
j
DMA
PE 1
PE 2
Si
SDMA > Si
Sj = SDMA+1
Preserve causality(producer-consumer dependence)
Communication-computationoverlap
Data flow traversal of the stream graph» Assign stages using above two rules
- 13 -
Stage Assignment Example
A
B1
D
C
B2
J
S
AS
B1Stage 0
DMA DMA DMA Stage 1
CB2
J
Stage 2
D
DMA Stage 3
Stage 4
DMA
PE 0
PE 1
- 14 -
Step 3: Code Generation for Cell
Target the Synergistic Processing Elements (SPEs)» PS3 – up to 6 SPEs
» QS20 – up to 16 SPEs
One thread / SPE Challenge
» Making a collection of independent threads implement a software pipeline
» Adapt kernel-only code schema of a modulo schedule
- 15 -
Complete Example
void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}
A
S
B1
DMA DMA DMA
CB2
J
D
DMA DMA
AS
B1
AS
B1
B1toJ
StoB2
AtoC
AS
B1
B2
J
C
B1toJ
StoB2
AtoC
AS
B1
JtoD
CtoD B2
J
C
B1toJ
StoB2
AtoC
AS
B1
JtoD
D
CtoD B2
J
C
B1toJ
StoB2
AtoC
SPE1 DMA1 SPE2 DMA2T
ime
- 16 -
SGMS(ILP) vs. Greedy
0
1
2
3
4
5
6
7
8
9
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rel
ativ
e S
peed
up
ILP Partitioning Greedy Partitioning Exposed DMA
(MIT method, ASPLOS’06)
• Solver time < 30 seconds for 16 processors
- 17 -
SGMS Conclusions
Streamroller» Efficient mapping of stream programs to multicore
» Coarse grain software pipelining
Performance summary» 14.7x speedup on 16 cores
» Up to 35% better than greedy solution (11% on average)
Scheduling framework» Tradeoff memory space vs. load balance
Memory constrained (embedded) systems Cache based system
- 18 -
Discussion Points
Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?
» Filters change execution time?
» Memory faster/slower than expected?
Could this be adapted for a more conventional multiprocessor with caches?
Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:
» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling
» Where else can it be used?
“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”
- 20 -
Static Verses Dynamic Scheduling
CoreCore
AA
B1B1 B3B3
FF
EE
CC
SplitterSplitter
JoinerJoiner
B2B2 B4B4
D1D1 D2D2
SplitterSplitter
JoinerJoiner
MemoryMemory
CoreCore
MemoryMemory
CoreCore
MemoryMemory
CoreCore
MemoryMemory
? Performing graph modulo scheduling on a
stream graph statically.
What happens in case of dynamic resource changes?
- 21 -
Overview of Flextream
Prepass ReplicationPrepass Replication
Work PartitioningWork Partitioning
Partition RefinementPartition Refinement
Stage AssignmentStage Assignment
Buffer AllocationBuffer Allocation
Static
Dynam
ic
Streaming Application
MSL Commands
Adjust the amount of parallelism for the target system by replicating actors.Adjust the amount of parallelism for the target system by replicating actors.
Performs light-weight adaptation of the schedule for the current configuration of the target hardware.Performs light-weight adaptation of the schedule for the current configuration of the target hardware.
Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)
Find an optimal schedule for a virtualized member of a family of processors.Find an optimal schedule for a virtualized member of a family of processors.
Specifies how actors execute in time in the new actor-processor mapping.Specifies how actors execute in time in the new actor-processor mapping.
Find optimal modulo schedule for a virtualized member of a family of processors.Find optimal modulo schedule for a virtualized member of a family of processors.
Tries to efficiently allocate the storage requirements of the new schedule into available memory units.Tries to efficiently allocate the storage requirements of the new schedule into available memory units.
Goal: To perform Adaptive Stream Graph Modulo Scheduling. Goal: To perform Adaptive Stream Graph Modulo Scheduling.
- 22 -
Overall Execution Flow
For every application may see multiple iterations of:
Resou
rce ch
ange
Req
uest
Resou
rce ch
ange
Gran
ted
- 23 -
Prepass Replication [static]
AA BB CC DD
EE FF E0E0E1E1
P0 : 10 P1 : 86 P2 : 246 P3 : 326
P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283
D0D0
D1D1
P6 : 283 P7 : 163
P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163
P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163
E0E0
E1E1 E2E2 E3E3
C0C0 C1C1 C2C2
C3C3
AA
FF
EE
DD
CC
1010
246246
326326
566566
1010
BB8686
C0C0 C2C2
S0S0
J0J0
61.561.5
66
66
C1C1 C3C3
D0D0
S1S1
J1J1
163163
66
66
D1D1
E0E0 E2E2
S2S2
J2J2
66
66
E1E1 E3E3 141.5141.5
21
21
22
22
22
22
- 24 -
Partition Refinement [dynamic 1]
Available resources at runtime can be more limited than resources in static target architecture.
Partition refinement tunes actor to processor mapping for the active configuration.
A greedy iterative algorithm is used to achieve this goal.
- 25 -
Partition Refinement Example
Pick processors with most number of actors.
Sort the actors
Find processor with max work
Assign min actors until threshold
AA BB
P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5
P4 : 151.5 P5 : 173 P7 : 159.5
D0D0D1D1
P6 : 140
E0E0
E1E1
E2E2E3E3C0C0 C1C1
C2C2
C3C3
S1S1S2S2
J0J0S0S0
J1J1 J2J2
BBE2E2 C2C2C0C0 C3C3C1C1 S1S1S2S2 J0J0S0S0J1J1 J2J2
P5 : 183
S0S0C3C3
S1S1S2S2
J1J1J2J2
C2C2
C1C1
BB
C0C0
E2E2
FFJ0J0
P5 : 193P5 : 270.5P4 : 274.5
P1 : 283 P3 : 289
- 26 -
Stage Assignment [dynamic 2]
Processor assignment only specifies how actors are overlapped across processors.
Stage assignment finds how actors are overlapped in time.
Relative start time of the actors is based on stage numbers.
DMA operations will have a separate stage.
- 27 -
Stage Assignment Example
AA
FF
BB
C0C0 C2C2
S0S0
J0J0
C1C1 C3C3
D0D0
S1S1
J1J1
D1D1
E0E0 E2E2
S2S2
J2J2
E1E1 E3E3
AA D0D0D1D1
E0E0
E1E1
E3E3
S0S0C3C3
S1S1S2S2
J1J1J2J2
C2C2
C1C1
BB
C0C0
E2E2
FFJ0J0
00
22
44
66
10
10
88
12
12
16
1618
18
14
14
- 28 -
Performance Comparison
- 29 -
Overhead Comparison
- 30 -
Flextream Conclusions
Static scheduling approaches are promising but not enough.
Dynamic adaptation is necessary for future systems.
Flextream provides a hybrid static/dynamic approach to improve efficiency.