Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | london-mille |
View: | 213 times |
Download: | 0 times |
Bottleneck Elimination from Stream Graphs
S. M. Farhad
The University of Sydney
Joint work with
Yousun Ko
Bernd Burgstaller
Bernhard Scholz
3
Multicores Are Here!
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# ofcores
1
2
4
8
16
32
64
128
256
512
Athlon
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480 Opteron 4P
Xeon MP
AmbricAM2045
4
Multicores Are Here!
Uniprocessors:C is the commonmachine language
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480
20??
# ofcores
1
2
4
8
16
32
64
128
256
512
Opteron 4P
Xeon MP
Athlon
AmbricAM2045
5
Multicores Are Here!
What is the commonmachine languagefor multicores?
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005
Raw
Power4Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480
20??
# ofcores
1
2
4
8
16
32
64
128
256
512
Opteron 4P
Xeon MP
Athlon
AmbricAM2045
6
Stream Programming Paradigm Research topic in parallel programming Various forms of parallelism
Pipeline, task, and data Applications
Signal Processing Multi-media High-Performance Computing
Programs expressed as stream graphs Streams
Infinite sequence of data elements (aka. Tokens) Actor
Functions applied to streams
6
Actor
Stream
Stream
7
Properties of Stream Program Regular and repeating
computation Independent actors with explicit
communication Producer / Consumer
dependencies
7
Adder
Speaker
AtoD
FMDemod
LPF1
Splitter
Joiner
LPF2 LPF3
HPF1 HPF2 HPF3
8
StreamIt Language [ASPLOS’2&6, PLDI’3]
An implementation of stream prog.
Each construct has single input/output stream
Hierarchical structure
Filters can be stateful/stateless
parallel computation
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
8
10
Mapping Actors
Core 1
B 60
C 60
D 5
5ACore 2 Core 3
5A
D 5
B 60 C 60
T =10s T = 60s T = 60s
Make span = 60s, Speedup = 130/60 = 2.17
10
11
Bottleneck Actors Limit the Performance
B 60
C 60
D 5
5A
D 5
5A
B_1 20
2s1
2j1
B_2 20 B_3 20
C_1 20
2s2
2j2
C_2 20 C_3 20
Hot actor duplication
Core 1 Core 2 Core 3
5A
s1 2
B_1 20
C_1 20
B_2 20
j1 2
s2 2
C_2 20
j2 2
B_3 20
C_3 20
D 5
T = 47s T = 46s T = 45s
Make span = 47s, Speedup = 130/47 = 2.77
11
12
Bottleneck Resolving of Stream Program Contd. Current state of the art
Integer Linear Programming Intractable
How to find a fast and good solution? Heuristics Optimal
12
Our Work
A data rate transfer model to detect and eliminate bottlenecks
We separate the bottleneck elimination from the actor allocation
Heuristics to solve bottleneck problem efficiently
13
14
Our Data Transfer Model
Throughput depends on the data rate of the actors (maximize)
Data transfer model forms a system of sim. functional linear equation
Compute a closed form of the output data rate We also consider a processor utilization
function for each actor
A B C1 5 1 1
z
zxA AB xx 2.0 BC xx
zxA zxB 2.0 zxC 2.0
14
15
Bottleneck Analysis
The throughput is limited by Processor capacity of the cores Memory bandwidth
A quantitative analysis determines An upper bound of the throughput imposed by an
actor An upper bound of the throughput imposed by the
parallel system Hot actor
Upper bound (actor) < upper bound (system)
15
h2
h1
Hot Region
Maximal connected subgraph where and each is hot and stateless
16
EEVV , EVh ,Vi
B
C
D
A
E
F
G
17
Resolving Bottleneck Options
17
B 60
C 60
D 5
5A
D 5
5A
B_1 20
2s1
B_2 20 B_3 20
C_1 20
2j1
C_2 20 C_3 20
Hot region duplication
D 5
5A
B_1 20
2s1
2j1
B_2 20 B_3 20
C_1 20
2s2
2j2
C_2 20 C_3 20
Hot actor duplication
18
Region Duplication further Increases Performance
18
D 5
5A
B_1 20
2s1
B_2 20 B_3 20
C_1 20
2j1
C_2 20 C_3 20
Core 1
5A
B_1 20
C_1 20
Core 2
s1 2
B_2 20
j1 2
C_2 20
Core 3
B_3 20
C_3 20
D 5
T = 45s T = 44s T = 45s
Make span = 45s, Speedup = 130/45 = 2.89
Mapping
Cascading Effect of Duplication Actors may become hot due to duplication of
other actors
B
C
D
A
E
A
B_1
s1
B_2 B_3
C_1
j1
C_2 C_3
E
D
dB=2 dh1=2
dD=3
dE=2
dF=3
dh2=3
Duplication Factor of an Actor and a Hot Region The # of times the actor needs
to be duplicated Maximum duplication factor of
the actors of the hot region
20
B
C
D
A
E
F
G
1id
Heuristics to Resolve Bottlenecks
21
Determine hot regions
Determine duplication factors of hot regions
Duplicate hot regions
Optimal solution?
Experiment
22
Summary
A simple quantitative analysis to detect and
eliminate bottlenecks
We separate the bottleneck elimination from
the actor allocation
Heuristics to eliminate bottlenecks
22
23
Related Works
[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]
[2] StreamIt: A language for streaming applications [Thies ‘02]
[3] Phased Scheduling of Stream Programs [Thies ’03]
[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in
Stream Programs [Thies ‘06]
[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]
[6] Software Pipelined Execution of Stream Programs on GPUs
[Udupa‘09]
[7] Synergistic Execution of Stream Programs on Multicores with
Accelerators [Udupa ‘09]
23