Spatial Computation
Thesis committee:Seth Goldstein
Peter LeeTodd MowryBabak FalsafiNevin Heintze
Ph.D. Thesis defense, December 8, 2003
SCS
Mihai BudiuCMU CS
2
Spatial Computation
Thesis committee:Seth Goldstein
Peter LeeTodd MowryBabak FalsafiNevin Heintze
Ph.D. Thesis defense, December 8, 2003
SCSA model of general-purpose computationbased on Application-Specific Hardware.
3
Thesis StatementApplication-Specific Hardware (ASH):
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
4
Outline• Introduction
• Compiling for ASH• Media processing on ASH• ASH vs. superscalar processors• Conclusions
5
CPU Problems
• Complexity• Power• Global Signals• Limited ILP
6
Design Complexity
from Michael Flynn’s FCRC 2003 talk
58%/Year
21%/Year
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
xx xx x
x
x
Logic transistors/chip
Transistors/staff*month
Source: S. Malik, orig Sematech
Prod
uctiv
ity
10
1,000,000
10,000,000
100,000,000
1000
100
10,000
100,000
10
1000
100
10,000
100,000
1,000,000
10,000,000
Chip
size
(K tr
ansis
tors
)
Design Time:CAD productivity favors FPL
2.5
.10
.35
7
Communication vs. Computation
5ps 20ps
gate wire
Power consumption on wires is also dominant
8
Our Approach: ASH
Application-Specific Hardware
9
1.
2.
1.
2.Programs
Programs
Resource Binding Time
CPU ASH
10
Hardware Interface
CPU ASH
ISA
software
hardware
software
hardwaregates
virtual ISA
11
Application-Specific HardwareC program
Compiler
Dataflow IR
Reconfigurable/custom hw
12
Contributions
Compilation
Computerarchitecture
Reconfigurablecomputing
Embeddedsystems
Asynchronouscircuits
High-levelsynthesis
Dataflowmachines
Nanotechnology
theory
systems
13
Outline• Introduction• CASH: Compiling for ASH
• Media processing on ASH• ASH vs. superscalar processors• Conclusions
14
Computation = Dataflow
• Operations ) functional units• Variables ) wires• No interpretation
x = a & 7;...
y = x >> 2;
Programs
&
a 7
>>
2
x
Circuits
15
Basic Operation
+data
valid
acklatch
16
+
Asynchronous Computation
data
valid
ack
1
+
2
+
3
+
4
+
8
+
7
+
6
+
5
latch
17
Distributed Control Logic
+ -
ackrdy
FSM
asynchronous control
short, local wires
18
Forward Branches
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Conditionals ) Speculation critical path
19
Control Flow ) Data Flow
data predicate
Merge (label)
Gateway
data
data
Split (branch)p !
20
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;for (i=0; i < 100; i++)
sum += i*i;return sum;return sum; !
ret
21
no speculationsequencingof side-effects
Predication and Side-Effects
Load
addr
data
pred
token
token
tomemory
22
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
23
Outline• Introduction• CASH: Compiling for ASH
– An optimization on the SIDE
• Media processing on ASH• ASH vs. superscalar processors• Conclusions
skip to
24
Availability Dataflow Analysis
y
y = a*b;...if (x) { ... ... = a*b;}
25
Dataflow Analysis Is Conservative
if (x) { ... y = a*b;}...... = a*b;
y?
26
Static Instantiation, Dynamic Evaluation
flag = false;if (x) { ... y = a*b; flag = true;}...... = flag ? y : a*b;
27
SIDE Register Promotion Impact
0
5
10
15
20
25
30
adpc
m_e
adpc
m_d
gsm
_e
gsm
_d
epic
_e
epic
_d
mpe
g2_e
mpe
g2_d
jpeg
_e
jpeg
_d
pegw
it_e
pegw
it_d
g721
_e
g721
_d
pgp_
e
pgp_
d
rast
a
mes
a
099.
go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vorte
x
183.
equa
ke
188.
amm
p
164.
gzip
175.
vpr
176.
gcc
181.
mcf
197.
pars
er
254.
gap
300.
twol
f
%st promo%st PRE
53
0
5
10
15
20
25
30
35
40
45
adpc
m_e
adpc
m_d
gsm
_e
gsm
_d
epic
_e
epic
_d
mpe
g2_e
mpe
g2_d
jpeg
_e
jpeg
_d
pegw
it_e
pegw
it_d
g721
_e
g721
_d
pgp_
e
pgp_
d
rast
a
mes
a
099.
go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vorte
x
183.
equa
ke
188.
amm
p
164.
gzip
175.
vpr
176.
gcc
181.
mcf
197.
pars
er
254.
gap
300.
twol
f
% ld promo% ld PRE
Loads
Stores
% re
duct
ion
28
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH
• ASH vs. superscalar processors• Conclusions
29
Performance Evaluation
ASH
LSQ
limited BW
L18K
L21/4M Mem
CPU: 4-way OOO
Assumption: all operations have the same latency.
30
Media Kernels, vs 4-way OOO
0
0.5
1
1.5
2
2.5
3ad
pcm
_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Tim
es fa
ster
125.85.8
31
Media Kernels, IPC
0
5
10
15
20
25
adpc
m_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Base IPCASH IPC
4
32
Speed-up IPC Correlation
0
1
2
3
4
5
6
7
8
9
10ad
pcm
_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
Tim
es b
igge
r
Speed-up
IPC Ratio
12
33
Low-Level EvaluationC
CASHcore
Verilog back-end
Synopsys,Cadence P/R
Results shown so far.All results in thesis.
Results in the next two slides.
ASIC
180nm std. cell library, 2V
~1999technology
34
Area
0
2
4
6
8
10
12
adpc
m_d
g721
_d
g721
_e
gsm_d
gsm_e
jpeg_
d
mpeg2
_d
mpeg2
_e
pegw
it_d
pegw
it_e
Squa
re m
m
Reference: P4 in 180nm has 217mm2
35
Power
vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W
0
50
100
150
200
250
300
350
Tim
es s
mal
ler t
han
OO
O
power ratio 70 41 41 129 147 94 121 136 303 303adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d mpeg2_d mpeg2_e pegwit_d pegwit_e
36
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
37
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH
– dataflow pipelining
• ASH vs. superscalar processors• Conclusions
skip to
38
Pipeliningi
+
<=
100
1
*
+
sum
pipelinedmultiplier(8 stages)
int sum=0, i;for (i=0; i < 100; i++)
sum += i*i;return sum;
cycle=1
39
Pipeliningi
+
<=
100
1
*
+
sum
cycle=2
40
Pipeliningi
+
<=
100
1
*
+
sum
cycle=3
41
Pipeliningi
+
<=
100
1
*
+
sum
cycle=4
42
Pipeliningi
+
<=
100
1
i=1
i=0
+
sum
cycle=5
pipeline balancing
43
Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH• ASH vs. superscalar processors
• Conclusions
44
This Is Obvious!
ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).
45
SpecInt95, ASH vs 4-way OOO
-50
-40
-30
-20
-10
0
10
20
3009
9.go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vorte
x
Perc
ent s
low
er /
fast
er
46
Predicted not takenEffectively a noop for CPU!
Predicted taken.
Branch Prediction
for (i=0; i < N; i++) {...
if (exception) break;}
i
+
<
1
&!
exception
result available before inputs
ASH crit pathCPU crit path
47
SpecInt95, perfect prediction
-60
-40
-20
0
20
40
60
099.
go
124.
m88
ksim
129.
com
pres
s
130.
li
132.
ijpeg
134.
perl
147.
vorte
x
Perc
ent s
low
er/fa
ster
baselineprediction
no data
48
ASH Problems
• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static
– No branch prediction– No dynamic unrolling– No register renaming
• Calls/returns not lenient• ...
49
Thesis StatementApplication-Specific Hardware:
• can be synthesized by adapting software compilation for predicated architectures,
• provides high-performance for programs withhigh ILP, with very low power consumption,
• is a more scalable and efficient computation substrate than monolithic processors.
50
Outline
Introduction+ CASH: Compiling for ASH+ Media processing on ASH+ ASH vs. superscalar processors= Conclusions
51
• low power• simple verification?
• specialized to app.• unlimited ILP• simple hardware• no fixed window
• economies of scale• highly optimized
• branch prediction• control speculation• full-dataflow• global signals/decision
Strengths
52
Conclusions
• Compiling “around the ISA” is a fruitful research approach.
• Distributed computation structures require more synchronization overhead.
• Spatial Computation efficiently implements high-ILP computation with very low power.
53
Backup Slides
• Control logic • Pipeline balancing• Lenient execution• Dynamic Critical Path• Memory PRE• Critical path analysis• CPU + ASH
54
Control Logic
C
C
Reg
rdyin
ackin
rdyoutackout
datain dataout
back back to talk
55
Last-Arrival Events
+data
valid
ack
• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges
56
Dynamic Critical Path
3. Some edges may repeat 2. Trace back along
last-arrival edges
1. Start from last node
back back to analysis
57
Critical Paths
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
58
Lenient Operations
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Solve the problem of unbalanced pathsback back to talk
59
Pipeliningi
+
<=
100
1
*i=1
i=0
+
sum
cycle=6
60
Pipeliningi
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
Longlatency pipe
predicate
cycle=7
61
Predicate ackedge is on thecritical path.
Pipeliningi
+
<=
100
1
*
+
sum
critical pathi’s loop
sum’s loop
62
Pipelinine balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
decouplingFIFO
cycle=7
63
Pipelinine balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
critical path
decouplingFIFO
back back to presentation
64
Register Promotion
…=*p(p2)
*p=…(p1)
…=*p
*p=…(p1)
(p2 Æ : p1)
Load is executed only if store is not
65
Register Promotion (2)
…=*p(p2)
*p=…(p1)
…=*p(false)
*p=…(p1)
• When p2 ) p1 the load becomes dead...• ...i.e., when store dominates load in CFG
back
66
¼ PRE
...=*p(p1) ...=*p(p2) ...=*p(p1 Ç p2)
This corresponds in the CFG to lifting the load to a basic block dominating the original loads
67
Store-store (1)
*p=...(p2)
*p=…(p1)
*p=...(p2)
*p=…(p1 Æ : p2)
• When p1 ) p2 the first store becomes dead...• ...i.e., when second store post-dominates first in CFG
68
Store-store (2)
*p=...(p2)
*p=…(p1)
*p=...(p2)
*p=…(p1 Æ : p2)
• Token edge eliminated, but...• ...transitive closure of tokens preserved
back
69
A Code Fragment
for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;
Y[i] = X[j].q;}
SpecINT95:124.m88ksim:init_processor, stylized
70
Dynamic Critical Path
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
load predicate
loop predicate
sizeof(X[j])
definition
71
MIPS gcc CodeLOOP:L1: beq $v0,$a1,EXIT ; X[j].r == iL2: addiu $v1,$v1,20 ; &X[j+1].rL3: lw $v0,0($v1) ; X[j+1].rL4: addiu $a0,$a0,1 ; j++L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xFEXIT:
L1! L2 ! L3 ! L5 ! L14-instructions loop-carried dependence
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
72
If Branch Prediction Correct
L1! L2 ! L3 ! L5 ! L1Superscalar is issue-limited!2 cycles/iteration sustained
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
LOOP:L1: beq $v0,$a1,EXIT ; X[j].r == iL2: addiu $v1,$v1,20 ; &X[j+1].rL3: lw $v0,0($v1) ; X[j+1].rL4: addiu $a0,$a0,1 ; j++L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xFEXIT:
73
Critical Path with Prediction
Loads are notspeculative
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
74
Prediction + Load Speculation
~4 cycles!Load not pipelined(self-anti-dependence)
ack edge
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
75
OOO Pipe Snapshot
IF DA EX WB CTL5L1L2
L1L2L3L4
L1L3
L5L3L2
L1L3L3
registerrenaming
LOOP:L1: beq $v0,$a1,EXIT ; X[j].r == iL2: addiu $v1,$v1,20 ; &X[j+1].rL3: lw $v0,0($v1) ; X[j+1].rL4: addiu $a0,$a0,1 ; j++L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xFEXIT:
76
Unrolling?
for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break;
if (X[j+1].r == 0xF) break;
if (X[j+1].r == i) break; } Y[i] = X[j].q;}
when 1 iteration
back back to talk
77
Ideal Architecture
High-ILPcomputation
Low ILP computation+ OS+ VM
CPU ASH
Memory
back