Spatial Computation

Spatial Computation

Thesis committee:Seth Goldstein

Peter LeeTodd MowryBabak FalsafiNevin Heintze

Ph.D. Thesis defense, December 8, 2003

SCS

Mihai BudiuCMU CS

2

Spatial Computation

Thesis committee:Seth Goldstein

Peter LeeTodd MowryBabak FalsafiNevin Heintze

Ph.D. Thesis defense, December 8, 2003

SCSA model of general-purpose computationbased on Application-Specific Hardware.

3

Thesis StatementApplication-Specific Hardware (ASH):

• can be synthesized by adapting software compilation for predicated architectures,

• provides high-performance for programs withhigh ILP, with very low power consumption,

• is a more scalable and efficient computation substrate than monolithic processors.

4

Outline• Introduction

• Compiling for ASH• Media processing on ASH• ASH vs. superscalar processors• Conclusions

5

CPU Problems

• Complexity• Power• Global Signals• Limited ILP

6

Design Complexity

from Michael Flynn’s FCRC 2003 talk

58%/Year

21%/Year

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

xx xx x

x

x

Logic transistors/chip

Transistors/staff*month

Source: S. Malik, orig Sematech

Prod

uctiv

ity

10

1,000,000

10,000,000

100,000,000

1000

100

10,000

100,000

10

1000

100

10,000

100,000

1,000,000

10,000,000

Chip

size

(K tr

ansis

tors

)

Design Time:CAD productivity favors FPL

2.5

.10

.35

7

Communication vs. Computation

5ps 20ps

gate wire

Power consumption on wires is also dominant

8

Our Approach: ASH

Application-Specific Hardware

9

1.

2.

1.

2.Programs

Programs

Resource Binding Time

CPU ASH

10

Hardware Interface

CPU ASH

ISA

software

hardware

software

hardwaregates

virtual ISA

11

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

12

Contributions

Compilation

Computerarchitecture

Reconfigurablecomputing

Embeddedsystems

Asynchronouscircuits

High-levelsynthesis

Dataflowmachines

Nanotechnology

theory

systems

13

Outline• Introduction• CASH: Compiling for ASH

• Media processing on ASH• ASH vs. superscalar processors• Conclusions

14

Computation = Dataflow

• Operations ) functional units• Variables ) wires• No interpretation

x = a & 7;...

y = x >> 2;

Programs

&

a 7

>>

2

x

Circuits

15

Basic Operation

+data

valid

acklatch

16

+

Asynchronous Computation

data

valid

ack

1

+

2

+

3

+

4

+

8

+

7

+

6

+

5

latch

17

Distributed Control Logic

+ -

ackrdy

FSM

asynchronous control

short, local wires

18

Forward Branches

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Conditionals ) Speculation critical path

19

Control Flow ) Data Flow

data predicate

Merge (label)

Gateway

data

data

Split (branch)p !

20

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;for (i=0; i < 100; i++)

sum += i*i;return sum;return sum; !

ret

21

no speculationsequencingof side-effects

Predication and Side-Effects

Load

addr

data

pred

token

token

tomemory

22

Thesis StatementApplication-Specific Hardware:




23

Outline• Introduction• CASH: Compiling for ASH

– An optimization on the SIDE

• Media processing on ASH• ASH vs. superscalar processors• Conclusions

skip to

24

Availability Dataflow Analysis

y

y = a*b;...if (x) { ... ... = a*b;}

25

Dataflow Analysis Is Conservative

if (x) { ... y = a*b;}...... = a*b;

y?

26

Static Instantiation, Dynamic Evaluation

flag = false;if (x) { ... y = a*b; flag = true;}...... = flag ? y : a*b;

27

SIDE Register Promotion Impact

0

5

10

15

20

25

30

adpc

m_e

adpc

m_d

gsm

_e

gsm

_d

epic

_e

epic

_d

mpe

g2_e

mpe

g2_d

jpeg

_e

jpeg

_d

pegw

it_e

pegw

it_d

g721

_e

g721

_d

pgp_

e

pgp_

d

rast

a

mes

a

099.

go

124.

m88

ksim

129.

com

pres

s

130.

li

132.

ijpeg

134.

perl

147.

vorte

x

183.

equa

ke

188.

amm

p

164.

gzip

175.

vpr

176.

gcc

181.

mcf

197.

pars

er

254.

gap

300.

twol

f

%st promo%st PRE

53

0

5

10

15

20

25

30

35

40

45

adpc

m_e

adpc

m_d

gsm

_e

gsm

_d

epic

_e

epic

_d

mpe

g2_e

mpe

g2_d

jpeg

_e

jpeg

_d

pegw

it_e

pegw

it_d

g721

_e

g721

_d

pgp_

e

pgp_

d

rast

a

mes

a

099.

go

124.

m88

ksim

129.

com

pres

s

130.

li

132.

ijpeg

134.

perl

147.

vorte

x

183.

equa

ke

188.

amm

p

164.

gzip

175.

vpr

176.

gcc

181.

mcf

197.

pars

er

254.

gap

300.

twol

f

% ld promo% ld PRE

Loads

Stores

% re

duct

ion

28

Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH

• ASH vs. superscalar processors• Conclusions

29

Performance Evaluation

ASH

LSQ

limited BW

L18K

L21/4M Mem

CPU: 4-way OOO

Assumption: all operations have the same latency.

30

Media Kernels, vs 4-way OOO

0

0.5

1

1.5

2

2.5

3ad

pcm

_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

Tim

es fa

ster

125.85.8

31

Media Kernels, IPC

0

5

10

15

20

25

adpc

m_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

Base IPCASH IPC

4

32

Speed-up IPC Correlation

0

1

2

3

4

5

6

7

8

9

10ad

pcm

_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

Tim

es b

igge

r

Speed-up

IPC Ratio

12

33

Low-Level EvaluationC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

Results shown so far.All results in thesis.

Results in the next two slides.

ASIC

180nm std. cell library, 2V

~1999technology

34

Area

0

2

4

6

8

10

12

adpc

m_d

g721

_d

g721

_e

gsm_d

gsm_e

jpeg_

d

mpeg2

_d

mpeg2

_e

pegw

it_d

pegw

it_e

Squa

re m

m

Reference: P4 in 180nm has 217mm2

35

Power

vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W

0

50

100

150

200

250

300

350

Tim

es s

mal

ler t

han

OO

O

power ratio 70 41 41 129 147 94 121 136 303 303adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d mpeg2_d mpeg2_e pegwit_d pegwit_e

36





37

Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH

– dataflow pipelining

• ASH vs. superscalar processors• Conclusions

skip to

38

Pipeliningi

+

<=

100

1

*

+

sum

pipelinedmultiplier(8 stages)

int sum=0, i;for (i=0; i < 100; i++)

sum += i*i;return sum;

cycle=1

39

Pipeliningi

+

<=

100

1

*

+

sum

cycle=2

40

Pipeliningi

+

<=

100

1

*

+

sum

cycle=3

41

Pipeliningi

+

<=

100

1

*

+

sum

cycle=4

42

Pipeliningi

+

<=

100

1

i=1

i=0

+

sum

cycle=5

pipeline balancing

43

Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH• ASH vs. superscalar processors

• Conclusions

44

This Is Obvious!

ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).

45

SpecInt95, ASH vs 4-way OOO

-50

-40

-30

-20

-10

0

10

20

3009

9.go

124.

m88

ksim

129.

com

pres

s

130.

li

132.

ijpeg

134.

perl

147.

vorte

x

Perc

ent s

low

er /

fast

er

46

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {...

if (exception) break;}

i

+

<

1

&!

exception

result available before inputs

ASH crit pathCPU crit path

47

SpecInt95, perfect prediction

-60

-40

-20

0

20

40

60

099.

go

124.

m88

ksim

129.

com

pres

s

130.

li

132.

ijpeg

134.

perl

147.

vorte

x

Perc

ent s

low

er/fa

ster

baselineprediction

no data

48

ASH Problems

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient• ...

49





50

Outline

Introduction+ CASH: Compiling for ASH+ Media processing on ASH+ ASH vs. superscalar processors= Conclusions

51

• low power• simple verification?

• specialized to app.• unlimited ILP• simple hardware• no fixed window

• economies of scale• highly optimized

• branch prediction• control speculation• full-dataflow• global signals/decision

Strengths

52

Conclusions

• Compiling “around the ISA” is a fruitful research approach.

• Distributed computation structures require more synchronization overhead.

• Spatial Computation efficiently implements high-ILP computation with very low power.

53

Backup Slides

• Control logic • Pipeline balancing• Lenient execution• Dynamic Critical Path• Memory PRE• Critical path analysis• CPU + ASH

54

Control Logic

C

C

Reg

rdyin

ackin

rdyoutackout

datain dataout

back back to talk

55

Last-Arrival Events

+data

valid

ack

• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges

56

Dynamic Critical Path

3. Some edges may repeat 2. Trace back along

last-arrival edges

1. Start from last node

back back to analysis

57

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

58

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Solve the problem of unbalanced pathsback back to talk

59

Pipeliningi

+

<=

100

1

*i=1

i=0

+

sum

cycle=6

60

Pipeliningi

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

Longlatency pipe

predicate

cycle=7

61

Predicate ackedge is on thecritical path.

Pipeliningi

+

<=

100

1

*

+

sum

critical pathi’s loop

sum’s loop

62

Pipelinine balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

decouplingFIFO

cycle=7

63

Pipelinine balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

critical path

decouplingFIFO

back back to presentation

64

Register Promotion

…=*p(p2)

*p=…(p1)

…=*p

*p=…(p1)

(p2 Æ : p1)

Load is executed only if store is not

65

Register Promotion (2)

…=*p(p2)

*p=…(p1)

…=*p(false)

*p=…(p1)

• When p2 ) p1 the load becomes dead...• ...i.e., when store dominates load in CFG

back

66

¼ PRE

...=*p(p1) ...=*p(p2) ...=*p(p1 Ç p2)

This corresponds in the CFG to lifting the load to a basic block dominating the original loads

67

Store-store (1)

*p=...(p2)

*p=…(p1)

*p=...(p2)

*p=…(p1 Æ : p2)

• When p1 ) p2 the first store becomes dead...• ...i.e., when second store post-dominates first in CFG

68

Store-store (2)

*p=...(p2)

*p=…(p1)

*p=...(p2)

*p=…(p1 Æ : p2)

• Token edge eliminated, but...• ...transitive closure of tokens preserved

back

69

A Code Fragment

for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

Y[i] = X[j].q;}

SpecINT95:124.m88ksim:init_processor, stylized

70

Dynamic Critical Path

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

load predicate

loop predicate

sizeof(X[j])

definition

71

MIPS gcc CodeLOOP:L1: beq $v0,$a1,EXIT ; X[j].r == iL2: addiu $v1,$v1,20 ; &X[j+1].rL3: lw $v0,0($v1) ; X[j+1].rL4: addiu $a0,$a0,1 ; j++L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xFEXIT:

L1! L2 ! L3 ! L5 ! L14-instructions loop-carried dependence

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

72

If Branch Prediction Correct

L1! L2 ! L3 ! L5 ! L1Superscalar is issue-limited!2 cycles/iteration sustained

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

LOOP:L1: beq $v0,$a1,EXIT ; X[j].r == iL2: addiu $v1,$v1,20 ; &X[j+1].rL3: lw $v0,0($v1) ; X[j+1].rL4: addiu $a0,$a0,1 ; j++L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xFEXIT:

73

Critical Path with Prediction

Loads are notspeculative

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

74

Prediction + Load Speculation

~4 cycles!Load not pipelined(self-anti-dependence)

ack edge

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

75

OOO Pipe Snapshot

IF DA EX WB CTL5L1L2

L1L2L3L4

L1L3

L5L3L2

L1L3L3

registerrenaming

LOOP:L1: beq $v0,$a1,EXIT ; X[j].r == iL2: addiu $v1,$v1,20 ; &X[j+1].rL3: lw $v0,0($v1) ; X[j+1].rL4: addiu $a0,$a0,1 ; j++L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xFEXIT:

76

Unrolling?

for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break;

if (X[j+1].r == 0xF) break;

if (X[j+1].r == i) break; } Y[i] = X[j].q;}

when 1 iteration

back back to talk

77

Ideal Architecture

High-ILPcomputation

Low ILP computation+ OS+ VM

CPU ASH

Memory

back

Date post:	20-Feb-2016
Category:	Documents
Upload:	apu
View:	56 times
Download:	1 times

Spatial Computation

Documents