+ All Categories
Transcript
Page 1: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 1

Séminaire COSI ’01

Power Driven Processor Array Partitionning for FPGA SoC

S.Derrien, S. Rajopadhye

Page 2: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 2

Content Context and motivations

Silicon compilation tools Target architectures Power consumption Related work

Partitioning Modeling Power Experimental results Conclusion

Page 3: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 3

Silicon compilation tools Parallel processor array architectures

Regular and scalable (well suited to FPGAs) Specialized high-performance data-path

Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain

Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)

Page 4: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 4

Power consumption General model and motivations

P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models)

Mainly dictated by : On chip area cost and activity Off-chip I/O volume

System level power model ? Estimate from specs and target arch.

Page 5: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 5

Target architecture

FPGA

CPU

SystemMemory

Ext world

Embedded CPU Power PC NIOS

Soc bus Amba, Coreconnect Plug ’n play IP cores

Shared Memory Low latency High bandwidth

Page 6: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 6

Related Work Compiler transformations to reduce mem

accesses [Kandemir] Loop fusion Loop tiling Loop reordering

Design space exploration for custom memory systems [Imec]

Systematic exploration Multi-level memory hierachy The approach is brute force

Page 7: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 7

Content Context and motivations Target architectures Partitioning

Clustering (LSGP) Tiling (LPGS) Co-partitionning

modeling Power Experimental results Conclusion

Page 8: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 8

Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications :

Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space

Tiling (LPGS)

Page 9: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 9

Tiling (LPGS)

1

1

000000

H

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Mux

PEPE

PE PE PE

PE

Mux

DeMux

DeMux

FIFO

FIFO

=2

=3

Matrix diagonal det||=Npe

domain height

Page 10: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 10

Regroups PEs into Clusters operations executed sequentially I/O accesses reduced

Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space

Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic

Clustering (LSGP)

Page 11: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 11

Clustering (LSGP)

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

y=3

y=2

PE

PE

PE

PE

Matrix diagonal det||=Npe

size yx…xx

xp ..

PE index vector Iteration index

vector

Original space-time mapping

1

1

000000

H

Page 12: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 12

Clustering (LSGP)

+*

A

B C

+*

A

B C

+*

A

B C1 2 61

1 1

1 3

1

PE original x=2 x=2, y=3

Resource usage estimate :

Page 13: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 13

Hybrid-partitioning Step1 : array is Tiled

Tune the I/O volume Step2 : Tile is clusteredArray

Tune the resource usage Trade-Off

Off-chip I/O Volume Local memory sizes

Page 14: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 14

Content Context and motivations Target architectures Partitioning modeling Power

IO power model Core power model Putting it all together

Experimental results Conclusion

Page 15: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 15

Dynamic IO Energy model IO Energy depends on

IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate

Eio=Krd.Vrd+ Kwr.Vwr

Determine IO volume For all loop variables Given tiling parameters

Number write I/O operations

Technological constant

Page 16: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 16

Tile IO volume is called « foot print » Estimate  for this foot print [Arg95] Spread vector of dependencies

IO Volume estimate (1/2)

: substituting ith row with spread vector

n

iaiAV

1

det

Page 17: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 17

v

k

n

i

n

jjjijjik alVio

0 1 1,, )1(

Total Tile IO volume:

Example :dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.1

dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.

dC=[0 0 1] aC=[1 0 0] lC=4 VC=

IO Volume estimate (1/2)

kth variable byte widthNumber of variables

Tile size parameterSpread vector

dependenciesTile output data

dependenciesTile input data

BA

C

j i

k

Page 18: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 18

FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f

Not suited to our target FPGA architecture. Distinction between LCs (mem and logic)

Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f

Core power model (1/4)

Technology constant

Average toggle rate

Nbs of logic cells

Design operating freq.

Page 19: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 19

Core power model (2/4) Control logic is not modeled

too complex to estimate no significant contribution to power

Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local

memory (application constant)

Page 20: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 20

Core power model (3/4) Memory ressource usage

LCs used as distributed memory (16x1bits) Datapath is design constant (library based)

Area cost for a PE array

Clustering parameter along processor space j

Register width along processor space k

Datapath functional cost

Number of PEs

fpd AnA

detdet

pn

16A 1

0m

p

kjjp

kkp An

Page 21: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 21

Core power model (4/4) Energy cost for the whole loop nest

we have Ec=Pc.ncycle.Tcycle

we will consider ncycle=Vcalc/np

Total core energy cost

Energy is not dependant on np !!

Total loop computation volumeAverage toggle rate

16E 1

0core

p

kjjp

kkpmcalcmfpfcalcf AnDVKAnDVK

Page 22: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 22

Content Context and motivations Target architectures Partitioning Modeling Power Experimental results

Model validation Extrapolations

Conclusion

Page 23: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 23

IO power model results

510

1520

25

510

1520

250

50

100

150

x

y

Pow

er (m

w)

510

1520

25

510

1520

250

50

100

150

x

y

Pow

er (m

w)

510

1520

25

510

1520

250

50

100

150

x

y

Powe

r (m

w)

510

1520

25

510

1520

250

20

40

60

80

100

x

y

Rel

ativ

e er

ror(%

)

Observed IO power dissipation Predicted IO power dissipation

Relative errorAbsolute error

Page 24: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 24

Core power model results

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

W)

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

W)

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

w)

510

1520

25

510

1520

25

0

50

100

x

y

Relative error (%)

Predicted core powerObserved core power

Absolute error(mw)

Page 25: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 25

System power model

510

1520

25

510

1520

250

50

100

150

x

y

Loop

exe

cutio

n en

ergy

cost(

J)

510

1520

25

510

1520

250

50

100

150

x

y

Loop

exe

cutio

n en

ergy

cost(

J)

510

1520

25

510

1520

250

50

100

150

200

x

y

Ener

gy (J

)

510

1520

25

510

1520

250

20

40

60

80

100

x

y

Rela

tive

erro

r(%)

Predicted total energy dissipation Observed total energy dissipation

Energy dissipation absolute error Energy dissipation relative error

Page 26: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 26

Content Context and motivations Target architectures Partitioning modeling Power Experimental results Conclusion

Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)

Page 27: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 27

Conclusion Models matches experiments

Cheap measurement setup Many components contribute to current

dissipation (LEDs, PCI, etc…) Observations

Trade-off evolves with technology More sensitive for Asics ?

Page 28: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 28

Future Work(1/2) Formulation of the optimization pb

Minimize Energy/iteration Contraints on Performance and Area

Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods

Page 29: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 29

Future Work(2/2) Model for embedded CPUs

Trade-off cache-size and memory acceses. Determine optimal cache size and associated

tiling parameters. Extension to SARE ?

Affine dependencies. More general loops.


Top Related