Download - Séminaire COSI ’01

Séminaire COSI-Roscoff’01 1

Séminaire COSI ’01

Power Driven Processor Array Partitionning for FPGA SoC

S.Derrien, S. Rajopadhye


Content Context and motivations

Silicon compilation tools Target architectures Power consumption Related work

Partitioning Modeling Power Experimental results Conclusion


Silicon compilation tools Parallel processor array architectures

Regular and scalable (well suited to FPGAs) Specialized high-performance data-path

Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain

Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)


Power consumption General model and motivations

P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models)

Mainly dictated by : On chip area cost and activity Off-chip I/O volume

System level power model ? Estimate from specs and target arch.


Target architecture

FPGA

CPU

SystemMemory

Ext world

Embedded CPU Power PC NIOS

Soc bus Amba, Coreconnect Plug ’n play IP cores

Shared Memory Low latency High bandwidth


Related Work Compiler transformations to reduce mem

accesses [Kandemir] Loop fusion Loop tiling Loop reordering

Design space exploration for custom memory systems [Imec]

Systematic exploration Multi-level memory hierachy The approach is brute force


Content Context and motivations Target architectures Partitioning

Clustering (LSGP) Tiling (LPGS) Co-partitionning

modeling Power Experimental results Conclusion


Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications :

Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space

Tiling (LPGS)


Tiling (LPGS)

1

1

000000

H

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Mux

PEPE

PE PE PE

PE

Mux

DeMux

DeMux

FIFO

FIFO

=2

=3

Matrix diagonal det||=Npe

domain height


Regroups PEs into Clusters operations executed sequentially I/O accesses reduced

Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space

Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic

Clustering (LSGP)


Clustering (LSGP)

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

y=3

y=2

PE

PE

PE

PE

Matrix diagonal det||=Npe

size yx…xx

xp ..

PE index vector Iteration index

vector

Original space-time mapping

1

1

000000

H


Clustering (LSGP)

+*

A

B C

+*

A

B C

+*

A

B C1 2 61

1 1

1 3

1

PE original x=2 x=2, y=3

Resource usage estimate :


Hybrid-partitioning Step1 : array is Tiled

Tune the I/O volume Step2 : Tile is clusteredArray

Tune the resource usage Trade-Off

Off-chip I/O Volume Local memory sizes


Content Context and motivations Target architectures Partitioning modeling Power

IO power model Core power model Putting it all together

Experimental results Conclusion


Dynamic IO Energy model IO Energy depends on

IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate

Eio=Krd.Vrd+ Kwr.Vwr

Determine IO volume For all loop variables Given tiling parameters

Number write I/O operations

Technological constant


Tile IO volume is called « foot print » Estimate for this foot print [Arg95] Spread vector of dependencies

IO Volume estimate (1/2)

: substituting ith row with spread vector

n

iaiAV

1

det


v

k

n

i

n

jjjijjik alVio

0 1 1,, )1(

Total Tile IO volume:

Example :dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.1

dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.

dC=[0 0 1] aC=[1 0 0] lC=4 VC=

IO Volume estimate (1/2)

kth variable byte widthNumber of variables

Tile size parameterSpread vector

dependenciesTile output data

dependenciesTile input data

BA

C

j i

k


FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f

Not suited to our target FPGA architecture. Distinction between LCs (mem and logic)

Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f

Core power model (1/4)

Technology constant

Average toggle rate

Nbs of logic cells

Design operating freq.


Core power model (2/4) Control logic is not modeled

too complex to estimate no significant contribution to power

Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local

memory (application constant)


Core power model (3/4) Memory ressource usage

LCs used as distributed memory (16x1bits) Datapath is design constant (library based)

Area cost for a PE array

Clustering parameter along processor space j

Register width along processor space k

Datapath functional cost

Number of PEs

fpd AnA

detdet

pn

16A 1

0m

p

kjjp

kkp An


Core power model (4/4) Energy cost for the whole loop nest

we have Ec=Pc.ncycle.Tcycle

we will consider ncycle=Vcalc/np

Total core energy cost

Energy is not dependant on np !!

Total loop computation volumeAverage toggle rate

16E 1

0core

p

kjjp

kkpmcalcmfpfcalcf AnDVKAnDVK


Content Context and motivations Target architectures Partitioning Modeling Power Experimental results

Model validation Extrapolations

Conclusion


IO power model results

510

1520

25

510

1520

250

50

100

150

x

y

Pow

er (m

w)

510

1520

25

510

1520

250

50

100

150

x

y

Pow

er (m

w)

510

1520

25

510

1520

250

50

100

150

x

y

Powe

r (m

w)

510

1520

25

510

1520

250

20

40

60

80

100

x

y

Rel

ativ

e er

ror(%

)

Observed IO power dissipation Predicted IO power dissipation

Relative errorAbsolute error


Core power model results

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

W)

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

W)

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

w)

510

1520

25

510

1520

25

0

50

100

x

y

Relative error (%)

Predicted core powerObserved core power

Absolute error(mw)


System power model

510

1520

25

510

1520

250

50

100

150

x

y

Loop

exe

cutio

n en

ergy

cost(

J)

510

1520

25

510

1520

250

50

100

150

x

y

Loop

exe

cutio

n en

ergy

cost(

J)

510

1520

25

510

1520

250

50

100

150

200

x

y

Ener

gy (J

)

510

1520

25

510

1520

250

20

40

60

80

100

x

y

Rela

tive

erro

r(%)

Predicted total energy dissipation Observed total energy dissipation

Energy dissipation absolute error Energy dissipation relative error


Content Context and motivations Target architectures Partitioning modeling Power Experimental results Conclusion

Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)


Conclusion Models matches experiments

Cheap measurement setup Many components contribute to current

dissipation (LEDs, PCI, etc…) Observations

Trade-off evolves with technology More sensitive for Asics ?


Future Work(1/2) Formulation of the optimization pb

Minimize Energy/iteration Contraints on Performance and Area

Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods


Future Work(2/2) Model for embedded CPUs

Trade-off cache-size and memory acceses. Determine optimal cache size and associated

tiling parameters. Extension to SARE ?

Affine dependencies. More general loops.