Séminaire COSI-Roscoff’01 1
Séminaire COSI ’01
Power Driven Processor Array Partitionning for FPGA SoC
S.Derrien, S. Rajopadhye
Séminaire COSI-Roscoff’01 2
Content Context and motivations
Silicon compilation tools Target architectures Power consumption Related work
Partitioning Modeling Power Experimental results Conclusion
Séminaire COSI-Roscoff’01 3
Silicon compilation tools Parallel processor array architectures
Regular and scalable (well suited to FPGAs) Specialized high-performance data-path
Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain
Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)
Séminaire COSI-Roscoff’01 4
Power consumption General model and motivations
P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models)
Mainly dictated by : On chip area cost and activity Off-chip I/O volume
System level power model ? Estimate from specs and target arch.
Séminaire COSI-Roscoff’01 5
Target architecture
FPGA
CPU
SystemMemory
Ext world
Embedded CPU Power PC NIOS
Soc bus Amba, Coreconnect Plug ’n play IP cores
Shared Memory Low latency High bandwidth
Séminaire COSI-Roscoff’01 6
Related Work Compiler transformations to reduce mem
accesses [Kandemir] Loop fusion Loop tiling Loop reordering
Design space exploration for custom memory systems [Imec]
Systematic exploration Multi-level memory hierachy The approach is brute force
Séminaire COSI-Roscoff’01 7
Content Context and motivations Target architectures Partitioning
Clustering (LSGP) Tiling (LPGS) Co-partitionning
modeling Power Experimental results Conclusion
Séminaire COSI-Roscoff’01 8
Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications :
Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space
Tiling (LPGS)
Séminaire COSI-Roscoff’01 9
Tiling (LPGS)
1
1
000000
H
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Mux
PEPE
PE PE PE
PE
Mux
DeMux
DeMux
FIFO
FIFO
=2
=3
Matrix diagonal det||=Npe
domain height
Séminaire COSI-Roscoff’01 10
Regroups PEs into Clusters operations executed sequentially I/O accesses reduced
Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space
Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic
Clustering (LSGP)
Séminaire COSI-Roscoff’01 11
Clustering (LSGP)
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
y=3
y=2
PE
PE
PE
PE
Matrix diagonal det||=Npe
size yx…xx
xp ..
PE index vector Iteration index
vector
Original space-time mapping
1
1
000000
H
Séminaire COSI-Roscoff’01 12
Clustering (LSGP)
+*
A
B C
+*
A
B C
+*
A
B C1 2 61
1 1
1 3
1
PE original x=2 x=2, y=3
Resource usage estimate :
Séminaire COSI-Roscoff’01 13
Hybrid-partitioning Step1 : array is Tiled
Tune the I/O volume Step2 : Tile is clusteredArray
Tune the resource usage Trade-Off
Off-chip I/O Volume Local memory sizes
Séminaire COSI-Roscoff’01 14
Content Context and motivations Target architectures Partitioning modeling Power
IO power model Core power model Putting it all together
Experimental results Conclusion
Séminaire COSI-Roscoff’01 15
Dynamic IO Energy model IO Energy depends on
IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate
Eio=Krd.Vrd+ Kwr.Vwr
Determine IO volume For all loop variables Given tiling parameters
Number write I/O operations
Technological constant
Séminaire COSI-Roscoff’01 16
Tile IO volume is called « foot print » Estimate for this foot print [Arg95] Spread vector of dependencies
IO Volume estimate (1/2)
: substituting ith row with spread vector
n
iaiAV
1
det
Séminaire COSI-Roscoff’01 17
v
k
n
i
n
jjjijjik alVio
0 1 1,, )1(
Total Tile IO volume:
Example :dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.1
dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.
dC=[0 0 1] aC=[1 0 0] lC=4 VC=
IO Volume estimate (1/2)
kth variable byte widthNumber of variables
Tile size parameterSpread vector
dependenciesTile output data
dependenciesTile input data
BA
C
j i
k
Séminaire COSI-Roscoff’01 18
FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f
Not suited to our target FPGA architecture. Distinction between LCs (mem and logic)
Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f
Core power model (1/4)
Technology constant
Average toggle rate
Nbs of logic cells
Design operating freq.
Séminaire COSI-Roscoff’01 19
Core power model (2/4) Control logic is not modeled
too complex to estimate no significant contribution to power
Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local
memory (application constant)
Séminaire COSI-Roscoff’01 20
Core power model (3/4) Memory ressource usage
LCs used as distributed memory (16x1bits) Datapath is design constant (library based)
Area cost for a PE array
Clustering parameter along processor space j
Register width along processor space k
Datapath functional cost
Number of PEs
fpd AnA
detdet
pn
16A 1
0m
p
kjjp
kkp An
Séminaire COSI-Roscoff’01 21
Core power model (4/4) Energy cost for the whole loop nest
we have Ec=Pc.ncycle.Tcycle
we will consider ncycle=Vcalc/np
Total core energy cost
Energy is not dependant on np !!
Total loop computation volumeAverage toggle rate
16E 1
0core
p
kjjp
kkpmcalcmfpfcalcf AnDVKAnDVK
Séminaire COSI-Roscoff’01 22
Content Context and motivations Target architectures Partitioning Modeling Power Experimental results
Model validation Extrapolations
Conclusion
Séminaire COSI-Roscoff’01 23
IO power model results
510
1520
25
510
1520
250
50
100
150
x
y
Pow
er (m
w)
510
1520
25
510
1520
250
50
100
150
x
y
Pow
er (m
w)
510
1520
25
510
1520
250
50
100
150
x
y
Powe
r (m
w)
510
1520
25
510
1520
250
20
40
60
80
100
x
y
Rel
ativ
e er
ror(%
)
Observed IO power dissipation Predicted IO power dissipation
Relative errorAbsolute error
Séminaire COSI-Roscoff’01 24
Core power model results
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
W)
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
W)
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
w)
510
1520
25
510
1520
25
0
50
100
x
y
Relative error (%)
Predicted core powerObserved core power
Absolute error(mw)
Séminaire COSI-Roscoff’01 25
System power model
510
1520
25
510
1520
250
50
100
150
x
y
Loop
exe
cutio
n en
ergy
cost(
J)
510
1520
25
510
1520
250
50
100
150
x
y
Loop
exe
cutio
n en
ergy
cost(
J)
510
1520
25
510
1520
250
50
100
150
200
x
y
Ener
gy (J
)
510
1520
25
510
1520
250
20
40
60
80
100
x
y
Rela
tive
erro
r(%)
Predicted total energy dissipation Observed total energy dissipation
Energy dissipation absolute error Energy dissipation relative error
Séminaire COSI-Roscoff’01 26
Content Context and motivations Target architectures Partitioning modeling Power Experimental results Conclusion
Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)
Séminaire COSI-Roscoff’01 27
Conclusion Models matches experiments
Cheap measurement setup Many components contribute to current
dissipation (LEDs, PCI, etc…) Observations
Trade-off evolves with technology More sensitive for Asics ?
Séminaire COSI-Roscoff’01 28
Future Work(1/2) Formulation of the optimization pb
Minimize Energy/iteration Contraints on Performance and Area
Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods
Séminaire COSI-Roscoff’01 29
Future Work(2/2) Model for embedded CPUs
Trade-off cache-size and memory acceses. Determine optimal cache size and associated
tiling parameters. Extension to SARE ?
Affine dependencies. More general loops.