Architecture and Synthesis for Power-Efficient FPGAs
Jason CongUniversity of California, Los Angeles
Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program
UCLAUCLA
Joint work with Deming Chen, Lei He, Fei Li, Yan Lin
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions
Why? FPGA is Known to be Power Inefficient!
FPGA consumes 50-100X more powerWhy do we care about power optimization for FPGAs ?!
Source:[Zuchowski, et al, ICCAD02]
FPGA Advantages
Short TAT (total turnaround time)No or very low NRE
ASICs Become Increasingly Expensive
Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below
Source: EETimes
7.512
40
60
$0.0
$0.5
$1.0
$1.5
$2.0
$2.5
250nm 180nm 130nm 100nm
Tot
al C
ost f
or M
ask
Set (
$M)
0
$10
$20
$30
$40
$50
$60
Cos
t/Mas
k ($
K)
Process (um) 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10
Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60
# of Masks 12 12 12 16 20 26 30 34
Mask Set cost ($K) 18 18 30 72 150 312 1,000 2,000
Our Research
Power EfficientFPGAs
Circuit Design
Fabric Design
System Design
Synthesis Tools
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions
FPGA Architecture
Programmable IO
KLUTInputs D FF
Clock
Out
BLE# 1
BLE# N
NOutputs
I Inputs
Clock
I
N
Programmable Logic
Programmable Routing
BC-Netlist
BC-NetlistGenerator
Power Simulator
Power
BLIF
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
SLIF
DelayArea
Arch Spec
BLIF
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
SLIF
DelayArea
Evaluation Framework – fpgaEva-LP
fpgaEva-LP [Li, et al, FPGA’03]
BC-Netlist Generator
Mapped Netlist Layout
Buffer Extraction
Netlist Generation for Logic Clusters
Capacitance Extraction
Delay Calculation
BC-Netlist
Back-annotation
Mixed-level Power Model – Overview
Dynamic powerSwitching power Short-circuit power
Related to signal transitions
Functional switchGlitch
Dynamic
Interconnect & clock
Macro-modelMacro-modelStatic
Switch-level model
Macro-model
Logic Blockcomponents
power sources
Static PowerSub-threshold leakage Gate leakageReverse biased leakage
Depending on the input vector
Cycle-Accurate Power Simulator
Mixed-level Power Model
Post-layout extracted delay & capacitance
Random Vector Generation
BC-Netlist
Cycle Accurate Power Simulation with Glitch Analysis
All cycles finished?
No
Power Values
Yes∑ ∑∈ ∈
+=activei idlej
sacycle nEnEE
)()(
Logic Block Power19%
Interconnect Power59%
Clock Power22%
Power Breakdown
Interconnect power is dominant
Cluster Size = 12, LUT Size = 4
Clock Power15%
Interconnect Power45%
Logic Block Power40%
Cluster Size = 12, LUT Size = 6
Power Breakdown (cont’d)
Leakage Power42%
Dynamic Power58%
Dynamic Power48%
Leakage Power52%
Leakage power becomes increasingly important (100nm)
Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power Optimization
Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA Architecture
Low Power Synthesis with Dual-VddConclusion
Total Power along LUT and Cluster Size Changes
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
3 4 5 6 7
LUT Size
Tota
l FPG
A P
ower
(nor
mal
ized
ge
omet
ric m
ean)
Cluster Size = 4Cluster Size = 6Cluster Size = 8Cluster Size = 10Cluster Size = 12
Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches
Routing Architecture Evaluation
Architecture of Low-power and High-performance
0.78651.02680.88651.0502
Cluster size 12,LUT size 4,
Wire segment length 4,100% buffered routing
switches
High-performance
(Et3)
1.00800.89090.99040.9653
Cluster size 10, LUT size 4,
wire segment length 4,25% buffered routing switches
Low-power(E3t)
Et3E3tDelay (t)
Energy (E)
Best FPGA architectureApplications
Arch. Parameter selection leads to 10% power/delay trade-offUniform FPGA fabrics provide limited power-performance tradeoffNeed to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual-Vdd fabrics
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power Optimization
Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04]
Low Power Synthesis with Dual-VddConclusion
Dual-Vdd LUT DesignDual-Vdd technique makes use of the timing slack to reduce power
VddH devices on critical path performanceVddL devices on non-critical paths powerAssume uniform Vdd for one LUT
Threshold voltage Vt should be adjusted carefullyfor different Vdd levels
To compensate delay increaseTo avoid excessive leakage power increase
Vdd/Vt-Scaling for LUTsThree scaling schemes
Constant-Vt scalingFixed-Vdd/Vt-ratio scalingConstant-leakage scaling
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1.3v 1.0v 0.9v 0.8vVdd (V)
Del
ay (n
s)
constant Vtfixed-Vdd/Vt-ratioconstant leakage
0
1
2
3
4
5
6
7
8
9
10
1.3v 1.0v 0.9v 0.8v
Vdd (V)
Leak
age
Pow
er (
uW)
constant Vtfixed-Vdd/Vt-ratioconstant leakage
Constant-leakage scaling obtains a good tradeoffuseful for both single-Vddscaling and dual-Vdd design
Dual-Vt LUT DesignLUT is divided into two parts
Part I: configuration cells high VtPart II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling)
Configuration SRAM cellsContent remains unchanged after configurationRead/write delay is not related to FPGA performance
Use high Vt ~40% of VddMaintain signal integrityReduce SRAM leakage by 15Xand LUT leakage by 2.4XIncrease configuration time by 13%
Pre-Defined Dual-Vt FabricPower saving
11.6% for combinational circuits14.6% for sequential circuits
12.4%0.180spla9.4%0.0927seq
power savingpower (watt)
11.6%Avg.
14.7%0.256pdc9.4%0.0753misex311.6%0.059ex5p17.3%0.179ex101010.7%0.234des12.3%0.0536apex49.3%0.108apex28.5%0.0798alu4
arch-SVDT (Dual Vt)
arch-SVST (Single Vt)Circuit
Table1 Combinational circuits
14.0%0.0351tseng10.2%0.261s38484
power savingpower (watt)
14.6%Avg.
11.7%0.307s3841713.4%0.0736s29819.2%0.190frisc16.3%0.140elliptic14.5%0.134dsip19.7%0.0391diffeq14.8%0.632clma12.3%0.148bigkey
arch-SVDT (Dual Vt)
arch-SVST(Single Vt)circuit
Table2 Sequential circuits
Dual-Vdd FPGA FabricGranularity: logic block (i.e., cluster of LUTs)
Smaller granularity => intuitively more power savingBut a larger implementation overhead
Layout pattern: pre-defined dual-Vdd patternRow-based or interleaved patternRatio of VddL/VddH blocks is 2:1 (benchmark profiling)
Interconnect uses uniform VddH
L-block: VddL
H-block: VddH
Simple Design Flow for Dual-Vdd FabricBased on traditional design flow, but with new steps
Step I: LUT mapping (FlowMap) + P & Rassuming uniform VddH (using VPR)
Step II: Dual-Vdd assignment based on sensitivity
Setp III: Timing driven P & R considering pre-defined dual-Vdd pattern (modified VPR)
Comparison Between Vdd-Scaling and Dual-Vdd
For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving)For low clock frequency, single-Vdd scaling is betterStill a large gap between ideal dual-Vdd and real case
Ideal dual-Vdd is the result without layout pattern constraint
circuit: alu4
0.03
0.04
0.05
0.06
0.07
0.08
0.09
65 75 85 95 105 115 125
Max. Clock Frequency (MHz)
Power (watt)
arch-SVDT (Vdd Scaling)
arch-DVDT(ideal case)
arch-DVDT(pre-defined Vdd)
1.3v
1.0v
0.9v
1.3v/0.8v
1.0v/0.8v0.9v/0.8v
1.5v
1.5/1.0v
1.3/1.0v
1.0/0.9v
1.5v/1.0v
1.3/0.9v
Vdd-Programmable Logic BlockPower switches for Vdd selection and power gatingOne-bit control is needed for Vdd selection, but two-bit control power gating
Experimental Results with Vdd-Programmable Blocks
Power v.s. performanceCircuit: alu4
0.03
0.04
0.05
0.06
0.07
0.08
0.09
65 75 85 95 105 115 125
clock frequency (MHz)
total power (watt)
arch-SV (Vdd scaling)arch-DV (configurable Vdd)arch-DV (ideal case)arch-DV (pre-defined Vdd)
1.3
v
1.0v
1.5v/1.0v
1.3v/0.8v
1.0v/0.8v
1.5v/1.0v
1.3v/0.9v
1.0v/0.8v
1.5v/0.8v
1.3v/0.8v
1.0v/0.9v
1.5v
0.9v/0.8v
1.0v/0.8v
1.3v/0.8v
1.5v/1.0v
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions
Low Power Synthesis for Dual Vdd FPGAs
FPGA architecture with dual-Vdds adds new layout constraints for synthesis toolsNovel synthesis tools are required to support the architecture
Technology mapping [Chen, et al, FPGA’04]Circuit clustering [Chen, et al, ISLPED’04]
Technology Mapping for Low-Power FPGAs with Dual Vdds
ac
d
yxz
b
w
e
fg
Cut Enumeration:
Topological Order from PIs to POs.
Delay 1, Power 1
Delay 2, Power 2
Optimal Delay = 1
Power = 1.5
Optimal Delay = 2
Power = 2.5
Delay 2, Power 3.2
Delay 2, Power 3.5
Delay 2, Power 2.5
Optimal Delay = 1
Power = 1
Optimal Delay = 1
Power = 1
Represent 1 case: single high Vdd case
Dual-Vdd Cases
Consider:Converter delay & powerVddL LUT delay & powerVddH LUT delay & power
a c
d
yxz
b
w
eTargetLUT
Cases Input LUT Target LUT Converter1 VddL VddL No2 VddL VddH Yes3 VddH VddL No4 VddH VddH No
Input LUT
Four extra cases for dual-Vddconsideration
Produce these four cases for each cut and nodeMore tradeoff solution points Smaller power requires larger delaySmaller delay requires larger power
Low Vdd LUTHigh Vdd LUT
Mapping Solution Generation
From POs to PIsCritical path driven by VddHLUTNon-critical paths can be driven by VddL LUT, guided by low power
ac
d
yxz
b
w
e
fg
Two Types of Required Times
VddL VddH
33.2
R
x y
If R is using VddH:
converter
Req’d times
Mapped LUTs
1.7 = 2.0 - 0.3
Critical path
If R is using VddL:
Critical path
1.8 2.0 Req’d times propagated back
Req’d time of R is 1.7
Req’d time of R is 1.8
To be mapped
Each node maintains two req’d times:
Propagated separately
Interact with each other
Experimental Results
- 2.10%- 1.29%0.56%- 4.04%
Real powerEst'ed powerTotal edgesMapping area
SVmap (Single high Vdd) compared to Emap [Lamoureux, ICCAD03]
Mapping area considerably betterEstimated power very close to the real power reported after P&R
- 9.44%- 10.72%- 11.63%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3
DVmapSVmap
DVmap (dual Vdd) compared to SVmap
v1.3 as VddH and v0.8 as VddL is the best combination
Circuit Clustering with Dual VddsGiven:
A mapped FPGA designAn FPGA architecture with Dual-
Vdd configurable logic blocksGoal:
Cluster the LUTs into logic blocksAssign voltages to the logic blocks
such that the design hasOptimal delay Minimum power
Constraints:Logic Block Inputs ≤ KLogic Block Size ≤ MLogic Block Outputs ≤ MLUT delay = dL or dHInter-block edge delay = D
Input = 5Size = 3Output = 2
LUT
LUT LUT
LUTLUT
LUTLUT
Cluster Enumeration – An Example
m n o p q
r s
t To get a cluster of size 6 on LUT t
Get 1 node on r, 4 on s, then merge with t …., and
Get 2 nodes on r, 3 on s …
Common nodesPIs to POs
Dynamic Programming
Get 3 on r …
Solution Generation
m n o p q
r s
t
Cluster s1
Cluster s2
Solution propagation similar as [Vaishnav, ICCAD’99]
Delay, power and voltage (form solution points) propagate through the clusters and nodes iteratively
Try to get solutions for Cluster t1
Get solutions for sGet solutions for r
Solution Curve on Node rGood solutions: Any two delay-power-vdd points
(D1, P1, V1) and (D2, P2, V2)if D1 > D2, then P1 < P2if D1 < D2, then P1 > P2
02468
1012
0 1 2 3 4 5 6 7
H LH
Lpower
Delay
H105L9.85.4H8.246.1L86.2
VddPowerDelay
Good delay-power-vdd points The corresponding solution curve
Solution Propagation
02468
1012
0 1 2 3 4 5 6 7
H LH Lp
ower
Delay
02468
1012
0 1 2 3 4 5 6 7
H LH Lp
ower
Delay
Delay-power-vdd curve for r Delay-power-vdd curve for s
Consider:Converter VddL LUTVddH LUTEdge delay
All the good solutions are generated
All the inferior solutions are pruned away0
2
4
6
8
10
12
14
16
18
0 1 2 3 4 5 6 7 8 9
Delay
power
LL
LLH
H
Delay-power-vdd curve for cluster t1
Two Theorems
The algorithm gets the minimum number of solution points, W, optimally for each node
W is upper bounded by Lwhere L = level(v) for node v
The algorithm is delay and power optimal for trees and delay optimal for directed acyclic graphs (DAGs) with dual-Vdd FPGAs
22 )1( +L
Experimental Results Summary
- 18.4%- 19.5%- 20.3%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3
Dual VddSingle Vdd
Dual-Vdd Clustering results compared to the Single high-Vdd Clustering results
v1.3 as high Vdd and v0.8 as low Vdd is the best combination among the three
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture Evaluation and Power OptimizationLow Power SynthesisConclusions
ConclusionsFPGA power consumption
Majority on programmable interconnectsLeakage is significant
FPGA architecture optimization for powerArchitecture parameter tuning has a limited impactUsing high Vt for configuration SRAM cells is helpfulUsing programmable dual Vdd for logic blocks is helpful
Power-efficient FPGA architectures introduce interesting CAD problems
Dual-Vdd mappingDual-Vdd clustering
Up to 20% power saving reported using these algorithms