Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
Power Modeling and Architecture Evaluation for
FPGA with Novel Circuits for Vdd Programmability
Yan Lin, Fei Li and Lei HeEE Department, UCLA
Partially supported by NSF. Partially supported by NSF.
Overview FPGA architecture evaluation
Area and delay [Rose et al, JSSC’90] Power [Poon et al, FPLA’02][Li et al, FPGA’03]
Vdd programmability for power reduction Concept in [FPGA’03] Application to logic [FPGA’04][DAC’04] Application to interconnects [ICCAD’04]
[Anderson et al, ICCAD’04] Novel circuits and Architecture evaluation
for FPGAs with Vdd-programmability Reduce power by 50% with 17% area and
3% delay increase
Outline Power modeling and architecture
evaluation methodology
FPGA Circuits for Vdd Programmability
Architecture Evaluation with Vdd programmability
Conclusions and Ongoing Work
Framework fpgaEva-LP
Parasitic Extraction
Cycle-accuratePower
Simulator
Power
Arch Spec
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
DelayArea
Benchmark circuits
FPGA Structure and Models Cluster-based Island Style FPGA Structure
100% buffered interconnects, subset switch block input fc = 50%, output fc = 25%
Area and delay models similar to [Betz-Rose-Marquardt] But based on layout and SPICE for 100nm and below
Mixed-level power model from [FPGA’03]Dynamic power
Capacitive power Short-circuit power
( transition time)
Capacitive power Functional switch Glitch
Static Power Sub-threshold leakage Reverse biased leakage Gate leakage
New Power Model in fpgaEva-LP2 Short-circuit power
switching time * switching power
fpgaEva-LP used average signal transition time
fpgaEva-LP2 calculates transition time for each buffer as , the buffer delay is NOT a constant 2 as in literature due to input slew is pre-characterized by SPICE
buffer delay <0.012 ns < 0.03 ns >0.03 ns
α 2 4.4 7
bufferr tt
Validation Using SPICE Validate by comparison for each power-component High fidelity with average absolute error of 8%
0
0.0005
0.001
0.0015
0.002
0.0025
b1 parity cm138a z4ml decode
Benchmark Circuits
FPG
A P
ower
(wat
t)
SPICE simulation fpgaEVA-LP fpgaEVA-LP2
Impact of Random Seeds in VPR
5.25
5.3
5.35
5.4
5.45
5.5
5.55
5.6
10.2 10.4 10.6 10.8 11 11.2 11.4 11.6 11.8 12
Critical Path Delay (ns)
FP
GA
En
erg
y (
nJ
/cy
cle
)
circuit: s38584
1
2
3
4
5
6
7
8
9
10
+5%
+12%
12% delay variation and 5% energy variation Min-delay solution among 10 runs is used
Evaluation of Single-Vdd FPGAs
Architectures explored Cluster size N = {6, 8, 10, 12} LUT size k = {3, 4, 5, 6, 7}
Energy-delay (ED) dominant architectures Architecture with smaller delay or less energy (compared
to any other architecture) Relaxed ED dominant set may be also valuable
3
4
5
6
7
8
9
9 10 11 12 13 14 15 16 17
Critical Path Delay (ns)
To
tal
FP
GA
En
erg
y (
nJ/
cycl
e)
(8, 7)
(6, 7)(6, 6)
(10, 5)(8, 5)
(12, 4)
(6, 5)
(8, 4)
(6, 4)(10, 4)
(8, 6)(12, 5)
(10, 6)
(12, 6)(10, 7)
(12, 7)
(10, 3)(12, 3)
(8, 3)
(6, 3)
Energy versus DelayCurrent commercial
architecture For 100nm ITRS technology Min-Energy arch (N,k)=(10,4) or (8.4) Min-Delay arch (N,k)=(8,7) 0.8x delay but 1.7x power
3
4
5
6
7
8
9
9 10 11 12 13 14 15 16 17
Critical Path Delay (ns)
To
tal
FP
GA
En
erg
y (
nJ/
cycl
e)
(8, 7)
(6, 7)(6, 6)
(10, 5)(8, 5)
(12, 4)
(6, 5)
(8, 4)
(6, 4)(10, 4)
(8, 6)(12, 5)
(10, 6)
(12, 6)(10, 7)
(12, 7)
(10, 3)(12, 3)
(8, 3)
(6, 3)
Outline Power modeling and evaluation
methodology
FPGA Circuits for Vdd Programmability
Architecture Evaluation with Vdd programmability
Conclusions and Ongoing Work
Vdd-programmable FPGA [DAC’04][ICCAD’04] Vdd-programmable logic
block Vdd selection Power-gating unused blocks
Vdd-programmable FPGA [FPGA’04][ICCAD’04] Vdd-programmable logic
block Vdd selection Power-gating unused blocks
Vdd-programmable switch
Vdd-level conversion is needed when VddL drives VddH To avoid excessive leakage
Vdd-programmable Routing Switch
Conventional routing switch
Vdd-programmable routing switch Brute-force design [ICCAD’04]
Two extra SRAM cells for each routing switch
New design One extra SRAM cell NAND2 gate –- minimum size & high-Vt transistor
Vdd-Programmable Interconnect Connection Block
New design Only TWO extra SRAM cells for n connection switches Control logic includes 2n NAND2 and a decoder
Brute-force design [ICCAD’04] 2n extra SRAM cells for n connection switches
Power and Delay Vdd-programmable switch uses
4X PMOS power transistor for 7X routing switch 1X PMOS power transistor for 4X connection switch
Compared to conventional switch 1000X less leakage power
Connection box is 28% faster and has 18% less dynamic power By moving mux from critical path of connection box
(Vdd=1.3v)Type
Switch delay (ns) Energy per switch (Joule)
w/o power transistor
w/ power transistor
w/o power transistor
w/ power transistor
Routing 5.9E-11 6.5E-11(+11%) 3.3E-14 3.2E-14 (-2%)
Connection 2.9E-10 2.1E-10(-28%) 3.8E-14 3.1E-14(-18%)
Vdd-gateable Routing Switch
Vdd-gateable two states Normal Vdd or Power-gating
Enable power-gating capability w/o extra SRAM cells
Can be replaced by tri-state buffer
Conventional
Power transitor
Vdd-gateable Connection Block
Enable power-gating capability w/ only one extra SRAM and a low leakage decoder
Conventional Vdd-gateable
Outline Power modeling and evaluation
methodology
FPGA Circuits for Vdd Programmability
Architecture Evaluation with Vdd programmability
Conclusions and Ongoing Work
FPGA Architecture ClassesArchitecture Class Logic Block Interconnect
Class0 (baseline) single-Vdd single-Vdd
Class1 programmable dual-Vdd
programmable dual-Vdd, level converters in routing
Class2 programmable dual-Vdd
VddH and Vdd-gateable
Class3 programmable dual-Vdd
Class 1, but no level converters in routing
High-Vt is applied to configuration SRAM cells for all the classes
Vdd-level Converters Class3 removes Vdd-level converters from interconnects in
Class1 With constraints that no VddL drives VddH
We developed a routing that one routing tree has a single Vdd level But trees with different Vdd-levels can
share the same wire track
Alternative approaches: Combined vdd-level converter and buffer [Anderson et al,
ICCAD’04] Our new work [DAC’05] allows dual vdd in a tree with a chip
level time slack budgeting for extra power reduction
Energy versus Delay
ED-product reduction 20% by Class1 (Vdd-programmable interconnects w/ level converters) 45% by Class2 (Vdd-gateable interconnects) 50% by Class3 (class1 minus level converters)
Performance degrades 3% due to Vdd programmability
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
10 10.5 11 11.5 12 12.5 13
Critical Path Delay (ns)
Tot
al F
PG
A E
ner
gy/C
ycle
(n
J)Class 0
(8, 7)
(6, 7) (6, 6) (8, 6)(10, 5)(8, 5)
(12, 4)
(8, 4)
(6, 5)(6, 4)
(10, 4)
Class 1
(8, 7)(6, 6)
(10, 5)
(12, 4) (8, 4) (6, 4)
(6, 7)
(8, 5)(8,7)
(6,7)
(8,5)
(10,6) (6,6) (8,6)(10,5)
(12,4)
Class 2
(8,7)(6,7)(10,6) (6,6)
(8,6)(10,5) (8,5) (12,4)
Class 3
LUT 4Low Energy
LUT 7High Performance
Min-area
Min-energy
Energy versus Area
1
2
3
4
5
6
6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07
Total FPGA Device Area
Tot
al F
PG
A E
ner
gy/C
ycle
(n
J)
Class0(8,7)
(6,7)
(8,6)(6,6)
(10,5)
(8,5)
(12,4)(6,5)
(6,4)(8,4)
(10,4)
Class2
(8,7)(6,7)
(10,6)(6,6)
(8,6)
(10,5)(8,5)
(12,4)(8,4)
(10,4)
Class1
(8,7)
(6,7)(6,6)(10,5)
(8,5)(12,4)(6,4)(8,4)
Class3
(8,7)
(6,7)
(10,4) (8,4) (12,4)
(10,5)(8,5)
(6,6)(10,6)
(8,6)
Average area overhead 118% for Class1 (Vdd-programmable interconnects w/ level converters) 17% for Class2 (Vdd-gateable interconnects) 52% by Class3 (Vdd-programmable interconnects w/o level converters)
Class2 is the best considering both energy and area
Energy Breakdown
Class2 and Class3 dramatically reduce global interconnect leakage
But class1 fails due to leakage in Vdd-level converters
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Class0 Class1 Class2 Class3FPGA Architecture (N,k) = (12,4)
Tot
al F
PG
A E
ner
gy (
nJ/
Cyc
le)
Logic Leakage EnergyLogic Dynamic EnergyLocal Interconnect Leakage EnergyLocal Interconnect Dynamic EnergyGlobal Interconnect Leakage EnergyGlobal Interconnect Dynamic Energy
2.94%3.71%
16.03%
8.09%
49.89%
19.33%
2.70%3.04%
26.22%
7.43%
42.84%
17.77%
4.07%3.92%
39.69%
9.81%
4.88%
37.62%
4.40%
4.32%
42.93%
10.81%5.85%
31.70%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4)
FP
GA
Are
a O
verh
ead
3.87%
0.60%
4.96%
4.82%
1.80%
1.39% Power Transistors & SRAMs (CLBs)
Vdd-level Converters (CLBs)
Control (Connection Blocks)
Power Transistors (Connection Blocks)
SRAMs (Connection Blocks)
Power Transistors (Routing Switches)Routing Switches 3.87%
Connection Blocks 10.38%
Logic Blocks 3.19%
Area Overhead
17% = 9% for power transistors + 5% for control + 2% for SRAM
Conclusions and New Results Field programmability is needed for fine-grained dual-vdd
and Vdd-gating in FPGA Vdd-gating offers a better area-power tradeoff than Vdd-
selection 45% energy-delay product reduction with 17% area
overhead Architecture with Vdd-programmability
LUT size 4 low energy and area LUT size 7 best performance
New results [dac’05] Time slack allocation for Vdd-programmable
interconnects Device and architecture co-optimization for 77% energy-
delay reduction
References and Download All references and tools at
http://eda.ee.ucla.edu
Results in the slides have been updated compared to the paper in ISFPGA’05