Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan...

Power Modeling and Architecture Evaluation for

FPGA with Novel Circuits for Vdd Programmability

Yan Lin, Fei Li and Lei HeEE Department, UCLA

[email protected]

Partially supported by NSF. Partially supported by NSF.

Overview FPGA architecture evaluation

Area and delay [Rose et al, JSSC’90] Power [Poon et al, FPLA’02][Li et al, FPGA’03]

Vdd programmability for power reduction Concept in [FPGA’03] Application to logic [FPGA’04][DAC’04] Application to interconnects [ICCAD’04]

[Anderson et al, ICCAD’04] Novel circuits and Architecture evaluation

for FPGAs with Vdd-programmability Reduce power by 50% with 17% area and

3% delay increase

Outline Power modeling and architecture

evaluation methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd programmability

Conclusions and Ongoing Work

Framework fpgaEva-LP

Parasitic Extraction

Cycle-accuratePower

Simulator

Power

Arch Spec

Logic Optimization(SIS)

Tech-Mapping (RASP)

Timing-Driven Packing (TV-Pack)

Placement & Routing (VPR)

DelayArea

Benchmark circuits

FPGA Structure and Models Cluster-based Island Style FPGA Structure

100% buffered interconnects, subset switch block input fc = 50%, output fc = 25%

Area and delay models similar to [Betz-Rose-Marquardt] But based on layout and SPICE for 100nm and below

Mixed-level power model from [FPGA’03]Dynamic power

Capacitive power Short-circuit power

( transition time)

Capacitive power Functional switch Glitch

Static Power Sub-threshold leakage Reverse biased leakage Gate leakage

New Power Model in fpgaEva-LP2 Short-circuit power

switching time * switching power

fpgaEva-LP used average signal transition time

fpgaEva-LP2 calculates transition time for each buffer as , the buffer delay is NOT a constant 2 as in literature due to input slew is pre-characterized by SPICE

buffer delay <0.012 ns < 0.03 ns >0.03 ns

α 2 4.4 7

bufferr tt

Validation Using SPICE Validate by comparison for each power-component High fidelity with average absolute error of 8%

0

0.0005

0.001

0.0015

0.002

0.0025

b1 parity cm138a z4ml decode

Benchmark Circuits

FPG

A P

ower

(wat

t)

SPICE simulation fpgaEVA-LP fpgaEVA-LP2

Impact of Random Seeds in VPR

5.25

5.3

5.35

5.4

5.45

5.5

5.55

5.6

10.2 10.4 10.6 10.8 11 11.2 11.4 11.6 11.8 12

Critical Path Delay (ns)

FP

GA

En

erg

y (

nJ

/cy

cle

)

circuit: s38584

1

2

3

4

5

6

7

8

9

10

+5%

+12%

12% delay variation and 5% energy variation Min-delay solution among 10 runs is used

Evaluation of Single-Vdd FPGAs

Architectures explored Cluster size N = {6, 8, 10, 12} LUT size k = {3, 4, 5, 6, 7}

Energy-delay (ED) dominant architectures Architecture with smaller delay or less energy (compared

to any other architecture) Relaxed ED dominant set may be also valuable

3

4

5

6

7

8

9

9 10 11 12 13 14 15 16 17


To

tal

FP

GA

En

erg

y (

nJ/

cycl

e)

(8, 7)

(6, 7)(6, 6)

(10, 5)(8, 5)

(12, 4)

(6, 5)

(8, 4)

(6, 4)(10, 4)

(8, 6)(12, 5)

(10, 6)

(12, 6)(10, 7)

(12, 7)

(10, 3)(12, 3)

(8, 3)

(6, 3)

Energy versus DelayCurrent commercial

architecture For 100nm ITRS technology Min-Energy arch (N,k)=(10,4) or (8.4) Min-Delay arch (N,k)=(8,7) 0.8x delay but 1.7x power

3

4

5

6

7

8

9

9 10 11 12 13 14 15 16 17


To

tal

FP

GA

En

erg

y (

nJ/

cycl

e)

(8, 7)

(6, 7)(6, 6)

(10, 5)(8, 5)

(12, 4)

(6, 5)

(8, 4)

(6, 4)(10, 4)

(8, 6)(12, 5)

(10, 6)

(12, 6)(10, 7)

(12, 7)

(10, 3)(12, 3)

(8, 3)

(6, 3)

Outline Power modeling and evaluation

methodology




Vdd-programmable FPGA [DAC’04][ICCAD’04] Vdd-programmable logic

block Vdd selection Power-gating unused blocks

Vdd-programmable FPGA [FPGA’04][ICCAD’04] Vdd-programmable logic

block Vdd selection Power-gating unused blocks

Vdd-programmable switch

Vdd-level conversion is needed when VddL drives VddH To avoid excessive leakage

Vdd-programmable Routing Switch

Conventional routing switch

Vdd-programmable routing switch Brute-force design [ICCAD’04]

Two extra SRAM cells for each routing switch

New design One extra SRAM cell NAND2 gate –- minimum size & high-Vt transistor

Vdd-Programmable Interconnect Connection Block

New design Only TWO extra SRAM cells for n connection switches Control logic includes 2n NAND2 and a decoder

Brute-force design [ICCAD’04] 2n extra SRAM cells for n connection switches

Power and Delay Vdd-programmable switch uses

4X PMOS power transistor for 7X routing switch 1X PMOS power transistor for 4X connection switch

Compared to conventional switch 1000X less leakage power

Connection box is 28% faster and has 18% less dynamic power By moving mux from critical path of connection box

(Vdd=1.3v)Type

Switch delay (ns) Energy per switch (Joule)

w/o power transistor

w/ power transistor

w/o power transistor

w/ power transistor

Routing 5.9E-11 6.5E-11(+11%) 3.3E-14 3.2E-14 (-2%)

Connection 2.9E-10 2.1E-10(-28%) 3.8E-14 3.1E-14(-18%)

Vdd-gateable Routing Switch

Vdd-gateable two states Normal Vdd or Power-gating

Enable power-gating capability w/o extra SRAM cells

Can be replaced by tri-state buffer

Conventional

Power transitor

Vdd-gateable Connection Block

Enable power-gating capability w/ only one extra SRAM and a low leakage decoder

Conventional Vdd-gateable

Outline Power modeling and evaluation

methodology




FPGA Architecture ClassesArchitecture Class Logic Block Interconnect

Class0 (baseline) single-Vdd single-Vdd

Class1 programmable dual-Vdd

programmable dual-Vdd, level converters in routing


VddH and Vdd-gateable


Class 1, but no level converters in routing

High-Vt is applied to configuration SRAM cells for all the classes

Vdd-level Converters Class3 removes Vdd-level converters from interconnects in

Class1 With constraints that no VddL drives VddH

We developed a routing that one routing tree has a single Vdd level But trees with different Vdd-levels can

share the same wire track

Alternative approaches: Combined vdd-level converter and buffer [Anderson et al,

ICCAD’04] Our new work [DAC’05] allows dual vdd in a tree with a chip

level time slack budgeting for extra power reduction

Energy versus Delay

ED-product reduction 20% by Class1 (Vdd-programmable interconnects w/ level converters) 45% by Class2 (Vdd-gateable interconnects) 50% by Class3 (class1 minus level converters)

Performance degrades 3% due to Vdd programmability

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

10 10.5 11 11.5 12 12.5 13


Tot

al F

PG

A E

ner

gy/C

ycle

(n

J)Class 0

(8, 7)

(6, 7) (6, 6) (8, 6)(10, 5)(8, 5)

(12, 4)

(8, 4)

(6, 5)(6, 4)

(10, 4)

Class 1

(8, 7)(6, 6)

(10, 5)

(12, 4) (8, 4) (6, 4)

(6, 7)

(8, 5)(8,7)

(6,7)

(8,5)

(10,6) (6,6) (8,6)(10,5)

(12,4)

Class 2

(8,7)(6,7)(10,6) (6,6)

(8,6)(10,5) (8,5) (12,4)

Class 3

LUT 4Low Energy

LUT 7High Performance

Min-area

Min-energy

Energy versus Area

1

2

3

4

5

6

6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07

Total FPGA Device Area

Tot

al F

PG

A E

ner

gy/C

ycle

(n

J)

Class0(8,7)

(6,7)

(8,6)(6,6)

(10,5)

(8,5)

(12,4)(6,5)

(6,4)(8,4)

(10,4)

Class2

(8,7)(6,7)

(10,6)(6,6)

(8,6)

(10,5)(8,5)

(12,4)(8,4)

(10,4)

Class1

(8,7)

(6,7)(6,6)(10,5)

(8,5)(12,4)(6,4)(8,4)

Class3

(8,7)

(6,7)

(10,4) (8,4) (12,4)

(10,5)(8,5)

(6,6)(10,6)

(8,6)

Average area overhead 118% for Class1 (Vdd-programmable interconnects w/ level converters) 17% for Class2 (Vdd-gateable interconnects) 52% by Class3 (Vdd-programmable interconnects w/o level converters)

Class2 is the best considering both energy and area

Energy Breakdown

Class2 and Class3 dramatically reduce global interconnect leakage

But class1 fails due to leakage in Vdd-level converters

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Class0 Class1 Class2 Class3FPGA Architecture (N,k) = (12,4)

Tot

al F

PG

A E

ner

gy (

nJ/

Cyc

le)

Logic Leakage EnergyLogic Dynamic EnergyLocal Interconnect Leakage EnergyLocal Interconnect Dynamic EnergyGlobal Interconnect Leakage EnergyGlobal Interconnect Dynamic Energy

2.94%3.71%

16.03%

8.09%

49.89%

19.33%

2.70%3.04%

26.22%

7.43%

42.84%

17.77%

4.07%3.92%

39.69%

9.81%

4.88%

37.62%

4.40%

4.32%

42.93%

10.81%5.85%

31.70%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4)

FP

GA

Are

a O

verh

ead

3.87%

0.60%

4.96%

4.82%

1.80%

1.39% Power Transistors & SRAMs (CLBs)

Vdd-level Converters (CLBs)

Control (Connection Blocks)

Power Transistors (Connection Blocks)

SRAMs (Connection Blocks)

Power Transistors (Routing Switches)Routing Switches 3.87%

Connection Blocks 10.38%

Logic Blocks 3.19%

Area Overhead

17% = 9% for power transistors + 5% for control + 2% for SRAM

Conclusions and New Results Field programmability is needed for fine-grained dual-vdd

and Vdd-gating in FPGA Vdd-gating offers a better area-power tradeoff than Vdd-

selection 45% energy-delay product reduction with 17% area

overhead Architecture with Vdd-programmability

LUT size 4 low energy and area LUT size 7 best performance

New results [dac’05] Time slack allocation for Vdd-programmable

interconnects Device and architecture co-optimization for 77% energy-

delay reduction

References and Download All references and tools at

http://eda.ee.ucla.edu

Results in the slides have been updated compared to the paper in ISFPGA’05

Date post:	20-Dec-2015
Category:	Documents
View:	215 times
Download:	1 times

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan...

Documents