Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 1
1
Low-‐Power VLSI Circuits and Systems Olivier Sen#eys ENSSAT -‐ Université de Rennes 1 IRISA/INRIA sen#[email protected]
Équipe-‐projet CAIRN
hLp://www.irisa.fr/cairn
2
Power es#ma#on and reduc#on
1. Why care about power? – Heat dissipa#on – Limited energy in portable systems – Wa# is the problem?
2. Where does power go in CMOS chips? – Digital integrated circuits – Microprocessors, DSPs, ...
3. How to es#mate power? 4. How to reduce power?
– Hardware and SoYware 5. Conclusions
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 2
3
Technological evolutions • DEC/Compaq processor family [Herrick99] • EV4
– 200 MHz @ 3.3V – 16 gate delays per cycle – 30W @ 200 MHz & 3.3V – 1,7 Million Transistors – 233 mm2
• EV7 (21364) – > 1000 MHz @ 1.5V – 100W – ~100 Million Transistors – ~350 mm2
• Intel Pentium 4 [Intel 2000] – 1.5 GHz, 0.18 micron – Power reduction of P4 using clock gating and power gating of unused blocks – Thermal sensors were embedded on the chip to cut the CPU in case of overheating! – 55 Watts at 1.5 GHz (instead of 90 Watts)
• EV5 (21164) – 350 MHz @ 3.3V – 14 gate delays per cycle – 60W @ 350 MHz & 3.3V – 9,3 Million Transistors – 298 mm2
• EV8 (never fabricated…) – > 1-2 GHz (0.125 micron) – <150W – ~250 Million Transistors
• EV6 (21264) – 575 MHz @ 2.2V – 12 gate delays per cycle – 90W @ 575 MHz & 2.2V – 15,2 Million Transistors – 314 mm2
4
1. Heat Dissipa#on
• Undesirable effects – Decrease in performance and reliability
• MTBF/2 every +10°C
– Increase in cost (cooling) • 1€/W when >40W
– Increase in volume and weight
• heat-‐sink, fan, baLeries, …
• Will technological evolu#ons solve the problem?
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 3
5
Technological evolutions
Year
Volta
ge [V
]
Pow
er p
er c
hip
[W]
VDD
cur
rent
[A]
1998 2002 2006 2010 2014 0
0.5
1
1.5
2
2.5
0 0
200 500
Current
Power
Voltage
6
Technological evolutions
• Projections – ... 2000 Watts, 3000 A ! – Chip area or Transistor count or Frequency must be
kept constant to stay below limits of 100-200W and 300-500A
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Pow
er (W
atts
)
Vdd scaling
0.1
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Icc
(A)
386 486
Pentium Pro
PII PIV
Power Supply current
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 4
7
2. Portability (1)
• Mul#media – Audio/Video encoding – Audio/Video decoding
• Interfaces – Voice recogni#on – Iner#al pen, touch screen
• Enyryp#on • Mobility
– LTE, UMTS, EDGE, GSM – Internet Protocol – WiFi
Tx. Radio
Rx. Radio
Graphics
Video
Voice
Interface
8
Battery technolgies
• Battery performance
[P. Senn 2000]
250
200
150
100
50
0 100 200 300 400
Smaller
Ligh
ter
Whr/l
Whr/kg
NiCd
NiMh
Lithium-Ion Liquid Lithium-Ion Polymer
LTC Lithium-Ion Polymer
LTC Lithium-Alloy Polymer
Technologies NiCd NiMh Li-ion Li-poly Tear of production
1956 1990 1992 1996
Voltage (V) 1.2 1.2 3.6 3.7 Thickness (mm)
>6 >6 >6 3
Capacity (Whr/kg)
30-50 60-90 70-140 115-140
Lifetime (cycles)
~1000 ~1000 500 500
• Typical example – 500 mAh Li-Pol = 1,7 Wh
• High-capacity example – 1400 mAh Li-Pol = 5 Wh
• For 10-hour autonomy – P < 400 mW
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 5
9
3G Terminal
• Processing – digital baseband, video, graphics, etc.
– >10 GOPS (Giga Oper. Per Sec.) • BaLery life: 10h • Weight: 100g (baLeries)
Tx. Radio
Rx. Radio
Graphics
Video
Voice
Interface
500mW @ 6 GIPS 12 GIPS/W @ 6 GIPS
• With current processors – 30 Kg or 10 minutes !!! – ... with 10s of DSPs !!!
• Dedicated System-on-Chip
10
Conclusion (power)
• Technology evolu#on – Increase in transistor density – Increase in clock frequency
• Power density of ICs is s#ll increasing despite: – Supply voltage decrease – BeLer design methods
• Limita#ons? – Heat dissipa#on limits
• 100W/cm2 is a hard limit… – Limita#ons due to applica#ons
• Portable computers, smartphones • Embedded systems (e.g. drones, satellite) • Ultra-‐low power systems (e.g. sensor networks) • Data-‐centers, telecommunica#on base-‐sta#ons, Internet routers, etc.
10
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 6
11
Conclusion (energy)
• Limited baLery evolu#on – 10 -‐ 15% per year
• Evolu#on of power/energy of integrated circuits – 35 -‐ 40% per year
• Consequence – There exists an important gap important between baLery technologies and current energy efficiency of electronics chips
12
Power es#ma#on and reduc#on
1. Why care about power? – Heat dissipa#on – Limited energy in portable systems – Wa# is the problem?
2. Where does power go in CMOS chips? – Recap – Microprocessors, DSPs, ...
3. How to es#mate power? 4. How to reduce power?
– Hardware and SoYware 5. Conclusions
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 7
13
Recap: Metrics
• Delay (sec): – Performance metric
• Energy (Joule) – Efficiency metric: effort to perform a task
• Power (WaL) – Energy consumed per unit #me
• Power*Delay (Joule) – Mostly a technology parameter – measures the efficiency of performing an opera#on in a given technology
• Energy*Delay = Power*Delay2 (Joule-‐sec) – Combined performance and energy metric – figure of merit of design style
• Other Metrics: Energy-‐Delayn (Joule-‐secn) – Increased weight on performance over energy
14
Recap: Power Equa#ons in CMOS
P = Pdyn + Psc + Ps
• Dynamic power: Pdyn – Charge and discharge of
circuit capacitance
• Short circuit power: Psc
– Short circuit path in sta#c logic cells (Vdd è Vss) during commuta#on – Strongly depends on rising #me and on Vth (NMOS/PMOS)
• Sta#c power: Ps – Sub-‐threshold leakage current – Source/Drain-‐Bulk junc#on leakage (diodes)
Pdyn = α • Cl • Vdd2 • f
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 8
15
P = α f CL VDD2 + VDD Ipeak (P0→1 + P1→0 ) + VDD Ileak
Dynamic power (≈ 40-70% today and decreasing
relatively)
Short-circuit power (≈ 10% today and
decreasing absolutely)
Leakage power (≈ 20-50 % today and increasing)
Recap: Power Equa#ons in CMOS
powerstaticrateoperationenergyP +×=
16
Recap: Ac#vity
• Probability propaga#on
A B C S
X
P(A) = ½ P(B) = ½ P(C) = ½
P(X =1) = 1/4 P(S = 1) = 1/2 . 3/4 = 3/8
αx = P(X=0) . P(X=1) = (1-‐P(X=1)) . P(X=1) = (1 – 1/4) . 1/4 = 3/16
αs = P(S=0) . P(S=1) = (1 – P(S=1)) . P(S=1) = (1 – 3/8) . 3/8 = 5/8 .3/8 = 15/64
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 9
17
Recap: Glitches
• Glitch – Dynamic hazards – Useless behaviour – Important useless power
A B C S
X
ABC X S
101 000
18
Where is power dissipated in CMOS chips?
Operators? Clock? Logic? Memory? RF? LCD, HDD, etc. ?
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 10
19
Where is power dissipated?
• Internal Memory – Cache – Scratch pad
• External Memory – DRAM, Flash – Include power consumed by I/O pads
• Ex. Itanium 2 – 60-‐70% of area is due to caches
20
Where is power dissipated?
• Clock – 40-50% of dissipated power is due to clock tree and
clock drivers
DIGITAL Corp. Alpha 21164 processor 1995 2.5M Portes 9.3M Transistors 298 mm2 300 MHz 64/128 Bits 0.5µ 60W @ 3.3V
Clock Gen
erator
Clock Driver
Clock Driver
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 11
21
Where is energy consumed?
• H263 Image Coding – 1500K opera#ons and 500K memory transfers – Energy(mem. transfer) ~ 33x Energy(opera#on) – Energy due to memory ~ 10x Energy due to processing
Add Mult RAM Read
RAM Write
I/O Memory Transfer (Off-‐Chip)
Energy / op
era#
on
22
Power Breakdown
• Portable Computer – Total power (Word applica#on): 19.1W
[Source Hitachi]
37%
20%3%10%
30% CPULCDVideoHDDLogic
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 12
23
Power Breakdown
• GSM Terminal
Paging Mode 80% of #me
Speaking Mode 20% of #me
Total
40%
0%
60%
15%
50%
35%
20%
40%
40%
Radio
Power Ampli
Base Band Codec
[Source Philips]
24
Power es#ma#on and reduc#on
1. Why care about power? 2. Where does power go in CMOS chips? 3. How to es#mate power?
– Ac#vity – CAD tools
4. How to reduce power? – Hardware and SoYware
5. Conclusions
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 13
25
Activity
• Flip-flop power: 9.5µW/MHz for each 0-to-1 output transition
CLK
Q0 Q0 Q1 Q1 Q2 Q2 Q3 Q3 Q4 Q4 Q5 Q5 Q6 Q6 Q7 Q7
D0 D1 D2 D3 D4 D5 D6 D7
26
Example: Registers
• Sta#s#cal approach – Random signal at inputs to es#mate power – 4 flip-‐flop are commu#ng in average with 2 from 0 to 1 – Power: 2x9.5 = 19µW/MHz
Q0 Q0 Q1 Q1 Q2 Q2 Q3 Q3 Q4 Q4 Q5 Q5 Q6 Q6 Q7 Q7
D0 D1 D2 D3 D4 D5 D6 D7
CLK
White Noise
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 14
27
Example: State Register
• Probabilis#c approach – Transi#on probability depends on signal ac#vity
• Example: Binary Coding of a State Register – 1 + 1/2 + 1/4 + 1/8 ... = 2 – Power: 2x9.5/2 = 9.5µW/MHz
Q0 Q0 Q1 Q1 Q2 Q2 Q3 Q3 Q4 Q4 Q5 Q5 Q6 Q6 Q7 Q7
D0 D1 D2 D3 D4 D5 D6 D7
Binary Coding of States
28
Example: State Register
• Probabilis#c approach – Transi#on probability depends on signal ac#vity
• Example: Gray Coding of a State Register – 1/2 + 1/4 + 1/8 ... = 1 – Power: 9.5/2 = 4.75µW/MHz
Q0 Q0 Q1 Q1 Q2 Q2 Q3 Q3 Q4 Q4 Q5 Q5 Q6 Q6 Q7 Q7
D0 D1 D2 D3 D4 D5 D6 D7
Gray Coding of States
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 15
29
CAD Tools
SPICE et al. Epic/PowerMill Avant!/ADM Mentor Graphics/Lsim Power Analyst
Accuracy of Power Estimation
Pote
ntia
l for
Pow
er O
ptim
isat
ion Algorithmic
Architecture
RTL
Gate
Switch
20%
x
5-x1
0 50
%
10%
Research Research
Synopsys/DesignPower Sente/WattWatcher Architect HLDS Cadence/Top-Down Design Planner
Epic/AMPS
Synopsys/DesignPower/ PrimePower Veritools/Power_tool Sente/WattWatcher Gate Xpower
Synopsys/Power Compiler
Spee
d of
Pow
er O
ptim
isat
ion
Research
30
Transistor-‐Level CAD Tools
• Accurate es#ma#on – SPICE (and variant) – PowerMill (Epic/Synopsys), ADM (Avant!), LSIM Power (Mentor)
• Op#misa#on tools – Op#misa#on by transistor sizing: AMPS (Epic/Synopsys)
– Reliabilty: RailMill, Thunder&Lightning • Advantages
– Highest accuracy – Easy to perform
• Limita#ons – Long simula#on #me – Limited to small blocks (100-‐10k transistors)
Device under test
In
Cl
Vdd
Vss
+ - Vs = 0 Is ßIs Ry Cy Po
wer
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 16
31
Logic-‐Level Es#ma#on
• Two techniques – Sta#s#cal Es#ma#on
– Probabilis#c Es#ma#on
Gate-‐level Simula#on
Inpu
t S#m
uli
Input Ac#vity
Node Ac#vity Monitoring
Ac#vity Propaga#on
Average
Analysis Power Rep
ort
Quality of testbench is crucial!
32
Two delay models
• Zero-‐delay model
• Real-‐delay model
Glitches / Hazards
• Typically 20% of power is due to glitches • Up to 70% in arithmetic operators (e.g.
adders, multipliers)
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 17
33
Logic-‐Level CAD Tools
• Numerous examples – PrimePower (Epic/Synopsys), DesignPower (Synopsys) – QuickPower (M. G.), WaLWatcher/Gate (Sente) – PowerCompiler (Synopsys): gate-‐level op#misa#on
• Advantages – Faster than transistor-‐level tools – Rely on exis#ng logic simulators (e.g. ModelSim) – Probabilis#c es#ma#on for early and quick es#ma#on
• Limita#ons – Interconnec#on (wire) models – Glitch es#ma#on is limited by simulator precision – Speed and block size is s#ll limited (full chip es#ma#on is not possible)
34
DesignPower (Synopsys)
• Gate-level analysis • Estimation of dynamic power (switching power,
internal cell power) and static (leakage) power • Probabilistic or statistical
Gate-Level Netlist
Switching Informations
Estimation
Probabilistic Analysis
Simulation Analysis
• Default or values probabilities
• Fast
• Full-timing gate-level simulation
• Time consuming • Accurate
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 18
35
Pow
er A
naly
sis
DesignPower (Synopsys)
• Design Flow – Power estimation – Power constraints
during logic synthesis
HDL Design
Optimization Timing, Area, Power
Optimization Timing and Area
Gate-Level Simulation
Physical Design
36
Power Models in DesignPower
• Switching Power • Internal Cell Power
– e.g. gate with 2 inputs A,B and 1 output Z
• Total Dynamic Power
• Leakage Power
( )∑∀
×=)(
2
TR2 inets
iloadc iC
VddP inetofratetoggle
inetofloadC
i
loadi
:TR
:
( )
∑
∑
=
==
××=
BAii
BAiii
trans
transloadZZcellrnalinte
TransWeightAvg
WeightAvgCfPZ
,
,
TR
.TR
TR
ZoutputfortimetransitionaverageweightedWeightAvg
ipinofratetoggleipinoftimetransitionTrans
trans
i
i
::TR:
∑∀
=)(icells
leakagecellleakage iPP
+= ∑∀ )(icells
cellnalinterdynamic iPP ( )∑
∀
×)(
2
TR2 inets
iloadiC
Vdd
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 19
37
Power es#ma#on and reduc#on
1. Why care about power? 2. Where does power go in CMOS chips? 3. How to es#mate power? 4. How to reduce power?
– Design flow and principles – Architecture-‐level op#misa#on – SoYware es#ma#on and op#miza#on – System-‐level op#misa#on
5. Conclusions
38
How to reduce power?
• Reduce (as low as possible) Vdd
• Minimize effective capacitance Ceff = α Cl
• Trade-off performance against power by playing with clock frequency f
• And do not forget leakage!
Well… just need to reduce α, Cl, Vdd and f !
How?
Pdyn = α . Cl . Vdd2 . f
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 20
39
Where?
• At each abstraction level
SUM :=
A1+B1
40
Reducing Vdd
• Vdd has a quadratic effect on power • Propagation delay increases if Vdd is reduced
– but power delay product still increases
0
1
2
3
4
5
6
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
Supply voltage (VDD)
Rel
ativ
e D
elay
t d
0
2
4
6
8
10
Rel
ativ
e P
dyn
Delay (td) and dynamic power (Pdyn) are functions of VDD
Pdyn = ↵.CL.V dd2.f
td =CL.V dd
Ids
=CL.V dd
kWL (V dd� V t)2
td / 1
V dd<2
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 21
41
Power Dissipa#on and Circuit Delay
1 2
3 4
-0. 4 0 0.4 0.8
0 0.2 0.4 0.6 0.8
1 x 10 -4
V th (V) V DD (V)
Pow
er (W
)
A
B
1 2
3 4
-0.4 0 0.4 0.8
0 1 2 3 4 5
x 10 -10
Del
ay (s
)
V th (V) V DD (V)
A B
[Sakurai03]
42
Mul#ple Vdd
• Main idea – Use of different supply voltages within the same design – High Vdd for cri#cal parts (high performance needed) – Low Vdd for non-‐cri#cal parts (only low performance demands)
• Usually two different VDD (but more are possible) • Need for Level converters
– Necessary, when module at lower supply drives gate at higher supply (step-‐up)
– If gate supplied with VddL drives a gate supplied with VddH then PMOS never turns off
VDDH
Vin Vout VDDL
Level Shifter
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 22
43
FF
FF
FF
FF
FF
FF
FF
FF
FF
CLK CLK CLK
Data Paths
• Data propagate through different data paths between registers (flipflops -‐ FF)
• Paths mostly differ in propaga#on delay #mes • Frequency of clock signal (CLK) depends on path with longest delay è cri#cal path
Paths Path
44
Data Paths: Slack
B
A
Y
C
time
all Inputs of G1 arrived
G1 ready with evaluation
delay of G1
all inputs of G2 arrived
Slack for G1
BA Y
C
G1 G2
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 23
45
Mul#ple VDD in Data Paths
• Minimum energy consump#on when all logic paths are cri#cal (same delay)
• Possible algorithm: clustered voltage-‐scaling – Each path starts with VddH and switches to VddL when slack is available
– Level conversion in flip-‐flops at end of paths
Connected with VDDL
Connected with VDDH
46
Reducing Vdd
• Compensate for Vdd reduction, which decreases performance, by architectural optimizations – Example: 16-bit architecture of a Viterbi decoder
Tclk
A
Tclk
B
+> <
Tclk
C
Pref = Cref . Vref2 . Fref = 14.7 mW
Cref = α . Σ Ci Fref = 1/40ns = 25MHz Vref = 5V
Area = 0.44 mm2
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 24
47
Parallel Architecture
Pparallel = (2.15Cref) • (0.58Vref)2 • (0.5Fref ) # 0.36 Pref = 5.3 mW
Cparallel = 2.15 Cref Fparallel = 1/80ns Vdd parallel = 2.9V
2•Tclk
A
2•Tclk
B
+> <
2•Tclk
C
2•Tclk
A
2•Tclk
B+
> <
2•Tclk
C
MUX
Tclk
Area = 0.87 mm2
48
Pipelined Architecture
• Pipelined/Parallel Architecture – Vdd = 2V, 2.2Cref, P=0.2Pref – Divide power by 5 at the cost of doubling area
Ppipeline = (1.15•Cref) • (0.58•Vref)2 • Fref # 0.39 • Pref = 5.7 mW
Cpipeline = 1.15• Cref (area advantage) Fpipeline = Fref Vdd pipeline = 2.9V
Tclk
A
Tclk
B
+> <
Tclk
C
Tclk
Tclk
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 25
49
Summary: Approximate Trend
N-parallel proc. N-stage pipeline proc.
Capacitance N.Cref Cref
Voltage ≈Vref/N ≈Vref/N
Frequency fref/N fref
Dynamic Power CrefVref2fref/N2 CrefVref
2fref/N2
Chip area N times 10-20% increase
50
• D Flip-Flop
Gated Clock
CLK CLK
CLK
PFF
= ↵Platches
+ Pclock
CLK
CLK
CLK
CLK
D
Q
Q
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 26
51
Gated Clock
• Remove useless commuta#ons when register value does not change
• State-‐machine modifica#on • Gain could be high
– depends upon ac#vity • Careful design… (not fully synchronous)
Reg
Clk
FSM En
Din Reg
Clk
FSM En
Din D Q
Gated Cell
Gated clock
latch
Clk
Gate signal
52
Conditional Flip-Flop
CLK CKI CKIB
D CKI
CKIB
Q
D Q CLK Controller
CLK
D
Q
CKI
D Q
5 10 15 0
Time (ns)
Clock-on-Demand (COD) F/F [Hamada99]
n Clock is provided to F/F only when new data comes
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 27
53
Leakage Power Optimization: Power Ga#ng • Objec#ve
– Reduce leakage currents by inser#ng a switch transistor (usually high Vth) into the logic stack (usually low Vth)
• Switch transistors change the bias points (VSB) of transistors
• Most effec#ve for systems with standby opera#onal modes – 1 to 3 orders of magnitude leakage reduc#on possible – But switches add many complica#ons
Virtual Ground
sleep
Vdd
Logic Cell
Switch Cell
Vdd
Logic Cell
Virtual Vdd sleep Switch
Cell
Vdd
Logic Cell
54
• Memory is a great source of leakage • Switch off memory banks when they are unused
Memory 1
Vdd
Gnd
Memory 2 Memory N
Gnd Gnd
Leakage Power Optimization: Power Ga#ng
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 28
55
Leakage Power Op#miza#on: Mul#-‐Vth
• Trade-‐off Posi#ve Slack for Reduced Leakage Power – Objec#ve: reduce leakage power where speed is not needed – Op#miza#on performed post-‐route – Cells along paths with posi#ve slack replaced with High-‐Vth cells
• Leakage currents reduced where #ming margins permits • Re-‐route not required – new cells have same footprint as previous cells
L L
L
L
L
L
L L
H
L
L L
L L
L
H
High speed, high leakage Reduced speed, low leakage
56
Operator Isolation
• Activate FUs only when necessary
Clk
FSM En Gated
Cell
ADD
MUL
Instr. Reg.
Reg
The mul#plier consumes energy even is unsed
Clk
FSM En Gated
Cell
ADD
MUL
Reg
Mul#plier inputs are latched when unused
Instr. Reg.
Latch
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 29
57
Pre-computation (1)
• Principle: avoid use of power-hungry blocks when results can be pre-computed by a less-hungry one
Comput. Block
Pre-comp
A
B S
58
Pre-‐computa#on (2)
• Example: comparator A > B
• Si les 2 MSB sont différents alors le résultat peut être déterminé sans soustraire A et B : – Si A[MSB] != B[MSB] && A[MSB] == 0 alors A > B est vraie (1)
– Si A[MSB] != B[MSB] && A[MSB] == 1 alors A > B est faux (0)
Reg
Clk
D Q A>B Reg D Q
Clk
Reg D Q
Clk
A
B
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 30
59
Pre-computation (3)
• Example: comparator A > B – If the 2 MSBs are equal, then subtraction is needed – Otherwise, result is MSB of B
Gated Cell
Clk
D Q A>B Reg D Q
Clk
A[MSB] B[MSB] D Q
D Q
Clk
B[MSB]
1 0
A
Reg D Q B
=1 if A[M
SB] =
= B[MSB
]
What is the average gain?
60
Post-computing (1)
• Principle: do not load state register if next state is identical to current state
f
Reg
D Q
Current State
Gated Cell
Clk
= ?
Next State
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 31
61
Post-computing (2)
• Example
f
Reg
D Q
Clk
Current State
E0
E1 A
E3 E2
!A&!B !A&B
Gated Cell
Clk
D Q f
Post A B
62
State Coding
• Binary coding: higher activity, lower capacitance (area)
• Gray coding: lower activity, higher area
• State encoding depending on transition probability – If Prob(Ck) high then codes of E1 and Ek
should be coded with low Hamming distance codes
E1
Ci Ck Cj
Ci+Cj+Ck=1
Ek
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 32
63
Glitch Power Reduc#on
• Design a digital circuit for minimum transient energy consump#on by elimina#ng hazards
Total transitions = 6 Essential transitions = 2
Glitch transitions = 4
64
Differen#al Path Delay
Delay D < DPD
A B
C
A
B
C
D D Hazard or glitch
DPD
DPD: Differential path delay
time
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 33
65
Balanced Path Delays
Delay D < DPD
A B
C
A
B
C
D No glitch
DPD
Delay buffer
time
66
Glitch Filtering by Iner#a
Delay D > DPD
A B
C
A
B
C
D > DPD
Filtered glitch
DPD
time
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 34
67
Designing a Glitch-‐Free Circuit
• Maintain specified cri#cal path delay • Glitch suppressed at all gates by
– Path delay balancing – Glitch filtering by increasing iner#al delay of gates or by inser#ng delay buffers when necessary
Delay D
Path delay = d1
Path delay = d2
Minimum transient energy condition: |d1 – d2| < D
68
Designing a Glitch-‐Free Circuit
• Logical path delay balancing – Logic synthesis (e.g. Power Compiler) – Example
• S = a.b.c.d with p(a) = 0.3; p(b) = 0.4; p(c) = 0.7; p(d) = 0.5 • AND: Pout = PA.PB
AND
AND
AND
AND
AND
AND
a b
c d
a b
c d
0.12 0.084
0.042
0.12
0.35
0.042
Less Activity Less Glitches
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 35
69
Resource Sharing
• Resource sharing – reduces area but increases ac#vity
• destructs data correla#on
• Bus Mul#plexing – Nbt: Number of bus transi#ons per cycle 1110 1111
0000 0001 1110 0000 1111 0001
Counter 1
Counter 2
Bus 1
Bus 2
Counter 1
Counter 2
Bus MUX
Nbt = 2(1+1/2+1/4+...) = 4 Nbt >= 4 (depends on counter skew)
70
Resource Sharing
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 36
71
Activity Reduction
• Bus encoding to reduce activity – e.g. bus between cache memory and processor – objective: reduce number of transitions
• e.g. activity of binary > activity of gray
Encoding Logic
Decoding Logic
Input Ac#vity Bus Ac#vity Output Ac#vity > <
72
Bus Encoding • Bus-‐Invert Coding
– Take advantage of correla#on between successive bus values – Choose sending true or complement form of bus values to minimize toggles (based on Hamming distance)
• Can break bus into fields and apply bus-‐invert coding to each field
XOR
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 37
73
Bus Encoding
• Gray encoding
• T0 (binary) code – Take advantage of address sequences – Add a redundant line to the bus (INC)
• INC = 1 if B(t)==B(t-‐1)+1 (or +K); Bus is kept constant and receiver increases value by 1 (or K)
• Otherwise INC = 0 and B(t) is normally transferred
58 Etat de l’art sur les techniques d’optimisation des performances
xor
xor
xor
xor
xor
xor
B1
B2
B3
B4
G1
G2
G3
G4
B1
B2
B3
B4
Codeur Décodeur
Fig. 3.10 – Architecture des codeur et décodeur pour le code de Gray. Exemple d’un bus de 4bits.
– Bn représente la valeur du bit n non codé ;
– Bn+1 représente la valeur du bit n + 1 non codé.
Au niveau du décodeur l’équation qu’il faut appliquer pour retrouver le codage en binaire pur est
la suivante :
Bn = Gn ! Bn+1 (3.2)
A partir de ces formules, il est aisé de construire l’architecture du codeur et du décodeur tel que
le montre la figure 3.10.
Les expérimentations e!ectuées dans [SD95] montrent une réduction de l’activité de 33% et une
réduction de l’énergie consommée sur le bus de 77%.
Par contre pour des bus larges (quand n est grand), le décodeur possède un long chemin critique
puisque les portes ou-exclusives sont cascadées des MSB vers les LSB.
Code T0
Dans [BMM+97], l’idée proposée est de rajouter un fil noté INC que l’on positionne à un ni-
veau logique défini lorsque les adresses accédées sont consécutives. Pour cela, la valeur de l’adresse
au cycle d’horloge t " 1 est stockée, une incrémentation de 1 est e!ectuée puis cette valeur est
comparée à celle arrivant au cycle d’horloge t. Si c’est deux valeurs sont identiques alors l’état du
bus ne change pas et le fil supplémentaire INC est positionné à un niveau logique défini. Dans le
cas contraire, la valeur de l’adresse est envoyée sur le bus. Au niveau du décodeur une sélection est
e!ectuée en fonction de l’état du fil INC entre la valeur sur le bus ou la valeur passée incrémentée
de 1.
Cette technique réduit l’activité à 0 lorsque les adresses accédées sont consécutives, ce qui permet
de réduire fortement la consommation sur le bus.
Une évolution de cette technique est proposée dans [FPSS00] où il est possible de définir plusieurs
pas d’incrémentation (+step) pour des accès consécutifs.
tel-0
0445
791,
ver
sion
1 -
11 J
an 2
010
0
1
+K =
Encoder
0
1
+K
Decoder
INC
B(t)
74
Memory Op#miza#on
• Place data which are accessed frequently in internal memory or in registers
• Minimize memory size (for leakage) by maximizing data reuse
CPU
Cache Mem ext
Scratchpad
Available Memory Space
Registers
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 38
75
Split Memory Access
dout
addr[0]
32
32
addr[14:1]
addr[14:0]
clock
pre_addr q d 15
write
dout
RAM 16K x 32
noe
din
addr
addr
din
dout
16K x 32 RAM
noe write
76
Power es#ma#on and reduc#on
1. Why care about power? 2. Where does power go in CMOS chips? 3. How to es#mate power? 4. How to reduce power?
– Design flow and principles – Architecture-‐level op#misa#on – SoYware es#ma#on and op#miza#on – System-‐level op#misa#on
5. Conclusions
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 39
77
Reducing Energy of Software
• Embedded software determines power/energy consumed by the processor – So why not modifying software to reduce energy?
• Energy, power or performance? – Energy = battery life-time – Power = supply voltage distribution sizing, heat
dissipation
MOV DX,[BX] MOV AX,CX MOV AX,DX Power: 1.15 W Energy: 8.6 10-8 J
NOP MOV AX,CX MOV DX,[BX] NOP NOP NOP NOP MOV AX,DX
NOP Power: O.99 W 14% less Energy: 22.3 10-8 J 158% more
78
Reducing Energy of Software
• Performance = use of memory bandwidth • Energy = use of registers or scratchpad memory, reduce ac#vity
LDR r3, [r2, #0] ADD r3,r0,r3 MOV r0,#28 LDR r0,[r2,r0] ADD r0,r3,r0 ADD r2,r2,#4 ADD r1,r1,#1 CMP r1,#100 BLT LL3
ADD r3,r0,r2 MOV r0,#28 MOV r2,r12 MOV r12,r11 MOV r11,r10 MOV r0,r9 MOV r9,r8 MOV r8,r1 LDR r1,[r4,r0] ADD r0,r3,r1 ADD r4,r4,#4 ADD r5,r5,#1 CMP r5,#100 BLT LL3
int a[1000]; c=a; for (i=1; i<100; i++) { b += *c; b += *(c+7); c+=1; }
2096 cycles 19.92 uJ
2231 cycles 16.47 uJ
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 40
79
Reducing Energy of Software
• Use of internal registers (20%) • Use of internal memory (scrathpad is beLer than cache) (40%)
• Transforma#ons to reduce the number of read/write to memory: be aware of the code you write
– Loop permuta#on, unrolling, #ling, fusion, fission, ... – Gain from 40% to x5
• Compiler: instruc#on selec#on, scheduling, ...
FOR i:= 1 TO N DO B[i] := f(A[i]) ;
FOR i:= 1 TO N DO C[i] := g(B[i]) ;
FOR i:= 1 TO N DO B[i] := f(A[i]) ; C[i] := g(B[i]) ;
END ;
[Marwedel02]
80
SoYware Power Es#ma#on
• Power es#ma#on: processor instruc#on power models
• Methods based on simula#on – Program is simulated on a low-‐level (RTL) model of the processor
• Physical measurements – Measure current of instruc#on sequences
Power System
CPU A
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 41
81
SoYware Power Es#ma#on
• Instruc#on-‐level model
• Measure on instruc#on sequences: Basei – Instruc#ons in a loop or sequence of instruc#ons
Instruction Courant mA Cycles Energie nJNOP 198 1 3.26LD 213 1 3.51ST 346 2 11.40ADD 199 1 3.28MULT 198 1 3.26
SPARClite
∑∑∑ ++k kji ijii ii EnergyNOverheadNBase
, , ).().(
82
SoYware Power Es#ma#on
• Inter-‐instruc#on effect: Overheadi – Previous state of processor influences energy of next instruc#on
– e.g. 486DX2 • XOR BX,1 • ADD RX,DX // overhead: 6.8 mA
• Pipeline, cache miss : Energyk
∑∑∑ ++k kji ijii ii EnergyNOverheadNBase
, , ).().(
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 42
83
Example: TMS 320C54x
• Applica#on note from TI – hLp://www.#.com/sc/docs/apps/dsp/tms320c54x.html
• Measure of current while execu#ng code – (a) Instruc#ons are repeated – (b) Instruc#ons in loops
testloop testloop I1 RPT #255 I1 I1 (256 #mes) B testloop I1 I1 B testloop
(a) straight-‐line method (b) RPT method
84
Example: TMS 320C54x Instructions/Applications CURRENT
(mA per MIPS)
CURRENT AT 50 MIPS (mA)
POWER AT 50 MIPS, 3V (mW)
IDLE3 0 0 0 IDLE2 0.03 1.5 4.5 IDLE1 0.12 6 18 Repeat NOPs 0.3 15 45 Inline NOPs 0.4 20 60 Block data transfer in on-chip DARAM using RPT
0.8 40 120
Repeat MAC with changing data (dual-operand addressing)
1.0 50 150
Inline MAC with changing data (dual-operand addressing)
1.2 60 180
Repeat MACD with changing data (single-operand addressing)
0.8 40 120
Inline MACD with changing data (single-operand addressing)
1.0 50 150
Repeated double-precision arithmetic instructions with changing data
0.9 45 135
Inline double-precision arithmetic instructions with changing data
1.1 55 165
Repeat FIRS with changing data 1.2 60 180 Inline FIRS with changing data 0.9 45 135 FIR filter 0.9 45 135 Full-rate GSM vocoder 1.03 51.5 154.5 Complex 256-point FFT 1.07 53.5 160.5
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 43
85
Power es#ma#on and reduc#on
1. Why care about power? 2. Where does power go in CMOS chips? 3. How to es#mate power? 4. How to reduce power?
– Design flow and principles – Architecture-‐level op#misa#on – SoYware es#ma#on and op#miza#on – System-‐level op#misa#on
5. Conclusions
86
StrongArm
• Intel StrongArm SA-‐1110 (ARM V4)
Compaq/Digital StrongARM Intel StrongARM
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 44
87
Power Management (PM)
• Play with power mode of processors (systems) – Reduce supply voltage Vdd – Sleep or Idle modes
• Switch off PLLs, clock drivers, peripherals
• Example: StrongArm SA1100
RUN
IDLE SLEEP
400 mW
50 mW 160 µW
10 µs 10 µs 90 µs
160 ms
90 µs
88
Sta#c Power Management (SPM)
• Different opera#on modes
[IBM]
Full-‐On
Normal-‐On
Standby
Suspend
Hiberna#on
Off
Ac#vity Monitor
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 45
89
Dynamic Power Management (DPM)
• Reduce speed (clock freq.) and Vdd depending on processor ac#vity (and therefore input data) – e.g. MPEG4 coder
After
Before
Time
Proc
esso
r Spe
ed
IDLE
E=CVH2+Eidle
E=CVL2
90
Dynamic Power Management (DPM)
• Smart DPM of Vdd and Fclock
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 46
91
DPM Example: TransMeta Crusoe
• Crusoe processor: x86 clone running at 700 MHz max
• Processor ac#vity detec#on by HW monitors
• OS adjusts Fclock and Vdd
Fclock MHz Vdd Power
700 1.65 V 100%
400 1.4 V 41%
333 1.2 V 25%
92
0
10
20
30
40
50
60
70
80
90
100
300 400 500 600 700 800 900 1000
Frequency (MHz)
% o
f max
pow
erl c
onsu
mpt
ion
300 Mhz0.80 V
433 Mhz0.87 V
533 Mhz0.95 V
667 Mhz1.05 V
800 Mhz1.15 V
900 Mhz1.25 V
1000 Mhz1.30 V
Typical operating region Peak performance region
DPM Example: TransMeta Crusoe
Source: Transmeta
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 47
93
DPM Example: TransMeta Crusoe
Source: Transmeta
94
DPM Example: TransMeta Crusoe
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 48
95
Conclusions: Is it s#ll necessary to convince you?
• New metrics for IC design – Power and/or energy becomes a major
constraint
Flexibility
Power Energy
Cost
Performance
96
Conclusions
• Power consump#on needs to be es#mated and op#mized at each abstrac#on level
• Reduce supply voltage (Vdd) while keeping performance acceptable
• Reduce ac#vity of internal and external signals
A smart design will always consume less power So design with your brain on!
Es#ma#on and Reduc#on of Power Consump#on 4/21/14
Olivier Sentieys 49
97
Perspec#ves
• Technologies – Mul#-‐Vth – SOI, SiGe, ...
• Architecture-‐level – Parallelism, pipeline, parallelism, pipeline, … – Reduce ac#vity – Memory hierarchy
• System-‐level – Dynamic management of Vdd/Vth/Fclk – Efficient SW compila#on