Innovative Power Control for Ultra Low-Power and High-Ultra Low Power and HighPerformance System LSIs
Hiroshi Nakamura (Univ. of Tokyo) Hideharu Amano (Keio Univ.)Masaaki Kondo (Univ. of Electro-Communications)Mitaro Namiki (Tokyo Univ. of Agriculture and Tech.)Kimiyoshi Usami (Shibaura Inst. of Tech.)
JST-CREST ULP Workshop (H.Nakamura)
Kimiyoshi Usami (Shibaura Inst. of Tech.)
1
Objective and Strategy
Objective: d ti d ti f
CSystem Software
drastic power reduction of high-performance system LSIs Strategy:
Co-Opt
Strategy: innovative power controlthrough tight Co Optimization /
timizatCompiler
through tight Co-Optimization / Co-Design of system software, architecture and circuit design
ion/Co
Architecture
architecture, and circuit design. Principle:
Performance: limited by a bottleneck
o-DesigPerformance: limited by a bottleneck
Power: summation of whole system Low power and slow operation for
gnCircuit Technology
2JST-CREST ULP Workshop (H.Nakamura)
Low power and slow operation for unhurried / idle parts
Role of Design Hierarchy for Low Power
OS When?OS
ArchitectureWhere?
Circuit How? throttle lever of/ f
Device Clock Gating, Dual Vth, DVFS Power Gating Back-bias
power/performance
Circuit Level : Provide levers to throttle performance / power Architecture OS Level :
DVFS, Power Gating, Back bias, ..
Architecture, OS Level : Find a chance to set levers, when and where ?? architecture: Intra-task/process optimization OS: Inter-task/process optimization
JST-CREST ULP Workshop (H.Nakamura) 3
Preferable Throttle Lever
Effectiveness of ReconfigS t
Processor
Power Reduction Low Overhead in Area,
Systemint fp
cache
Processor
Cache busyPerformance, Power Controlling the throttle
Memory
Network
Processorint fp
cache
lever itself takes time and consumes power
Fi C t l G l it
System LSI
Fine Control Granularity in both Space and Time
L ti f b / Locations of busy / idle parts are small and change frequently
idle
and change frequently
4JST-CREST ULP Workshop (H.Nakamura)time
Example of Throttle Levers
for dynamic power: Clock Gating, DVFSb th ff ti DVFS ti l (P Vdd2 ) both effective, DVFS particular (Power ∝ Vdd2 )
Clock Gating: very fine-grained control with little overhead easily utilized within circuit level design
DVFS: tens of μs to change Vdd through regulator moderate granularity
for leakage power: Power Gating, Body Biasing both effective, but large overhead
in power and performance CircuitBl k
Vdd
Body biasing: spatial granularity statically defined regions
t f fi i d t l
Block
VGND
sleep Trsleep signal not easy for fine-grained control
JST-CREST ULP Workshop (H.Nakamura)
sleep Tr.
GND
sleep signal
Power Gating5
Role of Design Hierarchy for Low Power: The Ideal
System When?OSSystem
Architecture
When?Where?
OS
ArchitectureWhen?Where?H ?Architecture
Circuit How?
Architecture
CircuitHow?
Spatial and Temporal
DeviceDevice Spatial and Temporal
Granularity is important
Co-Design of Circuit, Architecture and OS for Power Co Optimization of Throttle Lever Control: Co-Optimization of Throttle Lever Control:
especially, Co-Optimization of Spatial and Temporal Granularityex activity localization to make full use of throttle leversex. activity localization to make full use of throttle levers
characteristics by architecture/OSJST-CREST ULP Workshop (H.Nakamura) 6
Team Formation of our Research Project
System SoftwareCSa Sub-theme (leader)y
Co-operative System Soft-ware with Arch. (Prof. Namiki)
Co-O
ptimSystem
Sand A
rch
( )
ReconfigNetwork
Ultra Low-Power Reconf. Architecture (Prof. Amano)
mization o
Software
hitecture
System
Memory
Processorint fp
cacheData Resident Architecture
(Prof Nakamura)
( )
Architecture/CompilerC
ofe e
(Prof. Nakamura)Co-O
ptimA
rchiteC
ircuit
Data Resident Compiler(Prof. Kondo)
VddH VddL
logicblock
Ultra Low-Power CircuitDesign (Prof. Usami)
Ci it D i
mization
cture ant D
esign
( )
7JST-CREST ULP Workshop (H.Nakamura)
block g ( )Circuit Designof
nd n
(Project 1) Geyser: Low Power Processor throughFine-grained Runtime Power Gatingg g
Target: Leakage Power Background: Leakage reduction techniques so far,
Standby time: power-gating (Coarse Grain)
Runtime: Cache-decay, Drowsy-cache, (Coarse Grain in temporal)
Leakage for logic parts (ALU, multiplier, etc.) gets serious Fast but Leaky transistors are used Active ratio of those parts are not necessarily high, but active y g
parts change frequently, that is, cycle by cycleObjective : Reduce runtime leakage power of logic partsChallenge: how to optimize the granularity of power gating
JST-CREST ULP Workshop (H.Nakamura) 8
Instruction Pipeline with Power-Gating
Geyser: MIPS compatible processor with 5-stage pipeline, Straightforward PG (power-gating)
Turn EX-units into active mode only if necessary Ex unit gets active when an affecting instruction enters the IF stage Ex-unit gets active when an affecting instruction enters the IF stage The activated EX-unit returns to sleep mode after execution
IF ID EX MEM WBIF ID EX MEM WBInst
ALU MultOperationOperationSHIFTSHIFT
iiSHIFTSHIFT
ii
Detects which unit
Shift Div
S d k i l
InstructionInstructionInstructionInstruction
Shift
JST-CREST ULP Workshop (H.Nakamura)
Detects which unit will be used Sends wake-up signal
MIPS R3000 pipeline9
Challenges for Run-Time Power-Gating: Energy OverheadEnergy Overhead
PowerBreak-Even Time (BET)
Power
: Energy overhead1 3+
2 : part of leakage saving31NormalLeakage
2 : part of leakage saving
21 3+ =
( )4
31
2
Time 4 : Net Energy saving
Break-Even Time(BET)
Sleep period should be longer than BET
Sleep Wake-Up
Sleep period should be longer than BET Otherwise, total energy consumption increases
BET t ll th ll t l it f P G ti
JST-CREST ULP Workshop (H.Nakamura)
BET tells the smallest granularity for Power Gating
10
Break Even Time of Each Functional Unit11
11425℃ 65℃ 100℃ 125℃
90 nm technology
74 74
9225℃ 65℃ 100℃ 125℃
Cycl
44
2638
28
les @20
26 2214
2812 16 10 6 128 10 8 2 8
00MH
z
ALU Shift Mult Div CP0
BET is shortened when the chip temperature climbs up BET is shortened when the chip temperature climbs up Leakage current depends on temperature heavily
We need Novel PG strategies taking BET into account
JST-CREST ULP Workshop (H.Nakamura)
We need Novel PG strategies taking BET into account
11
Power Gating Strategies
Requirement: Power off Ex-units longer than BET static strategy static strategy
straightforward:Ex-units always in sleep after execution ideal compiler (ideal compiler-directed): exact average idle time of ideal compiler (ideal compiler directed): exact average idle time of
Ex-units after each instruction is known (for reference only)
dynamic strategy L1 miss: Ex-units fall asleep only if encountering L1 cache misses
L1 miss penalty = 15 cycles L2 miss: Ex units fall asleep only if encountering L2 cache misses L2 miss: Ex-units fall asleep only if encountering L2 cache misses
L2 miss penalty = 200 cycles
both static and dynamic strategiesbo s a c a d dy a c s a eg es ideal compiler + L2 cache miss
ideal (God) : ideal dynamic strategy ( ) y gy exact idle time of Ex-units are known at anytime,
upper limit of PG (for reference only)JST-CREST ULP Workshop (H.Nakamura) 12
Result for Frequently Used Execution Unit
straightfor ard BET isideal compiler: less chance
FPADD for MGRID
straightforward: BET is longer than sleep time waste of energy
for longer BET
L1: resulting sleep time is about 15
straightforwardideal compiler
ideal for BET<15, but waste of energy for longer BET
Relative Energy
L1L2ideal comp. + L2
L2: resulting sleep time is 200 ideal for longer BET
for shorter BET, compiler is effective
compared to
non-PG
ideal (God) ideal for longer BET
BET(cycle)
JST-CREST ULP Workshop (H.Nakamura) 13
Collaboration with Compiler / OS
Suggested Power Gating Strategy Co-optimization on Control Granularity of the PG lever
compiler direction by assuming short BET, p y g ,because compiler-directed PG is effective for shorter BET
for shorter BET (high temperature) compiler direction is for shorter BET (high temperature), compiler direction is put into use, and take (compiler + L2-miss) strategy
f ( ) for longer BET (low temperature), take L2-miss strategy, but ignore compiler direction
OS is expected to switch between strategies by observing changes on BETg
Power Gating Collaborated with Compiler / OSJST-CREST ULP Workshop (H.Nakamura) 14
Leakage Monitor [Koyama et. al. ITC-CSCC 08][Usami et. al. ISLPED2011 (poster 15)]
BET depends on the dynamic environment, such as temperature and the process variationtemperature and the process variation.
on-chip leakage monitoring circuit More leakage results in faster charging of VGND More leakage results in faster charging of VGND Estimate leakage by measuring rise-time of VGND to VREF
OS can select the best PG strategy by observing this monitor
age
(V) More leakage
ONOFFN
D V
olta Less leakage
'1''0'VGND V
GN Reference(VREF)
JST-CREST ULP Workshop (H.Nakamura)
Sleep time (s)RiseRise15
Co-Optimization of Throttle Lever Control in Fine-grained Runtime Power Gatinge g a ed u e o e Ga g
PG Strategy
OSPG Control through
gy
Architecture
PG Control throughActivity Localization
CircuitLever controlled
best granularitychanges dynamically
PG
Who should be responsible for PG Control
Lever controlled in 10~100cycles
changes dynamically(e.g. temperature)
Who should be responsible for PG Control depends on granularity of Control PG control granularity (BET) : 10 ~ 100 cycles PG control granularity (BET) : 10 ~ 100 cycles best granularity of control changes every msec
16JST-CREST ULP Workshop (H.Nakamura)
Prototype CPU : Geyser-1 [Ikebuchi et. al. ASSCC ’09]
MIPS R3000 Fujitsu e-shuttle 65nm Fujitsu e shuttle 65nm Vdd=1.2V
successfully in operation the first successful cycle
by cycle power gating
2.1 mm
4.2 mm
Shifter DIVMULT
ALU leakage monitor17JST-CREST ULP Workshop (H.Nakamura)
Prototype CPU : Geyser-2
Geyser-2: 2nd Prototype with caches and
TLBs on-chip max working
frequency : 210MHz(wakeup latency is less than 5ns)
r [m
W]
ge P
owe
Demonstration @ ISLPED2011 booth ④ Le
akag
18JST-CREST ULP Workshop (H.Nakamura)
ISLPED2011 booth ④Temperature [C]
(Project 2) Cool Mega Array
Reconfigurable Accelerator: gnot for performance but power-efficiency
PE array consists of only a combinatorial logicPower consumption of registers
and clock distribution is reduced
PE array consists of only a combinatorial logic combinational circuitDVS region
and clock distribution is reducedLow-voltage and Low-power PE
ti b l d ith
PE
array operation balanced with data bandwidth of memory
…………SE
localization of operations Operation / Reg. access
………………
DME ……DME DME DME
………………
……
…………
Performance / Power19JST-CREST ULP Workshop (H.Nakamura)
DMEM
DMEM
DMEM
DMEMDMEM DMEM DMEM DMEM
Architecture of CMA
Prototype : CMA-1C
Fujitsu 65nm 8x8 PE array 12KB data memory control part : 1.2V Maximum
Power Efficiency [MOPS/mW]
power efficiency 223.2 [MOPS/mW]
Demonstration @ ISLPED2011 booth ④
20JST-CREST ULP Workshop (H.Nakamura)
ISLPED2011 booth ④PE Array Voltage [V]
Summary and Future Direction
Geyser : Run-time Power Gating Processorfi t l b l ti first cycle-by-cycle power gating processor
Cool Mega Array : P Effi i A l t CMA CMACMAPower Efficiency Accelerator
Other Projects
CMA
GeyserCPU
CMACMA
L2 Cache Fine Grain Power Gating NoCs
Main M
CPU L2 Cache
[Matsutani et. al. NOCS 2010][Matsutani et. al. IEEE Trans. on CAD, 4/2011]
Linux-based Evaluation Platform Memory
Demonstration @ISLPED2011 booth ④
Towards Integrated System LSIs Evaluation through real integration via g g
3D wireless NoCs21JST-CREST ULP Workshop (H.Nakamura)
Selected Publications
1. N. Seki, et.al., “A Fine Grain Dynamic Sleep Control Scheme in MIPS R3000” Proc of ICCD-2008 pp 612-617 2008R3000 , Proc. of ICCD 2008, pp. 612 617, 2008
2. K.Usami, et.al., “Design and Implementation of Fine-grain Power Gating with Ground Bounce Suppression”, Proc. of VLSI Design 2009, pp. 381-386 2009386, 2009
3. N.Takagi, et.al., “Cooperative Shared Resource Access Control for Low Power Chip Multiprocessors”, ISLPED-2009, pp. 177-182, 2009S S it t l "M CCRA C b :A 3D D i ll R fi bl4. S.Saito, et.al., "MuCCRA-Cube:A 3D Dynamically Reconfigurable Processor with Inductive Coupling link," Proc. of FPL09, pp.6-11, 2009
5. D.Ikebuchi, et.al., “Geyser-1: A MIPS R3000 CPU core with fine grain ti ti ” P f IEEE ASSCC 2009 281 284 2009runtime power gating”, Proc. of IEEE ASSCC-2009, pp. 281-284, 2009
6. H. Matsutani, et.al., "Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs", Proc. of NOCS'10, pp.61-68, 2010.
7. H. Matsutani, et.al., "Performance, Area, and Power Evaluations of Ultrafine-Grained Run-Time Power-Gating Routers for CMPs", IEEE Trans. on CAD (TCAD), Vol.30, No.4, pp.520-533. Apr 2011.
8. K.Usami, et.al., “On-chip Detection Methodology for Break-Even Time of Power Gated Function Units”, Proc. of ISLPED-2011, (to appear)
22JST-CREST ULP Workshop (H.Nakamura)