Date post: | 02-Jun-2018 |
Category: |
Documents |
Upload: | nguyencaonhatlinh |
View: | 214 times |
Download: | 0 times |
of 68
8/11/2019 L3 - System on Chip
1/68
8/11/2019 L3 - System on Chip
2/68
Outline The framework: Computing platforms in the broad sense
Historical Trends towards Multi-Core through MooresLaw
, -Computation and Limits
Opportunities to Improve Energy Efficiency/VoltageScalability
Beyond-CMOS Ultra-Low Voltage Circuits
2 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
3/68
The Framework:
3 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
4/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding*
networks move towards macro and nano scale
nano scale (self-powered nodes)
meso scale (portable/handheld)
macro scale (data centers)
macro : cloud computing
nano : ubi uitous com utin /sensin
4 prof. Massimo Alioto
* adapted from MuSyC FCRP center
8/11/2019 L3 - System on Chip
5/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
macro
5 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
6/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
macro
New concepts
Internet of things
6 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
7/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
New conceptsmacro
ntra ervous
System for the Earth
7 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
8/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
New concepts-
macro
collectiveintelligence,
8 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
9/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
New applicationsmacro
9 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
10/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
New applicationsmacro
10 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
11/68
Computing Platforms: The Big Picture
Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale
mesonano
New applications
macro
advanced water/energymanagement,
11 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
12/68
Historical Trends towards Multi-Core
12 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
13/68
CMOS Integrated Circuits
MOS transistor Shockley-Brattain-Bardeen
(1947, Bell Labs)
Integrated Circuit (IC) chipmultiple transistors +
interconnects =--------------------------
Jack Kilby (1958)
packaged chip
13 prof. Massimo Alioto
demonstrated 1 st IC PCB
8/11/2019 L3 - System on Chip
14/68
Gordon Moores Prediction
CMOS technology scaling
XY
Z
0.7X0.7Y
0.7Z
2X more transistors/chipprevious generation next generation
Prediction in 1965 (not a law) Moores law: 1 generation/24 months
exponential growth in transistor count
14 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
15/68
8/11/2019 L3 - System on Chip
16/68
colossal investments coordinated by International Technology Roadmap for
process, device, circuits
challenges, performance,
consumption, capabilities,
16 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
17/68
As a Result of CMOS Scaling
CMOS scaling trends for microprocessors (macro scale) before 2005: Moores law + Dennards scaling (voltage )
exponential growth in # of transistors and performance
I n t el
1 2 m
I n t el
1 2 m
17 prof. Massimo Alioto
0 0 4
2 0 0 4
2
8/11/2019 L3 - System on Chip
18/68
8/11/2019 L3 - System on Chip
19/68
Power vs. energy efficiency
lkg opchip P throughput E P
keep leakage power small enough (10-15%)
- E op V 2
single-core performance V
I n t el
1 2 m
19 prof. Massimo Alioto
0 0 4
2
8/11/2019 L3 - System on Chip
20/68
Power vs. energy efficiency
lkg opchip P throughput E P
keep leakage power small enough (10-15%)
- E op V 2
single-core performance V
E op by using lower V
I n t el
1 2 m
improve peformance
20 prof. Massimo Alioto
0 0 4
2
8/11/2019 L3 - System on Chip
21/68
Multi-Core: Numerical Example
Post-Dennard scaling keep V TH , V DD constant performance becomes power limited use area (moretransistors available use them to improve efficiency)
use silicon for low power density blocks (cache 10 W/cm 2)
that strongly impact total speed, rather than logic (30 W/cm 2)D. Frank, Power Constrained CMOS Scaling Limits, IBM J. RES. & DEV. VOL. 46 NO.2/3 MARCH/MAY 2002
Example (iso power/technology):
core 1 DD
f area = 1
=
. DD0.8 f area = 2
=
core core 1
core 2
. DD0.63 f area = 4
=
core 2
core 3
21 prof. Massimo Alioto
throughput = 1
throughput = 1.6
throughput = 1.6 2 = 2.5core 4
8/11/2019 L3 - System on Chip
22/68
Multi-Core Scaling
Multi-core era will not last long [ISCA2011]: announced catastrophe Dark Silicon and the Endo u core ca ng ue o na equa e energy e c ency
percentage of unusable dark silicon
is growing fast
22 prof. Massimo Alioto
[ISCA2011] H. Esmaeilzadeh et al., Dark Silicon and the End of Multicore Scaling ISCA, June 2011
8/11/2019 L3 - System on Chip
23/68
Multi-core era will not last long [ISCA2011]: announced catastrophe Dark Silicon and the Endo u core ca ng ue o na equa e energy e c ency
percentage of unusable dark silicon
is growing fast
23 prof. Massimo Alioto
[ISCA2011] H. Esmaeilzadeh et al., Dark Silicon and the End of Multicore Scaling ISCA, June 2011
8/11/2019 L3 - System on Chip
24/68
Multi-core era will not last long [ISCA2011]: announced catastrophe Dark Silicon and the Endo u core ca ng ue o na equa e energy e c ency
percentage of unusable dark silicon
is growing fast
new power crisis n 2016 or processors: no reason or sca ng
24 prof. Massimo Alioto
[ISCA2011] H. Esmaeilzadeh et al., Dark Silicon and the End of Multicore Scaling ISCA, June 2011
8/11/2019 L3 - System on Chip
25/68
A Broader View of Dark Silicon
At macro scale, dark refers to spatial dimensionAt nano scale (self-powered nodes)
inadequate energy efficiency dark silicon along thetemporal dimension (intermittent available power)
availableenergy
nooperation
normaloperation
normaloperation
nooperation
t
At meso scale (portable) dark silicon in both spatial (power constraint 1-2 W) and
25 prof. Massimo Alioto
temporal dimension (limited lifetime @ given functionality)
8/11/2019 L3 - System on Chip
26/68
8/11/2019 L3 - System on Chip
27/68
Green IC Group
Painting silicon green: mission of Green IC groupwww.green-ic.org
meso
nano
27 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
28/68
Aggressive Voltage Scaling, Minimum-Energy
28 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
29/68
Voltage Scaling: Dynamic Energy
If dynamic energy per clockdominates:
2
DDSW dyn V C E
affected by switching activity , capacitance , voltage
reduce V DD as much as possible
energy reduction limited by V DD,min (functional/timing failures)E
E dyn
29 prof. Massimo Alioto
V DDV DD,min
8/11/2019 L3 - System on Chip
30/68
Voltage Scaling: Leakage Energy
If leakage (static) energy per cycle dominatesCK o DDlk T I V E
affected by supply voltage, leakage , clock cycle ( logic depth*gate delay)
V DD reduction and trends
D Q
clk
Reg comb 1 D Q
clk
Reg comb 2 D Q
clk
Reg comb n D Q
clk
Reg...
stage 1 stage 2 stage n
E
linear
constantexponentially growing
E lkg exponentially increasesE lkg
30 prof. Massimo Alioto
at low V DD V DDV DD,min
8/11/2019 L3 - System on Chip
31/68
Voltage Scaling: Total Energy
Total energy vs. V DD tradeoff between E dyn and E lkg minimum-energy point (MEP) exists
E
E TOT
E dynE lkg
V DDV DD,min V DD,opt
31 prof. Massimo Alioto
MEP determined by optimal balance of E dyn and E lkg
8/11/2019 L3 - System on Chip
32/68
Importance of Voltage Scaling: Broader View
Minimum-energy operation for better (10X) energyefficiency + circuit/architectural/SW integration permit performance increase at macro scale
reduces battery size
and lifetime at meso/nano
o age sca ng s power u intrinsic in Dennard scaling
.os - ennar sca ng aggressive voltage scaling: do it by yourself
as much as ossible ive u somethin variable workload
32 prof. Massimo Alioto
deal with related issues
8/11/2019 L3 - System on Chip
33/68
Ultra-Low Voltage (ULV) Operation:
Energy reduction comesLimits and Challenges
at a price
per ormance
leakage energy
resiliencylkg
ener
performance
ield
failurerate
yield
33 prof. Massimo Alioto
, DD
8/11/2019 L3 - System on Chip
34/68
8/11/2019 L3 - System on Chip
35/68
Limits and Challenges
Resiliency degraded at ULV rocess/volta e/tem erature
!%
!+
$ % & ' ( ) * + , - . / * + * , -
% ) - $
3 3
. / !
5-10X more process variations
(delay: easily 2X variations)%
'%% &%% )%% *%% !%%%
4 '
5 # &
! "" 1)$2M. Alioto(TCAS-I 2012)
5X higher sensitivity to V DD
p r o c e s s
v o l t a
g e
e m p e r a t u r e
design margining
-
nominal margin
t
R. Krishnamurthy (Micro 2012)
at near threshold, easily 2X margin
(in speed binning, many discarded )
35 prof. Massimo Alioto
performance/energy efficiency
8/11/2019 L3 - System on Chip
36/68
Limits and Challenges
Aging (depends on history, workload, voltage, temperature)
Soft errors nominal margin
.
higher failure rate at ULV
Degraded functionality at ULV V DD,min increase due tovariations 8
9 v t
1 4 v t
3 5 0 m
V
DD
degraded I on/ I off (incomplete switching)
MEM arrays: much less scalability (0.6-0.7 V)V DD,min increase due to intrinsic
NMOS/PMOS imbalance
V DD,min increase due to residualPUN/PDN imbalance 0
. 5 v t
2 . 5
v t
v t
1 3
3 2 5
36 prof. Massimo Alioto
2
M. Alioto (TCAS-I 2012)
8/11/2019 L3 - System on Chip
37/68
Opportunities to Improve Energy
37 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
38/68
Near-Threshold ICs
Parallelism compensates speed loss enhanced b 3D chi stackin
near res o compu ng very prom s ng: e c ency enables data center scalabilityD. Blaauw/D. S lvester
can enable exascale
computing by 2020
(Shekhar Borkar, Intel)
38 prof. Massimo Alioto
.
8/11/2019 L3 - System on Chip
39/68
Near-threshold computers will be different lo ic/MEM: different scalin (MEM becomes faster)
less cache levels, bigger cache
better logic/MEM coupling through 3D integration
More efficient and scalable microarchitectureseep p pe ne : ower ea age energy
D Q
clk
Reg comb 1 D Q
clk
Reg comb 2 D Q
clk
Reg comb n D Q
clk
Reg...
stage 1 stage 2 stage n
CK off DDlkg T I V E
ultra-low power = high speed only 17FO4/stage in 1,024-point complex FFT, 4X lower energy
39 prof. Massimo Alioto
(D. Blaauw, D. Sylvester ISSCC 2011)
8/11/2019 L3 - System on Chip
40/68
8/11/2019 L3 - System on Chip
41/68
Finer-grain voltage domains currentl : cores share same volta e, different fre uenc
lkg opchip P throughput E P
slower cores might operate at lower voltage ( E op, P lkg ) not possible (share same voltage)
multiple on-chip regulators on sight r e g u
l a t o r
r e g u
l a t o
g u
l a t o r
g u
l a t o r
different frequencies
can exploit workload reduction to further
r r
r e g u
l a t o r
r e g u
l a t o r
41 prof. Massimo Alioto
reduce E op and P lkg
8/11/2019 L3 - System on Chip
42/68
Enhance Energy Efficiency: Heterogeneity
Exploit heterogeneity (different scaling at ULV) area is commodit : ive
R. Krishnamurthy (Micro 2012)
up flexibility for better
efficiency
HW accelerators(media, image, crypto,
ra o, ,
same function in different IPs
- -. more extreme: use different
replicas with different variationsmodule 1
module 2
42 prof. Massimo Alioto
energy efficiency more testing delayusemodule 2usemodule 1
8/11/2019 L3 - System on Chip
43/68
Enhance Energy Efficiency
Limit communication energy ex loit localit at different scales
limit off-chip (2-10X intrachip)
limit intra-chip (1-10X computation)
B. Dally (CICC 2012) restrict data structure and flow (SIMD)
better Flip-Flops (post-silicon tuning)3
-. . better clock domain design
clock slope optimization: 35% better
43 prof. Massimo Alioto
energy efficiency [Alioto TCAS-I 2010] M. Alioto (ISSCC 2012)
8/11/2019 L3 - System on Chip
44/68
Margin Elimination: Design vs. Testing Time
Uncertainty margin at design time is too expensive post-silicon (self)tuning absolutely needed
eliminate margin: optimally allocate cost/design effort at
design/ testing / boot / run timeincrease design margin, improve
understanding/modeling, more robust
complexity , uncertainty design...
ckts people, architects and
testing people need topost-silicon
tune at testingtime, adapt at
44 prof. Massimo Alioto
play in the same field decisions relatedto design timedecisions related totesting / boot / run time
8/11/2019 L3 - System on Chip
45/68
Margin Elimination: Timing Error DetectionReduce/eliminate worst-casemargin by catching delay faults
nominal margin
correct at run-time, tune to compensate actual variationsrun-time testing improves energy efficiency
n-s tu mon tor ng no margin
au t pre ct on (Tunable Replica Circuit) needs some margin (false positives,mimics only critical path)
45 prof. Massimo Alioto
invasive, limited tuning little invasive, tuning required, low overhead
8/11/2019 L3 - System on Chip
46/68
Margin Elimination: Timing Error DetectionTiming monitoring: some circuit approaches
double sampling transition detection
Razor
Razor II(Umich)
(Umich)
DSTB(Intel)
TDTB(Intel)
to architecture through OR tree hold-time/detection window (TD)
46 prof. Massimo Alioto
metastability in data (Razor)/error path (others)
8/11/2019 L3 - System on Chip
47/68
Margin Elimination: Error Correction
Faults can be corrected at various levelsfaster correctionSW Architecture Microarchitecture Circuit
Circuit Microarchitecture Architecture SW
less HW resources
SW Architecture Microarchitecture Circuit
lower energy/performance penalty
energy overheadenergy of traditionalmargined design
t h r o u g h p u t
d e g r a d a t i o n d u e t o
i n c r e a s e d e r r o r r a t e energy reductionhrou h m ar in p
u t / I P C
e
op
correction
E throughput
rateerror energy reduction belowPoFF
error rateincreasebelow PoFF
elimination
minimum energyunder error det./corr.
e n e r g y
t h r o u g h
e r r o r r a
t
47 prof. Massimo Alioto
V DDmargined V DD(traditionaldesign)
o n t oFirst
Failure(PoFF)
energy-optimum
V DD
8/11/2019 L3 - System on Chip
48/68
Margin Elimination: Error Correction
Existing approaches circuit clock gating (Umich) clock stretching (Georgia Tech) error propagation within a clock cycle (very hard)
microarchitecture counterflow pipelining (Umich) micro-rollback Umich Bubble Razor (Umich) interferes with microarchitecture/
c cle-based timin architecture instruction re-execution (Intel), simple, large
48 prof. Massimo Alioto
checkpoint-restart (Wisc), simple, very large penalty
8/11/2019 L3 - System on Chip
49/68
The Next Step: Sub-Cycle Detection/Correction
Existing approaches are cycle-based
from J. Crop et al.,JLPEA, 2011
correction interferes with microarchitecture (design effort) errors affect timing at boundary: difficult SoC integration large energy penalty in high error rate regime (future)
Our visionsub-cycle detection/correction
errors detected/corrected in the same cycle or, at least, errors do not have to ro a ate to the boundar
49 prof. Massimo Alioto
so that errors are confined and determine low energy penalty
8/11/2019 L3 - System on Chip
50/68
Approximate Computing as Extreme Scaling
Some apps do not need to have perfect computation aggressively push voltage and tolerate errors
approximate computing (voltage overscaling by N. Shanbhag,K. Ro
ex.: multimedia (occasionally wrong pixels/samples)
errors not corrected on the fl
rather, avg error rate kept within bound (slow correction loop)
degradation of signal quality can be dynamically adjusted(application level)
50 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
51/68
Our Approach: User Experience-Centric Design
Voltage/energy reduction in portable multimedia for a iven ualit of user ex erience
20
40
tight link between circuit and final user errors are acceptable
20 40 60 80 100 120 140 160
60
80
100
120
140
PSNR=24 dB
metrics for quality of user experience (PSNR)close circuit design loop at application level
20
40
60
80
100
120
minimize energy for given quality
20 40 60 80 100 120 140 160
140
PSNR=36 dB
energy scalability : reduce energy
if lower quality is accepted
51 prof. Massimo Alioto
ynam c sca ng
8/11/2019 L3 - System on Chip
52/68
Limits of recent work on energy scalability (SRAM)
[Wolf2009], [Kurdahi2008] : aggressive V DD scaling to reduceenergy at the cost of higher BER
very limited voltage/energyBER
(or PSNR) BERenergy
sca a y : exp DD
abruptly increases targetedquality
same limitation in mixed 6T/8T SRAM Ro 2011
V DD
near threshold, 6T array almost always fails, 8T almost never fails
52 prof. Massimo Alioto
not really scalable either
8/11/2019 L3 - System on Chip
53/68
Our approach errors have different impact depending on where theyoccur
optimal energy allocation: protect (=spend energy) only
important bits to have graceful degradation (various knobs) when limiting precision, use
unused bits to improve resiliency
can pus more on DD to re uceenergy at same quality
53 prof. Massimo Alioto
current y, -nm c p un er test
8/11/2019 L3 - System on Chip
54/68
8/11/2019 L3 - System on Chip
55/68
Results in 28-nm 32-kb SRAM, YUV format
(QCIF 144x176)
Akiyo video, frame #30
PSNR w.r.t. voltage scaling
41% better PSNR(dB) at same energy
20
40
A20
40
B20
40
Original
60
80
100
120
60
80
100
120
60
80
100
120
55 prof. Massimo Alioto
20 40 60 80 100 120 140 160
14020 40 60 80 100 120 140 160
140
20 40 60 80 100 120 140 160
140
8/11/2019 L3 - System on Chip
56/68
Other Opportunities
Enable burst very high-speed computation ust violate reliabilit constraint
temporarily exceed Thermal Design Power
leverage thermal cap for DVFS Turbo Boost
2.0 [Intel, Rotem et al., HOTCHIPS 2011]
enhance thermal cap via phase change materials
Com utational S rintin Ra havan HPCA 2012
56 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
57/68
Our Vision of Distributed Power Management
Globally green systems ener -efficient, widel ener scalable
and externally tunable components
need for communication (energy state, knob tuning)
global policies based on information on energy stateENERGY
MANAGEMENTCHANNEL
TRADITIONALCOMMUNICATION
CHANNEL
REG
inputs
self-adjustinternal
EX.: bus, NoC, crossbar...
EX.: throughput,arithmetic precision...
instantaneous requirements ments
sensors
knobs tominimize
energy
settings processing
added to enable energyscalability and dynamic
tradeoff with other assets
MODULE
energy-related parametersmeters
57 prof. Massimo Alioto
outputsEX.: timing slack, bit error rate...
8/11/2019 L3 - System on Chip
58/68
Our Vision of Distributed Power Management keep it simple (integration), yet maintain global view:
hierarchical structure
h e r
l e v e
l i n
h i e r a r c
h y
h i
ena es remo e power managemen g o a v ew anintelligence kept out of nano-scale nodes)
move computation where more efficient (computation vs.
58 prof. Massimo Alioto
communication, locality, heterogeneity)
8/11/2019 L3 - System on Chip
59/68
Beyond-CMOS Ultra-Low Voltage Circuits
59 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
60/68
Tunnel-FETs: a Very Promising Alternative Main limit to voltage scaling of CMOS transistor
V TH can be reduced only if
su res o s ope s
lowered at given leakage
use new devices with
ower su t res o s ope
Tunnel FETs : very promising (ITRS: after 2020)
Physical structure p+in+
60 prof. Massimo Alioto
metal
8/11/2019 L3 - System on Chip
61/68
Tunnel-FETs: Robustness Comparison Comparison with CMOS bulk (FinFET) / SOI
fair: all optimized for ULV, same targets (leakage)
Noise margin degradation at ULV linear '%%
'+%! ""0123 167862 " 9: )$! ""0123 1;
8/11/2019 L3 - System on Chip
62/68
8/11/2019 L3 - System on Chip
63/68
Tunnel-FETs: Energy Comparison FO 4 inverter chain (10% activity, 16 slices)
min. energy vs. logic depth max. TFET advantage
w.r.t. SOI 35% @ 60FO4%")%"*
!!"'!"&!")
( ) * + , - . / 0 ' 6 7 8 6
. . .%
%"'"
'% &% )% *% !%% !'% !&% !)% !*% '%% 7
+ 8 *
& '
+'E,F /.G0H 9"#/D3 1
8/11/2019 L3 - System on Chip
64/68
Tunnel-FETs: Energy Comparison Impact of transistor stacking
at ULV, leakage reduction in 2-4 stacked TFETs is 5-8X better than SOI, 3-6X better than bulk
at ULV, I on reduction in 2-4 stacked TFETs is u
TFET cells with larger fan-in provide more benefits faster, lower leakage lower min. energy
TFET standard cell libraries must include higher fan-in cells
Example: zero-detector with 4-input gates min. energy improved by 1.79X (1.84X) w.r.t. to SOI (bulk)
64 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
65/68
Tunnel-FETs: SRAM cell System voltage scalability limited by SRAM cell
small margins, sensitive to variations
8T cell
about same area (33 X 13.4 F 2)
TFET SNM scales better '%%'+%
1 ) $ 2
%
+%
!%%
!+%
( . * / ) * ( E , &
: 4
TFET SOI bulk
V DD>140 mV 30% V DD 35% V DD 30% V DD
65 prof. Massimo Alioto
!%% !&% !*% ''% ')% (%% (&% (*% &'% &)% +%%! "" 1)$2
F748
8/11/2019 L3 - System on Chip
66/68
Conclusions Future computing platforms (macro, meso, nano)
Green: energy efficiency is key in any component
Ultra-low voltage is really challenging speed, leakage, resiliency (design margin)
Opportunities to overcome challenges margin reduction
heterogeneity
fine-grain/independent power domains
coor na e arc ec ure c rcu es gn use better devices
66 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
67/68
67 prof. Massimo Alioto
8/11/2019 L3 - System on Chip
68/68
Speakers Contacts
E-mail [email protected]
Massimo Alioto, Ph.D.
. - .
ECE De artmentNational University of Singapore (NUS)
4 Engineering Drive 3, Singapore 117576
68 prof. Massimo Alioto