1
The implications of energetic and thermalconstraints on current and future
processors
Pierre Michaud
June 2008
2
Outline
1. Why temperature, power and energy must be limited
2. Some basics
3. Why power consumption became a problem
4. How the power problem has been tackled
5. The temperature problem
6. How future processors may look like
3
Why temperature, power andenergy must be limited
4
Processing consumes energy
• When a processor executes a program it consumes someenergy. This energy is transformed into internal energy.
• Temperature is a measure of the average kinetic energyassociated with the disordered microscopic motion ofparticles
• A processor consuming some energy increases its owntemperature and that of its environment
5
Temperature must be limited
• We don’t want the processor to burn
• A 10 ºC temperature increase halves the processorlifetime– Several aging phenomena at work that are exponential with
temperature– Example: electromigration
• Circuits gets slower when temperature is higher
• ➔ Maximum temperature between 80 ºC and 100 ºC
6
Power consumption must be limited
• Power = energy per unit of time
• Electric power = voltage X current
• Power is limited by the power supply and by themaximum current
• A high sustained power generates a hightemperature ➔ limiting power is a way to limittemperature
7
Energy consumption must be limited
• Some processors are battery-powered– Laptop computers, hand-held devices– A battery stores a finite amount of energy– If you spend less energy to do a given work, you are able to do more
work before you need to recharge the battery
• Energy costs money– Will cost more and more (Google cares !)– Also an environmental cost
• Consuming less energy decreases power and temperature
8
Example: data center
• For each watt dissipated in the room, one extra watt mustbe consumed for the air conditioning
• Assume machines dissipate 100 KW, of which 30% (=30KW) come from CPUs
• + 100 KW for cooling ➔ 200 KW
• If we halve the power consumed by each CPU, we can …– Decrease the electric bill (15+15=30 KW saved)– Or put more CPUs in the room
9
Some basics
10
MOSFET(Metal-Oxide Semiconductor Field-Effect Transistor)
source drain
gate
substrate
dielectric
!
L
!
W
!
tox
11
Switching energy
0
C
Joule heating
i
v
ddV
=!=! ""#
dvvVCidtvV
ddV
dddd
00
)()( 2
2
1
ddCV
Energy is consumed when a gate output voltage switches from lowto high or from high to low
The switching energy depends oncapacitance and supply voltage
12
Dynamic power
• Dynamic power = switching energy consumed per second
!
Pd
= Ce" F "V
dd
2
clock frequencyequivalentcapacitance
• Equivalent capacitance takes into account contributions from allswitching gates on the chip
13
Gate delay
Cv
ddV
!
t "C #V
dd
Idsat
!
Idsat
" µ#ox
tox
$W
L$Vdd%V
t( )2
2
Gate dielectricthickness
Channellength
Channelwidth
Thresholdvoltage
Gate dielectricpermittivity
Approximate transistoras a current source
0
Shockley model
14
Why power consumption became aproblem
15
Moore’s law
• The number of transistors on a processor chip doubles every 2 years
16
Classical scaling rules
• Dennard, Gaensslen, Yu, Rideout, Bassous, Leblanc,“Design of ion-implanted MOSFET’s with very small physicaldimensions”, IEEE Journal of Solid-State Circuits, oct. 1974– On each technology generation, divide all transistors and wires
dimensions by– Divide all voltages by
• Dimensions scaling ➔ all parasitic capacitances (transistors& wires) are divided by
!
2
!
2
!
2
!
parallel plate capacitance =dielectric permittivity " plate area
dielectric thickness
17
Classical scaling: impact on delay
!
t "2
µ#
L
W
$
% &
'
( ) #
tox
*ox
$
% &
'
( ) #C #
Vdd
Vdd+V
t( )2
!
" 1
2
!
" 1
2
!
" 1
2
!
" 1
2 !
" 1
2
Under classical scaling, clock frequency(inverse of delay) can be multiplied by
!
2 "1.4
18
!
Pd
= Ce" F "V
dd
2 #Ce
C
$
% &
'
( ) "
W
L
$
% &
'
( ) "
*ox
tox
$
% &
'
( ) "Vdd
Vdd+V
t( )2
Classical scaling: impact on power
• Before 1990, supply voltage was kept constant (5 V)• In the 1990’s, power became a concern and voltage was scaled• Since 2000, voltage keeps decreasing, but more slowly
!
" 1
2
!
" 2
!
" 1
2!
" 1
2
19
Leakage currents
• Subthreshold leakage current between drain and source when the gate-to-source voltage is below the threshold voltage
• Gate leakage current due to tunneling of electrons through the dielectric layer
• Classical scaling requires that the threshold voltage be decreased when thesupply voltage is decreased– But this increases subthreshold leakage– ➔ Difficult to scale supply voltage further
• Classical scaling requires that the gate dielectric thickness be decreased– But this increases gate leakage– ➔ Use gate dielectric with higher permittivity (high-K)
drainsource
gate
dielectric
20
Static power
• Leakage currents ➔ static power consumption Ps
!
total power = Pd
+ Ps
= Ce" F "V
dd
2+V
dd" I
total leakage current
• We spend energy doing no work ☹
• Subthreshold leakage increases with temperature !
21
Microarchitects are guilty !
• Extra transistors have been used to increase the processorperformance, at the cost of more complexity and less energy efficiency– Superscalar, out-of-order, 64-bit operations, floating-point, SIMD,
multicore, etc.
• Until year 2000, clock frequency has increased not only because offaster transistors but also because of pipelining
!
Pd
= Ce" F "V
dd
2
!
constant chip area "# 2
!
pipelining "# 2
22
Core complexity across generations
R. Kumar, D. Tullsen, N. Jouppi, P. Ranganathan, “Heterogeneous Chip Multiprocessors”, IEEEComputer, Nov. 2005
Alpha 21064 (EV4) to Alpha 21464 (EV8)
23
What happened ?
• In 2000, we thought we would have processors in 2008 clocked at 10Ghz ➔ this did not happen
• In 2004, the Intel Tejas microprocessor was cancelled ➔ too muchpower, too much heat– It was very difficult to continue pushing the complexity of superscalar
processors and the clock frequency together
• But Moore’s law continues, so what do we do with extra transistors ?• ➔ big caches & multiple “simple” cores on the chip
• Power consumption has become a first-class constraint
24
How the power problem has beentackled
25
Energy-efficiency needed everywhere• Technology
– high-threshold-voltage transistors where speed is not critical (e.g., caches)– high-K gate dielectric– …
• Circuit– Sometimes, sacrificing speed a little permits saving significant energy– fine-grained clock gating ➔ don’t clock a flip-flop unless it has valid data in
input– …
• Microarchitecture– Find a good balance between complexity and performance– Disconnect parts that are not used
• Example: if a program does not perform floating-point computations, turn theFP units off
26
Voltage / frequency
• Circuit can be clocked at frequency proportional to supply voltage– As long as supply voltage is not too close to transistor threshold voltage
• Example– Assume 75% of power is dynamic and 25% static– multiply simultaneously voltage and frequency by 0.8
!
Pd
+ Ps
= Ce" F "V
dd
2+V
dd" I
!
" 0.8
!
" (0.8)2
!
" 0.8
!
0.75 " 0.83
+ 0.25 " 0.8 # 0.6 ➔ we get a 40% decrease of powerif we decrease frequency by 20%
27
Parallelism is power efficient
1 processorfrequency F
voltage V
2 processorsfrequency F/2
voltage V/2
• Parallelism allows to get the same performance whileconsuming less power
• A multicore processor permits obtaining moreperformance with the same power consumption– provided the application is parallel
28
The temperature problem
It is related to the power problem,but is not strictly equivalent to it
29
Processor heat sink
air-blowing fan takes heataway from the chip
Put on top of theprocessor chip
30
Cooling a laptop
Heat pipe
Heat sink
CPU
31
Fourier’s law of heat conduction
!
r q = "K #
r $ T
Heat flux Thermalconductivity
Temperaturegradient
Heat flows from high-temperature regions to low-temperature ones ata rate proportional to the temperature difference!
W /m2
!
W /mK
!
K /m
siliconaluminumcopper
100-150 W/mK240400
W/mKW/mK
32
Thermal resistance
!
T1
thermally insulated side
!
T2
!
section S
!
power P =Q" S
!
Fourier's law Q =P
S= K
T1"T
2
L!
length L
!
Thermal resistance R =T
1"T
2
P=
L
K # S(in kelvin per watt)
33
Convection cooling
solidtemperature T
ambient fluidin motiontemperature T0
!
Q = H(T "T0)
!
QHeat flux
!
W /m2
Heattransfercoefficient
!
W /m2K
Forced convection: the heat transfercoefficient increases with the fluid velocity
Newton’s law of cooling
!
Thermal resistance R =T "T
0
P=
1
H # S(in kelvin per watt)
area in contact with the fluid
34
Example
Heat sink
Silicon die
!
50 µm, 3.33 W/mK
Interfacematerial
!
500 µm
Transistors& wires
!
5 µm
!
150 mm2
!
Rim
= 0.1 K/W
!
Rhs
= 0.3 K/W
Primary heat path
!
Rsi
= 0.02 K/W
!
Tcircuit
"Tair
= P # (Rsi
+ Rim
+ Rhs) Each watt dissipated contributes
a 0.42 ºC temperature increase
35
Temperature is not uniform
J.D. Warnock et al., “The circuit and physical design of the POWER4 microprocessor”, IBM Journal ofResearch & Development, Jan. 2002.
36
Point source
Point sourcedissipating 1 watt
Temperature (relative to ambient) as afunction of the distance from the source
For multiple sources, add the contributions from each source
37
Impact of miniaturization on temperature
!
" 1
2
If power remains constant, temperatureincreases
38
Power must be decreased !
!
" 1
2
!
power " P
!
power P
!
P'= P " temperature increases
!
P'=P
2 " same power density (W/m2)" temperature decreases
!
P'=P
2 " temperature roughly the same
39
From single to dual-core
use this area fora second core!
Pd
= Ce" F "V
2
!
Pd" =
Ce
2# " F # " V
2
!
Pd" =
Pd
2# " F " V
2=
FV2
2
!
If we want " F > F, we must have " V <V
2# 0.84 $V
same total power ➔
40
“Dual-core” with a single valid core• Yield issue: only a fraction of the chips on a wafer will eventually be sold
– Other chips have defects
• Valid chips have either 1 or 2 valid cores
• Chips with a single valid core can use a higher voltage and frequency– (ignoring commercial considerations …)
!
Pd" =
Pd
2# " F " V
2= FV
2
times higher than thechip with 2 valid cores
We are limited by temperature ➔
!
2
41
Dynamic voltage / frequency scaling
• Processors can vary voltage and frequency dynamically (DVFS)– Keep frequency proportional to voltage
• The operating system sets frequency and voltage depending on thesituation– Thermal sensor indicates that temperature is too high ➔ decrease V & F– System activity is low ➔ decrease V & F to save energy
• When a single core is used, put the inactive core in low power mode andincrease V & F of the active core to boost performance– Intel Penryn processor ➔ 10 % frequency boost when 2nd core inactive
• Intel “Dynamic Acceleration Technology”
!
V ="F
42
DVFS in multicores
!
F2N,V
2N2N cores active ➔
N cores active ➔
!
FN,V
N
!
FNVN
2
F2NV2N
2= 2
!
" 2FN
3
" 2F2N
3= 2➔
!
FN
F2N
= 21/ 6
➔
!
F2
F1
= 21/ 6
"1.12
!
F4
F1
= 21/ 3
"1.26
!
F8
F1
= 21/ 2
"1.41
Intel Penryn
Intel Nehalem ?
43
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
44
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
45
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
46
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
47
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
48
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
49
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
50
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
51
Activity migration
• When using a single core at a time, migrating the executionperiodically to a different core decreases temperature– Spreads the same heat on a larger area
52
DVFS in multicores + activity migration
!
F2N,V
2N2N cores active ➔
N cores active ➔
!
FN,V
N
!
FNVN
2
F2NV2N
2= 2
!
" 2FN
3
" 2F2N
3= 2➔
!
FN
F2N
= 21/ 3
➔
!
F2
F1
= 21/ 3
"1.26
!
F4
F1
= 22/ 3
"1.59
!
F8
F1
= 2
DVFS is more efficient when it iscombined with activity migration➔ potential speed-up of 2 for sequentialexecution on a 8-core processor
53
The future ?
54
Let’s assume this scenario
!
frequency F "1
C
#
$ %
&
' ( )
W
L
#
$ %
&
' ( )
*ox
tox
#
$ %
&
' ( )
Vdd+V
t( )2
Vdd
!
" 2
!
dynamic core power Pd"
Ce
C
#
$ %
&
' ( )
W
L
#
$ %
&
' ( )
*ox
tox
#
$ %
&
' ( ) V
dd+V
t( )2Vdd
!
constant : 0.1V or 0.2V
!
" 2
55
My guess wish for 2010-2020
cache
big cores small cores
• Constant chip area
• Several big cores for high sequentialperformance– Vdd x 0.9 for constant core power– frequency x 1.7 on each generation– Use a single big core at a time– Migrate periodically for temperature
• Many small cores for high parallelperformance– Vdd x 0.8 for halving core power– Frequency increases slowly– Parallel performance doubles on each
generation
low freq.high freq.low Vt high Vt
56
Conclusion
• Sequential performance must increase, it is a necessity– Some applications have little parallelism– Amdahl’s law– legacy code– software productivity– Efficient activity migration may be the only long-term solution
• Peak parallel performance is likely to increase faster thansequential performance
57
Questions ?