15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu...

transcript

15-447 Computer Architecture Fall 2008 ©

November 24, 2007Nael Abu-Ghazaleh

naelag@cmu.eduhttp://www.qatar.cmu.edu/~msakr/15447-f08

Lecture 27Power Aware Architecture Design

CS 15-447: Computer Architecture

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

25%/year

52%/year

Uniprocessor Performance (SPECint)

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006

Sea change in chip design—what is emerging?

??%/year

Three walls

1. ILP Wall: • Wall: not enough parallelism available in one thread• Very costly to find more• Implications: cant continue to grow IPC• VLIW? SIMD ISA extensions?

2. Memory Wall:• Growing gap between DRAM and Processor speed• Caching helps, but only so much• Implications: cache misses are getting more expensive• Multithreaded processors?

3. Physics/Power Wall:• Cant continue to shrink devices; running into physical limits• Power dissipation is also increasing (more today)• Implications: cant rely on performance boost from shrinking

transistors• But we will continue to get more transistors

Multithreaded Processors

• What support is needed?

• I can use it to help ILP as well – Which designs help

ILP in the picture to the right?

Power-Efficient Processor Design

Goals: 1. Understand why energy efficiency is important2. Learn the sources of energy dissipation3. Overview a selection of approaches to reduce energy

Why Worry About Power?

• Embedded systems:– Battery life

• High-end processors:– Cooling (costs $1 per chip per Watt if operating @ >40W)– Power cost:15 cents/KiloWatt hr (KWH)

• A single 900 Watt server costs 100 USD /month to run, not including cooling costs!

– Packaging– Reliability

Why worry about power -- Oakridge Lab. Jaguar

• Current highest performance super computer– 1.3 sustained petaflops (quadrillion FP operations per

second)– 45,000 processors, each quad-core AMD Opteron

• 180,000 cores!

– 362 Terabytes of memory; 10 petabytes disk space– Check top500.org for a list of the most powerful

supercomputers

• Power consumption? (without cooling)– 7MegaWatts!– 0.75 million USD/month to power– There is a green500.org that rates computers based on

flops/Watt

• Alpha 21264 95W• AMD Athlon XP 67W• HP PA-8700 75W• IBM Power 4 135W• Intel Itanium 130W• Intel Xeon 59W

Peak Power in Today’s CPUs

Even worse when we consider power density (watt/cm2)

• Sources of power consumption in CMOS:– Dynamic or active power (due to the switching of

transistors)– Short-circuit power– Leakage power

• High temperature increases power consumption– Silicon is a bad conductor: higher temperature

->higher leakage current->even higher temperature…

Where is This Power Coming From?

Power Consumption in CMOS

– Dynamic Power Consumption• Charging and discharging capacitors

In Out

0 1 1 0

E=CV2 E=CV2

P=E*f=C*V2*f

Power= *C*V2*f

Activity factor: how often do wires switch

Supply voltage: has been dropping with successive

process generations

Clock frequency: increasing

Capacitance: function of wire length, transistor size

Dynamic Power Consumption

– Short-circuit power• Both PMOS and NMOS are conducting

About 2% of the overall power.

– Leakage power – transistors are not perfect switches and they leak.

In Out

20% now, expect 40% in next technology and growing

• All of the consumed power has to be dissipated

• Done by means of heat pipes, heat sinks, fans, etc.

• Different segments use different cooling mechanisms.

• Costs $1-$3 or more per chip per Watt if operating @ >40W

• We may soon need budgets for liquid-cooling or refrigeration hardware.

Cooling

Power= *C*V2*f

Activity factor: how often do wires switch

Supply voltage: has been dropping with successive

process generations

Clock frequency: increasing

Capacitance: function of wire

length, transistor size

Dynamic Power Consumption

• Transistor switches slower at lower voltage.

• Leakage current grows exponentially with decreases in threshold voltage

• Leakage power goes through the roof

Voltage Scaling

• New process generation every 2-3 years• Ideal shrink for 30% reduction in size:

– Voltage scales down by 30%– Gate delays are shortened by 30%

~50% frequency gain (500ps cycle = 2GHz clock, 333ps cycle = 3GHz clock)

– Transistor density increases by 2X• 0.7X shrink on a side, 2X area reduction

– Capacitance/transistor reduced by 30%

Technology Scaling: the Enabler

• 2/3 reduction in energy/transition (CV2 0.7x0.72 = 0.34X)

• 1/2 reduction in power (CV2f 0.7x0.72 x 1.5= 0.5X

• But twice as many transistors, or more if area increases

• Power density unchanged

Ideal Process Shrink: the Results

Looks good!

• Performance does not scale w/ frequency– New designs increase frequency by 2X– New designs use 2X-3X more transistors to get 1.4X-1.8X

performance*

• So, every new process generation:– Power goes up by about 2X (3X transistors * 2X switches

* 1/3 energy)– Leakage power is also increasing– Power density goes up 30%~80% (2X power / 1.X area)

• Will get worse in future technologies, because Voltage will scale down less

Process Technology – the Reality*

*Source: “Power – the Next Frontier: a Microarchitecture Perspective”, Ronny Ronen, Keynote speech at PACS’02 Workshop.

Ugly Numbers*

i486 (0.8) Pentium 4 (0.18) Factor

Transistors 1.2M 42M 35x

Frequency 50 MHz 2000 MHz 40x

Voltage 5 V 1.65 V 1/3x

Peak Power 5 W 100 W 20x

Die size 0.73 cm2 2.17 cm2 3x

Power density 6.8 W/cm2 46 W/cm2 7x

• Circuits and process scaling alone can no longer solve all power problems

• SYSTEMS must also be power-aware– OS– Compilers– Architecture

• Techniques at the architectural level are needed to reduce the absolute power dissipation as well as the power density

The Bottom Line

Microarchitectural Techniques for Power Reduction

FunctionUnitsInstruction Issue

Result/status forwarding

Instruction dispatch

Architectural Register File

Fetch Decode/Dispatch

D-cache

A Superscalar Datapath

Performance=N*f*IPC

Actually, it’s the whole system, but we focus on processor

• Dynamic power:– Reduce the activity factor– Reduce the switching capacitance (usually not possible)– Reduce the voltage/frequency (speedstep; e.g., 1.6 GHz

pentium M can be clocked down to 600MHz, voltage can be dropped from 1.48V to 0.95V)

• Leakage power:– Put some portions of the on-chip storage structures in a low-

power stand-by mode or even completely shutting off the power supply to these partitions

– Resizing

• We usually give up some performance to save energy, but how much?

Microarchitectural Techniques—General Approach

• If we reduce voltage, linear drop in maximum frequency (and performance)

• “The cube law”: P=kV3 (~1%V=3%P)– If we use voltage scaling we can approximately trade 1%

of performance loss for 3% of power reduction.

• Any architectural technique that trades performance for power should do better than that (or at least as good). Otherwise simple voltage scaling can be used to achieve better tradeoffs.

Guideline

• Speculation is used to increase performance• Wasted energy if it is wrong• Can we speculate only when we think we’ll be right?• Gating: temporarily prevent the new instructions

from entering the pipeline• Use Gating to avoid speculation beyond the

branches with low prediction accuracy– The number of unresolved low-confidence branches is used

to determine when to gate the pipeline and for how long– Report 38% energy savings in the wrong-path instructions

with about 1% of IPC loss

Examples: Front-End Throttling

• Just-in Time Instruction Delivery – Fetch stage is throttled based on the number of in-flight

instructions.– If the number of in-flight instructions exceeds a

predetermined threshold, the fetch is throttled– Threshold is adjusted through the “tuning cycle”– Reasons for energy savings:

• Fewer instructions are processed along the mispredicted path

• Instruction spends fewer cycles in the issue queue

Front-End Throttling (continued)

• General solutions:– Use of multi-banked RFs. Each bank has fewer entries

and fewer ports than the monolithic RF.• Problems:

– Possible bank conflicts -> IPC loss– Overhead of the port arbitration logic

– Use of the smaller cache-like structures to exploit the access locality

Energy Reduction in the Register Files

• Value Aging Buffer – At the time of writeback, the results are written into a

FIFO-style cache called VAB– The RF is updated only when the values are evicted from

the VAB.– In many situations, this can be avoided because a

register may be deallocated during its residency in the VAB

– If a register is read from the VAB, there is no need to access the RF.

– Some performance loss due to the sequential access to the VAB and the RF.

Energy Reduction in the Register Files

Isolation of short-lived operands

Out-of-Order Execution andIn-Order Retirement

Inst. Queue ExARF

In-order front end

Out-of-order core

In-order retirement

• Used to cope with false data dependencies.• A new physical register is allocated for EVERY

new result• P6 style: ROB slots serve as physical registers

Register Renaming

LOAD R1, R2, 100

SUB R5, R1, R3 ADD R1, R5, R4

LOAD P31, P2, 100

SUB P32, P31, P3

ADD P33, P32, P4

– Register Alias Table (RAT) maintains the mappings between logical and physical registers

Register Renaming: the Implementation

Arch. Reg

Phys. Reg.

Location(0-ROB,1-ARF)

LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4

Original code

– Register Alias Table (RAT) maintains the mappings between logical and physical registers

Arch. Reg

Phys. Reg.

1 31 0

LOAD P31, R2, 100

Original code

Renamed code

– Rename Table (RT) is used to maintain the mappings between logical and physical registers

Arch. Reg

Phys. Reg.

1 31 0

5 32 0

LOAD P31, R2, 100 SUB P32, P31, R3

Original code

Renamed code

– Rename Table (RT) is used to maintain the mappings between logical and physical registers

Arch. Reg

Phys. Reg.

1 33 0

5 32 0

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

Original code

Renamed code

• Definition: a value is short-lived if the destination register is renamed by the time of the result generation.

• Identified one cycle before the result writeback

• A large percentage of all generated results are short-lived for SPEC 2000 benchmarks.

Short-Lived Values

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4RENAMER

10096-entry ROB, 4-way processor

Percentage of Short-Lived Values

• Reasons for maintaining short-lived values:

– Recovering from branch mispredictions

– Reconstructing precise state if interrupts or exceptions occur

Why Keep Them ?

LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4

Energy-dissipating Events

Inst. Queue ExARF

In-order front end

Out-of-order core

In-order retirement

WriteWrite

Isolating Short-Lived Values: the Idea

Inst. Queue ExARF

Write short-lived values into a small

dedicated RF (SRF)

In-order front end

Out-of-order core

In-order retirement

• Dynamically resizable caches – Dynamically estimates the program requirements and

adapts to the required cache size– Cache is upsized or downsized at the end of periodic

intervals based on the value of the cache miss counter– Downsizing puts the higher-numbered sets into a low-

leakage mode using sleep transistors– A bit mask is used to specify the number of address bits

that are used for indexing into the set– The cache size always changes by a factor of two

Energy Reduction in Caches

• Gating off portions of the execution units – Disables the upper bits of the ALUs where they are not needed

(for small operands)

– Energy can be reduced by 54% for integer programs

• Packaging multiple narrow-width operations in a single ALU in the same cycle

• Steering instructions to FUs based on the criticality information – Critical instructions are steered to fast and power-hungry

execution units, non-critical instructions are steered to slow and power-efficient units

Energy Reduction within the Execution Units

• Using Grey code for the addresses to reduce switching activity on the address buses (Su et.al., IEEE Design and Test, 1994)– Exploits the observation that programs often generate

consecutive addresses– Grey code: there is only a single transition on the

address bus when consecutive addresses are accessed– 37% reduction in the switching activity is reported– A Gray code encoder is placed at the transmitting end of

the bus, and a decoder is needed at the receiving end

Encoding Addresses for Low Power

• Bus-invert encoding – Uses redundancy to reduce the number of transitions– Adds one line to the bus to indicate if the actual data or

its complement is transmitted– If the Hamming distance between the current value and

the previous one is less than or equal to (n/2) (for n bits), the value is transmitted as such and the value of 0 is transmitted on the extra line.

– Otherwise, the complement of the value is transmitted and the extra line is set to 1

– The average number of bus transitions per clock cycle is lowered by 25% as a result

Encoding Data for Low Power

• Can compiler help?

• Can OS help?– E.g., control voltage scaling– Control turning off devices

OS and Compiler Techniques

15-447 Computer ArchitectureFall 2008 © November 24, 2007 Nael Abu-Ghazaleh naelag@cmu.edu...

Documents