CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki.

transcript

CR18: Advanced Compilers

L08: Memory & Power

Tomofumi Yuki

Memory Expansion

Recall Array Dataflow Analysis start from loops get value-based dependences correspond to Alpha = no notion of

memory

It is sometimes called Full Array Expansion explicit dependences with single

assignment full parallelism exposed

Memory vs Parallelism

More parallelism requires more memory obvious example: scalar accumulation

One approach: ignore the problem by using memory-based dependences

Alternatively, we can try to find memory allocation afterwards

Memory Allocation

Given a schedule: Memory Reuse Analysis [1996] Lefebvre-Feautrier [1998] Quilleré-Rajopadhye [2000] Lattice-Based [2005]

For a set of schedules Universal Occupancy Vectors [1998] Affine Universal Occupancy Vectors

[2001] Quasi-Universal Occupancy Vectors

[2013]

Occupancy Vectors

Main Concept: A vector (in the iteration space) that

gives another iteration that can safely overwrite

Universal OV: OV that is legal for any schedule affine and quasi- variants restrict the

universe to smaller subset

Universal Occupancy Vectors Only for uniform dependence

all iterations have the same dependence pattern

large enough domain (no thin strips)

Key Idea: Transitivity some iteration z can overwrite z’ if z depends on all uses of z’ → possibly transitively

UOV Example

Find an UOV for the following

UOV Example

Find an UOV for the following is [1,1] a valid UOV? how does it translate to memory

mapping?

UOV Example

Find an UOV for the following how about [1,0]?

UOV Example

Find an UOV for the following

UOV Example

Alternative Formulation as intersection of transitive closures

Affine UOV Example

Restrict to affine schedules but allow affine dependences

Relevance of UOVs

UOV allocates d-1 dimensional array for d-dimensional space

Does this sound like a problem?

What can you say about programs with only uniform dependences?

How does this relate to tiling?

Memory Allocation/Contraction We are given an affine schedule θ

per statement possibly multi-dimensional

Problem: find affine pseudo-projections

affine function + modulo factors per statement usually minimizing the memory usage

Pseudo Projection

Assume lex. order as schedule what is a valid OV?

Pseudo Projection

Assume lex. order as schedule what is a valid OV? [0,2], which translates to

for i for j A[i%2,j] = foo(A[(i-1)%2,j], A[i%2,j-1]);

Allocation vs Contraction

Most programs have: much more statements than arrays

Memory allocation techniques: map each statement to its own array try to merge arrays afterwards

Array contraction techniques: keeps the original statement-to-array

mapping

Little difference in the theory behind

Live-ness of Values

Central analysis in memory allocation live-ness analysis in register allocation called by different names

Given: value computed at S(z), used by T(z’) we cannot overwrite the value of S(z) written at θ(S,z) until θ(T,z’)

forall T,z’

Computing the Live-ness

How to compute the live-ness? θ(i,j) = i θ(i,j) = i+j

Lefebvre-Feautrier

How to find the allocation? θ(i,j) = i 1. Start with scalar

2. Expand in a dimension

3. Use max reuse distanceas modulo factor

Lefebvre-Feautrier

Alternative Description

1. Start with full array

2. Project in a dimension3. Compute modulo factor

Quilleré-Rajopadhye

Based on non-Canonic Projections

Main Result: Optimality for a d-D space if you find x independent projections what can you say about memory usage?

Lattice-Based Allocation

Different formulation using lattices Consider some basis of an integer lattice

Lattices ≈ Occupancy Vectors Conflict Set

values that cannot be mapped to the same memory locations

Find the smallest lattice that is large enough to only intersect with the

conflict set at its base enumeration of the space using HNF

Energy-Aware Compilation

Power Wall

Power Density

Lead to multi-core

Saving Energy is important Barrier for Exa-Scale computing Battery lifetime of laptops

Compiler Optimization has focused on speed Is there anything compilers can do for

energy? Speed is still important 29

Starting Hypothesis

Energy is Power consumed over Time E = PT

P : Power consumption E : Energy consumption T : execution Time

Faster execution time =Lower energy consumption

Hypothesis : Optimizing for speed also optimizes energy

Single Processor Case

Two main categories Purely Program Transformations

Efficient use of data cache Energy Aware Compilation framework

Dynamic Voltage and Frequency Scaling Profile based Loop transformation + DVFS

Efficient use of data cache

HW with configurable cache line size Trade-off : larger CLS =>

better spatial locality, higher interference

Main Contribution: Model to maximize hit ratio

Configurable CLS leads to energy trade-off

In GP processors, data locality optimization ≈ energy optimization of cache

[D’Alberto et. al, 2001]

Energy Aware Compilation

Compiler framework with energy in mind Based on predicting power consumption

from high-level source code Energy-Aware Tiling

Optimal tiling strategy for speed != for energy

Key : tiling adds instructions Main Weakness

Improvement is relatively small (~10%) Energy is traded with speed

[Kadayif et. al, 2002]

Results by Kadayif et. al

Increase in energy/execution cycle when optimized for the other

Energy delay product would not change much

fir conv lms real biquad

complex

vpentai

Energy

7.7% 6.8% 3.9% 2.0% 8.8% 5.9%

Cycle 5.9%

8.7% 7.2% 2.9% 2.3% 7.6% 9.2%

HW for Further Optimization

Dynamic Voltage and Frequency Scaling Power consumption model for CMOS

Voltage is the obvious target high frequency requires high voltage quadratic energy savings with reduced

V : supply voltagef : frequencyα : activity rate

DVFS : Main Idea

Identify non-compute intensive stages Frequency/Voltage can be reduced

without influencing speed processor is under-utilized

DVFS states are coarse grained ~10 different frequency/voltage

configurations State transition is not free

100s of cycles extra energy consumed

DVFS : Single Processor

Profile Based Profile to identify opportunities Compile-time vs. Run-time Limited by available opportunities

Loop Transformation First, optimize for speed Then convert speedup to energy savings Transformation to expose opportunities

[Hsu and Kremer 2003, Hsu and Feng 2005]

[Ghodrat and Gvargis 2009]

DVFS : Single Processor

Task-Based Programs Main Ides: Decoupled Access/Execute

Compiler transformation to split into tasks

One that does memory Accesses to fetch data

Another that does Execute to compute Apply DVFS

low frequency for Access high frequency for Execute

[Jimborean et al. 2014]

Single Processor : Summary

Purely software based optimization No significant gains over optimizing for

speed Hypothesis holds in this case

DVFS based approaches HW for energy savings exposed to

software Identify when processors is not fully

utilized HW support breaks the hypothesis

Across Processors

Parallelization is necessary to utilize modern architectures

How does parallelism affect energy?

Amdahl’s Law for Energy Opportunities in parallel programs

Static Power

New Term to the Power Model

Some power is consumed even when idle

DVFS has less effect Static Power reaching 50% of the total

I : leakage current

static powerdynamic power

Amdahl’s Law for Energy

Simple model of energy and parallelism processors have DVFS

Simple but more complicated than the original

Speed-up energy trade-off analysis

[Cho and Melhem 2008]

s : sequential p : parallel N : number of processorsλ : static power y : power consumption as a function of frequency

seq dynamic parallel dynamic static

Illustrating example from paper

(frequency)

When Static Power is 50%

(frequency)

Static Power dominates

Static Power is significant Increases as N increases Excessive processors are bad

With current technology (high static power and increasing cores) Running as fast as possible is a good

way to save energy

Generalizing a bit Further

Analysis based on high-level energy model Emphasis on power breakdown Find when “race-to-sleep” is the best Survey power breakdown of recent

machines Goal: confirm that sophisticated use of

DVFS by compilers is not likely to help much e.g., analysis/transformation to

find/expose “sweet-spot” for trading speed with energy 46

Power Breakdown

Dynamic (Pd)—consumed when bits flip Quadratic savings as voltage scales

Static (Ps)—leaked while current is flowing Linear savings as voltage scales

Constant (Pc)—everything else e.g., memory, motherboard, disk,

network card, power supply, cooling, … Little or no effect from voltage scaling

Influence on Execution Time

Voltage and Frequency are linearly related Slope is less than 1 i.e., scale voltage by half, frequency

drop is less than half Simplifying Assumption

Frequency change directly influence exec. time

Scale frequency by x, time becomes 1/x Fully flexible (continuous) scaling

Small set of discrete states in practice48

Case1: Dynamic Dominates Power Time

Case2: Static Dominates Power Time

Case3: Constant Dominates Power Time

Ratio is the Key

Pd : Ps : Pc

Energy Slower the Better

Energy No harm, but No gain

Energy Faster the Better

When do we have Case 3?

Static power is now more than dynamic power Power gating doesn’t help when

computing Assume Pd = Ps

50% of CPU power is due to leakage Roughly matches 45nm technology Further shrink = even more leakage

The borderline is when Pd = Ps = Pc We have case 3 when Pc is larger than

Pd=Ps 50

Extensions to The Model

Impact on Execution Time May not be directly proportional to

frequency Shifts the borderline in favor of DVFS

Larger Ps and/or Pc required for Case 3

Parallelism No influence on result CPU power is even less significant than

1-core Power budget for a chip is shared (multi-

core) Network cost is added (distributed) 51

Do we have Case 3?

Survey of machines and significance of Pc

Based on: Published power budget (TDP) Published power measures Not on detailed/individual

measurements Conservative Assumptions

Use upper bound for CPU Use lower bound for constant powers Assume high PSU efficiency 52

Pc in Current Machines

Sources of Constant Power Stand-By Memory (1W/1GB)

Memory cannot go idle while CPU is working

Power Supply Unit (10-20% loss) Transforming AC to DC

Motherboard (6W) Cooling Fan (10-15W)

Fully active when CPU is working Desktop Processor TDP ranges from 40-

90W Up to 130W for large core count (8 or

Sever and Desktop Machines Methodology

Compute a lower bound of Pc

Does it exceed 33% of total system power?

Then Case 3 holds even if the rest was all consumed by the processor

System load Desktop: compute-intensive benchmarks Sever: Server workloads

(not as compute-intensive)

Desktop and Server Machines

Cray Supercomputers

Methodology Let Pd+Ps be sum of processors TDPs Let Pc be the sum of

PSU loss (5%) Cooling (10%) Memory (1W/1GB)

Check if Pc exceeds Pd = Ps Two cases for memory configuration

(min/max)

Cray Supercomputers

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

OtherPSU+CoolingMemoryCPU-staticCPU-dynamic

Cray Supercomputers

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

Cray Supercomputers

XT5 (min)

XT5 (max)

XT6 (min)

XT6 (max)

XE6 (min)

XE6 (max)

DVFS for Memory

Still in research stage (since 2010~) Same principle applied to memory

Quadratic component in power w.r.t. voltage

25% quadratic, 75% linear The model can be adopted:

Pd becomes Pq dynamic to quadratic Ps becomes Pl static to linear

The same story but with Pq : Pl : Pc

Influence on “race-to-sleep”

Methodology Move memory power from Pc to Pq and

25% to Pq and 75% to Pl

Pc becomes 15% of total power for Server/Cray

“race-to-sleep” may not be the best anymore

remains to be around 30% for desktop Vary Pq:Pl ratio to find when “race-to-

sleep” is the winner again leakage is expected to keep increasing

When “Race to Sleep” is optimal When derivative of energy w.r.t. scaling

Linearly Scaling Fraction: Pl / (Pq + Pl)

Summary and Conclusion

Diminishing returns of DVFS Main reason is leakage power Confirmation by a high-level energy

model “race-to-speed” seems to be the way to

go Memory DVFS won’t change the big

picture Compilers can continue to focus on

speed No significant gain in energy efficiency

by sacrificing speed 63

Balancing Computation and I/O DVFS can improve energy efficiency

when speed is not sacrificed Bring program to compute-I/O balanced

state If it’s memory-bound, slow down CPU If it’s compute-bound, slow down

memory Still maximizing hardware utilization

but by lowering the hardware capability Current hardware (e.g., Intel Turbo-

boost) and/or OS do this for processor

The Punch Line Method

How to Punch your audience how to attract your audience

Make your talk more effective learned from Michelle Strout

Colorado State University applicable to any talk

The Punch Line

The key cool idea in your paper the key insight

It is not the key contribution! X% better than Y do well on all benchmarks

Examples: ... because of HW prefetching ... further improve locality after reaching

compute-bound

Typical Conference Audience Many things to do

check emails browse websites finish their own slides

Attention Level (made up numbers)

~3 minutes 90% ~5 minutes 60% 5+ minutes 30% conclusion 70%

punch here!

push these numbers up!

Typical (Boring) Talk

1. Introduction 2. Motivation 3. Background 4. Approach 5. Results 6. Discussion 7. Conclusion

Punch Line Talk

Two Talks in One 5 minute talk

introduction/motivation key idea

X-5 minute talk add some background elaborate on approach ...

the punch

shortest path to the punch

Pitfalls of Beamer

Beamer != bad slides but it is a easy path to one

Checklist for good slides no full sentences LARGE font size few equations many figures !paper structure

beamer is not the best tool to encourage these

CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki.

Documents