Post on 29-Jan-2016
transcript
CR18: Advanced Compilers
L08: Memory & Power
Tomofumi Yuki
2
Memory Expansion
Recall Array Dataflow Analysis start from loops get value-based dependences correspond to Alpha = no notion of
memory
It is sometimes called Full Array Expansion explicit dependences with single
assignment full parallelism exposed
3
Memory vs Parallelism
More parallelism requires more memory obvious example: scalar accumulation
One approach: ignore the problem by using memory-based dependences
Alternatively, we can try to find memory allocation afterwards
4
Memory Allocation
Given a schedule: Memory Reuse Analysis [1996] Lefebvre-Feautrier [1998] Quilleré-Rajopadhye [2000] Lattice-Based [2005]
For a set of schedules Universal Occupancy Vectors [1998] Affine Universal Occupancy Vectors
[2001] Quasi-Universal Occupancy Vectors
[2013]
5
Occupancy Vectors
Main Concept: A vector (in the iteration space) that
gives another iteration that can safely overwrite
Universal OV: OV that is legal for any schedule affine and quasi- variants restrict the
universe to smaller subset
6
Universal Occupancy Vectors Only for uniform dependence
all iterations have the same dependence pattern
large enough domain (no thin strips)
Key Idea: Transitivity some iteration z can overwrite z’ if z depends on all uses of z’ → possibly transitively
7
UOV Example
Find an UOV for the following
i
j
8
UOV Example
Find an UOV for the following is [1,1] a valid UOV? how does it translate to memory
mapping?
i
j
9
UOV Example
Find an UOV for the following how about [1,0]?
i
j
10
UOV Example
Find an UOV for the following
i
j
11
UOV Example
Alternative Formulation as intersection of transitive closures
i
j
12
Affine UOV Example
Restrict to affine schedules but allow affine dependences
i
j
13
Relevance of UOVs
UOV allocates d-1 dimensional array for d-dimensional space
Does this sound like a problem?
What can you say about programs with only uniform dependences?
How does this relate to tiling?
14
Memory Allocation/Contraction We are given an affine schedule θ
per statement possibly multi-dimensional
Problem: find affine pseudo-projections
affine function + modulo factors per statement usually minimizing the memory usage
15
Pseudo Projection
Assume lex. order as schedule what is a valid OV?
i
j
16
Pseudo Projection
Assume lex. order as schedule what is a valid OV? [0,2], which translates to
i
j
for i for j A[i%2,j] = foo(A[(i-1)%2,j], A[i%2,j-1]);
17
Allocation vs Contraction
Most programs have: much more statements than arrays
Memory allocation techniques: map each statement to its own array try to merge arrays afterwards
Array contraction techniques: keeps the original statement-to-array
mapping
Little difference in the theory behind
18
Live-ness of Values
Central analysis in memory allocation live-ness analysis in register allocation called by different names
Given: value computed at S(z), used by T(z’) we cannot overwrite the value of S(z) written at θ(S,z) until θ(T,z’)
forall T,z’
19
Computing the Live-ness
How to compute the live-ness? θ(i,j) = i θ(i,j) = i+j
i
j
20
Lefebvre-Feautrier
How to find the allocation? θ(i,j) = i 1. Start with scalar
2. Expand in a dimension
3. Use max reuse distanceas modulo factor
i
j
21
Lefebvre-Feautrier
Alternative Description
1. Start with full array
2. Project in a dimension3. Compute modulo factor
i
j
22
Quilleré-Rajopadhye
Based on non-Canonic Projections
Main Result: Optimality for a d-D space if you find x independent projections what can you say about memory usage?
23
Lattice-Based Allocation
Different formulation using lattices Consider some basis of an integer lattice
i
j
24
Lattice-Based Allocation
Different formulation using lattices Consider some basis of an integer lattice
i
j
25
Lattice-Based Allocation
Different formulation using lattices Consider some basis of an integer lattice
i
j
26
Lattice-Based Allocation
Lattices ≈ Occupancy Vectors Conflict Set
values that cannot be mapped to the same memory locations
Find the smallest lattice that is large enough to only intersect with the
conflict set at its base enumeration of the space using HNF
27
28
Energy-Aware Compilation
Power Wall
Power Density
Lead to multi-core
Saving Energy is important Barrier for Exa-Scale computing Battery lifetime of laptops
Compiler Optimization has focused on speed Is there anything compilers can do for
energy? Speed is still important 29
Starting Hypothesis
Energy is Power consumed over Time E = PT
P : Power consumption E : Energy consumption T : execution Time
Faster execution time =Lower energy consumption
Hypothesis : Optimizing for speed also optimizes energy
30
Single Processor Case
Two main categories Purely Program Transformations
Efficient use of data cache Energy Aware Compilation framework
Dynamic Voltage and Frequency Scaling Profile based Loop transformation + DVFS
31
Efficient use of data cache
HW with configurable cache line size Trade-off : larger CLS =>
better spatial locality, higher interference
Main Contribution: Model to maximize hit ratio
Configurable CLS leads to energy trade-off
In GP processors, data locality optimization ≈ energy optimization of cache
32
[D’Alberto et. al, 2001]
Energy Aware Compilation
Compiler framework with energy in mind Based on predicting power consumption
from high-level source code Energy-Aware Tiling
Optimal tiling strategy for speed != for energy
Key : tiling adds instructions Main Weakness
Improvement is relatively small (~10%) Energy is traded with speed
33
[Kadayif et. al, 2002]
Results by Kadayif et. al
Increase in energy/execution cycle when optimized for the other
Energy delay product would not change much
34
fir conv lms real biquad
complex
mxm
vpentai
Energy
4.1%
7.7% 6.8% 3.9% 2.0% 8.8% 5.9%
7.3%
Cycle 5.9%
8.7% 7.2% 2.9% 2.3% 7.6% 9.2%
6.8%
HW for Further Optimization
Dynamic Voltage and Frequency Scaling Power consumption model for CMOS
Voltage is the obvious target high frequency requires high voltage quadratic energy savings with reduced
freq.
35
V : supply voltagef : frequencyα : activity rate
DVFS : Main Idea
Identify non-compute intensive stages Frequency/Voltage can be reduced
without influencing speed processor is under-utilized
DVFS states are coarse grained ~10 different frequency/voltage
configurations State transition is not free
100s of cycles extra energy consumed
36
DVFS : Single Processor
Profile Based Profile to identify opportunities Compile-time vs. Run-time Limited by available opportunities
Loop Transformation First, optimize for speed Then convert speedup to energy savings Transformation to expose opportunities
37
[Hsu and Kremer 2003, Hsu and Feng 2005]
[Ghodrat and Gvargis 2009]
DVFS : Single Processor
Task-Based Programs Main Ides: Decoupled Access/Execute
Compiler transformation to split into tasks
One that does memory Accesses to fetch data
Another that does Execute to compute Apply DVFS
low frequency for Access high frequency for Execute
38
[Jimborean et al. 2014]
Single Processor : Summary
Purely software based optimization No significant gains over optimizing for
speed Hypothesis holds in this case
DVFS based approaches HW for energy savings exposed to
software Identify when processors is not fully
utilized HW support breaks the hypothesis
39
Across Processors
Parallelization is necessary to utilize modern architectures
How does parallelism affect energy?
Amdahl’s Law for Energy Opportunities in parallel programs
40
Static Power
New Term to the Power Model
Some power is consumed even when idle
DVFS has less effect Static Power reaching 50% of the total
power
41
I : leakage current
static powerdynamic power
Amdahl’s Law for Energy
Simple model of energy and parallelism processors have DVFS
Simple but more complicated than the original
Speed-up energy trade-off analysis
42
[Cho and Melhem 2008]
s : sequential p : parallel N : number of processorsλ : static power y : power consumption as a function of frequency
seq dynamic parallel dynamic static
Illustrating example from paper
43
(frequency)
When Static Power is 50%
44
(frequency)
Static Power dominates
Static Power is significant Increases as N increases Excessive processors are bad
With current technology (high static power and increasing cores) Running as fast as possible is a good
way to save energy
45
Generalizing a bit Further
Analysis based on high-level energy model Emphasis on power breakdown Find when “race-to-sleep” is the best Survey power breakdown of recent
machines Goal: confirm that sophisticated use of
DVFS by compilers is not likely to help much e.g., analysis/transformation to
find/expose “sweet-spot” for trading speed with energy 46
Power Breakdown
Dynamic (Pd)—consumed when bits flip Quadratic savings as voltage scales
Static (Ps)—leaked while current is flowing Linear savings as voltage scales
Constant (Pc)—everything else e.g., memory, motherboard, disk,
network card, power supply, cooling, … Little or no effect from voltage scaling
47
Influence on Execution Time
Voltage and Frequency are linearly related Slope is less than 1 i.e., scale voltage by half, frequency
drop is less than half Simplifying Assumption
Frequency change directly influence exec. time
Scale frequency by x, time becomes 1/x Fully flexible (continuous) scaling
Small set of discrete states in practice48
Case1: Dynamic Dominates Power Time
Case2: Static Dominates Power Time
Case3: Constant Dominates Power Time
Ratio is the Key
49
Pd : Ps : Pc
Pd : Ps : Pc
Pd : Ps : Pc
Pd : Ps : Pc
Energy Slower the Better
Energy No harm, but No gain
Energy Faster the Better
When do we have Case 3?
Static power is now more than dynamic power Power gating doesn’t help when
computing Assume Pd = Ps
50% of CPU power is due to leakage Roughly matches 45nm technology Further shrink = even more leakage
The borderline is when Pd = Ps = Pc We have case 3 when Pc is larger than
Pd=Ps 50
Extensions to The Model
Impact on Execution Time May not be directly proportional to
frequency Shifts the borderline in favor of DVFS
Larger Ps and/or Pc required for Case 3
Parallelism No influence on result CPU power is even less significant than
1-core Power budget for a chip is shared (multi-
core) Network cost is added (distributed) 51
Do we have Case 3?
Survey of machines and significance of Pc
Based on: Published power budget (TDP) Published power measures Not on detailed/individual
measurements Conservative Assumptions
Use upper bound for CPU Use lower bound for constant powers Assume high PSU efficiency 52
Pc in Current Machines
Sources of Constant Power Stand-By Memory (1W/1GB)
Memory cannot go idle while CPU is working
Power Supply Unit (10-20% loss) Transforming AC to DC
Motherboard (6W) Cooling Fan (10-15W)
Fully active when CPU is working Desktop Processor TDP ranges from 40-
90W Up to 130W for large core count (8 or
16)
53
Sever and Desktop Machines Methodology
Compute a lower bound of Pc
Does it exceed 33% of total system power?
Then Case 3 holds even if the rest was all consumed by the processor
System load Desktop: compute-intensive benchmarks Sever: Server workloads
(not as compute-intensive)
54
Desktop and Server Machines
55
Cray Supercomputers
Methodology Let Pd+Ps be sum of processors TDPs Let Pc be the sum of
PSU loss (5%) Cooling (10%) Memory (1W/1GB)
Check if Pc exceeds Pd = Ps Two cases for memory configuration
(min/max)
56
Cray Supercomputers
XT5 (min)
XT5 (max)
XT6 (min)
XT6 (max)
XE6 (min)
XE6 (max)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OtherPSU+CoolingMemoryCPU-staticCPU-dynamic
57
Cray Supercomputers
XT5 (min)
XT5 (max)
XT6 (min)
XT6 (max)
XE6 (min)
XE6 (max)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OtherPSU+CoolingMemoryCPU-staticCPU-dynamic
58
Cray Supercomputers
XT5 (min)
XT5 (max)
XT6 (min)
XT6 (max)
XE6 (min)
XE6 (max)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OtherPSU+CoolingMemoryCPU-staticCPU-dynamic
59
DVFS for Memory
Still in research stage (since 2010~) Same principle applied to memory
Quadratic component in power w.r.t. voltage
25% quadratic, 75% linear The model can be adopted:
Pd becomes Pq dynamic to quadratic Ps becomes Pl static to linear
The same story but with Pq : Pl : Pc
60
Influence on “race-to-sleep”
Methodology Move memory power from Pc to Pq and
Pl
25% to Pq and 75% to Pl
Pc becomes 15% of total power for Server/Cray
“race-to-sleep” may not be the best anymore
remains to be around 30% for desktop Vary Pq:Pl ratio to find when “race-to-
sleep” is the winner again leakage is expected to keep increasing
61
When “Race to Sleep” is optimal When derivative of energy w.r.t. scaling
is >0
62
dE/dF
Linearly Scaling Fraction: Pl / (Pq + Pl)
Summary and Conclusion
Diminishing returns of DVFS Main reason is leakage power Confirmation by a high-level energy
model “race-to-speed” seems to be the way to
go Memory DVFS won’t change the big
picture Compilers can continue to focus on
speed No significant gain in energy efficiency
by sacrificing speed 63
Balancing Computation and I/O DVFS can improve energy efficiency
when speed is not sacrificed Bring program to compute-I/O balanced
state If it’s memory-bound, slow down CPU If it’s compute-bound, slow down
memory Still maximizing hardware utilization
but by lowering the hardware capability Current hardware (e.g., Intel Turbo-
boost) and/or OS do this for processor
64
65
66
The Punch Line Method
How to Punch your audience how to attract your audience
Make your talk more effective learned from Michelle Strout
Colorado State University applicable to any talk
exce
llent
aver
age
good
poor
Norm
al Ta
lk
Punch
Lin
e T
alk
67
The Punch Line
The key cool idea in your paper the key insight
It is not the key contribution! X% better than Y do well on all benchmarks
Examples: ... because of HW prefetching ... further improve locality after reaching
compute-bound
68
Typical Conference Audience Many things to do
check emails browse websites finish their own slides
Attention Level (made up numbers)
~3 minutes 90% ~5 minutes 60% 5+ minutes 30% conclusion 70%
punch here!
push these numbers up!
69
Typical (Boring) Talk
1. Introduction 2. Motivation 3. Background 4. Approach 5. Results 6. Discussion 7. Conclusion
70
Punch Line Talk
Two Talks in One 5 minute talk
introduction/motivation key idea
X-5 minute talk add some background elaborate on approach ...
the punch
shortest path to the punch
71
Pitfalls of Beamer
Beamer != bad slides but it is a easy path to one
Checklist for good slides no full sentences LARGE font size few equations many figures !paper structure
beamer is not the best tool to encourage these