Post on 26-Mar-2015
transcript
Instruction Scheduling— combining scheduling with allocation —
Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
Comp 512Spring 2011
COMP 512, Rice University 1
COMP 512, Rice University 2
Combining Scheduling & Allocation
Sometimes, combining two optimizations can produce solutions that cannot be obtained by solving them independently
• Requires bilateral interactions between optimizations Click and Cooper, “Combining Analyses, Combining
Optimizations”, TOPLAS 17(2), March 1995.
• Combining two optimizations can be a challenge (SCCP)
Scheduling & allocation are a classic example• Scheduling changes variable lifetimes• Renaming in the allocator changes dependences• Spilling changes the underlying code
falsedependences
COMP 512, Rice University 3
Many authors have tried to combine allocation & scheduling
• Underallocate to leave “room” for the scheduler Can result in underutilization of registers
• Preallocate to use all registers Can create false dependences
• Solving the problems together can produce solutions that cannot be obtained by solving them independently See Click and Cooper, “Combining Analyses, Combining
Optimizations”, TOPLAS 17(2), March 1995.
In general, these papers try to combine global allocators with local or regional schedulers — an algorithmic mismatch
Combining Scheduling & Allocation
Before we go there, a long digression about how much improvement we might expect …
COMP 512, Rice University 4
Quick Review of Local Scheduling
Given a sequence of machine operations, reorder the operations so that • Data dependences are respected• Execution time is minimized• Demand for registers is kept below k
Vocabulary:• An operation is an indivisible command• An instruction is a set of operations that issue in the same
cycle• A dependence graph is constructed to represent necessary
delays (Nodes are operations; edges show the flow of values; edge weights represent operation latencies)
COMP 512, Rice University 5
Scheduling Example
• Many operations have non-zero latencies• Modern machines can issue several operations per cycle• Execution time is order-dependent (and has been since the
60’s)
Assumed latencies (conservative)
Operation Cycles load 3store 3loadI 1add 1mult 2fadd 1fmult 2shift 1branch 0 to 8
• Loads & stores may or may not block
> Non-blocking fill those issue slots
• Branch costs vary with path taken• Branches typically have delay slots
> Fill slots with unrelated operations> Percolates branch upward
• Scheduler should hide the latencies
List scheduling is dominant algorithm
COMP 512, Rice University 6
Example
w w * 2 * x * y * z
Simple schedule Schedule loads early
2 registers, 20 cycles 3 registers, 13 cycles
Reordering operations for speed is called instruction scheduling
COMP 512, Rice University 7
Instruction Scheduling (The Abstract View)
To capture properties of the code, build a dependence graph G
• Nodes n G are operations with type(n) and delay(n)
• An edge e = (n1,n2) G if & only if n2 uses the result of n1
The Code
a
b c
d e
f g
h
i
The Dependence Graph
COMP 512, Rice University 8
Instruction Scheduling (Definitions)
A correct schedule S maps each n N into a non-negative integer representing its cycle number, and
1. S(n) ≥ 0, for all n N, obviously 2. If (n1,n2) E, S(n1 ) + delay(n1 ) ≤ S(n2 )3. For each type of operation (functional unit), t, there are no more
operations of type t in any cycle than the target machine can issue
The length of a schedule S, denoted L(S), is L(S) = maxn N (S(n) + delay(n))
The goal is to find the shortest possible correct schedule.S is time-optimal if L(S) ≤ L(S1 ), for all other schedules S1 A schedule might also be optimal in terms of registers or power,
or ….
COMP 512, Rice University 9
What’s so difficult?
Critical Points• All operands must be available when an operation
issues• Multiple operations can be ready (& often are …)• Moving an operation can lengthen register lifetimes or it
can shorten register lifetimes• Operands can have multiple predecessors (not
SSA)Together, these issues make scheduling hard (NP-
Complete)
Local scheduling is the simple case• Restricted to straight-line code• Consistent and predictable latencies
COMP 512, Rice University 10
Instruction Scheduling
The big picture1. Build a dependence graph, D2. Compute a priority function over the nodes in D3. Use list scheduling to construct a schedule, 1 cycle at a
timea. Use a queue of operations that are readyb. At each cycle
I. Choose a ready operation and schedule itII. Update the ready queue
Local list scheduling• The dominant algorithm for twenty years• A greedy, heuristic, local technique
COMP 512, Rice University 11
Local List Scheduling
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op
Cycle Cycle + 1
for each op Active if (S(op) + delay(op) ≤ Cycle) then remove op from Active for each successor s of op in D if (s is ready) then Ready Ready s
Removal in priority order
op has completed execution
If successor’s operands are ready, put it on Ready
Can improve efficiency by using a set of Queues (1 more than maximum delay on target machine) — see 412 notes
COMP 512, Rice University 12
Scheduling Example
1. Build the dependence graph
The Code
a
b c
d e
f g
h
i
The Dependence Graph
COMP 512, Rice University 13
Scheduling Example
1. Build the dependence graph2. Determine priorities: longest latency-weighted path
The Code
a
b c
d e
f g
h
i
The Dependence Graph
3
5
8
7
9
10
12
10
13
COMP 512, Rice University 14
Scheduling Example
1. Build the dependence graph2. Determine priorities: longest latency-weighted path3. Perform list scheduling
loadAI r0,@w r11) a:
add r1,r1 r14) b:
loadAI r0,@x r22) c:
mult r1,r2 r15) d:
loadAI r0,@y r33) e:
mult r1,r3 r17) f:loadAI r0,@z r26) g:
mult r1,r2 r19) h:11) i: storeAI r1 r0,@w
The Code
a
b c
d e
f g
h
i
The Dependence Graph
3
5
8
7
9
10
12
10
13
New register name used
COMP 512, Rice University 15
Local Scheduling
As long as we stay within a single block• List scheduling does well
• Problem is hard, so tie-breaking matters More descendants in dependence graph Prefer operation with a last use over one with none Breadth first makes progress on all paths
Tends toward more ILP & fewer interlocks Depth first tries to complete uses of a value
Tends to use fewer registers
Classic work on this is Gibbons & Muchnick (PLDI 86)
COMP 512, Rice University 16
Local Scheduling
Forward and backward can produce different results
cbr
cmp store1 store2 store3 store4 store5
add1 add2 add3 add4 addI
loadI1 lshift loadI2 loadI3 loadI4
Block from SPEC
benchmark “go”
Operation
load loadI add addI store cmp
Latency 1 1 2 1 4 1
1
2 5 5 5 5 5
7 7 7 7 6
88888Latency
to the cbr
Subscript to identify
COMP 512, Rice University 17
Local Scheduling
Int Int Mem
1 loadI1lshift
2 loadI2 loadI3
3 loadI4 add1
4 add2 add3
5 add4addI store1
6 cmp store2
7 store3
8 store4
9 store5
10
11
12
13 cbr
Forward
Schedule
Int Int Mem
1 loadI4
2 addI lshift
3 add4 loadI3
4 add3 loadI2 store5
5 add2 loadI1 store4
6 add1 store3
7 store2
8 store1
9
10
11 cmp
12 cbr
13
Backward
Schedule
Using latency to root as the priority
COMP 512, Rice University 18
Local Scheduling
Priority function strongly affects properties of result• Longest latency-weighted path biases result toward
finishing the long paths as soon as possible execution speed is paramount may use more registers than the minimum
• Depth-first approach can reduce demand for registers minimize lifetimes of values Sethi-Ullman numbering, extended to DAGs
COMP 512, Rice University 19
Iterative Repair Scheduling
The Problem
• List scheduling has dominated field for 20 years
• Anecdotal evidence both good & bad, little solid evidence
• No intuitive paradigm for how it works
• It works well, but will it work well in the future ?
• Is there room for improvement? (e.g., with allocation?)
Our Idea
• Try more powerful algorithms from other domains
• Look for better schedules
• Look for understanding of the solution space
This led us to iterative repair scheduling
COMP 512, Rice University 20
Iterative Repair Scheduling
The Algorithm
• Start from some approximation to a schedule (bad or broken)
• Find & prioritize all cycles that need repair (tried 6 schemes) Either resource or data constraints
• Perform the needed repairs, in priority order Break ties randomly Reschedule dependent operations, in random order Evaluation function on repair can reject the repair (try another)
• Iterate until repair list is empty
• Repeat this process many times to explore the solution space Keep the best result ! Randomization & restart is a
fundamental theme of our recent work
Iterative repair works well on many kinds of scheduling problems.
• Scheduling cargo for the space shuttle
• Typical problems in the literature involve 10s or 100s of repairs
We used it with millions of repairs
COMP 512, Rice University 21
Iterative Repair Scheduling
How does iterative repair do versus list scheduling?
• Found many schedules that used fewer registers
• Found very few faster schedules
• Were disappointed with the results
• Began a study of the properties of scheduling problems
Iterative repair, itself, doesn’t justify the additional costs
• Can we identify schedulers where it will win?
• Can we learn about the properties of scheduling problems ? And about the behavior of list scheduling ...
Hopeful sign for this lecture
COMP 512, Rice University 22
Methodology
• Looked at blocks & extended blocks in benchmarks
• Used randomized version of backward & forward list scheduling
• If non-optimal, used IR to find its best schedule (simple tests)
• Checked these results against an IP formulation using CPLEX
The Results
• List scheduling does quite well on a conventional uniprocessor Over 92% of blocks scheduled optimally for speed
Over 73% of extended blocks scheduled optimally for speed
Instruction Scheduling Study
COMP 512, Rice University 24
Methodology
• Looked at blocks & extended blocks in benchmarks
• Applied the RBF algorithm & tested result for optimality
• If non-optimal, used IR to find its best schedule
• Checked these results against an IP formulation using CPLEX
The Results
• List scheduling does quite well on a conventional uniprocessor Over 92% of blocks scheduled optimally for speed
Over 73% of extended blocks scheduled optimally for speed
Instruction Scheduling Study
COMP 512, Rice University 25
Methodology
• Repeated same experiment with randomly-generated blocks
• Generated over 85,000 random blocks of 10, 20, & 50 ops
The Results
• List scheduling finds optimal schedules over 80% of the time
• Plotted % non-optimal against available ILP
• Peak is around 2.8 for 1 functional unit and 4.7 for 2 units Worst-case schedule length over critical path length
• IR scheduler usually found optimal schedules for these “harder” problems use it when list scheduling fails
Instruction Scheduling Study3 compute months on a pair of UltraSparc workstations
COMP 512, Rice University 26
Non-optimal list schedules (%) versus available parallelism1 functional unit, randomly generated blocks of 10, 20, 50 ops
At the peak, compiler should apply other techniques
• Measure parallelism in list scheduler
• Invoke stronger techniques when high-probability of payoff
How Well Does List Scheduling Do?
Most codes fall here, unless the compiler transforms them for ILP.
If the compiler transforms the code, it should avoid this area!
From Phil Schielke’s thesis
COMP 512, Rice University 27
Instruction Scheduling Study
The Lessons
• In general, list scheduling does well Use randomized version, both directions, a few trials
• To find hard problems, measure average length of ready queue
• If it falls in the “hard” region Check resulting schedule for interlocks or holes If non-optimal, run the IR scheduler
• If transforming for parallelism, avoid hitting the “hard” range
• CPLEX had a hard time with the easy blocks Too many optimal solutions to check out
COMP 512, Rice University 28
Combining Allocation & Scheduling
The Problem
• Well-understood that the problems are intricately related
• Previous work under-allocates or under-schedules Except Goodman & Hsu
Our Approach
• Formulate an iterative repair framework Moves for scheduling, as before Moves to decrease register pressure or to spill
• Allows fair competition in a combined attack
Grows out of search for novel techniques from other areas
COMP 512, Rice University 29
Combining Allocation & Scheduling
The Details
• Run IR scheduler & keep the schedule with lowest pressure
• Start with ALAP schedule rather than ASAP schedule
• Reject repair that increases maximum pressure
• Cycle with pressure > k triggers “pressure repair” Identify ops that reduce pressure & move one Lower threshold for k seems to help
• Ran it against the classic method Schedule, allocate, schedule (using Briggs’
allocator)
COMP 512, Rice University 30
Combining Allocation & Scheduling
The Results
• Many opportunities to lower pressure 12% of basic blocks 33% of extended blocks
• This can produce faster code Best case was 41.3% Average case, 16 regs, was 5.4% Average case, 32 regs, was 3.5% (whole
applications)
This approach finds faster codes that spill fewer valuesIt is competing against a very good global allocator
Rematerialization catches many of the same effects
Knowing that new solutions exist does not ensure that they are better solutions!
This work confirms years of suspicion, while providing an effective, albeit nontraditional, technique
The opportunity is present, but the IR scheduler is still quite slow …
COMP 512, Rice University 31
Sethi-Ullman Numbering
Two pass algorithm• Number each subtree in the expression
1 if n is a leaf
label(n) = max(label(left child),label(right child) if labels are unequal
label(left child) + 1 if labels are equal
Labels correspond to registers need to evaluate that subtree
• Use numbers to guide evaluation At each node, generate more demanding subtree first
1. generate code for larger label2. store result in a temporary registers3. generate code for smaller label4. generate code for node
This approach minimizes register use for a given tree
COMP 512, Rice University 32
Sethi Ullman Numbering
Of course, code shape matters• Deep trees use fewer registers than broad trees
a
+
b
+
c
+
d
+
+
+
c da b
1 1
2
2
2
1
1
1 1 1 1
2 2
3
Evaluates in 2 registers
Less ILP
Evaluates in 3 registers
More ILP
COMP 512, Rice University 33
Balancing Speed and Register Pressure
Goodman & Hsu proposed a novel scheme• Context: debate about prepass versus postpass
scheduling• Problem: tradeoff between allocation & scheduling• Solution:
Schedule for speed until fewer than Threshold registers Schedule for registers until more than Threshold registers
• Details: “for speed” means one of the latency-weighted priorities “for registers” means an incremental adaptation of SU
scheme
James R. Goodman and Wei-Chung Hsu, “Code Scheduling and Register Allocation in Large Basic Blocks,” Proceedings of the 2nd International Conference on Supercomputing, St. Malo, France, 1988, pages 442-452.
COMP 512, Rice University 34
Local Scheduling & Register Allocation
List scheduling is a local, incremental algorithm• Decisions made on an operation-by-operation basis• Use local (basic-block level) metrics
Need a local, incremental register-allocation algorithm• Best’s algorithm, called “bottom-up local” in EaC
To free a register, evict the value with furthest next use
• Uses local (basic-block level) metrics
Combining these two algorithms leads to a fair, local algorithm for the combined problem, called IRIS Idea is due to Dae-Hwan Kim & Hyuk-Jae Lee Can use a non-local eviction heuristic (new twist on
Best’s alg.)
D-H. Kim and H-J. Lee, “Integrated Instruction Scheduling and Fine-Grain Register Allocation for Embedded Processors”, in Embedded Computer Systems: Architectures, Modeling, and Simulation, LNCS 4017, pages 269-278.
COMP 512, Rice University 35
Original Code for Local List Scheduling
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready S(op) Cycle Active Active op
Cycle Cycle + 1
update the Ready queue
Paraphrasing from earlier slide…
COMP 512, Rice University 36
The Combined Algorithm
Sketch of the algorithm
Cycle 1Ready leaves of DActive Ø
while (Ready Active Ø) if (Ready Ø) then remove an op from Ready
make operands available in registers allocate a register for target
S(op) Cycle Active Active op
Cycle Cycle + 1
update the Ready queue
Reload Live on Exit values, if necessary
Keep a list of free registersOn last use, put register back on free listTo free register, store value used farthest in the future