Mani SrivastavaUCLA - EE DepartmentRoom: 6731-H Boelter HallEmail: [email protected]: 310-267-2098WWW: http://www.ee.ucla.edu/~mbs
Copyright 2003 Mani Srivastava
High-level Synthesis Scheduling, Allocation, Assignment,
Note: Several slides in this Lecture are from
Prof. Miodrag Potkonjak, UCLA CS
Copyright 2003 Mani Srivastava2
Overview
High Level Synthesis
Scheduling, Allocation and Assignment
Estimations
Transformations
Copyright 2003 Mani Srivastava3
Allocation, Assignment, and Scheduling
D
+
-
>>
>>
+
-
>>
+ >>
+
>>
+
Allocation: How Much?2 adders
Assignment: Where?
Schedule: When?
Shifter 1
Time Slot 4
1 shifter24 registers
D
Techniques Well Understood and Mature
Copyright 2003 Mani Srivastava4
Scheduling and Assignment
+
*3*2
3
+
*1
2
+1 1
2
3
3
4 4
+
*3*2
3
+2
+1 2
3
4
1
2 3
4 control steps
+ * * + *
*1
Schedule 1 Schedule 2
1 +1
2 +2
3 +3 *1
4 *2 *3
Control Step
1 +3
2 +1 *2
3 +2 *3
4 *1
Control Step
Copyright 2003 Mani Srivastava5
ASAP Scheduling Algorithm
Copyright 2003 Mani Srivastava6
ASAP Scheduling Example
Copyright 2003 Mani Srivastava7
ASAP: Another Example
Sequence Graph ASAP Schedule
Copyright 2003 Mani Srivastava8
ALAP Scheduling Algorithm
Copyright 2003 Mani Srivastava9
ALAP Scheduling Example
Copyright 2003 Mani Srivastava10
ALAP: Another Example
Sequence Graph ALAP Schedule(latency constraint = 4)
Copyright 2003 Mani Srivastava11
Observation about ALAP & ASAP
No priority is given to nodes on critical path As a result, less critical nodes may be scheduled
ahead of critical nodes No problem if unlimited hardware However of the resources are limited, the less
critical nodes may block the critical nodes and thus produce inferior schedules
List scheduling techniques overcome this problem by utilizing a more global node selection criterion
Copyright 2003 Mani Srivastava12
List Scheduling and Assignment
List_Scheduling() {
Create_Candidate_List();
while (Candidate_List != NULL) {
Select_Candidate();
Schedule Candidate();
}
}
+
*3*2
3
+
*1
2
+14 control steps
Schedule 1
+1 +3
+3 *1
*2 *3
*2
+3 +2
1:
2:
3:
4:
Copyright 2003 Mani Srivastava13
List Scheduling Algorithm using Decreasing Criticalness Criterion
Copyright 2003 Mani Srivastava14
Scheduling
NP-complete Problem Optimal Heuristics - Iterative Improvements Heuristics – Constructive Various versions of problem
Unconstrained minimum latency Resource-constrained minimum latency Timing constrained
If all resources identical, reduced to multiprocessor scheduling
Minimum latency multiprocessor problem is intractable
Copyright 2003 Mani Srivastava15
Scheduling - Optimal Techniques
Integer Linear Programming
Branch and Bound
Copyright 2003 Mani Srivastava16
Integer Linear Programming
Given: integer-valued matrix Amxn,
vectors B = ( b1, b2, … , bm ), C = ( c1, c2, … , cn )
Minimize: CTX
Subject to:
AX B
X = ( x1, x2, … , xn ) is an integer-valued vector
Copyright 2003 Mani Srivastava17
Integer Linear Programming Problem: For a set of (dependent) computations {t1,t2,...,tn},
find the minimum number of units needed to complete the execution by k control steps.
Integer linear programming:Let y0 be an integer variable. For each control step i ( 1 i k ): define variable xij asxij = 1, if computation tj is executed in the ith control step. xij = 0, otherwise. define variable yi = xi1 + xI2 + ... + xin .
Copyright 2003 Mani Srivastava18
Integer Linear Programming
Integer linear programming:For each computation dependency: ti has to be done before tj, introduce a constraint: k x1i+ (k-1) x2i+ ... + xki k x1j+ (k-1) x2j+ ... + xkj+ 1(*)
Minimize: y0
Subject to: x1i+ x2i+ ... + xki = 1 for all 1 i n
yj y0 for all 1 i k
all computation dependency of type (*)
Copyright 2003 Mani Srivastava19
An Example
c1 c2 c3
c4
c6
c5
6 computations3 control steps
Copyright 2003 Mani Srivastava20
An Example
Introduce variables: xij for 1 i 3, 1 j 6
yi = xi1+xi2+xi3+xi4+xi5+xi6 for 1 i 3
y0
Dependency constraints: e.g. execute c1 before c4
3x11+2x21+x31 3x14 +2x24+x34+1
Execution constraints:
x1i+x2i+x3i = 1 for 1 i 6
Copyright 2003 Mani Srivastava21
An Example Minimize: y0
Subject to: yi y0 for all 1 i 3
dependency constraints
execution constraints One solution: y0 = 2
x11 = 1, x12 = 1,
x23 = 1, x24 = 1,
x35 = 1, x36 = 1.
All other xij = 0
Copyright 2003 Mani Srivastava22
ILP Model of Scheduling
Binary decision variables xil
i = 0, 1, …, n l = 1, 2, … +1
Start time is unique
Copyright 2003 Mani Srivastava23
ILP Model of Scheduling (contd.)
Sequencing relationships must be satisfied
Resource bounds must be met let upper bound on # of resources of type k be ak
Copyright 2003 Mani Srivastava24
Minimum-latency Scheduling Under Resource-constraints
Let t be the vector whose entries are start times Formal ILP model
Copyright 2003 Mani Srivastava25
Example
Two types of resources Multiplier ALU
Adder Subtraction Comparison
Both take 1 cycle execution time
Copyright 2003 Mani Srivastava26
Example (contd.)
Heuristic (list scheduling) gives latency = 4 steps Use ALAP and ASAP (with no resource
constraints) to get bounds on start times ASAP matches latency of heuristic
so heuristic is optimum, but let us ignore it! Constraints?
Copyright 2003 Mani Srivastava27
Example (contd.)
Start time is unique
Copyright 2003 Mani Srivastava28
Example (contd.)
Sequencing constraints note: only non-trivial ones listed
those with more than one possible start time for at least one operation
Copyright 2003 Mani Srivastava29
Example (contd.)
Resource constraints
Copyright 2003 Mani Srivastava30
Example (contd.)
Consider c = [0, 0, …, 1]T
Minimum latency schedule since sink has no mobility (xn,5 = 1), any feasible
schedule is optimum Consider c = [1, 1, …, 1] T
finds earliest start times for all operations equivalently,
Copyright 2003 Mani Srivastava31
Example Solution: Optimum Schedule Under Resource
Constraint
Copyright 2003 Mani Srivastava32
Example (contd.)
Assume multiplier costs 5 units of area, and ALU costs 1 unit of area
Same uniqueness and sequencing constraints as before
Resource constraints are in terms of unknown variables a1 and a2
a1 = # of multipliers
a2 = # of ALUs
Copyright 2003 Mani Srivastava33
Example (contd.) Resource constraints
Copyright 2003 Mani Srivastava34
Example Solution
MinimizecTa = 5.a1 + 1.a2
Solution with cost 12
Copyright 2003 Mani Srivastava35
Precedence-constrained Multiprocessor Scheduling
All operations done by the same type of resource intractable problem intractable even if all operations have unit delay
Copyright 2003 Mani Srivastava36
Scheduling - Iterative Improvement
Kernighan - Lin (deterministic) Simulated Annealing Lottery Iterative Improvement Neural Networks Genetic Algorithms Taboo Search
Copyright 2003 Mani Srivastava37
Scheduling - Constructive Techniques
Most Constrained
Least Constraining
Copyright 2003 Mani Srivastava38
Force Directed Scheduling
Goal is to reduce hardware by balancing concurrency
Iterative algorithm, one operation scheduled per iteration
Information (i.e. speed & area) fed back into scheduler
Copyright 2003 Mani Srivastava39
The Force Directed Scheduling Algorithm
Copyright 2003 Mani Srivastava40
Step 1
Determine ASAP and ALAP schedules
*
-+
**
*+ <
**-
*
-
+* * *+ <**
-
ASAP ALAP
Copyright 2003 Mani Srivastava41
Step 2
Determine Time Frame of each op Length of box ~ Possible execution cycles Width of box ~ Probability of assignment Uniform distribution, Area assigned = 1
C-step 1
C-step 2
C-step 3
C-step 4
Time Frames
*
-
*
*
-
*
**
+ <
+
1/2
1/3
Copyright 2003 Mani Srivastava42
Step 3
Create Distribution Graphs Sum of probabilities of each Op type Indicates concurrency of similar Ops
DG(i) = Prob(Op, i)
DG for Multiply DG for Add, Sub, Comp
Copyright 2003 Mani Srivastava43
Diff Eq Example: Precedence Graph Recalled
Copyright 2003 Mani Srivastava44
Diff Eq Example: Time Frame & Probability Calculation
Copyright 2003 Mani Srivastava45
Diff Eq Example: DG Calculation
Copyright 2003 Mani Srivastava46
Conditional Statements
Operations in different branches are mutually exclusive Operations of same type can be overlapped onto DG Probability of most likely operation is added to DG
DG for Add
-+
-+
+Fork
Join
+-+
-+
Copyright 2003 Mani Srivastava47
Self Forces Scheduling an operation will effect overall concurrency Every operation has 'self force' for every C-step of its time frame Analogous to the effect of a spring: f = Kx
Desirable scheduling will have negative self force Will achieve better concurrency (lower potential energy)
Force(i) = DG(i) * x(i)
DG(i) ~ Current Distribution Graph value
x(i) ~ Change in operation’s probability
Self Force(j) = [Force(i)]
b
ti
Copyright 2003 Mani Srivastava48
Example Attempt to schedule multiply in C-step 1
Self Force(1) = Force(1) + Force(2)
= ( DG(1) * X(1) ) + ( DG(2) * X(2) )
= [2.833*(0.5) + 2.333 * (-0.5)] = +0.25
This is positive, scheduling the multiply
in the first C-step would be bad
DG for Multiply
*
-
*
*
-
*
**
+ <
+
C-step 1
C-step 2
C-step 3
C-step 41/2
1/3
Copyright 2003 Mani Srivastava49
Diff Eq Example: Self Force for Node 4
Copyright 2003 Mani Srivastava50
Predecessor & Successor Forces
Scheduling an operation may affect the time frames of other linked operations
This may negate the benefits of the desired assignment Predecessor/Successor Forces = Sum of Self Forces of
any implicitly scheduled operations
*
-+
**
*+ <
**-
Copyright 2003 Mani Srivastava51
Diff Eq Example: Successor Force on Node 4
If node 4 scheduled in step 1 no effect on time frame for successor node 8
Total force = Froce4(1) = +0.25 If node 4 scheduled in step 2
causes node 8 to be scheduled into step 3 must calculate successor force
Copyright 2003 Mani Srivastava52
Diff Eq Example: Final Time Frame and Schedule
Copyright 2003 Mani Srivastava53
Diff Eq Example: Final DG
Copyright 2003 Mani Srivastava54
Lookahead Temporarily modify the constant DG(i) to include the effect
of the iteration being considered
Force (i) = temp_DG(i) * x(i)temp_DG(i) = DG(i) + x(i)/3
Consider previous example:
Self Force(1) = (DG(1) + x(1)/3)x(1) + (DG(2) + x(2)/3)x(2) = .5(2.833 + .5/3) -.5(2.333 - .5/3) = +.41667
This is even worse than before
Copyright 2003 Mani Srivastava55
Minimization of Bus Costs
Basic algorithm suitable for narrow class of problems Algorithm can be refined to consider “cost” factors Number of buses ~ number of concurrent data transfers Number of buses = maximum transfers in any C-step Create modified DG to include transfers: Transfer DG
Trans DG(i) = [Prob (op,i) * Opn_No_InOuts]
Opn_No_InOuts ~ combined distinct in/outputs for Op
Calculate Force with this DG and add to Self Force
Copyright 2003 Mani Srivastava56
Minimization of Register Costs Minimum registers required is given by the largest
number of data arcs crossing a C-step boundary Create Storage Operations, at output of any operation
that transfers a value to a destination in a later C-step Generate Storage DG for these “operations” Length of storage operation depends on final schedule
s
ss
d
d d
Storage distribution for S
ASAP Lifetime MAX Lifetime ALAP Lifetime
Copyright 2003 Mani Srivastava57
Minimization of Register Costs( contd.) avg life] =
storage DG(i) = (no overlap between ASAP & ALAP)
storage DG(i) = (if overlap)
Calculate and add “Storage” Force to Self Force
3
life] [MAX life] [ALAP life] [ASAP
life][max
life] [avg
[overlap]life][max
[overlap] - life] [avg
7 registers minimum
ASAP Force Directed
5 registers minimum
Copyright 2003 Mani Srivastava58
Pipelining* * *
***
+
+<
--
* * ****
+
+<
--
DG for Multiply
123, 1’4, 2’ 3’ 4’
Instance
Instance’
Functional Pipelining
1
2
34
*
*
Structural Pipelining
Functional Pipelining Pipelining across multiple
operations Must balance distribution
across groups of concurrent C-steps
Cut DG horizontally and superimpose
Finally perform regular Force Directed Scheduling
Structural Pipelining Pipelining within an operation For non data-dependant
operations, only the first C-step need be considered
Copyright 2003 Mani Srivastava59
Other Optimizations Local timing constraints
Insert dummy timing operations -> Restricted time frames
Multiclass FU’s Create multiclass DG by summing probabilities of
relevant ops Multistep/Chained operations.
Carry propagation delay information with operation Extend time frames into other C-steps as required
Hardware constraints Use Force as priority function in list scheduling
algorithms
Copyright 2003 Mani Srivastava60
Scheduling using Simulated Annealing
Reference:
Devadas, S.; Newton, A.R.
Algorithms for hardware allocation in data path synthesis.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, July 1989, Vol.8, (no.7):768-81.
Copyright 2003 Mani Srivastava61
Simulated Annealing
Local Search
Solution space
Cos
t fu
nctio
n
?
Copyright 2003 Mani Srivastava62
Statistical Mechanics
Combinatorial Optimization
State {r:} (configuration -- a set of atomic position )
weight e-E({r:])/K BT -- Boltzmann distribution
E({r:]): energy of configuration
KB: Boltzmann constant
T: temperature
Low temperature limit ??
Copyright 2003 Mani Srivastava63
Analogy
Physical System
State (configuration)
Energy
Ground State
Rapid Quenching
Careful Annealing
Optimization Problem
Solution
Cost Function
Optimal Solution
Iteration Improvement
Simulated Annealing
Copyright 2003 Mani Srivastava64
Generic Simulated Annealing Algorithm
1. Get an initial solution S2. Get an initial temperature T > 03. While not yet 'frozen' do the following: 3.1 For 1 i L, do the following:
3.1.1 Pick a random neighbor S'of S 3.1.2 Let =cost(S') - cost(S) 3.1.3 If 0 (downhill move) set S = S' 3.1.4 If >0 (uphill move)
set S=S' with probability e-/T
3.2 Set T = rT (reduce temperature)4. Return S
Copyright 2003 Mani Srivastava65
Basic Ingredients for S.A.
Solution Space
Neighborhood Structure
Cost Function
Annealing Schedule
Copyright 2003 Mani Srivastava66
Observation
All scheduling algorithms we have discussed so far are critical path schedulers
They can only generate schedules for iteration period larger than or equal to the critical path
They only exploit concurrency within a single iteration, and only utilize the intra-iteration precedence constraints
Copyright 2003 Mani Srivastava67
Example
Can one do better than iteration period of 4? Pipelining + retiming can reduce critical path to 3, and also
the # of functional units Approaches
Transformations followed by scheduling Transformations integrated with scheduling
Copyright 2003 Mani Srivastava74
Conclusions
High Level Synthesis Connects Behavioral Description and Structural
Description Scheduling, Estimations, Transformations High Level of Abstraction, High Impact on the
Final Design