High-level Synthesis Scheduling, Allocation, Assignment,

Mani SrivastavaUCLA - EE DepartmentRoom: 6731-H Boelter HallEmail: [email protected]: 310-267-2098WWW: http://www.ee.ucla.edu/~mbs

Copyright 2003 Mani Srivastava

High-level Synthesis Scheduling, Allocation, Assignment,

Note: Several slides in this Lecture are from

Prof. Miodrag Potkonjak, UCLA CS

Copyright 2003 Mani Srivastava2

Overview

High Level Synthesis

Scheduling, Allocation and Assignment

Estimations

Transformations


Allocation, Assignment, and Scheduling

D

+

-

>>

>>

+

-

>>

+ >>

+

>>

+

Allocation: How Much?2 adders

Assignment: Where?

Schedule: When?

Shifter 1

Time Slot 4

1 shifter24 registers

D

Techniques Well Understood and Mature


Scheduling and Assignment

+

*3*2

3

+

*1

2

+1 1

2

3

3

4 4

+

*3*2

3

+2

+1 2

3

4

1

2 3

4 control steps

+ * * + *

*1

Schedule 1 Schedule 2

1 +1

2 +2

3 +3 *1

4 *2 *3

Control Step

1 +3

2 +1 *2

3 +2 *3

4 *1

Control Step


ASAP Scheduling Algorithm


ASAP Scheduling Example


ASAP: Another Example

Sequence Graph ASAP Schedule


ALAP Scheduling Algorithm


ALAP Scheduling Example


ALAP: Another Example

Sequence Graph ALAP Schedule(latency constraint = 4)


Observation about ALAP & ASAP

No priority is given to nodes on critical path As a result, less critical nodes may be scheduled

ahead of critical nodes No problem if unlimited hardware However of the resources are limited, the less

critical nodes may block the critical nodes and thus produce inferior schedules

List scheduling techniques overcome this problem by utilizing a more global node selection criterion


List Scheduling and Assignment

List_Scheduling() {

Create_Candidate_List();

while (Candidate_List != NULL) {

Select_Candidate();

Schedule Candidate();

}

}

+

*3*2

3

+

*1

2

+14 control steps

Schedule 1

+1 +3

+3 *1

*2 *3

*2

+3 +2

1:

2:

3:

4:


List Scheduling Algorithm using Decreasing Criticalness Criterion


Scheduling

NP-complete Problem Optimal Heuristics - Iterative Improvements Heuristics – Constructive Various versions of problem

Unconstrained minimum latency Resource-constrained minimum latency Timing constrained

If all resources identical, reduced to multiprocessor scheduling

Minimum latency multiprocessor problem is intractable


Scheduling - Optimal Techniques

Integer Linear Programming

Branch and Bound



Given: integer-valued matrix Amxn,

vectors B = ( b1, b2, … , bm ), C = ( c1, c2, … , cn )

Minimize: CTX

Subject to:

AX B

X = ( x1, x2, … , xn ) is an integer-valued vector


Integer Linear Programming Problem: For a set of (dependent) computations {t1,t2,...,tn},

find the minimum number of units needed to complete the execution by k control steps.

Integer linear programming:Let y0 be an integer variable. For each control step i ( 1 i k ): define variable xij asxij = 1, if computation tj is executed in the ith control step. xij = 0, otherwise. define variable yi = xi1 + xI2 + ... + xin .



Integer linear programming:For each computation dependency: ti has to be done before tj, introduce a constraint: k x1i+ (k-1) x2i+ ... + xki k x1j+ (k-1) x2j+ ... + xkj+ 1(*)

Minimize: y0

Subject to: x1i+ x2i+ ... + xki = 1 for all 1 i n

yj y0 for all 1 i k

all computation dependency of type (*)


An Example

c1 c2 c3

c4

c6

c5

6 computations3 control steps


An Example

Introduce variables: xij for 1 i 3, 1 j 6

yi = xi1+xi2+xi3+xi4+xi5+xi6 for 1 i 3

y0

Dependency constraints: e.g. execute c1 before c4

3x11+2x21+x31 3x14 +2x24+x34+1

Execution constraints:

x1i+x2i+x3i = 1 for 1 i 6


An Example Minimize: y0

Subject to: yi y0 for all 1 i 3

dependency constraints

execution constraints One solution: y0 = 2

x11 = 1, x12 = 1,

x23 = 1, x24 = 1,

x35 = 1, x36 = 1.

All other xij = 0


ILP Model of Scheduling

Binary decision variables xil

i = 0, 1, …, n l = 1, 2, … +1

Start time is unique


ILP Model of Scheduling (contd.)

Sequencing relationships must be satisfied

Resource bounds must be met let upper bound on # of resources of type k be ak


Minimum-latency Scheduling Under Resource-constraints

Let t be the vector whose entries are start times Formal ILP model


Example

Two types of resources Multiplier ALU

Adder Subtraction Comparison

Both take 1 cycle execution time


Example (contd.)

Heuristic (list scheduling) gives latency = 4 steps Use ALAP and ASAP (with no resource

constraints) to get bounds on start times ASAP matches latency of heuristic

so heuristic is optimum, but let us ignore it! Constraints?


Example (contd.)

Start time is unique


Example (contd.)

Sequencing constraints note: only non-trivial ones listed

those with more than one possible start time for at least one operation


Example (contd.)

Resource constraints


Example (contd.)

Consider c = [0, 0, …, 1]T

Minimum latency schedule since sink has no mobility (xn,5 = 1), any feasible

schedule is optimum Consider c = [1, 1, …, 1] T

finds earliest start times for all operations equivalently,


Example Solution: Optimum Schedule Under Resource

Constraint


Example (contd.)

Assume multiplier costs 5 units of area, and ALU costs 1 unit of area

Same uniqueness and sequencing constraints as before

Resource constraints are in terms of unknown variables a1 and a2

a1 = # of multipliers

a2 = # of ALUs


Example (contd.) Resource constraints


Example Solution

MinimizecTa = 5.a1 + 1.a2

Solution with cost 12


Precedence-constrained Multiprocessor Scheduling

All operations done by the same type of resource intractable problem intractable even if all operations have unit delay


Scheduling - Iterative Improvement

Kernighan - Lin (deterministic) Simulated Annealing Lottery Iterative Improvement Neural Networks Genetic Algorithms Taboo Search


Scheduling - Constructive Techniques

Most Constrained

Least Constraining


Force Directed Scheduling

Goal is to reduce hardware by balancing concurrency

Iterative algorithm, one operation scheduled per iteration

Information (i.e. speed & area) fed back into scheduler


The Force Directed Scheduling Algorithm


Step 1

Determine ASAP and ALAP schedules

*

-+

**

*+ <

**-

*

-

+* * *+ <**

-

ASAP ALAP


Step 2

Determine Time Frame of each op Length of box ~ Possible execution cycles Width of box ~ Probability of assignment Uniform distribution, Area assigned = 1

C-step 1

C-step 2

C-step 3

C-step 4

Time Frames

*

-

*

*

-

*

**

+ <

+

1/2

1/3


Step 3

Create Distribution Graphs Sum of probabilities of each Op type Indicates concurrency of similar Ops

DG(i) = Prob(Op, i)

DG for Multiply DG for Add, Sub, Comp


Diff Eq Example: Precedence Graph Recalled


Diff Eq Example: Time Frame & Probability Calculation


Diff Eq Example: DG Calculation


Conditional Statements

Operations in different branches are mutually exclusive Operations of same type can be overlapped onto DG Probability of most likely operation is added to DG

DG for Add

-+

-+

+Fork

Join

+-+

-+


Self Forces Scheduling an operation will effect overall concurrency Every operation has 'self force' for every C-step of its time frame Analogous to the effect of a spring: f = Kx

Desirable scheduling will have negative self force Will achieve better concurrency (lower potential energy)

Force(i) = DG(i) * x(i)

DG(i) ~ Current Distribution Graph value

x(i) ~ Change in operation’s probability

Self Force(j) = [Force(i)]

b

ti


Example Attempt to schedule multiply in C-step 1

Self Force(1) = Force(1) + Force(2)

= ( DG(1) * X(1) ) + ( DG(2) * X(2) )

= [2.833*(0.5) + 2.333 * (-0.5)] = +0.25

This is positive, scheduling the multiply

in the first C-step would be bad

DG for Multiply

*

-

*

*

-

*

**

+ <

+

C-step 1

C-step 2

C-step 3

C-step 41/2

1/3


Diff Eq Example: Self Force for Node 4


Predecessor & Successor Forces

Scheduling an operation may affect the time frames of other linked operations

This may negate the benefits of the desired assignment Predecessor/Successor Forces = Sum of Self Forces of

any implicitly scheduled operations

*

-+

**

*+ <

**-


Diff Eq Example: Successor Force on Node 4

If node 4 scheduled in step 1 no effect on time frame for successor node 8

Total force = Froce4(1) = +0.25 If node 4 scheduled in step 2

causes node 8 to be scheduled into step 3 must calculate successor force


Diff Eq Example: Final Time Frame and Schedule


Diff Eq Example: Final DG


Lookahead Temporarily modify the constant DG(i) to include the effect

of the iteration being considered

Force (i) = temp_DG(i) * x(i)temp_DG(i) = DG(i) + x(i)/3

Consider previous example:

Self Force(1) = (DG(1) + x(1)/3)x(1) + (DG(2) + x(2)/3)x(2) = .5(2.833 + .5/3) -.5(2.333 - .5/3) = +.41667

This is even worse than before


Minimization of Bus Costs

Basic algorithm suitable for narrow class of problems Algorithm can be refined to consider “cost” factors Number of buses ~ number of concurrent data transfers Number of buses = maximum transfers in any C-step Create modified DG to include transfers: Transfer DG

Trans DG(i) = [Prob (op,i) * Opn_No_InOuts]

Opn_No_InOuts ~ combined distinct in/outputs for Op

Calculate Force with this DG and add to Self Force


Minimization of Register Costs Minimum registers required is given by the largest

number of data arcs crossing a C-step boundary Create Storage Operations, at output of any operation

that transfers a value to a destination in a later C-step Generate Storage DG for these “operations” Length of storage operation depends on final schedule

s

ss

d

d d

Storage distribution for S

ASAP Lifetime MAX Lifetime ALAP Lifetime


Minimization of Register Costs( contd.) avg life] =

storage DG(i) = (no overlap between ASAP & ALAP)

storage DG(i) = (if overlap)

Calculate and add “Storage” Force to Self Force

3

life] [MAX life] [ALAP life] [ASAP

life][max

life] [avg

[overlap]life][max

[overlap] - life] [avg

7 registers minimum

ASAP Force Directed

5 registers minimum


Pipelining* * *

***

+

+<

--

* * ****

+

+<

--

DG for Multiply

123, 1’4, 2’ 3’ 4’

Instance

Instance’

Functional Pipelining

1

2

34

*

*

Structural Pipelining

Functional Pipelining Pipelining across multiple

operations Must balance distribution

across groups of concurrent C-steps

Cut DG horizontally and superimpose

Finally perform regular Force Directed Scheduling

Structural Pipelining Pipelining within an operation For non data-dependant

operations, only the first C-step need be considered


Other Optimizations Local timing constraints

Insert dummy timing operations -> Restricted time frames

Multiclass FU’s Create multiclass DG by summing probabilities of

relevant ops Multistep/Chained operations.

Carry propagation delay information with operation Extend time frames into other C-steps as required

Hardware constraints Use Force as priority function in list scheduling

algorithms


Scheduling using Simulated Annealing

Reference:

Devadas, S.; Newton, A.R.

Algorithms for hardware allocation in data path synthesis.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, July 1989, Vol.8, (no.7):768-81.


Simulated Annealing

Local Search

Solution space

Cos

t fu

nctio

n

?


Statistical Mechanics

Combinatorial Optimization

State {r:} (configuration -- a set of atomic position )

weight e-E({r:])/K BT -- Boltzmann distribution

E({r:]): energy of configuration

KB: Boltzmann constant

T: temperature

Low temperature limit ??


Analogy

Physical System

State (configuration)

Energy

Ground State

Rapid Quenching

Careful Annealing

Optimization Problem

Solution

Cost Function

Optimal Solution

Iteration Improvement

Simulated Annealing


Generic Simulated Annealing Algorithm

1. Get an initial solution S2. Get an initial temperature T > 03. While not yet 'frozen' do the following: 3.1 For 1 i L, do the following:

3.1.1 Pick a random neighbor S'of S 3.1.2 Let =cost(S') - cost(S) 3.1.3 If 0 (downhill move) set S = S' 3.1.4 If >0 (uphill move)

set S=S' with probability e-/T

3.2 Set T = rT (reduce temperature)4. Return S


Basic Ingredients for S.A.

Solution Space

Neighborhood Structure

Cost Function

Annealing Schedule


Observation

All scheduling algorithms we have discussed so far are critical path schedulers

They can only generate schedules for iteration period larger than or equal to the critical path

They only exploit concurrency within a single iteration, and only utilize the intra-iteration precedence constraints


Example

Can one do better than iteration period of 4? Pipelining + retiming can reduce critical path to 3, and also

the # of functional units Approaches

Transformations followed by scheduling Transformations integrated with scheduling


Conclusions

High Level Synthesis Connects Behavioral Description and Structural

Description Scheduling, Estimations, Transformations High Level of Abstraction, High Impact on the

Final Design

Date post:	21-Jan-2016
Category:	Documents
Upload:	shania
View:	39 times
Download:	0 times

High-level Synthesis Scheduling, Allocation, Assignment,

Documents