+ All Categories
Home > Documents > Efficient scheduling of behavioural descriptions in high-level synthesis

Efficient scheduling of behavioural descriptions in high-level synthesis

Date post: 20-Sep-2016
Category:
Upload: km
View: 215 times
Download: 0 times
Share this document with a friend
8
Efficient scheduling of behavioural descriptions in high-level synthesis P.Kollig B.M.Al-Hashimi K.M.Abbott Indexing terms: High level synthesis Abstract: A new heuristic scheduling algorithm for time constrained datapath synthesis is described. The algorithm is based on the distribution graph concept where a least mean square error function is used to schedule operations in sequence, resulting in a computationally efficient solution with the capability of including other high-level synthesis features such as register cost without significant increase in execution time. This new proposed method contrasts with previously published algorithms where the influence of all operations on the schedule is first evaluated before the most appropriate operation is selected and scheduled. An important feature of the presented algorithm is its ability to solve different scheduling problems, including conditional statements, multicycled functional units and structural pipelining. To illustrate the efficiency of the algorithm a set of benchmark examples has been synthesised and compared. It has been shown that the new algorithm produces high quality solutions when compared to other heuristic algorithms. Furthermore, it is simple to implement and computationally efficient, with execution times increasing approximately linearly with increasing time constraints allowing complex designs to be synthesised in an acceptable timescale. As an example, it takes < 30s to obtain an optimal schedule for the discrete cosine transform when the time constraint of a maximum 36 control steps is imposed. 1 Introduction High-level synthesis is the translation of a behavioural description into a register transfer level (RTL) structure which consists of a datapath and control path [l]. The datapath is composed of registers, multiplexers and functional units which perform operations such as mul- tiplications and additions, whilst the control path gen- erates the necessary control signals for operation 0 IEE, 1997 IEE Proceedings online no. 19971 121 Paper first received 25th April and in revised form 18th October 1996 The authors are with the School of Engineering, Staffordshire University, Beaconside, Stafford ST18 OAD, UK IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 2, March 1997 execution. Automatic synthesis of datapath structures involves the tasks of scheduling, allocation and bind- ing. Scheduling is often considered as the most impor- tant step during synthesis [2] and its main objective is to add timing and structural information to the initial behavioural description. There are two scheduling schemes: resource constrained scheduling (RCS) and time constrained scheduling (TCS) [3]. The aim of RCS is to minimise the algorithm execution time having specified the available hardware, whilst TCS determines the minimum required number of hardware compo- nents to perform the algorithm in a specified time. Over the past few years a number of techniques have been proposed to solve the scheduling problem. A popular algorithm is the force-directed scheduler (FDS) used in the HAL system [4] which solves the time- constrained scheduling problem. Calculation of forces is based on a probability approach which results in an easy implementation and low computational complexity. Recently, work has been reported on utilising an improved FDS for high-throughput DSP applications [5]. However, neither system considers conditional resource sharing, which is important in applications where the system behaviour depends on states not known during synthesis. Several scheduling algorithms have been proposed capable of conditional resource sharing. For example, the algorithms presented by Wakabayashi [6] and Camposano [7] target the RCS problem, whilst the TASS system [8] targets the TCS problem. Recently, Kim [9] has reported an alternative technique for scheduling data flow graphs (DFGs) with conditional branches. His method is based on transforming DFGs with nested conditional branches into equivalent unconditional graphs. Other important issues in datapath synthesis are the support of structural pipelining [lo] and multicycled functional units. Pipelining is important in high speed applications, whilst multicycling allows efficient use of hardware resources. These issues have been addressed using a number of approaches. One approach is based on integer linear programming, and examples of such systems are OSCAR [ll] and ALPS [12]. The main disadvantage is that execution times are generally high. A more flexible approach is to offer a choice of scheduling algorithms depending on the application. For example, Cathedral-I1 [ 131 uses integer programming models and also heuristic schedulers. A more recent example capable of scheduling pipelined and multicycled components is the COBRA system [14]. This system uses simulated annealing and is targeted at a fixed datapath architecture. 15
Transcript
Page 1: Efficient scheduling of behavioural descriptions in high-level synthesis

Efficient scheduling of behavioural descriptions in high-level synthesis

P.Kollig B.M.Al-Hashimi K.M.Abbott

Indexing terms: High level synthesis

Abstract: A new heuristic scheduling algorithm for time constrained datapath synthesis is described. The algorithm is based on the distribution graph concept where a least mean square error function is used to schedule operations in sequence, resulting in a computationally efficient solution with the capability of including other high-level synthesis features such as register cost without significant increase in execution time. This new proposed method contrasts with previously published algorithms where the influence of all operations on the schedule is first evaluated before the most appropriate operation is selected and scheduled. An important feature of the presented algorithm is its ability to solve different scheduling problems, including conditional statements, multicycled functional units and structural pipelining. To illustrate the efficiency of the algorithm a set of benchmark examples has been synthesised and compared. It has been shown that the new algorithm produces high quality solutions when compared to other heuristic algorithms. Furthermore, it is simple to implement and computationally efficient, with execution times increasing approximately linearly with increasing time constraints allowing complex designs to be synthesised in an acceptable timescale. As an example, it takes < 30s to obtain an optimal schedule for the discrete cosine transform when the time constraint of a maximum 36 control steps is imposed.

1 Introduction

High-level synthesis is the translation of a behavioural description into a register transfer level (RTL) structure which consists of a datapath and control path [l]. The datapath is composed of registers, multiplexers and functional units which perform operations such as mul- tiplications and additions, whilst the control path gen- erates the necessary control signals for operation 0 IEE, 1997 IEE Proceedings online no. 19971 121 Paper first received 25th April and in revised form 18th October 1996 The authors are with the School of Engineering, Staffordshire University, Beaconside, Stafford ST18 O A D , UK

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 2, March 1997

execution. Automatic synthesis of datapath structures involves the tasks of scheduling, allocation and bind- ing. Scheduling is often considered as the most impor- tant step during synthesis [2] and its main objective is to add timing and structural information to the initial behavioural description. There are two scheduling schemes: resource constrained scheduling (RCS) and time constrained scheduling (TCS) [3]. The aim of RCS is to minimise the algorithm execution time having specified the available hardware, whilst TCS determines the minimum required number of hardware compo- nents to perform the algorithm in a specified time.

Over the past few years a number of techniques have been proposed to solve the scheduling problem. A popular algorithm is the force-directed scheduler (FDS) used in the HAL system [4] which solves the time- constrained scheduling problem. Calculation of forces is based on a probability approach which results in an easy implementation and low computational complexity. Recently, work has been reported on utilising an improved FDS for high-throughput DSP applications [5]. However, neither system considers conditional resource sharing, which is important in applications where the system behaviour depends on states not known during synthesis. Several scheduling algorithms have been proposed capable of conditional resource sharing. For example, the algorithms presented by Wakabayashi [6] and Camposano [7] target the RCS problem, whilst the TASS system [8] targets the TCS problem. Recently, Kim [9] has reported an alternative technique for scheduling data flow graphs (DFGs) with conditional branches. His method is based on transforming DFGs with nested conditional branches into equivalent unconditional graphs. Other important issues in datapath synthesis are the support of structural pipelining [lo] and multicycled functional units. Pipelining is important in high speed applications, whilst multicycling allows efficient use of hardware resources. These issues have been addressed using a number of approaches. One approach is based on integer linear programming, and examples of such systems are OSCAR [ll] and ALPS [12]. The main disadvantage is that execution times are generally high. A more flexible approach is to offer a choice of scheduling algorithms depending on the application. For example, Cathedral-I1 [ 131 uses integer programming models and also heuristic schedulers. A more recent example capable of scheduling pipelined and multicycled components is the COBRA system [14]. This system uses simulated annealing and is targeted at a fixed datapath architecture.

15

Page 2: Efficient scheduling of behavioural descriptions in high-level synthesis

Despite the work carried out, researchers are still looking for more efficient solutions in terms of design structure and short execution times. The aim of this paper is to describe a new and efficient scheduling algorithm for time constrained datapath synthesis. It is capable of solving different scheduling problems including conditional resource sharing, multicycled functional units and structural pipelining. A least mean square error function approach is used to solve the scheduling problem.

2 Proposed algorithm

The presented algorithm performs time-constrained scheduling where operations are assigned to control steps such that the cost of hardware resources is minimised for a given maximum execution time. The algorithm is based on the probability that a certain operation is to be scheduled into a specific control step (c-step) similar to the approach used by the force- directed scheduler. This requires determination of as soon as possible (ASAP) and as late as possible (ALAP) schedules. To find the ALAP schedule the overall time constraint must be specified by the user. Both schedules define the time frame [ASAP,, ALAP,] into which a particular operation can be assigned. Assuming that there is a uniform likelihood of scheduling a particular operation (0,) into any control . step (cstep) within the defined time frame, the probability is [4] prob(oz, cstep)

ASAP, 5 cstep 5 ALAP, otherwise

The calculated probability values are then used to create distribution graphs (DGs) where values of operations having the same type are added such that

DG(type , cstep) = prob(o,, cstep) (2)

1 = { tLAPt -ASAPt+l

(1)

~EI t , , ,

9,

3 2

> C 0 1

e 2 0

c 7

2 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 U control step

Fig. 1 Distribution graph prior to scheduling

where Itype is a set containing indices to all operations of the same type. Fig. 1 shows a DG with probability values that vary considerably. For example, the probability value is 1.6 in c-step 5 , whilst it is 0.2 in c- step 11, which means that it is likely that more than one functional unit is required in c-step 5. To optimise the utilisation of functional units, it is necessary to assign operations to c-steps such that the maximum distribution values decrease and the overall DG is more balanced. It is desirable that the maximum DG value of a scheduled data flow graph is as low as possible since this determines the number of functional modules required. This is achieved by distributing the operations uniformly across all c-steps, resulting in an optimal usage of functional modules. For this example, an optimum schedule would require only one functional unit, resulting in a balanced DG with a

76

constant value of 1 across the c-steps 0 to 11, The proposed scheduling algorithm achieves this balancing task by assessing DGs and the effect of operation assignments on it using a mean square error (MSE) function approach.

Prior to the scheduling process, a control and data flow graph (CDFG) [15] is derived from the given sys- tem behavioural specification. For this CDFG a sched- ule is developed iteratively while individual operations are considered, assessed and finally scheduled into the most suitable control step one after another. This is unlike other scheduling algorithms where the influence of all unscheduled operations on the schedule is evalu- ated before the most suitable operation is chosen and scheduled. Since the order of operation assignment affects the scheduling process, the following approach is adapted to obtain optimum results. First, all opera- tions are divided into two groups: one with low mobil- ity and the other with high mobility operations. Operation mobility is defined as

mobility(o,) = ALAP, - ASAP, ( 3 ) and the group of low mobility operations is

0 Low x n o b

Min-mob + Max-mob 2

= { o, E o Imobiiity(o,) <

(4) where 0 is the group of all operations, and Min-mob and Maxmob are the lowest and highest mobility values of all operations, respectively. High mobility operations have little effect on the overall operation distribution because of their long time frames which result in low probability values. This means that it is unnecessary to schedule high mobility operations early. Also, such operations have comparatively few preceding and succeeding operations, which simplifies scheduling further. For these reasons, low mobility operations are scheduled first. The obtained group sizes do not have any impact on scheduling since the operations in both groups are classified into a sorted list according to: (i) operation mobility (low/high) (ii) increasing ASAP time (iii) decreasing number of succeeding nodes. This list is used to schedule all operations one after the other without the need to resort the list, thus preserv- ing the algorithm’s low complexity.

I sort all operations according to operation mobility, ASAP time, number of successors 2 while there are unscheduled operations do 3 4 5 6 7 8 9 end for IO I1 12 update distribution graph 13 end while

take the next operation out of the sorted list for all c-steps into which the operation could be scheduled do

update time frames of preceding and succeeding nodes calculate distribution graph for the modfied data flow graph evaluate the mean square error function

assign the operation temporarily to the ostep

schedule operation into the c-step for which the lowest MSE value was found update time frames of preceding and succeeding nodes

Fig. 2 Pseudocode of proposed scheduling algorithm

Now the scheduling process begins, for which the pseudocode is given in Fig. 2. The first unscheduled operation is taken out of the sorted list (line 3). To establish the optimal c-step into which this operation is scheduled, the operation is assigned to all valid c-steps

IEE Puoc -Comput Digit Tech Vol 144, No 2, March 1997

Page 3: Efficient scheduling of behavioural descriptions in high-level synthesis

j E [ASAP,, ALAPJ within its time frame. Scheduling a particular operation into a certain c-step affects time frames of preceding and succeeding nodes. As a result, probability values of these nodes vary and modified distribution graphs DG,.' (type, i) should be determined for each c-step assignment j . To investigate the effect of different c-step assignments on the operation distribu- tion, the balance of the temporary DG,' is assessed knowing that a good schedule has a balanced DG. The difference between the DG values and an average value ( A VGtyp,) provides an indication of the graph balance. The average value is obtained from the original DG using

~ N - l

where Mtype is the theoretical number of c-steps into which operations of this type can be scheduled and N is the number of c-steps in the schedule. Fig. 3 shows two possible DGs assuming that a particular operation (N-15) is assigned to two different c-steps. Fig. 3a shows the DG obtained with operation N-15 assigned into c-step 2, whilst Fig. 3b shows the DG with the same operation assigned into c-step 6. It can be seen that the DG in Fig. 3a is more balanced because the values in the histogram are distributed more uniformly and the differences to the average of 1.0 are smaller than those in Fig. 36. These difference values are used to obtain a numerical value for the schedule quality using a mean square error (MSE) function:

1 - N-I 2 M S E ( ~ , type) = - (DG; ( type , i ) -- A V G ~ ~ ~ ~ ) \1 z=o

(6) where DGJ' (type, i) is the modified distribution graph for an operation assignment into c-step j . Note that there is one MSE value for each operation type and, to find an overall rate, all MSE values are added:

M S W ) = c C t y p e . M S E ( j , type) (7) t y p e

where Ctype is an optional constant reflecting the cost of the particular hardware resource. Experience has shown that setting all C values to 1 maximises the utili- sation of each functional unit type simultaneously and thus provides optimum results. Using eqns. 5-7, the values MSE(2) = 0.743 and MSE(6) = 0.885 are obtained from the DGs in Fig. 3. This confirms that assigning operation N-15 into c-step 2 yields a more balanced DG resulting in a better schedule.

% 5 2

g 1 3 9 .U

5 ' 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2

control step a

. . . . . . . . , , , , u " o 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2

control step b

Fig.3 b N-15 into e-step 6

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 2, March 1997

Operation assignment to 2 dflerent e-steps a N-15 into c-step 2;

Having determined the MSE values for all valid c- steps (lines 4-9), the operation is finally scheduled into the c-step which results in the lowest MSE value. This is followed by adjusting ASAP and ALAP times of pre- ceding and succeeding operations and updating the DG values. The previous steps (lines 2-13) are repeated until all operations are scheduled. Using the proposed algorithm, the DG of the optimum schedule for the example (Fig. 1) is shown in Fig. 4. Note that the max- imum DG value is 1 and hence only one functional unit is required.

aJ

$ 2 > c 9 '

9 - 3

L O U) 0 1 2 . 3 4 5 6 7 8 9 1 0 1 1 1 2 c .-

control step

Fig.4 DG of an optimum schedule using the proposed algorithm

conditional branches

branch

r - - subgraph

13.2) ' I J I ---

bo3 Fig. 5 Damflow graph with conditionul branches

2. I Conditional statements In solving practical design problems it is often required to synthesise behavioural descriptions containing conditional statements. Such statements result in DFGs with conditional branches where operations are mutually exclusive, which means that they are never executed at the same time. Thus, they can be scheduled into the same c-step without increasing the number of functional units required. Fig. 5 shows an example of a DFG with nested conditional branches. Two or more parallel conditional branches form a subgraph as shown in Fig. 5. To support scheduling of conditional DFGs using the proposed algorithm, it is necessary to number all subgraphs consecutively. Now, the branches within a subgraph are assigned a unique two digit number {subgraph, branch} as shown. For example, the marked subgraph in Fig. 5 consists of the conditional branches { 3,1} and { 3,2}. Having numbered subgraphs and branches, the next step is to generate a set of DGs for each branch of the DFG using eqns. 1 and 2. To obtain the DG for a particular subgraph the DG values of all branches within the subgraph are combined. Therecore, the maximum function is used because mutual exclusive operations are never executed at the same time and the highest probability value in all conditional branches determines

I1

Page 4: Efficient scheduling of behavioural descriptions in high-level synthesis

how many functional units are required. This results in the following formula:

op3

where Branches is a set containing indices to all branches in a subgraph. Usually there is more than one subgraph and to obtain DG values of the equivalent graph the following equation is used:

DGsubgraph ( j ) = DGz ( j ) (9) ,€Branches

In Fig. 5 branches {2,1} and {2,2} are combined first into a DG for subgraph 2 using the maximum func- tion. Having done this, the resulting value is added to the DG value of branch {1,2}. Nested conditional branches require that eqns. 8 and 9 are used recur- sively. This yields finally a DG of an equivalent uncon- ditional DFG which is applied to the scheduling algorithm described above.

The main difference to previously published algo- rithms is that the critical path length of conditional branches does not have to be increased prior to sched- uling and hence computational intensive graph trans- formations [9] are avoided. In contrast to the condition vector concept 161 and TASS [8] , the proposed algo- rithm does not need to determine which operations are actually mutually exclusive.

op2 o p l

1

x c - - 0 0.5 0 I2 bl 2 a 0

0 1 2 3 4 5

control step

1 x c - - 0 0 5

2 0 I 0 D

a 0 1 2 3 4 5

control step

Fig. 6 Probability calculation for multicycled operations

2.2 Multicycled functional units One approach to achieve efficient implementation of a given behavioural description is to use multicycled functional units (MFUs). To consider MFUs using the proposed algorithm, two properties must be met. The first property requires that the sum of all probability values has to equal the number of c-steps required for execution. The second property is that probability values for an operation are no longer constant within the operation's time frame. Fig. 6 shows the distribution graph of a multicycled operation, assuming that the operation is scheduled into time frame [1, 31 and that execution takes two c-steps. So, the operation can be executed either in time interval [I , 21 (Fig. 6a) or in [2, 31 (Fig. 6b). Adding both probability values results in Fig. 66, which indicates that it is more likely that a multicycled component is required in c-step 2. To achieve the two properties, the single cycle

78

probability value calculation (eqn. 1) has to be modified to prob(oZ, cstep) - min(cycles,, cstep - ASAP, + 1, ALAP, - cstep + 1) -

ALAP, - ASAP, - cycles, + 2 (10)

where the parameter cycks, refers to the execution time of a multicycled operation measured in control steps. Furthermore, MFUs reduce the operation mobility as long as the length of the time frame [ASAP,, ALAP,] remains unchanged. Hence, the mobility calculation (eqn. 3) is modified: mobility(0,) = ALAP, - ASAP, - cycles, + 1 (11) Eqns. 10 and 11 allow scheduling of multicycled func- tional units without further modifications to the pro- posed scheduling algorithm.

stage1 stage2 stage3

T T operation leaves operation result first stage becomes available

Fig. 7 PipelinedJiulctionaI unit

2.3 Structu ra I pip e lin ing Pipelining is used to increase the data throughput of a system. Pipelined components consist of multiple stages and intermediate results are passed from stage to stage. Thus, operations require multiple c-steps until the result is available, but a new operation can be started as soon as the first stage is available. For example, Fig. 7 shows a pipelined component with three stages. The result becomes available after a particular opera- tion has gone through all three stages, while a new operation can be started as soon as the first stage is released. Support for scheduling of pipelined compo- nents can be easily incorporated into the algorithm when only the execution time of the first stage is con- sidered while the DGs are generated. This results in a modification to eqn. 1: prob(o,, cstep) - min(dzz,, cstep - ASAP, + 1, ALAP, - cstep + 1) -

ALAP, -ASAP, - latency, + 2

where dii, refers to the number of clock cycles required for execution of the first pipeline stage. The modified probability equation allows scheduling of pipelined functional units without further modifications to the proposed scheduling algorithm.

(12)

0 1 2 3 4 5 6 7 8

control step

Fig.8 Register requirement DG

2.4 Register cost In scheduling, it is important to consider not only func- tional unit cost but also the cost of registers and inter-

IEE Proc.-Comput. Digit. Tech, Vol. 144, No. 2, March 1997

Page 5: Efficient scheduling of behavioural descriptions in high-level synthesis

connect structure in searching for the most efficient design realisation. This Section explains how the number of registers can be minimised by introducing an additional DG.

Prior to scheduling, it is not known in which c-step a particular result becomes available and until when it needs to be stored. This means that only an estimate of the expected variable lifetime is known and probability values indicate how likely it is that a particular result has to be stored in a specific c-step. Experience has shown that, in the majority of cases, this estimate represents a good approximation and is close to the final register count. In Fig. 8, it is assumed that an operation result becomes available either in c-step 2, 3 or 4. The probability that the result is already available in c-step 2 is one out of three and increases as the number of c-steps increases. For example, it is 1 after c-step 4 in Fig. 8. These probability values are obtained according to the following definition: sollrce(o,, cstep)

otherwise (13)

where source(oi, cstep) represents the probability that an operation result needs to be stored whilst the “in’ function limits the probability values to 1. A similar definition is used to estimate the end of a variable life: dest(oi, cstep)

” . otherwise

(14) In some cases a result is used by more than one opera- tion and the maximum of the probability value obtained using eqn. 14 determines how likely it is that a register is required in a particular c-step:

dest’(o,,cstep) = max (dest(oj,cstep)) (15) jEsucc -nodes

where succnodes is a set containing indices to all oper- ations that succeed operation 0,. Eqns. 13-15 describe initial probability values of register requirements which should be combined to generate the DG used in the scheduling process: prob(oz, step) = min (source(o,, cstep), dest’(o,, cstep))

This minimum function determines the lower value of source and destination values, which in turn defines the probability that a register is needed. As an example, the DG in Fig. 9a is obtained using eqn. 16. It shows probability values of a result which becomes available either in c-step 2 or 3 and is last used in c-steps 5 or 6. Note that in c-steps 3-5 a register is always required independent from the scheduling result and therefore the probability value is 1 in these c-steps. In some cases, probability values are less than 1 in all c-steps because of overlapping ASAP and ALAP time frames. For example, in Fig. 96 a result becomes available in the time period [l, 41 and is last used in [3, 61. Such probability values indicate that it is not known in which c-step a register is certainly required.

The resultant DG is treated during scheduling in the same way as any other DG, allowing consideration of the register cost during the scheduling process. If

(16)

IEE Proc -Comput Digit Tech, Vol 144, No 2, March 1997

desired, additional cost measures can be incorporated when new DGs are introduced.

0 1 2 3 4 5 6 7 8

control step a

1 > c - - 0 0.5 n e Q O

0 1 2 3 4 5 6 7 8

control step b

Fig. 9 Example DGs of register probabilities

3 Experimental results

The algorithm has been implemented using C++ on a SUN Sparc 10. To verify the efficiency, a number of benchmark examples have been synthesised and com- pared to other high level synthesis systems.

Table 1 Synthesis results of 5th-order digital wave filter

Time constraint 17 18 21 25

Operation + * + * + * + * Optimum 3 3 2 2 2 1 2 1

Proposed 3 3 2 2 2 1 2 1

HAL [4] 3 3 - - 2 1 - - TASS [81 3 3 2 2 2 1 2 1

ALPS [I21 3 3 2 2 2 1 - - COBRA[14] - - 3 2 2 1 - -

HU I171 3 3 2 2 2 1 - -

- no result available

3. I 5th-order digital elliptic wave filter This example [ 161 is often used as standard benchmark because it is quite difficult to obtain an optimal sched- ule due to the filter structure complexity. To allow a comparison to previous work, it is assumed that a mul- tiplication takes two c-steps and an addition takes one c-step for operation execution. Table 1 shows the scheduling results for different time constraints. It can be seen that the proposed algorithm is capable of pro- ducing the optimum solution for all constraints. Even for this comprehensive example an optimum schedule was obtained in < lOs, which shows that the proposed algorithm is also computationally efficient.

Table 2 Results obtained using pipelined multipliers

Time constraint 17 18 19

Operation + (*) + (*) + (*)

Proposed 3 2 3 1 2 1

HAL [41 3 2 3 1 2 1

(*) pipelined multiplier

19

Page 6: Efficient scheduling of behavioural descriptions in high-level synthesis

\

:1 I I '

\

T- 3 Fig. 10 Schedule 5th-order digital wave filter, time constraint 21 e-steps

- $ 2

9 c 1

3

.k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0

e o c

control step v, Tl

a

a J 2

0 - c 1 0 7 0 c

9 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 ~

c L" control step b U

Fi . I 1 A$itions, b multiplications

Functional unit utilisation for additions and multlplications a

The proposed algorithm not only produces the opti- mum result regarding functional unit count, it yields also good results considering register count. For exam- ple, the schedule generated using the proposed algo- rithm requires ten registers for the time constraint 19 c- steps. The same description synthesised with HAL requires 12 registers, whilst COBRA uses 13 registers. Fig. 10 shows the schedule obtained for a time con- straint of 21 c-steps. The operation names are given on the left of each operation, and the execution interval is given on the right. Fig. 1la shows the utilisation of adders, and the distribution of multiplication is given

80

N-4

in Fig. l l b . Note that it is not possible to execute addi- tions in c-steps 4 and 5 because of data dependencies. It can be seen that only one multiplier is required which is utilised in the most efficient way across c-steps 4 1 9 .

To reduce the amount of functional units, pipelined components can be utilised. Here, it is assumed that the pipelined multiplier has an execution time of two c- steps, while a new operation can be started in every c- step. Table 2 shows a comparison between the results obtained by HAL [4] with those obtained using the proposed algorithm. It can be seen that results of equal quality are obtained.

3.2 Discrete cosine transform The previous example has demonstrated how designs with complex data dependencies are handled. The 8- point discrete cosine transform (DCT) [18J used in image coding and image compression investigates the capability of scheduling designs with massive data parallelism. The DCT consists of 25 additions, seven subtractions and 16 multiplications. To allow a com- parison to previously published results, it has been assumed that an addition can be performed in one c- step, while a multiplication takes two c-steps. Table 3 shows and compares the number of required functional units obtained by the different algorithms. For the time constraint of ten c-steps the proposed algorithm uses an additional ALU when compared with the other

IEE Proc -Comput Digit Tech, Vol. 144, No 2, March 1997

Page 7: Efficient scheduling of behavioural descriptions in high-level synthesis

Fig. 12 Schedule discrete cosine transform, time constraint 19 c-steps

c - steps

0

1 N 110

5 N-I 2 N-10

Fig. 13 Scheduling result Kim example

Table 3 Scheduling results for discrete cosine transform

Time constraint 10 14 19

Operation ALU * ALU * ALU * ASCAM[191 4 4 3 3 3 2

LSA [201 4 4 3 3 2 3 Proposed 5 4 3 3 2 2

algorithms while it finds the same solution for the con- straint 14 c-steps. A more efficient realisation, however, is obtained for the constraint 19 c-steps. Compared to ASCAM one ALU less is used, and compared to LSA

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 2, March 1997

Table 4 Execution times for discrete cosine transform

Time constraint 10 14 19

ASCAM[191 140s 204s 569s

LSA [201 10-17 min

Proposed 6s 11s 16s

one multiplier less is used. The schedule for the time constraint 19 c-steps is shown in Fig. 12, and it can be seen that a maximum of two multipliers and two ALUs is used in each c-step. Table 4 shows the CPU times of the various algorithms for different time constraints

81

Page 8: Efficient scheduling of behavioural descriptions in high-level synthesis

highlighting the computational efficiency of the new scheduler where execution times are comparatively low and increase approximately linearly with the number of c-steps.

It has been found that the new algorithm produces good quality solutions in acceptable time periods for long time constraints. For example, it took < 30s to obtain the optimal solution (one multiplier and one adder) for the time constraint 36 c-steps.

Table 5 Synthesis results for Kim example

l i m e constraint +

Proposed 7 2 1 2 KIM 191 7 2 1 2

- >

3.3 Kim example This example [9] is chosen to illustrate the capability of scheduling mutually exclusive operations and condi- tional resource sharing. Fig. 13 shows the scheduled DFG of the Kim example. Note that the mutually exclusive operations N-69, N-105 and N-114 are scheduled into c-step 3, allowing them to be executed using the same functional unit. Note also that the same FU is used in c-step 4 to execute the unconditional operation N-127. Table 5 shows and compares results obtained using the proposed algorithm to those pub- lished by Kim. It can be seen that the new algorithm provides results of equal quality. It should be men- tioned that Kim’s algorithm does not provide the result immediately after scheduling. The extension of the crit- ical path length inside conditional blocks during graph transformation results in a suboptimal solution. As a result, a further optimisation process is needed.

The algorithm has also been tested on various other synthesis examples such as the MAHA example (211 and the differential equation solver 1221. In all cases it has been shown that the proposed algorithm compares favourably in terms of FU and register count.

4 Conclusion

A new, simple and efficient heuristic scheduling algo- rithm for time constrained datapath synthesis has been introduced. Based on the distribution graph concept, a least mean square error function is used to solve a number of scheduling problems including conditional resource sharing, structural pipelining, multicycled functional units and consideration of register cost. A number of benchmark examples including the 5th-order wave digital filter and the 8-point discrete cosine trans- form have been presented, demonstrating the algo- rithm’s efficiency in terms of functional unit utilisation and execution time when compared to previously pub- lished algorithms.

5 References

1 GAJSKI, D.D., and RAMACHANDRAN, L.: ‘Introduction to high-level synthesis’, IEEE Des. Test Comput., pp. 44-54, Winter, 1994

2 MCFARLAND, M.C., PARKER, A.C., and CAMPOSA- NO, R.: ‘Tutorial on high-level synthesis’. Proc. 25th Design auto- mation Conference, 1988, pp. 330-336

3 WALKER, R.A., and CHAUDHURI, S.: ‘Introduction to the scheduling problem’, IEEE Des. Test Comput., pp. 60-69, Sum- mer, 1998

4 PAULIN, P.G., and KNIGHT, J.P.: ‘Force-directed scheduling for the behavioral synthesis of ASIC’s’, IEEE Trans. Comput: Aided Des., 1989, 8, (6), pp. 661-679

8 VERHAEGH, W.F.J., LIPPENS, P.E.R., AARTS, E.H.L., KORST, J.H.M., VAN MEERBERGEN, J.L., and VAN DER WERF,, A.: ‘Improved force-directed scheduling in high-through- put digital signal processing’, IEEE Trans. Comput. -Aided Des., 1995, 14, (8), pp. 945-960 WAKABAYASHI, K., and YOSHIMURA, T.: ‘A resource shar- ing and control synthesis method for conditional branches’,Proc. Int. Conference on Computer-aided design, 1989, pp. 62-65

7 CAMPOSANO, R.: ‘Path-based scheduling for synthesis’, IEEE Trans. Comput.-Aided Des., 1991, 10, (l), pp. 85-93

8 AMELLAL, A., and KAMINSKA, B.: ‘Functional synthesis of digital systems with TASS’, IEEE Trans. Cornput.-Aided Des., 1994, 13, (9, pp. 537-582

9 KIM, T., YONEZAWA, N., LIU, J.W.S., and LIU, C.L.: ‘A scheduling algorithm for conditional resource sharing - A hierar- chical reduction approach’, IEEE Trans. Cornput.-Aided Des., 1994, 13, (4), pp. 428438

10 PARK, N., and PARKER, A.: ‘SEHWA: A program for synthe- sis of pipelines’. Proc. 23rd Design automation Conference, 1986, pp. 454460

11 LANDWEHR, B., MARWEDEL, P., and DOEMER, R.: ‘OSCAR: Optimum simultaneous scheduling, allocation and resource binding based on integer programming’. Proc. of

12 HWANG. C.T.. LEE. J.H.. and HSU. Y.C.: ‘A formal amroach

6

EURO-DAC ’94, 1994, pp. 90-95

to the scheduling problem’ in high-level synthesis’, IEEk LTrans. Cornput.-Aided Des., 1991, 10, pp. 464475

13 DE MAN, H., RABAEY, J., SIX, P., and CLAESEN, L.: ‘Cathedral-11: A silicon comoiler for digital signal-orocessine’. IEEE Des. Test Comput., 1986, 3, (6), pp.-13-25

14 DUNCAN, A.A., and HENDRY, D.C.: ‘High-level synthesis of DSP datapaths by global optimisation of variable lifetimes’, IEE Proc. Comput. Digit. Tech., 1995, 142, (3), pp. 218-224

15 VAN EIJNDHOVEN, J.T.J., and STOK, L.: ‘A data flow graph exchange standard’. Proc. European Conference on Design auto- mation, Brussels, Belgium, March 1992, pp. 193-199

16 DEWILDE. P.. DEPRETTERE. E.. and NOUTA. R.: ‘Parallel and pipelined VLSI implementation of signal processing algo- nthms’ in KUNG, S Y , WHITEHOUSE, H J , and KAILATH, T. (Eds.): ‘VLSI and modern signal processing’ (Prentice-Hall Information and System Sciences Series, New Jersey, 1985)

17 HU, Y., and CARLSON, B.S.: ‘A unified algorithm for estima- tion and scheduling in data path synthesis’. Proc. International Symposium on Circuits and systems, London, UK, 1994, pp. 57- 60

18 NEIL, J.P., and DENYER, P.B.: ‘Simulated annealing based syn- thesis of fast discrete cosine transform blocks’ in TAYLOR, G., and RUSSELL, G. (Eds.): ‘Algorithmic and knowledge based CAD for VLSI’ (Peter Peregrinus, 1992)

19 CIVERA. P.. MASERA. G.. PICCININI. G.. and ZAM- BONI, M.: ‘klgorithms for operation scheduling ’in VLSI circuit design’, IEE Proc. G, 1993, 140, ( S ) , pp. 339-346

20 KRISHNAMOORTHY, G., and NESTOR, J.A.: ‘Data path allocation using an extended binding model’. Proc. 29th DAC, Anaheim, USA, 1992, pp. 279-284

21 PARKER, A.C., PIZARRO, J.T.. and MLINAR, M.: ‘MAHA: A program for datapath synthesis’. Proc. 23rd Design automation Conference, 1986, pp. 461466

22 PAULIN, P.G., KNIGHT, J.P., and GIRCZYC, E.F.: ‘HAL: A multi-paradigm approach to automatic data path synthesis’. Proc. 23rd Design automation Conference, 1986, pp. 263-270

82 IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 2, March I997


Recommended