Source level optimisation of VHDL for behavioural synthesis

Source level optimisation of VHDL for behavioural synthesis

T.P.K. Nijhar A.D. Brown

Indexing terms: Behavioural synthesis, VHDL, Code optimisation

Abstract: Optimisation in high level behavioural synthesis is usually performed by applying transforms to the datapath and control graphs. An alternative approach, however, is to apply transforms at a higher level in the process, specifically directly to the behavioural source description. This technique is analogous to the way in which the source code of a conventional sequential programming language may be processed by an optimising compiler. The application of this kind of preprocessing to a number of example behavioural VHDL source descriptions (which are then fed into a ‘conventional’ synthesis system) produces structural descriptions which are up to 32% smaller and 52% faster.

1 Introduction

The high-level behavioural synthesis problem has been attacked from a number of directions. One of the cen- tral subproblems of the process is the question of design optimisation: for a given behavioural description (the input to the synthesis system), a large number of equivalent structural descriptions (the output from the synthesis system) exist. These equivalent descriptions will differ in area, processing speed and overall power dissipation. Goal oriented optimisation within the synthesis process guides the system to produce a solution which complies with the users’ requirements, and is usually achieved through a series of transformations, applied to the data and control path graphs.

Aside from the datapath optimisation, it is possible to enhance the final design by performing modifica- tions at the VHDL source level. Although the optimisation of VHDL at this level can be regarded as similar in many ways to the optimisation of a conventional, sequential programming language, there are significant differences.

The optimisation goals associated with a sequential language are (usually) execution speed and program size, the latter goal referring to the memory occupancy of the final executable program. (Further levels of sophistication exist; for example, the code and data 0 IEE, 1996 IEE Proceedinps online no. 19960631 Paper first received 29th January 1996 and in revised form 17th May 1996 The authors are with the Department of Electronics and Computer Science, University of Southampton, Southampton SO17 lBJ, UK

area requirements may be minimised independently.) However, these optimisation techniques are focused on maximising the execution speed on, and the resource utilisation within, a fixed target architecture - the hardware on which the program is to run. (Thus it follows that most good optimisers are machine dependent .)

It is easy to lose sight of the fact that VHDL is a hardware description language: it cannot be executed in the conventional sense. The output from a segment of VHDL source is a structural description. While the goals of VHDL optimisation are broadly the same as a sequential programming language - execution speed and design area (while power dissipation has no sequential programming analogue) optimising VHDL gives a significant extra degree of freedom to the optimiser - it is free to manipulate the ‘executing’ architecture itself. While a lot of the optimisation operations are clearly based on corresponding sequential programming techniques, this extra degree of freedom requires us to modify some of the transformations, and allows us to introduce new ones that have no direct sequential programming analogue. (It also renders some transformations less useful, and introduces interactions that do not exist in the sequential programming world.)

Optimisation of conventional sequential programming languages is a well researched area [l-61, and parallel programming language optimisation is also well documented [7-121. However, the problems of optimisation of hardware description languages has received relatively little attention.

[4] describes an optimiser that has been incorporated into a design automation system, taking as input a behavioural description in ISPS. This is translated into a group of directed acyclic graphs known as the value trace. It is on this structure that the optimisations are performed. The transformations reported are divided into three categories: OPERATOR transforms (transformations carried out on individual or groups of operators), SELECT transforms (carried out on the conditional constructs), and VTBODY transforms (carried out on groups of value trace operators). Examples of the transformations include constant folding, dead operator elimination and loop unwinding. It should be noted that the only loop optimisation performed is that of expanding the loop, when in general it is assumed that the transformations applied on a loop will be those that have the most effect. The results describe the effect of inline expansion showing a decrease in the number of control steps between 28% and 37%.

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 1, January 1997

[13] describes an optimising compiler for VHDL. Only a restricted subset of VHDL is able to construct a reducible flow graph. Behavioural VHDL constructs are used to describe the design, using a single process statement (i.e. one single sequential block). The description is parsed and converted into an intermediate description, which is the substrate for the subsequent optimisations. This intermediate representation is in the form of a 'three address code' which is used to build up a graphical representation of the design that is called a process graph. Data flow analysis is performed in order to implement the transformations correctly, using an iterative method. Loop optimisations (where potentially the greatest changes may be effected) are not performed.

The Olympus system [14] incorporates a set of source level transformations into the front end of its synthesis system, including local and basic global transformations (such as common subexpression and dead code removal). The loop optimisations are limited in that there is the option of expanding fixed iteration loops - however, no attempt appears to be made to optimise the loop through transformations such as loop invariant or induction variable detection. The system also presents the option to inline expand subroutine calls (as does the system described here), and to bind operations to specific library functions.

The system described in this paper does not share the shortcomings of the above; it can perform a larger range of optimisations on the entire synthesisable subset [15] of VHDL.

behavioural VHDL

n

I

source opt imiser

(area speed goals)

behavr o u ral VH DL

( y e a speed, power structural VHDL

Gross dataflow of synthesis system

testability 'goals)

Fig. 1

lexical analyser I

____

I FPGA mapper L------- - - - - - - - _ I

nonbehavioural VHDL I

I

I

2 Goal oriented behavioural synthesis

The preprocessor described here is used as a front end to the MOODS (multiple objective optimisation by datapath synthesis) system [16-18]. A brief description is relevant here because MOODS is needed to quantify the effects of the preprocessing. (For the purposes of characterising the process, we are effectively using MOODS as a postprocessor to the source optimiser.) The quantitative metrics used by MOODS for a design are area, speed, power dissipation and 'testability'; these are attributes that may only be associated with a structural VHDL description. (The latter two attributes are not discussed in detail further here.) Both the input and output of the source optimiser are behavioural VHDL [Note 11, which cannot possess physical charac- teristics. The gross dataflow of the complete synthesis system is shown in Fig. 1; Fig, 2 shows a more detailed breakdown.

MOODS performs the synthesis process by firstly constructing a control and datapath graph (a composite graph) of the behavioural description with a 1:l mapping of datapath operators to source constructs.

I

I I

I I I I perature I monotonically I trajectory

_ _ _ _ _ - _ _ - I I \

configuration t j ' f'nal I

I I I - I delay

Fig.3 Movement of a synthesised design through design space

temperature

t initial configuration

I- f .

I I I I I I I I * final delay

Movement of a synthesised design through temperature/criteria configuration

Fig. 4 space

Note 1 Nonbehavioural VHDL, such as generate statements and component calls, can pass through the source optmser, although these are elaborated as opposed to being optmsed Note 2. Small and local in ths context does not necessanly mply datapath adjacency, multiplexrng access to an expensive operator may create datapaths in the graph

i IEE Proc -Comput Digit Tech, Vol 144 No I January 1997

Under the guidance of a simulated annealing algorithm [19, 201, a series of small [Note 21, local, reversible transformations are performed on the composite graph, to bring the design closer to the user specified goals. The simulated annealing algorithm thus moves the design around in 'design space' (see Figs. 3 and 4). Note that each point on the trajectories of Figs. 3 and 4 corresponds to a different physical realisation of the required behaviour. Examples of the transformations used by MOODS are operator sharing (replacing two operators by a single operator and a multiplexer) and operator compression (replacing two simple combina- torial operators in a linear predicate relation in the datapath graph with a single, compound operator). Seventeen transforms in all are used by MOODS; a full list and description may be found elsewhere [ 181.

3 Source optimisation

The VHDL source is converted into a parse tree (con- sisting of a set of structures representing the various VHDL statements), and it is to this structure that the transforms are applied. In order to apply the optimisations, the data are viewed as a control flow graph (with nodes representing basic blocks and arcs indicating the data flow between them).

Associated with the control flow graph is a set of data structures representing definition-use (DU) and use-definition (UD) chains. The purpose of these lists is to keep track of the variables and signals defined, together with when and where they are being read from or written to. The lists are constantly updated, as each transform performed has an effect on the list.

The definition-use chain has an entry for each assignment statement, together with links to all the uses reached by this definition. The use-definition chain may be associated with any type of statement and it indicates all the possible definitions for each use in the statement.

DU table for b DU table for a

a=y*2

I I l l I

p = b*a p= b'a

q = 3*a

I U0 table for b UD table for a

Fig. 5 Definition-usehe-deJinition structure: corresponding D U/UD dutu structure Generating fragment b = 34

a = x + 1

if (...) a = f

a = y * 2

p = b * a

q = 3 * a

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 1, January 1997

Fig. 5 shows an example of these structures. Fig. 5a shows a fragment of VHDL code, and Fig. 5b the corresponding DUiUD structures. The DU table for the variable a contains entries corresponding to all the assignments (and threats) to a; each entry has pointers to subsequent uses of a (i.e. to statements whose outcome is predicated on the assignmentkhreat to a). Each entry in the UD table has pointers back to the appropriate DU entry.

3. I The transforms The transforms applied are listed in this Section. Each transform may be loosely considered to be of one of three types: sequential, parallel or hardware.

Sequential transforms are those which are valid sequential transforms, but are of no (or negative) use in a descriptive environment.

Parallel transforms are those which may be applied by parallel compilers, and are also applicable in a descriptive environment.

Hardware transforms are those which optimise to a goal unique to a hardware description language (such as power dissipation or system testability).

The optimiser currently uses a lexicon of 23 transforms (MOODS uses 17 for the composite graph optimisation). These are described in Table 1. As with a conventional sequential optimiser, the individual transforms may be grouped together to produce sets that are most appropriate for a particular goal (i.e. area or delay). By studying the effects of the transforms individually on a number of test data sets (we have a library of about 40 VHDL benchmarks [21-231) we have arrived at the empirical groupings shown the Table.

3.2 Transform groupings As well as the goal groupings, there exist various relationships between the individual transforms (this helps determine the order in which they are applied). The transforms interact in one of three ways: No relationship: The order of application of the transforms is interchangeable. This covers the majority of cases. Mutually exclusive: The outcome of one optimisation renders the other ineffective. The two transforms should never be applied together. Dependent transforms: The dependent transforms are more effective if applied after the first (independent) operation.

Transforms that have a considerable effect on the data structure (such as loop unwinding, component flattening and inline expansion) are applied first so that the remaining transforms are applied to as large a data area as possible.

The peephole class of transforms are applied at several stages during the optimisation process. These are a fast and inexpensive set of transforms to apply, and can be used to simplify the data structure after transforms such as loop unwinding have been applied.

4 Examples

Results from two sizeable behavioural descriptions are presented here. Recall that the source optimiser is a behaviour ---5 behaviour translator, and that MOODS is a behaviour - structure translator. Thus, to quan-

The transform interactions are shown in Fig. 6.

3

Table 1: Transform taxonomy

Area Delay Power Test Transform description

Constant folding (peephole): a transformation that replaces constant references with actual values (1) yes

(2) yes

(3) yes

(4) yes

(5) yes

(6) yes

(7 ) yes

18) yes

(9) yes

( I O ) yes

(11) yes

(12) no

(13) yes (14) no

(15) yes

(16) yes

(17) -

(18) yes

(19) no

(20) yes

(21) yes

(22) no

(23) yes

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

no

-

no

Yes

no

Yes

no

no

and attempts to-simplify the expression

Algebraic simplification (peephole and operator independent): the simplification of expressions that follow a particular pattern, such as x + 0, x * 1

Reduction in strength (peephole): replacement of specific operations with equivalent cheaper ones

Common subexpression removal: identical subexpressions are detected and replaced with references to a new variable, which evaluates the common expression. This and transformation 22 (register minimisation through common expressions) have mutually opposed effects and should not be applied together

Dead code removal: removal of code that has no further use

Copy propagation: replacing references to copies (statements of the form f:=g) with the value of the copy statement. This transformation must be applied prior to common subexpression removal so that as many similar expressions as possible may be detected. Also it must be followed by dead code removal, to delete the redundant copy expressions

Loop unwinding: expanding fixed iterative loops. The effect of this transform is dependent on the size and contents of the loop. It is effective in terms of area to apply this transformation when the loop is small or the contents of the loop is highly dependent on the index, which wil l allow various peephole transformations to be performed, simplifying a significant amount of the hardware gen- erated

Sequential loop fusion: this transformation involves merging two loops which have the same number of iterations. This optimisation is rendered ineffective if followed by transformation loop unwinding

Parallel loop fusion: an extension of sequential loop fusion, applied to two loops of differing sizes. The loops are fused together with the use of a conditional statement, so they can exit when necessary. This optimisation is rendered ineffective if followed by transformation loop unwinding

Loop invariant: moving of constant expressions, within a loop, to the outside of the loop

Induction variable removal: the basic principle of the induction variable transformation includes locating two identifiers which remain in lockstep within a loop (i.e. every time variable ‘a‘ is increased/decreased by ‘x’; ‘b’ is increased/decreased by ’y’). Once the induction variables have been detected any multiplication operations can be reduced in strength by replacing them with addition/subtraction operations. An initialisation statement is placed in a preheader block.

Loop unswitching: loops which contain a conditional statement only have the conditional placed outside the loop and the branches enclosed within loops.

Conditional fusion: merging of conditional statements which are triggered by the same state.

Conditional parallelisation: placing each branch of a conditional in parallel. This is only effective when the conditional is a single branching if-then-else statement or when it is a multiple if-then- else construct with no default branch.

Subroutine inline expansion: replacing subroutine calls with the actual functionality of the unit being referenced. This is effective mainly for optimising with respect to delay, or when optimising for area and the subroutine in question is small or makes relatively few calls itself

Merging subroutine calls: replacing subroutine calls which have identical input parameters with one call, and a series of assignment statements to the output parameters

Component flattening: evaluation of any component calls within an architecture. The transformation has no effect whatsoever with respect to any of the stated optimisation goals, but it does empower a wider range of subsequent transformations.

Merging component calls: the same type of transformation as merging subroutine calls, applied to component references

Grouping dependent statements: reordering statements, so that dependent statements occur in sequence

Bit range bound testing: checking that bit vectors that have been declared, make full use of the range specified, and if not, adjusting the range accordingly

Library dependent replacement: this optimisation is system dependent. It replaces the designer’s own type declarations and function routines with those designed specifically for MOODS

Register minimisation through common expressions: placing references to a variable with the expression represented by it. This and transformation 4 (common subexpression removal) have mutually opposed effects and should not be applied together. However, it wil l reduce register usage and decrease the delay of the design, with the possibility of area being reduced by subsequent dead code removal Pre-emptive processing: evaluating the possible targets of a conditional, in parallel, prior to calcu- lating the conditional expression. The required result can then be assigned using copy statements

tify the effects of the source level optimisation, it is metrics for the quality of the design. However, necessary to run the output of the source optimiser although MOODS is an optimising compiler, it can be through MOODS, in order to generate quantitative directed to translate behaviour to structure on an ‘as is’

4 IEE Proc -Cornput Digit Tech, Vol 144 No 1, January 1997

basis. This allows us the closest possible quantitative look at the output from the source optimiser.

Fig. 6 i”sjorm interactions A-B A- B

A and B are mutually exclusive; their effects counteract each other A must be applied before B to achieve maximum benefit

The possible routes for a VHDL description through the overall synthesis suite are shown in Table 2.

Table 2: Optimisation path labels

Source optimisation Datapath optimisation with respect to:

area NOTHING AN

area area AA

area delay AD

delay NOTHING DN

delay area DA

delay delay DD

DISABLED NOTHING NN

DISABLED area NA

DISABLED delay ND

Path label with respect to:

Table 3: Quantitative data for Fig. 7

Path label

NN 32 92 43

NA 22 3.05 10 19

ND 24 3.05 11 22

DN 79 73 202

DA 9 121.43 8 71

DD 10 65.15 7 68

AN 24 40 85

AA 18 2.23 9 14

AD 19 2.23 10 18

Functional Execution Register Control time, s count states

-

-

-

4.1 The Kalman filter The Kalman filter is regarded as a primitive computational building block in many modern control theory applications. The goal of the Kalman filter is to predict a system state from a set of measurements of smaller dimensionality than the state vector [23].

IEE Proc.-Comput. Digit. Tech., Vol. 144, No. I , January 1997

The initial behavioural specification takes 11 1 lines of VHDL, and generates 32 functional units. The effects of the various optimisation paths for the filter are shown in Fig. 7. (The AD and DA paths should be rather pointless - they imply contradictory goals for the two stages of optimisation, but they are included for the sake of completeness.)

m 0 0. F X

0 200 -

150 -

4 4 I DD DA

1001 0 4 8 12 16 20 24 28

delay. 1s Fig. 7 Path labels are described in Table 2 ; design data are tabulated in Tdbk 3

Design space trajectories for optimisation of Kalman Jilter

NN

ND I I

I I

I I I I

0 2 4 6 8 IO de1ay.p

Desi n space trajectories for optimisation of differential heat Fig.8 release algoritfm Path labels are described in Table 2 ; design data are tabulated in Table 4

The results show that optimisation at the source level is clearly a worthwhile step in the overall optimisation process. The design optimised at the source level alone shows significant improvements over the NN optimised design, and the two optimisers (source level and composite graph) operating in cascade show greater benefits still.

5

What IS unexpected from Fig. 7 is that optimisation with respect to area at source level followed by optimisation with respect to delay at composite graph level produces a design which has less delay than any other combination. This is completely counterintuitive, but the Kalman filter is not the only design to exhibit this effect. A detailed tabulation of the metrics of Fig. 7 (functional units, execution time, design register count and controller size) is given in Table 3.

This notwithstanding, the execution speed of the system is sufficiently fast that all eight combinations of optimisation may readily be tried for each design. (The runtimes shown in Fig. 7 are taken from a 486 50MHz PC running Linux.)

4.2 The differential heat release algorithm The differential heat release algorithm models the heat release within an internal combustion engine. It con- sists of about 100 lines of code, in the form of a single process. Control features and data structures include a number of while loops and a bit vector array.

The impact of the source preprocessor on the design is clear from Fig. 8; the results show a decrease of between 35 and 42% for area and 69-85% for delay (as well as a significant decrease in datapath optimisation runtime). Quantitative details are presented in Table 4.

Table 4: Quantitative data for Fig. 8

Functional Execution Register Control Path label time,s count states

NN

NA ND

DN

DA

DD AN AA

AD

428

424 41 5

27

13

14

26

13

14

- 238 1378

45.65 222 36.18 220

38 3.07 7

2.96 6

36

2.89 6

2.92 6

-

-

1361 1329

99

19

20

98

18

22

5 Final remarks

Figs. 7 and 8 show the interactions of the two optimisation techniques for different user goals. The symbiosis of the two techniques is evident from the juxtaposition of the resultant designs in the two- dimensional design space shown. In all cases, sensible application of the optimisations (i.e. the AA and DD designs) produce significantly superior designs.

The optimisation of synthesised digital design is a crucial step in the overall process, as it can reduce overall area and delay by factors of up to an order of magnitude. Optimisation of the composite graph has been shown to produce very efficient optimisations,

and the inclusion of the technique described here produces further gains, at relatively little computational cost.

Further work is being undertaken on more sophisticated source level transforms that capitalise on aspects of the source description to hardware mapping. These will be reported in a later paper.

6 Acknowledgment

The work described in this paper was funded by the Engineering and Physical Sciences Research Council.

7 ~ @ f e r e n c ~ s

1 ALFRED, A., SETHI, R., and ULLMAN, J.: ‘Compilers, princi- ples, techniques and tools’ (Addison Wesley, 1986)

2 WULF, W.M., JOHNSSON, R.K.. WEINSTOCK. C.B., HOBBS, S.O., and GESCHKE, C.M.: ‘The design of an optimis: ing compiler’ (American Elsevier, New York, 1975)

3 BOYLE, D., MUNDY, P., and SPENCE, T.M.: ‘Optimisation and code for several machines’, IBM J., 1980,24, (6), pp. 677-682

4 WALKER, R.A., and THOMAS, T.E.: ‘Behavioural level transformation in the CMU-DA svstem’. Proceedings of the 20th IEEE Design Automation Confeknce, 1983, pp 78u8-789

5 TANENBAUM, A S , STAVERN, H , and STEVEN- SON, J.W.: ‘Using peephole optimisation on intermediate code’, ACM Trans. Program. Lung. Syst., 1982, 4, (l), pp. 21-36

6 LOWRY, E.S., and MEDLOCK, C.W.: ‘Object code optimisation’, Commun. ACM, 1969, 12, pp. 159-166

7 PADUA, S.D.: ‘Issues in the compile time optimisation of parallel programs’. Proc. of the International Conference on Parallel arocessinp. 11. 1990.

8 FERRARTE, J , and MACE, M ‘On linearizing parallel code’,

9 FERRANTE, J., MACE, M., and SIMONS, B ‘Generating seauential code from Darallel code’. ACM 1988. DD 582-593

ACM, 1984, pp 179-189

10 GRUNWULD, D., ind SRINIVASAN, S.: ‘Data’flow equations for explicitly parallel programs’. 4th ACM PPOP, 1993, pp. 159- 167

11 MASTICOLA, S.P., and RYDER, E.G.: ‘Non concurrency anai- ysis’. 4th ACM PPOP, 1993, pp. 129-137

12 SHASHA, D., and SNIR, M.: ‘Efficient and correct execution of Darallel nroerams that share memorv’. A C M Trans. ProPTam. Lung S;st ,7988, 10, (2), pp 282-312’

VHDL‘, Szgplan N o t , 1988, 23, pp 92-103

1995)

13 BHASKER, J ‘Implementation of an optimising compiler for

14 RUSHTON, A ‘VHDL for logic synthesis’ (McGraw-Hill,

15 BAKER, K.R., CURRIE, A.J., and NICHOLS, K.G.: ‘Multiple objective optimisation in a behavioural synthesis system’, IEE Proc. G, 1993, 140, (4), pp. 253-260

16 BAKER. K.R.. BROWN. A.D.. and CURRIE. A.J.: ‘Ootimisa- tion effhency in beha(ioura1’ synthesis’, IEE Proc. LCircuits Devices Syst., 1994, 141, (S), pp. 399406

17 KIRKPATRICK, S.: ‘Optimisation by simulated annealing: auantitative studies’. J. Stat. Phvs., 1984. 34, (516). DD. 975-986

18 KIRKPATRICK, SI, GELATT, ‘C.D.,’ and VECCHI, M.P.: ‘Optimisation by simulated annealing’, Science, 1983, 220, (4598), pp. 671-680

19 BAKER, K.R.: ‘Multiple objective optimisation of data and control paths in a behavioural silicon compiler. PhD thesis, Univer- sity of Southampton, 1992

20 VEMURI, R., ROY, J., MAMTORA, P. et al.: ‘Benchmarks for high level synthesis’. Electrical and Computer Engineering Department, University of Cincinnati, June 1991

21 PANDA, P.R., and DUTT, N.: ‘1995 high level synthesis design repository’. University of California, Irvine, February 1995

22 MAYO, R.: ‘High level synthesis benchmarks’, 1991 23 DEMICHELI, G., KU, D., MAILLOT, F., and TRVONG, T.:

‘The Olympus synthesis system’, IEEE Des. Test Comput., 1990, 7, pp. 37-53

6 IEE Proc.-Comput. Digit. Tech., Vol. 144, No. 1, January 1997

Date post:	20-Sep-2016
Category:	Documents
Upload:	ad
View:	214 times
Download:	0 times

Source level optimisation of VHDL for behavioural synthesis

Documents