+ All Categories
Home > Documents > Optimising Transformations for Hardware Compilation Ashley Brown, Department of Computing, Imperial...

Optimising Transformations for Hardware Compilation Ashley Brown, Department of Computing, Imperial...

Date post: 13-Dec-2015
Category:
Upload: brian-leonard
View: 218 times
Download: 0 times
Share this document with a friend
49
Optimising Transformations Optimising Transformations for for Hardware Compilation Hardware Compilation Ashley Brown, Department of Computing, Imperial College London Contributions Contributions Transformation language for restructuring and Transformation language for restructuring and optimisation of Handel-C supporting data- optimisation of Handel-C supporting data- integrity conditions. integrity conditions. Prototype transformation engine for the Prototype transformation engine for the language. language. Automatic transformations giving a 35-70% Automatic transformations giving a 35-70% reduction in execution time. reduction in execution time. An insight into the interaction of An insight into the interaction of transformations: variability between platforms, transformations: variability between platforms, difficulty of prediction. difficulty of prediction.
Transcript

Optimising Transformations forOptimising Transformations forHardware CompilationHardware Compilation

Ashley Brown, Department of Computing, Imperial College London

Final Project Presentation, 21st June 2005

ContributionsContributions• Transformation language for restructuring and optimisation of Transformation language for restructuring and optimisation of

Handel-C supporting data-integrity conditions.Handel-C supporting data-integrity conditions.• Prototype transformation engine for the language.Prototype transformation engine for the language.• Automatic transformations giving a 35-70% reduction in Automatic transformations giving a 35-70% reduction in

execution time.execution time.• An insight into the interaction of transformations: variability An insight into the interaction of transformations: variability

between platforms, difficulty of prediction.between platforms, difficulty of prediction.

ContributionsContributions• Transformation language for restructuring and optimisation of Transformation language for restructuring and optimisation of

Handel-C supporting data-integrity conditions.Handel-C supporting data-integrity conditions.• Prototype transformation engine for the language.Prototype transformation engine for the language.• Automatic transformations giving a 35-70% reduction in Automatic transformations giving a 35-70% reduction in

execution time.execution time.• An insight into the interaction of transformations: variability An insight into the interaction of transformations: variability

between platforms, difficulty of prediction.between platforms, difficulty of prediction.

IntroductionIntroduction

21st June 2005 Ashley Brown # 3

What would we like to do?What would we like to do?

• Take an algorithm in written in C.• Generate an efficient hardware design, run it on an

FPGA.• Fast design cycle, easy to maintain code.• C programmers should be able to create fast hardware!

21st June 2005 Ashley Brown # 4

Background: Handel-CBackground: Handel-C

• C-based programming language for digital system design.

• One clock-cycle per statement.

• Explicit parallelism.

• Compiler generates hardware design from Handel-C source.

while (j != 3) { par { t0 = aa[0] * bb[0]; t1 = aa[1] * bb[1]; } par { cc[i][j] = t0 + t1; j++; }}

Handel-C code example.

21st June 2005 Ashley Brown # 5

ProblemsProblems

• Software programmers: Bad Handel-C, poor hardware.– No exploitation of statement-level parallelism.– Long expressions.– Lots of for loops!

• Experienced Handel-C designers: good hardware, hard to read code.– Trickery to reduce clock cycles, increase clock rate.

• Finding the “optimal” solution is not easy.– Optimisation effectiveness depends on the target architecture

(see the results later!)

21st June 2005 Ashley Brown # 6

SolutionsSolutions

• Restructure Handel-C code to optimise.– Can parallelise if desired.– Duplicate hardware if necessary.

• Apply transformations to the original source, leaving it intact.– The original readable description is still available.– A more efficient version is used for hardware generation.

• Allow the user to define custom transformations with a transformation language.

• Generate a whole design-space of solutions, with different optimisations.

21st June 2005 Ashley Brown # 7

Current SolutionsCurrent Solutions

• ROSE, Stratego, CTT.• CTT has straightforward syntax.

– Others are more complicated, not intuitive.

• Stratego support strategies.– Strategies in the hardware world difficult to decide.– Need a different strategy for each architecture.

• Haydn-C: restructuring of code similar to Handel-C– But not user-specified transformations.

21st June 2005 Ashley Brown # 8

What’s New?What’s New?

• Previous work with user-specified transformations has been:– For software-based C.

– Aimed at parallelising/optimising for microprocessors

• Can’t duplicate microprocessor hardware on the fly – it’s either there or not.We can duplicate hardware, pipeline – FASTER DESIGN!

• Previous work on hardware language transformations do not allow the user to describe transformations (Haydn-C).We do – the user can target their code explicitly.

• Exploring an entire design-space is usually done at the hardware level, not high-level language (although not always, e.g. ASC).We generate a full design-space – find *the* best solution.

The Transformation The Transformation LanguageLanguage

21st June 2005 Ashley Brown # 10

Cobble-CMLCobble-CML

• Cobble: compiler framework for Handel-C.

• CML: partially defined proposal for a transformation language for Cobble, builds on CTT.

• Cobble-CML: Our solution.

custom_transform { pattern { 0 * expr(0) } generate { 0 }}

0-constant elimination defined in original

CML.

21st June 2005 Ashley Brown # 11

Why choose CML?Why choose CML?

• Familiar syntax to Handel-C users.• Only partially defined, but showed potential.• Problems:

– No data flow conditions – can’t check that transformations won’t destroy data integrity.

– Transformations don’t have names.

21st June 2005 Ashley Brown # 12

Changes to CMLChanges to CML

• New conditions field, data integrity conditions– automatic parallelisation not

safe without it.

• Naming of transformations.• Wildcard matches named

rather than numbered.• Conditions allow more

powerful transformations.

transform zero_elim { pattern { cmlexpr(l)*cmlexpr(r) } generate { 0 } conditions { eval(cmlexpr(l) == 0 || cmlexpr(r) == 0); }}

0-constant elimination defined in CML.

21st June 2005 Ashley Brown # 13

Basic ComponentsBasic Components

The pattern section describes the format of the

code to match for this transformation.

The generate section describes the code which

should replace the pattern.

CML transformations are defined within transform

blocks.The optional always keyword indicates that this

transformation should always be applied where it can.

// 1 * x = x

std_times 1_elim {

pattern {

1 * cmlexpr(operand)

}

generate {

cmlexpr(operand

Wildcards, such as cmlexpr, allow a pattern to be matched and substituted into the new

tree.

)

}

}

Each transformation can have a name to identify it for

reporting.

always transform

Wildcard matching:• cmlexpr - matches any expression• cmlstmt - matches any statement• cmlstmtlist - matches a list of statements

Wildcard matching:• cmlexpr - matches any expression• cmlstmt - matches any statement• cmlstmtlist - matches a list of statements

21st June 2005 Ashley Brown # 14

Ensuring Data IntegrityEnsuring Data Integrity

• Three types of condition are defined to ensure data integrity:– Data-flow sets.– Expression evaluation.– Constant validation.

• Transformations have a conditions section to define these.

21st June 2005 Ashley Brown # 15

Data DependenciesData Dependencies

• Can’t modify source trees at will (we could … but we shouldn’t).

• Ideal: full data-dependency analysis.• We can get away with less.• Solution: Data-flow set manipulation.

21st June 2005 Ashley Brown # 16

Data DependenciesData Dependencies

Statement defs uses

x = 1; x

x = a + b; x a, b

par { x = a + b; y = c + d;}

x, y a, b, c, d

21st June 2005 Ashley Brown # 17

Data DependenciesData Dependencies

Symbol Meaning

== Set equality comparison.

!= Set inequality comparison.

{a, b} A set containing a and b.

{} The empty set.

defs(statement) The set of variables statement assigns to.

uses(statement) The set of variables statement uses.

& Set intersection.

| Set union.

21st June 2005 Ashley Brown # 18

Worked Matching ExampleWorked Matching Exampletransform auto_par { pattern { cmlstmtlist(preamble); cmlstmt(par1); cmlstmt(par2); cmlstmtlist(postamble); } generate { cmlstmtlist(preamble); par { cmlstmt(par1); cmlstmt(par2); } cmlstmtlist(postamble); } conditions { // don't assign to the same place defs(cmlstmt(par1);) & defs(cmlstmt(par2);) == {}; // second statement not waiting on first defs(cmlstmt(par1);) & uses(cmlstmt(par2);) == {}; } }

q = a << 1;qp = q + 1;qm = q - 1;

Code to Match

21st June 2005 Ashley Brown # 19

Code to Match

Match Option #1Match Option #1q = a << 1;qp = q + 1;qm = q - 1;

transform auto_par { pattern { cmlstmtlist(preamble); cmlstmt(par1); cmlstmt(par2); cmlstmtlist(postamble); }}

21st June 2005 Ashley Brown # 20

Match Option #1Match Option #1

conditions { defs(cmlstmt(par1);) & defs(cmlstmt(par2);) == {}; defs(cmlstmt(par1);) & uses(cmlstmt(par2);) == {}; }

q = a << 1;q = a << 1; qp = q + 1;

qp = q + 1;

Wildcard Assignment

preamble empty

par1 q = a << 1;

par2 qp = q + 1;

postamble qm = q – 1;

{ q } { qm }{ q } { q }

par{ q = a << 1; qp = q + 1;}qm = q - 1;

Disaster if we did not check!

21st June 2005 Ashley Brown # 21

Code to Match

Match Option #2Match Option #2

q = a << 1;qp = q + 1;qm = q - 1;

transform auto_par { pattern { cmlstmtlist(preamble); cmlstmt(par1); cmlstmt(par2); cmlstmtlist(postamble); }}

21st June 2005 Ashley Brown # 22

Match Option #2Match Option #2

conditions { defs(cmlstmt(par1);) & defs(cmlstmt(par2);) == {}; defs(cmlstmt(par1);) & uses(cmlstmt(par2);) == {}; }

qp = q + 1;qp = q + 1; qm = q - 1;

qm = q - 1;

Wildcard Assignment

preamble q = a << 1;

par1 qp = q + 1;

par2 qm = q – 1;

postamble empty

{ qp } { qm }{ qp } { q }

The Transformation The Transformation EngineEngine

21st June 2005 Ashley Brown # 24

Integrating with CobbleIntegrating with Cobble

21st June 2005 Ashley Brown # 25

Tree MatchingTree MatchingTransformation

pattern { 0 + cmlexpr(a)}

generate { cmlexpr(a)}

Code

b = 5*(0+1)

21st June 2005 Ashley Brown # 26

Tree MatchingTree Matching

21st June 2005 Ashley Brown # 27

Just Handel-C?Just Handel-C?

• No need to limit to Handel-C.• Tree-matching algorithm will work with any compatible

ASTs.• Any language we can turn into a Handel-C AST can be

used.• Automatic parallelisation: source language need not

support it explicitly.

21st June 2005 Ashley Brown # 28

Factors in Hardware Factors in Hardware DesignDesign

Speed

Area

Power

21st June 2005 Ashley Brown # 29

Design-Space ExplorationDesign-Space Exploration

• Difficult to decide which transformation is best.• Don’t guess, produce several solutions.• Branch the AST whenever a transformation is applied.

– In-place branches: small AST.– Propagate branches when no more transformations can be

applied.– Repeat transformation process on each new solution.

21st June 2005 Ashley Brown # 30

Design-Space ExplorationDesign-Space Exploration

Transform, creating a branch

point.

21st June 2005 Ashley Brown # 31

Design-Space ExplorationDesign-Space Exploration

Propagate branches to root – create several distinct

solutions.

21st June 2005 Ashley Brown # 32

Test TransformationsTest Transformations

• Generic – applicable to all programs:– autopar – parallelise sequential statements with no

dependencies.– fortowhile – convert for loops into corresponding while loops.– lttoeq – convert for loops with < in the loop condition to ==.

• Application specific – targetted at the test programs:– matrixpar – parallelisation of an inner loop.

21st June 2005 Ashley Brown # 33

More TransformationsMore Transformations

• Various mathematical rearrangments:– Factorise to reduce multiplies.– Remove *1, *0, +0 etc.

• More interesting:– Dead-code elimination (remember data conditions!)– Variable replacement

• remove dependencies in code by replacing variables with the expressions assigned to them last (again, remember data conditions!)

ResultsResults

21st June 2005 Ashley Brown # 35

Live DemoLive Demo

• We take two blocks of sequential division code, one parallelised, one not.

• This should be a live demo, unless something breaks!

21st June 2005 Ashley Brown # 36

Hand-coded ParallelHand-coded Parallel

Hand-coded by Matt Aubury, VP Engineering of Celoxica Ltd and former project student of Wayne Luk.

21st June 2005 Ashley Brown # 37

Pure SequentialPure Sequential

Same code, modified for Cobble but with no parallelism.

21st June 2005 Ashley Brown # 38

Tool-GeneratedTool-Generated

This should look familiar!

21st June 2005 Ashley Brown # 39

Execution Time Execution Time ImprovementImprovement

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

base autopar fortowhile lttoeq matrixpar

Stratix

Stratix DSP

Cyclone

Virtex II Mult

Spartan 3

lttoeq increases fmax on Altera, but decreases it on

Xilinx

lttoeq increases fmax on Altera, but decreases it on

Xilinx

Ex

ec

uti

on

Tim

e (s

)

Optimisation Applied (Optimisations are Cumulative)

21st June 2005 Ashley Brown # 40

Platform VariancePlatform VarianceStratix/Cyclone Area/Execution Time Comparison

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

3000 3500 4000 4500 5000 5500 6000

Altera Logic Elements

Ex

ec

uti

on

Tim

e (

us

)

base

autopar

fortowhile

lttoeq

matrixpar

21st June 2005 Ashley Brown # 41

Platform VariancePlatform VarianceVirtex II Area/Execution Time Comparison

0.00

1.00

2.00

3.00

4.00

5.00

6.00

2,600 2,650 2,700 2,750 2,800 2,850 2,900 2,950 3,000

Xilinx Slices

Ex

ec

uti

on

Tim

e (

us

)

base

autopar

fortowhile

lttoeq

matrixpar

21st June 2005 Ashley Brown # 42

Cycle Count Cycle Count ImprovementsImprovements

Program matmultinf

2dedge aes hist

base 189 4686 1948 3333

lttoeq 106 2818 1022 1794

% Decrease 44% 40% 48% 46%

21st June 2005 Ashley Brown # 43

Design Space ExplorationDesign Space Exploration

239

232139

97

98

99

100

101

102

103

104

105

0 50 100 150 200 250 300

Code Version

fma

x

21st June 2005 Ashley Brown # 44

Design Space ExplorationDesign Space Exploration

• Assume design with an fmax of 104MHz, must match that.

• Many solutions matching.– we should consider other factors such as area, power or

number of cycles.

• Being brief: look at solutions 139 and 232.• Only partially parallelised. Solution with most parallelism

(239) does not meet the fmax requirement.

21st June 2005 Ashley Brown # 45

Future WorkFuture Work

• Extensions to the language to allow additional matching.• expr replicator, complex expression matching.• Preservation of structure – e.g. a++; does not become a

= a + 1;• Heuristics for selecting transformations to apply.• Genetic algorithms for transformation selection? “Breed”

good transformation solutions.

21st June 2005 Ashley Brown # 46

Future ApplicationsFuture Applications

• Aspect-oriented concepts: automatically inserting debugging signals.

• Power-signature-masking code to avoid attacks in cryptographic applications.

21st June 2005 Ashley Brown # 47

ConclusionConclusion

• Matching method can achieve good results on naïve C code.

• Targeting domain- or application-specific constructs can provide large performance gains at the expense of resources.

• Scope to produce a much more powerful system with changes to the transformation language, heuristics and more efficient algorithms.

21st June 2005 Ashley Brown # 48

• The first transformation language for parallelising hardware languages with data integrity conditions.

• A prototype transformation engine for implementing the language.

• Automatic transformations capable of achieving a 35-70% reduction in execution time.

• An insight into the interaction of transformations, both with each other and with the platform their output runs on.

ContributionsContributions

QuestionsQuestions

This presentation, the final report, outsourcing report and source code are available from:https://www.doc.ic.ac.uk/~awb01/project/


Recommended