SOAP: Structural Optimization of Arithmetic Expressions for High...

SOAP: Structural Optimization of ArithmeticExpressions for High-Level Synthesis

Xitong Gao, Samuel Bayliss, George A. ConstantinidesDepartment of Electrical and Electronic Engineering

Imperial College LondonLondon SW7 2AZ, United Kingdom

{xi.gao08, s.bayliss08, g.constantinides}@imperial.ac.uk

Abstract—This paper introduces SOAP, a new tool to au-tomatically optimize the structure of arithmetic expressionsfor FPGA implementation as part of a high level synthesisflow, taking into account axiomatic rules derived from realarithmetic, such as distributivity, associativity and others. Weexplicitly target an optimized area/accuracy trade-off, allowingarithmetic expressions to be automatically re-written for thispurpose. For the first time, we bring rigorous approaches fromsoftware static analysis, specifically formal semantics and abstractinterpretation, to bear on source-to-source transformation forhigh-level synthesis. New abstract semantics are developed togenerate a computable subset of equivalent expressions froman original expression. Using formal semantics, we calculatetwo objectives, the accuracy of computation and an estimate ofresource utilization in FPGA. The optimization of these objectivesproduces a Pareto frontier consisting of a set of expressions. Thisgives the synthesis tool the flexibility to choose an implementationsatisfying constraints on both accuracy and resource usage. Wethus go beyond existing literature by not only optimizing theprecision requirements of an implementation, but changing thestructure of the implementation itself. Using our tool to optimizethe structure of a variety of real world and artificially generatedexamples in single precision, we improve either their accuracyor the resource utilization by up to 60%.

I. INTRODUCTION

The IEEE 754 standard [1] for floating-point computationis ubiquitous in computing machines. In practice, it is oftenneglected that floating-point computations almost always haveroundoff errors. In fact, associativity and distributivity proper-ties which we consider to be fundamental laws of real numbersno longer hold under floating-point arithmetic. This opensthe possibility of using these rules to generate an expressionequivalent to the original expression in real arithmetic, whichcould have better quality than the original when evaluated infloating point computation.

By exploiting rules of equivalence in arithmetic, such asassociativity (a + b) + c ≡ a + (b + c) and distributivity(a + b) × c ≡ a × c + b × c, it is possible to automaticallygenerate different implementations of the same arithmetic ex-pression. We optimize the structures of arithmetic expressionsin terms of the following two quality metrics relevant to FPGAimplementation: the resource usage when synthesized intocircuits, and a bound on roundoff errors when evaluated. Ourgoal is the joint minimization of these two quality metrics.This optimization process provides a Pareto optimal set ofimplementations. For example, our tool discovered that with

single precision floating-point representation, if a ∈ [0.1, 0.2],then the expression (a+ 1)

2 uses fewest resources whenimplemented in the form (a+ 1)× (a+ 1) but most accuratewhen expanded into (((a× a)+ a)+ a)+1. However it turnsout that a third alternative, ((1 + a) + a) + (a × a), is neverdesirable because it is neither more accurate nor uses fewerresources than the other two possible structures. Our aim is toautomatically detect and utilize such information to optimizethe structure of expressions.

A naıve implementation of equivalent expression findingwould be to explore all possible equivalent expressions to findoptimal choices, However this would result in combinatorialexplosion [2]. For instance, in worst case, the parsing of asimple summation of n variables could result in (2n− 1)!! =1× 3× 5×· · ·× (2n− 1) distinct expressions [2], [3]. This isfurther complicated by distributivity as (a+ b)

k could expandinto an expression with a summation of 2k terms each withk−1 multiplications. Therefore, usually it would be infeasibleto generate a complete set of equivalent expressions usingthe rules of equivalence, since an expression with a moderatenumber of terms will have a very large number of equivalentexpressions. The methodology explained in this paper makesuse of formal semantics as well as abstract interpretation [4]to significantly reduce the space and time requirements andproduce a subset of the Pareto frontier.

In order to further increase the options available in thePareto frontier, we introduce freedom in choosing mantissawidths for the evaluation of the expressions. Generally asthe precision of the evaluation increases, the utilization ofresources increases for the same expression. This gives flex-ibility in the trade-off between resource usage and precision.Our approach and its associated tool, SOAP, allow high-levelsynthesis flows to automatically determine whether it is abetter choice to rewrite an expression, or change its precisionin order to meet optimization goals.

The three contributions of this paper are:1) Efficient methods for discovering equivalent structures

of arithmetic expressions.2) A semantics-based program analysis that allows joint

reasoning about the resource usage and safe rangesof values and errors in floating-point computation ofarithmetic expressions.

3) A tool which produces RTL implementations on the

area-accuracy trade-off curve derived from structuraloptimization.

This paper is structured as follows. Section II discusses re-lated existing work in high-level synthesis and the optimizationof arithmetic expressions. We explain the basic concepts ofsemantics with abstract interpretation used in this paper inSection III. Using this, Section IV explains the concrete andabstract semantics for finding equivalent structure in arithmeticexpressions, as well as the analysis of their resource usageestimates and bounds of errors. Section V gives an overviewof the implementation details in our tool. Then we discuss theresults of optimized example expressions in Section VI andend with concluding remarks in Section VII.

II. RELATED WORK

High-level synthesis (HLS) is the process of compiling ahigh-level representation of an application (usually in C, C++or MATLAB) into register-transfer-level (RTL) implementa-tion for FPGA [5], [6]. HLS tools enable us to work ina high-level language, as opposed to facing labor-intensivetasks such as optimizing timing, designing control logic inthe RTL implementation. This allows application designers toinstead focus on the algorithmic and functional aspects of theirimplementation [5]. Another advantage of using HLS overtraditional RTL tools is that a C description is smaller thana traditional RTL description by a factor of 10 [5], [7], whichmeans HLS tools are in general more productive and lesserror-prone to work with. HLS tools benefit us in their abilityto automatically search the design space with a reasonabledesign cost [7], explore a large number of trade-offs betweenperformance, cost and power [8], which is generally muchmore difficult to achieve in RTL tools. HLS has receiveda resurgence of interest recently, particularly in the FPGAcommunity. Xilinx now incorporates a sophisticated HLS flowinto its Vivado design suite [9] and the open-source HLStool, LegUp [10], is gaining significant traction in the researchcommunity.

However, in both commercial and academic HLS tools,there is very little support for static analysis of numericalalgorithms. LLVM-based HLS tools such as Vivado HLSand LegUp usually have some traditional static analysis-based optimization passes such as constant propagation, aliasanalysis, bitwidth reduction or even expression tree balancingto reduce latency for numerical algorithms. There are alsoacademic tools that perform precision-performance trade-offby optimizing word-lengths of data paths [11]. Howeverthere are currently no HLS tools that perform the trade-offoptimization between accuracy and resource usage by varyingthe structure of arithmetic expressions.

Even in the software community, there are only a fewexisting techniques for optimizing expressions by transfor-mation, none of which consider accuracy/run-time trade-offs.Darulova et al. [12] employ a metaheuristic technique. Theyuse genetic programming to evolve the structure of arithmeticexpressions into more accurate forms. However there are sev-eral disadvantages with metaheuristics, such as convergence

can only be proved empirically and scalability is difficult tocontrol because there is no definitive method to decide howlong the algorithm must run until it reaches a satisfactory goal.Hosangadi et al. [13] propose an algorithm for the factorizationof polynomials to reduce addition and multiplication counts,but this method is only suitable for factorization and it isnot possible to choose different optimization levels. Peyman-doust et al. [14] present an approach that only deals with thefactorization of polynomials in HLS using Grobner bases. Theshortcomings of this are its dependence on a set of libraryexpressions [13] and the high computational complexity ofGrobner bases. The method proposed by Martel [15] is basedon operational semantics with abstract interpretation, but eventheir depth limited strategy is, in practice, at least expo-nentially complex. Finally Ioualalen et al. [2] introduce theabstract interpretation of equivalent expressions, and creates apolynomially sized structure to represent an exponential num-ber of equivalent expressions related by rules of equivalence.However it restricts itself to only a handful of these rules toavoid combinatorial explosion of the structure and there areno options for tuning its optimization level.

Since none of these above captures the optimization of bothaccuracy and performance by restructuring arithmetic expres-sions, we base ourselves on the software work of Martel [15],but extend this work in the following ways. Firstly, we developnew hardware-appropriate semantics to analyze not only accu-racy but also resource usage, seamlessly taking into accountcommon subexpression elimination. Secondly, because weconsider both resource usage and accuracy, we develop a novelmulti-objective optimization approach to scalably construct thePareto frontier in a hierarchical manner, allowing fast designexploration. Thirdly, equivalence finding is guided by priorknowledge on the bounds of the expression variables, as wellas local Pareto frontiers of subexpressions while it is optimiz-ing expression trees in a bottom-up approach, which allowsus to reduce the complexity of finding equivalent expressionswithout sacrificing our ability to optimize expressions.

We begin with an introduction to formal semantics in thefollowing section, later in Section IV, we explain our approachby extending the semantics to reason about errors, resourceusage and equivalent expressions.

III. ABSTRACT INTERPRETATION

This section introduces the basic concepts of formal seman-tics and abstract interpretation used in this paper. We illustratethese concepts by putting the familiar idea of interval arith-metic [16] in the framework of abstract interpretation. This isthen further extended later in the paper to define a scalableanalysis capturing ranges, errors and resource utilization. Asan illustration, consider the following expression and its DFG:

(a+ b)× (a+ b) (1)

We may wish to ask: if initially a and b are real numbersin the range of [0.2, 0.3] and [2, 3] respectively, what wouldbe the outcome of evaluating this expression with real arith-metic? A straightforward approach is simulation. Evaluating

a 1

b 2

+3

⇥4

Fig. 1. The DFG for the sample program.

the expression for a large quantity of inputs will produce a setof possible outputs of the expression. However the simulationapproach is unsafe, since there are infinite number of real-valued inputs possible and it is infeasible to simulate for all.A better method might be to represent the possible values of aand b using ranges. To compute the ranges of its output values,we could operate on ranges rather than values (note that thesuperscript ] denotes ranges). Assume that a]init = [0.2, 0.3],b]init = [2, 3], which are the input ranges of a and b, and A(l)where l ∈ {1, 2, 3, 4} are the intervals of the outputs of theboxes labelled with l in the DFG. We extract the data flowfrom the DFG to produce the following set of equations:

A(1) = a]init A(2) = b]initA(3) = A(1) +A(2) A(4) = A(3)×A(3) (2)

For the equations above to make sense, addition and multi-plication need to be defined on intervals. We may define thefollowing interval operations:

[a, b] + [c, d] = [a+ c, b+ d] [a, b]− [c, d] = [a− d, b− c][a, b]× [c, d] = [min(s),max(s)]

where s = {a× c, a× d, b× c, b× d}(3)

The solution to the set of (2) for A(4) is [4.84, 10.89], whichrepresents a safe bound on the output at the end of programexecution. Note that in actual execution of the program, thesemantics represent the values of intermediate variables, whichare real values. In our case, a set of real values forms the set ofall possible values produced by our code. However computingthis set precisely is not, in general, a possible task. Instead, weuse abstract interpretation based on intervals, which gives theabstract semantics of this program. Here, we have achieved aclassical interval analysis by defining the meaning of additionand multiplication on abstract mathematical structures (in thiscase intervals) which capture a safe approximation of theoriginal semantics of the program. Later in Section IV, wefurther generalize the idea by defining the meaning of theseoperations on more complex abstract structures which allowus to scalably reason about the range, error, and area of FPGAimplementations.

IV. NOVEL SEMANTICS

A. Accuracy Analysis

We first introduce the concepts of the floating-point rep-resentation [1]. Any values v representable in floating-pointwith standard exponent offset can be expressed with the formatgiven by the following equation:

v = s× 2e+2k−1−1 × 1.m1m2m3 . . .mp (4)

In (4), the bit s is the sign bit, the k-bit unsigned integer e isknown as the exponent bits, and the p-bits m1m2m3 . . .mp are

the mantissa bits, here we use 1.m1m2m3 . . .mp to indicatea fixed-point number represented in unsigned binary format.

Because of the finite characteristic of IEEE 754 floating-point format, it is not always possible to represent exactvalues with it. Computations in floating-point arithmetic ofteninduces roundoff errors. Therefore, following Martel [15], webound with ranges the values of floating-point calculations,as well as their roundoff errors. Our accuracy analysis de-termines the bounds of all possible outputs and their associ-ated range of roundoff errors for expressions. For example,assume that a ∈ [0.2, 0.3], b ∈ [2.3, 2.4], it is possible toderive that in single precision floating-point computation withrounding to the nearest, (a+ b)

2 ∈ [6.24999857, 7.29000187]and the error caused by this computation is bounded by[−1.60634534× 10−6, 1.60634534× 10−6].

We employ abstract error semantics for the calculation oferrors described in [2], [15]. First we define the domainE] = IntervalF×Interval, where Interval and IntervalFrespectively represent the set of real intervals, and the setof floating-point intervals (intervals exactly representable infloating-point arithmetic). The value (x], µ]) ∈ E] representsa safe bound on floating-point values and the accumulatederror represented as a range of real values. Then addition andmultiplication can be defined for the semantics as in (5):

(x]1, µ]1) + (x]2, µ

]2) = (↑]◦(x]1 + x]2), µ

]1 + µ]

2 + ↓]◦(x]1 + x]2))

(x]1, µ]1)− (x]2, µ

]2) = (↑]◦(x]1 − x]2), µ]

1 − µ]2 + ↓]◦(x]1 − x]2))

(x]1, µ]1)× (x]2, µ

]2) = (↑]◦(x]1 × x]2),

x]1 × µ]2 + x]2 × µ]

1 + µ]1 × µ]

2 + ↓]◦(x]1 × x]2))for (x]1, µ

]1) ∈ E], (x]2, µ

]2) ∈ E]

(5)The addition, subtraction and multiplication of intervals fol-

low the standard rules of interval arithmetic defined earlier in(3). In (5), the function ↓]◦ : Interval→ Interval determinesthe range of roundoff error due to the floating-point computa-tion under one of the rounding modes ◦ ∈ {−∞,∞, 0,¬0,∼}which are round towards negative infinity, towards infinity,towards zero, away from zero and towards nearest floating-point value respectively. For instance, ↓]∼ ([a, b]) = [−z, z],where z = max(ulp(a), ulp(b))/2, which denotes the maxi-mum rounding error that can occur for values within the range[a, b], and the unit of the last place (ulp) function ulp(x) [17]characterizes the distance between two adjacent floating-pointvalues f1 and f2 satisfying f1 ≤ x ≤ f2 [18]. In our analysis,the function ulp is defined as:

ulp(x) = 2e(x)+2k−1−1 × 2−p (6)

where e(x) is the exponent of x, k and p are the parameters ofthe floating-point format as defined in (4). The function ↑]◦ :Interval → IntervalF computes the floating-point boundfrom a real bound, by rounding the infimum a and supremumb of the input interval [a, b].

Expressions can be evaluated for their accuracy by themethod as follows. Initially the expression is parsed into adata flow graph (DFG). By way of illustration, the sample

expression (a+ b)2 has the tree structure in Fig. 1. Then the

exact ranges of values of a and b are converted into the abstractsemantics using a cast operation as in (7):

cast(x]) = (↑]◦(x]), ↓]◦(x])) (7)

For example, for the variable a ∈ [0.2, 0.3] under sin-gle precision with rounding to nearest, cast([0.2, 0.3]) =([0.200000003, 0.300000012], [−1/67108864, 1/67108864]).

After this, the propagation of bounds in the data flow graphis carried out as described in Section III, where the differenceis the abstract error semantics defined in (5) is used in lieu ofthe interval semantics. At the root of the tree (i.e. the exit ofthe DFG) we find the value of the accuracy analysis result forthe expression.

In this paper, the function Error : Expr → E] is usedto represent the analysis of evaluation accuracy, where Exprdenotes the set of all expressions.

B. Resource Usage Analysis

Here we define similar formal semantics which calculate anapproximation to the FPGA resource usage of an expression,taking into account common subexpression elimination. Thisis important as, for example, rewriting a×b+a×c as a×(b+c)in the larger expression (a× b+ a× c) + (a× b)2 causes thecommon subexpression a× b to be no longer present in bothterms. Our analysis must capture this.

The analysis proceeds by labelling subexpressions. Intu-itively, the set of labels Label, is used to assign uniquelabels to unique expressions, so it is possible to easily identifyand reuse them. For convenience, let the function fresh :Expr→ Label assign a distinct label to each expression orvariable, where Expr is the set of all expressions. Before weintroduce the labeling semantics, we define the environmentλ : Label → Expr ∪ {⊥}, which is a function thatmaps labels to expressions, and Env denotes the set of suchenvironments. A label l in the domain of λ ∈ Env thatmaps to ⊥ indicates that l does not map to an expression.An element (l, λ) ∈ Label × Env stands for the labelingscheme of an expression. Initially, we map all labels to ⊥,then in the mapping λ, each leaf of an expression is assigneda unique label, and the unique label l is used to identify theleaf. That is for the leaf variable or constant x:

(l, λ) = (fresh(x), [fresh(x) 7→ x]) (8)

This equation uses [fresh(x) 7→ x] to indicate an environmentthat maps the label fresh(x) to the expression x and all otherlabels map to ⊥, in other words, if l = fresh(x) and l′ 6= l,then λ(l) = x and λ(l′) = ⊥. For example, for the DFG inFig. 1, we have for the variables a and b:

(la, λa) = (fresh(a), [fresh(a) 7→ a]) = (l1, [l1 7→ a])

(lb, λb) = (l2, [l2 7→ b]) (9)

Then the environments are propagated in the flow directionof the DFG, using the following formulation of the labeling

semantics:

(lx, λx) ◦ (ly, λy) = (l, (λx � λy)[l 7→ lx ◦ ly])where l = fresh(lx ◦ ly), ◦ ∈ {+,−,×} (10)

Specifically, λ = λx � λy signifies that λy is used to updatethe mapping in λx, if the mapping does not exist in λx, andresult in a new environment λ; and λ[l 7→ x] is a shorthandfor λ� [l 7→ x]. As an example, with the expression in Fig. 1,using (9), recall to mind that l1 = la, l2 = lb, we derive forthe subexpression a+ b:

(la+b, λa+b) = (la, λa) + (lb, λb)

= (l3, (λa � λb)[l3 7→ la + lb])

where l3 = fresh(la + lb)

= (l3, [l1 7→ a]� [l2 7→ b]� [l3 7→ l1 + l2])

= (l3, [l1 7→ a, l2 7→ b, l3 7→ l1 + l2])(11)

Finally, for the full expression (a+ b)× (a+ b):

(l, λ) = (la+b, λa+b)× (la+b, λa+b)

= (l4, [l1 7→ a, l2 7→ b, l3 7→ l1 + l2, l4 7→ l3 × l3])(12)

From the above derivation, it is clear that the semanticscapture the reuse of subexpressions. The estimation of areais performed by counting, for an expression, the numbers ofadditions, subtractions and multiplications in the final labelingenvironment, then calculating the number of LUTs used tosynthesize the expression. If the number of operators is n◦where ◦ ∈ {+,−,×}, then the number of LUTs in total forthe expressions is estimated as

∑◦∈{+,−,×}A◦n◦, where the

value A◦ denotes the number of LUTs per ◦ operator, whichis dependent on the type of the operator and the floating-pointformat used to generate the operator.

In the following sections, we use the function Area :Expr→ N to denote our resource usage analysis.

C. Equivalent Expressions Analysis

In earlier sections, we introduce semantics that define ad-ditions and multiplications on intervals, then gradually tran-sition to error semantics that compute bounds of values anderrors, as well as labelling environments that allows commonsubexpression elimination, by defining arithmetic operationson these structures. In this section, we now take the leap fromnot only analyzing an expression for its quality, to definingarithmetic operations on sets of equivalent expressions, anduse these rules to discover equivalent expressions. Before this,it is necessary to formally define equivalent expressions andfunctions to discover them.

1) Discovering Equivalent Expressions: From an expres-sion, a set of equivalent expressions can be discovered bythe following inference system of equivalence relations B ⊂Expr×Expr. Let’s define e1, e2, e3 ∈ Expr, v1, v2, v3 ∈ R,

and ◦ ∈ {+,×}. First, the arithmetic rules are:

Associativity(◦) : (e1 ◦ e2) ◦ e3 B e1 ◦ (e2 ◦ e3)Commutativity(◦) : e1 ◦ e2 B e2 ◦ e1

Distributivity : e1 × (e2 + e3)B e1 × e2 + e1 × e3Distributivity′ : e1 × e2 + e1 × e3 B e1 × (e2 + e3)

Secondly, the reduction rules are:

Identity(×) : e1 × 1B e1 ZeroProp : e1 × 0B 0

Identity(+) : e1 + 0B e1 ConstProp(◦) : v3 = v1 ◦ v2v1 ◦ v2 B v3

The ConstProp rule states that if an expression is a summa-tion/multiplication of two values, then it can be simply eval-uated to produce the result. Finally, the following one allowsstructural induction on expression trees, i.e. it is possible toderive that a+ (b+ c)B a+ (c+ b) from b+ cB c+ b:

Tree(◦) : e1 B e2e3 ◦ e1 B e3 ◦ e2 (13)

It is possible to transform the structure of expressions usingthese above-mentioned inference rules. We define the functionI : P (Expr) → P (Expr), where P (Expr) denotes thepower set of Expr, which generates a (possibly larger) setof equivalent expressions from an initial set of equivalentexpressions by one step of (13), i.e. I(ε) = {e′ ∈ Expr |e B e′ ∧ e ∈ ε}, where ε is a set of equivalent expressions.From this, we may note that the set of equivalent expressionsgenerated by taking the union of any number of steps of (13)of ε is the transitive closure I?(ε), as given by the followingformula, where Ii(ε) = I(Ii−1(ε)) and I0(ε) = ε:

I?(ε) =

∞⋃i=0

Ii(ε) (14)

We say that e1 is equivalent to e2 if and only if e1 ∈ I?({e})and e2 ∈ I?({e}) for some expression e. In general becauseof combinatorial explosion, it is not possible to derive the fullset of equivalent expressions by rules of equivalence, whichmotivates us to develop scalable methods that executes fastenough even with large expressions.

2) Scalable Methods: The complexity of equivalent expres-sion finding is reduced by fixing the structure of subexpres-sions at a certain depth k in the original expression. Thedefinition of depth is given as follows: first the root of theparse tree of an expression is assigned depth d = 1; then werecursively define the depth of a node as one more than thedepth of its greatest-depth parent. If the depth of the node isgreater than k, then we fix the structure of its child nodes bydisallowing any equivalence transformation beyond this node.We let Ik denote this “depth limited” equivalence findingfunction, where k is the depth limit used, and I?

k denotes itstransitive closure. This approach is similar to Martel’s depthlimited equivalent expression transform [15], however Martel’smethod eventually allows transformation of subexpressionsbeyond the depth limit, because rules of equivalence wouldtransform these to have a smaller depth. This contributes to atime complexity at least exponential in terms of the expression

size. In contrast, our technique has a time complexity that doesnot depend on the size of the input expression, but grows withrespect to the depth limit k. Note that the full equivalenceclosure using the inference system we defined earlier in (14)is at least O(2n− 1!!) where n is the number of terms in anexpression, as we discussed earlier.

3) Pareto Frontier: Because we optimize expressions intwo quality metrics, i.e. the accuracy of computation and theestimate of FPGA resource utilization. We desire the largestsubset of all equivalent expressions E discovered such thatin this subset, no expression dominates any other expression,in terms of having both better area and better accuracy. Thissubset is known as the Pareto frontier. Fig. 2 shows a naıvealgorithm for calculating the Pareto frontier for a set ofequivalent expressions ε.

function FRONTIER(ε)e1, e2, . . . , en ← sort by accuracy(ε)e← e1; frontier ← {e}for i ∈ {1, 2, . . . , n} do

if Area(ei) < Area(e) thenfrontier ← frontier ∪ {ei}; e← ei

end ifend forreturn frontier

end function

Fig. 2. The Pareto frontier from a set of equivalent expressions.

The function sort by accuracy sorts a set of expressionsby their analyzed bounds of errors in increasing order, wherethe magnitudes of error bounds are used to perform theordering [15]:

AbsError(e) = max(abs(µ]min), abs(µ

]max))

where (x], [µ]min, µ

]max]) = Error(e) (15)

4) Equivalent Expressions Semantics: Similar to the anal-ysis of accuracy and resource usage, a set of equivalentexpressions can be computed with semantics. That is, wedefine structures, i.e. sets of equivalent expressions, that canbe manipulated with arithmetic operators. In our equivalentexpressions semantics, an element of P (Expr) is used toassign a set of expressions to each node in an expression parsetree. To begin with, at each leaf of the tree, the variable orconstant is assigned a set containing itself, as for x, the set εxof equivalent expressions is εx = {x}. After this, we propagatethe equivalence expressions in the parse tree’s direction offlow, using (16), where E◦(εx, εy) = {ex ◦ey | ex ∈ εx∧ey ∈εy}, and ◦ ∈ {+,−,×}:

εx ◦ εy = FRONTIER (I?k (E◦(εx, εy))) (16)

The equation implies that in the propagation procedure, itrecursively constructs a set of equivalent subexpressions forthe parent node from two child expressions, and uses thedepth limited equivalence function I?

k to work out a largerset of equivalent expressions. To reduce computation effort,

we select only those expressions on the Pareto frontier forthe propagation in the DFG. Although in worst case thecomplexity of this process is exponential, the selection byPareto optimality accelerates the algorithm significantly. Forexample, for the subexpression a+b of our sample expression:

εa + εb = FRONTIER (I?k (E◦(εa, εb)))

= FRONTIER (I?k (E◦({a}, {b})))

= FRONTIER ({a+ b, b+ a}) (17)

Alternatively, we could view the semantics in terms of DFGsrepresenting the algorithm for finding equivalent expressions.The parsing of an expression directly determines the structureof its DFG. For instance, the expression (a + b) × (a + b)generates the DFG illustrated in Fig. 3. The circles labeled 3and 7 in this diagram are shorthands for the operation E+ andE× respectively, where E+ and E× is defined in (16).

a

b

+ ⇥Ik[ Frontier Ik[1

2

3 4 5 6 7 8 9 10Frontier

Fig. 3. The DFG for finding equivalent expressions of (a+ b)× (a+ b).

For our example in Fig. 3, similar to the construction ofdata flow equations in Section III, we can produce a set ofequations from the data flow of the DFG, which now producesequivalent expressions:

A(1) = A(1) ∪ {a} A(2) = A(2) ∪ {b}A(3) = E+(A(1), A(2)) A(4) = A(3) ∪A(5)A(5) = Ik(A(4)) A(6) = FRONTIER(A(5))

A(7) = E×(A(6), A(6)) A(8) = A(7) ∪A(9)A(9) = Ik(A(8)) A(10) = FRONTIER(A(9)) (18)

Because of loops in the DFG, it is no longer trivial to findthe solution. In general, the analysis equations are solvediteratively. We can regard the set of equations as a singletransfer function F as in (19), where the function F takesas input the variables A(1), . . . , A(10) appearing in the right-hand sides of (18) and outputs the values A(1), . . . , A(10)appearing in the left-hand sides. Our aim is then to find aninput ~x to F such that F (~x) = ~x, i.e. a fixpoint of F .

F ((A(1), . . . , A(10))) = (A(1)∪{a}, . . . , FRONTIER(A(9)))(19)

Initially we assign A(i) = ∅ for i ∈ {1, 2, . . . , 10}, andwe denote ~∅ = (∅, · · · ,∅). Then we compute iterativelyF (~∅), F 2(~∅) = F (F (~∅)), and so forth, until the fixpointsolution Fn(~∅) is reached for some iteration n, i.e. Fn(~∅) =F (Fn(~∅)) = Fn+1(~∅). The fixpoint solution Fn(~∅) gives aset of equivalent expressions derived using our method, whichis found at A(10). In essence, the depth limit acts as a slidingwindow. The semantics allow hierarchical transformation ofsubexpressions using a depth-limited search and the propaga-tion of a set of subexpressions that are locally Pareto optimalto the parent expressions in a bottom-up hierarchy.

The problem with the semantics above is that the timecomplexity of I?

k scales poorly, since the worst case number

of subexpressions needed to explore increases exponentiallywith k. Therefore an alternative method is to optimize it bychanging the structure of the DFG slightly, as shown in Fig. 4.The difference is that at each iteration, the Pareto frontierfilters the results to decrease the number of expressions toprocess for the next iteration.

a

b

+ ⇥Ik[ Frontier Ik[1

2

3 4 5 6 7 8 9 10Frontier

Fig. 4. The alternative DFG for (a+ b)× (a+ b).

In the rest of this paper, we use frontier_traceto indicate our equivalent expression finding semantics, andgreedy_trace to represent the alternative method.

V. IMPLEMENTATION

The majority of our tool, SOAP, is implemented in Python.For computing errors in real arithmetic, we use exact arith-metic based on rational numbers within the GNU MultiplePrecision Arithmetic Library (GMP) [19]. We also use themulti-precision floating-point library MPFR [20] for access tofloating-point rounding modes and arbitrary precision floating-point computation.

Because of the workload of equivalent expression finding,the underlying algorithm is optimized in many ways. First, foreach iteration, the relation finding function Ik is only appliedto newly discovered expressions in the previous iteration.Secondly, results of functions such as Ik, Area and Errorare cached, since there is a large chance that these resultsfrom subexpressions are reused several times, subexpressionsare also maximally shared to eliminate duplication in memory.Thirdly, the computation of Ik is fully multithreaded.

The resource statistics of operators are provided usingFloPoCo [21] and Xilinx Synthesis Technology (XST) [22].Initially, For each combination of an operator, an exponentwidth between 5 and 16, and a mantissa width ranging from10 to 113, 2496 distinct implementations are generated usingFloPoCo. All of them are optimized to use DSP blocks.They are then synthesized using XST, targeting a Virtex-6FPGA device (XC6VLX760). Because LUTs are generallymore contrained resources than DSP blocks in floating-pointcomputations, we provide synthesis statistics in LUTs only.

VI. RESULTS

Because Martel’s approach defers selecting optimal optionsuntil the end of equivalent expression discovery, we developeda method that could produce exactly the same set of equivalentexpressions from the traces computed by Martel, and hasthe same time complexity, but we adopted it to generate aPareto frontier from the discovered expressions, instead of onlyerror bounds. This allows us to compare martel_trace,i.e. our implementation of Martel’s method, against our meth-ods frontier_trace and greedy_trace discussed inSection IV-C. Fig. 5(a) optimizes the expression (a+ b)

2 usingthe three methods above, all using depth limit 3, and the inputranges are a ∈ [5, 10] and b ∈ [0, 0.001] [15]. The IEEE754

single precision floating-point format with rounding to nearestwas used for the evaluation of accuracy and area estimation.The scatter points represent different implementations of theoriginal expression that have been explored and analyzed,and the (overlapping) lines denote the Pareto frontiers. Inthis example, our methods produce the same Pareto frontierthat Martel’s method could discover, while having up to 50%shorter run time. Because we consider an accuracy/area trade-off, we find that we can not only have the most accurateimplementation discovered by Martel, but also an option thatis only 0.0005% less accurate, but uses 7% fewer LUTs.

We go beyond the optimization of a small expression, bygenerating results in Fig 5(b) to show that the same techniqueis applicable to simultaneous optimization of multiple largeexpressions. The expressions ex and ey , with input ranges a ∈[1, 2], b ∈ [10, 20], c ∈ [10, 200] are used as our example:

ex = (a+ a+ b)× (a+ b+ b)× (b+ b+ c)×(b+ c+ c)× (c+ c+ a)× (c+ a+ a)

ey = (1 + b+ c)× (a+ 1 + c)× (a+ b+ 1) (20)

We generated and optimized RTL implementations ofex and ey simultaneously using frontier_trace andgreedy_trace with the depth limits indicated by thenumbers in the legend of Fig. 5(b). Note that becausethe expressions evaluate to large values, the errors arealso relatively large. We set the depth limit to 2 andfound that greedy_trace executes much faster thanfrontier_trace, while discovering a subset of the Paretofrontier of frontier_trace. Also our methods are sig-nificantly faster and more scalable than martel_trace,because of its poor scalability discussed earlier, our computerran out of memory before we could produce any results. If wenormalize the time allowed for each method and compare theperformance, we found that greedy_trace with a depthlimit 3 takes takes slightly less time than frontier_tracewith a depth limit 2, but produces a generally better Paretofrontier. The alternative implementations of the original ex-pression provided by the Pareto frontier of greedy_tracecan either reduce the LUTs used by approximately 10% whenaccuracy is not crucial, or can be about 10% more accurateif resource is not our concern. It also enables the ability tochoose different trade-off options, such as an implementationthat is 7% more accurate and uses 7% fewer LUTs than theoriginal expression.

Furthermore, Fig. 5(c) varies the mantissa width of thefloating-point format, and presents the Pareto frontier of bothex and ey together under optimization. Floating-point formatswith mantissa widths ranging from 10 to 112 bits were used tooptimize and evaluate the expressions for both accuracy andarea usage. It turns out that some implementations originallyon the Pareto frontier of Fig. 5(b) are no longer desirable, asby varying the mantissa width, new implementations are bothmore accurate and less resource demanding.

Besides the large example expressions above, Fig. 5(d)and Fig. 5(e) are produced by optimizing expressions with

real applications under single precision. Fig. 5(d) shows theoptimization of the Taylor expansion of sin(x + y), wherex ∈ [−0.1, 0.1] and y ∈ [0, 1], using greedy_trace with adepth limit 3. The function taylor(f, d) indicates the Taylorexpansion of function f(x, y) at x = y = 0 with a maximumdegree of d. For order 5 we reduced error by more than 60%.Fig. 5(e) illustrates the results obtained using the depth limit 3with the Motzkin polynomial [23] x6+y4z2+y2z4−3x2y2z2,which is known to be difficult to evaluate accurately, especiallyusing inputs x ∈ [−0.99, 1], y ∈ [1, 1.01], z ∈ [−0.01, 0.01].

Finally, Fig. 5(f) demonstrates the accuracy of the areaestimation used in our analysis. It compares the actual LUTsnecessary with the estimated number of LUTs using oursemantics, by synthesizing more than 6000 equivalent expres-sions derived from a+ b+ c, (a+ 1)× (b+ 1)× (c+ 1), ex,and ey using varying mantissa widths. The dotted line indicatesexact area estimation, a scatter points that is close to the linemeans the area estimation for that particular implementation isaccurate. The solid black line represents the linear regressionline of all scatter points. On average, our area estimation is a6.1% over-approximation of the actual number of LUTs, andthe worst case over-approximation is 7.7%.

VII. CONCLUSION

We provide a formal approach to the optimization ofarithmetic expressions for both accuracy and resource usagein high-level synthesis. Our method and the associated tool,SOAP, encompass three kind of semantics that describe theaccumulated roundoff errors, count operators in expressionsconsidering common subexpression elimination, and deriveequivalent expressions. For a set of input expressions, theproposed approach works out the respective sets of equiva-lent expressions in a hierarchical bottom-up fashion, with awindowing depth limit and Pareto selection to help reduce thecomplexity of equivalent expression discovery. Using our tool,we improve either the accuracy of our sample expressions orthe resource utilization by up to 60%, over the originals undersingle precision. Our tool enables a high-level synthesis toolto optimize the structure as well as the precision of arithmeticexpressions, then to automatically choose an implementationthat satisfies accuracy and resource usage constraints.

We believe that it is possible to extend our tool for themulti-objective optimization of arithmetic expressions in thefollowing ways. First, because of the semantics nature ofour approach, it provides the necessary foundation whichpermits us to extend the method for general numerical programtransformation in high-level synthesis. Secondly, it would beuseful to further allow transformations of expressions whileallowing different mantissa widths in the subexpressions, thisfurther increases the number options in the Pareto frontier,as well as leads to more optimized expressions. Thirdly, as analternative for interval analysis, we could employ more sophis-ticated abstract domains that capture the correlations betweenvariables, and produce tighter bounds for results. Finally, therecould be a lot of interest in the HLS community on how our

600 800 1000 1200 1400 1600Area (Number of LUTs)

0.8

1.0

1.2

1.4

1.6

1.8

Abs

olut

eE

rror

⇥10�5

originalgreedy trace, 3 (0.04s)frontier trace, 3 (0.08s)martel trace, 3 (0.09s)

((a + b) * (a + b))

(((a + b) * a) + ((a + b) * b))

((((a + b) + a) * b) + (a * a))

((((a * b) + (b * b)) + (a * b)) + (a * a))

(a) Optimization of (a+ b)2

8000 9000 10000 11000 12000Area (Number of LUTs)

6.0

6.2

6.4

6.6

6.8

Abs

olut

eE

rror

×106 frontier trace, 2 (2.97s)greedy trace, 3 (2.39s)greedy trace, 2 (0.27s)martel trace, 2 (out of memory)original

(b) Simultaneous optimization of both ex and ey

0 20000 40000 60000Area (Number of LUTs)

10�2110�1810�1510�1210�910�610�3

100103106109

1012

Abs

olut

eE

rror

greedy trace, 3 (274.91s)original

Half-precision

Single-precision

Double-precision

Quadruple-precision

(c) Varying the mantissa width of (b)


1.0

1.5

2.0

2.5

Abs

olut

eE

rror

×10−7 taylor(sin(x + y),2) (0.05s)taylor(sin(x + y),2) originaltaylor(sin(x + y),3) (0.28s)taylor(sin(x + y),3) originaltaylor(sin(x + y),4) (1.14s)taylor(sin(x + y),4) originaltaylor(sin(x + y),5) (7.75s)taylor(sin(x + y),5) original

(d) The Taylor expansion of sin(x+ y)


1.50

1.52

1.54

1.56

1.58

1.60

1.62

1.64A

bsol

ute

Err

or×10−6

greedy trace (0.88s)frontier trace (2.07s)martel trace (out of memory)original

(e) The Motzkin polynomial em

0 1 2 3 4 5Actual Area (Number of LUTs) ⇥104

0

1

2

3

4

5

Est

imat

edA

rea

(Num

bero

fLU

Ts)

⇥104

6.1% deviation

(f) Accuracy of Area Estimation

Fig. 5. Optimization Results.

tool can be incorporated with Martin Langhammer’s work onfused floating-point datapath synthesis [24].

In the future, after we extend our method and SOAP withgeneral numerical program transform, we will make the toolavailable to the HLS community.

REFERENCES

[1] ANSI/IEEE, “IEEE Standard for Floating-Point Arithmetic,” Micropro-cessor Standards Committee of the IEEE Computer Society, Tech. Rep.,Aug. 2008.

[2] A. Ioualalen and M. Martel, “A new abstract domain for the represen-tation of mathematically equivalent expressions,” in Proceedings of the19th International Conference on Static Analysis, SAS ’12. Springer-Verlag, 2012, pp. 75–93.

[3] C. Mouilleron, “Efficient computation with structured matrices andarithmetic expressions,” Ph.D. dissertation, Ecole Normale Superieurede Lyon-ENS LYON, 2011.

[4] P. Cousot and R. Cousot, “Abstract interpretation: a unified lattice modelfor static analysis of programs by construction or approximation of fix-points,” in Proceedings of the 4th ACM SIGACT-SIGPLAN Symposiumon Principles of Programming Languages, POPL ’77. ACM, 1977, pp.238–252.

[5] P. Coussy and A. Morawiec, High-Level Synthesis: from Algorithm toDigital Circuit, 1st ed. Springer-Verlag, 2008.

[6] D. Gajski, N. Dutt, A. Wu, and S. Lin, High-level synthesis. KluwerBoston, 1992.

[7] Berkeley Design Technology, Inc, “High-level synthesis tools for XilinxFPGAs,” Berkeley Design Technology, Inc, Tech. Rep., 2010.

[8] M. C. McFarland, A. C. Parker, and R. Camposano, “The high-levelsynthesis of digital systems,” Proc. IEEE, vol. 78, no. 2, pp. 301–318,1990.

[9] Xilinx, “Vivado design suite user guide—high-level synthesis,” 2012.[10] U. of Toronto, “LegUp documentation—release 3.0,” Jan. 2013.

[Online]. Available: http://legup.eecg.utoronto.ca/[11] G. Constantinides, A. Kinsman, and N. Nicolici, “Numerical data

representations for FPGA-based scientific computing,” IEEE Des. Test.Comput., vol. 28, no. 4, pp. 8–17, 2011.

[12] E. Darulova, V. Kuncak, R. Majumdar, and I. Saha, “On the Generationof Precise Fixed-Point Expressions,” Ecole Polytechnique Federale DeLausanne, Tech. Rep., 2013.

[13] A. Hosangadi, F. Fallah, and R. Kastner, “Factoring and eliminatingcommon subexpressions in polynomial expressions,” in Proceedingsof the 2004 IEEE/ACM International Conference on Computer-aidedDesign, ICCAD ’04. IEEE Computer Society, 2004, pp. 169–174.

[14] A. Peymandoust and G. De Micheli, “Using symbolic algebra inalgorithmic level DSP synthesis,” in Design Automation Conference,2001. Proceedings, 2001, pp. 277–282.

[15] M. Martel, “Semantics-based transformation of arithmetic expressions,”in Static Analysis, Lecture Notes in Computer Science, vol. 4634.Springer-Verlag, 2007, pp. 298–314.

[16] R. E. Moore, R. B. Kearfott, and M. J. Cloud, Introduction to IntervalAnalysis. Society for Industrial and Applied Mathematics, 2009.

[17] J.-M. Muller, “On the definition of ulp(x),” in Research report, Labora-toire de l’Informatique du Parallelisme, RR2005-09, February 2005.

[18] D. Goldberg, “What every computer scientist should know aboutfloating-point arithmetic,” ACM Computing Surveys (CSUR), vol. 23,no. 1, pp. 5–48, 1991.

[19] T. Granlund et al., “GMP, the GNU multiple precision arithmeticlibrary,” 1991. [Online]. Available: http://gmplib.org/

[20] L. Fousse, G. Hanrot, V. Lefevre, P. Pelissier, and P. Zimmermann,“MPFR: A multiple-precision binary floating-point library with cor-rect rounding,” ACM Transactions on Mathematical Software (TOMS),vol. 33, no. 2, p. 13, 2007.

[21] F. de Dinechin and B. Pasca, “Designing custom arithmetic data pathswith FloPoCo,” IEEE Des. Test. Comput., vol. 28, no. 4, pp. 18–27,2011.

[22] Xilinx, Inc., XST User Guide for Virtex-6, Spartan-6, and 7 SeriesDevices, Mar 2013. [Online]. Available: http://www.xilinx.com/support/documentation/sw manuals/xilinx14 5/xst v6s6.pdf

[23] J. Demmel, “Accurate and efficient expression evaluation and linear al-gebra,” in Proceedings of the 2011 International Workshop on Symbolic-Numeric Computation, SNC ’11. ACM, 2011, p. 2.

[24] M. Langhammer and T. VanCourt, “FPGA floating point datapathcompiler,” in 17th IEEE Symposium on Field Programmable CustomComputing Machines, FCCM ’09. IEEE, 2009, pp. 259–262.

Date post:	16-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

SOAP: Structural Optimization of Arithmetic Expressions for High...

Documents