A Scalable Auto-tuning Framework for Compiler Optimizationhollings/papers/ipdps09.pdf · produced a...

A Scalable Auto-tuning Framework for Compiler Optimization

Ananta Tiwari1, Chun Chen2∗, Jacqueline Chame3,

Mary Hall2 and Jeffrey K. Hollingsworth1

1University of Maryland 2University of Utah

Department of Computer Science School of Computing

College Park, MD 20740 Salt Lake City, UT 84112

{tiwari, hollings}@cs.umd.edu {chunchen, mhall}@cs.utah.edu

3University of Southern California

Information Sciences Institute

Marina del Ray, CA 90292

[email protected]

Abstract

We describe a scalable and general-purpose frame-work for auto-tuning compiler-generated code. Wecombine Active Harmony’s parallel search backend withthe CHiLL compiler transformation framework to gen-erate in parallel a set of alternative implementations ofcomputation kernels and automatically select the onewith the best-performing implementation. The result-ing system achieves performance of compiler-generatedcode comparable to the fully automated version of theATLAS library for the tested kernels. Performance forvarious kernels is 1.4 to 3.6 times faster than the nativeIntel compiler without search. Our search algorithm si-multaneously evaluates different combinations of com-piler optimizations and converges to solutions in onlya few tens of search-steps.

1 Introduction

The complexity and diversity of today’s parallel ar-chitectures overly burdens application programmers inporting and tuning their code. At the very high end,processor utilization is notoriously low, and the highcost of wasting these precious resources motivates ap-plication programmers to devote significant time andenergy to tuning their codes. This tuning process mustbe largely repeated to move from one architecture to

∗This work was done when the author was at USC/ISI.

another, as too often, a code that performs well on onearchitecture faces bottlenecks on another. As we areentering the era of petascale systems, the challengesfacing application programmers in obtaining accept-able performance on their codes will only grow.

To assist the application programmer in managingthis complexity, much research in the last few years hasbeen devoted to auto-tuning software that employs em-pirical techniques to evaluate a set of alternative map-pings of computation kernels to an architecture andselect the mapping that obtains the best performance.Auto-tuning software can be grouped into three cate-gories: (1) self-tuning library generators such as AT-LAS, PhiPAC and OSKI for linear algebra and FTTWand SPIRAL for signal processing [21, 3, 20, 9, 22]; (2)compiler-based auto-tuners that automatically gener-ate and search a set of alternative implementations ofa computation [7, 24, 11]; and, (3) application-levelauto-tuners that automate empirical search across aset of parameter values proposed by the applicationprogrammer [8, 16]. What is common across all thesedifferent categories of auto-tuners is the need to searcha range of possible implementations to identify one thatperforms comparably to the best-performing solution.The resulting search space of alternative implementa-tions can be prohibitively large. Therefore, a key chal-lenge that faces auto-tuners, especially as we expandthe scope of their capabilities, involves scalable searchamong alternative implementations.

As we look to the future, full applications will likelyinclude a mix of auto-tuning software from the above

1 2 3 4 5 6 7 8

0 10 20 30 40 50 60 70 80

0 5

10 15

20 25

30 35

1 2 3 4 5 6 7 8

Runtime

Parameter Interaction (Tiling and Unrolling for MM, N=800)

Tile Size

Unroll Amount

Runtime

Figure 1. Parameter Search Space for Tiling and Unrolling (Figure is easier to see in color).

three categories: automatically-generated libraries,compiler-generated code and application-level param-eters exposed to auto-tuning environments. Thus, ap-plications of the future will demand a cohesive envi-ronment that can seamlessly combine these differentkinds of auto-tuning software and that employs scal-able empirical search to manage the cost of the searchprocess.

In this paper, we take an important step in thedirection of building such an environment. We be-gin with Active Harmony [8], which permits applica-tion programmers to express application-level param-eters, and automates the process of searching amonga set of alternative implementations. We combineActive Harmony with CHiLL [5], a compiler frame-work that is designed to support convenient auto-matic generation of code variants and parametersfrom compiler-generated or user-specified transforma-tion recipes. In combining these two systems, we haveproduced a unique and powerful framework for auto-tuning compiler-generated code that explores a richerspace than compiler-based systems are doing today andcan empower application programmers to develop self-tuning applications that include compiler transforma-tions.

A unique feature of our system is a powerful paral-lel search algorithm which leverages parallel architec-tures to search across a set of optimization parametervalues. Multiple, sometimes unrelated, points in thesearch space are evaluated at each timestep. With thisapproach, we both explore multiple parameter inter-actions at each iteration and also have different nodesof the parallel system evaluate different configurationsto converge to a solution faster. In support of this

search process, CHiLL provides a convenient high-levelscripting interface to the compiler that simplifies codegeneration and varying optimization parameter values.

The remainder of the paper is organized into fivesections. The next section motivates the need for an ef-fective search algorithm to explore compiler generatedparameter spaces. Section 3 describes our search algo-rithm, which is followed by a high-level description ofCHiLL in section 4. In section 5, we give an overviewof the tuning workflow in our framework. Section 6presents an experimental evaluation of our framework.We discuss related work in section 7. Finally, section8 will provide concluding remarks and future implica-tions of this work.

2 Motivation

Today’s complex architecture features and deepmemory hierarchies require applying nontrivial opti-mization strategies on loop nests to achieve high per-formance. This is even true for a simple loop nestlike Matrix Multiply. Although naively tiling all threeloops of Matrix Multiply would significantly increaseits performance, the performance is still well belowhand-tuned libraries. Chen et al [7] demonstrate thatautomatically-generated optimized code can achieveperformance comparable to hand-tuned libraries by us-ing a more complex tiling strategy combined with otheroptimizations such as data copy and unroll-and-jam.Combining optimizations, however, is not an easy taskbecause loop transformation strategies interact witheach other in complex ways.

Different loop optimizations usually have different

goals, and when combined they might have unexpected(and sometimes undesirable) effects on each other.Even optimizations with similar goals but targeting dif-ferent resources, such as unroll-and-jam plus scalar re-placement targeting data reuse in registers, and looptiling plus data copy for reuse in caches, must be care-fully combined. Unroll-and-jam generally has more im-pact on performance than tiling for caches, since reusein registers reduces the number of loads and stores. Inaddition, in architectures with SIMD units, unroll-and-jam can be used to expose SIMD parallelism. The un-roll factors must be tuned so that reuse and SIMD areexploited without causing register spilling or instruc-tion cache misses. On the other hand, tiling plus datacopying for reuse in caches changes the iteration or-der and data layout, and may affect reuse in registersand SIMD parallelism. When combining unroll-and-jam and tiling, both unroll and tile sizes must be tunedso that performance gains are complementary. Figure1 illustrates these complex interactions by showing theperformance of square matrix (of size 800× 800) mul-tiplication as a function of tiling and unrolling factors.Tiling factors range from 2 to 80 and unrolling factorsfrom 2 to 32. We see a corridor of best performingcombinations along the x-y diagonal where tiling andunrolling factors are equal, and smaller corridors whentile factors are multiples of unroll factors. The bestperforming code variant used a tiling factor of 24 andunrolling factor of 24 and achieves a performance of845 MFLOPS.

Empirical optimization can compensate for the lackof precise analytical models by performing a system-atic search over a collection of automatically generatedcode variants. Each variant exposes a set of parametersthat controls the application of different transforma-tion strategies. Parameter configurations for variantsserve as points in the search space and the objectivefunction values1 associated with the points are gath-ered by actually running the variants on the target ar-chitecture. The success of empirical search is largelydriven by how well the chosen search algorithm nav-igates the search space. The search space shown inFigure 1 is not smooth and contains multiple minimasand maximas. The best and the worst configurationsare a factor of six different.

Active Harmony, an automated performance tun-ing infrastructure supporting both online and offline

1The objective function values associated with points in thesearch space can be any desired metric of performance (for ex-ample - time per timestep, MFLOPS, cache utilization etc.).

Algorithm 1 : PRO for Compiler Optimization

1: Start with initial simplex with vertices {v00 , · · · , vn

0 } andevaluate f(vj

0), j = 0, · · · , n in parallel.2: k = 03: while Stopping Criteria Not Valid do

4: Reorder simplex vertices, so that f(v0k) ≤ · · · ≤ f(vn

k )5: Compute n reflection pts r

j

k = Π`

2v0k − v

j

k

´

, and

function values f(rj

k), j = 1, · · · , n in parallel.{Reflection step}

6: l = arg minj f(rj

k) {Most Promising Point}7: if f(rl

k) < f(v0k) then

8: Compute n expansion pts ej

k = Π`

3v0k − v

j

k

´

, and

function values f(ej

k), j = 1, · · · , n in parallel.{Expansion checking step}

9: if f(elk) < f(rl

k) then {Accept Expansion}10: v

j

k+1 = ej

k j = 1, · · · , n

11: else {Send HALT signal to all processes and acceptreflection}

12: vj

k+1 = rj

k j = 1, · · · , n

13: end if

14: else {Accept shrink}15: Compute Π

`

vj

k+1 = 0.5v0k + 0.5v

j

k

´

, and

f(vj

k+1) j = 1, · · · , n in parallel. {Shrinkstep}

16: end if

17: k = k+118: end while

tuning for scientific applications2, provides a selectionof search algorithms designed specifically to deal withsearch spaces where the explicit definition of the ob-jective function is not available. Finding a good set ofloop transformation parameters is a good example ofthe type of search that the Harmony system is designedto address.

In the next section, we describe our parametertuning algorithm for compiler generated parameterspaces.

3 Parameter Tuning Algorithm

As previously shown, the loop transformation pa-rameters interact with each other in complex ways.The search algorithm used to explore the parameterspaces of compiler-optimized computations must takeinto account such interactions and be able to tune theparameters simultaneously. The simultaneous tuning,however, leads to added dimensions in the search space.For our purposes, we use a modified version of theParallel Rank Ordering (PRO) algorithm proposed by

2Online tuning refers to adapting performance related param-eters during runtime. Offline tuning refers to tuning for param-eters that can be selected at compile/launch time but remainfixed throughout the execution.

Tabatabaee et al [19]. Although the original PRO algo-rithm can effectively deal with high-dimensional searchspaces with unknown objective functions, there are twomain differences between the type of search PRO wasdesigned for and the type of search we want to con-duct. First, PRO was designed for online tuning ofSPMD-based parallel applications while our approachneeds an offline search. Secondly, Tabatabaee et al onlylooked at (hyper) rectangular search spaces instead ofthe more general parameter space used in our com-piler optimization. In addition, we modified the initialsimplex construction method to better suit our goal ofusing all available parallelism. We describe each mod-ification in detail later in this section. We will refer tothe modified algorithm as PRO-C (PRO for CompilerOptimization).

The parameter tuning algorithm is given in Algo-rithm 1. For a function of N variables, PRO-C main-tains a set of kN points forming the vertices of a sim-plex in an N -dimensional space. Each simplex trans-formation step3 (lines 5, 8 and 15) of the algorithm gen-erates up to kN −1 new vertices by reflecting, expand-ing, or shrinking the simplex around the best vertex.After each transformation step, the objective functionvalue, f , associated with each of the newly generatedpoints are calculated in parallel. The reflection stepis considered successful if at least one of the kN − 1new points has a better f than the best point in thesimplex. If the reflection step is not successful, thesimplex is shrunk around the best point. A successfulreflection step is followed by expansion check step (line9). If the expansion check step is successful, the ex-panded simplex is accepted. Otherwise, the reflectedsimplex is accepted and the search moves on to thenext iteration. A graphical illustration for reflection,expansion and shrink steps are shown in Figure 2 for a2-dimensional search space and a 4-point simplex. Inthe remainder of this section, we describe the modifi-cations that we made to the original PRO algorithmto make it suitable for searching compiler generatedparameter spaces.

3.1 Parallelizing Expansion Check Step

Recall that each simplex transformation step gen-erates up to kN − 1 new vertices. The time requiredto complete the parallel evaluation of these new ver-tices is the time taken by the worst performing vertex.The decision to introduce the expansion-check step in

3Each simplex transformation is considered to be a search-

step within one search iteration. One iteration of the searchalgorithm consists of all the simplex transformations that happenbetween successive reflection steps.

Figure 2. Simplex Transformation steps.

PRO was motivated by the observation that there aresome expansion points with very poor performance.For online tuning of SPMD-based parallel applications,such configurations slow down not only the search butalso the execution of the application itself. To avoidthese time consuming instances, before evaluating allexpansion points, PRO first calculates the expansionpoint performance of only the most promising case4 atthe expense of parallelism. If the expansion checkingstep is successful, the algorithm performs expansionof other points in the simplex. Assuming we have kN

nodes available, each iteration of PRO, therefore, takesat most three search steps (reflection, expansion checkand expansion).

In an offline parallel search, however, processors par-ticipating in the search are independent, which allowsus to take full advantage of the underlying parallelismwhile still avoiding expansion points with poor per-formance. To that end, PRO-C evaluates all expan-sion points and the decision to accept or reject theexpanded simplex is based on the performance of themost promising case. If the performance reported bythe most promising case is worse than that of the bestpoint in the reflected simplex, our system sends a sig-nal to all the other processors to stop the evaluation oftheir candidate configurations and accepts the reflectedsimplex. The expansion of the simplex is accepted ifthe performance of the most promising case is betterthan the best vertex in the reflected simplex. Withthis modification, we not only reduce the number ofsteps within one iteration of the search algorithm to at

4Most promising point is the point in the original simplexwhose reflection around the best point returns a better functionvalue.

most two (reflection-expansion and reflection-shrink)but also increase parallelism.

3.2 Projection Operator for ArbitrarySpace

Offline tuning of loop transformation parameters isa constrained optimization problem. Therefore in eachstep we have to make sure that the computed pointsare admissible, i.e. they satisfy the constraints. Theprojection operator, function Π(·) (used in the pseudo-code), takes care of this problem by mapping pointsthat are not admissible to admissible points. PROuses a simple method that independently maps thecomputed value of the parameter to its lower or up-per limit, whichever is closer. This method works wellfor hyper-rectangular search spaces, but not when wehave an arbitrarily shaped space defined by (possiblynon-linear) constraints on parameter values. Our pro-jection operator accommodates such arbitrarily shapedspaces by projecting an inadmissible point to its near-est admissible neighbor. We define distance betweentwo points using L1 distance, which is the sum of theabsolute differences of their coordinates. The nearestneighbor of an inadmissible point (calculated in termsof L1) will thus be a legal point with the least amountof change (in terms of parameter values) summed overall dimensions.

Computing the least L1 distance unfortunately in-volves finding the nearest neighbors in a high dimen-sional space, which is a computationally intensive task.After experimenting with multiple nearest-neighbor al-gorithms, we adopted the Approximate Nearest Neigh-bor5 (ANN) [2] algorithm for two reasons. First, forapproximate neighbors, ANN has linear space require-ments and logarithmic time complexity on the numberof points in the search space. Second, an efficient im-plementation of the ANN library is available [15]. Thelibrary supports a variety of metrics to define distancebetween two points, including L1 distance metric. Weset ε = 0.5, which, for L1 distance, means error ofat most one along at most one dimension is tolerated,which is a fairly small price to pay for logarithmic querytime.

3.3 Simplex Construction and Size

The initial simplex, with size kN , needs to be non-degenerate so that it can span the whole parameter

5Given any ε > 0, a (1 + ε)-nearest neighbor of q is a pointp′ ∈ S s.t.

dist(p,q)dist(p′,q)

≤ 1 + ε

space; therefore, kN must be at least N + 1, where N

is the number of tunable parameters. For a discreteparameter space, PRO’s simplex construction methodcan generate only up to 2N points. In PRO-C, weextend the method to generate points for any kN ≥N + 1. To exploit all available parallelism, kN can beset to the number of resources/processors available.

Unlike PRO’s strategy of starting the search at thecenter of the search-space (which is hard to ascertainin a high-dimensional constrained space), we randomlyselect kN points at the start of the algorithm. Thefirst iteration of the algorithm evaluates these randomconfigurations. The initial simplex is constructed byrandomly sampling points at distance d (L1 distance)from the best performing point. The set of search di-rections/vectors (from the initial best point to the sam-pled points) generated in this fashion is guaranteed tobe a linearly independent set, which is important be-cause this property gives us kN unique parameter in-teractions.

In section 4, we describe CHiLL - our loop transfor-mation and code generation framework.

4 CHiLL: A Framework for Composing

High-Level Loop Transformations

Automatic tuning requires a compiler to be able togenerate different codes rapidly during the search byadjusting parameter values, without costly compilerreanalysis. It also demands that the compiler havea clean interface to a separate parameter search en-gine. CHiLL [5, 6], a polyhedral loop transformationand code generation framework, provides such capabil-ity for composing high-level loop transformations witha script interface to describe the transformations andsearch space to the search engine. Polyhedral represen-tation of loops facilitates compilers to compose com-plex loop transformations in a mathematically rigor-ous way to ensure code correctness. However, existingpolyhedral frameworks are often too limited in sup-porting a wide array of loop transformations (for bothperfect and imperfect loop nests) required to achievehigh-performance on today’s computer architectures.CHiLL employs new design features such as iterationspace alignment and auxiliary loops to greatly expandthe capability of polyhedral framework. Further, itshigh-level script interface allows compilers or applica-tion programmers to use a common interface to de-scribe parameterized code transformations to be ap-plied to a computation, whose parameters can be in-stantiated by an external search engine to find the best-performing implementation. We now briefly describeCHiLL’s new features.

DO I=2, Ns1 SUM(I)=0

DO J=1, I-1

s2 SUM(I)=SUM(I)+A(J,I)*B(J)s3 B(I)=B(I)-SUM(I)

(a) Original code

IS1 : {[i, j] | 2 ≤ i ≤ N ∧ j = 1}IS2 : {[i, j] | 1 ≤ j < i ≤ N}IS3 : {[i, j] | 2 ≤ i ≤ N ∧ j = i − 1}

(b) Aligned iteration spaces

s3

s1

s2

f low(0,+)f low(0,0)output (0 ,+)output(0,0)

f low(0,+)f low(0,0)

f low(0,+)f low(0,0)

f low(+,1)

output (0 ,+)f low(0,+)ant i (0 ,+)

(c) Dependence graph

ts1: {[∗, i, ∗, j, ∗] → [0, i, 0, j, 0]}

ts2: {[∗, i, ∗, j, ∗] → [0, i, 1, j, 0]}

ts3: {[∗, i, ∗, j, ∗] → [0, i, 2, j, 0]}

(d) Transformation relations togenerate the original loop nest in (a)

Figure 3. Representing Loop Nests and Transformations.

4.1 Polyhedral Representation

In a polyhedral representation, a loop nest is repre-sented by the collection of iteration spaces of the state-ments inside the loop nest. Each statement has itsown iteration space, derived from its enclosing loopsrespectively. Thus for imperfect loop nests the num-ber of dimensions of the iteration spaces of individualstatements derived initially may be different. Addi-tional iteration space alignment brings each statementto be represented in a same unified iteration space. Togenerate imperfectly nested transformed loops, auxil-iary loops are added to determine lexicographical orderamong loops at each loop level. We will discuss bothconcepts in detail below.

Iteration space alignment can be thought as a gen-eralization of code sinking and loop fusion. For animperfect loop nests such as in Figure 3(a), CHiLL ex-tracts the iteration space for each statement as in Fig-ure 3(b). Note that in CHiLL’s representation everystatement in the loop nest has the same number of di-mensions in its iteration space. Although s1 and s3 areonly surrounded by one loop I, their iteration spacesare still 2-dimensional; more precisely, each representsa line aligned in a 2-dimensional iteration space. Oncethe iteration spaces of all statements are aligned in thesame iteration space, CHiLL can transform perfect andimperfect loop nests in a systematic way and the legal-ity of a transformation can be determined in the sameway as perfect loop nests, i.e., from data dependences(e.g. 3(c)) prior to the transformation. The completealgorithm for iteration space alignment can be foundin [5].

Auxiliary loops are introduced to allow a system-

atic code generation strategy for both perfect and im-perfect loop nests. If the aligned iteration spaces onlyinclude dimensions for each loop level, there would beno information available as to the relationship or re-quired execution order among statements or how loopsand statements would be organized at a specific looplevel. To keep a simple and robust polyhedral scanningstrategy for code generation, an auxiliary loop is asso-ciated with each loop level in the original nest. Eachauxiliary loop carries the execution order of statementsand loops at its associated level. An additional auxil-iary loop is associated with the statements within thedeepest level of the iteration space, and carries the ex-ecution order of these statements. By setting differ-ent constant integer values for these auxiliary loops,CHiLL establishes the lexicographical order of loops ateach loop level as well as the lexicographical order ofstatements in the innermost loop. So for an n-deeploop nest, we have (2n + 1)-dimension iteration spacesas [c1, l1, c2, l2, . . . , cn, ln, cn+1], where ci’s are auxiliaryloops. Each loop transformation from an n-deep loopnest to a new m-deep loop nest is represented as a setof relations:

t : {[c1, l1, . . . , cn, ln, cn+1] → [c′1, l′1, . . . , c′m, l′m, c′m+1]| · · · }.

Figure 3(d) shows the transformation relations to gen-erate the original loop nest, with the initial auxiliaryloop values unknown yet. Since only constant valuesare allowed in auxiliary loops, no loops are generatedin the final transformed code.

4.2 Code Transformations - Recipes

CHiLL takes as input the original code and a looptransformation recipe (a CHiLL script) describing how

to optimize the code. Each line of the script describes atransformation to be applied on an existing loop repre-sentation. For illustration purposes, we list some mostcommon high-level loop transformations below. As ageneral rule, each loop transformation affects a set ofstatements within the specified loop.

permute([stmt],order): the loop order of stmt is per-muted to the new order, which is represented by a se-quence of integers identifying the loops. If permutedoes not have a stmt parameter, it indicates that theloop order of all statements should be permuted.

tile(stmt,loop,size,[outer-loop]): Tile loop at level loopof stmt with the tile controlling loop at loop level outer-loop (default value 1), with tile size size.

unroll(stmt,loop,size): Unroll stmt ’s loop at level loopby unroll factor size. For all unrolled statements, theinner loop bodies below loop level loop are jammedtogether.

datacopy(stmt,loop,array,[index]): For the specifiedarray in stmt, a temporary array copy construction isintroduced for all array accesses touched within looplevel loop. The index (default value 0) specifies whichsubscript in array corresponds to the new temporaryarray’s first index (assuming Fortran array layout).The array accesses in stmt are replaced by appropriatetemporary array accesses.

split(stmt,loop,condition): Split stmt ’s loop level loopinto multiple loops according to condition. The orig-inal stmt ’s iteration space will satisfy condition. Theiteration space satisfying the complement of conditionswill be split into new statements.

nonsingular(matrix): Transform the perfect loop nestaccording to nonsingular matrix. This includes bothunimodular and nonunimodular transformations.

In the next section, we describe how CHiLL and Ac-tive Harmony frameworks interact with each other togenerate a set of alternative implementations of com-putation kernels and automatically search and selectthe one with the best-performing implementation.

5 Overall System Workflow

Figure 4 shows the overall workflow of our sys-tem. In the proposed framework, code transformationrecipes and parameter specifications (i.e. parameterdomain and constraints) can be either generated bythe compiler automatically or by the users tuning theirapplication code. With this flexibility, our approachcan support both fully automated compiler optimiza-tions and user-directed tuning. For our experiments,

Figure 4. Overall System Workflow Diagram.

we translate loop transformation sequences from the al-gorithms presented by Chen et al [7] to CHiLL scripts.Specifications for unbound parameters in the scriptsare derived using simple heuristics based on architec-tural parameters (e.g., consider cache capacity to gen-erate constraints for tile-sizes). We elaborate more onparameter specification in the next section. If a user,with domain knowledge, wants more control over whatpart of the parameter space to focus on, he/she canprovide additional constraints to fine-tune the searchspace. Using the parameter specifications, we normal-ize the domain of each parameter onto our internal in-teger based coordinate system. This step is necessaryto ensure that the differences in the range of valuesparameters can take in different dimensions do not un-duly influence the L1 distance metric.

Parameters that appear in one or more constraintsare considered to be interdependent and are evaluatedas sets. For example, tile-size parameters for multi-ple loops may appear in one or more cache capacityconstraints. A simple constraint solver is then usedto enumerate points for each of these sets. Projectionof an inadmissible point to a valid point in the searchspace is done (by the projection server) separately fordifferent groups of parameters.

At each search step, Active Harmony’s search-kernelrequests CHiLL’s code-generator to generate code vari-ants with given sets of parameters for loop transforma-tions. The CHiLL generated code variants are thencompiled and run in parallel on the target architectureby the optimization driver. Measured performance val-ues are consumed by the search-kernel to make simplextransformation decisions.

Table 1. Kernels used for experimentsKernel Naive Transformation Constraints

Code Recipe

MM

DO K = 1, N

DO J = 1, NDO I = 1, N

C[I,J] = C[I,J]+A[I,K]*B[K,J]

permute([3,1,2])tile(0,2,TJ)

tile(0,2,TI)tile(0,5,TK)datacopy(0,3,2,1)

datacopy(0,4,3)unroll(0,4,UI)

unroll(0,5,UJ)

TK × TI ≤ 1

2

“

sizeL2

2

”

TK × TJ ≤ 1

2

“

sizeL1

2

”

UI × UJ ≤ sizeR

TI, TJ, TK ∈ [0, 2, 4, . . . , 512]UI, UJ ∈ [1, 2, . . . , 16]

TRSM

DO J = 1, NDO K = 1, N

DO I = K + 1,NB(I,J) = B(I,J) - B(K,J)*A(I,K)

permute([1,3,2])tile(0,3,TK)

split(0,2,L3>=L1+TK)tile(0,3,TI,2)

tile(0,3,TJ,2)datacopy(0,3,2)datacopy(0,4,3,1)

unroll(0,4,UJ1)unroll(0,5,UI1)

datacopy(1,2,3,1)unroll(1,2,UJ2)unroll(1,3,UI2)

TK × TK ≤ 1

2

“

sizeL2

2

”

TK × TJ ≤ 1

2

“

sizeL1

2

”

TK × TI ≤ 1

2

“

sizeL2

2

”

UI1 × UJ1 ≤ sizeR

UI2 × UJ2 ≤ sizeR

TI, TJ, TK ∈ [0, 2, 4, . . . , 512]UI1, UJ1, UI2, UJ2 ∈ [1, 2, . . . , 16]

Jacobi

DO K = 2, N-1

DO J = 2, N-1DO I = 2, N-1

A(I,J,K) = C*(B(I-1,J,K)+B(I+1,J,K)+B(I,J-1,K)+B(I,J+1,K)+B(I,J,K-1)+B(I,J,K+1))

original()

tile(0, 3, TI)tile(0, 3, TJ)

tile(0, 3, TK)unroll(0,5,UJ)

TI, TJ, TK ∈ [0, 2, 4, . . . , 512]UJ ∈ [1, 2, . . . , 16]

6 Experimental Results

In this section, we present an experimental evalua-tion of our framework. First, we use a Matrix Multi-plication kernel to explore the effectiveness of PRO-Con the search space for loop transformation parame-ters. We study how the size of the initial simplex (andhence the degree of parallelism) affects the convergenceand performance of the search algorithm. In the secondpart, we use our framework to optimize two additionalcomputational kernels - Triangular Solver (TRSM) andJacobi. The use of linear algebra kernels - Matrix Mul-tiplication and Triangular Solver - was motivated byour goal to compare the effectiveness of our frameworkto well tuned codes. The results for the Jacobi ker-nel show that our underlying polyhedral framework isa general-purpose loop transformation tool, which canhandle arbitrary code beyond the linear algebra do-main. In addition, MM, TRSM and Jacobi all exhibitcomplex parameter interactions (discussed in section 2)for today’s computer architectures. For all the kernels,we provide the original code, the transformation recipeand the constraints on unbound parameters in Table 1.

The experiments were performed on a 64-node Linuxcluster. Each node is equipped with dual Intel Xeon2.66 GHz (SSE2) processors. L1-cache and L2-cachesizes are 128 KB and 4096 KB respectively. We com-pare the performance of our code versions with thoseof the native compiler (ifort 10.0.026, compiled with-O3 -xN). When compiling our transformed code, we

turn off the native compiler’s loop transformations toprevent them from interfering with our optimizations.For Matrix Multiplication and Triangular Solver, wepresent the performance of ATLAS (version 3.8) self-tuning libraries. In addition to a near exhaustive sam-pling of the search space, ATLAS uses carefully hand-tuned BLAS routines contributed by expert program-mers. To make a meaningful comparison, we providethe performance of the search-only version of ATLAS- code generated by the ATLAS Code Generator viapure empirical search. The search-only version wasgenerated by disabling the use of architectural defaultsand turning off the use of hand-coded BLAS routines.For all our experiments, unroll factors and tile sizes areconstrained by the storage capacity of their associatedmemory hierarchy levels. In addition, for tile sizes, weuse a simple heuristic which tries to fit references withtemporal reuse into half of the cache, leaving the otherhalf for references with spatial or no reuse.

6.1 Performance of PRO-C

In this section, we use Matrix Multiplication (MM)to demonstrate the effectiveness of parallel search. Theoptimization strategy reflected in the transformationrecipe in Table 1 exploits the reuse of C(I, J) in reg-isters, and the reuse of A(I, K) and B(K, J) in caches(A and B have the same amount of temporal reuse,carried by different loops). The transformation recipeapplies tiling to B in the L1 cache and A in the L2

0 10 20 30 40 50

1.4

1.6

1.8

2

2.2

Search Steps

Spe

edup

ove

r th

e N

ativ

e C

ompi

ler

Effects of Simplex Size on the Convergence of the Search Algorithm

2N Simplex (10 Nodes)4N Simplex (20 Nodes)8N Simplex (40 Nodes)12N Simplex (60 Nodes)

Figure 5. Effects of Different Degree of Paral-lelism on the Convergence of PRO-C.

cache. Data copying is applied to avoid conflict misses.In addition, to expose SSE optimization opportunitiesto the Intel compiler, the copying of A transposes thedata into the temporary array. The values for the fiveunbound parameters TI , TJ , TK, UI and UJ are de-termined by the search algorithm.

To study the effect of simplex size, we consideredfour alternative simplex sizes - 2N (10 Nodes), 4N (20Nodes), 8N (40 Nodes) and 12N (60 Nodes), where N

is the number of unbound parameters (N = 5 for thisexperiment). Each simplex was constructed around thesame initial point, which was randomly selected fromthe search space at the beginning of the experiment.The search algorithm was run for a square matrix ofsize 800 × 800. The results for this experiment aresummarized in Table 2.

Figure 5 shows the performance of the best pointin the simplex across search steps. Search conductedwith 12N and 8N simplices clearly use fewer searchsteps than the search conducted with smaller simplices.Recall from our discussion in section 2 and from Fig-ure 1 that loop transformation parameter space is notsmooth and contains multiple local minimas and max-imas. The existence of long stretches of consecutivesearch steps with minimal or no performance improve-ment (marked by arrows in Figure 5) in 2N and 4N

cases show that more search steps are required to getout of local minimas for smaller simplices. At the sametime, by effectively harnessing the underlying paral-lelism, 8N and 12N simplices evaluate more uniqueparameter configurations (see Table 2) and get out of

500 1000 1500 2000 2500 30000

5

10

15Performance Distribution

MFLOPS Greater Than

Per

cent

age

of th

e T

otal

Sam

ples

1.7% of 100K Samples

Figure 6. Performance Distribution for ran-domly chosen MM Configurations

Table 2. MM Results - Alternate Simplex Sizes

2N 4N 8N 12N

Number of Function Evals. 276 571 750 961Number of Search Steps 49 32 22 18Speedup over Native 2.30 2.33 2.32 2.33

local minimas at a faster rate.

Results summarized in Table 2 also show that asthe simplex size increases, the number of search stepsdecreases, thereby confirming the effectiveness of in-creased parallelism. Using a 12N initial simplex, thesearch converges to a solution 2.7 times faster than us-ing 2N initial simplex.

The next question regarding the effectiveness of ourframework relates to the quality of the search result. Toanswer this question, we selected 100,000 uniformly dis-tributed samples from the search space, which has over70 million total points, and evaluated the performanceassociated with all the samples. The performance dis-tribution is shown is Figure 6. Approximately 1.7%of the total samples report performance greater than 3GFLOPS. The best performance (3.22 GFLOPS) wasassociated with the configuration TI = 160, TJ = 6,TK = 162, UI = 1 and UJ = 6. For the sameproblem size, our code delivers 3.17 GFLOPS. The re-sult demonstrates PRO-C’s effectiveness on compiler-generated search spaces.

Finally, figure 7 shows the performance of the codevariant produced by a 12N simplex across a range of

500 1000 1500 2000 2500 3000 35001

1.5

2

2.5

3

3.5

4

4.5

Matrix Size(N)

GF

LOP

SMatrix Multiplication Results

IfortATLAS search−only Harmony−CHiLLATLAS Full

Figure 7. Results for MM Kernel

problem sizes along with the performance of nativecompiler, ATLAS’ search-only and full version. Ourcode version performs, on average, 2.36 times fasterthan the native compiler. The performance is 1.66times faster than the search-only version of ATLAS.Our code variant also performs within 20% of ATLAS’full version (with processor-specific hand coded assem-bly).

6.2 Triangular Solver (TRSM)

The optimization strategy for the TRSM kernel isoutlined in its transformation recipe provided in Table1. Two inner loops are permuted to reuse B(I, J) inregisters, and loops I and J are unrolled. For datareuse in cache, loop K is tiled first. The splitting con-dition is based on the decision to separate read ac-cess B(I, J) from write access B(K, J). After split-ting, one subloop has non-overlapping read and writeaccesses and it is optimized in the same way as matrixmultiplication. The other subloop has only one non-overlapping read access A(I, K), for which data copyis applied to reduce cache conflict misses caused by thisarray reference.

Unbound parameters in the transformation recipeTI , TJ , TK, UI1, UJ1, UI2 and UJ2 form a sevendimensional parameter space. PRO-C used a 60-pointsimplex and converged to a solution in 55 steps evalu-ating 1,579 unique parameter configurations. Figure 8shows the performance of the code variant along withthe performance of the Native compiler and both AT-LAS versions. The parameter configuration selected byPRO-C performs, on average, 3.62 times faster than

500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5

3

3.5

4

Matrix Size(N)

GF

LOP

S

Triangular Solver Results

IfortATLAS search−only Harmony−CHiLLATLAS Full

Figure 8. Results for TRSM Kernel

0 50 100 150 200 250 300 350 400 450350

400

450

500

550

600

650

700

750

800

Matrix Size(N)

MF

LOP

S

Jacobi Results

IfortHarmony−CHiLL

Figure 9. Results for Jacobi Kernel

the native Intel compiler. The performance, on av-erage, is 1.07 times faster than the search-only ver-sion of ATLAS. However, ATLAS full-version (withprocessor-specific hand-tuned assembly) performanceis 1.55 times faster than our code-variant.

6.3 Jacobi

The transformation recipe provided in Table 1 out-lines the optimization strategy we use for this kernel.Since only array B has reuse on three dimensions, theloops are tiled on three dimensions for reuse in L1 or L2cache. Arrays A and B access data in the loop nest inthe same order as the dimensionality of the iteration

space, thus the original loop order is best for spatialreuse in cache and TLB. Finally loop J is unrolled forregister reuse. Four unbound parameters in the scriptTI , TJ , TK and UI form a four-dimensional parame-ter space.

PRO-C took 23 steps (870 unique function evalua-tions) to converge to TI = 0, TJ = 22, TK = 0 andUJ = 1. The results of TK = 0 and TI = 0 suggestthat no tiling is needed for K and I loops. Tiling onlythe J loop produces the best performance. Also no un-roll is performed. We suspect that the native compiler’sscalar replacement cannot take advantage of availableregister reuse across the I dimension so there is littlebenefit of unrolling J . Figure 9 shows the performanceof our code variant. On average, our code variant per-forms 1.35 times faster than the native Intel compiler.

7 Related Work

There are many research projects working on empir-ical optimization of linear algebra kernels and domainspecific libraries. ATLAS [21] uses the technique to ge-neate highly optimized BLAS routines. It uses a near-exhaustive orthogonal search (search in one dimensionat a time by keeping rest of the parameters fixed). TheOSKI (Optimized Sparse Kernel Interface) [20] libraryprovides automatically tuned computational kernels forsparse matrices. FFTW [9] and SPIRAL [22] are do-main specific libraries. FFTW combines the staticmodels with empirical search to optimize FFTs. SPI-RAL generates empirically tuned Digital Signal Pro-cessing (DSP) libraries. Rather than focussing on oneparticular domain, our framework aims at providing ageneral-purpose compiler based approach tuning code.

Recently, many research projects on compiler trans-formation frameworks have focussed on facilitating theexploration of a large optimization space of possiblecompiler transformations and their parameter values.TLOG [13] is a code generator for parameterized tiledloops where tile sizes are symbolic parameters. Sym-bolic tile-size enables static or run-time tile size opti-mization without repeatedly generating the code andrecompiling it for each tile size. POET [23] is a trans-formation scripting language embedded in an arbitraryprogramming language. It is interpreted by a POETcompiler to apply source-to-source code transforma-tions. Interactive Compilation Interface (ICI) [10] pro-vides a flexible and portable interface to internal com-piler optimizations so that iterative optimization [1]can be applied at the loop or instruction-level by ad-justing optimization decisions externally. WRaP-IT[11] and Petit [12] are both polyhedral loop transfor-mation framework that supports composition of trans-

formations. They support many high-level loop trans-formations on perfect loop nests in a single transfor-mation step and by composing many low-level trans-formations on each individual loop, they also supportarbitrary loop transformations on imperfect loop nests.LeTSeE [17] is an iteration optimization tool based onthe polyhedral model. It finds all legal affine schedulingof a loop nest and explores this space to find the bestscheduling and parameter values. Pluto [4] is an au-tomatic parallelization and locality optimization toolalso based on the polyhedral model.

There is also some work done in using search tech-niques to explore compiler generated parameter spaces.Kisuki et al [14] addresses the problem of selectingtile sizes and unroll factors simultaneously. Differ-ent search algorithms are used to search the param-eter space - Genetic algorithms, Simulated Annealing,Pyramid search, Window search and Random search.Qasem et al [18] use a modified version of pattern-baseddirect search algorithm to explore the same searchspace. Our work considers a much broader range ofloop transformations. Also Kisuki et al. report con-verging to a solution in hundreds of iterations. Byeffectively utilizing the underlying parallel infrastruc-ture, we converge to solutions in a few tens of itera-tions.

8 Conclusion

In this paper, we integrated the capabilities of Ac-tive Harmony and CHiLL to create a unique and power-ful framework that is capable of both fully automatedcode transformation and parameter search as well asuser assisted transformation combined with automaticparameter search. The resulting framework employsa parallel search technique to simultaneously evalu-ate different combinations of compiler optimizations.Our system is demonstrated on three computationalkernels for automatic compilation and tuning in par-allel to achieve performance that greatly exceeds theIntel compiler, and is comparable to (and sometimesexceeds) the near-exhaustive search of the ATLAS li-brary system.

Our work on this topic is just beginning, in the nearterm we plan to explore optimizing larger programswithin our framework. We also plan to combine ourcurrent offline optimization approach with online opti-mization of application parameters.

Acknowledgements. This work was supportedin part by DOE grants DE-CFC02-01ER25489, DE-FG02-01ER25510, DE-FC02-06ER25763, DE-FC02-06ER25765 and DE-FG02-08ER25834, by NSF awardsEIA-0080206 and CSR-0615412, and by a gift from In-

tel Corporation.

References

[1] F. Agakov, E. Bonilla, J. Cavazos, B. Franke,G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Tous-saint, and C. K. I. Williams. Using machine learn-ing to focus iterative optimization. In Proceedings ofthe International Symposium on Code Generation andOptimization, Mar. 2004.

[2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman,and A. Y. Wu. An optimal algorithm for approximatenearest neighbor searching fixed dimensions. J. ACM,45(6):891–923, 1998.

[3] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Dem-mel. Optimizing matrix multiply using PHiPAC: aportable, high-performance, ANSI C coding method-ology. In Proceedings of the 1997 ACM InternationalConference on Supercomputing, June 1997.

[4] U. Bondhugula, A. Hartono, J. Ramanujam, andP. Sadayappan. A practical automatic polyhedral pro-gram optimization system. In ACM SIGPLAN Con-ference on Programming Language Design and Imple-mentation (PLDI), June 2008.

[5] C. Chen. Model-Guided Empirical Optimization forMemory Hierarchy. PhD thesis, University of South-ern California, 2007.

[6] C. Chen, J. Chame, and M. Hall. CHiLL: A frameworkfor composing high-level loop transformations. Tech-nical report, University of Southern California, 2008.

[7] C. Chen, J. Chame, and M. W. Hall. Combining mod-els and guided empirical search to optimize for multi-ple levels of the memory hierarchy. In Proceedings ofthe International Symposium on Code Generation andOptimization, Mar. 2005.

[8] I.-H. Chung and J. K. Hollingsworth. Using informa-tion from prior runs to improve automated tuning sys-tems. In SC ’04: Proceedings of the 2004 ACM/IEEEconference on Supercomputing, page 30, Washington,DC, USA, 2004. IEEE Computer Society.

[9] M. Frigo. A fast Fourier transform compiler. InProceedings of ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation, May1999.

[10] G. Fursin and A. Cohen. Building a practical iter-ative compiler. In Workshop on Statistical and Ma-chine Learning Approaches to Architectures and Com-pilation (SMART’09), Jan. 2007.

[11] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Par-ello, M. Sigler, and O. Temam. Semi-automatic com-position of loop transformations for deep parallelismand memory hierarchies. International Journal of Par-allel Programming, 34(3):261–317, June 2006.

[12] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeis-man, and D. Wonnacott. The Omega Library inter-face guide. Technical Report CS-TR-3445, Universityof Maryland at College Park, Mar. 1995.

[13] D. Kim, L. Renganarayanan, D. Rostron, S. Rajopad-hye, and M. M. Strout. Multi-level tiling: M forthe price of one. In SC ’07: Proceedings of the 2007ACM/IEEE conference on Supercomputing, pages 1–12, New York, NY, USA, 2007. ACM.

[14] T. Kisuki, P. M. W. Knijnenburg, and M. F. P.O’Boyle. Combined selection of tile sizes and unrollfactors using iterative compilation. In PACT ’00: Pro-ceedings of the 2000 International Conference on Par-allel Architectures and Compilation Techniques, page237, Washington, DC, USA, 2000. IEEE ComputerSociety.

[15] D. M. Mount. http://www.cs.umd.edu/~mount/ANN/.[last accessed: Feb 09, 2009].

[16] Y. Nelson, B. Bansal, M. Hall, A. Nakano, and K. Ler-man. Model-guided performance tuning of param-eter values: A case study with molecular dynam-ics visualization. Parallel and Distributed Processing,2008. IPDPS 2008. IEEE International Symposiumon, pages 1–8, April 2008.

[17] L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos.Iterative optimization in the polyhedral model: PartII, multidimensional time. In ACM SIGPLAN Con-ference on Programming Language Design and Imple-mentation (PLDI’08), pages 90–100, Tucson, Arizona,June 2008. ACM Press.

[18] A. Qasem, K. Kennedy, and J. Mellor-Crummey.Automatic tuning of whole applications using directsearch and a performance-based transformation sys-tem. J. Supercomput., 36(2):183–196, 2006.

[19] V. Tabatabaee, A. Tiwari, and J. K. Hollingsworth.Parallel parameter tuning for applications with perfor-mance variability. In SC ’05: Proceedings of the 2005ACM/IEEE conference on Supercomputing, page 57,Washington, DC, USA, 2005. IEEE Computer Soci-ety.

[20] R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski:A library of automatically tuned sparse matrix ker-nels. Journal of Physics: Conference Series, 16:521–530, June 2005.

[21] R. C. Whaley and J. Dongarra. Automatically tunedlinear algebra software. In Proceedings of Supercom-puting ’98, Nov. 1998.

[22] J. Xiong, J. Johnson, R. Johnson, and D. Padua.SPL: A language and compiler for DSP algorithms.In Proceedings of ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation, June2001.

[23] Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quin-lan. Poet: Parameterized optimizations for empiricaltuning. Parallel and Distributed Processing Sympo-sium, 2007. IPDPS 2007. IEEE International, pages1–8, March 2007.

[24] K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua,K. Pingali, and P. Stodghill. Is search really necessaryto generate high-performance BLAS? Proceedings ofthe IEEE: Special Issue on Program Generation, Op-timization, and Platform Adaptation, 93(2):358–386,Feb. 2005.

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

A Scalable Auto-tuning Framework for Compiler Optimizationhollings/papers/ipdps09.pdf · produced a...

Documents