Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and...

transcript

Mapping of Regular Nested Loop Programs toMapping of Regular Nested Loop Programs toCoarse-grained Reconfigurable Arrays – Constraints and MethodologyCoarse-grained Reconfigurable Arrays – Constraints and Methodology

Presented by: Luis Ortiz

Department of Computer ScienceDepartment of Computer ScienceThe University of Texas at San AntonioThe University of Texas at San Antonio

F. Hanning, H. Dutta, W. Tichy, and Jürgen TeichUniversity of Erlangen-Nuremberg, Germany

Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04)

OutlineOutline

Overview The Problem Reconfigurable Architectures Design Flow for Regular Mapping Parallelizing Transformations Constraints Related to CG Reconfigurable Arrays Case Study Results Conclusions and Future Work

OverviewOverview

Constructing a parallel program is equivalent to specifying its execution order• the operations of a program form a set, and its execution order is

a binary, transitive and asymmetric relation• the relevant sets are (unions of) Z-polytopes• most of the optimizations may be presented as transformation of

the original program

The problem of automatic parallelization• given a set of operations E and a strict total order on it• find a partial order on E such that execution of E under it is

determinate and gives the same results as the original program

Overview (cont.)Overview (cont.)

Defining a polyhedron• a set of linear inequalities: Ax + a ≥ 0• the polyhedron is the set of all x which satisfies these inequalities• the basic property of a polyhedron is convexity:

• if two points a and b belong to a polyhedron, then so all convex combinations• λa + (1 – λ)b, 0 ≤ λ ≤ 1

• a bounded polyhedron is called a polytope

The essence of the polytope model is to apply affine transformations to the iteration spaces of a program• the iteration domain of statement S:

Dom(S) = {x | Dsx + ds ≥ 0}• Ds and ds are the matrix and constant vector which define the

iteration polytope. ds may depend linearly on the structure parameters

Coarse-grained reconfigurable architectures• provide flexibility of software combined with the performance of

hardware• but, hardware complexity is a problem due to a lack of mapping

Parallelization techniques and compilers• map computationally intensive algorithms efficiently to coarse-

grained reconfigurable arrays

The ProblemThe Problem

“Mapping a certain class of regular nested loop programs onto a dedicated processor array”

Reconfigurable ArchitecturesReconfigurable Architectures

Span a wide range of abstraction levels• from fine-grained Look-Up Table (LUT) based reconfigurable logic

devices to distributed and hierarchical systems with heterogeneous reconfigurable components

Efficiency comparison• standard arithmetic is less efficient on fine-grained architectures

• due to the large routing area overhead

Few research work which deals with the compilation to coarse-grained reconfigurable architecture

Design Flow for Regular MappingDesign Flow for Regular Mapping

A piecewise regular algorithm contains N quantified equations

• • each equation Si[I] is of the form

• •

• xi[I] are indexed variables• fi are arbitrary functions• dji ∈ ℤn are constant data dependence vectors, and denote

similar arguments• Ii are called index spaces

Design Flow for Regular Mapping (cont.)Design Flow for Regular Mapping (cont.)

Linearly bounded lattice

• • •

• this set is affinely mapped onto iteration vectors I using an affine transformation

Block pipelining period• time interval between the initiations of two successive problem

instances (β)

Design Flow for Regular Mapping (cont.)Design Flow for Regular Mapping (cont.)

Parallelizing TransformationsParallelizing Transformations

Based on the representation of equations and index spaces several combinations of parallelizing transformations in the polytope model can be applied

• Affine Transformations• Localization• Operator Splitting• Exploration of Space-Time Mappings• Partitioning• Control Generation• HDL Generation & Synthesis

Constraints Related to CG Reconfigurable ArraysConstraints Related to CG Reconfigurable Arrays

Coarse-grained (re)configurable architectures consist of an array of processor elements (PE)

• array of processor elements (PE)• one or more dedicated functional units or• one or more arithmetic logic units (ALU)

• memory• local memory → register files• memory banks• an instruction memory is required if the PE contains an

instruction programmable ALU• interconnect structures• I/O ports• synchronization and reconfiguration mechanisms

Case StudyCase Study

Regular mapping methodology applied for a matrix multiplication algorithm

• target architecture• PACT XPP64-A reconfigurable processor array

• 64 ALU-PAEs of 24 bit data with in an 8x8 array• each ALU-PAE contains of three objects

• the ALU-PAE• Back-Register-object (BREG)• Forward-Register-object (FREG)

• all objects are connected to horizontal routing channels

Case Study (cont.)Case Study (cont.)

• RAM-PAE are located in two columns at the left and the right border of the array, two ports for independent r/w operations

• RAM can be configured to FIFO mode• each RAM-PAE has a 512x24 bit storage capacity• four independent I/O interfaces located in the corners of

the array

Structure of the PACT XPP64-A reconfigurable processor

ALU-PAE objects

Matrix multiplication algorithm• C = A * B• A ∈ ZNxN

• B ∈ ZNxN

• computations may be represented by a dependence graph (DG)• dependence graphs can be represented in a reduced form

• Reduced Dependence Graph: to each edge e = (vi, vj) there is associated a dependence vector dij ∈ Zn

• virtual Processor Elements (VPEs) are used to map the PE obtained from the design flow to the given architecture

Matrix multiplication algorithm, C-code

Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization

DG of transformed matrix multiplication algorithmN = 2

Reduced dependence graph

4 x 4 processor array

Output data• Ox the output-variable space of variable x of the space-time

mapped or partitioned index space• the output can be two-dimensional• the transformed output variables are distributed over the entire

array• collect the data from one processor’s line PL and feed them out to

an array border• • •

• m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x produces an output at processor element Pi,j

• if one of the following conditions holds, output data can be serialized

a) Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 exampleb) Dataflow graph after performing localization inside each filec) Array implementation of the partitioned example

Partitioned implementation of the matrix multiplication algorithm

ResultsResults

Both implementations (full-size and partitioned) show optimal utilization of resources

Each configured MAC-unit performs one operation per cycle

It is observed that using fewer resources with better implementation more performance per cycle can be achieved

The number of ALUs is reduced from O(3N) to O(N)

Merging and writing of output data streams is overlapped with computations in PEs

Conclusions and Future WorkConclusions and Future Work

The mapping methodology based on loop parallelization in the polytope model provides results that are efficient in terms of utilization of resources and execution time

Future work is focused on perform automatic compilation of nested loop programs

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and...

Documents