Post on 18-Jan-2018
description
transcript
Mapping of Regular Nested Loop Programs toMapping of Regular Nested Loop Programs toCoarse-grained Reconfigurable Arrays – Constraints and MethodologyCoarse-grained Reconfigurable Arrays – Constraints and Methodology
Presented by: Luis Ortiz
Department of Computer ScienceDepartment of Computer ScienceThe University of Texas at San AntonioThe University of Texas at San Antonio
F. Hanning, H. Dutta, W. Tichy, and Jürgen TeichUniversity of Erlangen-Nuremberg, Germany
Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04)
OutlineOutline
Overview The Problem Reconfigurable Architectures Design Flow for Regular Mapping Parallelizing Transformations Constraints Related to CG Reconfigurable Arrays Case Study Results Conclusions and Future Work
OverviewOverview
Constructing a parallel program is equivalent to specifying its execution order• the operations of a program form a set, and its execution order is
a binary, transitive and asymmetric relation• the relevant sets are (unions of) Z-polytopes• most of the optimizations may be presented as transformation of
the original program
The problem of automatic parallelization• given a set of operations E and a strict total order on it• find a partial order on E such that execution of E under it is
determinate and gives the same results as the original program
Overview (cont.)Overview (cont.)
Defining a polyhedron• a set of linear inequalities: Ax + a ≥ 0• the polyhedron is the set of all x which satisfies these inequalities• the basic property of a polyhedron is convexity:
• if two points a and b belong to a polyhedron, then so all convex combinations• λa + (1 – λ)b, 0 ≤ λ ≤ 1
• a bounded polyhedron is called a polytope
Overview (cont.)Overview (cont.)
The essence of the polytope model is to apply affine transformations to the iteration spaces of a program• the iteration domain of statement S:
Dom(S) = {x | Dsx + ds ≥ 0}• Ds and ds are the matrix and constant vector which define the
iteration polytope. ds may depend linearly on the structure parameters
Overview (cont.)Overview (cont.)
Coarse-grained reconfigurable architectures• provide flexibility of software combined with the performance of
hardware• but, hardware complexity is a problem due to a lack of mapping
tools
Parallelization techniques and compilers• map computationally intensive algorithms efficiently to coarse-
grained reconfigurable arrays
The ProblemThe Problem
“Mapping a certain class of regular nested loop programs onto a dedicated processor array”
Reconfigurable ArchitecturesReconfigurable Architectures
Span a wide range of abstraction levels• from fine-grained Look-Up Table (LUT) based reconfigurable logic
devices to distributed and hierarchical systems with heterogeneous reconfigurable components
Efficiency comparison• standard arithmetic is less efficient on fine-grained architectures
• due to the large routing area overhead
Few research work which deals with the compilation to coarse-grained reconfigurable architecture
Design Flow for Regular MappingDesign Flow for Regular Mapping
A piecewise regular algorithm contains N quantified equations
• • each equation Si[I] is of the form
• •
• xi[I] are indexed variables• fi are arbitrary functions• dji ∈ ℤn are constant data dependence vectors, and denote
similar arguments• Ii are called index spaces
Design Flow for Regular Mapping (cont.)Design Flow for Regular Mapping (cont.)
Linearly bounded lattice
• • •
• this set is affinely mapped onto iteration vectors I using an affine transformation
Block pipelining period• time interval between the initiations of two successive problem
instances (β)
Design Flow for Regular Mapping (cont.)Design Flow for Regular Mapping (cont.)
Parallelizing TransformationsParallelizing Transformations
Based on the representation of equations and index spaces several combinations of parallelizing transformations in the polytope model can be applied
• Affine Transformations• Localization• Operator Splitting• Exploration of Space-Time Mappings• Partitioning• Control Generation• HDL Generation & Synthesis
Constraints Related to CG Reconfigurable ArraysConstraints Related to CG Reconfigurable Arrays
Coarse-grained (re)configurable architectures consist of an array of processor elements (PE)
• array of processor elements (PE)• one or more dedicated functional units or• one or more arithmetic logic units (ALU)
• memory• local memory → register files• memory banks• an instruction memory is required if the PE contains an
instruction programmable ALU• interconnect structures• I/O ports• synchronization and reconfiguration mechanisms
Case StudyCase Study
Regular mapping methodology applied for a matrix multiplication algorithm
• target architecture• PACT XPP64-A reconfigurable processor array
• 64 ALU-PAEs of 24 bit data with in an 8x8 array• each ALU-PAE contains of three objects
• the ALU-PAE• Back-Register-object (BREG)• Forward-Register-object (FREG)
• all objects are connected to horizontal routing channels
Case Study (cont.)Case Study (cont.)
• RAM-PAE are located in two columns at the left and the right border of the array, two ports for independent r/w operations
• RAM can be configured to FIFO mode• each RAM-PAE has a 512x24 bit storage capacity• four independent I/O interfaces located in the corners of
the array
Structure of the PACT XPP64-A reconfigurable processor
ALU-PAE objects
Case Study (cont.)Case Study (cont.)
Case Study (cont.)Case Study (cont.)
Matrix multiplication algorithm• C = A * B• A ∈ ZNxN
• B ∈ ZNxN
• computations may be represented by a dependence graph (DG)• dependence graphs can be represented in a reduced form
• Reduced Dependence Graph: to each edge e = (vi, vj) there is associated a dependence vector dij ∈ Zn
• virtual Processor Elements (VPEs) are used to map the PE obtained from the design flow to the given architecture
Matrix multiplication algorithm, C-code
Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization
Case Study (cont.)Case Study (cont.)
DG of transformed matrix multiplication algorithmN = 2
Reduced dependence graph
4 x 4 processor array
Case Study (cont.)Case Study (cont.)
Case Study (cont.)Case Study (cont.)
Output data• Ox the output-variable space of variable x of the space-time
mapped or partitioned index space• the output can be two-dimensional• the transformed output variables are distributed over the entire
array• collect the data from one processor’s line PL and feed them out to
an array border• • •
• m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x produces an output at processor element Pi,j
Case Study (cont.)Case Study (cont.)
• if one of the following conditions holds, output data can be serialized
a) Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 exampleb) Dataflow graph after performing localization inside each filec) Array implementation of the partitioned example
Partitioned implementation of the matrix multiplication algorithm
Case Study (cont.)Case Study (cont.)
ResultsResults
Both implementations (full-size and partitioned) show optimal utilization of resources
Each configured MAC-unit performs one operation per cycle
It is observed that using fewer resources with better implementation more performance per cycle can be achieved
The number of ALUs is reduced from O(3N) to O(N)
Merging and writing of output data streams is overlapped with computations in PEs
Conclusions and Future WorkConclusions and Future Work
The mapping methodology based on loop parallelization in the polytope model provides results that are efficient in terms of utilization of resources and execution time
Future work is focused on perform automatic compilation of nested loop programs