Post on 27-Dec-2015
transcript
Optimizing Loop Performance for Clustered VLIW Architectures
Optimizing Loop Performance for Clustered VLIW Architectures
by
Yi Qian(Texas Instruments)
Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments)
Optimizing Loop Performance for Clustered VLIW Architectures
Clustered VLIW Architecture
Optimizing Loop Performance for Clustered VLIW Architectures
Motivation
Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low.
The compiler must
Expose maximal parallelism,
Maintain minimal communication overhead.
High-level optimizations can improve loop performance on clustered VLIW machines.
Optimizing Loop Performance for Clustered VLIW Architectures
Background
Software Pipelining – modulo scheduling
Archive ILP by overlapping execution of different loop iterations.
Initiation Interval (II)
ResII -- constraints from the machine resources.
RecII -- constraints from the dependence recurrences.
MinII = max(ResII, RecII)
Optimizing Loop Performance for Clustered VLIW Architectures
Loop Transformations
Scalar Replacement
replace array references with scalar variables.
improve register usage
for (i=0; i<n; ++i) for ( j=0; j<n; ++j)
a[i] = a[i] + b[j] * x[i][j];
for (i=0; i<n; ++i){ t = a[i]; for ( j=0; j<n; ++j) t = t + b[j] * x[i][j]; a[i] = t; }
Optimizing Loop Performance for Clustered VLIW Architectures
Loop Transformations
Unrolling reduce inter-iteration overhead
enlarge loop body size
Unroll-and-jambalance the computation and memory-access requirements
improve uMinII (MinII / unrollAmount)
for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j];
uMinII = 4
for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; }
uMinII = 3
(1 computational unit, 1 memory unit)unroll-and-jammed loop:original loop:
Optimizing Loop Performance for Clustered VLIW Architectures
Loop Transformations
Unroll-and-jam/unrolling
generate intercluster parallelism
for (i=0; i<2*n; ++i)
a[i] = a[i] + 1;
for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; }
for (i=0; i<2*n; ++i)
a[i] = a[i-1] + 1;
for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }
Optimizing Loop Performance for Clustered VLIW Architectures
Loop Transformations
Loop Alignment
Remove loop-carried dependences
Alignment conflicts
Used to determine intercluster communication cost
for (i=1; i<n; ++i) { a[i] = b[i] + c[i];
x[i] = a[i-1] *q; }
x[1] = a[0] * q;for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i];
x[i+1] = a[i] * q;}a[n-1] = b[n-1] + c[n-1];
for (i=1; i<n; ++i) { a[i] = b[i] + q;
c[i] = a[i-1] + a[i-2];}
for (i=1; i<n; ++i)
a[i] = a[i-1] + b[i];<1>
<2>
Optimizing Loop Performance for Clustered VLIW Architectures
Related Work
Partitioning ProblemEllis -- BUG
Capitanio et al. -- LC-VLIW
Nystrom et al. -- cluster assignment & software pipelining
Ozer et al. -- UAS
Sanchez et al. -- unified method
Hiser et al. – RCG
Aleta et al. – pseudo-scheduler
Optimizing Loop Performance for Clustered VLIW Architectures
Loop Transformations
Scalar ReplacementCallahan, et al -- pipelined architectures
Carr, Kennedy -- general algorithm
Duesterwalk -- data flow framework
Loop AlignmentAllen et al -- shared-memory machines
Unrolling/UjamCallahan et al -- pipelined architectures
Carr,Kennedy -- ILP
Carr, Guan -- linear algebra
Carr -- cache, software pipelining
Sarkar -- ILP, IC
Sanchez et al -- clustered machines
Huang et al -- clustered machines
Shin et al – Superwood Register files
Optimizing Loop Performance for Clustered VLIW Architectures
Optimization StrategyUnroll-and-jam/Unrolling
Scalar Replacement
Intermediate CodeGenerator
Data-flow OptimizationValue Cloning
Register PartitioningSoftware Pipelining
Assembly Code GeneratorTarget Code
Source Code
Optimizing Loop Performance for Clustered VLIW Architectures
Our Method
Picking loops to unroll
Computing uMinII
Computing register pressure (see paper)
Determining unroll amounts
Optimizing Loop Performance for Clustered VLIW Architectures
Picking Loops to Unroll
: carries the most dep. that are amenable to S.R.
: contains the fewest alignment conflicts.
Computing uMinII
uRecII does not increase
uResII
la U a
lp U p
F= f×Ua×Up ,
M=M L×Un
where
pa
CMF
UU
FUC
FUM
FUF
},,max{
Optimizing Loop Performance for Clustered VLIW Architectures
Computing Communication Cost for Unrolled Loops
Intercluster Copies
multiple loops (see paper)single loop
invariantdep.
variant dep.
innermost loopis unrolled
innermost loopis not unrolled
invariantdep.
variant dep.
Optimizing Loop Performance for Clustered VLIW Architectures
Unrolling a Single LoopVariant Dep.
v0 w 0
v0
v1 w n1
w n0
w n
Cluster 1
...
Cluster?
Before unrollingd l e
...
After unrollinguC l e = # of e where
copies per cluster:
aUn
sinks of the new dependences:
)mod())(( apl UUedmn
total costs:
ClEe
plCl UeuCC )(
1aUv
Optimizing Loop Performance for Clustered VLIW Architectures
Unrolling a Single LoopVariant Dep.
Special Casesif , then d l e mod Up×Ua = 0
uC l e = 0
for (i=0; i<4*n; i+=4) {
a[i] = a[i-4];
a[i+1] = a[i–3];
a[i+2] = a[i–2];
a[i+3] = a[i-1];}
if , then uC l e = d l e
for (i=0; i<6*n; i+=6) {
a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i];
a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3];}
4 clusters: 2 clusters:Ua=1,Up= 4 Ua=3,Up=2
)(edU la
Optimizing Loop Performance for Clustered VLIW Architectures
Unrolling a Single LoopInvariant Dep.
references can be eliminatedby scalar replacement. clusters need a copy operation.
Ua×Up
for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i];
for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t;
a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; }
1pU
IlEe
pIl UC )1(
Optimizing Loop Performance for Clustered VLIW Architectures
Determining Unroll Amounts
Integer optimization problem
Exhaustive search
Heuristic method
min uMinII
MRR1, pa UU
Optimizing Loop Performance for Clustered VLIW Architectures
Experimental Results
Benchmarks
119 DSP loops from the TI's benchmark suite
DSP applications: FIR filter, correlation, Reed-Solomon decoding, lattice filter, LMS filter, etc.
Architectures
URM, a simulated architecture
8 functional units - 2 clusters, 4 clusters (1 copy unit)
16 functional units - 2 clusters, 4 clusters (2 copy units)
TMS320C64x
Optimizing Loop Performance for Clustered VLIW Architectures
Unroll-and-jam/unrolling is applicable to 71 loops.
URM Speedups: Transformed vs. Original
width 8 16
clusters 2 4 2 4
Speedup
Harmonic 1.39 1.68 1.4 1.43
Median 1.52 1.78 1.6 1.6
Improved 50 69 50 51
Optimizing Loop Performance for Clustered VLIW Architectures
Our Algorithm vs. Fixed Unroll Amounts
Using a fixed unroll amount may cause performance
degradation when communication costs are dominant.
Width 8 16
Clusters 2 4 2 4
Speedup
Harmonic 1 0.91 1 1.07
Harmonic(fixed) 0.88 0.84 0.88 0.95
# of loops 9 4 9 21
Optimizing Loop Performance for Clustered VLIW Architectures
C64x Results
TMS320C64x Speedups: Unrolled vs. Original
Speedup
Harmonic 1.7
Median 2
Improved 55
Optimizing Loop Performance for Clustered VLIW Architectures
Accuracy of Communication Cost Model
Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops
2-cluster: 66 exact prediction 4-cluster: 64 exact prediction
Optimizing Loop Performance for Clustered VLIW Architectures
ConclusionProposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops.70%-90% of 71 loops can be improved by a speedup of 1.4-1.7.High-level transformations should be an integral part of compilation for clustered VLIW machines.