Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | patricia-gordon |
View: | 215 times |
Download: | 0 times |
A Prototypical Self-Optimizing Package for Parallel Implementation
of Fast Signal Transforms
Kang Chen and Jeremy Johnson
Department of Mathematics and Computer Science
Drexel University
Motivation and Overview
• High performance implementation of critical signal processing kernels
• A self-optimizing parallel package for computing fast signal transforms
– Prototype transform (WHT)
– Build on existing sequential package
– SMP implementation using OpenMP
• Part of SPIRAL project
– http://www.ece.cmu.edu/~spiral
Outline
• Walsh-Hadamard Transform (WHT)
• Sequential performance and optimization using dynamic programming
• A parallel implementation of the WHT
• Parallel performance and optimization including parallelism in the search
Walsh-Hadamard Transform
2221
product. tensor is
x is a signal of size N=2n.
WHTWHTWHTWHT
xWHTyn
n
iN
N
Fast WHT algorithms are obtained by factoring the WHT matrix
1111111111111111
WHT 4 2WHT 2WHT
11
11
11
11
t
iin WHTWHT
1222
)II( n1+ ··· +ni-1 2ni+1+ ··· +nt
SPIRAL WHT Package
• All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns
• Different factorizations lead to varying amounts of recursion and iteration
• Transforms in small sizes (21 to 28) are implemented in straight-line code to reduce overheads
• The WHT package allows exploration of different algorithms and implementations
• Optimization/adaptation to architectures is performed by searching for the fastest algorithm
Johnson and Püschel: ICASSP 2000
Dynamic Programming
• Exhaustive Search: Searching all possible algorithms
– Cost is (4n/n3/2) for binary factorizations
• Dynamic Programming: Searching among algorithms generated from previously determined best algorithms
– Cost is (n2) for binary factorizations
25
29
24
Best algorithm at size 29
24
Best algorithm at size 24
Possibly best algorithm at size 213
29
213
24
2524
Performance of WHT Algorithms
• Iterative algorithms have less overhead• Recursive algorithms have better data locality• Best WHT algorithms are compromise between less
overhead and good data flow pattern.
0.5
1
1.5
2
2.5
3
3.5
4 8 12 16 20WHT size log(N)
Rat
io o
f ru
nti
mes
Best algorithmsfound by DP
Recursivealgorithm
Iterativealgorithm
The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc.
Architecture Dependency
222
25,(1) 217
213
24,(1)
24,(1)
29
24,(1) 25
UltraSPARC v9 POWER3 II PowerPC RS64 III
222 A DDL split node
25,(1) An IL=1 straight-line WHT32 node
222
210 212
26 2625,(1) 26
222
24,(4) 218
26,(2) 212
25 27
Improved Data Access Patterns
• Stride tensor causes WHT accessing data out of block and loss of locality
• Large stride introduces more conflict cache misses
time
x0x1x2x3x4x5x6x7
42 WHTI 42 IWHT
22
23
21
))(( 42428 WHTIIWHTWHT
Stride tensor Union tensor
Dynamic Data Layout
DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor.
)()())(( 428424
8242428 WHTILWHTILWHTIIWHTWHT
x0 x1 x2 x3
x4 x5 x6 x7
( 24 WHTI ))42 WHTI (
x0x1x2x3x4x5x6x7
x0x4x2x6x1x5x3x7
x0 x4 x2 x6
x1 x5 x3 x7
pseudo transpose
pseudo transpose
N. Park and V. K. Prasanna: ICASSP 2001
Loop Interleaving
IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform.
Access order 1st 2nd 3rd 4th
x0x1x2x3x4x5x6x7
x0x1x2x3x4x5x6x7
x0x1x2x3x4x5x6x7
WHT2 IL=1I4/2 WHT2 IL=2I4/4WHT2 I4
Gatlin and Carter: PACT 2000, Implemented by Bo Hong
Environment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5
Best WHT Partition Trees
216
27 29
24 25
216
21 215
214
23
21
211
25 26
Standard best tree
Best tree with DDL
216
25,(3) 211
25 26
Best tree with IL
216 A DDL split node
25,(3) An IL=3 straight-line WHT32 node
Effect of IL and DDL on Performance
DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 214 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 24 8 bytes.
0.7
0.8
0.9
1
1 4 7 10 13 16 19 22 25 28
WHT size log(N)
Ra
tio
of
run
tim
es
Standard
with DDL
with IL=1
with IL=4
with IL=5
Parallel WHT Package
• SMP implementation obtained using OpenMP
• WHT partition tree is parallelized at the root node– Simple to insert OpenMP directives– Better performance obtained with manual scheduling
• DP decides when to use parallelism• DP builds the partition with best sequential subtrees• DP decides the best parallel root node
– Parallel split– Parallel split with DDL
–Parallel pseudo-transpose– Parallel split with IL
OpenMP Implementation
# pragma omp parallel{ R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); # pragma omp parallel for for (j = 0; j < R - 1) { for (k = 0; k < S - 1) { WHT(N(i)) * x(j, k, S, N(i)); } } S = S * N(i); }}
t
iin WHTWHT
1222
)II( n1+ ··· +ni-1 2ni+1+ ··· +nt
# pragma omp parallel{ total = get_total_threads( ); id = get_thread_id( ); R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); for (; id < R*S - 1; id += total) { j = id / S; k = id % S; WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); # pragma omp barrier }}
In WHTRS = L (ISWHTR) L (IRWHTS), the pseudo
transpose, L, can be parallelized in different granularity
Parallel DDL
thread 1 thread 2 thread 3 thread 4
Coarse-grained pseudo transpose
Fine-grained pseudo transpose
Fine-grained pseudo transpose
with ID shift
S
R
S S
Comparison of Parallel Schemes
0
2
4
6
8
10
1 6 11 16 21 26
WHT size log(N)
Spee
dup
Best sequentialwith DDL
Work-sharingOpenMP with10 threads
Coarse-grainedDDL with 10threads
Fine-grained IDshift DDL with10 threads
Best Tree of Parallel DDL Schemes
226
29 217
292524 28
24 25
226
217
A parallel DDL split node
A DDL split node
Coarse-grained DDL
226
212 214
2102626 24
25 25
Fine-grained DDL
226
213 213
272726 26
Fine-grained with ID Shift DDL
Normalized Runtime of PowerPC RS64
5.0E-09
7.0E-09
9.0E-09
1.1E-08
1.3E-08
1.5E-08
1.7E-08
0 5 10 15 20 25
WHT size log(N)
Nor
mal
ized
Run
tim
e
Standard
with DDL
with IL=4
The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.
PowerPC RS64 III
Overall Parallel Speedup
0
2
4
6
8
10
1 6 11 16 21 26
WHT size log(N)
Spee
dup 1 thread
8 threads
10 threads
Parallel Performance
A. PowerPC RS64 IIIThread Speedup Efficiency
1 1.00 100%2 1.99 100%3 2.91 97%4 3.94 99%5 4.74 95%6 5.59 93%7 6.30 90%8 7.71 96%9 7.85 87%
10 8.79 88%
Thread Speedup Efficiency1 1.00 100%2 1.74 87%3 2.15 72%4 2.52 63%
B. POWER3 II
C. UltraSPARC v8plusThread Speedup Efficiency
1 1.00 100%2 1.98 99%3 2.70 90%4 3.06 77%Data size is 225 for Table
A, 223 for Table B and C.
Conclusion and Future Work
• Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP– Self-adapts to different architectures using search
– Must take into account data access pattern
– Parallel implementation should not constrain search
– Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral
• Working on a distributed memory version using MPI
Effect of Scheduling Strategy
0.8
1
1.2
1.4
1.6
1.8
2
1 6 11 16 21 26
WHT size log(N)
Spe
edu
p
Best sequentialwith IL=4
Large granularitywith 2 threads andIL=4
Small granularitywith 2 threads andIL=4
Parallel Split Node with IL and DDL
0.8
1
1.2
1.4
1.6
1.8
2
1 6 11 16 21
WHT size log(N)
Spe
edup
Best sequentialwith IL
Best parallelwith 2 threadsand DDL
Best parallelwith 2 threadsand IL=4
Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.
Modified Scheduling
Choice in scheduling WHT tasks for (WHTR IS) and (IRWHTS).
small granularity, size R or S
large granularity, size R S / thread number
thread 1 thread 2 thread 3 thread 4