A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang...

A Prototypical Self-Optimizing Package for Parallel Implementation

of Fast Signal Transforms

Kang Chen and Jeremy Johnson

Department of Mathematics and Computer Science

Drexel University

Motivation and Overview

• High performance implementation of critical signal processing kernels

• A self-optimizing parallel package for computing fast signal transforms

– Prototype transform (WHT)

– Build on existing sequential package

– SMP implementation using OpenMP

• Part of SPIRAL project

– http://www.ece.cmu.edu/~spiral

Outline

• Walsh-Hadamard Transform (WHT)

• Sequential performance and optimization using dynamic programming

• A parallel implementation of the WHT

• Parallel performance and optimization including parallelism in the search

Walsh-Hadamard Transform

2221

product. tensor is

x is a signal of size N=2n.

WHTWHTWHTWHT

xWHTyn

n

iN

N

Fast WHT algorithms are obtained by factoring the WHT matrix

1111111111111111

WHT 4 2WHT 2WHT

11

11

11

11

t

iin WHTWHT

1222

)II( n1+ ··· +ni-1 2ni+1+ ··· +nt

SPIRAL WHT Package

• All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns

• Different factorizations lead to varying amounts of recursion and iteration

• Transforms in small sizes (21 to 28) are implemented in straight-line code to reduce overheads

• The WHT package allows exploration of different algorithms and implementations

• Optimization/adaptation to architectures is performed by searching for the fastest algorithm

Johnson and Püschel: ICASSP 2000

Dynamic Programming

• Exhaustive Search: Searching all possible algorithms

– Cost is (4n/n3/2) for binary factorizations

• Dynamic Programming: Searching among algorithms generated from previously determined best algorithms

– Cost is (n2) for binary factorizations

25

29

24

Best algorithm at size 29

24

Best algorithm at size 24

Possibly best algorithm at size 213

29

213

24

2524

Performance of WHT Algorithms

• Iterative algorithms have less overhead• Recursive algorithms have better data locality• Best WHT algorithms are compromise between less

overhead and good data flow pattern.

0.5

1

1.5

2

2.5

3

3.5

4 8 12 16 20WHT size log(N)

Rat

io o

f ru

nti

mes

Best algorithmsfound by DP

Recursivealgorithm

Iterativealgorithm

The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc.

Architecture Dependency

222

25,(1) 217

213

24,(1)

24,(1)

29

24,(1) 25

UltraSPARC v9 POWER3 II PowerPC RS64 III

222 A DDL split node

25,(1) An IL=1 straight-line WHT32 node

222

210 212

26 2625,(1) 26

222

24,(4) 218

26,(2) 212

25 27

Improved Data Access Patterns

• Stride tensor causes WHT accessing data out of block and loss of locality

• Large stride introduces more conflict cache misses

time

x0x1x2x3x4x5x6x7

42 WHTI 42 IWHT

22

23

21

))(( 42428 WHTIIWHTWHT

Stride tensor Union tensor

Dynamic Data Layout

DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor.

)()())(( 428424

8242428 WHTILWHTILWHTIIWHTWHT

x0 x1 x2 x3

x4 x5 x6 x7

( 24 WHTI ))42 WHTI (

x0x1x2x3x4x5x6x7

x0x4x2x6x1x5x3x7

x0 x4 x2 x6

x1 x5 x3 x7

pseudo transpose

pseudo transpose

N. Park and V. K. Prasanna: ICASSP 2001

Loop Interleaving

IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform.

Access order 1st 2nd 3rd 4th

x0x1x2x3x4x5x6x7

x0x1x2x3x4x5x6x7

x0x1x2x3x4x5x6x7

WHT2 IL=1I4/2 WHT2 IL=2I4/4WHT2 I4

Gatlin and Carter: PACT 2000, Implemented by Bo Hong

Environment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5

Best WHT Partition Trees

216

27 29

24 25

216

21 215

214

23

21

211

25 26

Standard best tree

Best tree with DDL

216

25,(3) 211

25 26

Best tree with IL

216 A DDL split node

25,(3) An IL=3 straight-line WHT32 node

Effect of IL and DDL on Performance

DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 214 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 24 8 bytes.

0.7

0.8

0.9

1

1 4 7 10 13 16 19 22 25 28

WHT size log(N)

Ra

tio

of

run

tim

es

Standard

with DDL

with IL=1

with IL=4

with IL=5

Parallel WHT Package

• SMP implementation obtained using OpenMP

• WHT partition tree is parallelized at the root node– Simple to insert OpenMP directives– Better performance obtained with manual scheduling

• DP decides when to use parallelism• DP builds the partition with best sequential subtrees• DP decides the best parallel root node

– Parallel split– Parallel split with DDL

–Parallel pseudo-transpose– Parallel split with IL

OpenMP Implementation

# pragma omp parallel{ R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); # pragma omp parallel for for (j = 0; j < R - 1) { for (k = 0; k < S - 1) { WHT(N(i)) * x(j, k, S, N(i)); } } S = S * N(i); }}

t

iin WHTWHT

1222

)II( n1+ ··· +ni-1 2ni+1+ ··· +nt

# pragma omp parallel{ total = get_total_threads( ); id = get_thread_id( ); R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); for (; id < R*S - 1; id += total) { j = id / S; k = id % S; WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); # pragma omp barrier }}

In WHTRS = L (ISWHTR) L (IRWHTS), the pseudo

transpose, L, can be parallelized in different granularity

Parallel DDL

thread 1 thread 2 thread 3 thread 4

Coarse-grained pseudo transpose

Fine-grained pseudo transpose

Fine-grained pseudo transpose

with ID shift

S

R

S S

Comparison of Parallel Schemes

0

2

4

6

8

10

1 6 11 16 21 26

WHT size log(N)

Spee

dup

Best sequentialwith DDL

Work-sharingOpenMP with10 threads

Coarse-grainedDDL with 10threads

Fine-grained IDshift DDL with10 threads

Best Tree of Parallel DDL Schemes

226

29 217

292524 28

24 25

226

217

A parallel DDL split node

A DDL split node

Coarse-grained DDL

226

212 214

2102626 24

25 25

Fine-grained DDL

226

213 213

272726 26

Fine-grained with ID Shift DDL

Normalized Runtime of PowerPC RS64

5.0E-09

7.0E-09

9.0E-09

1.1E-08

1.3E-08

1.5E-08

1.7E-08

0 5 10 15 20 25

WHT size log(N)

Nor

mal

ized

Run

tim

e

Standard

with DDL

with IL=4

The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

PowerPC RS64 III

Overall Parallel Speedup

0

2

4

6

8

10

1 6 11 16 21 26

WHT size log(N)

Spee

dup 1 thread

8 threads

10 threads

Parallel Performance

A. PowerPC RS64 IIIThread Speedup Efficiency

1 1.00 100%2 1.99 100%3 2.91 97%4 3.94 99%5 4.74 95%6 5.59 93%7 6.30 90%8 7.71 96%9 7.85 87%

10 8.79 88%

Thread Speedup Efficiency1 1.00 100%2 1.74 87%3 2.15 72%4 2.52 63%

B. POWER3 II

C. UltraSPARC v8plusThread Speedup Efficiency

1 1.00 100%2 1.98 99%3 2.70 90%4 3.06 77%Data size is 225 for Table

A, 223 for Table B and C.

Conclusion and Future Work

• Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP– Self-adapts to different architectures using search

– Must take into account data access pattern

– Parallel implementation should not constrain search

– Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral

• Working on a distributed memory version using MPI

Effect of Scheduling Strategy

0.8

1

1.2

1.4

1.6

1.8

2

1 6 11 16 21 26

WHT size log(N)

Spe

edu

p

Best sequentialwith IL=4

Large granularitywith 2 threads andIL=4

Small granularitywith 2 threads andIL=4

Parallel Split Node with IL and DDL

0.8

1

1.2

1.4

1.6

1.8

2

1 6 11 16 21

WHT size log(N)

Spe

edup

Best sequentialwith IL

Best parallelwith 2 threadsand DDL

Best parallelwith 2 threadsand IL=4

Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

Modified Scheduling

Choice in scheduling WHT tasks for (WHTR IS) and (IRWHTS).

small granularity, size R or S

large granularity, size R S / thread number

thread 1 thread 2 thread 3 thread 4

Date post:	26-Dec-2015
Category:	Documents
Upload:	patricia-gordon
View:	215 times
Download:	0 times

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang...

Documents