Date post: | 18-Jan-2018 |
Category: |
Documents |
Upload: | carmel-barker |
View: | 216 times |
Download: | 0 times |
Performance Analysis of Divide and Conquer Performance Analysis of Divide and Conquer Algorithms for the WHTAlgorithms for the WHT
Jeremy JohnsonMihai Furis, Pawel Hitczenko, Hung-Jen Huang
Dept. of Computer ScienceDrexel University
www.spiral.net
LACSI 2006 – Automatic Tuning of Libraries and Applications
MotivationMotivation
• On modern machines operation count is not always the most important performance metric.
• Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code.
• Automatic Performance Tuning and Architecture Adaptation– Generate and Test– FFT, Matrix Multiplication, …
• Explain performance distribution
LACSI 2006 – Automatic Tuning of Libraries and Applications
OutlineOutline
• Space of WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model– Instruction Count– Cache
LACSI 2006 – Automatic Tuning of Libraries and Applications
Walsh-Hadamard TransformWalsh-Hadamard Transform
• y = WHTN x, N = 2n
n
N WHTWHTWHT 22...
1111
WHT2
1111111111111111
1111
1111
WHTWHTWHT 224
LACSI 2006 – Automatic Tuning of Libraries and Applications
Factoring the WHT MatrixFactoring the WHT Matrix
• AC DCD• A • A C) = (A C • Im nmn
1100110000110011
1010010110100101
111111111111
1111
4WHT
WHT2 WHT2WHT2WHT2
LACSI 2006 – Automatic Tuning of Libraries and Applications
Recursive AlgorithmRecursive Algorithm
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111111111111111111111111111111111111111111111111111111111111111
(WHT(WHT22 I I44)(I)(I22 (WHT (WHT22 I I22) (I) (I22 WHT WHT22))))
LACSI 2006 – Automatic Tuning of Libraries and Applications
Iterative AlgorithmIterative Algorithm
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111
1111111111111111111111111111111111111111111111111111111111111111
(WHT(WHT22 I I44)(I)(I22 WHT WHT22 I I22) (I) (I44 WHT WHT22))))
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT AlgorithmsWHT Algorithms
• Recursive
• Iterative
• General
n
i
iniN 1IWHTIWHT 222 1
WHTIIWHTWHT 2 2/2/2 NNN
nnn t
t
innnnnn tiii
1
1
where
,2222 IWHTIWHT 111
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT ImplementationWHT Implementation
– N = N1* N2**Nt Ni=2ni
– x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))
• Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni
for j=0,…,R-1 for k=0,…,S-1 S=S* Ni;
Mb,s
t
i 1)
nn WHTWHT 222 II( n1+ ··· + ni-1 2
ni+1+ ··· + nt
i
ii
i
i
NSkSjNN
NSkSjN xWHTx ,,
i
LACSI 2006 – Automatic Tuning of Libraries and Applications
Partition TreesPartition Trees
4
1 3
1 2
1 1
4
1 1 11
Right Recursive
Iterative
9
3 4 2
1 2 1
1 1
4
13
12
11
Left Recursive
4
2 2
1 1 1 1
Balanced
LACSI 2006 – Automatic Tuning of Libraries and Applications
Number of AlgorithmsNumber of Algorithms
1,1
1,1 TTT 1
1
n
nnnnn n ttn
8.6),/(
)22(28811
))T(1()T(
)1(
2462
2/3
2
432
0
T
)T(
T)T(
n
z
zzzzzz
n
n
n
nn
zzz
zz
zz
LACSI 2006 – Automatic Tuning of Libraries and Applications
OutlineOutline
• WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model– Instruction Count– Cache
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT PackageWHT PackagePPüschel & Johnson (ICASSP ’00)üschel & Johnson (ICASSP ’00)
• Allows easy implementation of any of the possible
WHT algorithms
• Partition tree representation
W(n)=small[n] | split[W(n1),…W(nt)]
• Tools
– Measure runtime of any algorithm
– Measure hardware events (coupled with PCL/PAPI)
– Search for good implementation
• Dynamic programming
• Evolutionary algorithm
LACSI 2006 – Automatic Tuning of Libraries and Applications
Algorithm ComparisonAlgorithm ComparisonRecursive/Iterative Runtime
0.00E+002.00E-014.00E-016.00E-018.00E-011.00E+001.20E+001.40E+001.60E+001.80E+002.00E+00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22WHT size(2̂ n)
ratio r1/i1
Rec &Bal/It Instruction Count
0
0.5
1
1.5
2
2.5
1 3 5 7 9 11 13 15 17 19
rr1/i1lr1/i1
bal1/i1
Rec&It/Best Runtime
0.00E+002.00E+004.00E+006.00E+008.00E+001.00E+011.20E+01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22WHT size(2̂ n)
ratio
r1/br3/bi1/bi3/bb/b
Small/It Runtime
0.00E+002.00E+004.00E+006.00E+008.00E+001.00E+011.20E+01
1 2 3 4 5 6 7 8WHT size(2̂ n)
ratio I_1/rt
r_1/rt
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache Miss DataCache Miss Data
Recursive vs. Best
0.00E+00
1.00E+00
2.00E+00
3.00E+00
4.00E+00
5.00E+00
6.00E+00
1 4 7 10 13 16 19 22
size
Rat
io R
ecur
sive
/Iter
ativ
eInstructions
L1 Data CacheMisses
L2 Cache Misses
Recursive vs. Iterative
0.00E+00
2.00E-01
4.00E-01
6.00E-01
8.00E-01
1.00E+00
1.20E+00
1.40E+00
1.60E+00
1 4 7 10 13 16 19 22
size
Rat
io R
ecur
sive
/Iter
ativ
e
Instructions
L1 Data CacheMisses
L2 Data CacheMisses
Recursive vs. Iterative Normalized to Best
0.00E+00
2.00E+00
4.00E+00
6.00E+00
8.00E+00
1.00E+01
1.20E+01
1 4 7 10 13 16 19 22
size
Rat
io A
lg T
ime/
Bes
t Tim
e
Recursive Time
Iterative Time
Iterative vs. Best
0.00E+00
1.00E+00
2.00E+00
3.00E+00
4.00E+00
5.00E+00
6.00E+00
7.00E+00
8.00E+00
9.00E+00
1 4 7 10 13 16 19 22
size
Rat
io It
erat
ive/
Bes
t
Instructions
L1 Data CacheMisses
L2 Cache Misses
LACSI 2006 – Automatic Tuning of Libraries and Applications
Histogram (n = 16, 10,000 samples)Histogram (n = 16, 10,000 samples)
• Wide range in performance despite equal number of arithmetic operations (n2n flops)• Pentium III vs. UltraSPARC II
LACSI 2006 – Automatic Tuning of Libraries and Applications
OutlineOutline
• WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model– Instruction Count– Cache
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT ImplementationWHT Implementation
– N = N1* N2Nt Ni=2ni
– x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))
• Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni
for j=0,…,R-1 for k=0,…,S-1 S=S* Ni;
Mb,s
t
i 1)
nn WHTWHT 222 II( n1+ ··· + ni-1 2
ni+1+ ··· + nt
i
ii
i
i
NSkSjNN
NSkSjN xWHTx ,,
i
LACSI 2006 – Automatic Tuning of Libraries and Applications
Instruction Count ModelInstruction Count Model
)()()A()IC(3
1
3
1AL nlninn
ll
i i
A(n) = number of calls to WHT procedure= number of instructions outside loopsAl(n) = Number of calls to base case of size l l = number of instructions in base case of size l
Li = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body
LACSI 2006 – Automatic Tuning of Libraries and Applications
Small[1]Small[1].file "s_1.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.globl apply_small1
.type apply_small1,@function
apply_small1:
movl 8(%esp),%edx //load stride S to EDX
movl 12(%esp),%eax //load x array's base address to EAX
fldl (%eax) // st(0)=R7=x[0]
fldl (%eax,%edx,8) //st(0)=R6=x[S]
fld %st(1) //st(0)=R5=x[0]
fadd %st(1),%st // R5=x[0]+x[S]
fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]
fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ?????
fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0]
fstpl (%eax) //store x[0]=x[0]+x[S]
fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S]
ret
LACSI 2006 – Automatic Tuning of Libraries and Applications
Recurrences Recurrences
leaf a ,0)A(
... ),A(1)A(1
12
nnnnn
n
nnnti
t
i
i
leaf a ,0)(
... ,)()(
... ,)()(
... ),()(
L2L2L
2L2L
L2L
11
23
1
...
122
11
11
11
nnnnn
nnnn
nnnn
n
nnnnn
nnnnn
nntn
i
ti
t
i
ti
t
i
ti
t
i
ii
ii
i
lnl
ln
llA leaves ofnumber where,)( 2
LACSI 2006 – Automatic Tuning of Libraries and Applications
Histogram using Instruction Model (P3)Histogram using Instruction Model (P3)
l = 12, l = 34, and l = 106 = 271 = 18, 2 = 18, and 1 = 20
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache ModelCache Model
• Different WHT algorithms access data in different patterns• All algorithms with the same set of leaf nodes have the same
number of memory accesses
• Count misses for accesses to data array– Parameterized by cache size, associativity, and block size– simulate using program traces (restrict to data vector accesses)– Analytic formula?
LACSI 2006 – Automatic Tuning of Libraries and Applications
Blocked AccessBlocked Access4
1
1 2
3
)))(()(())((
4242282
828216
WHTIIWHTIIWHTWHTIIWHTWHT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LACSI 2006 – Automatic Tuning of Libraries and Applications
Interleaved AccessInterleaved Access
4
3
2 1
1 ))())((())((
2822424
282816
WHTIIWHTIIWHTWHTIIWHTWHT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache SimulatorCache Simulator
4
1
1 2
3
4
3
2 1
1
• 144 memory accesses• C = 4, A = 1, B = 1 (80, 112)• C = 4, A = 4, B = 1 (48, 48)• C = 4, A = 1, B = 2 (72, 88)
• Iterative vs. Recursive (192 memory accesses)• C = 4, A = 1, B = 1 (128, 112)
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache Misses as a Function of Cache SizeCache Misses as a Function of Cache SizeC=22 C=23
C=24 C=25
LACSI 2006 – Automatic Tuning of Libraries and Applications
Formula for Cache MissesFormula for Cache Misses
• M(L,WN,R) = Number of misses for LWHTN R
)...,,...(
leaf a is if
/ if ),,(
111 M2
3
2N22M C
nnWnn
WN
NW
tii
t
i
N
rr
N
l
rNnn
i
i
LACSI 2006 – Automatic Tuning of Libraries and Applications
Closed FormClosed Form
• M(L,WN,R) = Number of misses for LWHTN R
• M(0,W_n,0) = 3(n-c)*2n + k*2n
• C = 2c, k = number of parts in the rightmost c positions• c = 3, n = 4
4
1 1 11
Iterative4
1 3
1 2
1 1
Right Recursive
4
2 2
1 1 1 1
Balanced
k = 3
k = 2
k = 1
LACSI 2006 – Automatic Tuning of Libraries and Applications
Summary of Results and Future WorkSummary of Results and Future Work
• Instruction Count Model– min, max, expected value, variance, limiting distribution
• Cache Model– Direct mapped (closed form solution, distribution, expected
value, and variance)
• Combine models• Extend cache formula to include A and B• Use as heuristic to limit search and predict performance
LACSI 2006 – Automatic Tuning of Libraries and Applications
SponsorsSponsors
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting, DESA: Intelligent HW-SW
Compilers for Signal Processing Applications, and NSF
ITR/NGS #0325687: Intelligent HW/SW Compilers for DSP.