+ All Categories
Home > Documents > Jeremy Johnson Dept. of Computer Science Drexel University

Jeremy Johnson Dept. of Computer Science Drexel University

Date post: 04-Nov-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
23
Automatic Performance Tuning Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University
Transcript
Page 1: Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance TuningAutomatic Performance Tuning

Jeremy JohnsonDept. of Computer Science

Drexel University

Page 2: Jeremy Johnson Dept. of Computer Science Drexel University

OutlineOutline

• Scientific Computation Kernels– Matrix Multiplication– Fast Fourier Transform (FFT)

• Automated Performance Tuning                                                          (IEEE Proc. Vol. 93, No. 2, Feb. 2005)

– ATLAS– FFTW– SPIRAL

Page 3: Jeremy Johnson Dept. of Computer Science Drexel University

Matrix Multiplication and the FFTMatrix Multiplication and the FFT

∑=

=n

kkjikij BAC

1

=

+=+==

=

∑∑

=+

=+

=

1

0

1

0

2112

1

0

12

22

12

11

,,R

lSR

S

l

kl

NS

l

N

l

kl

Nk

xy

llkk

xy

kklklk

Skk

RlSkRSN

ωωω

ω

Page 4: Jeremy Johnson Dept. of Computer Science Drexel University

Basic Linear Algebra Subprograms (BLAS)Basic Linear Algebra Subprograms (BLAS)

• Level 1 – vector­vector, O(n) data, O(n) operations• Level 2 – matrix­vector, O(n2) data, O(n2) operations• Level 3 – matrix­matrix, O(n2) data, O(n3) operations = data reuse = 

locality!

• LAPACK built on top of BLAS (level 3)– Blocking (for the memory hierarchy) is the single most important 

optimization for linear algebra algorithms

• GEMM – General Matrix Multiplication

– SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K,                                  ALPHA, A, LDA, B, LDB, BETA, C, LDC ) 

– C := alpha*op( A )*op( B ) + beta*C, – where op(X) = X or X’

Page 5: Jeremy Johnson Dept. of Computer Science Drexel University

DGEMMDGEMM

…*           Form  C := alpha*A*B + beta*C.*            DO 90, J = 1, N               IF( BETA.EQ.ZERO )THEN                  DO 50, I = 1, M                     C( I, J ) = ZERO   50             CONTINUE               ELSE IF( BETA.NE.ONE )THEN                  DO 60, I = 1, M                     C( I, J ) = BETA*C( I, J )   60             CONTINUE               END IF               DO 80, L = 1, K                  IF( B( L, J ).NE.ZERO )THEN                     TEMP = ALPHA*B( L, J )                     DO 70, I = 1, M                        C( I, J ) = C( I, J ) + TEMP*A( I, L )   70                CONTINUE                  END IF   80          CONTINUE   90       CONTINUE…

Page 6: Jeremy Johnson Dept. of Computer Science Drexel University

Matrix Multiplication PerformanceMatrix Multiplication Performance

Page 7: Jeremy Johnson Dept. of Computer Science Drexel University

Matrix Multiplication PerformanceMatrix Multiplication Performance

Page 8: Jeremy Johnson Dept. of Computer Science Drexel University

Numeric RecipesNumeric Recipes

• Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed.– William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. 

Flannery, Cambridge University Press, 1992. 

• “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines.

• 1. Preliminarys• 2. Solutions of Linear Algebraic Equations• …• 12. Fast Fourier Transform• 19. Partial Differential Equations• 20. Less Numerical Algorithms

Page 9: Jeremy Johnson Dept. of Computer Science Drexel University

four1four1

Page 10: Jeremy Johnson Dept. of Computer Science Drexel University

four1 (cont)four1 (cont)

Page 11: Jeremy Johnson Dept. of Computer Science Drexel University

FFT PerformanceFFT Performance

Page 12: Jeremy Johnson Dept. of Computer Science Drexel University

Atlas Architecture and Search ParametersAtlas Architecture and Search Parameters

• NB – L1 data cache tile size

• NCNB – L1 data cache tile size for non­copying version

• MU, NU – Register tile size

• KU – Unroll factor for k’ loop

• LS – Latency for computation scheduling• FMA – 1 if fused multiply­add available, 0 otherwise• FF, IF,  NF – Scheduling of loads

Yotov et al., Is Search Really Necessary to Generate High­Performance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005

Page 13: Jeremy Johnson Dept. of Computer Science Drexel University

ATLAS Code GenerationATLAS Code Generation

• Optimization for locality – Cache tiling, Register tiling

Page 14: Jeremy Johnson Dept. of Computer Science Drexel University

ATLAS Code GenerationATLAS Code Generation

• Register Tiling– MU + NU + MU×NU  ≤ NR

• Loop unrolling• Scalar replacement• Add/mul interleaving• Loop skewing

• Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’

A C

B

NU

MU

K

K

NB

NB

mul1mul2…mulLs

add1

mulLs+1

add2

…mulMu×Nu

addMu×Nu­Ls+2

…addMu×Nu

Page 15: Jeremy Johnson Dept. of Computer Science Drexel University

ATLAS SearchATLAS Search

• Estimate Machine Parameters (C1, NR, FMA, LS)– Used to bound search

• Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter)– Search order

• NB• MU, NU• KU• LS• FF, IF, NF• NCNB• Cleanup codes

Page 16: Jeremy Johnson Dept. of Computer Science Drexel University

Using FFTWUsing FFTW

Page 17: Jeremy Johnson Dept. of Computer Science Drexel University

FFTW InfrastructureFFTW Infrastructure

• Use dynamic programming to find an efficient way to combine code sequences.  

• Combine code sequences using divide and conquer structure in FFT

• Codelets (optimized code sequences for small FFTs)

• Plan encodes divide and conquer strategy and stores “twiddle factors”

• Executor computes FFT of given data using algorithm described by plan.

15

3 12

4 8

3 5

Right Recursive

Page 18: Jeremy Johnson Dept. of Computer Science Drexel University

SPIRAL SPIRAL systemsystem

DSP transform specifies

user

goes for a coffee

Formula Generator

SPL Compiler Sea

rch 

Eng

ine

runtime on given platform

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcode

S P

 I R

 A L

(or an espresso for small transform

s)

platform­adaptedimplementation

comes back

Mathem

atician

Expert

Programmer

Page 19: Jeremy Johnson Dept. of Computer Science Drexel University

DSPDSP Algorithms: Example 4­point DFT Algorithms: Example 4­point DFTCooley/Tukey FFT (size 4):

 algorithms reduce arithmetic cost O(n^2)→O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure

−−

=

−−−−−−

1000001001000001

1100110000110011

000010000100001

1010010110100101

111111

111111

iii

ii

4222

42224 )()( LDFTITIDFTDFT ⋅⊗⋅⋅⊗=

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

Page 20: Jeremy Johnson Dept. of Computer Science Drexel University

AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(

8IIDCT

)(4

IIDCT)(

4IVDCT

R1)()( 2/2

)(2/

)(2/

)(n

IVn

IIn

IIn IFDCTDCTPDCT ⊗⋅⊕⋅→

2FR3 R6

2FR4

R3

R1

R6

2F

2FR4

2)(

2 21

FDCT II ⋅→

)(2

IIDCT

)(2

IIDST

)(2

IVDCT

)(2

IIDST

)(2

IIDCT

R1 R6 SDCTPDCT IIn

IVn ⋅⋅→ )()(

)(4

IIDCT)(2

IVDCT

Page 21: Jeremy Johnson Dept. of Computer Science Drexel University

GeneratedGenerated DFT Vector Code: Pentium 4, SSE DFT Vector Code: Pentium 4, SSE(P

seud

o) g

flop/

s

DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

 speedups (to C code) up to factor of 3.1

0

1

2

3

4

5

6

7

4 5 6 7 8 9 10 11 12 13

Spiral SSEIntel MKL interl.FFTW 2.1.3Spiral CSpiral C vectSIMD­FFT

 

hand­tuned vendor assembly code

Page 22: Jeremy Johnson Dept. of Computer Science Drexel University

Best Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024

scalar

C vect

SIMD

Pentium 4float

Pentium 4double

Pentium IIIfloat

AthlonXPfloat

10

64

4

10

87

5

10

8

6

4

2

2

2

2 2

2 2

2 2

2

21

22 3

10

8

5

2

2

2 3

10

97

5

12

22 3

10

62

42 2

2 2

4

10

4

2

6

2 42 2

2

10

5

2

5

2 3 3

10

6

3

4

2 2 3

10

8

5

2

2

2 3

10

64

42 2

2 2

2

10

5

2

5

2 3 3

trees platform/datatype dependent

Page 23: Jeremy Johnson Dept. of Computer Science Drexel University

CrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4S

low

dow

n fa

ctor

 w.r.

t. be

st

DFT 2n single precision, runtime of best found of other platforms

n

software adaptation is necessary

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

4 5 6 7 8 9 10 11 12 13

Pentium 4 SSEPentium 4 SSE2AthlonXP SSEPentiumIII SSEPentium 4 float


Recommended