+ All Categories
Home > Documents > Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf ·...

Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf ·...

Date post: 16-Jan-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
67
Targeting Multi Targeting Multi - - Core systems in Core systems in Linear Algebra applications Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou presented by Dan Terpstra [email protected] CScADS Autotuning Workshop Snowbird, Utah, July 9 - 12, 2007
Transcript
Page 1: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

Targeting MultiTargeting Multi--Core systems in Core systems in Linear Algebra applicationsLinear Algebra applications

Alfredo Buttari, Jack Dongarra, Jakub Kurzakand Julien Langou

presented by Dan [email protected]

CScADS Autotuning WorkshopSnowbird, Utah, July 9 - 12, 2007

Page 2: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

The free lunch is overThe free lunch is over

Problem

• power consumption• heat dissipation• pins

Solution

reduce clock and increase execution units = Multicore

Consequence

Non-parallel software won't run any faster. A new approach to programming is required.

Hardware

Software

Page 3: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

What is a Multicore processor, BTW?What is a Multicore processor, BTW?

“a processor that combines two or more independent processors into a single package” (wikipedia)

What about:• types of core?

homogeneous (AMD Opteron, Intel Woodcrest...) heterogeneous (STI Cell, Sun Niagara, NVIDIA...)

• memory?how is it arranged?

• bus?is it going to be fast enough?

• cache?shared? (Intel/AMD) not present at all? (STI Cell)

• communications?

Page 4: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

WhatWhat’’s the s the MulticoreMulticore timeline?timeline?

* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper (via LaBarta, et. al. SC06)

Page 5: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far

LAPACK

ThreadedBLAS

PThreads OpenMP

ScaLAPACK

PBLAS

BLACS+ MPI

Shared Memory Distributed Memory

parallelism

Page 6: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far

LAPACK

ThreadedBLAS

PThreads OpenMP

ScaLAPACK

PBLAS

BLACS+ MPI

Shared Memory Distributed Memoryparallelism

Page 7: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization

DPOTF2: BLAS-2non-blocked factorization of the panel

DTRSM: BLAS-3updates by applying the transformation computed in DPOTF2

DGEMM (DSYRK): BLAS-3updates trailing submatrix

U= LT

Page 8: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization

BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound.

• strict synchronizations• poor parallelism• poor scalability

Page 9: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization

The execution flow if filled with stalls due to synchronizations and sequential operations.

Time

Page 10: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization

do DPOTF2 on

for all do DTRSM on

end

for alldo DGEMM on

end

end

Tiling operations:

Page 11: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization

Cholesky can be represented as a Directed Acyclic Graph (DAG) where nodes are subtasks and edges are dependencies among them.

As long as dependencies are not violated, tasks can be scheduled in any order.

3:3 4:3

3:2 4:2

2:2

2:2 3:2 4:2

2:1 3:1 4:1

1:1

4:2 4:3

1:1

2:1 2:2

3:1

4:1

3:33:2

5:1 5:2 5:3 5:4 5:5

4:4

Page 12: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Time

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorizationhigher flexibilitysome degree of adaptativityno idle timebetter scalability

Cost:

1 /3n3

n 3

2n3

Page 13: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout

Column-Major Block data layout

Page 14: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Column-Major

Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout

Block data layout

Page 15: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Column-Major

Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout

Block data layout

Page 16: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

64 128 2560

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Blocking Speedup

DGEMMDTRSM

block size

spee

dup

The use of block data layout storage can significantly improve performance

Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout

Page 17: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Cholesky: performance Cholesky: performance

0 2000 4000 6000 8000 100000

5

10

15

20

25

30

35

40

45

50

55

Cholesky -- Dual Clovertown

async. 2d b lockingLAPACK + Th. BLAS

problem size

Gflo

p/s

Page 18: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Cholesky: performance Cholesky: performance

0 2500 5000 7500 10000 12500 150002.5

57.510

12.515

17.520

22.525

27.530

32.535

Cholesky -- 8-way Dual Opteron

async. 2d b lockingLAPACK + Th. BLAS

problem size

Gflo

p/s

Page 19: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations

DGETF2: BLAS-2non-blocked panel factorization

DTRSM: BLAS-3updates U with transformation computed in DGETF2

DGEMM: BLAS-3updates the trailing submatrix

Page 20: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations

The LU and QR factorizations algorithms in LAPACK don't allow for 2D distribution and block storage format.

LU: pivoting takes into account the whole panel and cannot be split in a block fashion.

QR: the computation of Householder reflectors acts on the whole panel.

The application of the transformation can only be sliced but not blocked.

Page 21: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Time

Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations

LU

Page 22: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

0 2000 4000 6000 8000 100000

2.55

7.510

12.515

17.520

22.525

27.530

32.535

LU -- Dual Clovertown

async. 1DLAPACK + Th. BLAS

problem size

Gflo

p/s

LU factorization: performanceLU factorization: performance

Page 23: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly, Multicore friendly, ““delightfully delightfully parallelparallel**””, algorithms, algorithmsComputer Science can't go any further on old algorithms. We need some math...

* quote from Prof. S. Kale

Page 24: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Assume that is the part of the matrix that has been already factorized and contains the Householder reflectors that determine the matrix Q.

The QR factorization in LAPACKThe QR factorization in LAPACK

The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.

Page 25: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

The QR factorization in LAPACKThe QR factorization in LAPACK

The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.

=DGEQR2( )

Page 26: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

The QR factorization in LAPACKThe QR factorization in LAPACK

The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.

=DLARFB( )

Page 27: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

The QR factorization in LAPACKThe QR factorization in LAPACK

The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.

How does it compare to LU?It is stable because it uses

Householder transformations that are orthogonalIt is more expensive than LU

because its operation count is versus4 /3 n3 2 /3 n3

Page 28: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly algorithms: QRMulticore friendly algorithms: QR

=DGEQR2( )

A different algorithm can be used where operations can be broken down into tiles.

The QR factorization of the upper left tile is performed. This operation returnsa small R factor: and the corresponding Householder reflectors:

Page 29: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

=DLARFB( )

Multicore friendly algorithmsMulticore friendly algorithms: QR: QR

A different algorithm can be used where operations can be broken down into tiles.

All the tiles in the first block-row are updated by applying the transformation

computed at the previous step.

Page 30: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

1 =DGEQR2( )

Multicore friendly algorithmsMulticore friendly algorithms: QR: QR

A different algorithm can be used where operations can be broken down into tiles.

The R factor computed at the first step is coupled with one tile in the block-column and a QR factorization is computed. Flops can be saved due to the shape of the matrix resulting from the coupling.

Page 31: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

1=DLARFB( )

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

A different algorithm can be used where operations can be broken down into tiles.

Each couple of tiles along the corresponding block rows is updated by applying the transformations computed in the previous step. Flops can be saved considering the shape of the Householder vectors.

Page 32: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

1 =DGEQR2( )

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

A different algorithm can be used where operations can be broken down into tiles.

The last two steps are repeated for all the tiles in the first block-column.

Page 33: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

1=DLARFB( )

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

A different algorithm can be used where operations can be broken down into tiles.

The last two steps are repeated for all the tiles in the first block-column.

Page 34: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

1=DLARFB( )

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

A different algorithm can be used where operations can be broken down into tiles.

The last two steps are repeated for all the tiles in the first block-column.

25% more Flops than the LAPACK version!!!*

*we are working on a way to remove these extra flops.

Page 35: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

Page 36: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

Very fine granularityFew dependencies, i.e.,

high flexibility for the scheduling of tasksBlock data layout is

possible

Page 37: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Time

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

Execution flow on a 8-way dual core Opteron.

Page 38: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

QR Factorizat ion: Scaling -- 8-way Dual Opteron

LAPACK + Th. BLASasync. 1Dasync. 2D blocking

n. of processes

Gflo

p/s

Page 39: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

QR Factorizat ion: Scaling -- 8-way Dual Opteron

LAPACK + Th. BLASasync. 1Dasync. 2D blocking

n. of processes

Gflo

p/s

Page 40: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

2.5

5

7.5

10

12.5

15

17.5

20

22.5

QR Factorizat ion -- 8-way Dual Opteron

LAPACK + Th. BLASasync. 1Dasync 2D blocking

problem size

Gflo

p/s

Page 41: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Multicore friendly Multicore friendly algorithms: QRalgorithms: QR

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

02.5

57.510

12.515

17.520

22.525

27.530

32.535

QR Factor izat ion -- Dual Clovertown

async. 2D blockingasync. 1DLAPACK+ Th. BLAS

problem size

Gflo

p/s

Page 42: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Current work and future plansCurrent work and future plans

Page 43: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Current work and future plansCurrent work and future plans

Implement LU factorization on multicoresIs it possible to apply the same approach to two-

sided transformations (Hessenberg, Bi-Diag, Tri-Diag)?Explore techniques to avoid extra flopsImplement the new algorithms on distributed

memory architectures (J. Langou and J. Demmel) Implement the new algorithms on the Cell

processorExplore automatic exploitation of parallelism

through graph driven programming environments

Page 44: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar

http://www.bsc.es/cellsuperscalar

uses source-to-source translation to determine dependencies among tasksscheduling of tasks is performed automatically

by means of the features provided by a libraryit is easily possible to explore different

scheduling policiesall of this is obtained by decorating the code

with pragmas and, thus, is transparent to other compilers

Page 45: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

app.c

CSS compiler

app_spe.c

app_ppe.c

llib_css-spe.so

Cell executable

llib_css-ppe.so

SPE Linker

PPE Linker

SPEexecutableSPE Compiler app_spe.o

PPE Compiler app_ppe.oSPE Embedder

SPE Linker

PPEObject

SDK

Compilation EnvironmentCompilation Environment

Page 46: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

for (i = 0; i < DIM; i++) {for (j= 0; j< i-1; j++){

for (k = 0; k < j-1; k++) {sgemm_tile( A[i][k], A[j][k], A[i][j] );

}strsm_tile( A[j][j], A[i][j] );

}for (j = 0; j < i-1; j++) {

ssyrk_tile( A[i][j], A[i][i] );}spotrf_tile( A[i][i] );

}

void sgemm_tile(float *A, float *B, float *C)

void strsm_tile(float *T, float *B)

void ssyrk_tile(float *A, float *C)

CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar

Page 47: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

for (i = 0; i < DIM; i++) {for (j= 0; j< i-1; j++){

for (k = 0; k < j-1; k++) {sgemm_tile( A[i][k], A[j][k], A[i][j] );

}strsm_tile( A[j][j], A[i][j] );

}for (j = 0; j < i-1; j++) {

ssyrk_tile( A[i][j], A[i][i] );}spotrf_tile( A[i][i] );

}

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void sgemm_tile(float *A, float *B, float *C)

#pragma css task input (T[64][64]) inout(B[64][64]) void strsm_tile(float *T, float *B)

#pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void ssyrk_tile(float *A, float *C)

CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar

Page 48: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Empirical TuningEmpirical Tuningof MADNESSof MADNESS

Haihang You and Keith SeymourHaihang You and Keith Seymour

Page 49: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

• SciDAC code by Robert Harrison @ ORNL

• Framework for adaptive multiresolutionmethods in multiwavelet bases

• Collaborative optimization effort as part of UTK’s participation in PERI, the Performance Engineering Research Institute

WhatWhat’’s MADNESS?s MADNESS?

Page 50: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

front end

code

IR CG

Search Engine

Loop Analyzer code

Driver generator

testingdriver

info of tuning parameters

tuning parameters

+

GCO FrameworkGCO Framework

Page 51: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

• GCO didn’t work!• Instead:

– Extract matrix-vector multiplication kernel from doitgen routine

– Design and hand-code a specific code generator for small size matrix-vector multiplication

– Tune optimal block size and unrolling factor separately for each input size

MADNESS Kernel TuningMADNESS Kernel Tuning

Page 52: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

MFLOPS Opteron(1.8 GHz)

0

500

1000

1500

2000

2500

1 4 7 10 13 16 19 22 25 28 31SIZE

MFL

OPS

auto-tuned C matrix-vector kernelhand-tuned Fortranmulti-resolution kernelreference kernel in C

atlas matrix-vector Ckernel

Page 53: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

MFLOPS Pentium 4(1.7 GHz)

0

200

400

600

800

1000

1200

1 4 7 10 13 16 19 22 25 28 31SIZE

MFL

OPS

auto-tuned C matrix-vector kernelhand-tuned Fortranmulti-resolution kernelreference Kernel in C

atlas matrix-vector Ckernel

Page 54: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

MFLOPS Woodcrest(3.0 GHz)

0

500

1000

1500

2000

2500

3000

3500

4000

1 4 7 10 13 16 19 22 25 28

SIZE

MFL

OPS

auto-tuned C matrix-vector kernelhand-tuned Fortranmulti-resolution kernelreference kernel in C

atlas matrix-vector Ckernel

Page 55: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

MADNESS Conclusions• We have demonstrated an effective

empirical tuning strategy for optimizing the doitgen computational kernel code– less effort than hand tuning– better performance than either:

• hand-tuned or• general purpose optimization

• Future– Aggressive code generator for MV

multiplication– Parallelize parameter search

Page 56: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

Thank youThank you

http://icl.cs.utk.edu

Page 57: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

AllReduce algorithmsAllReduce algorithms

57

The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications.

Input:A is block distributed by rows

Output:Q is block distributed by rowsR is global

A1

A2

A3

Q1

Q2

Q3

R

Page 58: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

AllReduce algorithmsAllReduce algorithms

in iterative methods with multiple right-hand sides (block iterative methods:)

Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).

BlockGMRES, BlockGCR, BlockCG, BlockQMR, …

in iterative methods with a single right-hand side

s-step methods for linear systems of equations (e.g. A. Chronopoulos),

LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,

Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley).

in iterative eigenvalue solvers,

PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),

HYPRE (Lawrence Livermore National Lab.) through BLOPEX,

Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),

PRIMME (A. Stathopoulos, Coll. William & Mary )

They are used in:

Page 59: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

AllReduce algorithmsAllReduce algorithms

A0

A1

pro

cess

es

time

Page 60: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

R0(0)

( , ) QR ( )

A0 V0(0)

R1(0)

( , ) QR ( )

A1 V1(0)

pro

cess

es

time

11

11

AllReduce algorithmsAllReduce algorithms

Page 61: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

R0(0)

( , ) QR ( )

A0 V0(0)

) R0(0)

R1(0)

R1(0)

( , ) QR ( )

A1 V1(0)

pro

cess

es

time

11

11

11

(

AllReduce algorithmsAllReduce algorithms

Page 62: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

R0(0)

( , ) QR ( )

A0 V0(0)

R0(1)

( , ) QR ( ) R0(0)

R1(0)

V0(1)

V1(1)

R1(0)

( , ) QR ( )

A1 V1(0)

pro

cess

es

time

11

11

22

11

AllReduce algorithmsAllReduce algorithms

Page 63: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

R0(0)

( , ) QR ( )

A0 V0(0)

R0(1)

( , ) QR ( ) R0(0)

R1(0)

V0(1)

V1(1)

InApply ( to ) V0(1)

0nV1(1)

Q0(1)

Q1(1)

R1(0)

( , ) QR ( )

A1 V1(0)

pro

cess

es

time

11

11

22

33

11

AllReduce algorithmsAllReduce algorithms

Page 64: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

R0(0)

( , ) QR ( )

A0 V0(0)

R0(1)

( , ) QR ( ) R0(0)

R1(0)

V0(1)

V1(1)

InApply ( to ) V0(1)

0nV1(1)

Q0(1)

Q1(1)

Q0(1)

R1(0)

( , ) QR ( )

A1 V1(0)

pro

cess

es

time

11

11

22

33

11 22

Q1(1)

AllReduce algorithmsAllReduce algorithms

Page 65: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

R0(0)

( , ) QR ( )

A0 V0(0)

R0(1)

( , ) QR ( ) R0(0)

R1(0)

V0(1)

V1(1)

InApply ( to ) V0(1)

0nV1(1)

Q0(1)

Q1(1)

Apply ( to ) 0n

V0(0)

Q0(1)

Q0

R1(0)

( , ) QR ( )

A1 V1(0)

Apply ( to )

V1(0)

Q1(1)

Q1

pro

cess

es

time

0n

11

11

22

33

44

44

11 22

AllReduce algorithmsAllReduce algorithms

Page 66: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

pro

cess

es

time

R0(0) ( ) QR ( )

A0

R0(1) ( )

QR ( ) R0

(0)

R1(0)

R1(0) ( ) QR ( )

A1

R2(0) ( ) QR ( )

A2

R2(1) ( )

QR ( ) R2

(0)

R3(0)

R3(0) ( ) QR ( )

A3

R( ) QR ( )

R0(1)

R2(1)

11

11

11

2222

22

11

11

11

11

AllReduce algorithmsAllReduce algorithms

Page 67: Targeting Multi-Core systems in Linear Algebra applicationscscads.rice.edu/Terpstra-DenseLA.pdf · Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack

CScADS Autotuning Workshop

0 10 20 30 40 50 60 701020

30

40

506070

80

90

100

110

120130

N= 50, locM= 100.000 -- Pent ium III + Dolphin

rhh_qr3qrf

# of processors

Mflo

p/s

per

proc

esso

r

0 5 10 15 20 25 30 3510

20

30

40

50

60

70

80

90

100

110

120

N= 50, M= 100000 -- Pent ium III + Dolphin

rhh_qr3qrf

# of processors

Mflo

ps/s

per

pro

cess

or

AllReduce algorithms: performanceAllReduce algorithms: performance

Weak Scalability Strong Scalability


Recommended