Targeting MultiTargeting Multi--Core systems in Core systems in Linear Algebra applicationsLinear Algebra applications
Alfredo Buttari, Jack Dongarra, Jakub Kurzakand Julien Langou
presented by Dan [email protected]
CScADS Autotuning WorkshopSnowbird, Utah, July 9 - 12, 2007
CScADS Autotuning Workshop
The free lunch is overThe free lunch is over
Problem
• power consumption• heat dissipation• pins
Solution
reduce clock and increase execution units = Multicore
Consequence
Non-parallel software won't run any faster. A new approach to programming is required.
Hardware
Software
CScADS Autotuning Workshop
What is a Multicore processor, BTW?What is a Multicore processor, BTW?
“a processor that combines two or more independent processors into a single package” (wikipedia)
What about:• types of core?
homogeneous (AMD Opteron, Intel Woodcrest...) heterogeneous (STI Cell, Sun Niagara, NVIDIA...)
• memory?how is it arranged?
• bus?is it going to be fast enough?
• cache?shared? (Intel/AMD) not present at all? (STI Cell)
• communications?
CScADS Autotuning Workshop
WhatWhat’’s the s the MulticoreMulticore timeline?timeline?
* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper (via LaBarta, et. al. SC06)
CScADS Autotuning Workshop
Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far
LAPACK
ThreadedBLAS
PThreads OpenMP
ScaLAPACK
PBLAS
BLACS+ MPI
Shared Memory Distributed Memory
parallelism
CScADS Autotuning Workshop
Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far
LAPACK
ThreadedBLAS
PThreads OpenMP
ScaLAPACK
PBLAS
BLACS+ MPI
Shared Memory Distributed Memoryparallelism
CScADS Autotuning Workshop
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
DPOTF2: BLAS-2non-blocked factorization of the panel
DTRSM: BLAS-3updates by applying the transformation computed in DPOTF2
DGEMM (DSYRK): BLAS-3updates trailing submatrix
U= LT
CScADS Autotuning Workshop
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound.
• strict synchronizations• poor parallelism• poor scalability
CScADS Autotuning Workshop
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
The execution flow if filled with stalls due to synchronizations and sequential operations.
Time
CScADS Autotuning Workshop
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
do DPOTF2 on
for all do DTRSM on
end
for alldo DGEMM on
end
end
Tiling operations:
CScADS Autotuning Workshop
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
Cholesky can be represented as a Directed Acyclic Graph (DAG) where nodes are subtasks and edges are dependencies among them.
As long as dependencies are not violated, tasks can be scheduled in any order.
3:3 4:3
3:2 4:2
2:2
2:2 3:2 4:2
2:1 3:1 4:1
1:1
4:2 4:3
1:1
2:1 2:2
3:1
4:1
3:33:2
5:1 5:2 5:3 5:4 5:5
4:4
CScADS Autotuning Workshop
Time
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorizationhigher flexibilitysome degree of adaptativityno idle timebetter scalability
Cost:
1 /3n3
n 3
2n3
CScADS Autotuning Workshop
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Column-Major Block data layout
CScADS Autotuning Workshop
Column-Major
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Block data layout
CScADS Autotuning Workshop
Column-Major
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Block data layout
CScADS Autotuning Workshop
64 128 2560
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Blocking Speedup
DGEMMDTRSM
block size
spee
dup
The use of block data layout storage can significantly improve performance
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
CScADS Autotuning Workshop
Cholesky: performance Cholesky: performance
0 2000 4000 6000 8000 100000
5
10
15
20
25
30
35
40
45
50
55
Cholesky -- Dual Clovertown
async. 2d b lockingLAPACK + Th. BLAS
problem size
Gflo
p/s
CScADS Autotuning Workshop
Cholesky: performance Cholesky: performance
0 2500 5000 7500 10000 12500 150002.5
57.510
12.515
17.520
22.525
27.530
32.535
Cholesky -- 8-way Dual Opteron
async. 2d b lockingLAPACK + Th. BLAS
problem size
Gflo
p/s
CScADS Autotuning Workshop
Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations
DGETF2: BLAS-2non-blocked panel factorization
DTRSM: BLAS-3updates U with transformation computed in DGETF2
DGEMM: BLAS-3updates the trailing submatrix
CScADS Autotuning Workshop
Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations
The LU and QR factorizations algorithms in LAPACK don't allow for 2D distribution and block storage format.
LU: pivoting takes into account the whole panel and cannot be split in a block fashion.
QR: the computation of Householder reflectors acts on the whole panel.
The application of the transformation can only be sliced but not blocked.
CScADS Autotuning Workshop
Time
Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations
LU
CScADS Autotuning Workshop
0 2000 4000 6000 8000 100000
2.55
7.510
12.515
17.520
22.525
27.530
32.535
LU -- Dual Clovertown
async. 1DLAPACK + Th. BLAS
problem size
Gflo
p/s
LU factorization: performanceLU factorization: performance
CScADS Autotuning Workshop
Multicore friendly, Multicore friendly, ““delightfully delightfully parallelparallel**””, algorithms, algorithmsComputer Science can't go any further on old algorithms. We need some math...
* quote from Prof. S. Kale
CScADS Autotuning Workshop
Assume that is the part of the matrix that has been already factorized and contains the Householder reflectors that determine the matrix Q.
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
CScADS Autotuning Workshop
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
=DGEQR2( )
CScADS Autotuning Workshop
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
=DLARFB( )
CScADS Autotuning Workshop
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
How does it compare to LU?It is stable because it uses
Householder transformations that are orthogonalIt is more expensive than LU
because its operation count is versus4 /3 n3 2 /3 n3
CScADS Autotuning Workshop
Multicore friendly algorithms: QRMulticore friendly algorithms: QR
=DGEQR2( )
A different algorithm can be used where operations can be broken down into tiles.
The QR factorization of the upper left tile is performed. This operation returnsa small R factor: and the corresponding Householder reflectors:
CScADS Autotuning Workshop
=DLARFB( )
Multicore friendly algorithmsMulticore friendly algorithms: QR: QR
A different algorithm can be used where operations can be broken down into tiles.
All the tiles in the first block-row are updated by applying the transformation
computed at the previous step.
CScADS Autotuning Workshop
1 =DGEQR2( )
Multicore friendly algorithmsMulticore friendly algorithms: QR: QR
A different algorithm can be used where operations can be broken down into tiles.
The R factor computed at the first step is coupled with one tile in the block-column and a QR factorization is computed. Flops can be saved due to the shape of the matrix resulting from the coupling.
CScADS Autotuning Workshop
1=DLARFB( )
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
A different algorithm can be used where operations can be broken down into tiles.
Each couple of tiles along the corresponding block rows is updated by applying the transformations computed in the previous step. Flops can be saved considering the shape of the Householder vectors.
CScADS Autotuning Workshop
1 =DGEQR2( )
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
A different algorithm can be used where operations can be broken down into tiles.
The last two steps are repeated for all the tiles in the first block-column.
CScADS Autotuning Workshop
1=DLARFB( )
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
A different algorithm can be used where operations can be broken down into tiles.
The last two steps are repeated for all the tiles in the first block-column.
CScADS Autotuning Workshop
1=DLARFB( )
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
A different algorithm can be used where operations can be broken down into tiles.
The last two steps are repeated for all the tiles in the first block-column.
25% more Flops than the LAPACK version!!!*
*we are working on a way to remove these extra flops.
CScADS Autotuning Workshop
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
CScADS Autotuning Workshop
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
Very fine granularityFew dependencies, i.e.,
high flexibility for the scheduling of tasksBlock data layout is
possible
CScADS Autotuning Workshop
Time
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
Execution flow on a 8-way dual core Opteron.
CScADS Autotuning Workshop
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
QR Factorizat ion: Scaling -- 8-way Dual Opteron
LAPACK + Th. BLASasync. 1Dasync. 2D blocking
n. of processes
Gflo
p/s
CScADS Autotuning Workshop
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
QR Factorizat ion: Scaling -- 8-way Dual Opteron
LAPACK + Th. BLASasync. 1Dasync. 2D blocking
n. of processes
Gflo
p/s
CScADS Autotuning Workshop
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
QR Factorizat ion -- 8-way Dual Opteron
LAPACK + Th. BLASasync. 1Dasync 2D blocking
problem size
Gflo
p/s
CScADS Autotuning Workshop
Multicore friendly Multicore friendly algorithms: QRalgorithms: QR
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
02.5
57.510
12.515
17.520
22.525
27.530
32.535
QR Factor izat ion -- Dual Clovertown
async. 2D blockingasync. 1DLAPACK+ Th. BLAS
problem size
Gflo
p/s
CScADS Autotuning Workshop
Current work and future plansCurrent work and future plans
CScADS Autotuning Workshop
Current work and future plansCurrent work and future plans
Implement LU factorization on multicoresIs it possible to apply the same approach to two-
sided transformations (Hessenberg, Bi-Diag, Tri-Diag)?Explore techniques to avoid extra flopsImplement the new algorithms on distributed
memory architectures (J. Langou and J. Demmel) Implement the new algorithms on the Cell
processorExplore automatic exploitation of parallelism
through graph driven programming environments
CScADS Autotuning Workshop
CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar
http://www.bsc.es/cellsuperscalar
uses source-to-source translation to determine dependencies among tasksscheduling of tasks is performed automatically
by means of the features provided by a libraryit is easily possible to explore different
scheduling policiesall of this is obtained by decorating the code
with pragmas and, thus, is transparent to other compilers
CScADS Autotuning Workshop
app.c
CSS compiler
app_spe.c
app_ppe.c
llib_css-spe.so
Cell executable
llib_css-ppe.so
SPE Linker
PPE Linker
SPEexecutableSPE Compiler app_spe.o
PPE Compiler app_ppe.oSPE Embedder
SPE Linker
PPEObject
SDK
Compilation EnvironmentCompilation Environment
CScADS Autotuning Workshop
for (i = 0; i < DIM; i++) {for (j= 0; j< i-1; j++){
for (k = 0; k < j-1; k++) {sgemm_tile( A[i][k], A[j][k], A[i][j] );
}strsm_tile( A[j][j], A[i][j] );
}for (j = 0; j < i-1; j++) {
ssyrk_tile( A[i][j], A[i][i] );}spotrf_tile( A[i][i] );
}
void sgemm_tile(float *A, float *B, float *C)
void strsm_tile(float *T, float *B)
void ssyrk_tile(float *A, float *C)
CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar
CScADS Autotuning Workshop
for (i = 0; i < DIM; i++) {for (j= 0; j< i-1; j++){
for (k = 0; k < j-1; k++) {sgemm_tile( A[i][k], A[j][k], A[i][j] );
}strsm_tile( A[j][j], A[i][j] );
}for (j = 0; j < i-1; j++) {
ssyrk_tile( A[i][j], A[i][i] );}spotrf_tile( A[i][i] );
}
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void sgemm_tile(float *A, float *B, float *C)
#pragma css task input (T[64][64]) inout(B[64][64]) void strsm_tile(float *T, float *B)
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64]) void ssyrk_tile(float *A, float *C)
CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar
CScADS Autotuning Workshop
Empirical TuningEmpirical Tuningof MADNESSof MADNESS
Haihang You and Keith SeymourHaihang You and Keith Seymour
CScADS Autotuning Workshop
• SciDAC code by Robert Harrison @ ORNL
• Framework for adaptive multiresolutionmethods in multiwavelet bases
• Collaborative optimization effort as part of UTK’s participation in PERI, the Performance Engineering Research Institute
WhatWhat’’s MADNESS?s MADNESS?
CScADS Autotuning Workshop
front end
code
IR CG
Search Engine
Loop Analyzer code
Driver generator
testingdriver
info of tuning parameters
tuning parameters
+
GCO FrameworkGCO Framework
CScADS Autotuning Workshop
• GCO didn’t work!• Instead:
– Extract matrix-vector multiplication kernel from doitgen routine
– Design and hand-code a specific code generator for small size matrix-vector multiplication
– Tune optimal block size and unrolling factor separately for each input size
MADNESS Kernel TuningMADNESS Kernel Tuning
CScADS Autotuning Workshop
MFLOPS Opteron(1.8 GHz)
0
500
1000
1500
2000
2500
1 4 7 10 13 16 19 22 25 28 31SIZE
MFL
OPS
auto-tuned C matrix-vector kernelhand-tuned Fortranmulti-resolution kernelreference kernel in C
atlas matrix-vector Ckernel
CScADS Autotuning Workshop
MFLOPS Pentium 4(1.7 GHz)
0
200
400
600
800
1000
1200
1 4 7 10 13 16 19 22 25 28 31SIZE
MFL
OPS
auto-tuned C matrix-vector kernelhand-tuned Fortranmulti-resolution kernelreference Kernel in C
atlas matrix-vector Ckernel
CScADS Autotuning Workshop
MFLOPS Woodcrest(3.0 GHz)
0
500
1000
1500
2000
2500
3000
3500
4000
1 4 7 10 13 16 19 22 25 28
SIZE
MFL
OPS
auto-tuned C matrix-vector kernelhand-tuned Fortranmulti-resolution kernelreference kernel in C
atlas matrix-vector Ckernel
CScADS Autotuning Workshop
MADNESS Conclusions• We have demonstrated an effective
empirical tuning strategy for optimizing the doitgen computational kernel code– less effort than hand tuning– better performance than either:
• hand-tuned or• general purpose optimization
• Future– Aggressive code generator for MV
multiplication– Parallelize parameter search
CScADS Autotuning Workshop
Thank youThank you
http://icl.cs.utk.edu
CScADS Autotuning Workshop
AllReduce algorithmsAllReduce algorithms
57
The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications.
Input:A is block distributed by rows
Output:Q is block distributed by rowsR is global
A1
A2
A3
Q1
Q2
Q3
R
CScADS Autotuning Workshop
AllReduce algorithmsAllReduce algorithms
in iterative methods with multiple right-hand sides (block iterative methods:)
Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).
BlockGMRES, BlockGCR, BlockCG, BlockQMR, …
in iterative methods with a single right-hand side
s-step methods for linear systems of equations (e.g. A. Chronopoulos),
LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,
Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley).
in iterative eigenvalue solvers,
PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),
HYPRE (Lawrence Livermore National Lab.) through BLOPEX,
Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),
PRIMME (A. Stathopoulos, Coll. William & Mary )
They are used in:
CScADS Autotuning Workshop
AllReduce algorithmsAllReduce algorithms
A0
A1
pro
cess
es
time
CScADS Autotuning Workshop
R0(0)
( , ) QR ( )
A0 V0(0)
R1(0)
( , ) QR ( )
A1 V1(0)
pro
cess
es
time
11
11
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
R0(0)
( , ) QR ( )
A0 V0(0)
) R0(0)
R1(0)
R1(0)
( , ) QR ( )
A1 V1(0)
pro
cess
es
time
11
11
11
(
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
R0(0)
( , ) QR ( )
A0 V0(0)
R0(1)
( , ) QR ( ) R0(0)
R1(0)
V0(1)
V1(1)
R1(0)
( , ) QR ( )
A1 V1(0)
pro
cess
es
time
11
11
22
11
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
R0(0)
( , ) QR ( )
A0 V0(0)
R0(1)
( , ) QR ( ) R0(0)
R1(0)
V0(1)
V1(1)
InApply ( to ) V0(1)
0nV1(1)
Q0(1)
Q1(1)
R1(0)
( , ) QR ( )
A1 V1(0)
pro
cess
es
time
11
11
22
33
11
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
R0(0)
( , ) QR ( )
A0 V0(0)
R0(1)
( , ) QR ( ) R0(0)
R1(0)
V0(1)
V1(1)
InApply ( to ) V0(1)
0nV1(1)
Q0(1)
Q1(1)
Q0(1)
R1(0)
( , ) QR ( )
A1 V1(0)
pro
cess
es
time
11
11
22
33
11 22
Q1(1)
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
R0(0)
( , ) QR ( )
A0 V0(0)
R0(1)
( , ) QR ( ) R0(0)
R1(0)
V0(1)
V1(1)
InApply ( to ) V0(1)
0nV1(1)
Q0(1)
Q1(1)
Apply ( to ) 0n
V0(0)
Q0(1)
Q0
R1(0)
( , ) QR ( )
A1 V1(0)
Apply ( to )
V1(0)
Q1(1)
Q1
pro
cess
es
time
0n
11
11
22
33
44
44
11 22
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
pro
cess
es
time
R0(0) ( ) QR ( )
A0
R0(1) ( )
QR ( ) R0
(0)
R1(0)
R1(0) ( ) QR ( )
A1
R2(0) ( ) QR ( )
A2
R2(1) ( )
QR ( ) R2
(0)
R3(0)
R3(0) ( ) QR ( )
A3
R( ) QR ( )
R0(1)
R2(1)
11
11
11
2222
22
11
11
11
11
AllReduce algorithmsAllReduce algorithms
CScADS Autotuning Workshop
0 10 20 30 40 50 60 701020
30
40
506070
80
90
100
110
120130
N= 50, locM= 100.000 -- Pent ium III + Dolphin
rhh_qr3qrf
# of processors
Mflo
p/s
per
proc
esso
r
0 5 10 15 20 25 30 3510
20
30
40
50
60
70
80
90
100
110
120
N= 50, M= 100000 -- Pent ium III + Dolphin
rhh_qr3qrf
# of processors
Mflo
ps/s
per
pro
cess
or
AllReduce algorithms: performanceAllReduce algorithms: performance
Weak Scalability Strong Scalability