E. Agullo, P. Ramet (and HiePACS team)iterschool.univ-amu.fr/IISS2014/lecture_notes/Ramet_lec.pdfE....

August 28, 2014

Task-based linear solvers for modern architecturesE. Agullo, P. Ramet (and HiePACS team)

7th ITER International School,High Performance Computing in Fusion Science,Aix-en-Provence

E. Agullo, P. RametHiePACS teamInria Bordeaux Sud-OuestLaBRI Bordeaux University

Guideline

Introduction

Sparse direct factorization - PaStiX

Sparse solver on heterogenous architectures

Hybrid methods - HIPS and MaPHYS

Low-rank compression - H-PaStiX

Conclusion

E. Agullo, P. Ramet - ITER School August 28, 2014- 2

Introduction

Guideline

Introduction





Conclusion


1Introduction

Introduction

Mixed/Hybrid direct-iterative methods

The ”spectrum” of linear algebra solvers

I Robust/accurate for generalproblems

I BLAS-3 based implementationI Memory/CPU prohibitive for

large 3D problemsI Limited parallel scalability

I Problem dependent efficiency/controlledaccuracy

I Only mat-vec required, fine graincomputation

I Less memory consumption, possibletrade-off with CPU

I Attractive ”build-in” parallel features


Direct

Guideline

Introduction





Conclusion


2Sparse direct factorization -PaStiX

Direct

Major steps for solving sparse linear systems

1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )

2. Factorization: matrix is factorized as A = LU, LLT or LDLT

3. Solve: the solution x is computed by means of forward andbackward substitutions


Direct

Direct Method and Nested Dissection


Direct

Supernodal methods

DefinitionA supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.

I All algorithms (ordering, symbolic factor., factor., solve)generalized to block versions.

I Use of efficient matrix-matrix kernels (improve cache usage).I Same concept as supervariables for elimination tree/minimum

degree ordering.I Supernodes and pivoting: pivoting inside a supernode does

not increase fill-in.


Direct

PaStiX main Features

I LLt, LDLt, LU : supernodal implementation (BLAS3)I Static pivoting + Refinement: CG/GMRES/BiCGstabI column-block or block mappingI Simple/Double precision + Float/Complex operationsI MPI/Threads (Cluster/Multicore/SMP/NUMA)I Multiple GPUs using DAG runtimesI Support external ordering library (PT-Scotch or METIS ...)I Multiple RHS (direct factorization)I Incomplete factorization with ILU(k) preconditionnerI Schur complement computationI C/C++/Fortran/Python/PETSc/Trilinos/FreeFem...


Direct

Current works

I Astrid Casadei (PhD student) : memory optimization to builda Schur complement in PaStiX, tight coupling betweensparse direct and iterative solvers (HIPS) + graphpartitioning with balanced halo

I Xavier Lacoste (PhD student) and the MUMPS team : GPUoptimizations for sparse factorizations with StarPU

I Stojce Nakov (PhD student) : tight coupling between sparsedirect and iterative solvers (MaPHYS) + GPUoptimizations for GMRES with StarPU

I Mathieu Faverge (Assistant Professor) : redesigned PaStiXstatic/dynamic scheduling with PaRSEC in order to get ageneric framework for multicore/MPI/GPU/Out-of-Core +compression


Direct

Direct Solver Highlights

Fusion - ITERPaStiX is used in the JOREK code developped by G. Huysmans atCEA/Cadarache in a fully implicit time evolution scheme for thenumerical simulations of the ELM (Edge Localized Mode)instabilities commonly observed in the standard tokamak operatingscenario.

MHD pellet injection simulated with the JOREK code


Direct

Tasks algorithms

POTRFTRSMSYRKGEMM

(a) Task for a dense tile (b) Task for a sparse supernode


Direct

DAG representation

POTRF

TRSM

T=>T

TRSM

T=>T

TRSM

T=>T TRSM

T=>T

GEMM

C=>B

GEMM

C=>B

GEMM

C=>B

SYRK

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>B

GEMM

C=>C

SYRK

T=>T

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>C

GEMM

C=>C

SYRK

T=>T

C=>A C=>AC=>A SYRK

C=>A

TRSM

C=>C

TRSM

C=>C

TRSM

C=>C

POTRF

T=>T

T=>T T=>TT=>T

C=>B

SYRK

C=>A C=>B C=>A C=>AC=>B

T=>T

GEMM

C=>C

SYRK

T=>T

SYRK

T=>T

C=>A C=>A C=>A

TRSM

C=>C POTRF

T=>T

T=>T

TRSM

T=>T

C=>C

C=>A C=>B

SYRK

T=>T

C=>AC=>A

POTRF

T=>T

TRSM

C=>C

T=>T

C=>A

POTRF

T=>T

(c) Dense DAG

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

(d) Sparse DAG


Direct

Direct Solver Highlights (multicore)SGI 160-cores

Name N NNZA Fill ratio FactAudi 9.44×105 3.93×107 31.28 float LLT

10M 1.04×107 8.91×107 75.66 complex LDLT

10M 10 20 40 80 160Facto (s) 3020 1750 654 356 260Mem (Gb) 122 124 127 133 146Solve (s) 24.6 13.5 3.87 2.90 2.89

Audi 128 2x64 4x32 8x16Facto (s) 17.8 18.6 13.8 13.4Mem (Gb) 13.4 2x7.68 4x4.54 8x2.69Solve (s) 0.40 0.32 0.21 0.14


Direct

Direct Solver Highlights (cluster of multicore)

RC3 matrix - complex double precisionN=730700 - NNA=41600758 - Fill-in=50

Facto 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6820 3520 1900 18906 threads 1020 639 337 28712 threads 525 360 155 121Mem Gb 1 MPI 2 MPI 4 MPI 8 MPI1 thread 34 19,2 12,5 9,226 threads 34,3 19,5 12,8 9,6612 threads 34,6 19,7 13 9,14Solve 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6,97 3,75 1,93 1,036 threads 2,5 1,43 0,78 0,5412 threads 1,33 0,93 0,66 0,59


Direct

Block ILU(k): supernode amalgamation algorithmDerive a block incomplete LU factorization from thesupernodal parallel direct solver

I Based on existing package PaStiXI Level-3 BLAS incomplete factorization implementationI Fill-in strategy based on level-fill among block structures

identified thanks to the quotient graphI Amalgamation strategy to enlarge block size

HighlightsI Handles efficiently high level-of-fillI Solving time faster than with scalar ILU(k)I Scalable parallel implementation


Manycore

Guideline

Introduction





Conclusion


3Sparse solver on heterogenousarchitectures

Manycore

Multiple layer approach

ALGORITHM

RUNTIME

KERNELS

GPU CPU

Governing ideas: Enable advancednumerical algorithms to be executed on ascalable unified runtime system forexploiting the full potential of futureexascale machines.Basics:

I Graph of tasksI Out-of-order schedulingI Fine granularity


Manycore

DAG schedulers consideredStarPU

I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy

PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T. HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled


Manycore

Supernodal sequential algorithm

forall the Supernode S1 dopanel (S1);/* update of the panel */

forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT

i substracted from

S2 */

endend


Manycore

StarPU Tasks submission

forall the Supernode S1 dosubmit panel (S1);/* update of the panel */

forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );submit gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT

i subtracted from S2*/

endwait for all tasks ();

end


Manycore

PaRSEC’s parameterized task graph

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

Task Graph

1 p a n e l ( j )23 /∗ E x e c u t i o n Space ∗/4 j = 0 . . c b l k n b r−156 /∗ Task L o c a l i t y ( Owner Compute ) ∗/7 : A( j )89 /∗ Data d e p e n d e n c i e s ∗/

10 RW A <− ( l e a f ) ? A( j ) : C gemm( l a s t b r o w )11 −> A gemm( f i r s t b l o c k +1 . . l a s t b l o c k )12 −> A( j )

Panel Factorization in JDF Format


Manycore

Matrices and Machines

Matrix Prec Method Size nnzA nnzL TFlopFilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47

Table: Matrix description (Z: double complex, D: double).

Machine Processors Frequency GPUs RAMMirage Westmere Intel Xeon X5650 (2× 6) 2.67 GHz Tesla M2070 (×3) 36 GB


Manycore

CPU scaling study: GFlop/s for numerical factorization

afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

native StarPU PaRSEC

1 core 1 core 1 core

3 cores 3 cores 3 cores





Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)








Manycore

GPU scaling study : GFlop/s for numerical factorization

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only

StarPU: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU

PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU


Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore


afshell10(D, LU)

FilterV2(Z, LU)

Flan(D, LLT )

audi(D, LLT )

MHD(D, LU)

Geo1438(D, LLT )

pmlDF(Z, LDLT )

HOOK(D, LU)0

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

Native: CPU only





Manycore

Xeon Phi (from Max-Planck-Institut for Plasmaphysic)[Phi=4CPUs, GPU=6CPUs]


Hybrid

Guideline

Introduction





Conclusion


4Hybrid methods - HIPS andMaPHYS

Hybrid Linear Solvers

Develop robust scalable parallel hybrid direct/iterative linearsolvers

I Exploit the efficiency and robustness of the sparse direct solversI Develop robust parallel preconditioners for iterative solversI Take advantage of scalable implementation of iterative solvers

Domain Decomposition (DD)

I Natural approach for PDE’sI Extend to general sparse matricesI Partition the problem into subdomainsI Use a direct solver on the subdomainsI Robust preconditioned iterative solver

Hybrid

Method used in MaPHYSI Partitioning the global matrix in

several local matrices

I MeTiS [G. Karypis and V.Kumar]

I Scotch [Pellegrini and al.]

I Local factorization

I MUMPS [P. Amestoy and al.](with Schur option)

I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)

I Constructing of the preconditioner

I Mkl library

I Solving the reduced system

I Cg/Gmres/FGmres on thereduced system

0 cut edges

Ω

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω


Hybrid









I Mkl library



48 cut edges

Ω1

Ω2

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Γ


Hybrid









I Mkl library



88 cut edges

Ω1

Ω2

Ω3

Ω4

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid


several local matricesI MeTiS [G. Karypis and V.

Kumar]I Scotch [Pellegrini and al.]





I Mkl library



88 cut edges

Ω1

Ω2

Ω3

Ω4

Γ

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid




I Local factorizationI MUMPS [P. Amestoy and al.]

(with Schur option)I PaStiX [P. Ramet and al.]

(with Schur option andmulti-threaded version)


I Mkl library




Hybrid







I Constructing of the preconditionerI Mkl library



0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid







I Constructing of the preconditionerI Mkl library

I Solving the reduced systemI Cg/Gmres/FGmres on the

reduced system

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

300

350

400

nz = 1920

Ω1

Ω2

Ω3

Ω4

Γ


Hybrid

Experimental set up

Hopper platform (Hardware)

I Two twelve-core AMD ’MagnyCours’ 2.1-GHzI Memory: 32 GB GDDR3I Double precision

MatricesMatrix Tdr455K Nachos4M

N 2,738K 4,147K

Nnz 112,7M 256,4M

Table: Overview of sparse matrices used on the Hopper platform


Hybrid

Results on the Hopper platform

Achieved performance for the Tdr455K matrixAll computational steps

10

100

1000

96 192 384 768 1536 3072

tim

e (

s)

#cores

3 t/p6 t/p

12 t/p24 t/p

Memory used per node

1000

10000

96 192 384 768 1536 3072

Mem

roy_peak(M

B)

#cores

3 t/p6 t/p

12 t/p24 t/p

Achieved performance for the Nachos4M matrixAll computational steps

10

100

768 1536 3072 6144 12288 24576

tim

e (

s)

#cores

3 t/p6 t/p

12 t/p24 t/p

Memory used per node

100

1000

10000

100000

768 1536 3072 6144 12288 24576

Mem

roy_peak(M

B)

#cores

3 t/p6 t/p

12 t/p24 t/p


Hybrid

HIPS : hybrid direct-iterative solver

Based on a domain decomposition : interface one node-wide(no overlap in DD lingo)

(AB FE AC

)B : Interior nodes of subdomains (direct factorization).C : Interface nodes.

Special decomposition and ordering of the subset C :Goal : Building a global Schur complement preconditioner (ILU)from the local domain matrices only.


Hybrid

HIPS: preconditioners

Main featuresI Iterative or “hybrid” direct/iterative method are implemented.I Mix direct supernodal (BLAS-3) and sparse ILUT

factorization in a seamless manner.I Memory/load balancing : distribute the domains on the

processors (domains > processors).


H-PaStiX

Guideline

Introduction





Conclusion


5Low-rank compression -H-PaStiX

H-PaStiX

Toward low rank compressions in supernodal solver

Many works on hierarchical matrices and direct solversI Eric Darve : Hierarchical matrices classifications (Building

O(N) Linear Solvers Using Nested Dissection)I Sherry Li : Multifrontal solver + HSS (Towards an

Optimal-Order Approximate Sparse Factorization ExploitingData-Sparseness in Separators)

I David Bindel : CHOLMOD + Low Rank (An Efficient Solverfor Sparse Linear Systems Based on Rank-Structured CholeskyFactorization)

I Jean-Yves L’Excellent : MUMPS + Block Low Rank


H-PaStiX

Symbolic factorization


H-PaStiX

Nested dissection, 2D mesh/matrix


H-PaStiX

Computational cost


H-PaStiX

FastLA associate team between INRIA/Berkeley/Stanford

Supernodal Solver - Hierarchical Matrices O(N .loga(N))

1. Check the potential compression ratio on top level blocks2. Develop a prototype with:

I low-rank compression on the larger supernodesI compression tree built at each updateI complexity analysis of the approach

3. Study coupling between nested dissection and compressiontree ordering

Which algorithm to find low-rank approximation ?SVD, RR-LU, RR-QR, ACA, CUR, Random ...

Which family of hierarchical matrix ?H, H2, HODLR ...


Conclusion

Guideline

Introduction





Conclusion


6Conclusion

Conclusion

SoftwaresGraph/Mesh partitioner and ordering :

http://scotch.gforge.inria.fr

Sparse linear system solvers :

http://pastix.gforge.inria.fr

http://hips.gforge.inria.fr

https://wiki.bordeaux.inria.fr/maphys/doku.php


http://scotch.gforge.inria.fr

http://pastix.gforge.inria.fr

http://hips.gforge.inria.fr

https://wiki.bordeaux.inria.fr/maphys/doku.php

Conclusion

Softwares

Fast Multipole Method :

http://scalfmm-public.gforge.inria.fr/

Matrices Over Runtime Systems (with University of Tenessee):

http://icl.cs.utk.edu/projectsdev/morse


http://scalfmm-public.gforge.inria.fr/

http://icl.cs.utk.edu/projectsdev/morse

Conclusion

Thank You

Pierre RametHiePACS

http://www.labri.fr/ ramet

Date post:	13-Mar-2018
Category:	Documents
Upload:	phungkhuong
View:	215 times
Download:	1 times

E. Agullo, P. Ramet (and HiePACS team)iterschool.univ-amu.fr/IISS2014/lecture_notes/Ramet_lec.pdfE....

Documents