SIAM EX14 WorkshopJuly 7, Chicago - IL
Preliminary Investigations on ResilientParallel Numerical Linear Algebra Solvers
Luc Giraud
joint work withE. Agullo, P. Salas, E. F. Yetkin, M. Zounonfunded by ANR RESCUE and G8-ECS
HiePACS Inria ProjectJoint Inria-CERFACS labINRIA Bordeaux Sud-Ouest
Context
L. Giraud - Resilient numerical linear algebra solvers 2/ 25
Resilience: Ability to compute a correct output in presence of faults
I Context: Numerical linear algebraI Goal: Keep converging in presence of faultI Method: Recover-restart strategy without Checkpoint
I HPC systems are not fault-freeI A faulty components (node, core, memory) loses
all its dataI Simulations at exascale have to be resilient
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 3/ 25
Faults in HPC Systems
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 4/ 25
Faults in HPC Systems
Framework
Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF
ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue
Faults in this presentationI Detected corrupted memory space (node crashes, damaged
memory pages, uncorrected bit-flip, . . . )
L. Giraud - Resilient numerical linear algebra solvers 5/ 25
Faults in HPC Systems
Framework
Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF
ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue
Faults in this presentationI Detected corrupted memory space (node crashes, damaged
memory pages, uncorrected bit-flip, . . . )
L. Giraud - Resilient numerical linear algebra solvers 5/ 25
Faults in HPC Systems
Framework
Forecast for extreme scale systemsI Mean Time Between Failure (MTBF): less than one hourI Checkpoint time might be larger than MTBF
ObjectivesI Explore fault-tolerant schemes with less/no overheadI Numerical algorithms to deal with overhead issue
Faults in this presentationI Detected corrupted memory space (node crashes, damaged
memory pages, uncorrected bit-flip, . . . )
L. Giraud - Resilient numerical linear algebra solvers 5/ 25
Sparse linear systems
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 6/ 25
Sparse linear systems
L. Giraud - Resilient numerical linear algebra solvers 7/ 25
x bA
=
Ax = bWe attempt to design fault tolerant solversfor sparse linear system
Two classes of iterative methodsI Stationary methods (Jacobi, Gauss-Seidel, . . . )I Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . )
I Krylov methods have attractive potential for Extreme-scale
Interpolation methods
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 8/ 25
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Block row distributionx bA
P
P
P
P
1
2
3
4
=
We distinguish two categories of data:I Static dataI Dynamic data
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Block row distributionx bA
P
P
P
P
1
2
3
4
=
We distinguish two categories of data:I Static dataI Dynamic data
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
Block row distributionx bA
P
P
P
P
1
2
3
4
=
We distinguish two categories of data:I Static dataI Dynamic data
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data
=
We distinguish two categories of data:I Static dataI Dynamic data
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 fails
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
��������
��������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 fails
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
����������������
����������������
������������
������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 failsI Failed processor is replacedI Static data are restored
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
������������
������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data
0
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 failsI Failed processor is replacedI Static data are restored
Reset: Set (x1) to initial value
Interpolation methods
L. Giraud - Resilient numerical linear algebra solvers 9/ 25
��������
��������
������������������������������
������������������������������
x bA
P
P
P
P
1
2
3
4
Static data Dynamic data Lost data Interpolatedv data
=
We distinguish two categories of data:I Static dataI Dynamic data
Let’s assume that P1 failsI Failed processor is replacedI Static data are restored
Our algorithms aim at recovering x1and restart
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
I Sequential simulationsI Simulation of parallel
environment
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Overview of our fault tolerant algorithm
L. Giraud - Resilient numerical linear algebra solvers 10/ 25
P
P
P
P
1
2
3
4
Time
Fault Successful iteration Failed iteration Interpolation
Restart
I Sequential simulationsI Simulation of parallel
environment
I Generation of fault traceI Realistic probability distribution
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(x1x2
)=
(b1b2
)
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−(
A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to recover x1?
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−(
A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to recover x1?
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−(
A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
Interpolation methods
Interpolation methods
Fault in linear system(A11 A12A21 A22
)(?x2
)=
(b1b2
)How to recover x1?
Linear Interpolation (LI) [Langou, Chen, Bosilca, Dongarra, SISC, 2007]
Solve A11x1 = b1 − A12x2
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 =
(b1b2
)x1 = argmin
x
∥∥∥∥(b1b2
)−
(A11A21
)x −
(A12A22
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 11/ 25
Interpolation methods
Main properties - basic linear algebra
PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing
PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure
L. Giraud - Resilient numerical linear algebra solvers 12/ 25
Interpolation methods
Main properties - basic linear algebra
PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]
PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure
L. Giraud - Resilient numerical linear algebra solvers 12/ 25
Interpolation methods
Main properties - basic linear algebra
PropositionThe initial guess generated by LI after a fault does ensure that theA-norm of the forward error associated with the iterates computedby restarted CG or PCG is monotonically decreasing[LI might not be defined for non-SPD matrices as diagonal blocksmight be singular]
PropositionThe initial guess generated by LSI after a fault does ensure themonotonic decrease of the residual norm of minimal residualKrylov subspace methods such as GMRES and MinRES after arestarting due to a failure
L. Giraud - Resilient numerical linear algebra solvers 12/ 25
Numerical experiments
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 13/ 25
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 4 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 8 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 17 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
Numerical experiments
Impact of fault ratePreconditioned GMRES (Kim1 - 2 % data lost)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 140 280 420 560 700 840 980 1120 1260 1400
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 40 faults
L. Giraud - Resilient numerical linear algebra solvers 14/ 25
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 3 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 0.8 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 0.2 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
Numerical experiments
Impact of lost data volumePreconditioned GMRES(100) (Averous/epb3 - 10 faults)
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 98 196 294 392 490 588 686 784 882 980
||(b
-Ax)|
|/||b||
Iteration
Reset
LI
LSI
SC
REF
Figure: 0.001 % data lost
L. Giraud - Resilient numerical linear algebra solvers 15/ 25
Numerical experiments
Penalty of restart strategy
I Recover-restart strategyI When restarting, we lose the Krylov subspace built before the
faultI Consequence: delay of convergence due to restartI Restarting mechanism is naturally implemented in GMRES to
reduce the computational resource consumptionI CG does not need to be restarted
L. Giraud - Resilient numerical linear algebra solvers 16/ 25
Numerical experiments
Penality of restart strategy on PCG
1e-13
1e-12
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 83 166 249 332 415 498 581 664 747 830
A-n
orm
(err
or)
Iterations
Reset
LI
LSI
SC
Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost
L. Giraud - Resilient numerical linear algebra solvers 17/ 25
Numerical experiments
Penality of restart strategy on PCG
1e-13
1e-12
1e-11
1e-10
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 83 166 249 332 415 498 581 664 747 830
A-n
orm
(err
or)
Iterations
Reset
LI
LSI
SC
REF
Figure: PCG on a 7-point stencil 3D Poisson equation with 70 faults -5 % data lost
L. Giraud - Resilient numerical linear algebra solvers 17/ 25
Resilience in eigensolvers
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 18/ 25
Resilience in eigensolvers
Recovery-restart for eigensolvers
Fault in eigenproblem(A11 A12A21 A22
)(x1x2
)= λ
(x1x2
)
Linear Interpolation (LI)Solve the linear system
(A11 − λI1
)x1 = −A12x2
Least Squares Interpolation (LSI)
(A11A21
)x1 +
(A21A22
)x2 = λ
(x1x2
)x1 = argmin
x
∥∥∥∥(A11 − λI1A21
)x +
(A12
A22 − λI2
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 19/ 25
Resilience in eigensolvers
Recovery-restart for eigensolvers
Fault in eigenproblem(A11 A12A21 A22
)(?x2
)= λ
(?x2
)How to recover x1?
Linear Interpolation (LI)Solve the linear system
(A11 − λI1
)x1 = −A12x2
Least Squares Interpolation (LSI)(A11A21
)x1 +
(A21A22
)x2 = λ
(x1x2
)x1 = argmin
x
∥∥∥∥(A11 − λI1A21
)x +
(A12
A22 − λI2
)x2
∥∥∥∥2
L. Giraud - Resilient numerical linear algebra solvers 19/ 25
Resilience in eigensolvers
L. Giraud - Resilient numerical linear algebra solvers 20/ 25
xA
=
x If Ax = λx with x 6= 0, where A ∈ Cn×n,x ∈ Cn, and λ ∈ C , then,
I λ : eigenvalueI x : eigenvectorI (λ, x) : eigenpair
Two classes of methodsI Fixed Point Methods (Power Method, Subspace iteration)I Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/Krylov
Schur)
Resilience in eigensolvers
Thermo-acoustic test example
(a few smallest eigenvalues)
L. Giraud - Resilient numerical linear algebra solvers 21/ 25
Resilience in eigensolvers
Jacobi-Davidson method
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e+01
0 24 48 72 96 120 144 168 192 216 240
||(A
x -
lam
bda*x
)||/||la
mbda||
Iteration
0 1 2 2 3 4
LSI
REF
Figure: Jacobi-Davidson method with 5 faults - 1 % lost data.Convergence history using LSI and Checkpoint of current iterate
L. Giraud - Resilient numerical linear algebra solvers 22/ 25
Concluding remarks and perspectives
Outline
Faults in HPC Systems
Sparse linear systems
Interpolation methods
Numerical experiments
Resilience in eigensolvers
Concluding remarks and perspectives
L. Giraud - Resilient numerical linear algebra solvers 23/ 25
Concluding remarks and perspectives
Concluding remarks
SummaryI We have designed techniques to interpolate meaningfull lost
data based on simple linear algebra toolsI Our techniques preserve some of the key monotonicy of Krylov
solvers but lack of robustness of LI for non-SPD problemsI The restarting effect remains reasonable within the GMRES
contextI No fault, no overheadI These techniques can be adpated to multiple faultsI What about silent soft-error - CGPOP preliminary
experiments ?
L. Giraud - Resilient numerical linear algebra solvers 24/ 25
Merci for your attentionQuestions ?
https://team.inria.fr/hiepacs/