Brian AustinAlex Druinsky, Osni Marquez, Eric
Roman, Sherry Li (LBNL)
Incorporating
Error Detection
and Recovery
into
Hierarchically
Semi-
Separable
Matrix
Operations
- 1 -
April 8, 2015
Towards Optimal Order Resilient Solvers
at Extreme Scale (TOORSES)
Linear solvers are ubiquitous in scientific computing
• Performance
– HSS matrix format reduces computational complexity
• Resilience
– Error rates may increase on extreme scale systems.• Increased concurrency – more parts that might fail
• Potentially lower part reliability
(smaller transistors, near-threshold voltage)
- 2 -
Outline
• Hierarchically Semi-Separable (HSS) decomposition
• Algorithm-based fault tolerance (ABFT) for dense matrices
• Error detection for HSS matrix-vector multiplication
• Error recovery using Containment Domains
• Performance results.
- 3 -
Hierarchically Semi-Separable (HSS)
Matrix Decomposition
- 4 -
• Exploits low numerical rank of matrix.
• Structured block sparsity
• Factorization has bounded error.
A = D(3) + U(3) ( B(2) + U(2) ( B(1) +U(1)B(0)V(1)* ) V(2)* ) V(3)*
HSS Matrix Vector multiplication
- 5 -
HSS Matrix-Vector multiplication: b=A.x
D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
Algorithm Based Fault Tolerance (ABFT)
for Dense Matrices (Huang & Abraham, 1984)
Checksum protection for individual matrices
Recovers up to one error per row/column
eT.A = [eTA]
A.e = [Ae]
Matix multiplication preserves checksums
[eTA].B = eT.[AB] A.[Be] = [AB].e
- 6 -
[Ae]A
[eTA]
A.[
Be]
=C
.e
C
[eTA].B = eT.C
A
[Ae]
[eTA]
[Be
]
B
[eTB]
× =
Checksum relationships can be derived from associative properties.
Intermediate error checking for HSS-
MV
• Observation: between each parenthesis, there is an implicit (i.e. not explicitly stored) matrix.
• Many invariant conditions can be constructed using associativity.
• For example:
y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e
• Many options for error checking at different stages of HSS-MV
- 7 -
A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*
ABFT for HSS-mv
Error checking with adjustable granularity• Coarse + CD
[e.AHSS].x = e.[AHSS .x]
• Medium + CDe.[V(L).x] = [e.V(L)].x
e.[V(0)…V(L).x] = [e.V(0)…V(L)].x
[e.AHSS].x = e.[AHSS .x]
• Fine + CDDetect errors in each MV
• EncodedDetect & correct errors in each MV
- 8 -
HSS Matrix-Vector multiplication: b=A.x
Error recovery by Containment Domains
(CDs)
Error Detection• Classical ABFT cannot
recover all errors.
Multiple errors per row.
Errors in both A and B.
Redesign for every algorithm.
• Containment Domains provide more robust recovery techniques.
Users supply validation tests.
Remote safe store
Composable (nested,…)
Automatic escalation
CD pseudocodeCD_Begin()
//first pass:
// store “safe” copies of A,B
//second pass:
// restore A,B
CD_Preserve(A,[eTA],[Ae])
CD_Preserve(B,[eTB],[Be])
Compute: C=A.B
CD_Assert(eT.C==[eTA].B)
CD_Complete()
- 9 -
Runtime overhead without error
injection
- 10 -
0.234
0.236
0.238
0.240
0.242
0.244
0.246
None(1148.2)
Coarse(2290.4)
Medium(2290.9)
Fine(2292.2)
Encoded +Coarse
(2295.9)
Encoded(1164.2)
Tim
e p
er
HSS
mv
ite
rati
on
(s)
• Overhead is less than 2%• Comparable to natural
performance variation.
(Memory (GB))
HSS-MV performance with error
injection.
- 11 -
0.20
0.25
0.30
0.35
0.40
0.451
.0E-
3
3.2
E-3
1.0
E-2
3.2
E-2
1.0
E-1
3.2
E-1
1.0
E+0
Tim
e p
er
HSS
mv
Ite
rati
on
(s)
Error Rate (#/s)
Coarse
Medium
Fine
Encoded
Conclusions & Future work
• Identified checksum relationships to validate HSS-MV operations.
• Fine grained error checking:
– has very low overhead
– maintains excellent efficiency at high error rates.
• Containment Domains
– Fine-grained preservation has incurs minimal runtime overhead.
– Preservation doubles memory capacity requirements.
• Merge fault-tolerance branch into main (parallel) HSS code.
• Incorporation into linear solver
- 12 -
Acknowledgement
• Toorses (LBNL)– Sherry Li (PI)– Eric Roman– Osni Marquez– Alex Druinski
• Strumpack – HSS Library– Francois-Henry Rouet
• Containment Domains (UT)– Mattan Erez– Kyushick Lee
• Support– This material is based upon work supported by the U.S. Department of Energy, Office of
Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract number DE-AC02-05CH11231.
– This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
- 13 -