Austin_SIAMCSE15

Brian AustinAlex Druinsky, Osni Marquez, Eric

Roman, Sherry Li (LBNL)

Incorporating

Error Detection

and Recovery

into

Hierarchically

Semi-

Separable

Matrix

Operations

- 1 -

April 8, 2015

Towards Optimal Order Resilient Solvers

at Extreme Scale (TOORSES)

Linear solvers are ubiquitous in scientific computing

• Performance

– HSS matrix format reduces computational complexity

• Resilience

– Error rates may increase on extreme scale systems.• Increased concurrency – more parts that might fail

• Potentially lower part reliability

(smaller transistors, near-threshold voltage)

- 2 -

Outline

• Hierarchically Semi-Separable (HSS) decomposition

• Algorithm-based fault tolerance (ABFT) for dense matrices

• Error detection for HSS matrix-vector multiplication

• Error recovery using Containment Domains

• Performance results.

- 3 -

Hierarchically Semi-Separable (HSS)

Matrix Decomposition

- 4 -

• Exploits low numerical rank of matrix.

• Structured block sparsity

• Factorization has bounded error.

A = D(3) + U(3) ( B(2) + U(2) ( B(1) +U(1)B(0)V(1)* ) V(2)* ) V(3)*

HSS Matrix Vector multiplication

- 5 -

HSS Matrix-Vector multiplication: b=A.x

D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*

Algorithm Based Fault Tolerance (ABFT)

for Dense Matrices (Huang & Abraham, 1984)

Checksum protection for individual matrices

Recovers up to one error per row/column

eT.A = [eTA]

A.e = [Ae]

Matix multiplication preserves checksums

[eTA].B = eT.[AB] A.[Be] = [AB].e

- 6 -

[Ae]A

[eTA]

A.[

Be]

=C

.e

C

[eTA].B = eT.C

A

[Ae]

[eTA]

[Be

]

B

[eTB]

× =

Checksum relationships can be derived from associative properties.

Intermediate error checking for HSS-

MV

• Observation: between each parenthesis, there is an implicit (i.e. not explicitly stored) matrix.

• Many invariant conditions can be constructed using associativity.

• For example:

y . [ U(3) . U(2) . U(1) . e ] = [ y. U(3) . U(2) . U(1) ] . e

• Many options for error checking at different stages of HSS-MV

- 7 -

A = D(3) + U(3) ( B(2) + U(2) ( B(1) + U(1) B(0) V(1)* ) V(2)* ) V(3)*

ABFT for HSS-mv

Error checking with adjustable granularity• Coarse + CD

[e.AHSS].x = e.[AHSS .x]

• Medium + CDe.[V(L).x] = [e.V(L)].x

e.[V(0)…V(L).x] = [e.V(0)…V(L)].x

[e.AHSS].x = e.[AHSS .x]

• Fine + CDDetect errors in each MV

• EncodedDetect & correct errors in each MV

- 8 -

HSS Matrix-Vector multiplication: b=A.x

Error recovery by Containment Domains

(CDs)

Error Detection• Classical ABFT cannot

recover all errors.

Multiple errors per row.

Errors in both A and B.

Redesign for every algorithm.

• Containment Domains provide more robust recovery techniques.

Users supply validation tests.

Remote safe store

Composable (nested,…)

Automatic escalation

CD pseudocodeCD_Begin()

//first pass:

// store “safe” copies of A,B

//second pass:

// restore A,B

CD_Preserve(A,[eTA],[Ae])

CD_Preserve(B,[eTB],[Be])

Compute: C=A.B

CD_Assert(eT.C==[eTA].B)

CD_Complete()

- 9 -

Runtime overhead without error

injection

- 10 -

0.234

0.236

0.238

0.240

0.242

0.244

0.246

None(1148.2)

Coarse(2290.4)

Medium(2290.9)

Fine(2292.2)

Encoded +Coarse

(2295.9)

Encoded(1164.2)

Tim

e p

er

HSS

mv

ite

rati

on

(s)

• Overhead is less than 2%• Comparable to natural

performance variation.

(Memory (GB))

HSS-MV performance with error

injection.

- 11 -

0.20

0.25

0.30

0.35

0.40

0.451

.0E-

3

3.2

E-3

1.0

E-2

3.2

E-2

1.0

E-1

3.2

E-1

1.0

E+0

Tim

e p

er

HSS

mv

Ite

rati

on

(s)

Error Rate (#/s)

Coarse

Medium

Fine

Encoded

Conclusions & Future work

• Identified checksum relationships to validate HSS-MV operations.

• Fine grained error checking:

– has very low overhead

– maintains excellent efficiency at high error rates.

• Containment Domains

– Fine-grained preservation has incurs minimal runtime overhead.

– Preservation doubles memory capacity requirements.

• Merge fault-tolerance branch into main (parallel) HSS code.

• Incorporation into linear solver

- 12 -

Acknowledgement

• Toorses (LBNL)– Sherry Li (PI)– Eric Roman– Osni Marquez– Alex Druinski

• Strumpack – HSS Library– Francois-Henry Rouet

• Containment Domains (UT)– Mattan Erez– Kyushick Lee

• Support– This material is based upon work supported by the U.S. Department of Energy, Office of

Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract number DE-AC02-05CH11231.

– This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

- 13 -

National Energy Research Scientific Computing

Center

- 14 -

Date post:	17-Jul-2015
Category:	Science
Upload:	karen-pao
View:	34 times
Download:	0 times