1 Fault Tolerance & PetaScale Systems: Current Knowledge, Challenges and Opportunities Franck...

1

Fault Tolerance & PetaScale Systems: Fault Tolerance & PetaScale Systems: Current Knowledge, Challenges and Current Knowledge, Challenges and

OpportunitiesOpportunities

Franck CappelloINRIA

[email protected]

Keynote @ EuroPVM/MPI, September 2008, Dublin Ireland

AgendaAgenda

•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?

AgendaAgenda


Why Fault Tolerance in Peta scale Why Fault Tolerance in Peta scale systems for HPC app. is challenging?systems for HPC app. is challenging?

A) FT a difficult research area:• Related to many other issues: scalability, programming models,

environments, communication libs., storage, etc.• FT is not considered as first class issue (Many vendors are simply not providing software to tolerate the failure of their systems)

• Software solutions have to work on large variety of hardware• New results need strong efforts (old research discipline)• Etc.

B) We will reach soon a situation where the “Classic” Rollback-Recovery will simply not work anymore!--> let see why

Classic approach for FT: Classic approach for FT: Checkpoint-RestartCheckpoint-Restart

Compute nodes

Network(s)

I/O nodes

Parallel file system (1 to 2 PB)

40 to 200 GB/s

Total memory: 100-200 TB

1000 sec. < Ckpt < 2500 sec.

RoadRunner

TACC Ranger

Systems Perf. Ckpt time Source

RoadRunner 1PF ~20 min. Panasas

LLNL BG/L 500 TF >20 min. LLNL

Argonne BG/P 500 TF ~30 min. LLNL

Total SGI Altix 100 TF ~40 min. estimation

IDRIS BG/P 100 TF 30 min. IDRIS

LLNL BG/L

Typical “Balanced Architecture” for PetaScale ComputersQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Without optimization,Without optimization,Checkpoint-Restart needs about 1h!Checkpoint-Restart needs about 1h!(~30 minutes each)(~30 minutes each)

Failure rate and #socketsFailure rate and #sockets

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

In Top500 machine performance X2 per year (See Jack slide on top500)--> more than Moore’s law and increase of #cores in CPUsIf we consider #core X 2, every 18, 24 and 30 months AND fixed Socket MTTI:

Figures fromGarth Gibson

RR TR 1h. wall

SMTTI ~ 1/(1− MTTI^n )

We may reach the 1h. wall as soon as in 2012-2013Another projection from CHARNG-DA LU gives similar results

It’s urgent to optimize Rollback-Recovery for PetaScale It’s urgent to optimize Rollback-Recovery for PetaScale systems and to investigate alternatives.systems and to investigate alternatives.

AgendaAgenda


Understanding Approach:Understanding Approach:Failure logsFailure logs

• The computer failure data repository (CFDR)

http://cfdr.usenix.org/ From 96 until now… HPC systems+Google

• failure logs from LANL, NERSC, PNNL, ask.com, SNL, LLNL, etc.

ex: LANL released root cause logs for: 23000 events causing apps. Stop on 22 Clusters (5000 nodes), over 9 years





http://cfdr.usenix.org/

Analysis of failure logsAnalysis of failure logs• In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most

number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware problems, albeit rarer, need 6.3-100.7 hours on the average to solve.”

• In 2007 (Garth Gibson, ICPP Keynote):

• In 2008 (Oliner and J. Stearley, DSN Conf.):





50%

Hardware

Conclusion1: Both Hardware and Software failures have to be consideredConclusion2: Oliner: logging tools fail too, some key info is missing, better filtering (correlation) is needed

FT system should cover all causes of failuresFT system should cover all causes of failures(Rollback Recovery is consistent with this requirement*)(Rollback Recovery is consistent with this requirement*)

AgendaAgenda


Saves SnapShots (consistent global states), from which the distributed execution can be restarted. It uses maker messages to “Flush” the network and coordinate the processes. (checkpoint the application when there is no in transit messages)Need global coordination and rollback

Nodes

Ckpt

failure

detection/global stop

restart

Nodes

Ckpt

failure

detectionrestart

Coordinated Checkpoint(Chandy/Lamport)

Uncoordinated Checkpoint

Rollback-Recovery Protocols

No global coordination (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages

Sync

Roll-Back Recovery Protocols:Roll-Back Recovery Protocols:Which one to choose from (MPI)?Which one to choose from (MPI)?

Semi AutoAutomatic

Pessimistic log

Log basedCheckpointbased

Causal logOptimistic log

Level

Framew

ork

APIComm

unication

Lib.

ClipSemi-transparent checkpoint

[CLP97]

Optimistic recoveryIn distributed systems

n faults with coherent checkpoint[SY85]

Manethon faults[EZ92]

Egida

[RAV99]MPI/FT

Redundance of tasks

[BNC01]

FT-MPIModification of MPI routines

User Fault Treatment

[FD00]

MPICH-VN faults

Distributed logging

MPI-FTN fault

Centralized server

[LNLE00]

CocheckIndependent of MPI

[Ste96]

StarfishEnrichment of MPI

[AF99]

Pruitt 982 faults sender based

[PRU98]

Sender based Mess. Log.1 fault sender based

[JZ87]

LA-MPICommunications

rerouting

Other

LAM/MPI

OpenMPI-V

The main research domains (protocols):The main research domains (protocols):a)a) Removing the blocking of Processes. in coordinated ckpt.Removing the blocking of Processes. in coordinated ckpt.b)b) Reducing the overhead of message loggingReducing the overhead of message logging protocols

RADICEuropar’08

Improved Message LoggingImproved Message Logging•Classic approach (MPICH-VMPICH-V) implements Message Logging at the device level: all messages are copied •High speed MPI implementations use Zero Copy and decompose Recv in: a) Matching, b) Delivery

Fig. from Boutieller

Bandwidth of OpenMPI-V compared to others

OpenMPI-V Overhead on NAS (Myri 10g)



•OpenMPI-VOpenMPI-V implements Mes. Log. within MPI: different event types are managed differently, distinction between determ. and non determ. events, optimized mem. copy

Coordinated and message logging protocols have Coordinated and message logging protocols have been improved --> improvements are probably stillbeen improved --> improvements are probably stillpossible but very difficult to obtain!possible but very difficult to obtain!

AgendaAgenda


Reduce the Checkpoint time Reduce the Checkpoint time --> reduce the checkpoint size--> reduce the checkpoint size

Compute nodes

Network(s)

I/O nodes

40 to 200 GB/s


Typical “Balanced Architecture” for PetaScale Computers



1) Incremental Checkpointing2) Application level Checkpointing3) Compiler assisted application level Checkpointing4) Restart from local storage

Reduce the size of the data saved and restored fromremote file system

100-200 TB

10-50 TB?

Reducing Checkpoint size 1/2Reducing Checkpoint size 1/2• Incremental Checkpointing:A runtime monitor detects memory regions that have not been modified between two adjacent CKPT. and omit them from the subsequent CKPT.

OS Incremental Checkpointing uses the memory management subsystem to decide which data change between consecutive checkpoints

0102030405060708090

100

Sage1000MB

Sage 500MB

Sage 100MB

Sage 50MB

Sweepd3D SP LU BT FT

Full memory footprintBelow the full memory footprint

Fig. from J.-C. Sancho

Challenge (not scientific): establish a base of codes withChallenge (not scientific): establish a base of codes withApplication Level CheckpointingApplication Level Checkpointing

Fraction of Memory Footprint Overwritten during Main Iteration

• Application Level Checkpointing“Programmers know what data to save and when to save the state of the execution”.Programmer adds dedicated code in the application to save the state of the execution.

Few results available:Bronevetsky 2008: MDCASK code of the ASCI Blue Purple BenchmarkHand written Checkpointer eliminates 77% of the application stateLimitation: impossible to optimize checkpoint interval (interval should be well chosen to avoid large increase of the exec time --> cooperative checkpointing)



Reducing Checkpoint size 2/2Reducing Checkpoint size 2/2Compiler assisted application level checkpoint•From Plank (compiler assisted memory exclusion)•User annotate codes for checkpoint•The compiler detects dead data (not modified between 2 CKPT) and omit themfrom the second checkpoint.•Latest result (Static Analysis 1D arrays)excludes live arrays with dead data:--> 45% reduction in CKPT size for mdcask, one of the ASCI Purple benchmarks

Fig. from G. Bronevetsky

22%

100%

Execution Time (s)

Mem

ory

ad

dre

sses•Inspector Executor (trace based) checkpoint

(INRIA study)Ex: DGETRF (max gain 20% over IC)Need more evaluation

Challenge: Reducing checkpoint size (probably one of Challenge: Reducing checkpoint size (probably one of the most difficult problems). the most difficult problems).

AgendaAgenda


Remove bottleneck of the I/O nodes Remove bottleneck of the I/O nodes and file system --> Checkpointing and file system --> Checkpointing

without stable storagewithout stable storage

Compute nodes

Network(s)

I/O nodes

40 to 200 GB/s





1) Add storage devices in compute nodes and/or as extra “non computing” nodes

2) Diskless Checkpointing

No checkpoint will cross this line!

Store ckpt. on SSD Store ckpt. on SSD (Flash mem.)(Flash mem.)

Challenge: Integrate the SSD technology at a Challenge: Integrate the SSD technology at a reasonable cost and without reducing the MTTIreasonable cost and without reducing the MTTI

Compute nodes

Network(s)

I/O nodes

Parallel file system (1 to 2 PB)

40 to 200 GB/sTotal memory: 100-200 TB

•Current practice --> checkpoint on local disk (1 min.) and then move asynchronously CKPT images to persistent storage -->Checkpoint still needs 20-40 minutes

•Recent proposal: use SSD (Flash mem.) in nodes or attached to network--> Increase the cost of the machine (100 TB or flash memory) + increase power consumption + Need replication of ckpt image on remote nodes (if SSD on nodes) OR add a large # of components in the system.

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.



SSD

SSD

SSD

SSD

SSD

Diskless Checkpointing 1/2Diskless Checkpointing 1/2Principle: Compute a checksum of the processes’ memory and store it on spare processors

Advantage: does not require ckpt on stable storage.

Images fromGeorge Bosilca

P1P1 P2P2 P3P3 P4P4 4 computing processors

P1P1 P2P2 P3P3 P4P4 PcPcP4P4 Add fifth “non computing”processor

P1P1 P2P2 P3P3 P4P4 Perform a checkpointPcPc+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Continue the computationPcPcP4P4

....

P1P1 P2P2 P3P3 P4P4 Start the computationPcPcP4P4

A) Every process saves a copy of its local state of in memory or local disc

B) Perform a global bitstream or floating point operation on all saved local states

P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4

P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4

P1P1 P3P3 P4P4 Recover P2 dataPcPcP2P2 - - -=

All processes restore its local state from the one saved in memory or local disc

Diskless Checkpointing 2/2Diskless Checkpointing 2/2

Challenge: experiment more Diskless CKPT and Challenge: experiment more Diskless CKPT and in very large machines in very large machines (current result are for ~1000 CPUs)(current result are for ~1000 CPUs)

•Need spare nodes and double the memory occupation (to survive failures during ckpt.) --> increases the overall cost and #failures

•Need coordinated checkpointing or message logging protocol

•Need very fast encoding & reduction operations

•Need automatic Ckpt protocol or program modifications



Images fromCHARNG-DA LU

•Could be done at application and system levels

•Process data could be considered (and encoded) either as bit-streams or as floating point numbers. Computing the checksum from bit-streams uses operations such as parity. Computing checksum from floating point numbers uses operations such as addition

•Can survive multiple failures of arbitrary patternsReed Solomon for bit-streams and weighted checksum for floating point numbers (sensitive to round-off errors).

•Work with with incremental ckpt.

AgendaAgenda


Avoid Rollback-RecoveryAvoid Rollback-Recovery

Compute nodes

Network(s)

I/O nodes

40 to 200 GB/s





• System monitoring and Proactive-Operations

No checkpoint at all!

Proactive OperationsProactive Operations•Principle: predict failures and trigger preventive actions when a node is suspected

•Many researches on proactive operations assume that failure could predicted.Only few papers are based on actual data.

•Most of researches refer 2 papers published in 2003 and 2005 on a 350 CPUs cluster and and BG/L prototype (100 days, 128K CPUs)



More than 50 measurement More than 50 measurement points monitored per Cray XT5 points monitored per Cray XT5 system blade system blade




sont requis pour visionner cette image.A lot of fatal failures(up to >35 a day!)

Memory

Network

APP-IO

switchNode cards

Everywhere in the system

From manyFrom manysourcessources

Graphs from R. Sahoo

BG/L prototype

Traces from either a rather small system (350 CPUs) or the first 100 days of a large system not yet stabilized

Proactive MigrationProactive Migration•Principle: predict failures and migrate processes before failures

•Prediction models are based on the analysis of correlations between non fatal and fatal errors, and temporal and spatial correlations between failure events.

•Results on the 100 first days of BlueGene/L demonstrate good failure predictability: 50% of I/O failures could have been predicted (based on trace analysis). Note that Memory failures are much less predictable!

•Bad prediction has a cost (false positives and negatives have an impact on performance) --> false negatives impose to use rollback-recovery.

•Migration has a cost (need to checkpoint and log or delay messages)

•What to migrate?

•Virtual Machine, Process checkpoint?

•Only application state (user checkpoint)?

•What to do with predictable software failures?

Migrate OR keep safe software and replace dynamically the software that is predicted to fail?

Challenge: Analyze more traces, Identify more Challenge: Analyze more traces, Identify more correlations, Improve predictive algorithmscorrelations, Improve predictive algorithms

Proactive migration may help to significantly increase the checkpoint interval.

Results are lacking concerning real time predictions and actual benefits of migration in real conditions

AgendaAgenda


OpportunitiesOpportunitiesMay come from a strong modification of the problem statement:

FailuresExceptions Normal events

From system side: “Alternatives FT Paradigms”:–Replication (mask the effect of failure), –Self-Stabilization (forward recovery: push the system towards a legitimate state), –Speculative Execution (commit only correct speculative state modifications),

From applications&algorithms side: “Failures Aware Design”: –Application level fault management (FT-MPI: reorganize computation)–Fault Tolerance Friendly Parallel Patterns (Confine failure effects), –Algorithmic Based Fault tolerance (Compute with redundant data), –Naturally Fault Tolerant Algorithms (Algorithms resilient to failures).

Since these opportunities have received only little attention (recently), they need further explorations in the context of PetaScale systems.

Does Replication make sens?Does Replication make sens?



Need investigation on the processes slowdown with high speed networksCurrently too expensive (double the Hardware & power consumption)

•Design new parallel architectures with very cheap and low power nodes •Replicate only nodes that are likely to fail --> failure prediction

Slide fromGarth Gibson

Chandy-Lamport Algorithm assumes a « worst case » situation all processes may communicate with all other ones --> these communications influence all processes.

Not necessarily true for all parallel programming/execution patterns.

1) Divide the system into recovery domains --> failures in one domain are confined to the domain and do not force further failure effects across domains.May need some message logging (interesting only if few inter-domain coms.)

2) Dependency based recovery: limit the rollback to those nodes that have acquired

dependencies on the failed ones. In typical MPI applications, processes exchange messages with a limited numbers of other processes (be careful: domino effect).

FT Friendly Parallel Patterns still needs fault detection and correction

Examples of Fault Tolerant Friendly Parallel Patterns: Master Worker, Divide&Conquer (Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack)-->SATIN (D&C framework of IBIS): transparent FT strategy dedicated to D&C pattern

Fault Tolerance FriendlyFault Tolerance FriendlyParallel Patterns 1/2Parallel Patterns 1/2

Ideas from E. ELNOZAHY

FT Friendly Parallel Patterns 2/2FT Friendly Parallel Patterns 2/2A simple example (IPDPS 2005): Fibonacciwith SATIN (IBIS)

Figure by. G. Wrzesińska

1

12

3

6 7

13

14

10 11

4

2

5

9

15

processor 1

processor 2

processor 3

8

Processor3 disconnected &Orphans Jobs

9

15

processor 3

(9, cpu3)(15,cpu3)

Processor3 broadcast its list of Orphans

processor 3

1

3

6 74

2

5

9

15

12 13

processor 1

When Processor1 re-computes taks2 andthen task4, it reconnects to processor3and Orphan jobs are recovered.

•Divide&ConquerMany non trivial parallel applications:Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack(cobinatory optimization problems)...

Works for many Linear Algebra operations:Matrix Multiplication: A * B = C -> Ac * Br = CfLU Decomposition: C = L * U -> Cf = Lc * UrAddition: A + B = C -> Af + Bf = CfScalar Multiplication: c * Af = (c * A)fTranspose: AfT = (AT)fCholesky factorization & QR factorization

In 1984, Huang and Abraham, proposed the ABFT to detect and correct errors in some matrix operations on systolic arrays.

ABFT encodes data & redesign algo. to operate on encoded data. Failure are detected and corrected off-line (after execution).

ABFT variation for on-line recovery (runtime detects failures + robust to failures):

““Algorithmic Based Fault Tolerance”Algorithmic Based Fault Tolerance”

•Similar to Diskless ckpt., an extra processor is added, Pi+1, store the checksum of data:(vector X and Y in this case) Xc = X1 +…+Xp, Yc = Y1 +…+Yp.Xf = [X1, …Xp, Xc], Yf = [Y1, …Yp, Yc], • Operations are performed on Xf and Yfinstead of X and Y : Zf=Yf+Zf

• Compared to diskless checkpointing, the memory AND CPU of Pc take part of the computation):• No global operation for Checksum!• No local checkpoint!

P1P1 P2P2 P3P3 P4P4 PcPcP4P4

X1 X2 X3 X4 Xc

Y1 Y2 Y3 Y4 Yc

Z1 Z2 Z3 Z4 Zc

+

=

From G.Bosilca



Meshless formulation of 2-D finite difference application

Figure from A. Geist

“Naturally fault tolerant algorithm” Natural fault tolerance is the ability to tolerate failures through the mathematical properties of the algorithm itself, without requiring notification or recovery.

The algorithm includes natural compensation for the lost information.

For example, an iterative algorithm may require more iterations to converge, but it still converges despite lost information

Assumes that a maximum of 0.1% of tasks may fail

Ex1 : Meshless iterative methods+chaotic relaxation (asynchronous iterative methods)

Ex2: Global MAX (used in iterative methods to determine convergence)



This algorithm share some features with SelfStabilization algorithms: detection of termination is very hard!it provides the max « eventually »…BUT, it does not tolerate Byzantine faults (SelfStabilization does for transient failures + acyclic topology)

Wrapping-up Wrapping-up Fault tolerance is becoming a major issue for users of large scale parallel systems.

Many Challenges:•Reduce the cost of Checkpointing (checkpoint size & time)•Design better logging and analyzing tools•Design less expensive replication approaches•Integrate Flash mem. tech. while keeping cost low and MTTI high•Investigate scalability of Diskless Checkpointing•Collect more traces, Identify correl., new predictive algo.

Opportunities may come from Failure Aware application Design and the investigation of Alternatives FT Paradigms, in the context of HPC applications.

Date post:	01-Apr-2015
Category:	Documents
Upload:	ramon-harnett
View:	212 times
Download:	0 times

1 Fault Tolerance & PetaScale Systems: Current Knowledge, Challenges and Opportunities Franck...

Documents