Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | ramon-harnett |
View: | 212 times |
Download: | 0 times |
1
Fault Tolerance & PetaScale Systems: Fault Tolerance & PetaScale Systems: Current Knowledge, Challenges and Current Knowledge, Challenges and
OpportunitiesOpportunities
Franck CappelloINRIA
Keynote @ EuroPVM/MPI, September 2008, Dublin Ireland
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
Why Fault Tolerance in Peta scale Why Fault Tolerance in Peta scale systems for HPC app. is challenging?systems for HPC app. is challenging?
A) FT a difficult research area:• Related to many other issues: scalability, programming models,
environments, communication libs., storage, etc.• FT is not considered as first class issue (Many vendors are simply not providing software to tolerate the failure of their systems)
• Software solutions have to work on large variety of hardware• New results need strong efforts (old research discipline)• Etc.
B) We will reach soon a situation where the “Classic” Rollback-Recovery will simply not work anymore!--> let see why
Classic approach for FT: Classic approach for FT: Checkpoint-RestartCheckpoint-Restart
Compute nodes
Network(s)
I/O nodes
Parallel file system (1 to 2 PB)
40 to 200 GB/s
Total memory: 100-200 TB
1000 sec. < Ckpt < 2500 sec.
RoadRunner
TACC Ranger
Systems Perf. Ckpt time Source
RoadRunner 1PF ~20 min. Panasas
LLNL BG/L 500 TF >20 min. LLNL
Argonne BG/P 500 TF ~30 min. LLNL
Total SGI Altix 100 TF ~40 min. estimation
IDRIS BG/P 100 TF 30 min. IDRIS
LLNL BG/L
Typical “Balanced Architecture” for PetaScale ComputersQuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Without optimization,Without optimization,Checkpoint-Restart needs about 1h!Checkpoint-Restart needs about 1h!(~30 minutes each)(~30 minutes each)
Failure rate and #socketsFailure rate and #sockets
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
In Top500 machine performance X2 per year (See Jack slide on top500)--> more than Moore’s law and increase of #cores in CPUsIf we consider #core X 2, every 18, 24 and 30 months AND fixed Socket MTTI:
Figures fromGarth Gibson
RR TR 1h. wall
SMTTI ~ 1/(1− MTTI^n )
We may reach the 1h. wall as soon as in 2012-2013Another projection from CHARNG-DA LU gives similar results
It’s urgent to optimize Rollback-Recovery for PetaScale It’s urgent to optimize Rollback-Recovery for PetaScale systems and to investigate alternatives.systems and to investigate alternatives.
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
Understanding Approach:Understanding Approach:Failure logsFailure logs
• The computer failure data repository (CFDR)
http://cfdr.usenix.org/ From 96 until now… HPC systems+Google
• failure logs from LANL, NERSC, PNNL, ask.com, SNL, LLNL, etc.
ex: LANL released root cause logs for: 23000 events causing apps. Stop on 22 Clusters (5000 nodes), over 9 years
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Analysis of failure logsAnalysis of failure logs• In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most
number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware problems, albeit rarer, need 6.3-100.7 hours on the average to solve.”
• In 2007 (Garth Gibson, ICPP Keynote):
• In 2008 (Oliner and J. Stearley, DSN Conf.):
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
50%
Hardware
Conclusion1: Both Hardware and Software failures have to be consideredConclusion2: Oliner: logging tools fail too, some key info is missing, better filtering (correlation) is needed
FT system should cover all causes of failuresFT system should cover all causes of failures(Rollback Recovery is consistent with this requirement*)(Rollback Recovery is consistent with this requirement*)
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
Saves SnapShots (consistent global states), from which the distributed execution can be restarted. It uses maker messages to “Flush” the network and coordinate the processes. (checkpoint the application when there is no in transit messages)Need global coordination and rollback
Nodes
Ckpt
failure
detection/global stop
restart
Nodes
Ckpt
failure
detectionrestart
Coordinated Checkpoint(Chandy/Lamport)
Uncoordinated Checkpoint
Rollback-Recovery Protocols
No global coordination (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages
Sync
Roll-Back Recovery Protocols:Roll-Back Recovery Protocols:Which one to choose from (MPI)?Which one to choose from (MPI)?
Semi AutoAutomatic
Pessimistic log
Log basedCheckpointbased
Causal logOptimistic log
Level
Framew
ork
APIComm
unication
Lib.
ClipSemi-transparent checkpoint
[CLP97]
Optimistic recoveryIn distributed systems
n faults with coherent checkpoint[SY85]
Manethon faults[EZ92]
Egida
[RAV99]MPI/FT
Redundance of tasks
[BNC01]
FT-MPIModification of MPI routines
User Fault Treatment
[FD00]
MPICH-VN faults
Distributed logging
MPI-FTN fault
Centralized server
[LNLE00]
CocheckIndependent of MPI
[Ste96]
StarfishEnrichment of MPI
[AF99]
Pruitt 982 faults sender based
[PRU98]
Sender based Mess. Log.1 fault sender based
[JZ87]
LA-MPICommunications
rerouting
Other
LAM/MPI
OpenMPI-V
The main research domains (protocols):The main research domains (protocols):a)a) Removing the blocking of Processes. in coordinated ckpt.Removing the blocking of Processes. in coordinated ckpt.b)b) Reducing the overhead of message loggingReducing the overhead of message logging protocols
RADICEuropar’08
Improved Message LoggingImproved Message Logging•Classic approach (MPICH-VMPICH-V) implements Message Logging at the device level: all messages are copied •High speed MPI implementations use Zero Copy and decompose Recv in: a) Matching, b) Delivery
Fig. from Boutieller
Bandwidth of OpenMPI-V compared to others
OpenMPI-V Overhead on NAS (Myri 10g)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
•OpenMPI-VOpenMPI-V implements Mes. Log. within MPI: different event types are managed differently, distinction between determ. and non determ. events, optimized mem. copy
Coordinated and message logging protocols have Coordinated and message logging protocols have been improved --> improvements are probably stillbeen improved --> improvements are probably stillpossible but very difficult to obtain!possible but very difficult to obtain!
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
Reduce the Checkpoint time Reduce the Checkpoint time --> reduce the checkpoint size--> reduce the checkpoint size
Compute nodes
Network(s)
I/O nodes
40 to 200 GB/s
Total memory: 100-200 TB
Typical “Balanced Architecture” for PetaScale Computers
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
1) Incremental Checkpointing2) Application level Checkpointing3) Compiler assisted application level Checkpointing4) Restart from local storage
Reduce the size of the data saved and restored fromremote file system
100-200 TB
10-50 TB?
Reducing Checkpoint size 1/2Reducing Checkpoint size 1/2• Incremental Checkpointing:A runtime monitor detects memory regions that have not been modified between two adjacent CKPT. and omit them from the subsequent CKPT.
OS Incremental Checkpointing uses the memory management subsystem to decide which data change between consecutive checkpoints
0102030405060708090
100
Sage1000MB
Sage 500MB
Sage 100MB
Sage 50MB
Sweepd3D SP LU BT FT
Full memory footprintBelow the full memory footprint
Fig. from J.-C. Sancho
Challenge (not scientific): establish a base of codes withChallenge (not scientific): establish a base of codes withApplication Level CheckpointingApplication Level Checkpointing
Fraction of Memory Footprint Overwritten during Main Iteration
• Application Level Checkpointing“Programmers know what data to save and when to save the state of the execution”.Programmer adds dedicated code in the application to save the state of the execution.
Few results available:Bronevetsky 2008: MDCASK code of the ASCI Blue Purple BenchmarkHand written Checkpointer eliminates 77% of the application stateLimitation: impossible to optimize checkpoint interval (interval should be well chosen to avoid large increase of the exec time --> cooperative checkpointing)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Reducing Checkpoint size 2/2Reducing Checkpoint size 2/2Compiler assisted application level checkpoint•From Plank (compiler assisted memory exclusion)•User annotate codes for checkpoint•The compiler detects dead data (not modified between 2 CKPT) and omit themfrom the second checkpoint.•Latest result (Static Analysis 1D arrays)excludes live arrays with dead data:--> 45% reduction in CKPT size for mdcask, one of the ASCI Purple benchmarks
Fig. from G. Bronevetsky
22%
100%
Execution Time (s)
Mem
ory
ad
dre
sses•Inspector Executor (trace based) checkpoint
(INRIA study)Ex: DGETRF (max gain 20% over IC)Need more evaluation
Challenge: Reducing checkpoint size (probably one of Challenge: Reducing checkpoint size (probably one of the most difficult problems). the most difficult problems).
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
Remove bottleneck of the I/O nodes Remove bottleneck of the I/O nodes and file system --> Checkpointing and file system --> Checkpointing
without stable storagewithout stable storage
Compute nodes
Network(s)
I/O nodes
40 to 200 GB/s
Total memory: 100-200 TB
Typical “Balanced Architecture” for PetaScale Computers
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
1) Add storage devices in compute nodes and/or as extra “non computing” nodes
2) Diskless Checkpointing
No checkpoint will cross this line!
Store ckpt. on SSD Store ckpt. on SSD (Flash mem.)(Flash mem.)
Challenge: Integrate the SSD technology at a Challenge: Integrate the SSD technology at a reasonable cost and without reducing the MTTIreasonable cost and without reducing the MTTI
Compute nodes
Network(s)
I/O nodes
Parallel file system (1 to 2 PB)
40 to 200 GB/sTotal memory: 100-200 TB
•Current practice --> checkpoint on local disk (1 min.) and then move asynchronously CKPT images to persistent storage -->Checkpoint still needs 20-40 minutes
•Recent proposal: use SSD (Flash mem.) in nodes or attached to network--> Increase the cost of the machine (100 TB or flash memory) + increase power consumption + Need replication of ckpt image on remote nodes (if SSD on nodes) OR add a large # of components in the system.
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
SSD
SSD
SSD
SSD
SSD
Diskless Checkpointing 1/2Diskless Checkpointing 1/2Principle: Compute a checksum of the processes’ memory and store it on spare processors
Advantage: does not require ckpt on stable storage.
Images fromGeorge Bosilca
P1P1 P2P2 P3P3 P4P4 4 computing processors
P1P1 P2P2 P3P3 P4P4 PcPcP4P4 Add fifth “non computing”processor
P1P1 P2P2 P3P3 P4P4 Perform a checkpointPcPc+ P4P4+ + =
P1P1 P2P2 P3P3 P4P4 Continue the computationPcPcP4P4
....
P1P1 P2P2 P3P3 P4P4 Start the computationPcPcP4P4
A) Every process saves a copy of its local state of in memory or local disc
B) Perform a global bitstream or floating point operation on all saved local states
P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4
P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4
P1P1 P3P3 P4P4 Recover P2 dataPcPcP2P2 - - -=
All processes restore its local state from the one saved in memory or local disc
Diskless Checkpointing 2/2Diskless Checkpointing 2/2
Challenge: experiment more Diskless CKPT and Challenge: experiment more Diskless CKPT and in very large machines in very large machines (current result are for ~1000 CPUs)(current result are for ~1000 CPUs)
•Need spare nodes and double the memory occupation (to survive failures during ckpt.) --> increases the overall cost and #failures
•Need coordinated checkpointing or message logging protocol
•Need very fast encoding & reduction operations
•Need automatic Ckpt protocol or program modifications
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Images fromCHARNG-DA LU
•Could be done at application and system levels
•Process data could be considered (and encoded) either as bit-streams or as floating point numbers. Computing the checksum from bit-streams uses operations such as parity. Computing checksum from floating point numbers uses operations such as addition
•Can survive multiple failures of arbitrary patternsReed Solomon for bit-streams and weighted checksum for floating point numbers (sensitive to round-off errors).
•Work with with incremental ckpt.
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
Avoid Rollback-RecoveryAvoid Rollback-Recovery
Compute nodes
Network(s)
I/O nodes
40 to 200 GB/s
Total memory: 100-200 TB
Typical “Balanced Architecture” for PetaScale Computers
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• System monitoring and Proactive-Operations
No checkpoint at all!
Proactive OperationsProactive Operations•Principle: predict failures and trigger preventive actions when a node is suspected
•Many researches on proactive operations assume that failure could predicted.Only few papers are based on actual data.
•Most of researches refer 2 papers published in 2003 and 2005 on a 350 CPUs cluster and and BG/L prototype (100 days, 128K CPUs)
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
More than 50 measurement More than 50 measurement points monitored per Cray XT5 points monitored per Cray XT5 system blade system blade
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.A lot of fatal failures(up to >35 a day!)
Memory
Network
APP-IO
switchNode cards
Everywhere in the system
From manyFrom manysourcessources
Graphs from R. Sahoo
BG/L prototype
Traces from either a rather small system (350 CPUs) or the first 100 days of a large system not yet stabilized
Proactive MigrationProactive Migration•Principle: predict failures and migrate processes before failures
•Prediction models are based on the analysis of correlations between non fatal and fatal errors, and temporal and spatial correlations between failure events.
•Results on the 100 first days of BlueGene/L demonstrate good failure predictability: 50% of I/O failures could have been predicted (based on trace analysis). Note that Memory failures are much less predictable!
•Bad prediction has a cost (false positives and negatives have an impact on performance) --> false negatives impose to use rollback-recovery.
•Migration has a cost (need to checkpoint and log or delay messages)
•What to migrate?
•Virtual Machine, Process checkpoint?
•Only application state (user checkpoint)?
•What to do with predictable software failures?
Migrate OR keep safe software and replace dynamically the software that is predicted to fail?
Challenge: Analyze more traces, Identify more Challenge: Analyze more traces, Identify more correlations, Improve predictive algorithmscorrelations, Improve predictive algorithms
Proactive migration may help to significantly increase the checkpoint interval.
Results are lacking concerning real time predictions and actual benefits of migration in real conditions
AgendaAgenda
•Why Fault Tolerance is Challenging?•What are the main reasons behind failures?•Rollback Recovery Protocols?•Reducing Rollback Recovery Time?•Rollback Recovery without stable storage?•Alternatives to Rollback Recovery?•Where are the opportunities?
OpportunitiesOpportunitiesMay come from a strong modification of the problem statement:
FailuresExceptions Normal events
From system side: “Alternatives FT Paradigms”:–Replication (mask the effect of failure), –Self-Stabilization (forward recovery: push the system towards a legitimate state), –Speculative Execution (commit only correct speculative state modifications),
From applications&algorithms side: “Failures Aware Design”: –Application level fault management (FT-MPI: reorganize computation)–Fault Tolerance Friendly Parallel Patterns (Confine failure effects), –Algorithmic Based Fault tolerance (Compute with redundant data), –Naturally Fault Tolerant Algorithms (Algorithms resilient to failures).
Since these opportunities have received only little attention (recently), they need further explorations in the context of PetaScale systems.
Does Replication make sens?Does Replication make sens?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Need investigation on the processes slowdown with high speed networksCurrently too expensive (double the Hardware & power consumption)
•Design new parallel architectures with very cheap and low power nodes •Replicate only nodes that are likely to fail --> failure prediction
Slide fromGarth Gibson
Chandy-Lamport Algorithm assumes a « worst case » situation all processes may communicate with all other ones --> these communications influence all processes.
Not necessarily true for all parallel programming/execution patterns.
1) Divide the system into recovery domains --> failures in one domain are confined to the domain and do not force further failure effects across domains.May need some message logging (interesting only if few inter-domain coms.)
2) Dependency based recovery: limit the rollback to those nodes that have acquired
dependencies on the failed ones. In typical MPI applications, processes exchange messages with a limited numbers of other processes (be careful: domino effect).
FT Friendly Parallel Patterns still needs fault detection and correction
Examples of Fault Tolerant Friendly Parallel Patterns: Master Worker, Divide&Conquer (Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack)-->SATIN (D&C framework of IBIS): transparent FT strategy dedicated to D&C pattern
Fault Tolerance FriendlyFault Tolerance FriendlyParallel Patterns 1/2Parallel Patterns 1/2
Ideas from E. ELNOZAHY
FT Friendly Parallel Patterns 2/2FT Friendly Parallel Patterns 2/2A simple example (IPDPS 2005): Fibonacciwith SATIN (IBIS)
Figure by. G. Wrzesińska
1
12
3
6 7
13
14
10 11
4
2
5
9
15
processor 1
processor 2
processor 3
8
Processor3 disconnected &Orphans Jobs
9
15
processor 3
(9, cpu3)(15,cpu3)
Processor3 broadcast its list of Orphans
processor 3
1
3
6 74
2
5
9
15
12 13
processor 1
When Processor1 re-computes taks2 andthen task4, it reconnects to processor3and Orphan jobs are recovered.
•Divide&ConquerMany non trivial parallel applications:Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack(cobinatory optimization problems)...
Works for many Linear Algebra operations:Matrix Multiplication: A * B = C -> Ac * Br = CfLU Decomposition: C = L * U -> Cf = Lc * UrAddition: A + B = C -> Af + Bf = CfScalar Multiplication: c * Af = (c * A)fTranspose: AfT = (AT)fCholesky factorization & QR factorization
In 1984, Huang and Abraham, proposed the ABFT to detect and correct errors in some matrix operations on systolic arrays.
ABFT encodes data & redesign algo. to operate on encoded data. Failure are detected and corrected off-line (after execution).
ABFT variation for on-line recovery (runtime detects failures + robust to failures):
““Algorithmic Based Fault Tolerance”Algorithmic Based Fault Tolerance”
•Similar to Diskless ckpt., an extra processor is added, Pi+1, store the checksum of data:(vector X and Y in this case) Xc = X1 +…+Xp, Yc = Y1 +…+Yp.Xf = [X1, …Xp, Xc], Yf = [Y1, …Yp, Yc], • Operations are performed on Xf and Yfinstead of X and Y : Zf=Yf+Zf
• Compared to diskless checkpointing, the memory AND CPU of Pc take part of the computation):• No global operation for Checksum!• No local checkpoint!
P1P1 P2P2 P3P3 P4P4 PcPcP4P4
X1 X2 X3 X4 Xc
Y1 Y2 Y3 Y4 Yc
Z1 Z2 Z3 Z4 Zc
+
=
From G.Bosilca
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
Meshless formulation of 2-D finite difference application
Figure from A. Geist
“Naturally fault tolerant algorithm” Natural fault tolerance is the ability to tolerate failures through the mathematical properties of the algorithm itself, without requiring notification or recovery.
The algorithm includes natural compensation for the lost information.
For example, an iterative algorithm may require more iterations to converge, but it still converges despite lost information
Assumes that a maximum of 0.1% of tasks may fail
Ex1 : Meshless iterative methods+chaotic relaxation (asynchronous iterative methods)
Ex2: Global MAX (used in iterative methods to determine convergence)
QuickTime™ et undécompresseur TIFF (LZW)
sont requis pour visionner cette image.
This algorithm share some features with SelfStabilization algorithms: detection of termination is very hard!it provides the max « eventually »…BUT, it does not tolerate Byzantine faults (SelfStabilization does for transient failures + acyclic topology)
Wrapping-up Wrapping-up Fault tolerance is becoming a major issue for users of large scale parallel systems.
Many Challenges:•Reduce the cost of Checkpointing (checkpoint size & time)•Design better logging and analyzing tools•Design less expensive replication approaches•Integrate Flash mem. tech. while keeping cost low and MTTI high•Investigate scalability of Diskless Checkpointing•Collect more traces, Identify correl., new predictive algo.
Opportunities may come from Failure Aware application Design and the investigation of Alternatives FT Paradigms, in the context of HPC applications.