Post on 17-Jan-2016
description
transcript
Providing Fault-tolerance for Parallel Programs on Grid
(FT-MPICH)
Providing Fault-tolerance for Parallel Programs on Grid
(FT-MPICH)
Heon Y. YeomDistributed Computing Systems Lab.Seoul National University
Contents
Motivation1
Introduction2
Architecture3
Conclusion4
Motivation
Hardware performance limitations are overcome by Moore's Law
These cutting-edge technologies make “Tera-scale” clusters feasible !!!
However.. What about “THE” system reliability ??? Distributed systems are still fragile due
to unexpected failures…
MotivationMultiple
Fault-tolerantFramework
MVAPICH(InfiniBand)High-speed
(Up to 30Gbps)Will be Popular
MPICH CompatibleDemand Fault-
resilience !!!
MPICH-GM(Myrinet)High-speed (10Gbps)Popular
MPICH CompatibleDemand Fault-
resilience !!!
MPICH-G2(Ethernet)Good speed
(1Gbps)Common
MPICH StandardDemand Fault-
resilience !!!
High-performance Network Trend
Introduction
Unreliability of distributed systems Even a single local failure can be fatal
to parallel processes since it could render useless all computations executed to the point of failure.
Our goal is To construct a practical multiple
fault-tolerant framework for various types of MPICH variants working on high-performance clusters/Grids.
Introduction
Why Message Passing Interface (MPI)? Designing a generic FT framework is
extremely hard due to the diversity of hardware and software systems.
We chosen MPICH series ....
MPI is the most popular programming model in cluster computing.
Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…
Architecture-Concept-
MonitoringFailure
Detection
C/R ProtocolConsensus
& ElectionProtocol
Multiple Fault-tolerant
Framework
Multiple Fault-tolerant
Framework
Architecture-Overall System-
Others
Ethernet
Management System
Communication
MPI Process
Communication
Ethernet Others
MPI Process
Communication
EthernetOthers
MPI Process
Communication
Ethernet
High-speed Network (Myrinet, InfiniBand)
Gigabit Ethernet
Architecture-Development History-
Fault-tolerantFault-tolerantMPICH-G2MPICH-G2-Ethernet--Ethernet-
Fault-tolerantFault-tolerantMPICH-GMMPICH-GM-Myrinet--Myrinet-
Fault-tolerantFault-tolerantMVAPICHMVAPICH
-InfiniBand--InfiniBand-
MPICH-GF FT-MPICH-GM FT-MVAPICH
2004 2005 Current2003
Management System
ManagementSystem
Makes MPI more reliable
FailureDetection
CheckpointCoordination
Recovery
InitializationCoordination
OutputManagement
CheckpointTransfer
Management System
Local Job Manager
MPI Process
Local Job Manager
MPI Process
Local Job Manager
MPI Process
Leader Job Manager
Third-party Scheduler(e.g. PBS, LSF)
UserCLI
Communication over Ethernet
Communication over High Speed Network(e.g. Myrinet, Infiniband)
Stable Storage
Job Management System 1/2
Job Management System Manages and monitors multiple MPI processes
and their execution environments Should be lightweight Helps the system take consistent checkpoints
and recover from failures Has a fault-detection mechanism
Two main components Central Manager & Local Job Manager
Job Management System 2/2
Central Manager Manages all system functions and states Detects node failures by periodic heartbeats
and Job Manager’s failures
Job Manager Relays messages between Central Manager &
MPI Processes Detects unexpected MPI process failures
Fault-Tolerant MPI 1/3
To provide MPI fault-tolerance, we adopt Coordinated checkpointing scheme
(vs. Independent scheme) The Central Manager is the Coordinator!!
Application-level checkpointing (vs. kernel-level CKPT.)
This method does not require any efforts on the part of cluster administrators
User-transparent checkpointing scheme (vs. User-aware)
This method requires no modification of MPI source codes
ver 2
ver 1
Fault-Tolerant MPI 2/3
CentralManager
checkpointcommand
rank0 rank1 rank2 rank3
Coordinated Checkpointing
storage
failure detection
ver 1
Fault-Tolerant MPI 3/3
CentralManager
checkpointcommand
rank0 rank1 rank2 rank3
Recovery from failures
storage
Management System
MPICH-GF Based on Globus Toolkit2 Hierarchical Management System
Suitable for multiple clusters Supports recovery from process/manager/node
failure Limitation
Does not support recovery from multiple failures
Has single point of failure (Central Manager)
Management System
FT-MPICH-GM New version
It does not rely on the Globus Toolkit. Removes of hierarchical structure
Myrinet/Infiniband clusters no longer require hierarchical structure.
Supports recovery from multiple failures
FT-MVAPICH More robust
Removes the single point of failure Leader election for the job manager
Fault-tolerant MPICH-variants
FT Module Recovery Module
ConnectionRe-establishment
Ethernet
Checkpoint Toolkit
Atomic M
essage
Transfer
ADI(Abstract Device Interface)
Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand)
Collective Operations
MPICH-GF
P2P Operations
FT-MPICH-GM FT-MVAPICH
Myrinet InfiniBand
Future Works
We’re working to incorporate our FT protocol into the GT-4 framework. MPICH-GF is GT-2 compliant Incorporating fault-tolerant management
protocol into GT-4 Make MPICH work with different clusters
Gig-E Myrinet Open-MPI, VMI, etc. Infiniband
Supporting non-Intel CPUs AMD(Opteron)
GRID Issues
Who should be responsible for ? Monitoring the up/down of nodes. Resubmitting the failed process. Allocating new nodes.
GRID Job Management Resource management Scheduler Health Monitoring