Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

transcript

Providing Fault-tolerance for Parallel Programs on Grid

(FT-MPICH)

Providing Fault-tolerance for Parallel Programs on Grid

(FT-MPICH)

Heon Y. YeomDistributed Computing Systems Lab.Seoul National University

Contents

Motivation1

Introduction2

Architecture3

Conclusion4

Motivation

Hardware performance limitations are overcome by Moore's Law

These cutting-edge technologies make “Tera-scale” clusters feasible !!!

However.. What about “THE” system reliability ??? Distributed systems are still fragile due

to unexpected failures…

MotivationMultiple

Fault-tolerantFramework

MVAPICH(InfiniBand)High-speed

(Up to 30Gbps)Will be Popular

MPICH CompatibleDemand Fault-

resilience !!!

MPICH-GM(Myrinet)High-speed (10Gbps)Popular

MPICH CompatibleDemand Fault-

resilience !!!

MPICH-G2(Ethernet)Good speed

(1Gbps)Common

MPICH StandardDemand Fault-

resilience !!!

High-performance Network Trend

Introduction

Unreliability of distributed systems Even a single local failure can be fatal

to parallel processes since it could render useless all computations executed to the point of failure.

Our goal is To construct a practical multiple

fault-tolerant framework for various types of MPICH variants working on high-performance clusters/Grids.

Introduction

Why Message Passing Interface (MPI)? Designing a generic FT framework is

extremely hard due to the diversity of hardware and software systems.

We chosen MPICH series ....

MPI is the most popular programming model in cluster computing.

Providing fault-tolerance to MPI is more cost-effective than providing it to the OS or hardware…

Architecture-Concept-

MonitoringFailure

Detection

C/R ProtocolConsensus

& ElectionProtocol

Multiple Fault-tolerant

Framework

Multiple Fault-tolerant

Framework

Architecture-Overall System-

Others

Ethernet

Management System

Communication

MPI Process

Communication

Ethernet Others

MPI Process

Communication

EthernetOthers

MPI Process

Communication

Ethernet

High-speed Network (Myrinet, InfiniBand)

Gigabit Ethernet

Architecture-Development History-

Fault-tolerantFault-tolerantMPICH-G2MPICH-G2-Ethernet--Ethernet-

Fault-tolerantFault-tolerantMPICH-GMMPICH-GM-Myrinet--Myrinet-

Fault-tolerantFault-tolerantMVAPICHMVAPICH

-InfiniBand--InfiniBand-

MPICH-GF FT-MPICH-GM FT-MVAPICH

2004 2005 Current2003

Management System

ManagementSystem

Makes MPI more reliable

FailureDetection

CheckpointCoordination

Recovery

InitializationCoordination

OutputManagement

CheckpointTransfer

Management System

Local Job Manager

MPI Process

Local Job Manager

MPI Process

Local Job Manager

MPI Process

Leader Job Manager

Third-party Scheduler(e.g. PBS, LSF)

UserCLI

Communication over Ethernet

Communication over High Speed Network(e.g. Myrinet, Infiniband)

Stable Storage

Job Management System 1/2

Job Management System Manages and monitors multiple MPI processes

and their execution environments Should be lightweight Helps the system take consistent checkpoints

and recover from failures Has a fault-detection mechanism

Two main components Central Manager & Local Job Manager

Job Management System 2/2

Central Manager Manages all system functions and states Detects node failures by periodic heartbeats

and Job Manager’s failures

Job Manager Relays messages between Central Manager &

MPI Processes Detects unexpected MPI process failures

Fault-Tolerant MPI 1/3

To provide MPI fault-tolerance, we adopt Coordinated checkpointing scheme

(vs. Independent scheme) The Central Manager is the Coordinator!!

Application-level checkpointing (vs. kernel-level CKPT.)

This method does not require any efforts on the part of cluster administrators

User-transparent checkpointing scheme (vs. User-aware)

This method requires no modification of MPI source codes

CentralManager

checkpointcommand

rank0 rank1 rank2 rank3

Coordinated Checkpointing

storage

failure detection

CentralManager

checkpointcommand

rank0 rank1 rank2 rank3

Recovery from failures

storage

Management System

MPICH-GF Based on Globus Toolkit2 Hierarchical Management System

Suitable for multiple clusters Supports recovery from process/manager/node

failure Limitation

Does not support recovery from multiple failures

Has single point of failure (Central Manager)

Management System

FT-MPICH-GM New version

It does not rely on the Globus Toolkit. Removes of hierarchical structure

Myrinet/Infiniband clusters no longer require hierarchical structure.

Supports recovery from multiple failures

FT-MVAPICH More robust

Removes the single point of failure Leader election for the job manager

Fault-tolerant MPICH-variants

FT Module Recovery Module

ConnectionRe-establishment

Ethernet

Checkpoint Toolkit

Atomic M

essage

Transfer

ADI(Abstract Device Interface)

Globus2 (Ethernet) GM (Myrinet) MVAPICH (InfiniBand)

Collective Operations

MPICH-GF

P2P Operations

FT-MPICH-GM FT-MVAPICH

Myrinet InfiniBand

Future Works

We’re working to incorporate our FT protocol into the GT-4 framework. MPICH-GF is GT-2 compliant Incorporating fault-tolerant management

protocol into GT-4 Make MPICH work with different clusters

Gig-E Myrinet Open-MPI, VMI, etc. Infiniband

Supporting non-Intel CPUs AMD(Opteron)

GRID Issues

Who should be responsible for ? Monitoring the up/down of nodes. Resubmitting the failed process. Allocating new nodes.

GRID Job Management Resource management Scheduler Health Monitoring

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH)

Documents