Download - Designing Scalable and Efficient I/O Middleware for Fault …sc14.supercomputing.org/sites/all/themes/sc14/files/... · 2016-05-27 · Designing Scalable and Efficient I/O Middleware

Designing Scalable and Efficient I/O Middleware for Fault-Resilient HPC Clusters Raghunath Raja Chandrasekar

Abstract

Problem Statement

Key Designs and Results

Ongoing and Future Work

This dissertation proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include – CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of-band fault-prediction mechanism that pro-actively monitors for failures.

•Inline-compression strategies for data-staging framework •Traditionally considered for space-constrained systems •Better representation of data more efficient network data movement •How compressible are application-/system-generated checkpoints? •Is inline checkpoint-compression a viable strategy to reduce data-movement overheads in a data-staging framework? What are the trade-offs involved?

•Energy-efficient checkpointing protocols •Energy – one “the most pervasive” challenges for Exascale computing •Power-budgets imposed system-wide •Power-aware job scheduling and accounting •I/O accounts for significant portion of job wallclock time •Are there opportunities to reduce energy consumption during checkpointing? •How can existing I/O middleware be made power-conscious?

Hierarchical RDMA-Based Checkpoint Data Staging

Advised by : Dhabaleswar K. Panda

Committee : K. Mohror (LLNL)

P. Sadayappan (OSU)

R. Teodorescu (OSU)

HPC Scientific Applications

Fault-Tolerance Techniques Checkpoint-Restart Process-Migration

Scalable and Efficient I/O Middleware

Hierarchical Data-Staging

QoS-Aware

Checkpointing

Inline compression

for Data Staging

System-level Mechanisms

Efficient

In-Memory

Checkpointing

Checkpointing

Heterogeneous

Systems

Application-assisted Mechanisms

Low-overhead fault-prediction Energy-aware checkpointing protocols

Mutually-beneficial Mechanisms

NVM Flash/SSDs IB, 10GigE.. MIC, GPU Lustre, PVFS..

• Can checkpoint-restart mechanisms benefit from an hierarchical data-

staging framework?

• How can I/O middleware minimize the contention for network resources

between checkpoint-restart traffic and inter-process communication

traffic?

• How can the behavior of HPC applications and I/O middleware be

enhanced to leverage the deep storage hierarchies available on current-

generation supercomputers?

• How can the capabilities of state-of-art checkpointing systems be enhanced

efficiently handle heterogeneous systems?

• Can low-overhead timely failure prediction mechanisms be designed for

pro-active failure avoidance and recovery?

Dissertation Research Framework

I/O Quality-of-Service Aware Checkpointing

Efficient In-Memory Checkpointing

Checkpoint-Restart for Heterogeneous Systems

Low-Overhead Fault Prediction

Checkpointing overhead reduced by 8.3x with the staging approach

MPI Applications I/O Libraries

(POSIX. HDF5,

MPI-IO, NetCDF, etc.)

MPI Libraries

(MVAPICH2, OpenMPI, etc.)

InfiniBand Interconnect Fabric

Backend Parallel Filesystem

(Lustre, GPFS, PVFS, etc.)

QoS-Aware Data-Staging Framework

Parallel Filesystem

Staging

Group

N

Staging

Group

2

Staging

Group

1

….

Client

Node

1

Client

Node

2

..

Client

Node

N-1

Client

Node

N

Staging Server

SSD

0.9

0.95

1

1.05

1.1

1.15

1.2

default with I/O noise I/O noise isolated

Anelastic Wave Propagation (64 MPI processes)

Normalized Runtime

17.9%

8%

0

500

1000

1500

2000

2500

3000

3500

4000

Ba

nd

wid

th (

MB

/s)

Message Size (Bytes)

Large-message Bandwidth

default

QoS-Aware I/O

with I/O noise

~20%

0

1

2

..

7

0

1

2

..

7 0

1

2

..

7

0

1

2

..

7

Staging Server IB

Switch

Storage Network

Switch

SSD Parallel Filesystem

CRUISE

Compute Nodes

Parallel

File System

MPI Application

RAM Disk SSD HDD

Node-Local Storage

SCR

RAM/

Persistent

Memory

SCRLocal RDMA

Agent

Remote RDMA

AgentCRUISE

1

2

3

45

6

7

9

get_data_region()

get_chunk_meta_list()

8

Run on Sequoia @LLNL 50MB checkpoints

10 iterations 4MB Chunks

0.1

1

10

100

1000

10000

1K 2K 4K 8K 16K 32K 64K 96K

TB

/s

Nodes

Memory

CRUISE

RAM disk

1.21 PB/s @64ppn

1.16 PB/s @32ppn

58.9 TB/s

(3 mil procs)

(1.5 mil procs)

Sandy Bridge Ivy Bridge

Same

Socket

Read from MIC 962 MB/s

(15%)

3421 MB/s

(54%)

Write to MIC 5280 MB/s

(83%)

6396 MB/s

(100%)

Different

Socket

Read from MIC 370 MB/s

(6%)

247 MB/s

(4%)

Write to MIC 1075 MB/s

(17%)

1179 MB/s

(19%) Peak IB FDR Bandwidth:

6397 MB/s

CPU

Xeon Phi

PCIe

QPI

MCI = MIC-Check Interception Library

MCP = MIC-Check Proxy

MCI

MVAPICH

Application Processes

Host Xeon Phi

Parallel File System

Buffer Pools + I/O Threads

MCP

1

2

3

4

0

2

4

6

8

10

12

14

16

1 4 16 32 64 128

Tim

e

(se

c)

# Nodes

1 Thread

4 Threads

16 Threads

32 Threads

Front-End Node

FTB-IPMI

Daemon

FTB_Agent

Client 1

FTB_Agent

Client 2

FTB_Agent

Client N

FTB_Agent

Fault-Tolerance Backplane

Applications

MPI Lib

Filesystems

Applications

MPI Lib

Filesystems

Applications

MPI Lib

Filesystems