Designing Scalable and Efficient I/O Middleware for Fault-Resilient HPC Clusters Raghunath Raja Chandrasekar
Abstract
Problem Statement
Key Designs and Results
Ongoing and Future Work
This dissertation proposes a cross-layer framework that leverages this hierarchy in storage media, to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include – CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of-band fault-prediction mechanism that pro-actively monitors for failures.
•Inline-compression strategies for data-staging framework •Traditionally considered for space-constrained systems •Better representation of data more efficient network data movement •How compressible are application-/system-generated checkpoints? •Is inline checkpoint-compression a viable strategy to reduce data-movement overheads in a data-staging framework? What are the trade-offs involved?
•Energy-efficient checkpointing protocols •Energy – one “the most pervasive” challenges for Exascale computing •Power-budgets imposed system-wide •Power-aware job scheduling and accounting •I/O accounts for significant portion of job wallclock time •Are there opportunities to reduce energy consumption during checkpointing? •How can existing I/O middleware be made power-conscious?
Hierarchical RDMA-Based Checkpoint Data Staging
Advised by : Dhabaleswar K. Panda
Committee : K. Mohror (LLNL)
P. Sadayappan (OSU)
R. Teodorescu (OSU)
HPC Scientific Applications
Fault-Tolerance Techniques Checkpoint-Restart Process-Migration
Scalable and Efficient I/O Middleware
Hierarchical Data-Staging
QoS-Aware
Checkpointing
Inline compression
for Data Staging
System-level Mechanisms
Efficient
In-Memory
Checkpointing
Checkpointing
Heterogeneous
Systems
Application-assisted Mechanisms
Low-overhead fault-prediction Energy-aware checkpointing protocols
Mutually-beneficial Mechanisms
NVM Flash/SSDs IB, 10GigE.. MIC, GPU Lustre, PVFS..
• Can checkpoint-restart mechanisms benefit from an hierarchical data-
staging framework?
• How can I/O middleware minimize the contention for network resources
between checkpoint-restart traffic and inter-process communication
traffic?
• How can the behavior of HPC applications and I/O middleware be
enhanced to leverage the deep storage hierarchies available on current-
generation supercomputers?
• How can the capabilities of state-of-art checkpointing systems be enhanced
efficiently handle heterogeneous systems?
• Can low-overhead timely failure prediction mechanisms be designed for
pro-active failure avoidance and recovery?
Dissertation Research Framework
I/O Quality-of-Service Aware Checkpointing
Efficient In-Memory Checkpointing
Checkpoint-Restart for Heterogeneous Systems
Low-Overhead Fault Prediction
Checkpointing overhead reduced by 8.3x with the staging approach
MPI Applications I/O Libraries
(POSIX. HDF5,
MPI-IO, NetCDF, etc.)
MPI Libraries
(MVAPICH2, OpenMPI, etc.)
InfiniBand Interconnect Fabric
Backend Parallel Filesystem
(Lustre, GPFS, PVFS, etc.)
QoS-Aware Data-Staging Framework
Parallel Filesystem
Staging
Group
N
Staging
Group
2
Staging
Group
1
….
Client
Node
1
Client
Node
2
..
Client
Node
N-1
Client
Node
N
Staging Server
SSD
0.9
0.95
1
1.05
1.1
1.15
1.2
default with I/O noise I/O noise isolated
Anelastic Wave Propagation (64 MPI processes)
Normalized Runtime
17.9%
8%
0
500
1000
1500
2000
2500
3000
3500
4000
Ba
nd
wid
th (
MB
/s)
Message Size (Bytes)
Large-message Bandwidth
default
QoS-Aware I/O
with I/O noise
~20%
0
1
2
..
7
0
1
2
..
7 0
1
2
..
7
0
1
2
..
7
Staging Server IB
Switch
Storage Network
Switch
SSD Parallel Filesystem
CRUISE
Compute Nodes
Parallel
File System
MPI Application
RAM Disk SSD HDD
Node-Local Storage
SCR
RAM/
Persistent
Memory
SCRLocal RDMA
Agent
Remote RDMA
AgentCRUISE
1
2
3
45
6
7
9
get_data_region()
get_chunk_meta_list()
8
Run on Sequoia @LLNL 50MB checkpoints
10 iterations 4MB Chunks
0.1
1
10
100
1000
10000
1K 2K 4K 8K 16K 32K 64K 96K
TB
/s
Nodes
Memory
CRUISE
RAM disk
1.21 PB/s @64ppn
1.16 PB/s @32ppn
58.9 TB/s
(3 mil procs)
(1.5 mil procs)
Sandy Bridge Ivy Bridge
Same
Socket
Read from MIC 962 MB/s
(15%)
3421 MB/s
(54%)
Write to MIC 5280 MB/s
(83%)
6396 MB/s
(100%)
Different
Socket
Read from MIC 370 MB/s
(6%)
247 MB/s
(4%)
Write to MIC 1075 MB/s
(17%)
1179 MB/s
(19%) Peak IB FDR Bandwidth:
6397 MB/s
CPU
Xeon Phi
PCIe
QPI
MCI = MIC-Check Interception Library
MCP = MIC-Check Proxy
MCI
MVAPICH
Application Processes
Host Xeon Phi
Parallel File System
Buffer Pools + I/O Threads
MCP
1
2
3
4
0
2
4
6
8
10
12
14
16
1 4 16 32 64 128
Tim
e
(se
c)
# Nodes
1 Thread
4 Threads
16 Threads
32 Threads
Front-End Node
FTB-IPMI
Daemon
FTB_Agent
Client 1
FTB_Agent
Client 2
FTB_Agent
Client N
FTB_Agent
Fault-Tolerance Backplane
Applications
MPI Lib
Filesystems
Applications
MPI Lib
Filesystems
Applications
MPI Lib
Filesystems