Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

transcript

Modularized Redundant Parallel Virtual System

Sheng-Kai Hung

HPCC Lab

Parallel Virtual File System Overview Developed by Clemson Uni

versity Using RAID-0 like striping to

distributed file Claim for high read/write pe

rformance Based on TCP/IP server-cli

ent model Centralized metadata serve

r POSIX 、MPI-IO complaint NO fault tolerant mechanis

m provided

Our Pervious Design

Parity information is stored at metadata server A single point of failure

Read/Write performance Using “delay write” to

improve the parity overhead Use a buffer to store the

difference of block being written

Reading corresponding blocks are also needed

Metadata Server

IO Node

Data Message

Metadata Server

IO Node

Data Message

SIO SHM

MTTF Formula

1/( )pvfsD S

N NMTTF

MTTF MTTF

( 1)1/( )

raidD s

D pvfs

NN MTTR NGMTTF

MTTF MTTF

MTTFMTTF

MTTF N MTTF

pvfsnrpvfs

MTTFMTTF G

Examples of MTTF

Assumption MTTFD is no less than

100,000 hours (around 10 years)

MTTFs 10,000 hours (around 1.5 years)

MTTR is usually shorter than 4 hours

Node number is 16

(hours)

PVFS 528 -

PVFSraid 624 1

RPVFS 86088 1

MTTF Result

10x100

100x100

10x103

100x103

208256

Overhead of Using Parity

Read does not involve in the process of parity construction

Read-Modify-Write Some blocks are dirtied Need 2 read, 2 write

Write The whole striping units

are overwritten 1 read, 2 write

XORR-M-W

XORWrite

System Architecture

Metadata Server

IO Node

Data Message

Parity CacheTable

MirroredMetadata Server

MdataReal File

Real File

IO Node

Real File

IO Node

Real File

IO Node

Real File

Parity Cache Table (1/3)

A pinned down memory region within the metadata node 4K entry each entry contain N data blocks plus a inode number tag

and a reference count Can aggregate the written block to reduce the

number of parity written We delay the writing and generating of parity block

Several blocks near by can be combined in a single write

1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...

... ... ... ... ...

1 2 3 N...

inode #

Hashing

/1024 %4096offset

When to write back the cache ? Replacement

Choosing the bigger {N,N} Ready

When all the blocks needed to compute a parity block is ready

Flush A routine like bdflush runs every 30 secs

Potential data loss ? On average 15 secs

Write Performance64

262144

1048576 4

262144

1048576 4

Read-Modify-Write Performance

262144

1048576 4

262144

1048576 4

Mirrored Parity Scheme (1/3)

RAID-1 Can not tolerate two

faults in the same mirrored group

For different groups 3 faults can be tolerated

Disk overhead is 100% RAID-4 (RAID-5)

Only can tolerate a single fault

Disk overhead always less than 33.3%

D0 D1 D2

D0' D1' D2'

Mirrored Disks

D0 D1 D2 P1

RAID 4

Can tolerate faults occurred in the same mirrored group D1 、 P12 faults

D0、 P01 faults

Can tolerate at most 3 faults,

except one case D1、 P12、 D0 all faults

The concept of grouping disappeared

Use the same disk overhead as Raid-1

D0 D1 D2

P01 P12 P012

Mirrored Parity Disks

1 0 01D D P

0 12 012D P P

Pro MTTF is higher Can tolerate more simultaneous fault when

compared with RAID-1 With the same disk overhead

Con Need at most N XOR operations to recovery the

corrupted data N is the nodes involved in a parity group XOR is a cheap operation, but read 3 blocks may be a

problem

Separate metadata cache Accessing meta data is a serialized process

Only 1 single metadata server with 1 disk Separate metadata cache from the real data cache Either on clients or on servers

If on clients can save a socket connection when hitting

Distributed metadata Handling the parity cache table Parity information must also be distributed

Block based parity need to be modified

Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Documents