Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Modularized Redundant Parallel Virtual System

Sheng-Kai Hung

HPCC Lab

Parallel Virtual File System Overview Developed by Clemson Uni

versity Using RAID-0 like striping to

distributed file Claim for high read/write pe

rformance Based on TCP/IP server-cli

ent model Centralized metadata serve

r POSIX 、MPI-IO complaint NO fault tolerant mechanis

m provided

Our Pervious Design

Parity information is stored at metadata server A single point of failure

Read/Write performance Using “delay write” to

improve the parity overhead Use a buffer to store the

difference of block being written

Reading corresponding blocks are also needed

Metadata Server

SIO

IO Node

IOD

MGR

IO Node

IOD

IO Node

IOD

IO Node

IOD

...

Data Message

SHM

Metadata Server

IOD

IO Node

IOD

MGR

IO Node

IOD

IO Node

IOD

IO Node

IOD

...

Data Message

SIO SHM

MTTF Formula

1/( )pvfsD S

N NMTTF

MTTF MTTF

2

2

( 1)1/( )

n

raidD s

Dpvfs

D pvfs

NN MTTR NGMTTF

MTTF MTTF

MTTFMTTF

MTTF N MTTF

2

pvfsnrpvfs

MTTFMTTF G

MTTR

Examples of MTTF

Assumption MTTFD is no less than

100,000 hours (around 10 years)

MTTFs 10,000 hours (around 1.5 years)

MTTR is usually shorter than 4 hours

Node number is 16

MTTF

(hours)

Group

Size

PVFS 528 -

PVFSraid 624 1

RPVFS 86088 1

MTTF Result

10x100

100x100

1x103

10x103

100x103

1x106

16

64

112

160

208256

12

3

Overhead of Using Parity

Read does not involve in the process of parity construction

Read-Modify-Write Some blocks are dirtied Need 2 read, 2 write

Write The whole striping units

are overwritten 1 read, 2 write

Read

XORR-M-W

XORWrite

System Architecture

Metadata Server

SIOD

IO Node

IOD

MGR

...

Data Message

Parity CacheTable

MirroredMetadata Server

MSIOD

MdataReal File

Real File

IO Node

IOD

Real File

IO Node

IOD

Real File

IO Node

IOD

Real File

Mdata

Real File

Parity Cache Table (1/3)

A pinned down memory region within the metadata node 4K entry each entry contain N data blocks plus a inode number tag

and a reference count Can aggregate the written block to reduce the

number of parity written We delay the writing and generating of parity block

Several blocks near by can be combined in a single write


1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...

... ... ... ... ...

1 2 3 N...

N

1

2

1

N

2

inode #

inode #

inode #

inode #

inode #

inode #

Hashing

/1024 %4096offset


When to write back the cache ? Replacement

Choosing the bigger {N,N} Ready

When all the blocks needed to compute a parity block is ready

Flush A routine like bdflush runs every 30 secs

Potential data loss ? On average 15 secs

Write Performance64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

5000

10000

15000

20000

25000

30000

Read-Modify-Write Performance

64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

5000

10000

15000

20000

25000

30000

35000

Mirrored Parity Scheme (1/3)

RAID-1 Can not tolerate two

faults in the same mirrored group

For different groups 3 faults can be tolerated

Disk overhead is 100% RAID-4 (RAID-5)

Only can tolerate a single fault

Disk overhead always less than 33.3%

D0 D1 D2

D0' D1' D2'

Mirrored Disks

D0 D1 D2 P1

RAID 4


Can tolerate faults occurred in the same mirrored group D1 、 P12 faults

D0、 P01 faults

Can tolerate at most 3 faults,

except one case D1、 P12、 D0 all faults

The concept of grouping disappeared

Use the same disk overhead as Raid-1

D0 D1 D2

P01 P12 P012

Mirrored Parity Disks

1 0 01D D P

0 12 012D P P


Pro MTTF is higher Can tolerate more simultaneous fault when

compared with RAID-1 With the same disk overhead

Con Need at most N XOR operations to recovery the

corrupted data N is the nodes involved in a parity group XOR is a cheap operation, but read 3 blocks may be a

problem

Separate metadata cache Accessing meta data is a serialized process

Only 1 single metadata server with 1 disk Separate metadata cache from the real data cache Either on clients or on servers

If on clients can save a socket connection when hitting

Distributed metadata Handling the parity cache table Parity information must also be distributed

Block based parity need to be modified

Date post:	23-Jan-2016
Category:	Documents
View:	217 times
Download:	0 times

Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Documents