+ All Categories
Home > Documents > Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Date post: 23-Jan-2016
Category:
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
17
Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab
Transcript
Page 1: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Modularized Redundant Parallel Virtual System

Sheng-Kai Hung

HPCC Lab

Page 2: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Parallel Virtual File System Overview Developed by Clemson Uni

versity Using RAID-0 like striping to

distributed file Claim for high read/write pe

rformance Based on TCP/IP server-cli

ent model Centralized metadata serve

r POSIX 、MPI-IO complaint NO fault tolerant mechanis

m provided

Page 3: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Our Pervious Design

Parity information is stored at metadata server A single point of failure

Read/Write performance Using “delay write” to

improve the parity overhead Use a buffer to store the

difference of block being written

Reading corresponding blocks are also needed

Metadata Server

SIO

IO Node

IOD

MGR

IO Node

IOD

IO Node

IOD

IO Node

IOD

...

Data Message

SHM

Metadata Server

IOD

IO Node

IOD

MGR

IO Node

IOD

IO Node

IOD

IO Node

IOD

...

Data Message

SIO SHM

Page 4: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

MTTF Formula

1/( )pvfsD S

N NMTTF

MTTF MTTF

2

2

( 1)1/( )

n

raidD s

Dpvfs

D pvfs

NN MTTR NGMTTF

MTTF MTTF

MTTFMTTF

MTTF N MTTF

2

pvfsnrpvfs

MTTFMTTF G

MTTR

Page 5: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Examples of MTTF

Assumption MTTFD is no less than

100,000 hours (around 10 years)

MTTFs 10,000 hours (around 1.5 years)

MTTR is usually shorter than 4 hours

Node number is 16

MTTF

(hours)

Group

Size

PVFS 528 -

PVFSraid 624 1

RPVFS 86088 1

Page 6: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

MTTF Result

10x100

100x100

1x103

10x103

100x103

1x106

16

64

112

160

208256

12

3

Page 7: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Overhead of Using Parity

Read does not involve in the process of parity construction

Read-Modify-Write Some blocks are dirtied Need 2 read, 2 write

Write The whole striping units

are overwritten 1 read, 2 write

Read

XORR-M-W

XORWrite

Page 8: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

System Architecture

Metadata Server

SIOD

IO Node

IOD

MGR

...

Data Message

Parity CacheTable

MirroredMetadata Server

MSIOD

MdataReal File

Real File

IO Node

IOD

Real File

IO Node

IOD

Real File

IO Node

IOD

Real File

Mdata

Real File

Page 9: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Parity Cache Table (1/3)

A pinned down memory region within the metadata node 4K entry each entry contain N data blocks plus a inode number tag

and a reference count Can aggregate the written block to reduce the

number of parity written We delay the writing and generating of parity block

Several blocks near by can be combined in a single write

Page 10: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Parity Cache Table (2/3)

1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...

... ... ... ... ...

1 2 3 N...

N

1

2

1

N

2

inode #

inode #

inode #

inode #

inode #

inode #

Hashing

/1024 %4096offset

Page 11: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Parity Cache Table (3/3)

When to write back the cache ? Replacement

Choosing the bigger {N,N} Ready

When all the blocks needed to compute a parity block is ready

Flush A routine like bdflush runs every 30 secs

Potential data loss ? On average 15 secs

Page 12: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Write Performance64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

5000

10000

15000

20000

25000

30000

Page 13: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Read-Modify-Write Performance

64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

64

256

1024

4096

16384

65536

262144

1048576 4

64

1024

16384

0

5000

10000

15000

20000

25000

30000

35000

Page 14: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Mirrored Parity Scheme (1/3)

RAID-1 Can not tolerate two

faults in the same mirrored group

For different groups 3 faults can be tolerated

Disk overhead is 100% RAID-4 (RAID-5)

Only can tolerate a single fault

Disk overhead always less than 33.3%

D0 D1 D2

D0' D1' D2'

Mirrored Disks

D0 D1 D2 P1

RAID 4

Page 15: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Mirrored Parity Scheme (2/3)

Can tolerate faults occurred in the same mirrored group D1 、 P12 faults

D0、 P01 faults

Can tolerate at most 3 faults,

except one case D1、 P12、 D0 all faults

The concept of grouping disappeared

Use the same disk overhead as Raid-1

D0 D1 D2

P01 P12 P012

Mirrored Parity Disks

1 0 01D D P

0 12 012D P P

Page 16: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Mirrored Parity Scheme (3/3)

Pro MTTF is higher Can tolerate more simultaneous fault when

compared with RAID-1 With the same disk overhead

Con Need at most N XOR operations to recovery the

corrupted data N is the nodes involved in a parity group XOR is a cheap operation, but read 3 blocks may be a

problem

Page 17: Modularized Redundant Parallel Virtual System Sheng-Kai Hung HPCC Lab.

Separate metadata cache Accessing meta data is a serialized process

Only 1 single metadata server with 1 disk Separate metadata cache from the real data cache Either on clients or on servers

If on clients can save a socket connection when hitting

Distributed metadata Handling the parity cache table Parity information must also be distributed

Block based parity need to be modified


Recommended