Post on 23-Jan-2016
transcript
Modularized Redundant Parallel Virtual System
Sheng-Kai Hung
HPCC Lab
Parallel Virtual File System Overview Developed by Clemson Uni
versity Using RAID-0 like striping to
distributed file Claim for high read/write pe
rformance Based on TCP/IP server-cli
ent model Centralized metadata serve
r POSIX 、MPI-IO complaint NO fault tolerant mechanis
m provided
Our Pervious Design
Parity information is stored at metadata server A single point of failure
Read/Write performance Using “delay write” to
improve the parity overhead Use a buffer to store the
difference of block being written
Reading corresponding blocks are also needed
Metadata Server
SIO
IO Node
IOD
MGR
IO Node
IOD
IO Node
IOD
IO Node
IOD
...
Data Message
SHM
Metadata Server
IOD
IO Node
IOD
MGR
IO Node
IOD
IO Node
IOD
IO Node
IOD
...
Data Message
SIO SHM
MTTF Formula
1/( )pvfsD S
N NMTTF
MTTF MTTF
2
2
( 1)1/( )
n
raidD s
Dpvfs
D pvfs
NN MTTR NGMTTF
MTTF MTTF
MTTFMTTF
MTTF N MTTF
2
pvfsnrpvfs
MTTFMTTF G
MTTR
Examples of MTTF
Assumption MTTFD is no less than
100,000 hours (around 10 years)
MTTFs 10,000 hours (around 1.5 years)
MTTR is usually shorter than 4 hours
Node number is 16
MTTF
(hours)
Group
Size
PVFS 528 -
PVFSraid 624 1
RPVFS 86088 1
MTTF Result
10x100
100x100
1x103
10x103
100x103
1x106
16
64
112
160
208256
12
3
Overhead of Using Parity
Read does not involve in the process of parity construction
Read-Modify-Write Some blocks are dirtied Need 2 read, 2 write
Write The whole striping units
are overwritten 1 read, 2 write
Read
XORR-M-W
XORWrite
System Architecture
Metadata Server
SIOD
IO Node
IOD
MGR
...
Data Message
Parity CacheTable
MirroredMetadata Server
MSIOD
MdataReal File
Real File
IO Node
IOD
Real File
IO Node
IOD
Real File
IO Node
IOD
Real File
Mdata
Real File
Parity Cache Table (1/3)
A pinned down memory region within the metadata node 4K entry each entry contain N data blocks plus a inode number tag
and a reference count Can aggregate the written block to reduce the
number of parity written We delay the writing and generating of parity block
Several blocks near by can be combined in a single write
Parity Cache Table (2/3)
1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...1 2 3 N...
... ... ... ... ...
1 2 3 N...
N
1
2
1
N
2
inode #
inode #
inode #
inode #
inode #
inode #
Hashing
/1024 %4096offset
Parity Cache Table (3/3)
When to write back the cache ? Replacement
Choosing the bigger {N,N} Ready
When all the blocks needed to compute a parity block is ready
Flush A routine like bdflush runs every 30 secs
Potential data loss ? On average 15 secs
Write Performance64
256
1024
4096
16384
65536
262144
1048576 4
64
1024
16384
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
64
256
1024
4096
16384
65536
262144
1048576 4
64
1024
16384
0
5000
10000
15000
20000
25000
30000
Read-Modify-Write Performance
64
256
1024
4096
16384
65536
262144
1048576 4
64
1024
16384
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
64
256
1024
4096
16384
65536
262144
1048576 4
64
1024
16384
0
5000
10000
15000
20000
25000
30000
35000
Mirrored Parity Scheme (1/3)
RAID-1 Can not tolerate two
faults in the same mirrored group
For different groups 3 faults can be tolerated
Disk overhead is 100% RAID-4 (RAID-5)
Only can tolerate a single fault
Disk overhead always less than 33.3%
D0 D1 D2
D0' D1' D2'
Mirrored Disks
D0 D1 D2 P1
RAID 4
Mirrored Parity Scheme (2/3)
Can tolerate faults occurred in the same mirrored group D1 、 P12 faults
D0、 P01 faults
Can tolerate at most 3 faults,
except one case D1、 P12、 D0 all faults
The concept of grouping disappeared
Use the same disk overhead as Raid-1
D0 D1 D2
P01 P12 P012
Mirrored Parity Disks
1 0 01D D P
0 12 012D P P
Mirrored Parity Scheme (3/3)
Pro MTTF is higher Can tolerate more simultaneous fault when
compared with RAID-1 With the same disk overhead
Con Need at most N XOR operations to recovery the
corrupted data N is the nodes involved in a parity group XOR is a cheap operation, but read 3 blocks may be a
problem
Separate metadata cache Accessing meta data is a serialized process
Only 1 single metadata server with 1 disk Separate metadata cache from the real data cache Either on clients or on servers
If on clients can save a socket connection when hitting
Distributed metadata Handling the parity cache table Parity information must also be distributed
Block based parity need to be modified