+ All Categories
Home > Documents > A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng...

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng...

Date post: 04-Jan-2016
Category:
Upload: lewis-banks
View: 213 times
Download: 0 times
Share this document with a friend
37
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1 , Patrick P. C. Lee 2 , Liping Xiang 1 , Yinlong Xu 1 , Lingling Gao 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong DSN’12 1
Transcript

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems

with RAID-6 Codes

Yunfeng Zhu1, Patrick P. C. Lee2, Liping Xiang1,

Yinlong Xu1, Lingling Gao1

1University of Science and Technology of China2The Chinese University of Hong Kong

DSN’121

Fault Tolerance Fault tolerance becomes more challenging in

modern distributed storage systems • Increase in scale • Usage of inexpensive but less reliable storage nodes

Fault tolerance is ensured by introducing redundancy across storage nodes• Replication

• Erasure codes (e.g., Reed-Solomon codes)

2A B A+B A+2B

A

B

A

B

A

B

XOR-Based Erasure Codes

Encoding/decoding involve XOR operations only• Low computational overhead

Different redundancy levels• 2-fault tolerant: RDP, EVENODD, X-Code• 3-fault tolerant: STAR• General-fault tolerant: Cauchy Reed-Solomon (CRS)

3

Failure Recovery

Recovering node failures is necessary• Preserve the required redundancy level• Avoid data unavailability

Single-node failure recoverySingle-node failure occurs more frequently than a

concurrent multi-node failure

Example: Recovery in RDP

d0,6

d1,6

d2,6

d3,6

d4,6

d5,6

⊕⊕

⊕⊕

d0,0 d0,1 d0,2 d0,3 d0,4 d0,5

d1,0 d1,1 d1,2 d1,3 d1,4 d1,5

d2,0 d2,1 d2,2 d2,3 d2,4 d2,5

d3,0 d3,1 d3,2 d3,3 d3,4 d3,5

d4,0 d4,1 d4,2 d4,3 d4,4 d4,5

d5,0 d5,1 d5,2 d5,3 d5,4 d5,5

d0,7

d1,7

d2,7

d3,7

d4,7

d5,7

⊕⊕⊕⊕⊕⊕

node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7

5

An RDP code example with 8 nodes

Let’s say node0 fails. How do we recover node0?

Conventional Recovery Idea: use only row parity sets. Recover each lost

data symbol (i.e., data chunk) independentlynode 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7

Read symbols: 36

Then how do we recover node 0 efficiently?

Different metrics can be used to measure the efficiency of a recovery

scheme

6

Minimize Number of Read Symbols

Idea: use a combination of row and diagonal parity sets to maximize overlapping symbols[Xiang, ToS’11]

node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7

Read symbols: 27 Improve rate: 25%

7

Need A New Metric?

A modern storage system is natural to be composed of heterogeneous types of storage nodes • System upgrades• New node addition

A heterogeneous environment

8

Proxy

node 0

node 1 node 2

node 3

node4

node 5node 6node 7

New node

26Mbps

68Mbps 109Mbps

110Mbps

113Mbps

10Mbps110Mbps86Mbps

Need a new efficient failure recovery solution

for heterogeneous environment!

Related Work

Hybrid recovery• Minimize number of read symbols RAID-6 XOR-based erasure codes

• e.g., RDP [Xiang, ToS’11], EVENODD [Wang, Globecom’10

Enumeration recovery [Khan, FAST’12]

• Enumerate all recovery possibilities to achieve optimal recovery for general XOR-based erasure codes

Greedy recovery [Zhu, MSST’12]

• Efficient search of recovery solutions for general XOR-based erasure codes

Regenerating codes [Dimakis, ToIT’10]

• Nodes encode data during recovery• Minimize recovery bandwidth• Heterogeneous case considered in [Li, Infocom’10], but requires node encoding

and collaboration

9

Challenges

How to enable efficient failure recovery for heterogeneous settings?• Minimizing # of read symbols homogeneous settings• Performance bottlenecked by poorly performed nodes

How to quickly find the recovery strategy?• Minimizing # of read symbols deterministic metric• Minimizing general cost non-deterministic metric Recovery decision typically can’t be pre-determined

Our Contributions

Target two RAID-6 codes: RDP and EVENODD• XOR-based encoding operations

Goals:• Minimize search time• Minimize recovery cost

Cost-based single-node failure recovery for heterogeneous distributed storage systems

11

Our Contributions

Formulate an optimization problem for single-node failure recovery in heterogeneous settings

Propose a cost-based heterogeneous recovery (CHR) algorithmNarrow down search spaceSuitable for online recovery

Implement and experiment on a heterogeneous networked storage testbed

12

. . .

Node p-1 Node p

. . .

Weight:

Download Distribution:

w0 w1 wp-1 wp

y0 y1 yp-1 yp

. . .. . .

. . .. . .

p

kiiii ywC

,0

Minimizing total recovery cost:

Model Formulation

Our formulation:

13

Node : v0 v1 vkvp-1 vp

Node 0 Node 1 Node k

Physical Meanings

wi C

1 for all i total number of symbols being read from surviving nodes

inverse of transmission bandwidth of node Vi

total amount of transmission time to download symbols from surviving nodes

monetary cost of migrating per unit of data outbound from node Vi

the total monetary cost of migrating data from surviving nodes (or clouds)

14

Solving the Model Important: Which symbols to be fetched from surviving

nodes must follow inherent rules of specific coding schemes To solve the model, we introduce recovery sequence

(x0 , x1 , … , xp-2, 0)

– xi = 0 , di,k is recovered from its row parity set

– xi = 1 , di,k is recovered from its diagonal parity set

download distribution:(3, 2, 2, 3, 2)

recovery sequence: (0, 0, 1, 1, 0)

d0,0

d1,0

d2,0

d3,0

d0,1

d1,1

d2,1

d3,1

d0,2

d1,2

d2,2

d3,2

d0,3

d1,3

d2,3

d3,3

d0,4

d1,4

d2,4

d3,4

d0,5

d1,5

d2,5

d3,5

node 0 node 1 node 2 node 3 node 4 node 5

15

An example:1) Each recovery sequence represents a feasible recovery solution;

2) Download distribution can be represented by recovery sequence;

Solving the Model (2) Step 1: use recovery sequence to represent downloads

Step 2: narrow down search space by only considering min-read recovery sequences (i.e., download minimum number of read symbols during recovery)

Step 3: reformulate the model as

)()1(1

0

1

0

p

ikjii

p

iij p

xxxpy

1

0

2/)1(p

ikjiij p

xxpy

kj

p

iipkjii wxx

1

0Minimize

16

Expensive Enumeration

P Total # of recovery sequences

# of min-read recovery sequences

# of unique min-read recovery sequences

5 16 6 2

7 64 20 4

11 1024 252 26

13 4096 924 74

17 65536 12870 698

19 262144 48620 2338

23 4194304 705432 28216

29 268435456 40116600 1302688

Challenge: Too many min-read recovery sequences to enumerate even we narrow down search space

17

Observation: many min-read recovery sequences return the same download distribution

Optimize Enumeration Process

Two conditions under which different recovery sequences have same download distribution:Shift condition(0, 0, 0, 1, 1, 1, 0) (0, 0, 1, 1, 1, 0, 0)

(0, 1, 1, 1, 0, 0, 0) (1, 1, 1, 0, 0, 0, 0) …

Reverse condition(0, 0, 0, 1, 1, 1, 0) (0, 1, 1, 1, 0, 0, 0)

18

Key idea: not all recovery sequences need to be enumerated(details in the paper)

Cost-based Heterogeneous Recovery (CHR) Algorithm: Intuition

Step 1: initialize a bitmap to track all possible min-read recovery sequences R

Step 2: compute recovery cost of R.

Step 3: mark all shifted and reverse sequences of R as being enumerated

Step 4: switch to another R; return the one with minimum cost

19

Example

Proxy

node 0

node 1 node 2

node 3

node4

node 5node 6node 7

New node

26Mbps

68Mbps 109Mbps

110Mbps

113Mbps

10Mbps110Mbps86Mbps

node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7

3 5 4 4 5 3 3

node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7

5 4 3 3 4 5 3

Our proposed CHR algorithm Hybrid approach [Xiang, ToS’11]

Recovery Cost Comparison CHR approach

Hybrid approach

Conventional approach

7353.0113

3

10

5

110

4

86

3

110

3

109

4

68

5

5449.0113

3

10

3

110

5

86

4

110

4

109

5

68

3

9221.010

6

110

6

86

6

110

6

109

6

68

6

reduce by 25.89%

reduce by 40.91%

21

Simulation Studies (1): Traverse Efficiency

Evaluate the computational time of CHRP Naive traverse time

(ms)CHR’s traverse time

(ms)Improved rate

(%)5 0.0220 0.0100 54.55

7 0.0950 0.0310 67.37

11 2.3160 0.3910 83.12

13 11.9840 1.6150 86.52

17 107.7410 10.0790 90.65

19 455.2760 40.5370 91.10

23 9230.7800 691.2800 92.51

29 752296.2700 45423.5570 93.96

CHR significantly reduces the traverse time of the naive approach by over 90% as p increases! 22

Simulation Studies (2): Robustness Efficiency

Evaluate if CHR achieves the global optimal among all the feasible recovery sequences

P Hit Global OptimalProbability(%)

Global Optimal MaxImprovement(%)

5 94.9 6.12

7 94.5 5.54

11 93.6 5.98

13 93.2 6.46

17 92.8 5.97

19 93.1 5.73

CHR has a very high probability (over 93%) to hit the global optimal recovery cost!

12 p

23

Simulation Studies (3): Recovery Efficiency

Evaluate via 100 runs for each p the recovery efficiency of CHR in a heterogeneous storage environment

CHR can reduce recovery cost by up to 50% over the conventional approach

CHR can reduce recovery cost by up to 30% over the hybrid approach

24

Experiments

Experiments on a networked storage testbed• Conventional vs. Hybrid vs. CHR• Default chunk size = 1MB• Communication via ATA over Ethernet (AoE)• Consider two codes: RDP and EVENODD

• Only RDP results shown in this talk

Recovery operation:• Read chunks from

surviving nodes• Reconstruct lost chunks• Write reconstructed chunks

to a new node

25

Recovery process

Gigabit switch

nodes

Experiments

Two types of Ethernet interface card equipped by physical storage devices• 100Mbps set weight = 1/(100Mbps)• 1Gbps set weight = 1/(1Gbps)

26

p Total # of nodes

# of nodes with 100Mbps

# of nodes with 1Gbps

5 6 2 4

7 8 3 5

11 12 5 7

13 14 6 8

17 18 9 9

Configuration for RDP code

Different Number of Storage Nodes

Total recovery time for RDP• CHR improves conventional by 21-31%• CHR improves hybrid by 15-20%

16.4717.11

15.2818.1

20.78

24.1524.61

21.4523.93

31.19

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

p=5 p=7 p=11 p=13 p=17

Conventional Hybrid CHR

Re

co

very

Tim

e (

in s

ec

on

ds

) p

er

MB

p27

Different Chunk Size

Total recovery time for RDP (p = 11)• CHR improves conventional by 18-26%• CHR improves hybrid by 14-19%

14.6914.66

15.2816.68

19.82

18.32

22.02

21.4523.76

25.66

0

0.05

0.1

0.15

0.2

0.25

256KB 512KB 1024KB 2048KB 4096KB

Conventional Hybrid CHRChunk Size

Re

co

very

Tim

e (

in s

ec

on

ds

) p

er

MB

Different Failed Nodes

Total recovery time for RDP (p = 11)• CHR still outperforms conventional and hybrid

29

15.28 16.18 19.44 12.46 13.12 17.49 14.43 14.27 16.67 11.9910.04

21.45 21.53 25.12 21.77 19.03 22.54 18.84 23.62 20.7924.58

17.9

00.020.040.060.08

0.10.120.140.160.18

0.2

Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7 Node8 Node9 Node10

Conventional Hybrid CHR

Rec

ove

ryT

ime

(in

sec

on

ds)

per

M

B

Conclusions Address single-node failure recovery RAID-6 coded

heterogeneous storage systems

Formulate a computation-efficient optimization model

Propose a cost-based heterogeneous recovery algorithm

Validate the effectiveness of the CHR algorithm through extensive simulations and testbed experiments

Future work: Different cost formulations Extension for general XOR-based erasure codes Degraded reads

Source code:• http://ansrlab.cse.cuhk.edu.hk/software/chr/ 30

Backup

Cost-based Heterogeneous Recovery (CHR) Algorithm

F A bitmap that identifies if a min-read recovery sequence has been enumerated

R, C A min-read recovery sequence with its recovery cost

R*, C* The min-cost recovery sequence with the minimum total recovery cost

1 Initialize F[0…2p-1-1] with 0-bits; Initialize R with 1-bits followed by 0-bits;Initialize R* with R ; Initialize C* with MAX_VALUE

2 If R is null, then go to Step 4;Convert R into integer value v, if R has already enumerated, then go to Step 3;Mark all the shifted an reverse recovery sequences of R as being enumerated;Calculate the recovery cost C of R; Update R* and C* if necessary

3 Get the next min-read recovery sequence R and go to Step 2;

4 Finally, initialize R with all 0-bits;Calculate the recovery cost C of R; Update R* and C* if necessary

Notation:

Algorithm:2

1p2

1p

32

Example

Proxy

node 0

node 1

node 2

node 3

node4

node 5

node 6

node 7

New node

26Mbps

68Mbps 109Mbps

110Mbps

113Mbps

10Mbps110Mbps86Mbps

Step 1: Initialize F[0..63] with 0-bits, R = {1110000}, the recovery cost C = MAX_VALUE

Step 2: F[7]=1, mark R’s shifted and reverse recovery sequences: F[56]=F[28]=F[14]=1;Calculate the recovery cost for R, C will be 0.7353α; R*, C* will be updated by R, C

Step 3: Get the next min-read recovery sequence R and go to Step 2

Step 4: Finally, we can find that R* = {1010100} and C* = 0.5449α33

node 0

node 1

node 2

node 3

node 4

node 5

node 6

node 7

3 5 4 4 5 3 3

Recovery Cost Comparison CHR approach

Hybrid approach

Conventional approach

7353.0113

3

10

5

110

4

86

3

110

3

109

4

68

5

5449.0113

3

10

3

110

5

86

4

110

4

109

5

68

3

9221.010

6

110

6

86

6

110

6

109

6

68

6

reduce by 25.89%

reduce by 40.91%

34

node 0

node 1

node 2

node 3

node 4

node 5

node 6

node 7

5 4 3 3 4 5 3

Different Number of Storage Nodes

Consider the overall performance of the complete recovery operation for EVENODD

14.319.69

14.44

15.06

18.31

1925.47

25.17

27.49

32.11

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

p=5 p=7 p=11 p=13 p=17

Conventional Hybrid CHR

Re

co

ve

ryT

ime

(in

se

co

nd

s)

pe

r M

B

p

35

Different Chunk Size Evaluate the impact of chunk size for EVENODD on

the recovery time performance

9.411.07

14.4416.27

20.5

15.57 26.2525.17

26.3925.48

0

0.05

0.1

0.15

0.2

0.25

256KB 512KB 1024KB 2048KB 4096KB

Conventional Hybrid CHR

Re

co

very

Tim

e (

in s

ec

on

ds

) p

er

MB

Chunk Size

36

Different Failed NodesEvaluate the recovery time performance for EVENODD when the failed node is in a different column

14.448.1 11.52 10.22

13.97 13.06 9.4 15.31 13.798.83 9.98

25.1717.06 17.09

21.822.62 23.93

18.58 19.77 18.19 22.56 16.95

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Node0 Node1 Node2 Node3 Node4 Node5 Node6 Node7 Node8 Node9 Node10

Conventional Hybrid CHR

Re

co

very

Tim

e (

in s

ec

on

ds

) p

er

MB

37


Recommended