Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data....

transcript

Disk Arrays

COEN 180

Large Storage Systems

Collection of disks to store large amount of data.

Performance advantage:Each drive can satisfy only so many IO per

seconds.Data spread across more drives is more

accessible. JBOD: Just a Bunch Of Disks

Principal difficulty: ReliabilityData needs to be stored redundantly:

Mirroring, Replication Simple Expensive (double, triple, … storage costs) Good performance

Erasure correcting codes Complex Save storage Moderate performance

Mirrored Disks Used by Tandem

1970 – 1997, bought by Compact Nonstop architecture

Used redundancy (CPU, storage) for fail-over capacity

Data is replicated on both drives Performance:

Writes as fast as single disk model Reads: Slightly faster, since we can serve the read from the

drive with best expected service time.

Disk Performance Modeling Basics

Service Time: Time to satisfy a request if system is otherwise idle.

Response Time: Time to satisfy a request at a given system load. Response time = service time + waiting time

Utilization: Time system is busy

Disk Performance Modeling Basics M/M/1 queue single server Assume Poisson arrival,

exponential service time Arrival rate Service time S Utilization U = S (Little’s law) Response time R

Determine R by: R = S + UR

R= S/(1-U) = S/(1- S)

0.2 0.4 0.6 0.8

hence U =

Need to determine service time of disk request.

Service time = seek time + latency + transfer time

Industrial (but wrong) determination:Seek time = time to travel one third of a disk.Why?

Assume that head position is randomly on any track.

Assume that target track is another random track.

Given x [0,1], calculate D(x) = distance of random point in [0,1] from

Given x [0,1], calculate D(x) = distance of random point in [0,1] from x.

dyxydyyx

dyxyxD

0.2 0.4 0.6 0.8 1

Now calculate the average distance from a random point to a random point in [0,1]

Is Average Seek Time = Seek Time for Average Distance?

NO: Seek Time is not linearly dependent on average seek

time. Seek Time consists

acceleration cruising (if seek distance is long braking exact positioning

Is Average Seek Time = Seek Time for Average Distance?

Practical measurements suggestsSeek time depends on the seek distance

roughly as a square-root of distance

2 4 6 8 10

Rules of ThumbKeep utilization of disks between 50% and

Disk Arrays

Dealing with reliability RAID

Redundant array of inexpensive (independent) disks

RAID Levels RAID Level 0: JBOD (striping) RAID Level 1: Mirroring RAID Level 2:

Encodes symbols (bytes) with a Hamming code. Stores a bit per symbol on different disk. Not used in practice.

Disk Arrays

Dealing with reliabilityRAID Levels

RAID Level 3: Encodes symbols (bytes) with the simple parity code. Breaks a file up into n stripes. Calculates parity stripes. Stores all n + 1 stripes on n + 1 disks.

Disk Arrays

Dealing with ReliabilityRAID Levels

RAID Level 4 Maintains n data drives. Files are stored completely on one drive.

Or perhaps in stripes if files become very large. Additional drive storing the byte-wise parity of the disk

arrays.

ParityData Data Data

Disk Arrays

Level 4 RAIDUneven load of parity drive and data drives

Disk Arrays

Dealing with ReliabilityRAID Level 5

No dedicated parity disk Data in blocks Blocks in parallel positions on disks form reliability stripe. One block in each reliability stripe is the parity of the

others.

No performance bottleneck

Disk Arrays

Dealing with ReliabilityRAID Level 6

Like RAID Level 5, but every stripe has two parity blocks

Lower write performance 2-failure resilience

RAID Level 7 Proprietary name for a RAID Level 3 with lots of

caching. (Marketing bogus)

Disk Arrays

Disk Array Operations Reads:

Directly from data in RAID Level 3-6 Writes:

Large Writes: Writes to all blocks in a single reliability stripe.

Calculate parity from data and write it. Small Writes:

Need to maintain parity. Option 1: Write data, then read all other blocks in the stripe

and recalculate parity. Option 2: Read old data, then overwrite it. Calculate the

difference (XOR) between old and new data. Then read old parity, XOR it with the result of the previous operation and overwrite with it the parity block.

Disk Arrays

Disk Array OperationsReconstruction (RAID Level 4-5):

Systematically:Reconstruct only lost data.Read all surviving blocks in the reliability stripe.Calculate its parity. This is the lost data block.Write data block in place of parity.

Out of order reconstruction for data that is being read.

Disk Arrays Performance Analysis

Assume that read and write service times are the same.

seek latency (transfer)

Write operation involves the read-modify operation. About twice as long as read / write service time seek latency transfer two latencies transfer

Disk Arrays

Performance Analysis Level 4 RAID

Offered read load r Offered write load w n disks

Utilization at data disk: r S /(n – 1) + w 2S/(n – 1)

Utilization at parity disk: w 2S

Equal utilization only if r = 2(n – 2) w

100 200 300 400 500

1Disk Arrays

Performance Analysis Level 4 RAID

Offered load . Assume only small writes. Assume read /write ratio of

Utilization at data disk S/n

Utilization at write disk (1- )2 S

parity disk

data disk

Utilization

Offered Load (IO/sec)

Parameters:

4+1 layout

70% reads

Service time 10 msec

Disk Arrays

Performance Analysis RAID Level 5 Offered load Read ratio n disks

Read Load S/n

Write Load (1- ) 4S/n Every write leads to two read-modify-write ops.

100 200 300 400 500

Disk Arrays

Level 4 RAID vs Level 5 RAID

Without parity disk (JBOD)

RAID Level 5

Parameters:4+1 layout70% readsService time 10 msec

parity drive

data drive

Disk Arrays

PerformanceSmall writes are expensive.Parity logging (Daniel Stodolsky, Garth Gibson, Mark Holland)

Write operation: Read old data, Write new data, Send XOR to a parity log file.

Whenever parity log file becomes to big, process it by updating parity information.

Disk Arrays

ReliabilityAccurately given by the probability of failure at

every moment in time.

5 10 15 20 25 30

Disk Arrays

ReliabilityOften given by Mean Time To Data LossMTTDLWarning:

MTTDL numbers can be deceiving.

Red line is more reliable during Design Life, but has lower MTTDL

Disk Arrays

Use Markov Model to model system in various states.States describe system.Assumes constant rates of transitions.Transitions correspond to:

component failure component repair

Disk Arrays

One component system

Failure State

(absorbing)

Initial State

MTTDL = MTTF = 1/

Disk Arrays

Two component system without repair

Failure State

(absorbing)

Initial State:

2 components working

1 component working, one failed

Disk Arrays

Two component system with repair

Failure State

(absorbing)

Initial State:

Disk Arrays

How to calculate MTTF Start with original Markov model. Remove failure state. Replace transition(s) to failure state with failure

transitions to initial state. This models a meta-system where we replace a failed

system immediately with a new one. Now calculate the steady-state solution of the

Markov model. It typicallyhas become ergodic.

Use this to calculate the average rate of a failure transition being taken. This gives the MTTF.

Disk Arrays

One component system

Initial State

System in initial state all the time.

Failure transition taken at rate .

“Loss rate” L = .

MTTDL = 1/L = 1/

Disk Arrays

Initial State:

Steady-state solution

Let x be the probability to be in state 2, y the probability to be in state 1.

Inflow into state 2 = Outflow from state 2:

2x = y

Total sum of probabilities is 1:

x+y = 1.

Disk Arrays

Initial State:

Steady-state solution

2x = y

x+y = 1.

Solution is:

x = 1/3, y = 2/3.

Loss rate is L = (2/3).

MTTF = 1/L = 1.5 (1/ ).

(1.5 times better than before).

Disk Arrays Two component system with repair

Initial State:

Disk Arrays

RAID Level 4/5 Reliability

Failure State

(absorbing)

Initial State:

n disks

nn n-1

n – 1 disks

Disk Arrays

RAID Level 6 Reliability

Initial State:

n disks

nn n-1

n – 1 disks

Failure State

(absorbing)

(n-1)n-2

n – 2 disks

Disk Arrays

Sparing Create more resilience by adding a hot spare.Failover to hot spare reconstructs and

replaces contents of the lost disk on spare disk.

Distributed sparing (Menon et al.): Distribute the spare space throughout the disk

array.

Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data....

Documents