HARD DISKS AND OTHER STORAGE DEVICES Jehan-François Pâris Spring 2015.

Post on 11-Jan-2016

221 views 8 download

Tags:

transcript

HARD DISKS AND OTHER STORAGE DEVICES

Jehan-François PârisSpring 2015

Magnetic disks (I)

Sole part of computer architecture with moving parts:

Data stored on circular tracks of a diskSpinning speed between 5,400 and 15,000

rotations per minuteAccessed through a read/write head

Magnetic disks (II)

Platter

R/W headArm

Servo

Magnetic disks (III)

Data are stored into circular tracks Tracks are partitioned into a variable number of

fixed-size sectorsOutside tracks have more sectors than inside

tracks If disk drive has more than one platter, all tracks

corresponding to the same position of the R/W head form a cylinder

Seagate ST4000DM000 (I)

Interface: SATA 6Gb/s (750MB/s) Capacity: 4TB Cache: 64MB multisegmented Seek Average

Read: < 8.5msWrite: <9.5ms

Average data rate: 146 MB/s (R/W) Maximum sustained

data rate: 180MB/s

Seagate ST4000DM000 (II)

Number of platters: 4 Number of heads: 8 Bytes per sector: 4,096 Irrecoverable read

errors per bit read: 1 in 1014

Power consumptionOperating: 7.5W Idle: 5WStandby & Sleep:0.75W

Sectors and blocks

Sectors are the smallest physical storage unit on a diskFixed-sizeTraditionally 512 bytesSeparated by intersector gaps

Blocks are the smallest transfer unit between the disk and the main memory

Magnetic disks (III)

Disk spins at a speed varying between5,400 rpm (laptops) and15,000 rpm (Seagate Cheetah X15, …)Accessing data requires

Positioning the head on the right track: Seek time

Waiting for data to reach the headOn the average half a rotation

Transferring the data

Accessing disk contents

Each block on a disk has a unique addressNormally a single number

Logical block addressing (LBA)Standard since 1996

Older disks used a different scheme Cylinder-head-sector

Exposed disk internal organizationCan still map old CHS triples onto LBA

addresses

Disk access times

Dominated by seek time and rotational delay

We try to reduce seek times by placing all data that are likely to be accessed together on nearby tracks or same cylinder

Cannot do as much for rotational delay

Seek times (I)

Depend on the distance between the two tracks Minimal delay for

Seeks between adjacent tracks Track to track (1-3 ms)

Switching between tracks within the same cylinder

Worse delay for end to end seeks

Seek times (II)

3 to 5x

x

Track to track End to end

Seek time

Rotational latency

On the average half a rotationSame for read and writes

One and half rotations for write/verify

Average rotational delay

RPM Delay

(ms)

5400 5.6

7200 4.2

10,000 3.0

15,000 2.0

Transfer rate (I)

Burst rate:Observed while transferring a blockHighest for blocks on outside tracks

More of them on each track Sustained transfer rate:

Observe red while reading sequential blocksLower

Transfer rate (II)

Actual transfer rate

Double buffering (I)

Speeds up handling of sequential file

B0 B1 B2 B3 B4 B5 B6 …

File

B1Buffers B2

Processedby DBMS

In transfer

Double buffering (II)

When both tasks are completed

B0 B1 B2 B3 B4 B5 B6 …

File

B3Buffers B2

Processedby DBMS

In transfer

The five minute rule

Jim Gray Keep in memory any data item that will be used

during the next five minutes

The internal disk controller

Printed circuit board attached to disk driveAs powerful as the CPU of a personal

computer of the early 80's Functions include

Speed bufferingDisk scheduling…

Reliability Issues

Disk failure rates

Failure rates follow a bathtub curveHigh infantile mortality Low failure rate during useful lifeHigher failure rates as disks wear out

Disk failure rates (II)

Failurerate

Time

Infantilemortality

Useful life

Wearout

Disk failure rates (III)

Infant mortality effect can last for months for disk drives

Cheap SATA disk drives seem to age less gracefully than SCSI drives

The Backblaze study

Reported on the disk failure rates of more than 25,000 disks at Backblaze.

Their disks tend to fail at a rate of5.1 percent per year during their first eighteen

months1.4 percent per year during the next eighteen

months11.8 percent per year after that

0

5

10

15

0 12 24 36 48

Time (months)

Year

ly fa

ilure

rate

(per

cent

)

Early failure stage5.1% failure rate

Random failure stage1.4% failure rate

Wearout failure stage11.8% failure rate

MTTF

Disk manufacturers advertise very highMean Times To Fail (MTTF) for their products500,000 to 1,000,000 hours, that is,

57 to 114 years Does not mean that disk will last that long! Means that disks will fail at an average rate of

one failure per 500,000 to 100,000 hours duringtheir useful life

More MTTF Issues (I)

Manufacturers' claims are not supported by solid experimental evidence

Obtained by submitting disks to a stress test at high temperature and extrapolating results to ideal conditionsProcedure raises many issues

More MTTF Issues (II)

Failure rates observed in the field aremuch higherCan go up to 8 to 9 percent per year

Corresponding MTTFs are 11 to 12.5 years

If we have 100 disks and a MTTF of 12.5 years, we can expect an average of 8 disk failures per year

Flash Drives

What about flash? Widely used in flash drives, most MP3

players and some small portable computers

Several important limitationsLimited write bandwidth

Must erase a whole block of data before overwriting any part of it

Limited endurance 10,000 to 100,000 write cycles

Flash drives

Widely used in flash drives, most MP3 players and some small portable computers

Similar technology as EEPROM Three technologies:

NOR flashNAND flashVertical NAND

NOR Technology

Each cell hasone end connected straight to ground the other end connected straight to a bit line

Longest erase and write times Allow random access to any memory location Good choice for storing BIOS code

Replace older ROM chips

NAND Technology

Shorter erase and write times Requires less chip area per cell Up to ten times the endurance of NOR flash. Disk-like interface:

Data must be read on a page-wise basis Block erasure:

Erasing older data must be performed one block at a time

Typically 32, 64 or 128 pages

Vertical NAND Technology

Fastest

The flash drive controller

PerformsError correction

Higher flash densities result in many errorsLoad leveling

Distribute writes among blocks to prevent failures resulting from uneven numbers of erase cycles

Flash drives works best with sequential workloads

Performance data

Widely vary between models: One random pair of specs:

Read Speed 22MBpsWrite Speed 15MBps

RAID level 0

No replication Advantages:

Simple to implementNo overhead

Disadvantage: If array has n disks failure rate is n times the failure

rate of a single disk

RAID levels 0 and 1RAID level 0

RAID level 1 Mirrors

RAID level 1

MirroringTwo copies of each disk block

Advantages:Simple to implementFault-tolerant

Disadvantage:Requires twice the disk capacity of normal file

systems

RAID level 4 (I)

Requires N+1 disk drivesN drives contain data

Individual blocks, not chunksBlocks with same disk address form a

stripe

x x xx ?

RAID level 4 (II)

Parity drive contains exclusive or of the N blocks in stripe

p[k] = b[k] b[k+1] ... b[k+N-1]

Parity block now reflects contents of several blocks!

Can now do parallel reads/writes

RAID levels 4 and 5

RAID level 4

RAID level 5

Bottleneck

RAID level 5

Single parity drive of RAID level 4 is involved in every write Will limit parallelism

RAID-5 distribute the parity blocks among the N+1 drivesMuch better

The small write problem

Specific to RAID 5 Happens when we want to update a single block

Block belongs to a stripeHow can we compute the new value of the

parity block

...b[k+1] p[k]b[k+2]b[k]

First solution

Read values of N-1 other blocks in stripe Recompute

p[k] = b[k] b[k+1] ... b[k+N-1]

Solution requiresN-1 reads2 writes (new block and new parity block)

Second solution

Assume we want to update block b[m] Read old values of b[m] and parity block p[k] Compute

p[k] = new b[m] old b[m] old p[k]

Solution requires2 reads (old values of block and parity block)2 writes (new block and new parity block)

Other RAID organizations (I) RAID 6:

Two check disksTolerates two disk failuresMore complex updates

Other RAID organizations (II) RAID 10:

Also known as RAID 1 + 0Data are striped (as in RAID 0 or RAID 5)

over pairs of mirrored disks (RAID 1)

RAID 0

RAID 1 RAID 1 RAID 1 RAID 1

Other RAID organizations (III) Two dimensional RAIDs

Designed for archival storage Data are written once and read maybe

(WORM) Update rate is less important than

High reliabilityLow storage costs

Complete 2D RAID arrays

Haven parity disksn(n – 1)/2 data disks

P2P1

P3

D13

D14P4

D34

D23

D24

D12

Main advantageWork in progress