Chapter 6csl.skku.edu/uploads/ICE3003S14/chap6.pdf · Chapter 6 Storage and Other I/O Topics . ......

1 ICE3003: Computer Architecture | Spring 2014 | Euiseong Seo ([email protected])

Chapter 6

Storage and Other I/O Topics


Introduction §  I/O devices can be characterized by

•  Behaviour: input, output, storage •  Partner: human or machine •  Data rate: bytes/sec, transfers/sec

§  I/O bus connections

§6.1 Introduction


I/O System Characteristics §  Dependability is important

•  Particularly for storage devices §  Performance measures

•  Latency (response time) •  Throughput (bandwidth) •  Desktops & embedded systems

– Mainly interested in response time & diversity of devices •  Servers

– Mainly interested in throughput & expandability of devices


Dependability

§  Fault: failure of a component •  May or may not

lead to system failure

§6.2 Dependability, R

eliability, and Availability

Service accomplishment Service delivered

as specified

Service interruption Deviation from

specified service

Failure Restoration


Dependability Measures §  Reliability: mean time to failure (MTTF) §  Service interruption:

mean time to repair (MTTR) §  Mean time between failures

•  MTBF = MTTF + MTTR

§  Availability = MTTF / (MTTF + MTTR) §  Improving Availability

•  Increase MTTF: fault avoidance, fault tolerance, fault forecasting

•  Reduce MTTR: improved tools and processes for diagnosis and repair


Disk Storage §  Nonvolatile, rotating magnetic storage

§6.3 Disk S

torage


Disk Sectors and Access §  Each sector records

•  Sector ID •  Data (512 bytes, 4096 bytes proposed) •  Error correcting code (ECC)

– Used to hide defects and recording errors •  Synchronization fields and gaps

§  Access to a sector involves •  Queuing delay if other accesses are pending •  Seek: move the heads •  Rotational latency •  Data transfer •  Controller overhead


Disk Access Example §  Given

•  512B sector, 15,000rpm, 4ms average seek time, 100MB/s transfer rate, 0.2ms controller overhead, idle disk

§  Average read time •  4ms seek time

+ ½ / (15,000/60) = 2ms rotational latency + 512 / 100MB/s = 0.005ms transfer time + 0.2ms controller delay = 6.2ms

§  If actual average seek time is 1ms •  Average read time = 3.2ms


Disk Performance Issues §  Manufacturers quote average seek time

•  Based on all possible seeks •  Locality and OS scheduling lead to

smaller actual average seek times §  Smart disk controller allocate physical sectors

on disk •  Present logical sector interface to host •  SCSI, ATA, SATA

§  Disk drives include caches •  Prefetch sectors in anticipation of access •  Avoid seek and rotational delay


Flash Storage §  Nonvolatile semiconductor storage

•  100× – 1000× faster than disk •  Smaller, lower power, more robust •  But more $/GB (between disk and DRAM)

§6.4 Flash Storage


Flash Types §  NOR flash: bit cell like a NOR gate

•  Random read/write access •  Used for instruction memory in embedded systems

§  NAND flash: bit cell like a NAND gate •  Denser (bits/area), but block-at-a-time access •  Cheaper per GB •  MLC vs SLC •  Used for USB keys, media storage, …

§  Flash bits wears out after 1000’s of accesses •  Not suitable for direct RAM or disk replacement •  Wear leveling: remap data to less used blocks


Flash Translation Layer §  Erase before write (not inplace update)

•  Erase at block granularity •  Write at page granularity

§  Wear-leveling §  Solution: FTL (Flash Translation Layer)

on the underlying FTL algorithms and implementation details. Thismapping table is accessed in every request and should be main-tained when the storage system is powered off. Therefore, thememory for this translation mapping table should be stored in fastand non-volatile media such as battery-backed DRAM or battery-backed SRAM. Because battery-backed DRAM is also a significantpower consumer, the memory component for storing the mappingtable is also suspected to contribute to the power consumption offlash-based SSDs

2.2. Flash translation layer

The FTL maintains a mapping table of virtual sector numbers,which are seen from the host system, to physical sector numbersof the flash memory. It emulates the functionality of a normalblock device by exposing only read/write operations to the uppersoftware layers and hiding the presence of the erase operation.

Processing read requests in the FTL is simple. When the read re-quest occurs from the host system, the FTL returns the correspond-ing physical sector number after searching the mapping table.

In comparison to processing read requests, write requests in-duce a complex update process. Basically, the sequence of process-ing write requests is as follows; (i) find a suitable erased page to bewritten, (ii) write the new data over the chosen page, (iii) invali-date the original page mapped to the logical page currently in pro-cess, if it exists, and (iv) update the mapping table to reflect thischange.

FTL algorithms can be roughly categorized into three groups bytheir mapping granularity.

The page-mapping algorithm [16], which is a basic form of theFTL, uses a translation table, which translates external logical pageaddresses into internal physical page addresses on the flash

memory as shown in Fig. 2(a). Because the granularity of each en-try is a page, each mapping entry implies an one-to-one mappingrelationship from a logical page to the corresponding physicalpage. When the overwriting operation occurs, the FTL only needsto remap the corresponding entry.

By using this straightforward table management, the page-mapping algorithm has a fast translation speed and simple page-update mechanism, which only changes the mapping entry forthe page being updated. However, because there should be asmany mapping entries as the number of pages, the mapping tabletakes a large amount of space, which we will explore in Section 3.3.

Therefore, the block-mapping algorithm [29] and its variantswere proposed. As shown in Fig. 2(b), the mapping table in theblock-mapping FTL only contains the block granularity mappingentries, each of which maps a logical page address to the combina-tion of a physical block number and an offset number. The actualphysical page, which stores the data for the logical page, is deter-mined by the offset field of the logical page addresses in the block.

Although the table size of the block-mapping FTL is very small,processing partial rewrites of previously written blocks induces asubstantial amount of copy and erase operations. This is because,in order to overwrite a page, a new free block has to be allocated,and all the pages, except the page to be overwritten in the originalblock, have to be copied over to the pages with the same offsets inthe new block.

Therefore, the hybrid FTL (also known as the log-based FTL)[19,23], illustrated in Fig. 3 has gained popularity [17] in commod-ity flash-based storage devices, of which production cost is of thehighest priority.

The hybrid FTL has a few spare blocks, called log blocks, whichare used as temporary space for the updated pages. When a pageoverwrite operation occurs, a free page in the log blocks is

Fig. 1. Block-diagram of a typical SSD inside [3].

Fig. 2. Address translation in the page-mapping and block-mapping FTL.

356 S. Park et al. / Journal of Systems Architecture 57 (2011) 354–365


Hybrid Storage Host

Disk Driver

Disk Interface

Hybrid Storage Disk Interface

Micro-tiering Controller

Hard Disk Media Flash Media


Hybrid Storage §  Conceptual Benefit

•  Large capacity of HDDs with high performance of SSDs •  Low power at most of time

§  Reality •  What to put in each media? •  When? At what granularity? •  Complex metadata management


Flash Cache

FlashTier: A Lightweight, Consistent and Durable Storage Cache Mohit Saxena, Michael M. Swift and Yiying Zhang

Introduction High-speed solid-state drives (SSDs) composed of NAND flash are often deployed as a cache in front of high-capacity disk storage. Several vendors adopt SSDs as a block cache transparent to filesystems: • OS Vendors: Oracle, Microsoft, Linux • Storage Vendors: Intel, OCZ, NetApp, EMC, FusionIO • Applications: Facebook, Google

Unified Address Space Consistent Cache Interface

Problem Statement Design Overview The FlashTier system consists of two components: a solid-state cache (SSC), a flash device designed specifically for caching, and a cache manager within the OS for migrating data between the flash caching tier and disk storage.

References [1] N. Agrawal, V. Prabhakran, T. Wobber, J. Davis, M. Manasse, and R. Panigraphy. Design Tradeoffs for SSD performance, In Usenix ATC 2008. [2] Facebook FlashCache,https://github.com/facebook/flashcache. [3] Google Inc, Sparse Hash Map,http://goog-sparsehash.sourceforge.net [4] M. Saxena and M. Swift, FlashVM: Virtual Memory Management on Flash, In Usenix ATC 2010.

Inefficiencies with SSD block cache Address Space Management

Two levels of indirections for address translation

Consistency and Durability

Several hours to days for warm-up after reboot/crash

Free Space Management

Low write performance and endurance due to garbage collection for caching

Two levels of indirection: On an SSD cache, the cache manager maintains a mapping table within the host DRAM translating disk addresses to SSD addresses. The SSD block cache keeps another mapping (FTL) within the device memory translating SSD logical addresses to flash addresses to avoid in-place writes. These two mappings pose memory overhead for host DRAM and device memory.

Inconsistency: On an SSD cache, maintaining cached data consistent and durable requires the cache manager to persist the cache block state and mapping within the host DRAM. To avoid the runtime cost of persisting the mapping, most cache managers do not provide any consistency or durability. • Without durability, filling a 100GB cache from a 500 IOPS disk

system can take over 14 hours after a reboot. • Without consistency, cache update and invalidate operations

can result in a stale copy of cached data to be read later. FlashTier provides an SSC interface to keep the mapping consistent and durable.

Free Space Management

Garbage collection: On an SSD cache, garbage collection result in additional copies for valid pages to create free erased blocks for new writes. • Full devices lacking free

blocks behave worse with up to 83% lower performance and 80% lower write endurance [Intel IDF ‘10].

• Caches are often full.

Application

Cache Manager

Block Layer

Read Read miss

Write Write through

File System

Discard Write back SSD

Cache Manager

SSC Disks

OS

Application

Cache Manager

Block Layer File System

Cache Manager

SSD

Flash Translation Layer

Mapping Table

0

200

400

600

800

0

20

40

60

80

100HostDevice

Write Intensive Read Intensive

Tota

l Mem

ory

(MB)

Disk LBN

LBN Æ PBA

Cache Manager

Sparse Hash Map

lookup, insert, remove

Data Blocks

read/write

OS

Cache Manager

Ap

plic

atio

n

Bloc

k La

yer

File

Syst

em

SSD

Flash Translation Layer

Command Purpose write-dirty Insert new block or update existing block with

dirty data. write-clean Insert new block or update existing block with

clean data. read Read block if present or return error evict Invalidate block immediately clean Allow future eviction of block exists Test for presence of dirty blocks

% T

hrou

ghpu

t rel

ativ

e to

Non

e


81 71

96 97 85

93 96 97

0

20

40

60

80

100

120

mail homes usr proj

None SSD WB SSC WB SSC Dirty

SSD: 5 reads/writes

Stale Stale

Valid

Valid

Stale Stale

Stale Valid

Valid

Free Free

Free Free

Free

Valid

Valid

Valid Valid

Valid Valid

Stale Stale

Clean Dirty

Dirty

Stale Stale

Stale Clean

Clean

Stale Stale

Clean Dirty

Dirty

Free Free

Free Free

Free

SSC: No reads/writes

Log Data

SSC

Evict GC

Log Data

Convert

Convert

SSC-V

Evict GC

201%

268%

98% 99%

0

50

100

150

200

250

300

homes mail usr proj

SSDSSC WTSSC-V WTSSC WBSSC-V WB

Perfo

rman

ce re

lativ

e to

SSD

WB


Silent Eviction: An SSC employs silent eviction to lose clean data rather than copying it. It uses a cost/benefit mechanism to select blocks for eviction: • Usage benefit: The cache

manager uses evict/clean to identify cold clean data.

• Eviction cost: The SSC silently evicts least-utilized cold clean data.

Caching is different from Storage An SSD block cache is hindered by the narrow block interface and internal block management of SSDs designed to serve as a disk replacement for persistent storage.

SSD block cache

SSC: The SSC matches the requirements of caching behavior. It provides: • Unified cache address space • Consistent cache interface • Cache-aware free space

management Cache Manager in OS: • Adapts to the new interface of

the SSC • Operation Modes: write-back for

higher performance or write-through for higher safety

FlashTier Unified Address Space • The cache manager directly

addresses flash blocks in the SSC by disk LBN.

• The SSC only stores the hot data out of the terabytes of disk storage. As a result, the SSC optimizes for sparse address space using a Sparse Hash Map [3] requiring only 8.4 bytes/key to keep low device memory footprint.

Host and Device Memory Usage The graph shows the host and device memory usage for SSD cache and FlashTier with SSC and SSC-V. • The SSD cache manager stores state for all cached blocks. • The FlashTier cache manager only maintains a dirty block

table to clean/flush dirty LRU blocks to disk on exceeding the dirty mark.

• SSC-V stores more page-level mappings for variable log space and improved performance.

FlashTier reduces memory usage for host and device combined by 78% for SSC and 60% for SSC-V relative to SSD block cache.

Consistent SSC Interface

Cost of Crash Consistency The graph shows the relative cost of consistency for SSD block cache persisting dirty pages, SSC persisting both clean and dirty pages, and SSC Dirty persisting only dirty pages. The SSD block cache manager uses synchronous metadata updates, and the SSC uses internal logging and checkpointing mechanisms for persistent mapping.

Crash Guarantees: An SSC never returns stale data or loses dirty data. This is guaranteed by consistent reads following a cache write to dirty/clean data and cache eviction. It is always safe for the cache manager to consult the SSC after a crash.

FlashTier decreases the cost of crash consistency relative to the SSD block cache. It has less than 16% overhead for all workloads.

• SSC with fixed log space: Evicted data blocks are recycled for use as data blocks.

• SSC-V with variable log space: Evicted data blocks can be recycled for use as data blocks or increase the fraction of log blocks.

Policies for Silent Eviction The SSC uses two different policies for silently evicting a data block, which trade between cache performance and device memory for storing page-level mappings of log blocks:

System Performance The graph shows the relative performance of SSC and SSC-V in write-back and write-through modes relative to SSD cache in write-back mode.

FlashTier improves cache performance for write-intensive workloads by 168% with SSC-V and 101% with SSC. For read-intensive workloads, FlashTier

performs equally well as SSD block cache despite silent evictions.


Interconnecting Components §  Need interconnections between

•  CPU, memory, I/O controllers §  Bus: shared communication channel

•  Parallel set of wires for data and synchronization of data transfer

•  Can become a bottleneck §  Performance limited by physical factors

•  Wire length, number of connections §  More recent alternative:

high-speed serial connections with switches •  Like networks

§6.5 Connecting P

rocessors, Mem

ory, and I/O D

evices


Bus Types §  Processor-Memory buses

•  Short, high speed •  Design is matched to memory organization

§  I/O buses •  Longer, allowing multiple connections •  Specified by standards for interoperability •  Connect to processor-memory bus through a bridge


Bus Signals and Synchronization §  Data lines

•  Carry address and data •  Multiplexed or separate

§  Control lines •  Indicate data type, synchronize transactions

§  Synchronous •  Uses a bus clock

§  Asynchronous •  Uses request/acknowledge control lines for

handshaking


I/O Bus Examples Firewire USB 2.0 PCI Express Serial ATA Serial

Attached SCSI

Intended use External External Internal Internal External Devices per channel

63 127 1 1 4

Data width 4 2 2/lane 4 4 Peak bandwidth

50MB/s or 100MB/s

0.2MB/s, 1.5MB/s, or 60MB/s

250MB/s/lane 1×, 2×, 4×, 8×, 16×, 32×

300MB/s 300MB/s

Hot pluggable

Yes Yes Depends Yes Yes

Max length 4.5m 5m 0.5m 1m 8m Standard IEEE 1394 USB

Implementers Forum

PCI-SIG SATA-IO INCITS TC T10


Typical x86 PC I/O System


I/O Management §  I/O is mediated by the OS

•  Multiple programs share I/O resources – Need protection and scheduling

•  I/O causes asynchronous interrupts –  Same mechanism as exceptions

•  I/O programming is fiddly – OS provides abstractions to programs

§6.6 Interfacing I/O D

evices …


I/O Commands §  I/O devices are managed by

I/O controller hardware •  Transfers data to/from device •  Synchronizes operations with software

§  Command registers •  Cause device to do something

§  Status registers •  Indicate what the device is doing and occurrence of

errors §  Data registers

•  Write: transfer data to a device •  Read: transfer data from a device


I/O Register Mapping §  Memory mapped I/O

•  Registers are addressed in same space as memory •  Address decoder distinguishes between them •  OS uses address translation mechanism to make them

only accessible to kernel §  I/O instructions

•  Separate instructions to access I/O registers •  Can only be executed in kernel mode •  Example: x86


Polling §  Periodically check I/O status register

•  If device ready, do operation •  If error, take action

§  Common in small or low-performance real-time embedded systems •  Predictable timing •  Low hardware cost

§  In other systems, wastes CPU time


Interrupts §  When a device is ready or error occurs

•  Controller interrupts CPU §  Interrupt is like an exception

•  But not synchronized to instruction execution •  Can invoke handler between instructions •  Cause information often identifies

the interrupting device §  Priority interrupts

•  Devices needing more urgent attention get higher priority

•  Can interrupt handler for a lower priority interrupt


I/O Data Transfer §  Polling and interrupt-driven I/O

•  CPU transfers data between memory and I/O data registers

•  Time consuming for high-speed devices

§  Direct memory access (DMA) •  OS provides starting address in memory •  I/O controller transfers to/from memory autonomously •  Controller interrupts on completion or error


DMA/Cache Interaction §  If DMA writes to a memory block

that is cached •  Cached copy becomes stale

§  If write-back cache has dirty block, and DMA reads memory block •  Reads stale data

§  Need to ensure cache coherence •  Flush blocks from cache if they will be used for DMA •  Or use non-cacheable memory locations for I/O


DMA/VM Interaction §  OS uses virtual addresses for memory

•  DMA blocks may not be contiguous in physical memory §  Should DMA use virtual addresses?

•  Would require controller to do translation §  If DMA uses physical addresses

•  May need to break transfers into page-sized chunks •  Or chain multiple transfers •  Or allocate contiguous physical pages for DMA


Measuring I/O Performance §  I/O performance depends on

•  Hardware: CPU, memory, controllers, buses •  Software: operating system,

database management system, application •  Workload: request rates and patterns

§  I/O system design can trade-off between response time and throughput •  Measurements of throughput often done with

constrained response-time

§6.7 I/O P

erformance M

easures: …


Transaction Processing Benchmarks §  Transactions

•  Small data accesses to a DBMS •  Interested in I/O rate, not data rate

§  Measure throughput •  Subject to response time limits and failure handling •  ACID (Atomicity, Consistency, Isolation, Durability) •  Overall cost per transaction

§  Transaction Processing Council (TPC) benchmarks (www.tpc.org) •  TPC-APP: B2B application server and web services •  TCP-C: on-line order entry environment •  TCP-E: on-line transaction processing for brokerage firm •  TPC-H: decision support — business oriented ad-hoc queries


File System & Web Benchmarks §  SPEC System File System (SFS)

•  Synthetic workload for NFS server, based on monitoring real systems

•  Results –  Throughput (operations/sec) – Response time (average ms/operation)

§  SPEC Web Server benchmark •  Measures simultaneous user sessions,

subject to required throughput/session •  Three workloads: Banking, Ecommerce, and Support


I/O vs. CPU Performance §  Amdahl’s Law

•  Don’t neglect I/O performance as parallelism increases compute performance

§  Example •  Benchmark takes 90s CPU time, 10s I/O time •  Double the number of CPUs/2 years

–  I/O unchanged

Year CPU time I/O time Elapsed time % I/O time now 90s 10s 100s 10% +2 45s 10s 55s 18% +4 23s 10s 33s 31% +6 11s 10s 21s 47%

§6.9 Parallelism

and I/O: R

AID


RAID §  Redundant Array of Inexpensive (Independent)

Disks •  Use multiple smaller disks (c.f. one large disk) •  Parallelism improves performance •  Plus extra disk(s) for redundant data storage

§  Provides fault tolerant storage system •  Especially if failed disks can be “hot swapped”

§  RAID 0 •  No redundancy (“AID”?)

–  Just stripe data over multiple disks •  But it does improve performance


RAID 1 & 2 §  RAID 1: Mirroring

•  N + N disks, replicate data – Write data to both data disk and mirror disk – On disk failure, read from mirror

§  RAID 2: Error correcting code (ECC) •  N + E disks (e.g., 10 + 4) •  Split data at bit level across N disks •  Generate E-bit ECC •  Too complex, not used in practice


RAID 3: Bit-Interleaved Parity §  N + 1 disks

•  Data striped across N disks at byte level •  Redundant disk stores parity •  Read access

– Read all disks

•  Write access – Generate new parity and update all disks

•  On failure – Use parity to reconstruct missing data

§  Not widely used


RAID 4: Block-Interleaved Parity

§  N + 1 disks •  Data striped across N disks at block level •  Redundant disk stores parity for a group of blocks •  Read access

– Read only the disk holding the required block

•  Write access –  Just read disk containing modified block, and parity disk – Calculate new parity, update data disk and parity disk

•  On failure – Use parity to reconstruct missing data

§  Not widely used


RAID 3 vs RAID 4


RAID 5: Distributed Parity §  N + 1 disks

•  Like RAID 4, but parity blocks distributed across disks –  Avoids parity disk being a bottleneck

§  Widely used


RAID 6: P + Q Redundancy §  N + 2 disks

•  Like RAID 5, but two lots of parity •  Greater fault tolerance through more redundancy

§  Multiple RAID •  More advanced systems give similar fault tolerance wit

h better performance


RAID Summary §  RAID can improve performance and

availability •  High availability requires hot swapping

§  Assumes independent disk failures •  Too bad if the building burns down!

§  JBOD (Just a Bunch of Disks) •  An alternative solution to RAID


I/O System Design §  Satisfying latency requirements

•  For time-critical operations •  If system is unloaded

–  Add up latency of components

§  Maximizing throughput •  Find “weakest link” (lowest-bandwidth component) •  Configure to operate at its maximum bandwidth •  Balance remaining components in the system

§  If system is loaded, simple analysis is insufficient •  Need to use queuing models or simulation

§6.8 Designing and I/O

System


Fallacy: Disk Dependability §  If a disk manufacturer quotes MTTF as

1,200,000hr (140yr) •  A disk will work that long

§  Wrong: this is the mean time to failure •  What is the distribution of failures? •  What if you have 1000 disks

– How many will fail per year?

§6.12 Fallacies and Pitfalls

0.73%ehrs/failur 1200000

hrs/disk 8760disks 1000(AFR) Rate Failure Annual =×

=


Fallacies §  Disk failure rates are as specified

•  Studies of failure rates in the field –  Schroeder and Gibson: 2% to 4% vs. 0.6% to 0.8% –  Pinheiro, et al.: 1.7% (first year) to 8.6% (third year) vs. 1.5%

•  Why?

§  A 1GB/s interconnect transfers 1GB in one sec •  But what’s a GB? •  For bandwidth, use 1GB = 109 B •  For storage, use 1GB = 230 B = 1.075×109 B •  So 1GB/sec is 0.93GB in one second

–  About 7% error


Pitfall: Offloading to I/O Processors

§  Overhead of managing I/O processor request may dominate •  Quicker to do small operation on the CPU •  But I/O architecture may prevent that

§  I/O processor may be slower •  Since it’s supposed to be simpler

§  Making it faster makes it into a major system component •  Might need its own coprocessors!


Pitfall: Backing Up to Tape §  Magnetic tape used to have advantages

•  Removable, high capacity

§  Advantages eroded by disk technology developments

§  Makes better sense to replicate data •  E.g, RAID, remote mirroring


Fallacy: Disk Scheduling §  Best to let the OS schedule disk accesses

•  But modern drives deal with logical block addresses – Map to physical track, cylinder, sector locations –  Also, blocks are cached by the drive

•  OS is unaware of physical locations – Reordering can reduce performance – Depending on placement and caching


Pitfall: Peak Performance §  Peak I/O rates are nearly impossible to achieve

•  Usually, some other system component limits performance

•  E.g., transfers to memory over a bus – Collision with DRAM refresh –  Arbitration contention with other bus masters

•  E.g., PCI bus: peak bandwidth ~133 MB/sec –  In practice, max 80MB/sec sustainable


Concluding Remarks §  I/O performance measures

•  Throughput, response time •  Dependability and cost also important

§  Buses used to connect CPU, memory, I/O controllers •  Polling, interrupts, DMA

§  I/O benchmarks •  TPC, SPECSFS, SPECWeb

§  RAID •  Improves performance and dependability

§6.13 Concluding R

emarks

Date post:	15-Feb-2018
Category:	Documents
Upload:	vantuyen
View:	214 times
Download:	1 times

Chapter 6csl.skku.edu/uploads/ICE3003S14/chap6.pdf · Chapter 6 Storage and Other I/O Topics . ......

Documents