20 November 2012 Sam Siewert
CSE A215 Assembly Language Programming
for Engineers Lecture 13 – Storage and I/O
(MMIO, Devices, Reliability/Availability, Performance)
Hardware/Software Interface for I/O
Basics and Driver Concept
Sam Siewert
2
Sam Siewert 3
PCI (Peripheral Component Interconnect) System
CPU
North Bridge Graphics Adapter
SDRAM/ DDR
South Bridge
Super IO Audio
ISA Bus
PCI 2.x Bus
IDE
COM-A
COM-B
Ethernet Expansion Slots
AGP
FSB
Sam Siewert 4
Hardware View of Device Interfaces Analog I/O – DAC analog output: servos, motors, heaters, ... – ADC analog input: photodiodes, thermistors, ...
Digital I/O – Direct TTL I/O or GPIO – Digital Serial (I2C, SPI, ... - Chip-to-Chip) – Bus Interfaces
Parallel – PCI 2.x, PCI-X, SCSI, etc (32-bit, 64-bit, synchronous parallel transfer)
Differential Serial – USB – Infiniband – gigE / 10GE Ethernet – Fiber Channel – SAS/SATA
Sam Siewert 5
Software View of Drivers MMIO – Device Buffers Decode Memory Bus Addresses (Outside of RAM address space) Character – Register Control/Config, Status, Data – Typical of Low-Rate I/O Interfaces (RS232) – Linux User Space Buffer Drivers (Direct IO) – e.g.
SCSI Generic Block – FIFOs, Dual-Port RAM and DMA – Typical of High-Rate I/O Interfaces (Network, Storage) – Only Interface for 512 Byte LBA/Sector HDDs
Network – Driver Stacks – OSI 7 Layer Model (Phy, Link, Network, Transport, Session,
Presentation, Application) – TCP/IP/Ethernet/Cat-6e
Sam Siewert 6
Linux Char Driver Design Application Interface – Application Policy – Blocking/Non-Blocking – Multi-thread access – Abstraction
Device Interface – SW/HW Interface – Immediate Buffering – Interrupt Service
Routine App/Device Interface Hardware Device
Application(s)
ISR
SemGive Input Ring-Buffer
Output Ring-Buffer
If Output Ring-Buffer Full then
{SemTake or EAGAIN}
else {Process and Return}
If Input Ring-Buffer Empty then
{SemTake or EAGAIN}
else {Processand Return}
open/close, read/write, creat, ioctl EAGAIN, Block,
Data, Status
Sam Siewert 7
Cached Memory and DMA
Cache Coherency – Making sure that cached data and memory are in sync – Can become out of sync due to DMAs and Multi-Processor
Caches – Push Caches Allow for DMA into and out of Cache Directly – Cache Snooping by HW may Obviate Need for Invalidate
Drivers Must Ensure Cache Coherency – Invalidate Memory Locations on DMA Read Completion – Flush Cache Prior to DMA Write Initiation
IO Data Cache Line Alignment – Ensure that IO Data is Aligned on Cache Line Boundaries – Other Data That Shares Cache Line with IO Data Could
Otherwise Be Errantly Invalidated
How Reliable and Available are Data Center Systems?
Availability vs. Reliability
Sam Siewert
8
Sam Siewert 9
Reliability and Recovery
Redundancy – Dual String
Side A, Side B Pilot, Co-Pilot
– Fail-Over Fault Detection, Protection, Recovery
– Backup System Independent Design e.g. Backup Flight System
– Cross Strapping of Sides Dual String A & B 3 Components C1, C2, C3 8 Possible Configurations 4 Component Switches A|B Select Switch
C1 C1
C2 C2
C3 C3
A B
Configurations C1 C2 C3
1 A A A
2 A A B
3 A B A
4 A B B
5 B A A
6 B A B
7 B B A
8 B B B
SW1 SW2
SW3 SW4
Sam Siewert 10
High Availability
Service Up-time is Figure of Merit – Number of Times down? – How long down? – Quick recovery is key
Hot or Warm Spare Equipment Protection – Fault Detection and Fail Over Without Service Outage – Excess Capacity
E.g. Diverse Routing in a Network Overlapping Coverage in Cell Phone Systems On-orbit spare satellites
Sam Siewert 11
Availability vs. Reliability
Are They the Same? – Are all Reliable Systems Highly Available? – Are all Highly Available Systems Reliable?
Reliability = Long MTTF – Mean Time To Failure – Mean Time Between Failures, MTBF=MTTF+MTTR – FDIR When Failures Do Occur
Fault Detection, Isolation and Recovery Safing MTTR (Mean Time to Recover)
Availability = MTTF / (MTTF + MTTR) = % Uptime – MTTF = 8,766 hours (525,960 minutes) – MTTR = 5 minutes – Availability = 525,960 / (525,960 + 5) = 99.999% Uptime
Storage I/O
Storing Data Long Term
Sam Siewert
12
A Single Disk Drive Read and Write 512-byte Sectors at LBA (Logical Block Address) A 2TB 3.5” SATA Disk Drive has 4 billion 512-byte sectors to manage The Operating System SATA/SCSI Driver and Filesystem Layered on the Block Driver Provide Use of a Disk Drive The Operating System Caches Pages (Typically 4K), that are Written Back (like CPU cache) from RAM to Disk When Needed (See slabtop) Filesystem Manages Access to Sectors Block I/O Can Be Done Directly as Well
Sam Siewert 13
RAID-10
Sam Siewert 14
A1 A1 A2 A2 A3 A3 A4 A4 A5 A5 A6 A6
RAID-1 Mirror RAID-1 Mirror RAID-1 Mirror
RAID-0 Striping Over RAID-1 Mirrors
A7 A7 A8 A8 A9 A9 A10 A10 A11 A11 A12 A12
A1,A2,A3, … A12
RAID5,6 XOR Parity Encoding
MDS Encoding, Can Achieve High Storage Efficiency with N+1: N/(N+1) and N+2: N/(N+2)
Sam Siewert 15
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Stor
age
Effic
ienc
y
Number of Data Devices for 1 XOR or 2 P,Q Encoded Devices
RAID6
RAID5
RAID-50
Sam Siewert 16
A1
RAID-5 Set RAID-5 Set
B1 C1 D1 P(ABCD)
E1 F1 G1 H1 P(EFGH)
I1 J1 P(IJKL) K1 L1 M1 P(MNOP) N1 P1 O1
P(QRST) Q1 R1 S1 T1
A2 B2 C2 D2 P(ABCD)
E2 F2 G2 H2 P(EFGH)
I2 J2 P(IJKL) K2 L2 M2 P(MNOP) N2 P2 O2
P(QRST) Q2 R2 S2 T2
RAID-0 Striping Over RAID-5 Sets
A1,B1,C1,D1,A2,B2,C2,D2,E1,F1,G1,H1,…, Q2,R2,S2,T2
A1
RAID-6 Set RAID-6 Set
B1 C1 D1 P(ABCD)
E1 F1 G1 P(EFGH)
I1 J1 P(IJKL) K1 M1 P(MNOP) N1 O1 P(QRST) Q1 R1 S1
RAID-0 Striping Over RAID-6 Sets
A1,B1,C1,D1,A2,B2,C2,D2,E1,F1,G1,H1,…, Q2,R2,S2,T2
Disk5 Disk1 Disk2 Disk3 Disk4
Q(EFGH)
Disk6
H1 QABCD)
Q(IJKL)
Q(MNOP)
Q(QRST)
L1 P1 T1
A2 B2 C2 D2 P(ABCD)
E2 F2 G2 P(EFGH)
I2 J2 P(IJKL) K2 M2 P(MNOP) N2 O2 P(QRST) Q2 R2 S2
Disk5 Disk1 Disk2 Disk3 Disk4
Q(EFGH)
Disk6
H2 QABCD)
Q(IJKL)
Q(MNOP)
Q(QRST)
L2 P2 T2
RAID-60 (Reed-Solomon Encoding)
I/O Performance
Some Methods to Improve I/O on Linux
Sam Siewert
18
Hiding IO Latency – Overlapping with Processing
Simple Design – Each Thread has READ, PROCESS, WRITE-BACK Execution Frame rate is READ+PROCESS+WRITE latency – e.g. 10 fps for 100 milliseconds – If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK
20 msec, predominate time is IO time, not processing – Disk drive with 100 MB/sec READ rate can only read 16 fps,
62.5 msec READ latency
Sam Siewert 19
READ F(1) Process F(1) Write-back F(1) READ F(2)
Hiding IO Latency
Schedule Multiple Overlapping Threads? Requires Nthreads = Nstages x Ncores 1.5 to 2x Number of Threads for SMT (Hyper-threading) For IO Stage Duration Similar to Processing Time More Threads if IO Time (Read+WB+Read) >> 3 x Processing Time Sam Siewert 20
READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4
READ F2 Process F2 Write-back F2 READ F5 Process F5 …
READ F3 Process F3 Write-back F3 Read F6 …
Start-up Core #1 Continuous Processing Core #1 Continuous Processing
READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4
READ F2 Process F2 Write-back F2 READ F5 Process F5 …
READ F3 Process F3 Write-back F3 Read F6 …
Start-up Core #2 Continuous Processing Core #2 Continuous Processing
Hiding Latency – Dedicated IO
Schedule Reads Ahead of Processing
Requires Nthreads = 2 + Ncores
Synchronize Frame Ready/Write-backs Balance Stage Read/Write-Back Latency to Processing 1.5 to 2x Threads for SMT (Hyper-threading)
Sam Siewert 21
Wait Process F1 Process F3 Process F5 …
Wait Process F2 Process F4 Process F6
Read F1 Read F2 Read F3 Read F4 Read F5 Read F6 Read F7 Read F8
Start-up
Wait … WB F1 WB F2 WB F3 WB F4 WB F5 WB F6
Dual-Core Concurrent Processing Completion
Processing Latency Alone Write Code with Memory Resident Frames – Load Frames in Advance – Process In-Memory Frames Over and Over – Do No IO During Processing – Provides Baseline Measurement of Processing Latency per
Frame Alone – Provides Method of Optimizing Processing Without IO Latency
Sam Siewert 22
IO Latency Alone Comment Out Frame Transformation Code or Call Stubbed NULL Function – Provides Measurement of IO Frame Rate Alone – Essentially Zero Latency Transform – No Change Between Input Frames and Output Frames – Allows for Tuning of IO Scheduler and Threading
Sam Siewert 23
Tips for IO Scheduling
blockdev --getra /dev/sda – Should return 256 – Means that reads read-ahead up to 128K – Function calls – read, fread should request as much as possible – Check “actual bytes read”, re-read as needed in a loop
blockdev --setra /dev/sda 16384 (8MB) Switch CFQ to Deadline – Use “lsscsi” to verify your disk is /dev/sda … substitue block
driver interface used for file system if not sda – cat /sys/block/sda/queue/scheduler – echo deadline > /sys/block/sda/queue/scheduler
Options are noop, cfq, deadline
Sam Siewert 24