Date post: | 18-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
High Speed Sequential IO on Windows NT™ 4.0 (sp3)
Erik Riedel (of CMU)
Catharine van Ingen
Jim Gray
http://Research.Microsoft.com/BARC/Sequential_IO/
Outline
• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO
– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls
• Summary
We Got a Lot of Help• Brad Waters, Wael Bahaa-El-Din, and Maurice Franklin
Shared experience, results, tools, and hardware lab. Helped us understand NT Feedback on our preliminary measurements
• Tom Barclay iostress benchmark program
• Barry Nolte & Mike Parkes allocate issues
• Doug Treuting, Steve Mattos + Adaptec SCSI and Adaptec device drivers
• Bill Courtright, Stan Skelton, Richard Vanderbilt, Mark Regester loanded us a Symbios Logic array, host adapters, and r expertise. .
• Will Dahli : helped us understand NT configuration and measurement.
• Joe Barrera & Don Slutz & Felipe Cabrera valuable comments, feedback and helped in understanding NTFS internals.
• David Solomon: Inside Windows NT 2nd edition draft
Controller
The Actors• Measured & Modeling Sequential IO
• Where are the bottlenecks?
• How does it scale with – SMP, RAID, new interconnects
Adapter SCSIFile cache PCI
MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)
Mem
bus
App address space
PAP (peak advertised Performance) vs RAP (real application performance)
• Goal: PAP = RAP / 2 (the half-power point)System Bus
422 MBps7.2 MB/s
133 MBps7.2 MB/s
10-15 MBps7.2 MB/s
SCSIFile System Buffers
ApplicationData
Disk
PCI
40 MBps7.2 MB/s
Outline
• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO
– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls
• Summary
Two Basic Shapes
• Circle (disk)– storage frequently returns to same spot – so less total surface area
• Line (tape)– Lots more area, – Longer time to get to the data.
• Key idea: multiplex expensive read/write head over large storage area: trade $/GB for access/second
Disk Terms• Disks are called platters
• Data is recorded on tracks (circles) on the disk.
• Tracks are formatted into fixed-sized sectors.
• A pair of Read/Write heads for each platter
• Mounted on a disk arm• Client addresses logical blocks (cylinder, head, sector)
• Bad blocks are remapped to spare good blocks.
Disk Access Time
• Access time = SeekTime 6 ms+ RotateTime 3 ms+ ReadTime 1 ms
• Rotate time:– 5,000 to 10,000 rpm
• ~ 12 to 6 milliseconds per rotation• ~ 6 to 3 ms rotational latency• Improved 3x in 20 years
Disk Seek Time
• Seek time is ~ Sqrt(distance)(distance = 1/2 acceleration x time2)
• Specs assume seek is 1/3 of disk
• Short seeks are common. (over 50% are zero length)
• Typical 1/3 seek time: 8 ms
• 4x improvement in 20 years.
Full Accelerate Full Stop
spee
d
time
Read/Write Time: Density• Time = Size / BytesPerSecond
• Bytes/Second = Speed * Density– 5 to 15 MBps
• MAD (Magnetic Aerial Density)– Today 3 Gbits/inch2
5 gbpsi in lab
– Rising > 60%/year– ParaMagnetic Limit:
10 Gb/inch2
– linear density is sqrt10x per decade
1970 1980 1990 2000
10,000
1,000
100
10
1Hoagland’s L
aw
MA
D (
Mbp
si)
0
2
4
6
8
10
0% 25% 50% 75% 100%Radial Distance
Th
rou
gh
pu
t (M
B/s
)
Fast Wide SCSI
Ultra SCSI
.
Read/Write Time: Rotational Speed• Bytes/Second = Speed * Density
• Speed greater at edge of circle
• Speed 3600 -> 10,000 rpm– 5%/year improvement
• bit rate varies by ~1.5x today
r2 = 1
r2 = 4
r = 1
r = 2
Read/Write Time: Zones
• Disks are sectored – typical: 512 bytes/sector
– Sector is read/write unit – Failfast: can detect bad sectors.
• Disks are zoned – outer zones have more sectors– Bytes/second higher in outer zones.
14 sectors/track
8 sectors/track
8 sectors/track
Disk Access Time
• Access time = SeekTime 6 ms 5%/y + RotateTime 3 ms 5%/y+ ReadTime 1 ms 25%/y
• Other useful facts:– Power rises more than size3 (so small is indeed beautiful)
– Small devices are more rugged– Small devices can use plastics (forces are much smaller)
e.g. bugs fall without breaking anything
The Access Time MythThe Access Time MythThe Myth: seek or pick time dominatesThe Reality:(1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often shortImplication: many cheap servers
better than one fast expensive server– shorter queues– parallel transfer– lower cost/access and cost/byte
This is now obvious for disk arraysThis will be obvious for tape arrays
Seek
Rotate
Transfer
Seek
Rotate
Transfer
Wait
Storage Ratios Changed• 10x better access time
• 10x more bandwidth
• 4,000x lower media price
Disk Performance vs Time
1
10
100
1980 1990 2000
Year
seek
s p
er s
eco
nd
ban
dw
idth
: MB
/s
0.1
1.
10.
Cap
acity
(GB
)
Disk accesses/second vs Time
1
10
100
1980 1990 2000
Year
Acc
esse
s p
er S
eco
nd
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
• DRAM/disk media price ratio changed– 1970-1990 100:1
– 1990-1995 10:1
– 1995-1997 50:1
– today ~ .2$pMB disk 10$pMB dram
Year 2002 Disks• Big disk (10 $/GB)
– 3”– 100 GB– 150 kaps (k accesses per second)– 20 MBps sequential
• Small disk (20 $/GB)– 3”– 4 GB– 100 kaps – 10 MBps sequential
• Both running Windows NT™ 7.0?(see below for why)
Tape & Optical: Beware of the Media Myth
• Optical is cheap: 200 $/platter 3 GB/platter => 70$/GB (cheaper than disc)
• Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).
The Media Myth
• Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 10$/GB ... 150$/GB
(1x…10x cheaper than disc)
Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB
( more expensive than mag disc )
• Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!
Crazy Disk Ideas• Disk Farm on a card: surface mount disks
• Disk (magnetic store) on a chip: (micro machines in Silicon)
• NT and BackOffice in the disk controller(a processor with 100MB dram)
ASIC
The Disk Farm On a CardThe Disk Farm On a CardThe Disk Farm On a CardThe Disk Farm On a Card
The 100GB disc cardAn array of discsCan be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etcLOTS of accesses/second bandwidth
14"
Life is cheap, its the accessories that cost ya.
Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card).
Functionally Specialized Cards• Storage
• Network
• Display
M MB DRAM
P mips processor
ASIC
ASIC
ASIC Today:
P=50 mips
M= 2 MB
In a few years
P= 200 mips
M= 64 MB
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer• You get a
– several network interfaces– A Postscript engine
• cpu, • memory, • software,• a spooler (soon)
– and… a print engine.
Tera Byte Backplane
• TODAY– Disk controller is 10 mips risc engine
with 2MB DRAM– NIC is similar power
• SOON– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation(can run Oracle on NT in disk controller).
• Advantages– Uniform programming model– Great tools– Security– Economics (cyberbricks)– Move computation to data (minimize traffic)
All Device Controllers will be Cray 1’s
CentralProcessor &
Memory
System On A Chip• Integrate Processing with memory on one chip
– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
With Tera Byte Interconnectand Super Computer Adapters
• Processing is incidental to – Networking– Storage– UI
• Disk Controller/NIC is – faster than device– close to device– Can borrow device
package & power
• So use idle capacity for computation.• Run app in device.
Tera ByteBackplane
Implications
• Offload device handling to NIC/HBA
• higher level protocols: I2O, NASD, VIA…
• SMP and Cluster parallelism is important.
Tera Byte Backplane
• Move app to NIC/device controller
• higher-higher level protocols: CORBA / DCOM.
• Cluster parallelism is VERY important.
CentralProcessor &
Memory
Conventional Radical
How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other
– CORBA? DCOM? IIOP? RMI?
– One or all of the above.
• Huge leverage in high-level interfaces.• Same old distributed system story.
Wire(s)h
stre
ams
data
gram
s
RP
C?
Applications
VIAL/VIPL
streams
datagrams
RP
C ?
Applications
Will He Ever Get to The Point?
• I thought this was about NTFS sequential IO.
• Why is he telling me all this other crap?
It is relevant background
Outline
• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO
– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls
• Summary
ControllerAdapter SCSIFile cache PCI
Memory
Mem
bus
App address space
The Actors• Processor - Memory bus• Memory
– holds file cache and app data
• Application– reads and writes memory
• The Disk: writes, stores, reads data• The Disk Controller:
– manages drive (error handling)– reads & writes drive– converts SCSI commands
to disk actions– May buffer or do RAID
• The SCSI bus: carries bytes • The Host-Bus Adapter:
– protocol converter to system bus
– may do RAID
Sequential vs Random IO• Random IO is typically small IO (8KB)
– seek+rotate+transfer is ~ 10 ms
– 100 IO per second
– 800 KB per second
• Sequential IO is typically large IO– almost no seek (one per cylinder read/written)
– No rotational delay (reading whole disk track)
– Runs at MEDIA speed: 8 MB per second
• Sequential is 10x more bandwidth than random!
1
10
Basic File Concepts• Buffered:
– File reads/writes go to file cache– File system does pre-fetch, post write, aggregation.– Unbuffered bypasses file cache– Data written to disk at file close or LRU or lazy write
• Overlapped:– requests are pipelined– completions via events, completion ports, – A simpler alternative to multi-threaded IO.
• Temporary Files:– Files written to cache, not flushed on close.
Experiment Background
• Used Intel/Gateway 2000 G6-200Mhz Pentium Pro• 64 MB DRAM (4x interleave)• 32-bit PCI• Adaptec 2940 Fast-Wide (20 MBps)
and Ultra-Wide (40 MBps) controllers• Seagate 4GB SCSI disks (fast and ultra)
– (7200 rpm, 7-15 MBps “internal”)
• NT 4.0 SP3, NTFS• i.e.: modest 1997 technology.• Not multi-processor, Not DEC Alpha, Some RAID
Simplest Possible Code
• Error checking adds some more, but still, its easy
#include <stdio.h>#include <windows.h>
int main(){ const int iREQUEST_SIZE = 65536;
char cRequest[iREQUEST_SIZE];unsigned long ibytes;
HANDLE hFile = CreateFileCreateFile("C:\\input.dat", // name GENERIC_READ, // desired access 0, NULL, // share & security OPEN_EXISTING, // pre-existing file FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_SEQUENTIAL_SCAN, NULL); // file template
while(ReadFileReadFile(hFile,cRequest,iREQUEST_SIZE,&ibytes,NULL) ) // do read{ if (ibytes == 0) break; // break on end of file
/* do something with the data */ };
CloseHandleCloseHandle(hFile);return 0;}
The Best Case: Temp File, NO IO• Temp file Read / Write File System Cache• Program uses small (in cpu cache) buffer.• So, write/read time is bus move time (3x better than copy)• Paradox: fastest way to move data is to write then read it.• This hardware is
limited to 150 MBpsper processor
Temp File Read/Write
148 136
54
0
50
100
150
200
Temp read Temp write Memcopy ()
MB
ps
Out of the Box Disk File Performance
• One NTFS disk
• Buffered read
• NTFS does 64 KB read-ahead – if you ask FILE_FLAG_SEQUENTIAL– or if it thinks you are sequential
• NTFS does 64 KB write behind– under same conditions– aggregates many small IO to few big IO.
64KB
Synchronous Buffered Read/Write• Read throughput is GREAT!
• Write throughput is 40% of read
• WCE is fast but dangerous
Out of the Box Throughput
0
2
4
6
8
10
2 4 8 16 32 64 128 192
Request Size (K-Bytes)
Th
rou
gh
pu
t (M
B/s
)
Write
Read
Write +WCE
Out of the Box Overhead
0
10
20
30
40
50
60
70
80
2 4 8 16 32 64 128 192Request Size (K Bytes)
Ove
rhea
d (
cpu
mse
c/M
B) Read
Write
Write + WCE
Read
Write
• Net: default out of the box Net: default out of the box performance is good.performance is good.
• 20 ms/MB ~ 2 instructions/byte!
• CPU will saturate at 50MBps
Write Multiples of Cluster Size• For IOs less than 4KB
if OVERWRITING datafile system reads 4KB pagethen overwrites bytesthen writes bytes
• Cuts throughput by 2x - 3x
• So, write in multiples of cluster size.
Out of the Box Throughput
0
2
4
6
8
10
2 4 8 16 32 64 128 192
Request Size (K-Bytes)T
hro
ug
hp
ut
(MB
/s)
Write
Read
Write +WCE
2KB writes are5x slower than reads
2x or 3x slower than 4KB writes
What is WCE?• Write Cache Enable lets disk controller respond “yes” before data is
on disk.
• DangerousDangerous – If power fails, WCE can destroy data integrity– Most RAID controllers have Non Volatile RAM
That makes WCE safe (invisible) if they do RESET right.
• About 50% of disks we see have WCE onYou can turn it off with 3rd party SCSI Utilities.
• As seen later: 3-deep request buffering gets similar performance.
Synchronous Un-Buffered Read/Write • Reads do well above 2KB• Writes are terrible• WCE helps writes• Ultra media is 1.5x Faster
Unbuffered Throughput
0
2
4
6
8
10
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
Ultra Read
Fast Read
Ultra Write
Fast Write
0
2
4
6
8
10
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)Fast Write WCE
Ultra Write WCE
WCE Unbuffered Write Throughput
• 1/2 power point– Read: 4KB
– Write: 64h KB no wce 4 KB with
wce
Cost of Un-Buffered IO • Saves Buffer Memory copy.• Was 20 ms/MB, now 2 ms/MB• Cost/request ~ 120 s (wow)• Note: unbuffered must be sector aligned.
• Buffered:– saturates CPU at 50 MB/s
• Un Buffered– saturates CPU at 500 MB/s
CPU milliseconds per MB
1
10
100
2 4 8 16 32 64 128 192
Request Size (K bytes)
Co
st (
ms/
MB
)
CPU Utilization
0%
5%
10%
15%
20%
25%
30%
35%
2 4 8 16 32 64 128 192Request Size (K bytes)
Co
st (
CP
U%
)
cpu idle because
non-WCE w rites so slow
CPU milliseconds per Request
0.10
0.15
0.20
0.25
0.30
2 4 8 16 32 64 128 192Request Size (K bytes)
Co
st (
ms/
req
ues
t)
Fast Read
Ultra Read
Fast Write
Ultra Write
Ultra Write WCE
Fast write WCE
Summary• Out of the box
– Read RAP ~PAP (thanks NTFS)– Write RAP ~ PAP / 10 …PAP/2
• Buffering small IO is great!• Buffering large IO is expensive• WCE is a dangerous way out
but frequently used.
• Parallelism Tricks:– deep requests (async, overlap)
– striping (raid0, raid5)
– allocation and other tricks
Out of the Box Overhead
0
10
20
30
40
50
60
2 4 8 16 32 64 128 192
Request Size (K Bytes)
Read BufferedWrite BufferedWrite Buffered + WCEReadWriteWrite+WCE
Out of the Box Throughput
0
2
4
6
8
10
2 4 8 16 32 64 128 192Request Size (K-Bytes)
Th
rou
gh
pu
t (M
B/s
)
Un-Buffered
Read & Write
FS Buffered Read & Write
WCE Out of Box Throughput
0
2
4
6
8
10
2 4 8 16 32 64 128 192Request Size (K-Bytes)
Un-Buffered Write
Buffered Write
Bottleneck Analysis
• Drawn to linear scale
TheoreticalBus Bandwidth
422MBps = 66 Mhz x 64 bits
MemoryRead/Write
~150 MBps
MemCopy~50 MBps
Disk R/W~9MBps
Outline
• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO
– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls
• Summary
Kinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential
Step
A Sequential
Step
SequentialSequential
SequentialSequential Any Sequential
Step
Any Sequential
Step
Pipeline Requests to One Disk• Does not help reads much
They were already pipelined by the disk controller
• Pipeline (async, overlap) IO is a BIG win (RAP ~ 85% PAP)
• Helps writes a LOT– Above 16KB
3-deep matches WCE
Read Throughput - 1 Fast Disk, Various Request Depths
0
2
4
6
8
10
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
Write Throughput - 1 Fast Disk, Various Request Depths
0
2
4
6
8
10
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
WCE
1 Buffer
3 Buffers
8 Buffers
Parallel Access To Data?
1 Terabyte1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte1 Terabyte
1,000 x parallel100 second SCAN.
Parallelism: divide a big problem into many smaller ones
to be solved in parallel.
BANDWID
TH
10 GB/s
Pipeline Access: Stripe Across 4 disks• Stripes NEED pipeline• 3-deep is good enough• Saturate at 15 MBps
• 8-deep Pipeline matches WCE
Write 4 Disk StripesThroughput vs Request Depth
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
WCE
1 Buffer
3 Buffers
8 Buffers
Read 4 Disk Stripes Throughput vs Request Depth
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
3 Stripes and Your Out!• 3 disks can saturate adapter• Similar story with UltraWide
• CPU time goes down with request size
• Ftdisk (striping is cheap)
Read Throughput vs Stripes - 3 deep Fast
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
WriteThroughput vs Stripes - 3 deep Fast
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
1 Disk
2 Disks
3 Disks
4 Disks
CPU miliseconds per MB
1
10
100
2 4 8 16 32 64 128 192
Request Size (bytes)
Co
st (
CP
U m
s/M
B)
=
Parallel SCSI Busses Help
• Second SCSI bus nearly doubles read and wce throughput
• Write needs deeper buffers• Experiment is unbuffered
(3-deep +WCE)
One or Two SCSI Busses
0
5
10
15
20
25
2 4 8 16 32 64 128 192
Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
ReadWriteWCEReadWriteWCE
2 busses
1 Bus
2 x
File System Buffering & Stripes(UltraWide Drives)
• FS buffering helps small reads• FS buffered writes peak at
12MBps• 3-deep async helps
• Write peaks at 20 MBps• Read peaks at 30 MBps
Three Disks, 1 Deep
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128 192Request Size (K Bytes)
Th
rou
gh
pu
t (M
B/s
)
FS Read
ReadFS Write WCE
Write WCE
Three Disks, 3 Deep
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128 192Request Size (K Bytes)
Th
rou
gh
pu
t (M
B/s
)
PAP vs RAP• Reads are easy, writes are hard• Async write can match WCE.
•
422 MBps
142 MBps
133 MBps
72 MBps
10-15 MBps
9 MBps
SCSI
File System
ApplicationData
PCI SCSI
Disks40 MBps
31 MBps
Bottleneck Analysis• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI
~ 65 MBps Unbuffered read~ 43 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Memory Read/Write ~150 MBps
PCI~70 MBps
Adapter~30 MBps
Adapter
70 M
Bps
Hypothetical Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Memory Read/Write ~150 MBps
PCI~70 MBps
Adapter~30 MBps
PCI
Adapter
Adapter
Adapter
120
MB
ps
Outline
• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO
– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls
• Summary
Stripes, Mirrors, Parity (RAID 0,1, 5)
• RAID 0: Stripes– bandwidth
• RAID 1: Mirrors, Shadows,…– Fault tolerance– Reads faster, writes 2x slower
• RAID 5: Parity– Fault tolerance– Reads faster– Writes 4x or 6x slower.
0,3,6,.. 1,4,7,.. 2,5,8,..
0,1,2,.. 0,1,2,..
0,2,P2,.. 1,P1,4,.. P0,3,5,..
Where To Do RAID? • RAID in host (= NT)
– no special hardware– data FtDisk responsible for data integrity– can stripe across multiple busses/adapters
• RAID in Adapter– Gets safe WCE if not volatile– Offloads host– Not good for WolfPack
• RAID in disk controller– Gets safe WCE if not volatile– offloads host– best data integrity for MSCS
NT Host-Based Striping is OK • 3 Ultra-disks per Stripe.• WCE is enabled in all cases• Requests are 3-deep
•
Striping Read Throughput
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128Request Size (Kbytes)
Th
rou
gh
pu
t (M
B/s
)
Controller-Based Striping
Host-Based Striping
Array-Based Striping
Striping WriteThroughput
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128Request Size (Kbytes)
Th
rou
gh
pu
t (M
B/s
)
Surprise: Good NT RAID5 Performance
• Ignores read performance in the case of disk fault.
• Above 32KB requests, CPU write cost is significant.
• At 8 KB, performance is similar
• Write performance is bad in all cases.RAID5 Throughput vs Request Depth
0
5
10
15
20
25
30
35
Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
Read
Write
RAID5 CPU milliseconds per MB
1
10
100
Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
Array ReadArray WriteHost ReadHost Write
Controller & Adapters are Complex• Min response time 300µs• Typical 1ms for 8KB• Many strange effects
(e.g. Ultra cache is busted).
1
10
Request Size (K bytes)
Ela
pse
d T
ime
(ms)
0.10 10 20 30 40 50 60 70
Elapsed time vs Request Size
Controller Cache vs Controller Prefetch
Ultra Cached
Fast Cached
Narrow Cached
Narrow Prefetch
Fast Prefetch
Ultra Prefetch
Bus Overhead Grows • Small requests (8KB) are more than 1/2 overhead.• 3x more disks means 5x more overhead
SCSI Overhead Grows with Disks
31%
3%11%
18%
27% 27%
56%
80%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 Disk 8KB
1 Disk 64KB
2 Disks 64KB
3 Disks 64KB
SC
SI
Bu
s U
tili
zati
on Overhead
Data
Allocate/Extend Suppresses Async Writes
• When you allocate spaceNT zeros it (both DRAM and disk)
• Prevents others from reading data you “delete”
• This “kills” pipeline writes.• Solution: pre-allocate or
reuse files whenever you can.
• Do VERY large writes.
•
Allocate/Extend While Writing
0
5
10
15
20
2 4 8 16 32 64 128 192Request Size (K bytes)
Th
rou
gh
pu
t (M
B/s
)
4-disk write- 8 deepno-extend
1-disk write 8-deepno extend
1 deep equals 8-deep extend
Stripe Alignment: Chunk vs Cluster
Alignment, 4-disk(ultra), 3-deep
0
5
10
15
20
25
30
35
2 4 8 16 32 64 128 192Request Size (bytes)
Th
rou
gh
pu
t (M
B/s
)
Unaligned Read
Aligned Read
Aligned Write
Unaligned Write
64KB 64KB 64KB
4 64KB 64KB
• 64 KB read becomes two reads: 4KB and 60KB• Twice as many physical
requests.• Stripe has chunk size (64KB)• Volume has cluster size
– default is 4KB (for big disks).
60
Other Issues.• Multi-processor• DEC Alpha• Memory Mapped Files• Fragmentation• Ultra-2, Merced, FC,…• NT5
– Veritas volume manger
– 64-bit
– performance improvements
– I2O,...
Summary Read is easy, write is hard
SCSI & FS read prefetch worksRead PAP ~ .8 RAPWrite PAP ~ .05 RAP to .8 RAP
NTFS buffering is good for small IOscoalesces into 64KB requests
Bigger is better: 8KB ok, 64KB best Deep requests help
3-deep is good, 8-deep is better WCE is fast but dangerous
3-deep writes approximate WCE for > 8KB requests.
3 disks can saturate a SCSI bus, both Fast-Wide (15 MBps) or Ultra-Wide (31 MBps)
Memory speed is ultimate limitwith multiple disks, multiple PCI 50MBps copy, 150 MBps r/w.
Avoid FS buffering above 16KBcosts 20 ms/MB of cpu
Preallocate & reuse files when possibleAvoids Allocate/Extend sync IO
Software RAID5 performs well but fault tolerance is a problem writes are expensive in any case
Pitfalls Read-before-write: 2KB buffered IO Allocate/Extend: synchronous write Zoned disks => 50% speed bump RAID alignment => 20% speed bump
More Details at
• Web site has – Paper– Sample code– Test program we used– These slides– http://research.Microsoft.com/BARC/Sequential_IO/
Outline
• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO
– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls
• Summary