Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
Slide 1
Towards Benchmarks for Availability,
Maintainability, and Evolutionary Growth (AME)
A Case Study of Software RAID Systems
Aaron Brown
2000 Winter IRAM/ISTORE Retreat
Slide 2
Outline
• Motivation: why new benchmarks, why
AME?
• Availability benchmarks: a general
approach
• Case study: availability of Software RAID
• Conclusions and future directions
Slide 3
Why benchmark AME?• Most benchmarks measure performance• Today’s most pressing problems aren’t
about performance– online service providers face new problems as
they rapidly deploy, alter, and expand Internet services
» providing 24x7 availability (A)» managing system maintenance (M)» handling evolutionary growth (E)
– AME issues affect in-house providers and smaller installations as well
» even UCB CS department
Slide 4
Why benchmarks?• Growing consensus that we need to focus on
these problems in the research community– HotOS-7: “No-Futz Computing”– Hennessy at FCRC:
» “other qualities [than performance] will become crucial: availability, maintainability, scalability”
» “if access to services on servers is the killer app--availability is the key metric”
– Lampson at SOSP ’99:» big challenge for systems research is building “systems
that work: meet specifications, are always available, evolve while they run, grow without practical limits”
• Benchmarks shape a field!– they define what we can measure
Slide 5
The components of AME• Availability
– what factors affect the quality of service delivered by the system, and by how much/how long?
– how well can systems survive typical failure scenarios?
• Maintainability– how hard is it to keep a system running at a certain
level of availability and performance?– how complex are maintenance tasks like upgrades,
repair, tuning, troubleshooting, disaster recovery?
• Evolutionary Growth– can the system be grown incrementally with
heterogeneous components?– what’s the impact of such growth on availability,
maintainability, and sustained performance?
Slide 6
Outline
• Motivation: why new benchmarks, why
AME?
• Availability benchmarks: a general
approach
• Case study: availability of Software RAID
• Conclusions and future directions
Slide 7
How can we measure availability?
• Traditionally, percentage of time system is up– time-averaged, binary view of system state (up/down)
• Traditional metric is too inflexible– doesn’t capture degraded states
» a non-binary spectrum between “up” and “down”
– time-averaging discards important temporal behavior» compare 2 systems with 96.7% traditional availability:
•system A is down for 2 seconds per minute•system B is down for 1 day per month
• Solution: measure variation in system quality of service metrics over time
Slide 8
Example Quality of Service metrics
• Performance– e.g., user-perceived latency, server throughput
• Degree of fault-tolerance• Completeness
– e.g., how much of relevant data is used to answer query
• Accuracy– e.g., of a computation or decoding/encoding
process
• Capacity– e.g., admission control limits, access to non-
essential services
Slide 9
Availability benchmark methodology
• Goal: quantify variation in QoS metrics as events occur that affect system availability
• Leverage existing performance benchmarks– to generate fair workloads– to measure & trace quality of service metrics
• Use fault injection to compromise system– hardware faults (disk, memory, network, power)– software faults (corrupt input, driver error returns)– maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads– the availability analogues of performance micro- and
macro-benchmarks
Slide 10
Time (2-minute intervals)0 5 10 15 20 25 30 35 40 45 50 55 60
Per
form
ance
160
170
180
190
200
210
}normal behavior(99% conf)
injecteddisk failure
reconstruction
• Results are most accessible graphically– plot change in QoS metrics over time– compare to “normal” behavior
» 99% confidence intervals calculated from no-fault runs
Methodology: reporting results
• Graphs can be distilled into numbers– quantify distribution of deviations from normal behavior,
compute area under curve for deviations, ...
Slide 11
Outline
• Motivation: why new benchmarks, why
AME?
• Availability benchmarks: a general
approach
• Case study: availability of Software RAID
• Conclusions and future directions
Slide 12
Case study• Software RAID-5 plus web server
– Linux/Apache vs. Windows 2000/IIS
• Why software RAID?– well-defined availability guarantees
» RAID-5 volume should tolerate a single disk failure» reduced performance (degraded mode) after failure» may automatically rebuild redundancy onto spare
disk
– simple system– easy to inject storage faults
• Why web server?– an application with measurable QoS metrics that
depend on RAID availability and performance
Slide 13
Benchmark environment: metrics
• QoS metrics measured– hits per second
» roughly tracks response time in our experiments
– degree of fault tolerance in storage system
• Workload generator and data collector– SpecWeb99 web benchmark
» simulates realistic high-volume user load» mostly static read-only workload; some dynamic
content» modified to run continuously and to measure
average hits per second over each 2-minute interval
Slide 14
Benchmark environment: faults
• Focus on faults in the storage system (disks)
• How do disks fail?– according to Tertiary Disk project, failures include:
» recovered media errors» uncorrectable write failures» hardware errors (e.g., diagnostic failures)» SCSI timeouts» SCSI parity errors
– note: no head crashes, no fail-stop failures
Slide 15
Disk fault injection technique• To inject reproducible failures, we replaced
one disk in the RAID with an emulated disk– a PC that appears as a disk on the SCSI bus– I/O requests processed in software, reflected to local
disk– fault injection performed by altering SCSI command
processing in the emulation software
• Types of emulated faults:– media errors (transient, correctable, uncorrectable)– hardware errors (firmware, mechanical)– parity errors– power failures– disk hangs/timeouts
Slide 16
System configuration
• RAID-5 Volume: 3GB capacity, 1GB used per disk– 3 physical disks, 1 emulated disk, 1 emulated spare disk
• 2 web clients connected via 100Mb switched Ethernet
IBM18 GB
10k RPM
IBM18 GB
10k RPM
IBM18 GB
10k RPM
Server
AMD K6-2-33364 MB DRAM
Linux or Win2000
IDEsystem
disk
= Fast/Wide SCSI bus, 20 MB/sec
Adaptec2940
Adaptec2940
Adaptec2940 Adaptec
2940
RAIDdata disks
IBM18 GB
10k RPM
SCSIsystem
disk
Disk Emulator
AMD K6-2-350Windows NT 4.0
ASC VirtualSCSI lib.
Adaptec2940
emulatorbacking disk
(NTFS)AdvStorASC-U2W
UltraSCSI
EmulatedSpareDisk
EmulatedDisk
Slide 17
Results: single-fault experiments
• One exp’t for each type of fault (15 total)– only one fault injected per experiment– no human intervention– system allowed to continue until stabilized or
crashed
• Four distinct system behaviors observed(A) no effect: system ignores fault(B) RAID system enters degraded mode(C) RAID system begins reconstruction onto spare disk(D) system failure (hang or crash)
Slide 18
Time (2-minute intervals)0 1 2 3 4 5 6 7 8
Hit
s p
er s
eco
nd
0
5
10
180
190
200
#fai
lure
s to
lera
ted
0
1
2
Hits/sec# failures tolerated
(D) system failure
Time (2-minute intervals)0 1 2 3 4 5 6 7
Hit
s p
er s
eco
nd
150
160
170
180
190
200
#fai
lure
s to
lera
ted
0
1
2
Hits/sec# failures tolerated
Time (2-minute intervals)0 2 4 6 8 10 12 14 16
Hit
s p
er s
eco
nd
150
160
170
180
190
200
#fai
lure
s to
lera
ted
0
1
2
Hits/sec# failures tolerated
System behavior: single-fault
Time (2-minute intervals)0 5 10 15 20 25 30 35 40 45
Hit
s p
er s
eco
nd
190
195
200
205
210
215
220#f
ailu
res
tole
rate
d
0
1
2
Hits/sec# failures tolerated
Reconstruction
(A) no effect (B) enter degraded mode
(C) begin reconstruction
Slide 19
System behavior: single-fault (2)
Fault Type Linux Win2000
Correctable read, transient reconstruct no eff ect
Correctable read, sticky reconstruct no eff ect
Uncorrectable read, transient reconstruct no eff ect
Uncorrectable read, sticky reconstruct degraded
Correctable write, transient reconstruct no eff ect
Correctable write, sticky reconstruct no eff ect
Uncorrectable write, transient reconstruct degraded
Uncorrectable write, sticky reconstruct degraded
Hardware error, transient reconstruct no eff ect
I llegal command, transient reconstruct no eff ect
Disk hang on read failure failure
Disk hang on write failure failure
Disk hang, not on a command failure failure
Power f ailure during command reconstruct degraded
Physical removal of active disk reconstruct degraded
– Windows ignores benign faults
– Windows can’t automatically rebuild
– Linux reconstructs on all errors
– Both systems fail when disk hangs
Slide 20
Interpretation: single-fault exp’ts
• Linux and Windows take opposite approaches to managing benign and transient faults– these faults do not necessarily imply a failing disk
» Tertiary Disk: 368/368 disks had transient SCSI errors; 13/368 disks had transient hardware errors, only 2/368 needed replacing.
– Linux is paranoid and stops using a disk on any error» fragile: system is more vulnerable to multiple faults» but no chance of slowly-failing disk impacting perf.
– Windows ignores most benign/transient faults» robust: less likely to lose data, more disk-efficient» less likely to catch slowly-failing disks and remove them
• Neither policy is ideal!– need a hybrid?
Slide 21
Results: multiple-fault experiments
• Scenario(1) disk fails
(2) data is reconstructed onto spare
(3) spare fails
(4) administrator replaces both failed disks
(5) data is reconstructed onto new disks
• Requires human intervention– to initiate reconstruction on Windows 2000
» simulate 6 minute sysadmin response time
– to replace disks» simulate 90 seconds of time to replace hot-swap
disks
Slide 22
Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110
Hit
s p
er s
eco
nd
140
150
160
170
180
190
200
210
data diskfaulted
reconstruction(manual)
sparefaulted
disks replaced
}normal behavior(99% conf)
Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110
Hit
s p
er s
eco
nd
140
150
160
170
180
190
200
210
220
data diskfaulted
reconstruction(automatic)
sparefaulted
reconstruction(automatic)
}normal behavior(99% conf)
disks replaced
System behavior: multiple-fault
• Windows reconstructs ~3x faster than Linux• Windows reconstruction noticeably affects application
performance, while Linux reconstruction does not
Windows2000/IIS
Linux/Apache
Slide 23
Interpretation: multi-fault exp’ts
• Linux and Windows have different reconstruction philosophies– Linux uses idle bandwidth for reconstruction
» little impact on application performance» increases length of time system is vulnerable to
faults
– Windows steals app. bandwidth for reconstruction» reduces application performance» minimizes system vulnerability» but must be manually initiated (or scripted)
– Windows favors fault-tolerance over performance; Linux favors performance over fault-tolerance
» the same design philosophies seen in the single-fault experiments
Slide 24
Maintainability Observations• Scenario: administrator accidentally
removes and replaces live disk in degraded mode– double failure; no guarantee on data integrity– theoretically, can recover if writes are queued– Windows recovers, but loses active writes
» journalling NTFS is not corrupted» all data not being actively written is intact
– Linux will not allow removed disk to be reintegrated
» total loss of all data on RAID volume!
Slide 25
Maintainability Observations (2)
• Scenario: administrator adds a new spare– a common task that can be done with hot-swap drive trays– Linux requires a reboot for the disk to be recognized– Windows can dynamically detect the new disk
• Windows 2000 RAID is easier to maintain– easier GUI configuration– more flexible in adding disks– SCSI rescan and NTFS deal with administrator goofs– less likely to require administration due to transient errors
» BUT must manually initiate reconstruction when needed
Slide 26
Outline
• Motivation: why new benchmarks, why
AME?
• Availability benchmarks: a general
approach
• Case study: availability of Software RAID
• Conclusions and future directions
Slide 27
Conclusions• AME benchmarks are needed to direct
research toward today’s pressing problems• Our availability benchmark methodology is
powerful– revealed undocumented design decisions in Linux and
Windows 2000 software RAID implementations» transient error handling» reconstruction priorities
• Windows & Linux SW RAIDs are imperfect– but Windows is easier to maintain, less likely to fail due
to double faults, and less likely to waste disks– if spares are cheap and plentiful, Linux auto-
reconstruction gives it the edge
Slide 28
Future Directions• Add maintainability
– use methodology from availability benchmark– but include administrator’s response to faults– must develop model of typical administrator behavior– can we quantify administrative work needed to
maintain a certain level of availability?
• Expand availability benchmarks– apply to other systems: DBMSs, mail, distributed
apps– use ISTORE-1 prototype
» 80-node x86 cluster with built-in fault injection, diags.
• Add evolutionary growth– just extend maintainability/availability techniques?
Slide 29
Backup Slides
Slide 30
Time (2-minute intervals)0 5 10 15 20 25 30 35 40 45 50 55 60
Hit
s p
er s
eco
nd
200
205
210
215
220
225
230
}normal behavior(99% conf)
injecteddisk failure reconstruction (automatic)
Time (2-minute intervals)0 5 10 15 20 25 30 35 40 45 50 55 60
Hit
s p
er s
eco
nd
160
170
180
190
200
210
}normal behavior(99% conf)
injecteddisk failure
reconstruction (initiated manually)
Availability Example: SW RAID• Win2k/IIS, Linux/Apache on software RAID-5 volumes
• Windows gives more bandwidth to reconstruction, minimizing fault vulnerability at cost of app performance– compare to Linux, which does the opposite
Windows2000/IIS
Linux/Apache
Slide 31
Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110
Hit
s p
er s
eco
nd
140
150
160
170
180
190
200
210
220
Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110
Hit
s p
er s
eco
nd
140
150
160
170
180
190
200
210
(1)
(2)
}normal behavior(99% conf)
(2)
}normal behavior(99% conf)
(3)
(5)
(4)
(1) (5)
(4)
(3)
note: reconstructioninitiated manually
note: reconstructioninitiated automatically
System behavior: multiple-fault
• Windows reconstructs ~3x faster than Linux• Windows reconstruction noticeably affects application
performance, while Linux reconstruction does not
(1) data disk faulted(2) reconstruction(3) spare faulted(4) disks replaced(5) reconstruction
Windows2000/IIS
Linux/Apache