Date post: | 18-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
Slide 1
Initial Availability Benchmarking of a Database System
Aaron [email protected]
DBLunch Seminar, 1/23/01
Slide 2
Motivation• Availability is a key metric for modern apps.
– e-commerce, enterprise apps, online services, ISPs
• Database availability is particularly important– databases hold the critical hard state for most
enterprise and e-business applications» the most important system component to keep available
– we trust databases to be highly dependable. Should we?
» how do DBMSs react to hardware faults/failures?» what is the user-visible impact of such failures?
Slide 3
Overview of approach• Use availability benchmarking to
evaluate database dependability– an empirical technique based on simulated faults
– study 3-tier OLTP workload» back-end: commercial database» middleware: transaction monitor & business logic» front-end: web-based form interface
– focus on storage system faults/failures
– measure availability in terms of performance» also possible to look at consistency of data
Slide 4
Outline
• Availability benchmarking methodology
• Adapting methodology for OLTP
databases
• Case study of Microsoft SQL Server
2000
• Discussion and future directions
Slide 5
Availability benchmarking• A general methodology for defining and
measuring availability– focused toward research, not marketing– empirically demonstrated with software RAID
systems [Usenix00]
• 3 components1) metrics
2) benchmarking techniques
3) representation of results
Slide 6
Part 1: Availability metrics• Traditionally, percentage of time system is
up– time-averaged, binary view of system state (up/down)
• This metric is inflexible– doesn’t capture degraded states
» a non-binary spectrum between “up” and “down”
– time-averaging discards important temporal behavior» compare 2 systems with 96.7% traditional availability:
•system A is down for 2 seconds per minute•system B is down for 1 day per month
• Our solution: measure variation in system quality of service metrics over time
– performance, fault-tolerance, completeness, accuracy
Slide 7
Part 2: Measurement techniques
• Goal: quantify variation in QoS metrics as system availability is compromised
• Leverage existing performance benchmarks– to measure & trace quality of service metrics – to generate fair workloads
• Use fault injection to compromise system– hardware and software faults – maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads– the availability analogues of performance micro- and
macro-benchmarks
Slide 8
• Results are most accessible graphically– plot change in QoS metrics over time– compare to “normal” behavior
» 99% confidence intervals calculated from no-fault runs
Part 3: Representing results
• Graphs can be distilled into numbers
Slide 9
Outline
• Availability benchmarking methodology
• Adapting methodology for OLTP
databases– metrics
– workload and fault injection
• Case study of Microsoft SQL Server
2000
• Discussion and future directions
Slide 10
Availability metrics for databases
• Possible OLTP quality of service metrics– transaction throughput– transaction response time
» better: % of transactions longer than a fixed cutoff
– rate of transactions aborted due to errors– consistency of database– fraction of database content available
• Our experiments focused on throughput– rates of normal and failed transactions
Slide 11
Workload & fault injection• Performance workload
– easy: TPC-C
• Fault workload: disk subsystem– realistic fault set based on Tertiary Disk study
» correctable & uncorrectable media errors, hardware errors, power failures, disk hangs/timeouts
» both transient and “sticky” faults
– injected via an emulated SCSI disk (~0.5ms overhead)
– faults injected in one of two partitions:» database data partition» database’s write-ahead log partition
Slide 12
Outline
• Availability benchmarking methodology
• Adapting methodology for OLTP
databases
• Case study of Microsoft SQL Server
2000
• Discussion and future directions
Slide 13
Experimental setup• Database
– Microsoft SQL Server 2000, default configuration
• Middleware/front-end software– Microsoft COM+ transaction monitor/coordinator– IIS 5.0 web server with Microsoft’s tpcc.dll HTML
terminal interface and business logic– Microsoft BenchCraft remote terminal emulator
• TPC-C-like OLTP order-entry workload– 10 warehouses, 100 active users, ~860 MB database
• Measured metrics– throughput of correct NewOrder transactions/min– rate of aborted NewOrder transactions (txn/min)
Slide 14
Experimental setup (2)
• Database installed in one of two configurations:– data on emulated disk, log on real (IBM) disk– data on real (IBM) disk, log on emulated disk
IBM18 GB
10k RPM
DB Server
IDEsystem
disk
= Fast/Wide SCSI bus, 20 MB/sec
Adaptec3940
EmulatedDisk
DB data/log disks
Front End
SCSIsystem
disk
100mbEthernet
IBM18 GB
10k RPM
SCSIsystem
disk
Disk Emulator
Intel P-II/300128 MB DRAM
Windows NT 4.0
Adaptec2940
emulatorbacking disk
(NTFS)AdvStorASC-U2W
UltraSCSI
ASC VirtualSCSI lib.
Intel P-III/450256 MB DRAM
Windows 2000 AS
MS BenchCraft RTEIIS + MS tpcc.dll
MS COM+
AMD K6-2/333128 MB DRAM
Windows 2000 AS
SQL Server 2000
Slide 15
Results• All results are from single-fault micro-
benchmarks• 14 different fault types
– injected once for each of data and log partitions
• 4 categories of behavior detected1) normal
2) transient glitch
3) degraded
4) failed
Slide 16
Type 1: normal behavior
• System tolerates fault• Demonstrated for all sector-level faults except:
– sticky uncorrectable read, data partition– sticky uncorrectable write, log partition
Slide 17
Type 2: transient glitch
• One transaction is affected, aborts with error• Subsequent transactions using same data would fail• Demonstrated for one fault only:
– sticky uncorrectable read, data partition
Slide 18
Type 3: degraded behavior
• DBMS survives error after running log recovery• Middleware partially fails, results in degraded perf.• Demonstrated for one fault only:
– sticky uncorrectable write, log partition
Slide 19
Type 4: failure
• DBMS hangs or aborts all transactions• Middleware behaves erratically, sometimes crashing• Demonstrated for all fatal disk-level faults
– SCSI hangs, disk power failures
• Example behaviors (10 distinct variants observed)
Disk hang during write to data disk Simulated log disk power failure
Slide 20
Results: summary• DBMS was robust to a wide range of
faults– tolerated all transient and recoverable errors– tolerated some unrecoverable faults
» transparently (e.g., uncorrectable data writes)» or by reflecting fault back via transaction abort» these were not tolerated by the SW RAID systems
• Overall, DBMS is significantly more robust to disk faults than software RAID on same OS!
Slide 21
Outline
• Availability benchmarking methodology
• Adapting methodology for OLTP
databases
• Case study of Microsoft SQL Server
2000
• Discussion and future directions
Slide 22
Results: discussion• DBMS’s extra robustness comes from:
– redundant data representation in form of log– transactions
» standard mechanism for reporting errors (txn abort)» encapsulate meaningful unit of work, providing
consistent rollback upon failure
• But, middleware was not robust, compromising overall system availability– crashed or behaved erratically when DBMS
recovered or returned errors– user cannot distinguish DBMS and middleware
failure– system is only as robust as its weakest component!
compare RAID: blocks don’t let you do thiscompare RAID: blocks don’t let you do this
Slide 23
Discussion of methodology• General availability benchmarking
methodology does work on more than just RAID systems
• Issues in adapting the methodology– defining appropriate metrics– measuring non-performance availability metrics– understanding layered (multi-tier) systems with
only end-to-end instrumentation
Slide 25
Future directions• Direct extensions of this work:
– expand metrics, including tests of ACID properties– consider other fault injection points besides disks– investigate clustered database designs– study issues in benchmarking layered systems
Slide 26
Future directions (2)• Availability/maintainability extensions to TPC
– proposed by James Hamilton at ISTORE retreat– an “optional maintainability test” after regular run– sponsor supplies N best administrators– TPC benchmark run repeated with realistic fault injection
and a set of maintenance tasks to perform– measure availability, performance, admin. time, . . .– requires:
» characterization of typical failure modes, admin. tasks» scalable, easy-to-deploy fault-injection harness
• This work is a (small) step toward that goal– and hints at poor state-of-the-art in TPC-C benchmark
middleware fault handling
Slide 27
Thanks!• Microsoft SQL Server group
– for generously providing access to SQL Server 2000 and the Microsoft TPC-C Benchmark Kit
– James Hamilton– Jamie Redding and Charles Levine
Slide 29
Example results: failing data disk
Transient, correctable read fault(system tolerates fault)
Sticky, uncorrectable read fault
(transaction is aborted with error)
Disk hang between SCSI commands
(DBMS hangs, middleware returns errors)
Disk hang during a data write(DBMS hangs, middleware
crashes)