Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown [email protected]...

Slide 1

Initial Availability Benchmarking of a Database System

Aaron [email protected]

DBLunch Seminar, 1/23/01

Slide 2

Motivation• Availability is a key metric for modern apps.

– e-commerce, enterprise apps, online services, ISPs

• Database availability is particularly important– databases hold the critical hard state for most

enterprise and e-business applications» the most important system component to keep available

– we trust databases to be highly dependable. Should we?

» how do DBMSs react to hardware faults/failures?» what is the user-visible impact of such failures?

Slide 3

Overview of approach• Use availability benchmarking to

evaluate database dependability– an empirical technique based on simulated faults

– study 3-tier OLTP workload» back-end: commercial database» middleware: transaction monitor & business logic» front-end: web-based form interface

– focus on storage system faults/failures

– measure availability in terms of performance» also possible to look at consistency of data

Slide 4

Outline

• Availability benchmarking methodology

• Adapting methodology for OLTP

databases

• Case study of Microsoft SQL Server

2000

• Discussion and future directions

Slide 5

Availability benchmarking• A general methodology for defining and

measuring availability– focused toward research, not marketing– empirically demonstrated with software RAID

systems [Usenix00]

• 3 components1) metrics

2) benchmarking techniques

3) representation of results

Slide 6

Part 1: Availability metrics• Traditionally, percentage of time system is

up– time-averaged, binary view of system state (up/down)

• This metric is inflexible– doesn’t capture degraded states

» a non-binary spectrum between “up” and “down”

– time-averaging discards important temporal behavior» compare 2 systems with 96.7% traditional availability:

•system A is down for 2 seconds per minute•system B is down for 1 day per month

• Our solution: measure variation in system quality of service metrics over time

– performance, fault-tolerance, completeness, accuracy

Slide 7

Part 2: Measurement techniques

• Goal: quantify variation in QoS metrics as system availability is compromised

• Leverage existing performance benchmarks– to measure & trace quality of service metrics – to generate fair workloads

• Use fault injection to compromise system– hardware and software faults – maintenance events (repairs, SW/HW upgrades)

• Examine single-fault and multi-fault workloads– the availability analogues of performance micro- and

macro-benchmarks

Slide 8

• Results are most accessible graphically– plot change in QoS metrics over time– compare to “normal” behavior

» 99% confidence intervals calculated from no-fault runs

Part 3: Representing results

• Graphs can be distilled into numbers

Slide 9

Outline



databases– metrics

– workload and fault injection


2000


Slide 10

Availability metrics for databases

• Possible OLTP quality of service metrics– transaction throughput– transaction response time

» better: % of transactions longer than a fixed cutoff

– rate of transactions aborted due to errors– consistency of database– fraction of database content available

• Our experiments focused on throughput– rates of normal and failed transactions

Slide 11

Workload & fault injection• Performance workload

– easy: TPC-C

• Fault workload: disk subsystem– realistic fault set based on Tertiary Disk study

» correctable & uncorrectable media errors, hardware errors, power failures, disk hangs/timeouts

» both transient and “sticky” faults

– injected via an emulated SCSI disk (~0.5ms overhead)

– faults injected in one of two partitions:» database data partition» database’s write-ahead log partition

Slide 12

Outline



databases


2000


Slide 13

Experimental setup• Database

– Microsoft SQL Server 2000, default configuration

• Middleware/front-end software– Microsoft COM+ transaction monitor/coordinator– IIS 5.0 web server with Microsoft’s tpcc.dll HTML

terminal interface and business logic– Microsoft BenchCraft remote terminal emulator

• TPC-C-like OLTP order-entry workload– 10 warehouses, 100 active users, ~860 MB database

• Measured metrics– throughput of correct NewOrder transactions/min– rate of aborted NewOrder transactions (txn/min)

Slide 14

Experimental setup (2)

• Database installed in one of two configurations:– data on emulated disk, log on real (IBM) disk– data on real (IBM) disk, log on emulated disk

IBM18 GB

10k RPM

DB Server

IDEsystem

disk

= Fast/Wide SCSI bus, 20 MB/sec

Adaptec3940

EmulatedDisk

DB data/log disks

Front End

SCSIsystem

disk

100mbEthernet

IBM18 GB

10k RPM

SCSIsystem

disk

Disk Emulator

Intel P-II/300128 MB DRAM

Windows NT 4.0

Adaptec2940

emulatorbacking disk

(NTFS)AdvStorASC-U2W

UltraSCSI

ASC VirtualSCSI lib.

Intel P-III/450256 MB DRAM

Windows 2000 AS

MS BenchCraft RTEIIS + MS tpcc.dll

MS COM+

AMD K6-2/333128 MB DRAM

Windows 2000 AS

SQL Server 2000

Slide 15

Results• All results are from single-fault micro-

benchmarks• 14 different fault types

– injected once for each of data and log partitions

• 4 categories of behavior detected1) normal

2) transient glitch

3) degraded

4) failed

Slide 16

Type 1: normal behavior

• System tolerates fault• Demonstrated for all sector-level faults except:

– sticky uncorrectable read, data partition– sticky uncorrectable write, log partition

Slide 17

Type 2: transient glitch

• One transaction is affected, aborts with error• Subsequent transactions using same data would fail• Demonstrated for one fault only:

– sticky uncorrectable read, data partition

Slide 18

Type 3: degraded behavior

• DBMS survives error after running log recovery• Middleware partially fails, results in degraded perf.• Demonstrated for one fault only:

– sticky uncorrectable write, log partition

Slide 19

Type 4: failure

• DBMS hangs or aborts all transactions• Middleware behaves erratically, sometimes crashing• Demonstrated for all fatal disk-level faults

– SCSI hangs, disk power failures

• Example behaviors (10 distinct variants observed)

Disk hang during write to data disk Simulated log disk power failure

Slide 20

Results: summary• DBMS was robust to a wide range of

faults– tolerated all transient and recoverable errors– tolerated some unrecoverable faults

» transparently (e.g., uncorrectable data writes)» or by reflecting fault back via transaction abort» these were not tolerated by the SW RAID systems

• Overall, DBMS is significantly more robust to disk faults than software RAID on same OS!

Slide 21

Outline



databases


2000


Slide 22

Results: discussion• DBMS’s extra robustness comes from:

– redundant data representation in form of log– transactions

» standard mechanism for reporting errors (txn abort)» encapsulate meaningful unit of work, providing

consistent rollback upon failure

• But, middleware was not robust, compromising overall system availability– crashed or behaved erratically when DBMS

recovered or returned errors– user cannot distinguish DBMS and middleware

failure– system is only as robust as its weakest component!

compare RAID: blocks don’t let you do thiscompare RAID: blocks don’t let you do this

Slide 23

Discussion of methodology• General availability benchmarking

methodology does work on more than just RAID systems

• Issues in adapting the methodology– defining appropriate metrics– measuring non-performance availability metrics– understanding layered (multi-tier) systems with

only end-to-end instrumentation

Slide 25

Future directions• Direct extensions of this work:

– expand metrics, including tests of ACID properties– consider other fault injection points besides disks– investigate clustered database designs– study issues in benchmarking layered systems

Slide 26

Future directions (2)• Availability/maintainability extensions to TPC

– proposed by James Hamilton at ISTORE retreat– an “optional maintainability test” after regular run– sponsor supplies N best administrators– TPC benchmark run repeated with realistic fault injection

and a set of maintenance tasks to perform– measure availability, performance, admin. time, . . .– requires:

» characterization of typical failure modes, admin. tasks» scalable, easy-to-deploy fault-injection harness

• This work is a (small) step toward that goal– and hints at poor state-of-the-art in TPC-C benchmark

middleware fault handling

Slide 27

Thanks!• Microsoft SQL Server group

– for generously providing access to SQL Server 2000 and the Microsoft TPC-C Benchmark Kit

– James Hamilton– Jamie Redding and Charles Levine

Slide 28

Backup slides

Slide 29

Example results: failing data disk

Transient, correctable read fault(system tolerates fault)

Sticky, uncorrectable read fault

(transaction is aborted with error)

Disk hang between SCSI commands

(DBMS hangs, middleware returns errors)

Disk hang during a data write(DBMS hangs, middleware

crashes)

Slide 30

Example results: failing log disk

Transient, correctable write fault

(system tolerates fault)

Sticky, uncorrectable write fault(DBMS recovers, middleware degrades)

Simulated disk power failure(DBMS aborts all txns with errors)

Disk hang between SCSI commands

(DBMS hangs, middleware hangs)

Date post:	18-Dec-2015
Category:	Documents
View:	216 times
Download:	0 times

Slide 1 Initial Availability Benchmarking of a Database System Aaron Brown [email protected]...

Documents