Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | barbara-greene |
View: | 216 times |
Download: | 1 times |
Latency as a Performability Latency as a Performability Metric: Experimental ResultsMetric: Experimental Results
Pete [email protected]
Outline
1. Motivation and background• Performability overview• Project summary
2. Test setup• PRESS web server• Mendosus fault injection system
3. Experimental results & analysis• How to represent latency• Questions for future research
• Goal of ROC project: develop metrics to evaluate new recovery techniques
• Performability – class of metrics to describe how a system performs in the presence of faults
– First used in fault-tolerant computing field1
– Now being applied to online services
Performability overview
1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Example: microbenchmark
RAID disk failure
• Rutgers study: performability analysis of a web server, using throughput
• Other studies (esp. from HP Labs Storage group) also use response time as a metric
• Assertion: latency and data quality are better than throughput for describing user experience
• How best to represent latency in performability reports?
Project motivation
• Goals:1. Replicate PRESS/Mendosus study with
response time measurements2. Discuss how to incorporate latency into
performability statistics
• Contributions:1. Provide a latency-based analysis of a web
server’s performability (currently rare)2. Further the development of more
comprehensive dependability benchmarks
Project overview
Experiment components• The Mendosus fault injection system
– From Rutgers (Rich Martin)– Goal: low-overhead emulation of a cluster of workstations, injection of likely
faults
• The PRESS web server– Cluster-based, uses cooperative caching. Designed by Carreira et al.
(Rutgers)– Perf-PRESS: basic version– HA-PRESS: incorporates hearbeats, master node for automated cluster
management
• Client simulators– Submit set # of requests/sec, based on real traces
User-leveldaemon (Java)
ModifiedNIC
driver
SCSImodule
procmodule
apps
Global Controller(Java)
Fault config
file
Workstations (real or VMs)
LANemu
configfile
Appsconfig
file
Emulated LAN
Mendosus design
Experimental setup
Fault typesCategory Fault Possible Root Cause
Node Node crash Operator error, OS bug, hardware component failure, power outage
Node freeze OS or kernel module bug
Application App crash Application bug or resource unavailability
App hang Application bug or resource contention with other processes
Network Link down or flaky Broken, damaged or misattached cable
Switch down or flaky Damaged or misconfigured switch, power outage
Test case timeline
- Warm-up time: 30-60 seconds
- Time to repair: up to 90 seconds
Simplifying assumptions• Operator repairs any non-transient failure
after 90 seconds• Web page size is constant• Faults are independent• Each client request is independent of all
others (no sessions!)– Request arrival times are determined by a
Poisson process (not self-similar)
• Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs
Sample result: app crash
0
100
200
300
400
500
600
700
800
900
0 10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
Time elapsed (secs)
# R
eq
uests
Success Aborted (2s) Timed out (8s)
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
180
Time elapsed (secs)
Avg
. res
po
nse
tim
e (s
ecs)
Latency
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
180
Time elapsed (secs)
Avg
. res
po
nse
tim
e (s
ecs)
Latency
0
100
200
300
400
500
600
700
800
900
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
180
Time elapsed (secs)
# R
equ
ests
Success Aborted (2s) Timed out (8s)
Perf-PRESS HA-PRESS
Th
rou
gh
pu
tLa
ten
cy
0
100
200
300
400
500
600
700
800
900
0 10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
Time elapsed (secs)
# R
eq
uests
Success Aborted (2s) Timed out (8s)
0
0.5
1
1.5
2
2.5
3
0 10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
Time elapsed (secs)
Avg
. re
sp
on
se t
ime (
secs)
Latency
Sample result: node hangPerf-PRESS HA-PRESS
0
100
200
300
400
500
600
700
800
900
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
Time elapsed (secs)
# R
equ
ests
Success Aborted (2s) Timed out (8s)
0
0.5
1
1.5
2
2.5
3
01
02
03
04
05
06
07
08
09
01
00
11
01
20
13
01
40
15
01
60
17
01
80
19
02
00
21
02
20
23
02
40
25
02
60
27
02
80
29
03
00
31
03
20
33
03
40
35
03
60
37
03
80
39
04
00
41
0
Time elapsed (secs)
Avg
. re
sp
on
se
tim
e (
se
cs)
Latency
Th
rou
gh
pu
tLa
ten
cy
• Total seconds of wait time– Not good for comparing cases with
different workloads
• Average (mean) wait time per request– OK, but requires that expected (normal)
response time be given separately
• Variance of wait time– Not very intuitive to describe. Also, read-
only workload means that all variance is toward longer wait times anyway
Representing latency
• Consider “goodput”-based availability:total responses served total requests
• Idea: Latency-based “punctuality”:ideal total latency
actual total latency• Like goodput, maximum value is 1• “Ideal” total latency:
average latency for non-fault cases x total #requests (shouldn’t be 0)
Representing latency (2)
• Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience)
– Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)
Representing latency (3)
Availability and punctualityThroughput-based availability of PRESS versions
0
0.2
0.4
0.6
0.8
1
App hang App crash Node crash Node freeze Link down
Fault scenario
Av
aila
bili
ty in
de
x
Perf-PRESS
HA-PRESS
Latency-based "punctuality" of PRESS versions
0
0.02
0.04
0.06
0.08
0.1
0.12
App hang App crash Node crash Node freeze Link down
Fault scenario
Pu
nc
tua
lity
ind
ex
Perf-PRESS
HA-PRESS
• Data quality, latency and throughput are interrelated
– Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”?
• To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1
– These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available
Other metrics
1 Zona Research and Keynote Systems, The Need for Speed II, 2001
• Rules:– Each aborted (2s) conn: 2 demerits– Each conn error: 1 demerit– Each user timeout (8s): 8 demerits– Each sec of total latency above ideal level:
(1 demerit/total #requests) x scaling factor
Sample demerit system
0
100000
200000
300000
400000
500000
Perf HA Perf HA Perf HA Perf HA Perf HA
Fault scenario
De
me
rits
Latency
Throughput
Apphang
Appcrash
Nodecrash
Nodefreeze
Linkdown
Expensive,fast & flaky
Expensive &robust, but slow
Expensive,robust and fast
Online service optimizationPerformance metrics:
throughput, latency & data quality
Environment:workload & faults
Cost of operations &components
Cheap, robust& fast (optimal)Cheap, fast
& flaky
Cheap &robust, but slow
Conclusions• Latency-based punctuality and
throughput-based availability give similar results for a read-only web workload
• Applied workload is very important– Reliability metrics do not (and should not)
reflect maximum performance/workload!
• Latency did not degrade gracefully in proportion to workload
– At high loads, PRESS “oscillates” between full service, 100% load shedding
Further Work• Combine test results & predicted
component failure rates to get long-term performability estimates (are these useful?)
• Further study will benefit from more sophisticated client & workload simulators
• Services that generate dynamic content should lead to more interesting data (ex: RUBiS)
Latency as a Performability Latency as a Performability Metric: Experimental ResultsMetric: Experimental Results
Pete [email protected]
Example: long-term modelDiscrete-time Markov chain (DTMC) model of
a RAID-5 disk array1
pi(t) = probability that system is in state i at time t
wi(t) = reward (disk I/O operations/sec)
= failure rate of a single disk drive
= disk repair rate
D = number of data disks
1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997
p0(t)
Normaloperation
(D+1)
p1(t)
1 disk failed,repair necessary
D
p2(t)
Failure -data loss
w0(t) w1(t) w2(t)