Automated Diagnosis of Chronic Problems in Production Systems
Soila Kavulya
Thesis CommitteeChristos Faloutsos , CMUGreg Ganger, CMUMatti Hiltunen, AT&TPriya Narasimhan, CMU (Advisor)
Soila Kavulya @ March 20122
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 20123
Motivation Chronics are problems that are
Not transient Not resulting in system-wide outage
Chronics occur in real production systems VoIP
User’s calls fail due to version conflict between user and upgraded server Hadoop (CMU’s OpenCloud)
A user job sporadically fails in map phase with cryptic block I/O error User and admins spend 2 months troubleshooting Traced to large heap size in tasktracker starving collocated datanodes
Chronics are due to a variety of root-causes Configuration problems, bad hardware, software bugs
Thesis: Automate chronics diagnosis in production systems
Soila Kavulya @ March 20124
Challenge for Diagnosis
Due to single node?
Due to complex interactions between nodes?
Due to multiple independent node?
Node1
Single manifestation, multiple possible causes
Node2
Node3
Node4
Node5
Soila Kavulya @ March 20125
Challenges in Production Systems Labeled failure-data is not always available
Difficult to diagnose problems not encountered before Sysadmins’ perspective may not correspond to users’
No access to user configurations, user behavior No access to application semantics First sign of trouble is often a customer complaint Customer complaints can be cryptic
Desired level of instrumentation may not be possible As-is vendor instrumentation with limited control Cost of added instrumentation may be high Granularity of diagnosis consequently limited
Soila Kavulya @ March 20126
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 20127
Objectives “Is there a problem?” (anomaly detection)
Detect a problem despite potentially not having seen it before Distinguish a genuine problem from a workload change
“Where is the problem?” (localization) Drill down by analyzing different instrumentation perspectives
“What kind of problems?” (chronics) Manifestation: exceptions, performance degradations Root-cause: misconfiguration, bad hardware, bugs, contention Origin: single/multiple independent sources, interacting sources
“What kind of environments?” (production systems) Production VoIP system at AT&T Hadoop: Open-source implementation of MapReduce
Soila Kavulya @ March 20128
Thesis Statement
Peer-comparison* enables anomaly detection in production systems despite workload changes, and the subsequent incremental fusion of different instrumentation sources enables localization of chronic problems.
*Comparison of some performance metric across similar (peer) system elements
9
rika (Swahili), noun. peer, contemporary, age-set, undergoing rites of passage (marriage) at similar times.
What was our Inspiration?
Soila Kavulya @ March 201210
What is a Peer?
Temporal similarity Age-set: Born around the same time Anomaly detection: Events within same time window
Spatial similarity Age-set: Live in same location Anomaly detection: Run on same node
Phase similarity Age-set: (birth, initiation, marriage) Anomaly detection: (map, shuffle, reduce)
Contextual similarity Age-set: Same gender, clan Anomaly detection: Same workload, h/w
Soila Kavulya @ March 201211
Target Systems for Validation VoIP system at large telecommunication provider
10s of millions of calls per day, diverse workloads 100s of network elements with heterogeneous hardware 24x7 Ops team uses alarm correlation to diagnose outages Separate team troubleshoots long-term chronics Labeled traces available
Hadoop: Open-source implementation of MapReduce Diverse kinds of real workloads
Graph mining, language translation Hadoop clusters with homogeneous hardware
Yahoo! M45 & Opencloud production clusters Controlled experiments in Amazon EC2 cluster
Long running jobs (> 100s): Hard to label failures
Soila Kavulya @ March 201212
In Support of Thesis Statement
OBJECTIVE VoIP HADOOP
Anomaly Detection
Heuristics-based, peer-comparison pending
Peer comparison without labeled data
Problem Localization
Localize to customer/network-element/resource/error-code
Localize to node/task/resource
Chronics Exceptions, performance degradation, single/multiple-source
Exceptions, performance degradation, single-sourcemultiple-source pending
Production Systems
AT&T production system EC2 test system, OpenCloud pending
Publications OSR’11, DSN’12 WASL’08, HotMetrics’09, ISSRE’09, ICDCS’10, NOMS’10, CCGRID’10
Soila Kavulya @ March 201213
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 201214
Goals & Non-Goals Goals
Anomaly detection in the absence of labeled failure-data Diagnosis based on available instrumentation sources Differentiation of workload changes from anomalies
Non-goals Diagnosis of system-wide outages Diagnosis of value faults and transient faults Root-cause analysis at code-level Online/runtime diagnosis Recovery based on diagnosis
Soila Kavulya @ March 201215
Assumptions Majority of system is working correctly Problems manifest in observable behavioral changes
Exceptions or performance degradations All instrumentation is locally timestamped Clocks are synchronized to enable system-wide
correlation of data Instrumentation faithfully captures system behavior
Soila Kavulya @ March 201216
Overview of Approach
End-to-endTrace
Construction
PerformanceCounters
ApplicationLogs
Ranked list of root-causes
Anomaly Detection Localization
Soila Kavulya @ March 201217
Target System #1: VoIP
PSTN Access IP Access
GatewayServers
IP BaseElements
ApplicationServers
Call ControlElements
ISP’s network
Soila Kavulya @ March 201218
Target System #2: Hadoop
JobTracker
NameNodeTaskTrackerDataNode
Map/Reduce tasks
HDFSblocks
Master Node Slave Nodes
Hadooplogs
OS dataOS data
Hadooplogs
Soila Kavulya @ March 201219
Performance Counters For both Hadoop and VoIP Metrics collected periodically from /proc in OS Monitoring interval varies from 1 sec to 15 min Examples of metrics collected
CPU utilization CPU run-queue size Pages in/out Memory used/free Context switches Packets sent/received Disk blocks read/written
Soila Kavulya @ March 201220
End-to-End Trace Construction
End-to-endTrace
Construction
PerformanceCounters
ApplicationLogs
Ranked list of root-causes
Anomaly Detection Localization
Application Logs Each node logs each request that passes through it
Timestamp, IP address, request duration/size, phone no., … Log formats vary across components and systems
Application-specific parsers extract relevant attributes Construction of end-to-end traces
Pre-defined schema used to stitch requests across nodes Match on key attributes
In Hadoop, match tasks with same task IDs In VoIP, match calls with same sender/receiver phone no
Incorporate time-based correlation In Hadoop, consider block reads in same time interval as maps In VoIP, consider calls with same phone no. within same time interval
Soila Kavulya @ March 201221
Soila Kavulya @ March 201222
Application Logs: VoIP
Combine per-element logs to obtain per-call traces Approximate match on key attributes Timestamps, caller-callee numbers, IP, ports
Determine call status from per-element codes Zero talk-time, callback soon after call termination
IP Base Element
Call Control Element
Application Server
Gateway Server
10:03:59, START973-123-8888 to 409-555-5555192.156.1.2 to 11.22.34.110:03:59, STOP
10:03:59, ATTEMPT973-123-8888 to 409-555-5555
10:04:01, ATTEMPT973-123-xxxx to 409-555-xxxx192.156.1.2 to 11.22.34.1
Soila Kavulya @ March 201223
Application Logs: Hadoop (1) Peer-comparable attributes extracted from logs Correlate traces using IDs and request schema
2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known
outputs (0 slow hosts and 105 dup hosts)
2009-03-06 23:06:01,612 INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling 2 bytes (2 raw bytes) into RAM from
attempt_200903062245_0051_m_000055_0 …from
ip-10-250-90-207.ec2.internal
Temporal similarity: Timestamps
Hostnames: Spatial similarity
Phase similarity: MapReduce
Context similarity: TaskType
Soila Kavulya @ March 201224
Application Logs: Hadoop (2) No global IDs for correlating logs in Hadoop & VoIP Extract causal flows using predefined schemas
NoSQL Database
2009-03-06 23:06:01,572 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known outputs (0 slow hosts and 105 dup hosts)
Application logs
Extract events
<time=t2,type=shuffle, reduceid=reduce1,mapid=map1,
duration=2s>
MapReduce: { “events” : { “Map” : { “primary-key” : “MapID”, “join-key” : “MapID”, “next-event” : “Shuffle”},…
Flow schema (JSON)
Causal flows
Soila Kavulya @ March 201225
Anomaly Detection
End-to-endTrace
Construction
PerformanceCounters
ApplicationLogs
Ranked list of root-causes
Anomaly Detection Localization
Soila Kavulya @ March 201226
Anomaly Detection Overview Some systems have rules for anomaly detection
Redialing number immediately after disconnection Server reported error codes and exceptions
If no rules available, rely on peer-comparison Identifies peers (nodes, flows) in distributed systems Detect anomalies by identifying “odd-man-out”
Soila Kavulya @ March 201227
Anomaly Detection (1) Empirically determine best peer groupings
Window size, request-flow types, job information Best grouping minimizes false positives in fault-free runs
Peer-comparison identifies “odd-man-out” behavior Robust to workload changes Relies on histogram-comparison Less sensitive to timing differences
Multiple suspects might be identified Due to propagating errors, multiple independent problems
Soila Kavulya @ March 201228
Anomaly Detection (2)
Histogram comparison identifies anomalous flows Generate aggregate histogram represents majority behavior Compare each node’s histogram against aggregate histogram O(n) Compute anomaly score using Kullback-Leibler divergence Detect anomaly if score exceeds pre-specified threshold
Faulty node
Histograms (distributions) of durations of flows
Normal node Normal node
Nor
mal
ized
cou
nts
(tota
l 1.0
)
Nor
mal
ized
cou
nts
(tota
l 1.0
)
Nor
mal
ized
cou
nts
(tota
l 1.0
)
Soila Kavulya @ March 201229
Localization
End-to-endTrace
Construction
PerformanceCounters
ApplicationLogs
Ranked list of root-causes
Anomaly Detection Localization
Soila Kavulya @ March 201230
“Truth table” Request Representation
Node1 Node2 Map ReadBlock OutcomeReq1 1 0 1 1 SUCCESSReq2 0 1 1 1 FAIL
Log SnippetReq1: 20100901064914,SUCCESS,Node1,Map,ReadBlock
Req2: 20100901064930,FAIL,Node2,Map,ReadBlock
Identify Suspect Attributes
Assume each attribute represented as “coin toss” Estimate attribute distribution using Bayes
Success distribution: Prob(Attribute|Success) Anomalous distribution: Prob(Attribute|Anomalous) Anomaly score: KL-divergence between the two distributions
http://www.pdl.cmu.edu/
Bel
ief
Probability(Node2=TRUE)
Successful requestsAnomalous requests
Indict attributes with highest divergence between
distributions
Soila Kavulya @ March 201231
Soila Kavulya @ March 201232
Rank Problems by Severity
ShuffleMap
Node3 Node2
Step 1: All requests
Problem1: Node2 Map
Shuffle
ExceptionX ExceptionY
Node3
Step 2: Filter all requests exceptthose matching Problem1
Problem2:Node3Shuffle
Indict path with highest anomaly score
350120
670 90
290
450
160 340
Soila Kavulya @ March 201233
Incorporate Performance Counters (1) Annotate requests on indicted nodes with performance
counters based on timestamps
Identify metrics most correlated with problem Compare distribution of metrics in successful and failed requests
Requests on node2# Timestamp,CallNo,Status,Memory(%),CPU(%) 20100901064914, 1, SUCCESS, 54, 620100901065030, 2, SUCCESS, 54, 6 20100901065530, 3, SUCCESS, 56, 4 20100901070030, 4, FAIL, 52, 45
Soila Kavulya @ March 201234
Incorporate Performance Counters (2)
ShuffleMap
Node3 Node2
All requests
Problem1: Node2 Map High CPU
High CPU
Incorporate performance counters in diagnosis
350120
670 90
Soila Kavulya @ March 201235
Why Does It Work? Real-world data backs up utility of peer-comparison
Task durations peer-comparable in >75% of jobs [CCGrid’10] Approach analyzes both successful and failed requests
Analyzing only failed requests might elevate common elements over causal elements
Iterative approach discovers correlated attributes Identifies problems due to conjunctions of attributes Filtering step identifies multiple ongoing problems
Handles unencountered problems Does not rely on historical models of normal behavior Does not rely on signatures of known defects
Soila Kavulya @ March 201236
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 201237
VoIP: Diagnosis of Real IncidentsExamples of real-world incidents Diagnose
dResource Indicted
Customers use wrong codec to send faxes ✓ NACustomer problem causes blocked calls at IPBE. ✓ NABlocked circuit identification codes on trunk group ✓ NASoftware bug at control server causes blocked calls ✓ NAProblem with customer equipment leads to poor QoS ✓ NA
Debug tracing overloads servers during peak traffic. ✓ CPUPerformance problem at application server. ✓ CPU/MemoryCongestion at gateway servers due to high load ✓ CPU/Concurrent
Sessions Power outage and causes brief outages. ✗ NAPSX not responding to invites from app. server ✗ Low responses at
app. server
8 out of 10 real incidents diagnosed
Soila Kavulya @ March 201238
Day1 Day2 Day3 Day4 Day5 Day6
Day1 Day2 Day3 Day4 Day5 Day6
VoIP: Case StudiesIncident 1: Chronic due to unsupported fax codec
Faile
d ca
lls fo
r tw
o cu
stom
ers
Faile
d ca
lls
for s
erve
r
Customers stop using unsupported codec
Chronic nightly problem
Unrelated chronic server problem emerges
Server reset
Incident 2: Chronic server problem
39
Implementation of ApproachDraco: Deployment in Production at AT&T
http://www.pdl.cmu.edu/
1. Problem1STOP.IP-TO-PS.487.3 STOP.IP-TO-PSTN.41.0.-.-Chicago*GSXServersMemoryOverload
2. Problem2STOP.IP-TO-PSTN.102.0.102.102ServiceB CustomerAcmeIP_w.x.y.z
SearchFilter
~8500 lines of C code
Soila Kavulya @ March 2012
Soila Kavulya @ March 201240
VoIP: Ranking Multiple Problems
Draco performs better at ranking multiple independent problems
Soila Kavulya @ March 201241
VoIP: Performance of Algorithm
Offline Analysis Avg. Log Size
Avg. Data Load Time
Avg. Diagnosis Time
Draco simulated-1hr (C++)
271 MB 8s 4s
Draco real-1day(C++)
2.4 G 7min 8min
Running on 16-core Xeon (@ 2.4GHz), 24 GB Memory
Soila Kavulya @ March 201242
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 201243
Hadoop: Target Clusters 10 to 100-node Amazon’s EC2 cluster
Commercial, pay-as-you-use cloud-computing resource Workloads under our control, problems injected by us
gridmix, nutch, sort, random writer Can harvest logs and OS data of only our workloads
4000-processor M45 & 64 node Opencloud cluster Production environment Offered to CMU as free cloud-computing resource Diverse kinds of real workloads, problems in the wild
Massive machine-learning, language/machine-translation Permission to harvest all logs and OS data
Soila Kavulya @ March 201244
Hadoop: EC2 Fault Injection
Fault Description
Resource contention
CPU hog External process uses 70% of CPU
Packet-loss 5% or 50% of incoming packets dropped
Disk hog 20GB file repeatedly written to
Application bugs
Source: Hadoop JIRA
HADOOP-1036 Maps hang due to unhandled exception
HADOOP-1152 Reduces fail while copying map output
HADOOP-2080 Reduces fail due to incorrect checksum
Injected fault on single node
Soila Kavulya @ March 201245
Metrics
True
Pos
itive
Rat
es
Different metrics detect different
problems
Hadoop: Peer-comparison ResultsWithout Causal Flows
Correlated problems (e.g., packet-loss) harder to localize
Soila Kavulya @ March 201246
Hadoop: Peer-comparison ResultsWith Causal Flows + Localization
Examples of real-world incidents Diagnosed Metrics IndictedCPU hog ✓ NodePacket-loss ✓ Node+ShuffleDisk hog ✓ NodeHADOOP-1036 ✓ Node+Map
HADOOP-1152 ✓ Node+ShuffleHADOOP-2080 ✓ Node+Shuffle
Correlated problems correctly identified
Soila Kavulya @ March 201247
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 201248
Critique of Approach Anomaly detection thresholds are fragile
Need to use statistical tests Anomaly detection does not address problems at master Peer-groups are defined statically
Assumes homogeneous clusters Need to automate identification of peers
False positives occur if root-cause not in logs Algorithm tends to implicate adjacent network elements Need to incorporate more data to improve visibility
Soila Kavulya @ March 201249
Related Work Chronics fly under the radar
Undetected by alarm mining [Mahimkar09] Chronics can persist undetected for long periods of time
Hard to detect using change-points [Kandula09] Hard to demarcate problem periods [Sambasivan11]
Multiple ongoing problems at a time Single fault assumption inadequate [Cohen05, Bodik10]
Peer-comparison on its own inadequate Hard to localize propagating problems
[Kasick10,Tan10,Kang10]
Soila Kavulya @ March 201250
Outline Motivation Thesis Statement Approach
End-to-end trace construction Anomaly detection Localization
Evaluation VoIP Hadoop
Critique & Related Work Pending Work
Soila Kavulya @ March 201251
Pending Work
OBJECTIVE VoIP HADOOP
Anomaly Detection
Heuristics-based, peer-comparison pending
Peer comparison without labeled data
Problem Localization
Localize to customer/network-element/resource/error-code
Localize to node/task/resource
Chronics Exceptions, performance degradation, single/multiple-source
Exceptions, performance degradation, single-sourcemultiple-source pending
Production Systems
AT&T production system EC2 test system, OpenCloud pending
Publications OSR’11, DSN’12 WASL’08, HotMetrics’09, ISSRE’09, NOMS’10, CCGRID’10
Soila Kavulya @ March 201252
Pending Work: Details OpenCloud production cluster & multiple-source
problems [April-June 2012] 64-node cluster housed at Carnegie Mellon Obtained and parsed logs from 25 real OpenCloud incidents Root-causes include misconfigurations, h/w issues, buggy apps Yet to analyze logs
Peer comparison in VoIP [June-July 2012] Examining data that is not labeled, and identifying peers Notion of a peer might be determined by function and location Root-causes under investigation are as before
Dissertation writing [June-August 2012] Defense [September 2012]
Soila Kavulya @ March 201253
Collaborators & Thanks VoIP (AT&T)
Matti Hiltunen, Kaustubh Joshi, Scott Daniels Hadoop diagnosis
Jiaqi Tan, Xinghao Pan, Rajeev Gandhi, Keith Bare, Michael Kasick, Eugene Marinelli
Hadoop visualization Christos Faloutsos, U Kang, Elmer Garduno, Jason Campbell
(Intel), HCI 05-610 team OpenCloud
Greg Ganger, Garth Gibson, Julio Lopez, Kai Ren, Mitch Franzos, Michael Stroucken
Summary Peer-comparison effective for anomaly detection
Robust to workload changes Requires little training data
Incremental fusion of different instrumentation sources enables localization of chronics Starts with user-visible symptoms of a problem Drills down to localize root-cause of problem
Usefulness of approach in two production systems VoIP system at large telecommunication provider
(demonstrated) Hadoop clusters (underway)
Soila Kavulya @ March 201254
Soila Kavulya @ March 201255
Questions?
Climbing Mt. Kilimanjaro comes a distant second to a thesis proposal!
Soila Kavulya @ March 201256
Selected Publications (1)Diagnosis in Production VoIP system DSN12: Draco: Statistical Diagnosis of Chronic Problems in Large
Distributed Systems. S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, P. Narasimhan. To appear DSN 2012.
OSR12: Practical Experiences with Chronics Discovery in Large Telecommunications Systems. S. P. Kavulya, K. Joshi, M. Hiltunen, S. Daniels, R. Gandhi, P. Narasimhan. Best Papers from SLAML 2011 in Operating Systems Review, 2011.
Survey Paper & Workload Analysis of Production Hadoop Cluster RAE12: Failure Diagnosis of Complex Systems S. P. Kavulya, K. Joshi, F.
Di Giandomenico, P. Narasimhan. To appear in Book on Resilience Assessment and Evaluation. Wolter, 2012.
An analysis of traces from a production MapReduce cluster. S. Kavulya, J. Tan, R. Gandhi, P. Narasimhan. CCGrid 2010.
Soila Kavulya @ March 201257
Selected Publications (2)Visualization in Hadoop CHIMIT11: Understanding and improving the diagnostic workflow of
MapReduce users. J. D. Campbell, A. B. Ganesan, B. Gotow, S. P. Kavulya, J. Mulholland, P. Narasimhan, S. Ramasubramanian, M. Shuster, J. Tan. CHIMIT 2011
ICDCS10: Visual, log-based causal tracing for performance debugging of MapReduce systems. J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ICDCS 2010
Diagnosis in Hadoop (Application logs + performance counters) NOMS10: Kahuna: Problem Diagnosis for MapReduce-Based Cloud
Computing Environments. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P. Narasimhan. NOMS 2010.
ISSRE09: Blind Men and the Elephant (BLIMEy): Piecing together Hadoop for Diagnosis. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. ISSRE 2009.
Soila Kavulya @ March 201258
Selected Publications (3)Diagnosis in Hadoop (Performance counters) HotMetrics09: Ganesha: Black-Box Fault Diagnosis for MapReduce
Systems. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan. HotMetrics 2009.
Diagnosis in Hadoop (Application logs) WASL: SALSA: Analyzing Logs as StAte Machines. J. Tan, X. Pan, S.
Kavulya, R. Gandhi. P. Narasimhan. WASL 2008,
Diagnosis in Group Communication Systems SRDS08: Gumshoe: Diagnosing Performance Problems in Replicated File-
Systems. S. Kavulya, R. Gandhi, P. Narasimhan. SRDS 2008. SysML07: Fingerpointing Correlated Failures in Replicated Systems. S.
Pertet, R. Gandhi, P. Narasimhan. SysML, April 2007.
Related Work (1) [Bodik10]: Fingerprinting the datacenter: automated classification of
performance crises. Peter Bodík, Moisés Goldszmidt, Armando Fox, Dawn B. Woodard, Hans Andersen: EuroSys 2010.
[Cohen05]: Capturing, indexing, clustering and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, Armando Fox. SOSP, 2005.
[Kandula09]: Detailed diagnosis in enterprise networks. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl. SIGCOMM 2009.
[Kasick10]: Black-Box Problem Diagnosis in Parallel File Systems. Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. FAST 2010.
[Kiciman05]: Detecting application-level failures in component-based Internet Services. Emre Kiciman, Armando Fox. IEEE Trans. on Neural Networks 2005.
Soila Kavulya @ March 201259
Related Work (2) [Mahimkar09]: Towards automated performance diagnosis in a large IPTV
network. Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, Qi Zhao. SIGCOMM 2009.
[Sambasivan11]: Diagnosing Performance Changes by Comparing Request Flows. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. NSDI 2011.
http://www.pdl.cmu.edu/
Soila Kavulya @ March 201260