Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | allegra-lawrence |
View: | 34 times |
Download: | 5 times |
A Statistical Learning Approach to Diagnosing eBay’s Site
Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric [email protected]
Path-based Diagnosis Jan 12, 2004 Slide 2
Motivation Fast failure detection and diagnosis are
critical to high availability– But, exact root cause may not be required for
many recovery techniques Many potential causes of failures
– Software bugs, hardware, configuration, network, database, etc.
– Manual diagnosis is slow and inconsistent Statistical approaches are ideal
– Simultaneously examining many possible causes of failures
– Robust to noise
Path-based Diagnosis Jan 12, 2004 Slide 3
Challenges Lots of (noisy) data Near real-time detection and diagnosis Multiple independent failures Root cause might not be captured in logs
Path-based Diagnosis Jan 12, 2004 Slide 4
Talk Outline Introduction eBay’s infrastructure 3 statistical approaches Early results
Path-based Diagnosis Jan 12, 2004 Slide 5
eBay’s Infrastructure 2 physical tiers
– Web server/app server + DB
– Migrating to Java (WebSphere) from C++
SuperCAL (Centralized Application Logging)– API for app developer to log anything to CAL
– Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc.
– Supports nested txns
– A path can be identified via thread ID + host ID
Path-based Diagnosis Jan 12, 2004 Slide 6
SuperCAL Architecture
Stats– 2K app servers, 40 SuperCAL machines– 1B URLs/day– 1TB raw logs/day (150GB gzipped), 200Mbps peak
AppServers
LBSwitch
detection
Real-timemsg bus
diagnosis
……
Path-based Diagnosis Jan 12, 2004 Slide 7
Failure Analysis Summarize each transaction into:
What features are causing requests to fail?
– Txn type, txn name, pool, host, version, DB, or a combination of these?
– Different causes require different recovery techniques
ID Type Name Pool Host
Version DB Status
1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, …
NullPointer
2 URL Bid Cgi2 231 1.0.3 PriceDB Success
3 XML … … … … … …
Features Class
Path-based Diagnosis Jan 12, 2004 Slide 8
3 Approaches Machine learning
– Decision trees
– MinEntropy – eBay’s greedy variant of decision trees
Data mining– Association rules
Path-based Diagnosis Jan 12, 2004 Slide 9
Decision Trees Classifiers developed in the statistical
machine learning field Example: go skiing tomorrow?
“learning” => inferring the decision trees rules from data
Y
Y N
New snow No new snow
CloudySunny
Y
Y N
Sunny Cloudy
No new snowNew snow
Path-based Diagnosis Jan 12, 2004 Slide 10
Decision Trees Feature selection
– Look for features that best separates the classes– Different algorithms uses different metrics to measure
“skewness” (e.g. C4.5 uses information gain)
The goal of decision tree algorithm– to split nodes until leaves are “pure” enough or until no
further split is possible • i.e. pure => all data points have the same class label
– Use pruning heuristics to control over-fitting
TxnName Failed
MyEBay 636
MyEBaySeller 512
MyEBayLogin 736
… …
Machine Failed
Attila 2985
Lenin 20
Marcus 4
Scipio 5
… …
Path-based Diagnosis Jan 12, 2004 Slide 11
Decision Trees – Sample Output Pool = icgi1
| TxnName = LeaveFeedback: failed (8,1)
| TxnName = MyFeedback: failed (205,3)
Pool = icgi2
| TxnName = Respond: failed (1)
| TxnName = ViewFeedback: failed (3554,52)
(Correct, incorrect)
8 205 1 3554
icgi1 icgi2
RespondLeaveFdbk MyFdbkViewFdbk
Naïve diagnosis:
1. Pool=icgi1 and TxnName=LeaveFeedback
2. Pool=icgi1 and TxnName=MyFeedback
3. Pool=icgi2 and TxnName=Respond
4. Pool=icgi2 and TxnName=ViewFeedback
Path-based Diagnosis Jan 12, 2004 Slide 12
Feature Selection Heuristics
1. Ignore leaf nodes with no failed transactions
2. Problem: noisy leaves– keep the top N leaves, or ignore nodes with < M% failues
3. Problem: features may not be independent– drop ancestor nodes that are “subsumed” by the leaves
4. Rank by impact– sort the predicted causes by failure count
8 205 1 3554
icgi1 icgi2
RespondLeaveFdbk
MyFdbkViewFdbk
205 3554
icgi1 icgi2
RespondMyFdbk
205 3554
RespondMyFdbk
Path-based Diagnosis Jan 12, 2004 Slide 13
MinEntropy Entropy measures the randomness of data
– E.g. if failure is evenly distributed (very random), then entropy is high
Rank features by the normalized entropy– Greedy approach searches for the leaf node with most
failures Always produces one and exactly one diagnosis Deployed on the entire eBay site
– Sends real-time alerts to ops– Pros: fast (<1s for 100K txns and scales linearly)– Cons: optimized for single faults
Path-based Diagnosis Jan 12, 2004 Slide 14
MinEntropy example
TxnType Errors
URL 4350
SQL 47
EMAIL 12
XSLT 0
… …
Pool Errors
Cgi0 12
Cgi1 4002
Cgi2 30
Cgi3 8
Cgi4 5
… …
Machine Errors
Attila 1985
Lenin 2002
Marcus 4
Scipio 0
… …
TxnName Errors
MyEBay 636
MyEBaySeller
512
MyEBayLogin
736
… …Version Errors
E293 3987
E291 15
Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1
Path-based Diagnosis Jan 12, 2004 Slide 15
Association Rules Data mining technique to compute item sets
– e.g. Shoppers who bought this item also shopped for … Metrics
– Confidence: (# of A & B) / # of A• Conditional probability of B given A
– Support: (# of A & B)/total # of txns Generates rules for all possible sets
– e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02)
Applied to failure diagnosis– Find all rules that has failed status on the right, then rank
by conf– Pros: looks at combinations of features– Cons: generates many rules
Path-based Diagnosis Jan 12, 2004 Slide 16
Association Rules – Sample Output Sample output (rules containing failures):
TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)
Problem: features may not be independent– e.g. all LeaveFeedback txns are of type URL
– Drop rules that are subsumed by more specific rules
Diagnosis: TxnName=LeaveFeedback
Path-based Diagnosis Jan 12, 2004 Slide 17
Experimental Setup Dataset
– About 1/8 of the whole site
– 10 one-minute traces, 4 with 2 concurrent faults• total of 14 independent faults
– True faults identified through post-mortems, ops chat logs, application logs, etc.
Metrics
– Precision: (# of identified faults) / (# of true faults)
– Recall: (# of identified faults) / (# of predicted faults)
Type Name Pool Machine Version Database
Status
10 300 15 260 7 40 8
Host DB Host, Host
Host, DB Host, SW DB, SW
2 4 1 1 1 1
Path-based Diagnosis Jan 12, 2004 Slide 18
Results: DBs in Dataset
0%
20%
40%
60%
80%
100%
C4.5 naïve C4.5 (noisefiltering)
C4.5 (noisefiltering + path
trimming)
recallprecision
True causes for DB-related failures are captured in the dataset– Variable number of DBs
used by each txn Feature selection heuristics
1. Ignore leaf nodes with no failed transactions
2. Noise filtering– ignore nodes with < M%
failues (in this case, M = 10)
3. Path trimming– drop ancestor nodes subsumed
by the leaf nodes
Path-based Diagnosis Jan 12, 2004 Slide 19
Results: DBs not in Dataset
0%
20%
40%
60%
80%
100%
C4.5 MinEntropy AssociationRules (N=5)
AssociationRules(N=10)
precision
recall
True cause not captured for DB-related failures
C4.5 suffers from unbalanced dataset– i.e. produces a single-rule that predicts every txn
to be successful
Path-based Diagnosis Jan 12, 2004 Slide 20
What’s next? ROC curves
– show tradeoff between precision and recall
Transient failures– Up-sample to balance dataset or use cost matrix
Some measure of the “confidence” of the prediction
More data points– Have 20hrs of logs that have failures
Path-based Diagnosis Jan 12, 2004 Slide 21
Open Questions How to deal with multiple symptoms?
– E.g. DB outage causing multiple types of requests to fail
– Treat it as multiple failures?
Failure importance (count vs. rate)– Two failures may have similar failure count
– Low volume and higher failure rate vs. high volume and lower failure rate