+ All Categories
Home > Documents > A Statistical Learning Approach to Diagnosing eBay’s Site

A Statistical Learning Approach to Diagnosing eBay’s Site

Date post: 31-Dec-2015
Category:
Upload: allegra-lawrence
View: 34 times
Download: 5 times
Share this document with a friend
Description:
A Statistical Learning Approach to Diagnosing eBay’s Site. Mike Chen , Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer [email protected]. Motivation. Fast failure detection and diagnosis are critical to high availability - PowerPoint PPT Presentation
21
A Statistical Learning Approach to Diagnosing eBay’s Site Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric Brewer [email protected]
Transcript
Page 1: A Statistical Learning Approach to Diagnosing eBay’s Site

A Statistical Learning Approach to Diagnosing eBay’s Site

Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric [email protected]

Page 2: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 2

Motivation Fast failure detection and diagnosis are

critical to high availability– But, exact root cause may not be required for

many recovery techniques Many potential causes of failures

– Software bugs, hardware, configuration, network, database, etc.

– Manual diagnosis is slow and inconsistent Statistical approaches are ideal

– Simultaneously examining many possible causes of failures

– Robust to noise

Page 3: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 3

Challenges Lots of (noisy) data Near real-time detection and diagnosis Multiple independent failures Root cause might not be captured in logs

Page 4: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 4

Talk Outline Introduction eBay’s infrastructure 3 statistical approaches Early results

Page 5: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 5

eBay’s Infrastructure 2 physical tiers

– Web server/app server + DB

– Migrating to Java (WebSphere) from C++

SuperCAL (Centralized Application Logging)– API for app developer to log anything to CAL

– Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc.

– Supports nested txns

– A path can be identified via thread ID + host ID

Page 6: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 6

SuperCAL Architecture

Stats– 2K app servers, 40 SuperCAL machines– 1B URLs/day– 1TB raw logs/day (150GB gzipped), 200Mbps peak

AppServers

LBSwitch

detection

Real-timemsg bus

diagnosis

……

Page 7: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 7

Failure Analysis Summarize each transaction into:

What features are causing requests to fail?

– Txn type, txn name, pool, host, version, DB, or a combination of these?

– Different causes require different recovery techniques

ID Type Name Pool Host

Version DB Status

1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, …

NullPointer

2 URL Bid Cgi2 231 1.0.3 PriceDB Success

3 XML … … … … … …

Features Class

Page 8: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 8

3 Approaches Machine learning

– Decision trees

– MinEntropy – eBay’s greedy variant of decision trees

Data mining– Association rules

Page 9: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 9

Decision Trees Classifiers developed in the statistical

machine learning field Example: go skiing tomorrow?

“learning” => inferring the decision trees rules from data

Y

Y N

New snow No new snow

CloudySunny

Y

Y N

Sunny Cloudy

No new snowNew snow

Page 10: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 10

Decision Trees Feature selection

– Look for features that best separates the classes– Different algorithms uses different metrics to measure

“skewness” (e.g. C4.5 uses information gain)

The goal of decision tree algorithm– to split nodes until leaves are “pure” enough or until no

further split is possible • i.e. pure => all data points have the same class label

– Use pruning heuristics to control over-fitting

TxnName Failed

MyEBay 636

MyEBaySeller 512

MyEBayLogin 736

… …

Machine Failed

Attila 2985

Lenin 20

Marcus 4

Scipio 5

… …

Page 11: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 11

Decision Trees – Sample Output Pool = icgi1

| TxnName = LeaveFeedback: failed (8,1)

| TxnName = MyFeedback: failed (205,3)

Pool = icgi2

| TxnName = Respond: failed (1)

| TxnName = ViewFeedback: failed (3554,52)

(Correct, incorrect)

8 205 1 3554

icgi1 icgi2

RespondLeaveFdbk MyFdbkViewFdbk

Naïve diagnosis:

1. Pool=icgi1 and TxnName=LeaveFeedback

2. Pool=icgi1 and TxnName=MyFeedback

3. Pool=icgi2 and TxnName=Respond

4. Pool=icgi2 and TxnName=ViewFeedback

Page 12: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 12

Feature Selection Heuristics

1. Ignore leaf nodes with no failed transactions

2. Problem: noisy leaves– keep the top N leaves, or ignore nodes with < M% failues

3. Problem: features may not be independent– drop ancestor nodes that are “subsumed” by the leaves

4. Rank by impact– sort the predicted causes by failure count

8 205 1 3554

icgi1 icgi2

RespondLeaveFdbk

MyFdbkViewFdbk

205 3554

icgi1 icgi2

RespondMyFdbk

205 3554

RespondMyFdbk

Page 13: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 13

MinEntropy Entropy measures the randomness of data

– E.g. if failure is evenly distributed (very random), then entropy is high

Rank features by the normalized entropy– Greedy approach searches for the leaf node with most

failures Always produces one and exactly one diagnosis Deployed on the entire eBay site

– Sends real-time alerts to ops– Pros: fast (<1s for 100K txns and scales linearly)– Cons: optimized for single faults

Page 14: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 14

MinEntropy example

TxnType Errors

URL 4350

SQL 47

EMAIL 12

XSLT 0

… …

Pool Errors

Cgi0 12

Cgi1 4002

Cgi2 30

Cgi3 8

Cgi4 5

… …

Machine Errors

Attila 1985

Lenin 2002

Marcus 4

Scipio 0

… …

TxnName Errors

MyEBay 636

MyEBaySeller

512

MyEBayLogin

736

… …Version Errors

E293 3987

E291 15

Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1

Page 15: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 15

Association Rules Data mining technique to compute item sets

– e.g. Shoppers who bought this item also shopped for … Metrics

– Confidence: (# of A & B) / # of A• Conditional probability of B given A

– Support: (# of A & B)/total # of txns Generates rules for all possible sets

– e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02)

Applied to failure diagnosis– Find all rules that has failed status on the right, then rank

by conf– Pros: looks at combinations of features– Cons: generates many rules

Page 16: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 16

Association Rules – Sample Output Sample output (rules containing failures):

TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

Problem: features may not be independent– e.g. all LeaveFeedback txns are of type URL

– Drop rules that are subsumed by more specific rules

Diagnosis: TxnName=LeaveFeedback

Page 17: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 17

Experimental Setup Dataset

– About 1/8 of the whole site

– 10 one-minute traces, 4 with 2 concurrent faults• total of 14 independent faults

– True faults identified through post-mortems, ops chat logs, application logs, etc.

Metrics

– Precision: (# of identified faults) / (# of true faults)

– Recall: (# of identified faults) / (# of predicted faults)

Type Name Pool Machine Version Database

Status

10 300 15 260 7 40 8

Host DB Host, Host

Host, DB Host, SW DB, SW

2 4 1 1 1 1

Page 18: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 18

Results: DBs in Dataset

0%

20%

40%

60%

80%

100%

C4.5 naïve C4.5 (noisefiltering)

C4.5 (noisefiltering + path

trimming)

recallprecision

True causes for DB-related failures are captured in the dataset– Variable number of DBs

used by each txn Feature selection heuristics

1. Ignore leaf nodes with no failed transactions

2. Noise filtering– ignore nodes with < M%

failues (in this case, M = 10)

3. Path trimming– drop ancestor nodes subsumed

by the leaf nodes

Page 19: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 19

Results: DBs not in Dataset

0%

20%

40%

60%

80%

100%

C4.5 MinEntropy AssociationRules (N=5)

AssociationRules(N=10)

precision

recall

True cause not captured for DB-related failures

C4.5 suffers from unbalanced dataset– i.e. produces a single-rule that predicts every txn

to be successful

Page 20: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 20

What’s next? ROC curves

– show tradeoff between precision and recall

Transient failures– Up-sample to balance dataset or use cost matrix

Some measure of the “confidence” of the prediction

More data points– Have 20hrs of logs that have failures

Page 21: A Statistical Learning Approach to Diagnosing eBay’s Site

Path-based Diagnosis Jan 12, 2004 Slide 21

Open Questions How to deal with multiple symptoms?

– E.g. DB outage causing multiple types of requests to fail

– Treat it as multiple failures?

Failure importance (count vs. rate)– Two failures may have similar failure count

– Low volume and higher failure rate vs. high volume and lower failure rate


Recommended