A Statistical Learning Approach to Diagnosing eBay’s Site

A Statistical Learning Approach to Diagnosing eBay’s Site

Mike Chen, Alice Zheng, Jim Lloyd, Michael Jordan, Eric [email protected]

Path-based Diagnosis Jan 12, 2004 Slide 2

Motivation Fast failure detection and diagnosis are

critical to high availability– But, exact root cause may not be required for

many recovery techniques Many potential causes of failures

– Software bugs, hardware, configuration, network, database, etc.

– Manual diagnosis is slow and inconsistent Statistical approaches are ideal

– Simultaneously examining many possible causes of failures

– Robust to noise


Challenges Lots of (noisy) data Near real-time detection and diagnosis Multiple independent failures Root cause might not be captured in logs


Talk Outline Introduction eBay’s infrastructure 3 statistical approaches Early results


eBay’s Infrastructure 2 physical tiers

– Web server/app server + DB

– Migrating to Java (WebSphere) from C++

SuperCAL (Centralized Application Logging)– API for app developer to log anything to CAL

– Runtime platform provides application-generic logging: cookie, host, URL, DB table(s), status, duration, etc.

– Supports nested txns

– A path can be identified via thread ID + host ID


SuperCAL Architecture

Stats– 2K app servers, 40 SuperCAL machines– 1B URLs/day– 1TB raw logs/day (150GB gzipped), 200Mbps peak

AppServers

LBSwitch

detection

Real-timemsg bus

diagnosis

……


Failure Analysis Summarize each transaction into:

What features are causing requests to fail?

– Txn type, txn name, pool, host, version, DB, or a combination of these?

– Different causes require different recovery techniques

ID Type Name Pool Host

Version DB Status

1 URL ViewFeedback Cgi0 134 1.2.1 FeedbackDB, UserDB, …

NullPointer

2 URL Bid Cgi2 231 1.0.3 PriceDB Success

3 XML … … … … … …

Features Class


3 Approaches Machine learning

– Decision trees

– MinEntropy – eBay’s greedy variant of decision trees

Data mining– Association rules


Decision Trees Classifiers developed in the statistical

machine learning field Example: go skiing tomorrow?

“learning” => inferring the decision trees rules from data

Y

Y N

New snow No new snow

CloudySunny

Y

Y N

Sunny Cloudy

No new snowNew snow


Decision Trees Feature selection

– Look for features that best separates the classes– Different algorithms uses different metrics to measure

“skewness” (e.g. C4.5 uses information gain)

The goal of decision tree algorithm– to split nodes until leaves are “pure” enough or until no

further split is possible • i.e. pure => all data points have the same class label

– Use pruning heuristics to control over-fitting

TxnName Failed

MyEBay 636

MyEBaySeller 512

MyEBayLogin 736

… …

Machine Failed

Attila 2985

Lenin 20

Marcus 4

Scipio 5

… …


Decision Trees – Sample Output Pool = icgi1

| TxnName = LeaveFeedback: failed (8,1)

| TxnName = MyFeedback: failed (205,3)

Pool = icgi2

| TxnName = Respond: failed (1)

| TxnName = ViewFeedback: failed (3554,52)

(Correct, incorrect)

8 205 1 3554

icgi1 icgi2

RespondLeaveFdbk MyFdbkViewFdbk

Naïve diagnosis:

1. Pool=icgi1 and TxnName=LeaveFeedback

2. Pool=icgi1 and TxnName=MyFeedback

3. Pool=icgi2 and TxnName=Respond

4. Pool=icgi2 and TxnName=ViewFeedback


Feature Selection Heuristics

1. Ignore leaf nodes with no failed transactions

2. Problem: noisy leaves– keep the top N leaves, or ignore nodes with < M% failues

3. Problem: features may not be independent– drop ancestor nodes that are “subsumed” by the leaves

4. Rank by impact– sort the predicted causes by failure count

8 205 1 3554

icgi1 icgi2

RespondLeaveFdbk

MyFdbkViewFdbk

205 3554

icgi1 icgi2

RespondMyFdbk

205 3554

RespondMyFdbk


MinEntropy Entropy measures the randomness of data

– E.g. if failure is evenly distributed (very random), then entropy is high

Rank features by the normalized entropy– Greedy approach searches for the leaf node with most

failures Always produces one and exactly one diagnosis Deployed on the entire eBay site

– Sends real-time alerts to ops– Pros: fast (<1s for 100K txns and scales linearly)– Cons: optimized for single faults


MinEntropy example

TxnType Errors

URL 4350

SQL 47

EMAIL 12

XSLT 0

… …

Pool Errors

Cgi0 12

Cgi1 4002

Cgi2 30

Cgi3 8

Cgi4 5

… …

Machine Errors

Attila 1985

Lenin 2002

Marcus 4

Scipio 0

… …

TxnName Errors

MyEBay 636

MyEBaySeller

512

MyEBayLogin

736

… …Version Errors

E293 3987

E291 15

Alert: Version E293 causing URL failures (not specific to any URL) in pool CGI1


Association Rules Data mining technique to compute item sets

– e.g. Shoppers who bought this item also shopped for … Metrics

– Confidence: (# of A & B) / # of A• Conditional probability of B given A

– Support: (# of A & B)/total # of txns Generates rules for all possible sets

– e.g. machine=abc, txn=login => status=NullPointer (conf:0.1, support=0.02)

Applied to failure diagnosis– Find all rules that has failed status on the right, then rank

by conf– Pros: looks at combinations of features– Cons: generates many rules


Association Rules – Sample Output Sample output (rules containing failures):

TxnType=URL Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

Pool=icgi2 TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

TxnType=URL TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

TxnName=LeaveFeedback ==> Status=Failed conf:(0.28)

Problem: features may not be independent– e.g. all LeaveFeedback txns are of type URL

– Drop rules that are subsumed by more specific rules

Diagnosis: TxnName=LeaveFeedback


Experimental Setup Dataset

– About 1/8 of the whole site

– 10 one-minute traces, 4 with 2 concurrent faults• total of 14 independent faults

– True faults identified through post-mortems, ops chat logs, application logs, etc.

Metrics

– Precision: (# of identified faults) / (# of true faults)

– Recall: (# of identified faults) / (# of predicted faults)

Type Name Pool Machine Version Database

Status

10 300 15 260 7 40 8

Host DB Host, Host

Host, DB Host, SW DB, SW

2 4 1 1 1 1


Results: DBs in Dataset

0%

20%

40%

60%

80%

100%

C4.5 naïve C4.5 (noisefiltering)

C4.5 (noisefiltering + path

trimming)

recallprecision

True causes for DB-related failures are captured in the dataset– Variable number of DBs

used by each txn Feature selection heuristics

1. Ignore leaf nodes with no failed transactions

2. Noise filtering– ignore nodes with < M%

failues (in this case, M = 10)

3. Path trimming– drop ancestor nodes subsumed

by the leaf nodes


Results: DBs not in Dataset

0%

20%

40%

60%

80%

100%

C4.5 MinEntropy AssociationRules (N=5)

AssociationRules(N=10)

precision

recall

True cause not captured for DB-related failures

C4.5 suffers from unbalanced dataset– i.e. produces a single-rule that predicts every txn

to be successful


What’s next? ROC curves

– show tradeoff between precision and recall

Transient failures– Up-sample to balance dataset or use cost matrix

Some measure of the “confidence” of the prediction

More data points– Have 20hrs of logs that have failures


Open Questions How to deal with multiple symptoms?

– E.g. DB outage causing multiple types of requests to fail

– Treat it as multiple failures?

Failure importance (count vs. rate)– Two failures may have similar failure count

– Low volume and higher failure rate vs. high volume and lower failure rate

Date post:	31-Dec-2015
Category:	Documents
Upload:	allegra-lawrence
View:	34 times
Download:	5 times

A Statistical Learning Approach to Diagnosing eBay’s Site

Documents