Ranking the Importance of Alerts for Problem Determination in Large Computer System

transcript

Ranking the Importance of Alerts for Problem Determination in Large

Computer System

Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena

NEC Laboratories America, Princeton

Outline• Introduction– Motivation & Goal

• System Invariants– Invariants extraction– Value propagation

• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts

• Experiment result• Conclusion

ICAC 2009 : 6/16/2009 2

Motivation

ICAC 2009 : 6/16/2009 3

• Large & complex systems are deployed by integrating many heterogeneous components:– servers, routers, storage & software from multiple

vendors.– Hidden dependencies

• Log/Performance data from components– Operators set many rules to check it and trigger alerts.• E.g. CPU% @ Web > 70%

– Rule setting: independent & isolated• Operator’s own system knowledge.

ICAC 2009 : 6/16/2009 4

• Which alerts should we analyze first?

- Get more consensus from others- Blend system management knowledge from multiple operators

• We introduce “Peer-review” mechanism– To rank the importance of alerts.

• Operators can prioritize problem determinations process.

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Alert 3 Alert 1 Alert 2Alert 4

Full automation

Alerts Ranking Process

ICAC 2009 : 6/16/20095

tOff line

1. Extract Invariants from monitoring data

Invariants model

Operators(w/ domain knowledge)

Large system

Alert 1Alert 2

Alert 3Alert 4

2. Define alert rules 3. Sort alert rules

[ICAC 2006][TDSC 2006][TKDE 2007][DSN 2006]

4. Rank alertsOnlineAt time of alerts received

Alert 1Alert 1Alert 1Alert 4Real alerts

Domain information

System Invariants

ICAC 2009 : 6/16/2009 6

any constant relationship

Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests.

Target System

Userrequests

• User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly.

• We search the relationships among these internal measurements collected at various points.

• If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

Invariant Examples

ICAC 2009 : 6/16/2009 7

• Check implicit relationships, but not real values of flow intensities, which are always changing.

However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant.

LoadBalancer

I1 = O1+O2+O3

DatabaseServer

Packetvolume V1

SQL querynumber N1

V1 = f(N1)Invariant

Automated Invariants Search

ICAC 2009 : 6/16/2009 8

model library f

TargetSystem

observationdata

pick any twomeasurementsi, j to learn f ij

f ij: Invariantcandidates

with new data [t1-t2], do f ij hold ?

drop thevariants f ij Pi: Confidence Score

Sequential validation

[t0-t1]Monitoring

observationdata

[t1-t2]

with new data [tk-tk+1], do f ij hold ?

observationdata

[tk-tk+1]

drop thevariants f ij

YesTemplate

One example in model library

ICAC 2009 : 6/16/2009 9

• We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements.

• Define

• Given a sequence of real observations, using LMS, we learn the model by minimizing the error.

• A fitness function can be used to evaluate how well the learned model fits the real data.

cmktxbktxbntyatyaty mn )(...)()(...)1()( 01

Tmn cbbbaaa ],,...,,,,...,,[ 1021

Tmktxktxntytyt )](),...,(),(),...,1([)( Ttty )()(

TN tyttt

1).()(])()([ˆ

100]|)(|

|)|(ˆ)(|1[)(

Value Propagation with Invariants

ICAC 2009 : 6/16/2009 10

y=f(x)y

zz=g(y)

u=h(x)v=s(u)Extract

invariants

cxbxbyayay mn 101

yty )( Converged

xtx )(Set

z=g(f(x))v=s(h(x))

With ARX Model

Multi hops

Rules and Fault Model

ICAC 2009 : 6/16/2009 11

1then),(if alert generate_xx TRule

Predicate Action

Probability of fault occurrence

Fault model for each rule

False positive

False negative

Ideal modelRealistic model

Probability of Reporting a True Positive Alert

• Importance of an alert:

ICAC 2009 : 6/16/2009 12

true|xProb

Probability of Reporting a True Positive (PRTP)generated by value x

A very small false positive rate leads to large number of false positive repots.

Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!!

Ex. Real operation support system: 80% of reports are FPs

Local Context Mapping to Global Context

ICAC 2009 : 6/16/2009 13

Web AP

Different semanticsDifferent semanticsGlobal context

CPU%Web = fa(Network@AP)

CPU%Web = fb(CPU%@DB)CPU%Web = fc(DiskUsg%@Web)

Fault model of CPU%WebPRTP

0xT xCPU@DBxDiskUsg@WEB

xNetwork@AP

= fa(Network@AP)

= fc(DiskUsg@WEB) = fb(CPU%@AP)

Prob(true|XCPU@DB)> Prob(true|XT)> Prob(true|XDiskUsg@Web)> Prob(true|XNetwork@AP)

Alert 3

Alert 1

Alert 2

Alert 4

Local Context Mapping to Global Context

ICAC 2009 : 6/16/2009 14

Web AP

Fault model of Network%APPRTP

0xCPU@WEB

xCPU@DB

xDiskUsg@WEB

Prob(true|XCPU@DB)> Prob(true|XCPU@WEB)> Prob(true|XDiskUsg@Web)

> Prob(true|XT)

Alert 3

Alert 1

Alert 2

Alert 4

Alert ranking: No Change

Alerts Ranking Process

ICAC 2009 : 6/16/2009 154. Rank alertsOnline

At time of alerts receivedAlert 1Alert 1Alert 1Alert 4Real alerts

Ranking Alerts (Case I)

ICAC 2009 : 6/16/2009 16

Sorted alert rules

Alert 6Alert 2

Alert 3Alert 7

Alert 5Alert 9Alert 1

Alert 8Alert 4

Case I: Receive ONLY ALERTS, no monitoring data from components

Alert 2

Alert 3Alert 7

Alert 5

Alert 1

Alerts ranking

5 alertsgenerated

Operator’s knowledge & configuration

System InvariantsNetwork

Ranking Alerts (Case II)

ICAC 2009 : 6/16/2009 17

Case II: Receive both alerts and monitoring data from components

Fault model of CPU%Web

0xT xCPU@DBxDiskUsg@WEB

xNetwork@AP

= fa(Network@AP)

= fc(DiskUsg@WEB) = fb(CPU%@AP)

Observed ValueX(CPU%Web)

Number of Threshold Violations (NTV)

Fault model of Network%AP

0xCPU@WEB

xCPU@DB

xDiskUsg@WEB

Observed ValueX(Network%AP)

Alert by CPU%Web is more important than one from Network%AP.

Index• Introduction– Motivation & Goal

• System Invariants– Invariants extraction– Value propagation

• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts

• Experiment result• Conclusion

ICAC 2009 : 6/16/2009 18

Experimental system

ICAC 2009 : 6/16/2009 19

Flow Intensities:

: the number of EJB created at time t.

: the JVM processing time at time t.

: the number of SQL queries at time t.

Flow Intensities:

: the number of EJB created at time t.

: the JVM processing time at time t.

: the number of SQL queries at time t.

( )ejbI t

( )jvmI t

( )sqlI t

Invariant Examples:Invariant Examples:

( ) 0.07 ( 1) 0.57 ( )ejb ejb jvmI t I t I t

( ) 0.34 ( 1) 1.41 ( )

0.2 ( 1)

sql sql ejb

I t I t I t

Extracted Invariants Network

ICAC 2009 : 6/16/2009 20

Thresholds of Measurements

ICAC 2009 : 6/16/2009 21

m1 m2 m3 m4 m5 m6

iTii malert generate_mx _then),(if

Thresholds of Measurements

ICAC 2009 : 6/16/2009 22

iTii malert generate_mx _then),(if

Ranking Alerts with NTVs (1)

ICAC 2009 : 6/16/2009 23

Observed value 73.6 34319 81.6 71.430621 22620

NTVs 5 5 5 65 2

ICAC 2009 : 6/16/2009 24

ICAC 2009 : 6/16/2009 25

Observed value 73.5 31478 54.6 46.122712 18564

NTVs 5 2 - -- -

ICAC 2009 : 6/16/2009 26Inject a problem (SCP copy) to Web serverInject a problem (SCP copy) to Web server

Conclusion• We introduce a peer review mechanism to

rank alerts from heterogeneous components– By mapping local thresholds of various rules into

their equivalent values in a global context

– Based on system invariants network model

• To support operators’ consultation for prioritization of problem determination.

ICAC 2009 : 6/16/2009 27

Thank You!

• Questions?

ICAC 2009 : 6/16/2009 28

Ranking the Importance of Alerts for Problem Determination in Large Computer System

Documents