Ranking the Importance of Alerts for Problem Determination in Large Computer System

Post on 31-Jan-2016

17 views 0 download

Tags:

description

Ranking the Importance of Alerts for Problem Determination in Large Computer System. Guofei Jiang , Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton. Outline. Introduction Motivation & Goal System Invariants Invariants extraction Value propagation - PowerPoint PPT Presentation

transcript

Ranking the Importance of Alerts for Problem Determination in Large

Computer System

Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena

NEC Laboratories America, Princeton

Outline• Introduction– Motivation & Goal

• System Invariants– Invariants extraction– Value propagation

• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts

• Experiment result• Conclusion

ICAC 2009 : 6/16/2009 2

Motivation

ICAC 2009 : 6/16/2009 3

• Large & complex systems are deployed by integrating many heterogeneous components:– servers, routers, storage & software from multiple

vendors.– Hidden dependencies

• Log/Performance data from components– Operators set many rules to check it and trigger alerts.• E.g. CPU% @ Web > 70%

– Rule setting: independent & isolated• Operator’s own system knowledge.

Goal

ICAC 2009 : 6/16/2009 4

• Which alerts should we analyze first?

- Get more consensus from others- Blend system management knowledge from multiple operators

• We introduce “Peer-review” mechanism– To rank the importance of alerts.

• Operators can prioritize problem determinations process.

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Alert 3 Alert 1 Alert 2Alert 4

Full automation

Alerts Ranking Process

ICAC 2009 : 6/16/20095

t

t

tOff line

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

1. Extract Invariants from monitoring data

Invariants model

Operators(w/ domain knowledge)

Large system

Alert 1Alert 2

Alert 3Alert 4

2. Define alert rules 3. Sort alert rules

[ICAC 2006][TDSC 2006][TKDE 2007][DSN 2006]

4. Rank alertsOnlineAt time of alerts received

Alert 1Alert 1Alert 1Alert 4Real alerts

Domain information

System Invariants

ICAC 2009 : 6/16/2009 6

m1

m2

m4

m3

mi

mi+1

mi+2

mn t

t

t

tt

t

t

. .

.

.

..

. ..

any constant relationship

???

mn

Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests.

Target System

Userrequests

t

t

• User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly.

• We search the relationships among these internal measurements collected at various points.

• If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

Invariant Examples

ICAC 2009 : 6/16/2009 7

• Check implicit relationships, but not real values of flow intensities, which are always changing.

However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant.

LoadBalancer

LoadBalancer

I1

O1

O2

O3

I1 = O1+O2+O3

DatabaseServer

DatabaseServer

Packetvolume V1

SQL querynumber N1

V1 = f(N1)Invariant

Automated Invariants Search

ICAC 2009 : 6/16/2009 8

model library f

TargetSystem

observationdata

pick any twomeasurementsi, j to learn f ij

f ij: Invariantcandidates

with new data [t1-t2], do f ij hold ?

drop thevariants f ij Pi: Confidence Score

NO

Sequential validation

[t0-t1]Monitoring

observationdata

[t1-t2]

with new data [tk-tk+1], do f ij hold ?

observationdata

[tk-tk+1]

P0 P1

Yes

drop thevariants f ij

NO PK

YesTemplate

One example in model library

ICAC 2009 : 6/16/2009 9

• We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements.

• Define

• Given a sequence of real observations, using LMS, we learn the model by minimizing the error.

• A fitness function can be used to evaluate how well the learned model fits the real data.

cmktxbktxbntyatyaty mn )(...)()(...)1()( 01

Tmn cbbbaaa ],,...,,,,...,,[ 1021

Tmktxktxntytyt )](),...,(),(),...,1([)( Ttty )()(

N

t

N

t

TN tyttt

1

1

1).()(])()([ˆ

100]|)(|

|)|(ˆ)(|1[)(

1

2

1

2

N

t

N

t

yty

tytyF

Value Propagation with Invariants

ICAC 2009 : 6/16/2009 10

x

y=f(x)y

zz=g(y)

uv

u=h(x)v=s(u)Extract

invariants

cxbxbyayay mn 101

yty )( Converged

n

j j

m

i i

a

cxby

1

1

0

1

xtx )(Set

z=g(f(x))v=s(h(x))

With ARX Model

Multi hops

Rules and Fault Model

ICAC 2009 : 6/16/2009 11

1then),(if alert generate_xx TRule

Predicate Action

Probability of fault occurrence

x

1

0xT

Fault model for each rule

False positive

False negative

Ideal modelRealistic model

Probability of Reporting a True Positive Alert

• Importance of an alert:

ICAC 2009 : 6/16/2009 12

true|xProb

Probability of Reporting a True Positive (PRTP)generated by value x

A very small false positive rate leads to large number of false positive repots.

Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!!

Ex. Real operation support system: 80% of reports are FPs

Local Context Mapping to Global Context

ICAC 2009 : 6/16/2009 13

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Web AP

DB

Different semanticsDifferent semanticsGlobal context

CPU%Web = fa(Network@AP)

CPU%Web = fb(CPU%@DB)CPU%Web = fc(DiskUsg%@Web)

Fault model of CPU%WebPRTP

x

1

0xT xCPU@DBxDiskUsg@WEB

xNetwork@AP

= fa(Network@AP)

= fc(DiskUsg@WEB) = fb(CPU%@AP)

Prob(true|XCPU@DB)> Prob(true|XT)> Prob(true|XDiskUsg@Web)> Prob(true|XNetwork@AP)

Alert 3

Alert 1

Alert 2

Alert 4

Local Context Mapping to Global Context

ICAC 2009 : 6/16/2009 14

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Web AP

DB

Fault model of Network%APPRTP

x

1

0xCPU@WEB

xCPU@DB

xDiskUsg@WEB

xT

Prob(true|XCPU@DB)> Prob(true|XCPU@WEB)> Prob(true|XDiskUsg@Web)

> Prob(true|XT)

Alert 3

Alert 1

Alert 2

Alert 4

Alert ranking: No Change

Alerts Ranking Process

ICAC 2009 : 6/16/2009 154. Rank alertsOnline

At time of alerts receivedAlert 1Alert 1Alert 1Alert 4Real alerts

Ranking Alerts (Case I)

ICAC 2009 : 6/16/2009 16

Sorted alert rules

Alert 6Alert 2

Alert 3Alert 7

Alert 5Alert 9Alert 1

Alert 8Alert 4

Case I: Receive ONLY ALERTS, no monitoring data from components

Alert 2

Alert 3Alert 7

Alert 5

Alert 1

Alerts ranking

12345

5 alertsgenerated

Operator’s knowledge & configuration

System InvariantsNetwork

Ranking Alerts (Case II)

ICAC 2009 : 6/16/2009 17

Case II: Receive both alerts and monitoring data from components

Fault model of CPU%Web

PRTP

x

1

0xT xCPU@DBxDiskUsg@WEB

xNetwork@AP

= fa(Network@AP)

= fc(DiskUsg@WEB) = fb(CPU%@AP)

Observed ValueX(CPU%Web)

Number of Threshold Violations (NTV)

NTV=3

Fault model of Network%AP

PRTP

x

1

0xCPU@WEB

xCPU@DB

xDiskUsg@WEB

xT

Observed ValueX(Network%AP)

NTV=2

Alert by CPU%Web is more important than one from Network%AP.

Index• Introduction– Motivation & Goal

• System Invariants– Invariants extraction– Value propagation

• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts

• Experiment result• Conclusion

ICAC 2009 : 6/16/2009 18

Experimental system

ICAC 2009 : 6/16/2009 19

Flow Intensities:

: the number of EJB created at time t.

: the JVM processing time at time t.

: the number of SQL queries at time t.

Flow Intensities:

: the number of EJB created at time t.

: the JVM processing time at time t.

: the number of SQL queries at time t.

A

D

C

B

BA D

C

( )ejbI t

( )jvmI t

( )sqlI t

Invariant Examples:Invariant Examples:

( ) 0.07 ( 1) 0.57 ( )ejb ejb jvmI t I t I t

( ) 0.34 ( 1) 1.41 ( )

0.2 ( 1)

sql sql ejb

ejb

I t I t I t

I t

Extracted Invariants Network

ICAC 2009 : 6/16/2009 20

m1

m3

m5

m2

m4m6

Thresholds of Measurements

ICAC 2009 : 6/16/2009 21

70

30000

80

70

30000

20000

m1 m2 m3 m4 m5 m6

iTii malert generate_mx _then),(if

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

Thresholds of Measurements

ICAC 2009 : 6/16/2009 22

70

m1

iTii malert generate_mx _then),(if

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

30000

m2

32726

33006

33212

36316

28207

80

m3

71.4

78.0

86.4

81.0

66.9

30000

m4

29540

29646

32613

25469

27018

70

m5

57.4

62.8

63.7

54.1

63.0

20000

m6

23208

23291

25688

21200

23509

Ranking Alerts with NTVs (1)

ICAC 2009 : 6/16/2009 23

70m1

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

30000

m2

32726

33006

33212

36316

28207

80

m3

71.4

78.0

86.4

81.0

66.9

30000

m4

29540

29646

32613

25469

27018

70

m5

57.4

62.8

63.7

54.1

63.0

20000

m6

23208

23291

25688

21200

23509

Observed value 73.6 34319 81.6 71.430621 22620

NTVs 5 5 5 65 2

Ranking Alerts with NTVs (1)

ICAC 2009 : 6/16/2009 24

Ranking Alerts with NTVs (2)

ICAC 2009 : 6/16/2009 25

70m1

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

30000

m2

32726

33006

33212

36316

28207

80

m3

71.4

78.0

86.4

81.0

66.9

30000

m4

29540

29646

32613

25469

27018

70

m5

57.4

62.8

63.7

54.1

63.0

20000

m6

23208

23291

25688

21200

23509

Observed value 73.5 31478 54.6 46.122712 18564

NTVs 5 2 - -- -

Ranking Alerts with NTVs (2)

ICAC 2009 : 6/16/2009 26Inject a problem (SCP copy) to Web serverInject a problem (SCP copy) to Web server

Conclusion• We introduce a peer review mechanism to

rank alerts from heterogeneous components– By mapping local thresholds of various rules into

their equivalent values in a global context

– Based on system invariants network model

• To support operators’ consultation for prioritization of problem determination.

ICAC 2009 : 6/16/2009 27

Thank You!

• Questions?

ICAC 2009 : 6/16/2009 28