+ All Categories
Home > Documents > Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking the Importance of Alerts for Problem Determination in Large Computer System

Date post: 31-Jan-2016
Category:
Upload: minowa
View: 17 times
Download: 0 times
Share this document with a friend
Description:
Ranking the Importance of Alerts for Problem Determination in Large Computer System. Guofei Jiang , Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton. Outline. Introduction Motivation & Goal System Invariants Invariants extraction Value propagation - PowerPoint PPT Presentation
Popular Tags:
28
Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton
Transcript
Page 1: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking the Importance of Alerts for Problem Determination in Large

Computer System

Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena

NEC Laboratories America, Princeton

Page 2: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Outline• Introduction– Motivation & Goal

• System Invariants– Invariants extraction– Value propagation

• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts

• Experiment result• Conclusion

ICAC 2009 : 6/16/2009 2

Page 3: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Motivation

ICAC 2009 : 6/16/2009 3

• Large & complex systems are deployed by integrating many heterogeneous components:– servers, routers, storage & software from multiple

vendors.– Hidden dependencies

• Log/Performance data from components– Operators set many rules to check it and trigger alerts.• E.g. CPU% @ Web > 70%

– Rule setting: independent & isolated• Operator’s own system knowledge.

Page 4: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Goal

ICAC 2009 : 6/16/2009 4

• Which alerts should we analyze first?

- Get more consensus from others- Blend system management knowledge from multiple operators

• We introduce “Peer-review” mechanism– To rank the importance of alerts.

• Operators can prioritize problem determinations process.

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Alert 3 Alert 1 Alert 2Alert 4

Page 5: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Full automation

Alerts Ranking Process

ICAC 2009 : 6/16/20095

t

t

tOff line

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

1. Extract Invariants from monitoring data

Invariants model

Operators(w/ domain knowledge)

Large system

Alert 1Alert 2

Alert 3Alert 4

2. Define alert rules 3. Sort alert rules

[ICAC 2006][TDSC 2006][TKDE 2007][DSN 2006]

4. Rank alertsOnlineAt time of alerts received

Alert 1Alert 1Alert 1Alert 4Real alerts

Domain information

Page 6: Ranking the Importance of Alerts for Problem Determination in Large Computer System

System Invariants

ICAC 2009 : 6/16/2009 6

m1

m2

m4

m3

mi

mi+1

mi+2

mn t

t

t

tt

t

t

. .

.

.

..

. ..

any constant relationship

???

mn

Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests.

Target System

Userrequests

t

t

• User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly.

• We search the relationships among these internal measurements collected at various points.

• If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

Page 7: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Invariant Examples

ICAC 2009 : 6/16/2009 7

• Check implicit relationships, but not real values of flow intensities, which are always changing.

However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant.

LoadBalancer

LoadBalancer

I1

O1

O2

O3

I1 = O1+O2+O3

DatabaseServer

DatabaseServer

Packetvolume V1

SQL querynumber N1

V1 = f(N1)Invariant

Page 8: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Automated Invariants Search

ICAC 2009 : 6/16/2009 8

model library f

TargetSystem

observationdata

pick any twomeasurementsi, j to learn f ij

f ij: Invariantcandidates

with new data [t1-t2], do f ij hold ?

drop thevariants f ij Pi: Confidence Score

NO

Sequential validation

[t0-t1]Monitoring

observationdata

[t1-t2]

with new data [tk-tk+1], do f ij hold ?

observationdata

[tk-tk+1]

P0 P1

Yes

drop thevariants f ij

NO PK

YesTemplate

Page 9: Ranking the Importance of Alerts for Problem Determination in Large Computer System

One example in model library

ICAC 2009 : 6/16/2009 9

• We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements.

• Define

• Given a sequence of real observations, using LMS, we learn the model by minimizing the error.

• A fitness function can be used to evaluate how well the learned model fits the real data.

cmktxbktxbntyatyaty mn )(...)()(...)1()( 01

Tmn cbbbaaa ],,...,,,,...,,[ 1021

Tmktxktxntytyt )](),...,(),(),...,1([)( Ttty )()(

N

t

N

t

TN tyttt

1

1

1).()(])()([ˆ

100]|)(|

|)|(ˆ)(|1[)(

1

2

1

2

N

t

N

t

yty

tytyF

Page 10: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Value Propagation with Invariants

ICAC 2009 : 6/16/2009 10

x

y=f(x)y

zz=g(y)

uv

u=h(x)v=s(u)Extract

invariants

cxbxbyayay mn 101

yty )( Converged

n

j j

m

i i

a

cxby

1

1

0

1

xtx )(Set

z=g(f(x))v=s(h(x))

With ARX Model

Multi hops

Page 11: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Rules and Fault Model

ICAC 2009 : 6/16/2009 11

1then),(if alert generate_xx TRule

Predicate Action

Probability of fault occurrence

x

1

0xT

Fault model for each rule

False positive

False negative

Ideal modelRealistic model

Page 12: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Probability of Reporting a True Positive Alert

• Importance of an alert:

ICAC 2009 : 6/16/2009 12

true|xProb

Probability of Reporting a True Positive (PRTP)generated by value x

A very small false positive rate leads to large number of false positive repots.

Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!!

Ex. Real operation support system: 80% of reports are FPs

Page 13: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Local Context Mapping to Global Context

ICAC 2009 : 6/16/2009 13

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Web AP

DB

Different semanticsDifferent semanticsGlobal context

CPU%Web = fa(Network@AP)

CPU%Web = fb(CPU%@DB)CPU%Web = fc(DiskUsg%@Web)

Fault model of CPU%WebPRTP

x

1

0xT xCPU@DBxDiskUsg@WEB

xNetwork@AP

= fa(Network@AP)

= fc(DiskUsg@WEB) = fb(CPU%@AP)

Prob(true|XCPU@DB)> Prob(true|XT)> Prob(true|XDiskUsg@Web)> Prob(true|XNetwork@AP)

Alert 3

Alert 1

Alert 2

Alert 4

Page 14: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Local Context Mapping to Global Context

ICAC 2009 : 6/16/2009 14

CPU% @Web > 70% Alert 1

DiskUsg@Web > 150 Alert 2

CPU% @DB > 60% Alert 3

Network@AP > 35k Alert 4

Web AP

DB

Fault model of Network%APPRTP

x

1

0xCPU@WEB

xCPU@DB

xDiskUsg@WEB

xT

Prob(true|XCPU@DB)> Prob(true|XCPU@WEB)> Prob(true|XDiskUsg@Web)

> Prob(true|XT)

Alert 3

Alert 1

Alert 2

Alert 4

Alert ranking: No Change

Page 15: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Alerts Ranking Process

ICAC 2009 : 6/16/2009 154. Rank alertsOnline

At time of alerts receivedAlert 1Alert 1Alert 1Alert 4Real alerts

Page 16: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking Alerts (Case I)

ICAC 2009 : 6/16/2009 16

Sorted alert rules

Alert 6Alert 2

Alert 3Alert 7

Alert 5Alert 9Alert 1

Alert 8Alert 4

Case I: Receive ONLY ALERTS, no monitoring data from components

Alert 2

Alert 3Alert 7

Alert 5

Alert 1

Alerts ranking

12345

5 alertsgenerated

Operator’s knowledge & configuration

System InvariantsNetwork

Page 17: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking Alerts (Case II)

ICAC 2009 : 6/16/2009 17

Case II: Receive both alerts and monitoring data from components

Fault model of CPU%Web

PRTP

x

1

0xT xCPU@DBxDiskUsg@WEB

xNetwork@AP

= fa(Network@AP)

= fc(DiskUsg@WEB) = fb(CPU%@AP)

Observed ValueX(CPU%Web)

Number of Threshold Violations (NTV)

NTV=3

Fault model of Network%AP

PRTP

x

1

0xCPU@WEB

xCPU@DB

xDiskUsg@WEB

xT

Observed ValueX(Network%AP)

NTV=2

Alert by CPU%Web is more important than one from Network%AP.

Page 18: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Index• Introduction– Motivation & Goal

• System Invariants– Invariants extraction– Value propagation

• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts

• Experiment result• Conclusion

ICAC 2009 : 6/16/2009 18

Page 19: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Experimental system

ICAC 2009 : 6/16/2009 19

Flow Intensities:

: the number of EJB created at time t.

: the JVM processing time at time t.

: the number of SQL queries at time t.

Flow Intensities:

: the number of EJB created at time t.

: the JVM processing time at time t.

: the number of SQL queries at time t.

A

D

C

B

BA D

C

( )ejbI t

( )jvmI t

( )sqlI t

Invariant Examples:Invariant Examples:

( ) 0.07 ( 1) 0.57 ( )ejb ejb jvmI t I t I t

( ) 0.34 ( 1) 1.41 ( )

0.2 ( 1)

sql sql ejb

ejb

I t I t I t

I t

Page 20: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Extracted Invariants Network

ICAC 2009 : 6/16/2009 20

m1

m3

m5

m2

m4m6

Page 21: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Thresholds of Measurements

ICAC 2009 : 6/16/2009 21

70

30000

80

70

30000

20000

m1 m2 m3 m4 m5 m6

iTii malert generate_mx _then),(if

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

Page 22: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Thresholds of Measurements

ICAC 2009 : 6/16/2009 22

70

m1

iTii malert generate_mx _then),(if

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

30000

m2

32726

33006

33212

36316

28207

80

m3

71.4

78.0

86.4

81.0

66.9

30000

m4

29540

29646

32613

25469

27018

70

m5

57.4

62.8

63.7

54.1

63.0

20000

m6

23208

23291

25688

21200

23509

Page 23: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking Alerts with NTVs (1)

ICAC 2009 : 6/16/2009 23

70m1

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

30000

m2

32726

33006

33212

36316

28207

80

m3

71.4

78.0

86.4

81.0

66.9

30000

m4

29540

29646

32613

25469

27018

70

m5

57.4

62.8

63.7

54.1

63.0

20000

m6

23208

23291

25688

21200

23509

Observed value 73.6 34319 81.6 71.430621 22620

NTVs 5 5 5 65 2

Page 24: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking Alerts with NTVs (1)

ICAC 2009 : 6/16/2009 24

Page 25: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking Alerts with NTVs (2)

ICAC 2009 : 6/16/2009 25

70m1

m1T

m2T

m3T

m4T

m5T

m6T

63.6

70.2

70.5

77.0

59.8

30000

m2

32726

33006

33212

36316

28207

80

m3

71.4

78.0

86.4

81.0

66.9

30000

m4

29540

29646

32613

25469

27018

70

m5

57.4

62.8

63.7

54.1

63.0

20000

m6

23208

23291

25688

21200

23509

Observed value 73.5 31478 54.6 46.122712 18564

NTVs 5 2 - -- -

Page 26: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking Alerts with NTVs (2)

ICAC 2009 : 6/16/2009 26Inject a problem (SCP copy) to Web serverInject a problem (SCP copy) to Web server

Page 27: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Conclusion• We introduce a peer review mechanism to

rank alerts from heterogeneous components– By mapping local thresholds of various rules into

their equivalent values in a global context

– Based on system invariants network model

• To support operators’ consultation for prioritization of problem determination.

ICAC 2009 : 6/16/2009 27

Page 28: Ranking the Importance of Alerts for Problem Determination in Large Computer System

Thank You!

• Questions?

ICAC 2009 : 6/16/2009 28


Recommended