+ All Categories
Home > Documents > Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… ·...

Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… ·...

Date post: 23-Sep-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
16
Pythia: Detection, Localization, and Diagnosis of Performance Problems using perfSONAR Partha Kanuparthy Constantine Dovrolis (PI) Georgia Institute of Technology
Transcript
Page 1: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Pythia: Detection, Localization, and Diagnosis

of Performance Problemsusing perfSONAR

Partha KanuparthyConstantine Dovrolis (PI)

Georgia Institute of Technology

Page 2: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Intro

Pythia is a data-analysis tool

data from perfSONAR

Focus: performance problems

Funded by DoE: 3 yrs (since Sept’11)

This talk: detection, localization, WiP

Page 3: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Pythia:one tool, three objectives

Detection: “noticeable loss rate between ORNL and SLAC on 07/11/11 at 09:00:02 EDT”

Localization“it happened at DENV-SLAC link”

Diagnosis“it was due to insufficient router buffers”

Page 4: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

DetectionFirst step: “Is there a problem?”

Look for deviations from baseline

but not monitor-related events!

0

10

20

30

40

50

60

0 20 40 60 80 100 120 140 160 180 200

de

lay (

ms)

seq.

Possible-congestion in path NEWY_OWAMP_ES_NET to CLEV_OWAMP_ES_NET

delay

Congestion: NY-CLEV 0

500

1000

1500

2000

2500

201680 201700 201720 201740 201760 201780 201800 201820

dela

y (

ms)

seq.

context-switch in path ALBU_OWAMP_ES_NET to ATLA_OWAMP_ES_NET

context-switchdelay

Monitor event: ALBU-ATLbaseline

2.5s rise!

Page 5: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Detection ImplementationA single pass through OWAMP timeseries

Discard monitor-related events

Discard level-shifts (e.g., NTP synchronization; TODO: detecting routing changes)

The rest are network performance problems!

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

0 2 00 4 00 6 00 8 00 1 000 1 2 00

de

lay (

ms)

seq.

levelshift in path A L B U _O WA M P _E S _N E T to A T L A _O WA M P _E S _N E T

delay

NTP level-shifts

0

10

20

30

40

50

60

0 20 40 60 80 100 120 140 160 180 200

dela

y (

ms)

seq.

Possible-congestion in path NEWY_OWAMP_ES_NET to CLEV_OWAMP_ES_NET

delay

Congestion: NY-CLEV

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90

dela

y (

ms)

seq.

Possible-congestion in path NEWY_OWAMP_ES_NET to CLEV_OWAMP_ES_NET

delay

Page 6: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Detection: In PracticeDetection outputs congestion events

> 10s long

start, end timestamps

ESnet data: 12 days, 33 monitors

Internet2 data:22 days, 9 monitors

Monitor events

Congestion events

Congestion events / path / day

ESnet 2.2 million 933 0.1

Internet2 11,200 2268 1.4

Page 7: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

How long are congestion events?

ESnet, I2: 90% events were 10-20s long

this is sufficient to affect app-performance

delay increases by 10s of milliseconds

some events are common across paths

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 12 14 16 18 20 22 24 26 28

CD

F

Duration (in seconds)

ESnet : CDF of duration of congestion events which are atleast 10 seconds long

ESnet

Internet2

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

1

1 0 1 2 1 4 1 6 1 8 2 0 2 2 2 4 2 6 2 8

CD

F

D ura tio n (in se c o n d s)

In te rn e t2 : D ura tio n o f c o n g e stio n e ve n ts

Page 8: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Are lossy events common?

Answer: No

ESnet: no lossy congestion events

Internet2: 6 of 2268 lossy congestion events

< 0.1% loss rate assampled by OWAMP

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085

CD

F

Percentage of packets lost

Internet2 : Fraction of lost packets for lossy congestion evnets

Internet2

Page 9: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

LocalizationFollow-up to detection:“Which link is bad?”

Link/path performance levels discrete:e.g., high delay, medium delay, low delay

Localization: minimum number of bad links that can explain bad paths

use greedy heuristic to solve iteratively

Page 10: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Localizing Bad LinksESnet: 9 congestion events

1 bad link localized for each

up to 75 paths affected by an event:

0

10

20

30

40

50

60

70

80

washcr1-aofacr2.es.net

starcr1-chicr1.es.net

bnlmr2-bnlowamp.es.net

clevcr1-ip-bostcr1.es.net

clevcr1-ip-bostcr1.es.net

bostcr1-ip-aofacr2.es.net

chiccr1-ip-clevcr1.es.net

clevcr1-ip-chiccr1.es.net

Number of paths

Bad

link

0

5

10

15

20

25

30

35

40

45

0 2 4 6 8 10 12 14 16 18 20

One W

ay D

ela

y (

in m

s)

Time (in seconds)

Page 11: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Localization: Internet2Internet2: 266 congestion events in 22 days

3 bad links: 1 case

2 bad links: 6 cases

1 bad link: rest

few bad links dominate 90% events:

ge-6-2-0.0.rtr.kans (58% events)ge-1-2-0.0.rtr.chic (25%)xe-1-1-0.0.rtr.hous (6%)

0

5

10

15

20

25

0 5 10 15 20 25

Lin

k E

ve

nts

pe

r d

ay

Day since 23rd Feb 2011

ge-6-2-0.0.rtr.kansge-1-2-0.0.rtr.chic

xe-1-1-0.0.rtr.hous

Timeline of bad links: peaks around 7th March 2011

Page 12: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Case StudyInternet2: event with two bad links

28th Feb 2011, 00:10:51 GMT

Localized bad links:ge-6-2-0.0-rtr.KANS ge-6-1-0.0-rtr.LOSA

Predicted bad link performance (avg):26ms and 57ms

0

10

20

30

40

50

60

70

80

90

100

-5 0 5 10 15 20 25 30

On

e W

ay D

ela

y (

in m

s)

Time (in seconds)

Path : CHIC_LAT to LOSA_LAT

path:CHIC to LOSA

path:ATLA to KANS

path:HOUS to LOSA

0

2 0

4 0

6 0

8 0

1 0 0

-5 0 5 1 0 1 5 2 0 2 5 3 0

One W

ay D

ela

y (

in m

s)

Time (in seconds)

Path : H O US_L A T to L O SA _L A T

0

20

40

60

80

100

-5 0 5 10 15 20 25 30

One W

ay D

ela

y (

in m

s)

Time (in seconds)

Path : ATLA_LAT to KANS_LAT

Page 13: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Diagnosis“What was the problem?”

Match signatures to identify known problems

delays, losses, reordering, etc.

Unknown signatures:

store in database for future diagnosis

operators can “label” them for reference

Work in progress

Page 14: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Pythia: Real-time System

Centralized process talks to perfSONAR MAs to collect data

Work in progress

tracerouteMA

OWAMPMA 1

OWAMPMA 2

OWAMPMA 3

BWCTLMA

...MA

Pythiaserver

Preprocessing DetectionLocalization

Diagnosis

Page 16: Pythia: Detection, Localization, and Diagnosis of ...dovrolis/Papers/escc-summer11-pythia.… · Pythia: one tool, three objectives Detection: “noticeable loss rate between ORNL

Pre-processingtraceroute: compensate for “* * *”s

Clock skew: use a 1000s window to normalize delays

Clock offset between monitors: use a 2s window to identify congestion events

Identify simultaneous events across paths


Recommended