+ All Categories
Home > Documents > 1 OSPF Monitor - NSDI 2004 OSPF Monitor Architecture, Design and Deployment Experience Aman Shaikh...

1 OSPF Monitor - NSDI 2004 OSPF Monitor Architecture, Design and Deployment Experience Aman Shaikh...

Date post: 22-Dec-2015
Category:
Upload: clyde-sparks
View: 220 times
Download: 2 times
Share this document with a friend
Popular Tags:
30
OSPF Monitor - NSDI 2004 1 OSPF Monitor Architecture, Design and Deployment Experience Aman Shaikh Albert Greenberg AT&T Labs - Research NSDI 2004
Transcript

OSPF Monitor - NSDI 2004 1

OSPF Monitor Architecture, Design and Deployment Experience

Aman Shaikh

Albert Greenberg

AT&T Labs - Research

NSDI 2004

OSPF Monitor - NSDI 2004 2

Objectives for OSPF Monitor• Real-time analysis of OSPF behavior

– Trouble-shooting, alerting, validation of maintenance– Real-time snapshots of OSPF network topology

• Off-line analysis– Post-mortem analysis of recurring problems– Generate statistics and reports about network

performance– Identify anomaly signatures– Facilitate tuning of configurable parameters– Improve maintenance procedures– Analyze OSPF behavior in commercial networks

OSPF Monitor - NSDI 2004 3

OSPF Monitor in a Nutshell• Collect OSPF LSAs (Link State Advertisements)

passively from network– Every router describes its local connectivity in an LSA– Router originates an LSA due to...

• Change in network topology• Periodic soft-state refresh

– LSA is flooded to other routers in the domain• Flooding is reliable and hop-by-hop• Flooding leads to duplicate copies of LSAs being received

– Every router stores LSAs (self-originated + received) in link-state database (= topology graph)

• Real-time analysis of LSA streams

• Archive LSAs for off-line analysis

OSPF Monitor - NSDI 2004 4

Components• Data collection: LSA Reflector (LSAR)

– Passively collects OSPF LSAs from network– “Reflects” streams of LSAs to LSAG– Archives LSAs for analysis by OSPFScan

• Real-time analysis: LSA aGgregator (LSAG)– Monitors network for topology changes, LSA storms,

node flaps and anomalies

• Off-line analysis: OSPFScan– Supports queries on LSA archives– Allows playback and modeling of topology changes– Allows emulation of OSPF routing

OSPF Monitor - NSDI 2004 5

Example

Area 0Area 1 Area 2

Real-time Monitoring

LSAG

“Reflect” LSA

LSA archive

LSAR 1

“Reflect” LSA

LSAR 2

OSPFScan

Off-line Analysis

replicateLSA archive LSA archive

OSPF Network

LSAsLSAsLSAs

LSAs LSAs LSAs

TCP Connection

OSPF Monitor - NSDI 2004 6

How LSAR attaches to Network• Host mode

– Join multicast group– Adv: completely passive– Disadv: not reliable, delayed initialization of LSDB

• Full adjacency mode– Form full adjacency (= peering session) with a router– Adv: reliable, immediate initialization of LSDB– Disadv: LSAR’s instability can impact entire network

• Partial adjacency mode– Keep adjacency in a state that allows LSAR to receive LSAs,

but does not allow data forwarding over link– Adv: reliable, LSAR’s instability does not impact entire

network, immediate initialization of LSDB– Disadv: can raise alarms on the router

OSPF Monitor - NSDI 2004 7

Partial Adjacency for LSAR

LSAR

Partial state

I have LSA L

Please send me LSA LPlease send me LSA LPlease send me LSA L

I need LSA L from LSAR

• LSARR link is not used for data forwarding

R

• Router R does not advertise a link to LSAR

• Routers (except R) not aware of LSAR’s presence• Does not trigger routing calculations in network• LSAR’s going up/down does not impact network

• LSAR does not originate any LSAs

OSPF Monitor - NSDI 2004 8

LSA aGregator (LSAG)• Analyzes “reflected” LSAs from LSARs in real-time• Generates console messages:

– Change in OSPF network topology• ADJACENY COST CHANGE: rtr 10.0.0.1 (intf 10.0.0.2) rtr

10.0.0.5 old_cost 1000 new_cost 50000 area 0.0.0.0

– Node flaps• RTR FLAP: rtr 10.0.0.12 no_flaps 7 flap_window 570 sec

– LSA storms• LSA STORM: lstype 3 lsid 10.1.0.0 advrt 10.0.0.3 area 0.0.0.0 no_lsas

7 storm_window 470 sec

– Anomalous behavior• TYPE-3 ROUTE FROM NON-BORDER RTR: ntw 10.3.0.0/24 rtr

10.0.0.6 area 0.0.0.0

• Dumps snapshots of network topology

OSPF Monitor - NSDI 2004 9

OSPFScan• Tools for off-line analysis of LSA archives

– Parse, select (based on queries), and analyze

• Functionality supported by OSPFScan– Classification of LSA traffic

• Change LSAs, refresh LSAs, duplicate LSAs

– Emulation of OSPF Routing• How OSPF routing tables evolved in response to network changes

• How end-to-end path within OSPF domain looked like at any instance

– Modeling of topology changes• Vertex addition/deletion and link addition/deletion/change_cost

– Playback of topology change events

– Statistics and report generation

OSPF Monitor - NSDI 2004 10

Performance Evaluation• Performance of LSAR and LSAG through lab

experiments– LSAR and LSAG are key to real-time monitoring

• How performance scales with LSA-rate and network size

OSPF Monitor - NSDI 2004 11

Experimental Setup

LSALSA

PC

ZebraOSPF adjacency

TCP connection

SUT

LSAR

LSAG

TCP connection LSA

LSA

Emulated topology

LSA

Measure LSA pass-through time for LSAR

Measure LSA processing time for LSAG

OSPF Monitor - NSDI 2004 12

Methodology• Send a burst of LSAs from Zebra to LSAR

– Vary number of LSAs (l) in a burst of 1 sec duration

• Use of fully connected graph as the emulated topology– Vary number of nodes (n) in the topology

• Performance measurements– LSAR performance: LSA “pass-through” time

• Zebra measures time difference between sending and receiving an LSA from LSAR

– LSAG performance: LSA processing time• Instrumentation of LSAG code

OSPF Monitor - NSDI 2004 13

LSAR Performance

Mean LSA pass-through time (LSAR) v/s burst-size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

50 100 150 200 250 300 350 400 450 500

Number of LSAs per burst

Tim

e (

se

co

nd

s)

n = 100, LSAR + LSAGn = 50, LSAR + LSAGn = 100, LSAR onlyn = 50, LSAR only

OSPF Monitor - NSDI 2004 14

LSAG Performance

Mean LSA processing time (LSAG) v/s network size

0

0.01

0.02

0.03

0.04

0.05

0.06

50 60 70 80 90 100

Number of nodes in the topology

Tim

e (

se

co

nd

s)

burst-size = 500 LSAs

burst-size = 100 LSAs

OSPF Monitor - NSDI 2004 15

Deployment• Tier-1 ISP network

– Area 0, 100+ routers; point-to-point links– Deployed since January, 2003– LSA archive size: 8 MB/day– LSAR connection: partial adjacency mode

• Enterprise network– 15 areas, 500+ routers; Ethernet-based LANs– Deployed since February, 2002– LSA archive size: 10 MB/day– LSAR connection: host mode

OSPF Monitor - NSDI 2004 16

LSAG in Day-to-day Operations• Generation of alarms by feeding messages into

higher layer network management systems– Grouping of messages to reduce the number of

alarms– Prioritization of messages

• Validation of maintenance steps and monitoring the impact of these steps on network-wide OSPF behavior– Example:

• Network operators use cost-out/cost-in of links to carry out maintenance

• A “link-audit” web-page allows operators to keep track of link costs in real-time

OSPF Monitor - NSDI 2004 17

Problems Caught by LSAG• Equipment problem

– Detected internal problems in a crucial router in enterprise network• Problem manifested as episodes of OSPF adjacency

flapping

• Configuration problem– Identified assignment of same router-id to two routers

in enterprise network

• OSPF implementation bug– Caught a bug in type-3 LSA generation code of a

router vendor in ISP network• Faster refresh of LSAs than standards-mandated rate

OSPF Monitor - NSDI 2004 18

Long Term Analysis by OSPFScan• LSA traffic analysis

– Identified excessive duplicate LSA traffic in some areas of Enterprise Network• Led to root-cause analysis and preventative steps

• Statistics generation– Inter-arrival time of change LSAs in ISP network

• Fine-tuning configurable timers related to route calculation (= SPF calculation)

– Mean down-time and up-time for links and routers in ISP network• Assessment of reliability and availability

OSPF Monitor - NSDI 2004 19

Lessons Learned through Deployment• New tools reveal new failure modes• Real-time alerting and off-line analysis are

complementary– Distributed architecture helped a lot

• OSPF exhibits significant activity in real networks– Maintenance and genuine problems

• Add functionality incrementally and through interaction with users

• Archive all LSAs– LSA volume is manageable

– Don’t throw away refresh and duplicate LSAs

OSPF Monitor - NSDI 2004 20

Conclusion• Three component architecture

– LSAR: data collection– LSAG: real-time analysis– OSPFScan: off-line analysis

• Performance analysis– LSAR and LSAG scale well as LSA-rate and network

size increases

• Deployment– Deployed in Tier-1 ISP and Enterprise network

• Has proved to be an extremely valuable tool for network management• “OSPF Monitor was a Lifesaver”

– VP of Networking, Enterprise network

OSPF Monitor - NSDI 2004 21

Future Work• Real-time analysis

– Correlation with other fault and performance data for more meaningful alerting

– Prioritization of alerts

• Off-line analysis– Correlation with other data sources

• Work already underway: BGP, fault, performance

– Identification of problem signatures and feeding them into real-time component for problem prediction

OSPF Monitor - NSDI 2004 22

Backup Slides

OSPF Monitor - NSDI 2004 23

Overview of OSPF• OSPF is a link-state protocol

– Every router learns entire network topology• Topology is represented as graph

– Routers are vertices, links are edges

– Every link is assigned weight through configuration

– Every router uses Dijkstra’s single source shortest path algorithm to build its forwarding table• Router builds Shortest Path Tree (SPT) with itself as root

• Shortest Path Calculation (SPF)

– Packets are forwarded along shortest paths defined by link weights

OSPF Monitor - NSDI 2004 24

Areas in OSPF• OSPF allows domain to be divided into areas for

scalability– Areas are numbered 0, 1, 2 …– Hub-and-spoke with area 0 as hub– Every link is assigned to exactly one area– Routers with links in multiple areas are called border

routers

Area 1 Area 2

Area 0

Border routers

OSPF Monitor - NSDI 2004 25

Summarization with Areas• Each router learns

– Entire topology of its attached areas– Information about subnets in remote areas and their

distance from the border routers• Distance = sum of link costs from border router to subnet

B1 B2

R2

Area 0100 200

200500400

300

R3

R1 R1’s View

Area 110.10.4.0/24 10.10.5.0/24

20 70 1060

Area 1

Area 0

20

100

B1 B2

C1 C2

10.10.4.0/24 10.10.5.0/24

10

50

200

200500400

300

R3R2

R1OSPF domain

OSPF Monitor - NSDI 2004 26

Link State Advertisements (LSAs)• Every router describes its local connectivity in Link

State Advertisements (LSAs)• Router originates an LSA due to…

– Change in network topology• Example: link goes down or comes up

– Periodic soft-state refresh• Recommended value of interval is 30 minutes

• LSA is flooded to other routers in the domain– Flooding is reliable and hop-by-hop– Includes change and refresh LSAs– Flooding leads to duplicate copies of LSAs being received

• Every router stores LSAs (self-originated + received) in link-state database (= topology graph)

OSPF Monitor - NSDI 2004 27

Adjacency• Neighbor routers (i.e., routers connected by a

physical link) form an adjacency• The purpose is to make sure

– Link is operational and routers can communicate with each other

– Neighbor routers have consistent view of network topology• To avoid loops and black holes

• Link gets used for data forwarding only after adjacency is established

• Use of periodic Hellos to monitor the status of link and adjacency

OSPF Monitor - NSDI 2004 28

Equipment Problem at Enterprise Network• Internal errors in a router in area 0

– Episodes where router would drop adjacencies with other routers

• Problem manifested in LSAG as “ADJ UP” and “ADJ DOWN” messages– Not visible in other network management systems

• Led to proactive maintenance

0

20

40

60

80

100

1 7 13 19

Hour on April 16, 2002

Total LSAs in area 0

Total LSAs due to router bug

0

500

1000

1500

1 11 21

Day in April, 2002

Total LSAs in area 0

Total LSAs due to router bug

OSPF Monitor - NSDI 2004 29

LSA Traffic in Enterprise Network

0

4000

8000

1 11 21

Area 4Days

0

4000

8000

1 11 21

Area 3Days

0

4000

8000

1 11 21

Area 2

Days1

100

10000

1000000

1 11 21

Area 0

Days

Duplicate LSAs

Change LSAs

Refresh LSAs

Artifact: 23 hr day (Apr 7)

Genuine AnomalyGenuine Anomaly

OSPF Monitor - NSDI 2004 30

Overhead: Duplicate LSAs

• Why do some areas witness substantial duplicate LSA traffic, while other areas do not witness any?– OSPF flooding over LANs leads to control plane asymmetries

and to imbalances in duplicate LSA traffic

-50

950

1950

2950

1 11 21

Duplicate LSAs in area 3

Duplicate LSAs in area 2

Days


Recommended