Date post: | 17-Jan-2018 |
Category: |
Documents |
Upload: | cecily-boone |
View: | 226 times |
Download: | 0 times |
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
1
Network Failures and Their Impacts;
Fast IGP Routing Convergence and IP Fast Re-Routing (IP FRR)• Network Failures and Intra-Domain
Routing (IGP)– ISP Link Failure Studies– Failure Characteristics, Causes and Impacts
• IGP Fast Routing Convergence •Speed up routing convergence after routing changes
• IP Fast Re-Routing (IP FRR)– Fast Rerouting Schemes: Failure Insensitive Routing– Other Schemes
Readings: Please do the required readings
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
2
Why Network Fails• Many, many possible reasons and causes• Human Errors
– misconfigurations– other mistakes: e.g, let’s see what that red button does
• Software Bugs– buggy implementation, incompatibility, …
• Hardware failures– flaky interfaces, link errors, fiber cuts, router crashes due to CPU
overload or running of memory, ….• Malicious attacks • Network Overload
– traffic surges causing network congestion, … • Others: e.g., natural disasters, major accidents
– E.g., Baltimore tunnel fire, Ohio train accident, …
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
3
Understanding Network Link Failure Characteristics
• Failure Characteristics Within an ISP network – How often do links/routers failure?– How many? Are they random, correlated?– How long do they last?– …
• What about inter-domain or Internet wide?– What causes BGP to update/withdraw routes?– Destination network down? AS internal failures? BGP session
resets? Policy changes? …• How do we measure, detect and analyze network failures?• How do we trouble-shoot network failures and perform
root-case analysis?• How do we design more robust and resilient mechanisms?
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
4
This Lecture: Focusing on Impact of Failures within an ISP
Network• With IP networks becoming the dominant and
“converged” information delivery substrate, displacing telephone networks, and eventually cable TV?– Need to better “service availability”– Telephone networks: service availability metrics: 5 9’s: i.e., 99.999%– What about IP networks?
• Effect of IP network failures:– routers lose “reachability”: i.e., no forward entries– or existence of transient/permanent forwarding loops
• What are impacts of network failures?– In particular, on VoIP services
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
5
Failures Affect Link Loads• Many ISP networks are “over-provisioned” so as to
handle network failures:– Many claim: normal load utilization < 50%
• But still high variability in link utilization: – Can find a link w/ load > 50% every 15 minutes; > 90%
every 8 days
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
6
Traffic Potholes or BlackholesSprint Measurement Study Anecdote:
• Average delay over 5 sec intervals
• Traffic was blackholed for more than 10 minutes
• It took about 40 minutes for the network to reach a stable state
• Root Cause:Route Misconfigurations!
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
7
Routing Loops under Failures• Loops due to link failures/new route advertisement• Measurements from 3 backbone links
– 25% packets caught in a loop in one failure instance– 1% lost due to expire TTL; those that escape have long
delays
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
8
Sprint Link Failure Study• Link “failures” occur fairly frequently, well spread
over time– Inter-POP links are more stable than intra-PoP links– Many intra-PoP link “failures” due to planned events, less
impact on traffic due to “full-mesh” intra-PoP topology• Most link failures tend to be transient
– Excluding “planned” failures– Most are single link failures– Some are correlated link failures
• Link failure characteristics vary depend on links– Depending causes of failures, e.g., flaky interfaces, router
overloads, fiber cuts, etc.,• Impact of link failures
– OC48 link down for 6 seconds: 3 million packets may be lost! – significant impact on applications such as VoIP, on-line
gamiing
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
9
Methodology: Integrated MonitoringSprint Measurement Study [I+02,M+04]• Tier-1 ISP backbone (600+ nodes)• Passive route listener software to collect IS-IS & BGP
updates• IPMON passive traffic monitoring & active probes• SONET alarm logs; router configurations and BGP
policies
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
10
IGP Failure Events• IP link: adjacency between two IS-IS routers• Link Failure: loss of this adjacency• Results shown in the following slides only include
– US inter-PoP links (OC48)– Failures less than 24 hrs long
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
11
Sprint Study: Link Failure Frequencies
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
12
Sprint Study: Duration of Failures
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
13
Sprint Study: Failures across Links
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
14
Scatter Plot of US Failure Events• Apr. – Nov. 2002
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
15
Maintenance (or Planned Failures)• Weekly schedule (Mondays 5am – 2pm UC): 20% of
failures
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
16
Examples of Planned Failures• Upgrades
– Changing link to higher capacity– Loading new operating system on a router– Swapping out an old interface card
• Maintenance– Fixing a flaky optical amplifier– Configuration changes that require a reboot– Responsible for 50% of intradomain failures
• Cable intrusions– Construction activities near a fiber
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
17
Failure Classification
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
18
Anomalies Found in Shaikh04 paper
• Intermittent hardware problem– Router periodically losing OSPF adjacencies– Risk of network partition if 2nd failure occurred
• External link flaps– Congestion on edge link causing lost messages– Lost adjacency leading to flapping routes
• Configuration errors– Two routers assigned the same IP address– Inefficient config leading to duplicate LSAs
• Vendor implementation bug– More frequent refreshing of LSAs than specified
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
19
Converging After a Failure• Failure detection
– Router recognizes an incident link has failed• Failure notification
– Router informs other routers about the change• Path re-computation
– Routers compute new paths avoiding the link• Forwarding-table update
– Routers update their forwarding tables– Data traffic starts to flow over the new path
• AT&T, Sprint studies show– convergence time 100s milliseconds up to a few
seconds
Routing convergence
Forwarding convergence
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
20
All Together: Looking Inside RouterRoute Processor (CPU)
FIB
Interface card Interface card
Forwarding
SwitchingFabric
Data packet
Data packet
TopologyView
SPF Calculation
OSPF Process
LSA
LS Ack
LSA
Forwarding
LSA Processing
LSA Flooding
SPF Calculation
FIB Update
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
21
Bad Things Happen During Convergence
• Transient inconsistencies– Creating “transient forwarding loops” due to– Routers have different views of the network– Forwarding decisions may be inconsistent
• Effects on data traffic– Black-hole: packet loss– Loops: packets going in circles– Delay: packets going on very long paths– Out-of-order: new packets arrive before old ones
• Want to minimize convergence delay– … and especially the effects on the data traffic
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
22
Example: Transient Forwarding Loop (or Micro-loop)
• Set of routers disagree– One router acting on old information– Another router acting on new information
s d
Loop!
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
23
Reducing Impact of Link Failures Assuming Traditional Link-State Protocols• Improving convergence time of control/data plane
– Reducing timer value for HELO messages– Can achieve sub-second convergence time
• 200 msecs common target, threshold for VoIP quality, do-able!– However,
• Still react to failure events, can’t prevent packet loops or losses during convergence
• may amplify effect of short “transient” failures that last sub-seconds• Prevent “micro loops” during transient routing
convergence periods – One solution: using “ordered FIB updates”– requires coordination among routers, adds complexity,
delays convergence time• Dealing with “Planned Failures” ?
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
24
Reducing Impact of Link FailuresUsing MPLS• Can pre-compute back-up paths
– Often done using the “link protection” scheme– For each link, there is a MPLS protection (back-up) path
• But– Need to change “forwarding plane” of routers – Many networks don’t have MPLS deployed
Question: Can we perform fast rerouting using “traditional”
link state routing protocols without resort to MPLS?
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
25
Fast Re-Routing using Link State Protocols [Nelakuditi et al]
• Motivations– Most common link failures are transient single-link failures– Hastily react to such failures by LSA flooding may do more
harm than good, causing network instability! – Suppress such failures unless it lasts longer than a threshold– But we want to be able to re-route affected packets along a
back-up path, not simply dropping them !• FIFR (failure Insensitive Fast Re-routing): nearly 100%
forwarding continuity– prepare for (instead of react to) failures– adapt to changes while ensuring stability– Other Advantages:
• no change to forwarding plane• minimal change to routing plane
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
26
What is Interface Specific Forwarding?
• Interface-independent forwarding– destination next-hop– Each line card has a copy of the same FIB
• Interface-specific forwarding– <incoming interface, destination> next-hop– Different forwarding entries at each line card
• Forwarding operation remains the same
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
27
ISF Enables Local Rerouting• Infer failures based on interface and destination
– Find the farthest keylink whose failure would cause a packet to arrive at the unusual interface along the reverse shortest path to the destination
• Precompute interface-specific forwarding tables– Avoid the keylink in choosing next hop for a destination
• Failure Inferencing based Fast Rerouting– IP fast reroute without explicit routing/tunneling
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
28
Illustration: No Failure Scenario
B BC CD DE BF B
A AC AED AE EF E
F
F
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
29
Illustration: Local Rerouting without ISF
B BC CD DE BF B
A AC AD AE AF A
Flink B – E fails!
new routing table at router B after detecting the failure
F
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
30
Illustration: Local Rerouting with ISF
B BC CD DE BF B
A AC AD AE AF A
B -C CD DE CF D
F
F
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
31
ISF Table Computation• Infer failed links from packet’s arrival at an interface
– keylink whose failure causes packet to d arrive at i from j
– A link u -> v is a candidate keylink if• with u->v, j is a next hop from i to d• without u->v, edge j->i is along the shortest path from u to d
– is the farthest one from i among candidate keylinks
• Avoid keylink in choosing the destination’s next hop– next hops to d from i when packet arrives at i from j
• Failure inferencing is not done per packet– ISF table entries computed upon link state updates
dijF
)\( dij
di
dij KERF
dijK
dijK
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
32
Illustration: ISF Table ComputationB -C CD DE CF D
E
ABK {B-E}
D
ABK {}
C
ABK {}
{E-F}F
ABKB BC CD -E BF B
B BC -D DE BF B
When no more than one link failure is suppressed in a network with symmetric weights, FIFR always forwards successfully to a destination if a path to it exists
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
33
Operations under FIFREvent Adjacent nodes Other nodesPacket arrival Interface-specific forwarding
Link down Initiate local rerouting
Link up before suppression interval
Resume forwarding on the recovered link
Link down beyond suppression interval
Link state update Recompute interface-specific forwarding tables
Link up after suppression interval
Link state update Recompute interface-specific forwarding tables
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
34
Handling both Link and Node Failures
• Infer keynodes instead of keylinks– A node u is a candidate keynode if
• with u, j is a next hop from i to d• without u, edge j->i is along the shortest path from the
upstream node of u (w.r.t. the path from i to u) to d– Keynode is the farthest one from i among candidates
• When no route to destination without a node– Node adjacent to the failure assumes link failure– Non-adjacent nodes treat it as adjacent node failure– May cause loops when destination is indeed not reachable
• Protects against non-partitioning single failures
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
35
Networks with Asymmetric Link Weights
• FIFR can handle asymmetric link weights– By forcing packets to take reverse shortest path– Provided links are bidirectional
• Keynode computation based on rSPF– A node u is a candidate keynode if
• with u, j is a next hop from i to d• without u, edge i->j is along the shortest path from d to
the upstream node of u (w.r.t the path from i to u)– Keynode is the farthest one from i among candidates– Works with both symmetric and asymmetric weights
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
36
Networks with Broadcast Links
• FIFR applicable to networks with broadcast links– A broadcast link is modeled with point to point links
from/to the designated router
• Adjacent failures– Broadcast link failure treated as that of designated router
• Non-adjacent failures– Not necessary to know the previous hop of a packet to
compute interface-specific keynode per destination– Failure inferencing can be done as before
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
37
Summary of FIFR• Fast reroute under any single failures
– Without changing/encapsulating IP datagram
• May cause loops under multiple failures– With ISF, guaranteed-protection against single failures
or loop-freedom under multiple failures but not both– Blacklist-based Interface Specific Forwarding
• Needs interface-specific forwarding– Two forwarding entries per destination– O(|E|log2|V|) to compute forwarding entries
CSci5221: Network Failures; IGP Fast Convergence and IP Fast Re-Routing
38
Other ApproachesSee the optional reading [GRY07] for more detials.•Loop-free Alternative (LFA): fast re-routing only when direct link to (default) next-hop fails
– simpler computation – we know exactly which link to remove when computed new next-hop; but protection limited
– using IP-tunnels, etc.•U-turn: allow protection over multiple hops •Using “Not-Via” Addresses •Multi-topology routing
– routers and links (with possibly different link weights) belong to multiple topologies
• E.g., a default topology, plus “back-up” topologies with various (assumed) links removed (or new link weights)
– packets are “marked” with “topology id” for look-up• IETF Fast Rerouting and MT-Routing Working Groups