LIFEGUARD: Practical Repair of Persistent Route Failures
Ethan Katz-Bassett (USC)Colin Scott, David Choffnes, Italo Cunha,
Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy
This work is generously funded in part by Google, Cisco and the NSF.
LIFEGUARD: Practical Repair of Persistent Route Failures3
LIFEGUARD: Practical Repair of Persistent Route Failures4
LIFEGUARD: Practical Repair of Persistent Route Failures4
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
86% of outages last less than 5 minutes
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
86% of outages last less than 5 minutes
LIFEGUARD: Practical Repair of Persistent Route Failures
! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?
5
Long Outages Cause Most Unavailability
86% of outages last less than 5 minutes
But longer outages account for 90% of the unavailability
LIFEGUARD: Practical Repair of Persistent Route Failures
Operators Struggle to Locate Failures“Traffic attempting to pass through Level3’s network in the
Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010
6
LIFEGUARD: Practical Repair of Persistent Route Failures
Operators Struggle to Locate Failures
Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *
“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010
6
LIFEGUARD: Practical Repair of Persistent Route Failures
Operators Struggle to Locate Failures
Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *
Mailing List User 21 Home router2 Verizon in DC3 Alter.net in DC4 Level3 in DC5 Level3 in Chicago6 Level3 in Denver7 * * *8 * * *
“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010
6
LIFEGUARD: Practical Repair of Persistent Route Failures
Reasons for Long-Lasting Outages
Long-term outages are:! Repaired over slow, human timescales! Not well understood! Caused by routers advertising paths that do not work
! E.g., corrupted memory on line card causes black hole! E.g., bad cross-layer interactions cause failed MPLS tunnel
! Complicated by lack of visibility into or control over routes in other ISPs
7
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
8
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem
! Suggest that other ISPs reroute around the problem
! Building blocks! Example! Description of technique
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
8
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem
! Suggest that other ISPs reroute around the problem
LIFEGUARD: Practical Repair of Persistent Route Failures
Building blocks for failure isolationLIFEGUARD can use:! Ping to test reachability! Traceroute to measure forward path! Distributed vantage points (VPs)
! PlanetLab for our experiments! Some can source spoof
! Reverse traceroute to measure reverse path (NSDI ’10)! Atlas of historical forward/reverse paths between VPs and
targets
9
LIFEGUARD: Practical Repair of Persistent Route Failures
! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry
10
Source:GMU
Target:Smartkom
How does LIFEGUARD locate a failure?Before outage:
LIFEGUARD: Practical Repair of Persistent Route Failures
! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry
10
Source:GMU
Target:Smartkom
Level3 Telia TransTelecom ZSTTK
How does LIFEGUARD locate a failure?Before outage:
LIFEGUARD: Practical Repair of Persistent Route Failures
! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry
10
Source:GMU
Target:Smartkom
Level3 Telia TransTelecom ZSTTK
RostelecomNTT
How does LIFEGUARD locate a failure?Before outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK?How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK? VP
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK? VP
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK? VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK? VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK? VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?
Ping! To:VP
During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
?
Problem with ZSTTK? VP
Ping? Fr:VP
How does LIFEGUARD locate a failure?Ping! To:VP
During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
! Forward path works
Problem with ZSTTK? VP
How does LIFEGUARD locate a failure?Ping! To:VP
During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures11
! Forward path works
Problem with ZSTTK? VP
How does LIFEGUARD locate a failure?Ping! To:VP
During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?
NTT:Ping?Fr:GMU
During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?
GMU:Ping!Fr:NTT
During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures12
! Forward path works
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures13
! Forward path works! Rostelcom is not forwarding traffic towards GMU
Rostele:Ping? Fr:GMU
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures13
! Forward path works! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures13
! Forward path works! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures13
! Forward path works! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?During outage:
Source:GMU
Target:Smartkom
Source:GMU
Level3 Telia ZSTTK
RostelecomNTT
TransTelecom
Target:Smartkom
Source:GMU
LIFEGUARD: Practical Repair of Persistent Route Failures13
! Forward path works! Rostelcom is not forwarding traffic towards GMU
How does LIFEGUARD locate a failure?During outage:
LIFEGUARD: Practical Repair of Persistent Route Failures
How LIFEGUARD Locates Failures
LIFEGUARD:1. Maintains background historical atlas2. Isolates direction of failure, measures working direction3. Tests historical paths in failing direction in order to
prune candidate failure locations4. Locates failure as being at the horizon of reachability
14
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
15
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem
! Suggest that other ISPs reroute around the problem
! What would we like to add to BGP to enable this?! What can we deploy today, using only available protocols
and router support?
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Approach and Outline
15
LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem
! Suggest that other ISPs reroute around the problem
LIFEGUARD: Practical Repair of Persistent Route Failures
Our Goal for Failure Avoidance! Enable content / service providers to repair
persistent routing problems affecting them,regardless of which ISP is causing them
Setting! Assume we can locate problem! Assume we are multi-homed / have multiple data centers! Assume we speak BGP
! We use BGP-Mux to speak BGP to the real Internet: 5 US universities as providers
16
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Straightforward: Choose a path that avoids the problem.
17
Self-Repair of Forward Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
A Mechanism for Failure AvoidanceForward path: Choose route that avoids ISP or ISP-ISP link
Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X! Want a BGP announcement AVOID(X,P):
! Any ISP with a route to P that avoids X uses such a route! Any ISP not using X need only pass on the announcement
18
LIFEGUARD: Practical Repair of Persistent Route Failures19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS)
AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS)
AVOID(L3,WS)
AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
AVOID(L3,WS)
AVOID(L3,WS)
AVOID(L3,WS)
19
Ideal Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
Do paths exist that AVOID problem? LIFEGUARD repairs outages by instructing others to avoid particular routes.
Q: Do alternative routes exist?A: Alternate policy-compliant paths exist in 90% of simulated AVOID(X,P) announcements.! Simulated 10 million AVOIDs on actual measured routes.
20
LIFEGUARD: Practical Repair of Persistent Route Failures21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS
L3 ! ATT ! WS
Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS
L3 ! ATT ! WS
Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS
L3 ! ATT ! WS
Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS
L3 ! ATT ! WS
Qwest ! WS
21
Practical Self-Repair of Reverse Paths
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS Qwest ! WS
AVOID(L3,WS)
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS Qwest ! WS
WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS
WS ! L3! WS
Qwest ! WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
AISP ! Qwest ! WS ! L3! WS
WS ! L3! WS
Qwest ! WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WSSprint ! Qwest ! WS ! L3! WS WS ! L3! WS
Qwest ! WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WSSprint ! Qwest ! WS ! L3! WS
ATT ! WS ! L3! WS
WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
L3 ! ATT ! WS
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
?
Sprint ! Qwest ! WS ! L3! WS
ATT ! WS ! L3! WS
WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
?
UW ! Sprint ! Qwest ! WS ! L3! WS
Sprint ! Qwest ! WS ! L3! WS
ATT ! WS ! L3! WS
WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
WS
ATT ! WS
UW ! L3 ! ATT ! WS
Sprint ! Qwest ! WS
?
UW ! Sprint ! Qwest ! WS ! L3! WS
Sprint ! Qwest ! WS ! L3! WS
ATT ! WS ! L3! WS
WS ! L3! WS
22
Practical Self-Repair of Reverse Paths
BGP loop prevention encourages switch to working path.
LIFEGUARD: Practical Repair of Persistent Route Failures
Stuff I Don’t Have Time to Talk About
23
Results from real poisonings! Poisoning in the wild / poisoning anomalies! Case study of restoring connectivityMaking poisoning flexible! Monitoring broken path while it is disabled! Allowing ISPs w/o alternatives to use disabled routeLIFEGUARD’s scalability! Overhead and speed of failure location! Router update load if many ISPs deploy our approachAlternatives to poisoning! Compatibility with secure routing (BGPSEC, etc.)! Comparing to other route control mechanisms
LIFEGUARD: Practical Repair of Persistent Route Failures
Can poisoning approximate AVOID effects?
24
LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.
Q: Does poisoning disrupt working routes?A: No. As I will describe:(a) Under certain circumstances, we can disable a link
without disabling the full ISP.(b) We can speed BGP convergence by carefully crafting
announcements.
O
B1 B2
A
C1
C2 C3
C4D1 D2
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
25
! We only want C3 to change its route, to avoid A-B2
O
B1 B2
A
C1
C2 C3
C4D1 D2
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
25
! We only want C3 to change its route, to avoid A-B2
O
B1 B2
A
C1
C2 C3
C4D1 D2
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
26
! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route
O
B1 B2
A
C1
C2 C3
C4D1 D2
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
26
! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route
O
B1 B2
A
C1
C2 C3
C4D1 D2
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
27
! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route
O
B1 B2
A
C1
C2 C3
C4D1 D2O O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
28
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP
O
B1 B2
A
C1
C2 C3
C4D1 D2O O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
28
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP
O
B1 B2
A
C1
C2 C3
C4D1 D2O-O-O O-A-OO-A-O O-A-O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
29
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP
O
B1 B2
A
C1
C2 C3
C4D1 D2
? ?
O-O-O O-A-OO-A-O O-A-O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
30
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP
O
B1 B2
A
C1
C2 C3
C4D1 D2O
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
31
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt
O
B1 B2
A
C1
C2 C3
C4D1 D2O
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
31
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt
O
B1 B2
A
C1
C2 C3
C4D1 D2
?
??
O
Network linkTransitive linkOriginal pathNew path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
32
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt
O
B1 B2
A
C1
C2 C3
C4D1 D2O O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
33
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
O
B1 B2
A
C1
C2 C3
C4D1 D2O O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
33
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
LIFEGUARD: Practical Repair of Persistent Route Failures
What if some routes in an ISP still work?
34
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
O
B1 B2
A
C1
C2 C3
C4D1 D2O-O-O O-A-O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
O
B1 B2
A
C1
C2 C3
C4D1 D2O-O-O O-A-O
Network linkTransitive linkPre-poisoning pathPost-poisoning path
LIFEGUARD: Practical Repair of Persistent Route Failures35
What if some routes in an ISP still work?
! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,
selectively poison via D2 and not D1
LIFEGUARD: Practical Repair of Persistent Route Failures
Can poisoning approximate AVOID effects?
36
LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.
Q: Does poisoning disrupt working routes?A: No. As I will describe:(a) “Selective poisoning” can avoid 73% of links without
disabling entire AS.‣ Real-world results from 5 provider BGP-Mux testbed
(b) We can speed BGP convergence by carefully crafting announcements.
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss
O
A
B
CF
D
E
OA-O
D-A-OF-B-A-O
B-A-OE-D-A-O
A-O
B-A-O! Some ISPs may have working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
37
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O
D-A-OF-B-A-O
B-A-OE-D-A-O
A-O
B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
38
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
D-A-OF-B-A-O
B-A-OE-D-A-O
A-O-X-O
B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
39
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
A-O-X-OD-A-O-X-OF-B-A-O
B-A-O-X-OE-D-A-O
B-A-O-X-O
F-B-A-O
E-D-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
40
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
A-O-X-OD-A-O-X-OF-B-A-O
B-A-O-X-OE-D-A-O
B-A-O-X-O
F-B-A-O
E-D-A-O
F-B-A-OD-A-O-X-O
E-D-A-OB-A-O-X-O E-D-A-O
F-B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
41
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
A-O-X-OD-A-O-X-OF-B-A-O
B-A-O-X-OE-D-A-O
B-A-O-X-O
F-B-A-O
E-D-A-O
F-B-A-OD-A-O-X-O
E-D-A-OB-A-O-X-O E-D-A-O
F-B-A-O
E-D-A-O
F-B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
42
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
A-O-X-OD-A-O-X-OF-B-A-O
B-A-O-X-OE-D-A-O
B-A-O-X-O
F-B-A-O
E-D-A-O
F-B-A-OD-A-O-X-O
E-D-A-OB-A-O-X-O E-D-A-O
F-B-A-O
E-D-A-O
F-B-A-O
B-A-O-X-O E-D-A-O
D-A-O-X-O F-B-A-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
43
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
D-A-O-X-OF-B-A-O-X-O
B-A-O-X-OE-D-A-O-X-O
A-O-X-O
B-A-O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Naive Poisoning Causes Transient Loss! Some ISPs may have
working paths that avoid problem ISP X
! Naively, poisoning causes path exploration even for these ISPs
! Path exploration causes transient loss
44
AVOID(X,P)
O
A
B
CF
D
E
O-O-OA-O-O-O
D-A-O-O-OF-B-A-O-O-O
B-A-O-O-OE-D-A-O-O-O
A-O-O-O
B-A-O-O-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration! Most routing decisions
based on:(1) next hop ISP(2) path length
! Keep these fixed to speed convergence
! Prepending prepares ISPs for later poison
45
AVOID(X,P)
O
A
B
CF
D
E
O-O-OA-O-O-O
D-A-O-O-OF-B-A-O-O-O
B-A-O-O-OE-D-A-O-O-O
A-O-O-O
B-A-O-O-O
O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration! Most routing decisions
based on:(1) next hop ISP(2) path length
! Keep these fixed to speed convergence
! Prepending prepares ISPs for later poison
46
AVOID(X,P)
O
A
B
CF
D
E
O-O-OA-O-O-O
D-A-O-O-OF-B-A-O-O-O
B-A-O-O-OE-D-A-O-O-O
A-O-O-O
B-A-O-O-O
O-X-OA-O-X-O
A-O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration! Most routing decisions
based on:(1) next hop ISP(2) path length
! Keep these fixed to speed convergence
! Prepending prepares ISPs for later poison
47
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
A-O-X-OD-A-O-X-OF-B-A-O-O-O
B-A-O-X-OE-D-A-O-O-O
B-A-O-X-OE-D-A-O-O-O
F-B-A-O-O-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration! Most routing decisions
based on:(1) next hop ISP(2) path length
! Keep these fixed to speed convergence
! Prepending prepares ISPs for later poison
48
AVOID(X,P)
O
A
B
CF
D
E
O-X-OA-O-X-O
D-A-O-X-OF-B-A-O-X-O
B-A-O-X-OE-D-A-O-X-O
A-O-X-O
B-A-O-X-O
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepend to Reduce Path Exploration! Most routing decisions
based on:(1) next hop ISP(2) path length
! Keep these fixed to speed convergence
! Prepending prepares ISPs for later poison
49
AVOID(X,P)
0.9999
0.999
0.990.95
0.650
0 1 2 3 4 5 6 7 8
Cum
ulat
ive
Frac
tion
ofC
onve
rgen
ces
(CD
F)
Peer Convergence Time (minutes)
Prepend, no changeNo prepend, no change
LIFEGUARD: Practical Repair of Persistent Route Failures
Prepending Speeds Convergence
! With no prepend, only 65% of unaffected ISPs converge instantly! With prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min.! Also speeds convergence to new paths for affected peers
50
LIFEGUARD: Practical Repair of Persistent Route Failures
Conclusion! We increasingly depend on the Internet, but availability lags! Much of Internet unavailability due to long-lasting outages
! LIFEGUARD: Let edge networks reroute around failures
! Location challenge: Find problem, given unidirectional failures and tools that depend on connectivity! Use reverse traceroute, isolate directions, use historical view
! Avoidance challenge: Reroute without participation of transit networks! BGP poisoning gives control to the destination! Well-crafted announcements ease concerns
51