+ All Categories
Home > Documents > LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU...

LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU...

Date post: 31-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
106
LIFEGUARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC) Colin Scott, David Choffnes, Italo Cunha, Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy This work is generously funded in part by Google, Cisco and the NSF.
Transcript
Page 1: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Ethan Katz-Bassett (USC)Colin Scott, David Choffnes, Italo Cunha,

Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy

This work is generously funded in part by Google, Cisco and the NSF.

Page 2: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures3

Page 3: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures4

Page 4: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures4

Page 5: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

Page 6: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

Page 7: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

Page 8: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

But longer outages account for 90% of the unavailability

Page 9: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures“Traffic attempting to pass through Level3’s network in the

Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

Page 10: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *

“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

Page 11: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *

Mailing List User 21 Home router2 Verizon in DC3 Alter.net in DC4 Level3 in DC5 Level3 in Chicago6 Level3 in Denver7 * * *8 * * *

“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

Page 12: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Reasons for Long-Lasting Outages

Long-term outages are:! Repaired over slow, human timescales! Not well understood! Caused by routers advertising paths that do not work

! E.g., corrupted memory on line card causes black hole! E.g., bad cross-layer interactions cause failed MPLS tunnel

! Complicated by lack of visibility into or control over routes in other ISPs

7

Page 13: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

8

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

Page 14: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

! Building blocks! Example! Description of technique

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

8

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

Page 15: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Building blocks for failure isolationLIFEGUARD can use:! Ping to test reachability! Traceroute to measure forward path! Distributed vantage points (VPs)

! PlanetLab for our experiments! Some can source spoof

! Reverse traceroute to measure reverse path (NSDI ’10)! Atlas of historical forward/reverse paths between VPs and

targets

9

Page 16: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry

10

Source:GMU

Target:Smartkom

How does LIFEGUARD locate a failure?Before outage:

Page 17: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry

10

Source:GMU

Target:Smartkom

Level3 Telia TransTelecom ZSTTK

How does LIFEGUARD locate a failure?Before outage:

Page 18: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry

10

Source:GMU

Target:Smartkom

Level3 Telia TransTelecom ZSTTK

RostelecomNTT

How does LIFEGUARD locate a failure?Before outage:

Page 19: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

How does LIFEGUARD locate a failure?During outage:

Page 20: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

How does LIFEGUARD locate a failure?During outage:

Page 21: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK?How does LIFEGUARD locate a failure?During outage:

Page 22: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?During outage:

Page 23: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?During outage:

Page 24: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?During outage:

Page 25: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?During outage:

Page 26: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?

Ping! To:VP

During outage:

Page 27: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?Ping! To:VP

During outage:

Page 28: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

! Forward path works

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?Ping! To:VP

During outage:

Page 29: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

! Forward path works

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?Ping! To:VP

During outage:

Page 30: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Page 31: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Page 32: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Page 33: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?

NTT:Ping?Fr:GMU

During outage:

Page 34: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?

GMU:Ping!Fr:NTT

During outage:

Page 35: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Page 36: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Page 37: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

Rostele:Ping? Fr:GMU

How does LIFEGUARD locate a failure?During outage:

Page 38: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Page 39: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Page 40: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Page 41: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Page 42: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

How LIFEGUARD Locates Failures

LIFEGUARD:1. Maintains background historical atlas2. Isolates direction of failure, measures working direction3. Tests historical paths in failing direction in order to

prune candidate failure locations4. Locates failure as being at the horizon of reachability

14

Page 43: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

15

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

Page 44: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

! What would we like to add to BGP to enable this?! What can we deploy today, using only available protocols

and router support?

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

15

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

Page 45: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Goal for Failure Avoidance! Enable content / service providers to repair

persistent routing problems affecting them,regardless of which ISP is causing them

Setting! Assume we can locate problem! Assume we are multi-homed / have multiple data centers! Assume we speak BGP

! We use BGP-Mux to speak BGP to the real Internet: 5 US universities as providers

16

Page 46: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

Page 47: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

Page 48: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

Page 49: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

Page 50: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

A Mechanism for Failure AvoidanceForward path: Choose route that avoids ISP or ISP-ISP link

Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X! Want a BGP announcement AVOID(X,P):

! Any ISP with a route to P that avoids X uses such a route! Any ISP not using X need only pass on the announcement

18

Page 51: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures19

Ideal Self-Repair of Reverse Paths

Page 52: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

Page 53: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

Page 54: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

AVOID(L3,WS)

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

Page 55: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

AVOID(L3,WS)

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

Page 56: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Do paths exist that AVOID problem? LIFEGUARD repairs outages by instructing others to avoid particular routes.

Q: Do alternative routes exist?A: Alternate policy-compliant paths exist in 90% of simulated AVOID(X,P) announcements.! Simulated 10 million AVOIDs on actual measured routes.

20

Page 57: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures21

Practical Self-Repair of Reverse Paths

Page 58: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

21

Practical Self-Repair of Reverse Paths

Page 59: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

Page 60: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

Page 61: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

Page 62: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

Page 63: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

Page 64: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS Qwest ! WS

AVOID(L3,WS)

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

Page 65: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS Qwest ! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

Page 66: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

WS ! L3! WS

Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

Page 67: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS ! L3! WS

WS ! L3! WS

Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

Page 68: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WSSprint ! Qwest ! WS ! L3! WS WS ! L3! WS

Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

Page 69: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WSSprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

Page 70: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

?

Sprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

Page 71: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

?

UW ! Sprint ! Qwest ! WS ! L3! WS

Sprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

Page 72: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

?

UW ! Sprint ! Qwest ! WS ! L3! WS

Sprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

Page 73: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Stuff I Don’t Have Time to Talk About

23

Results from real poisonings! Poisoning in the wild / poisoning anomalies! Case study of restoring connectivityMaking poisoning flexible! Monitoring broken path while it is disabled! Allowing ISPs w/o alternatives to use disabled routeLIFEGUARD’s scalability! Overhead and speed of failure location! Router update load if many ISPs deploy our approachAlternatives to poisoning! Compatibility with secure routing (BGPSEC, etc.)! Comparing to other route control mechanisms

Page 74: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Can poisoning approximate AVOID effects?

24

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.

Q: Does poisoning disrupt working routes?A: No. As I will describe:(a) Under certain circumstances, we can disable a link

without disabling the full ISP.(b) We can speed BGP convergence by carefully crafting

announcements.

Page 75: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

25

! We only want C3 to change its route, to avoid A-B2

Page 76: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

25

! We only want C3 to change its route, to avoid A-B2

Page 77: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

26

! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route

Page 78: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

26

! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route

Page 79: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

27

! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route

Page 80: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

28

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

Page 81: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

28

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

Page 82: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O-O-O O-A-OO-A-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

29

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

Page 83: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

? ?

O-O-O O-A-OO-A-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

30

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

Page 84: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

31

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt

Page 85: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

31

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt

Page 86: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2

?

??

O

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

32

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt

Page 87: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

33

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

Page 88: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

33

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

Page 89: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

34

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

O

B1 B2

A

C1

C2 C3

C4D1 D2O-O-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

Page 90: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

B1 B2

A

C1

C2 C3

C4D1 D2O-O-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures35

What if some routes in an ISP still work?

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

Page 91: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Can poisoning approximate AVOID effects?

36

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.

Q: Does poisoning disrupt working routes?A: No. As I will describe:(a) “Selective poisoning” can avoid 73% of links without

disabling entire AS.‣ Real-world results from 5 provider BGP-Mux testbed

(b) We can speed BGP convergence by carefully crafting announcements.

Page 92: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

O

A

B

CF

D

E

OA-O

D-A-OF-B-A-O

B-A-OE-D-A-O

A-O

B-A-O! Some ISPs may have working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

37

AVOID(X,P)

Page 93: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O

D-A-OF-B-A-O

B-A-OE-D-A-O

A-O

B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

38

AVOID(X,P)

Page 94: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

D-A-OF-B-A-O

B-A-OE-D-A-O

A-O-X-O

B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

39

AVOID(X,P)

Page 95: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

40

AVOID(X,P)

Page 96: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

F-B-A-OD-A-O-X-O

E-D-A-OB-A-O-X-O E-D-A-O

F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

41

AVOID(X,P)

Page 97: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

F-B-A-OD-A-O-X-O

E-D-A-OB-A-O-X-O E-D-A-O

F-B-A-O

E-D-A-O

F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

42

AVOID(X,P)

Page 98: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

F-B-A-OD-A-O-X-O

E-D-A-OB-A-O-X-O E-D-A-O

F-B-A-O

E-D-A-O

F-B-A-O

B-A-O-X-O E-D-A-O

D-A-O-X-O F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

43

AVOID(X,P)

Page 99: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

D-A-O-X-OF-B-A-O-X-O

B-A-O-X-OE-D-A-O-X-O

A-O-X-O

B-A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

44

AVOID(X,P)

Page 100: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-O-OA-O-O-O

D-A-O-O-OF-B-A-O-O-O

B-A-O-O-OE-D-A-O-O-O

A-O-O-O

B-A-O-O-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

45

AVOID(X,P)

Page 101: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-O-OA-O-O-O

D-A-O-O-OF-B-A-O-O-O

B-A-O-O-OE-D-A-O-O-O

A-O-O-O

B-A-O-O-O

O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

46

AVOID(X,P)

Page 102: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-O-OA-O-O-O

D-A-O-O-OF-B-A-O-O-O

B-A-O-O-OE-D-A-O-O-O

A-O-O-O

B-A-O-O-O

O-X-OA-O-X-O

A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

47

AVOID(X,P)

Page 103: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O-O-O

B-A-O-X-OE-D-A-O-O-O

B-A-O-X-OE-D-A-O-O-O

F-B-A-O-O-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

48

AVOID(X,P)

Page 104: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

O

A

B

CF

D

E

O-X-OA-O-X-O

D-A-O-X-OF-B-A-O-X-O

B-A-O-X-OE-D-A-O-X-O

A-O-X-O

B-A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

49

AVOID(X,P)

Page 105: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

0.9999

0.999

0.990.95

0.650

0 1 2 3 4 5 6 7 8

Cum

ulat

ive

Frac

tion

ofC

onve

rgen

ces

(CD

F)

Peer Convergence Time (minutes)

Prepend, no changeNo prepend, no change

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepending Speeds Convergence

! With no prepend, only 65% of unaffected ISPs converge instantly! With prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min.! Also speeds convergence to new paths for affected peers

50

Page 106: LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU 13 LIFEGUARD: Practical Repair of Persistent Route Failures! Forward path works!

LIFEGUARD: Practical Repair of Persistent Route Failures

Conclusion! We increasingly depend on the Internet, but availability lags! Much of Internet unavailability due to long-lasting outages

! LIFEGUARD: Let edge networks reroute around failures

! Location challenge: Find problem, given unidirectional failures and tools that depend on connectivity! Use reverse traceroute, isolate directions, use historical view

! Avoidance challenge: Reroute without participation of transit networks! BGP poisoning gives control to the destination! Well-crafted announcements ease concerns

51


Recommended