LIFEGUARD: Practical Repair of Persistent Route Failures...NTT Rostelecom TransTelecom Source: GMU...

Post on 31-Aug-2020

2 views 0 download

transcript

LIFEGUARD: Practical Repair of Persistent Route Failures

Ethan Katz-Bassett (USC)Colin Scott, David Choffnes, Italo Cunha,

Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy

This work is generously funded in part by Google, Cisco and the NSF.

LIFEGUARD: Practical Repair of Persistent Route Failures3

LIFEGUARD: Practical Repair of Persistent Route Failures4

LIFEGUARD: Practical Repair of Persistent Route Failures4

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2! Fraction of outages of duration ! X?! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

But longer outages account for 90% of the unavailability

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures“Traffic attempting to pass through Level3’s network in the

Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *

“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *

Mailing List User 21 Home router2 Verizon in DC3 Alter.net in DC4 Level3 in DC5 Level3 in Chicago6 Level3 in Denver7 * * *8 * * *

“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

LIFEGUARD: Practical Repair of Persistent Route Failures

Reasons for Long-Lasting Outages

Long-term outages are:! Repaired over slow, human timescales! Not well understood! Caused by routers advertising paths that do not work

! E.g., corrupted memory on line card causes black hole! E.g., bad cross-layer interactions cause failed MPLS tunnel

! Complicated by lack of visibility into or control over routes in other ISPs

7

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

8

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

! Building blocks! Example! Description of technique

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

8

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

LIFEGUARD: Practical Repair of Persistent Route Failures

Building blocks for failure isolationLIFEGUARD can use:! Ping to test reachability! Traceroute to measure forward path! Distributed vantage points (VPs)

! PlanetLab for our experiments! Some can source spoof

! Reverse traceroute to measure reverse path (NSDI ’10)! Atlas of historical forward/reverse paths between VPs and

targets

9

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry

10

Source:GMU

Target:Smartkom

How does LIFEGUARD locate a failure?Before outage:

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry

10

Source:GMU

Target:Smartkom

Level3 Telia TransTelecom ZSTTK

How does LIFEGUARD locate a failure?Before outage:

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes! Traceroute yields only path from GMU to target! Reverse traceroute reveals path asymmetry

10

Source:GMU

Target:Smartkom

Level3 Telia TransTelecom ZSTTK

RostelecomNTT

How does LIFEGUARD locate a failure?Before outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK?How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?

Ping! To:VP

During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

?

Problem with ZSTTK? VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?Ping! To:VP

During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

! Forward path works

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?Ping! To:VP

During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures11

! Forward path works

Problem with ZSTTK? VP

How does LIFEGUARD locate a failure?Ping! To:VP

During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?

NTT:Ping?Fr:GMU

During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?

GMU:Ping!Fr:NTT

During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures12

! Forward path works

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

Rostele:Ping? Fr:GMU

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

Source:GMU

Target:Smartkom

Source:GMU

Level3 Telia ZSTTK

RostelecomNTT

TransTelecom

Target:Smartkom

Source:GMU

LIFEGUARD: Practical Repair of Persistent Route Failures13

! Forward path works! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?During outage:

LIFEGUARD: Practical Repair of Persistent Route Failures

How LIFEGUARD Locates Failures

LIFEGUARD:1. Maintains background historical atlas2. Isolates direction of failure, measures working direction3. Tests historical paths in failing direction in order to

prune candidate failure locations4. Locates failure as being at the horizon of reachability

14

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

15

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

! What would we like to add to BGP to enable this?! What can we deploy today, using only available protocols

and router support?

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

15

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically! Locate the ISP / link causing the problem

! Suggest that other ISPs reroute around the problem

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Goal for Failure Avoidance! Enable content / service providers to repair

persistent routing problems affecting them,regardless of which ISP is causing them

Setting! Assume we can locate problem! Assume we are multi-homed / have multiple data centers! Assume we speak BGP

! We use BGP-Mux to speak BGP to the real Internet: 5 US universities as providers

16

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

A Mechanism for Failure AvoidanceForward path: Choose route that avoids ISP or ISP-ISP link

Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X! Want a BGP announcement AVOID(X,P):

! Any ISP with a route to P that avoids X uses such a route! Any ISP not using X need only pass on the announcement

18

LIFEGUARD: Practical Repair of Persistent Route Failures19

Ideal Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

AVOID(L3,WS)

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

AVOID(L3,WS)

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

Do paths exist that AVOID problem? LIFEGUARD repairs outages by instructing others to avoid particular routes.

Q: Do alternative routes exist?A: Alternate policy-compliant paths exist in 90% of simulated AVOID(X,P) announcements.! Simulated 10 million AVOIDs on actual measured routes.

20

LIFEGUARD: Practical Repair of Persistent Route Failures21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

L3 ! ATT ! WS

Qwest ! WS

21

Practical Self-Repair of Reverse Paths

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS Qwest ! WS

AVOID(L3,WS)

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS Qwest ! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS

WS ! L3! WS

Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

AISP ! Qwest ! WS ! L3! WS

WS ! L3! WS

Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WSSprint ! Qwest ! WS ! L3! WS WS ! L3! WS

Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WSSprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

?

Sprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

?

UW ! Sprint ! Qwest ! WS ! L3! WS

Sprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

ATT ! WS

UW ! L3 ! ATT ! WS

Sprint ! Qwest ! WS

?

UW ! Sprint ! Qwest ! WS ! L3! WS

Sprint ! Qwest ! WS ! L3! WS

ATT ! WS ! L3! WS

WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

LIFEGUARD: Practical Repair of Persistent Route Failures

Stuff I Don’t Have Time to Talk About

23

Results from real poisonings! Poisoning in the wild / poisoning anomalies! Case study of restoring connectivityMaking poisoning flexible! Monitoring broken path while it is disabled! Allowing ISPs w/o alternatives to use disabled routeLIFEGUARD’s scalability! Overhead and speed of failure location! Router update load if many ISPs deploy our approachAlternatives to poisoning! Compatibility with secure routing (BGPSEC, etc.)! Comparing to other route control mechanisms

LIFEGUARD: Practical Repair of Persistent Route Failures

Can poisoning approximate AVOID effects?

24

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.

Q: Does poisoning disrupt working routes?A: No. As I will describe:(a) Under certain circumstances, we can disable a link

without disabling the full ISP.(b) We can speed BGP convergence by carefully crafting

announcements.

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

25

! We only want C3 to change its route, to avoid A-B2

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

25

! We only want C3 to change its route, to avoid A-B2

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

26

! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

26

! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route

O

B1 B2

A

C1

C2 C3

C4D1 D2

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

27

! We only want C3 to change its route, to avoid A-B2! Forward direction is easy: choose a different route

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

28

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

28

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

O

B1 B2

A

C1

C2 C3

C4D1 D2O-O-O O-A-OO-A-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

29

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

O

B1 B2

A

C1

C2 C3

C4D1 D2

? ?

O-O-O O-A-OO-A-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

30

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP

O

B1 B2

A

C1

C2 C3

C4D1 D2O

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

31

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt

O

B1 B2

A

C1

C2 C3

C4D1 D2O

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

31

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt

O

B1 B2

A

C1

C2 C3

C4D1 D2

?

??

O

Network linkTransitive linkOriginal pathNew path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

32

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! Selective advertising via just D1 is also blunt

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

33

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

O

B1 B2

A

C1

C2 C3

C4D1 D2O O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

33

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

34

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

O

B1 B2

A

C1

C2 C3

C4D1 D2O-O-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

O

B1 B2

A

C1

C2 C3

C4D1 D2O-O-O O-A-O

Network linkTransitive linkPre-poisoning pathPost-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures35

What if some routes in an ISP still work?

! We only want C3 to change its route, to avoid A-B2! Poisoning seems blunt, disabling an entire ISP! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

LIFEGUARD: Practical Repair of Persistent Route Failures

Can poisoning approximate AVOID effects?

36

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.

Q: Does poisoning disrupt working routes?A: No. As I will describe:(a) “Selective poisoning” can avoid 73% of links without

disabling entire AS.‣ Real-world results from 5 provider BGP-Mux testbed

(b) We can speed BGP convergence by carefully crafting announcements.

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

O

A

B

CF

D

E

OA-O

D-A-OF-B-A-O

B-A-OE-D-A-O

A-O

B-A-O! Some ISPs may have working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

37

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O

D-A-OF-B-A-O

B-A-OE-D-A-O

A-O

B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

38

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

D-A-OF-B-A-O

B-A-OE-D-A-O

A-O-X-O

B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

39

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

40

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

F-B-A-OD-A-O-X-O

E-D-A-OB-A-O-X-O E-D-A-O

F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

41

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

F-B-A-OD-A-O-X-O

E-D-A-OB-A-O-X-O E-D-A-O

F-B-A-O

E-D-A-O

F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

42

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O

B-A-O-X-OE-D-A-O

B-A-O-X-O

F-B-A-O

E-D-A-O

F-B-A-OD-A-O-X-O

E-D-A-OB-A-O-X-O E-D-A-O

F-B-A-O

E-D-A-O

F-B-A-O

B-A-O-X-O E-D-A-O

D-A-O-X-O F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

43

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

D-A-O-X-OF-B-A-O-X-O

B-A-O-X-OE-D-A-O-X-O

A-O-X-O

B-A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning causes path exploration even for these ISPs

! Path exploration causes transient loss

44

AVOID(X,P)

O

A

B

CF

D

E

O-O-OA-O-O-O

D-A-O-O-OF-B-A-O-O-O

B-A-O-O-OE-D-A-O-O-O

A-O-O-O

B-A-O-O-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

45

AVOID(X,P)

O

A

B

CF

D

E

O-O-OA-O-O-O

D-A-O-O-OF-B-A-O-O-O

B-A-O-O-OE-D-A-O-O-O

A-O-O-O

B-A-O-O-O

O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

46

AVOID(X,P)

O

A

B

CF

D

E

O-O-OA-O-O-O

D-A-O-O-OF-B-A-O-O-O

B-A-O-O-OE-D-A-O-O-O

A-O-O-O

B-A-O-O-O

O-X-OA-O-X-O

A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

47

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

A-O-X-OD-A-O-X-OF-B-A-O-O-O

B-A-O-X-OE-D-A-O-O-O

B-A-O-X-OE-D-A-O-O-O

F-B-A-O-O-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

48

AVOID(X,P)

O

A

B

CF

D

E

O-X-OA-O-X-O

D-A-O-X-OF-B-A-O-X-O

B-A-O-X-OE-D-A-O-X-O

A-O-X-O

B-A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration! Most routing decisions

based on:(1) next hop ISP(2) path length

! Keep these fixed to speed convergence

! Prepending prepares ISPs for later poison

49

AVOID(X,P)

0.9999

0.999

0.990.95

0.650

0 1 2 3 4 5 6 7 8

Cum

ulat

ive

Frac

tion

ofC

onve

rgen

ces

(CD

F)

Peer Convergence Time (minutes)

Prepend, no changeNo prepend, no change

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepending Speeds Convergence

! With no prepend, only 65% of unaffected ISPs converge instantly! With prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min.! Also speeds convergence to new paths for affected peers

50

LIFEGUARD: Practical Repair of Persistent Route Failures

Conclusion! We increasingly depend on the Internet, but availability lags! Much of Internet unavailability due to long-lasting outages

! LIFEGUARD: Let edge networks reroute around failures

! Location challenge: Find problem, given unidirectional failures and tools that depend on connectivity! Use reverse traceroute, isolate directions, use historical view

! Avoidance challenge: Reroute without participation of transit networks! BGP poisoning gives control to the destination! Well-crafted announcements ease concerns

51