National Institute of Advanced Industrial Science and Technology
An Analysis of ISP Backbone Availability
Katsushi [email protected]
National Institute of Advanced Industrial Science and Technology
• All results in this talk are based only with the IS-IS messages provided by Internet2 observatory. Therefore, the results of specific links and nodes in this presentation are not directly reflect the quality of its service, and/or of its equipment.
National Institute of Advanced Industrial Science and Technology
How much availability in ISP infrastructure.
• Your ISP offers 99.9% SLA for intra-ISP,
• really premium ?
• valuable to pay more ?
• Just presenting infrastructure availability, not taking into account :
• Any convergence delay of routing protocol
• Packet behavior
National Institute of Advanced Industrial Science and Technology
Internet infrastructure : viewpoint from Routing
• Breakdown network failures into its causes:
• Routing and centralized-NMS (Labovitz ’99)
• A lot of BGP activities• BGP failures affects world wide Internet system• BGP can be seen by other ISP’s• BGP continues to be recorded as UO’s RouteViews
National Institute of Advanced Industrial Science and Technology
ISP infrastructure : viewpoint from IGP
• Fewer IGP activities than BGP• IS-IS on Qwest , Alaettinoglu (‘02)• OSPF on Michi-Net, Watson (‘03)
• required to install collector ISP network inside.
• IGP dataset will disclose ISP backbone quality.
• or, It is not a news network is working fine :)
• IGP message represents infrastructure events• Lost adjacency, ext. route : circuit / switch / interface down• Est. adjacency, ext. route : circuit / switch / interface up• Lost LSP/LSA : router down• Reset LSP/LSA seq. : router up
National Institute of Advanced Industrial Science and Technology
IS-IS collector in Abilene
• IS-IS collector is part of I2 Abilene observatory activity.
http://ndb2-blmt.abilene.ucaid.edu/isis/ Contributed by Shu Zhang [ZK06]
• Deployed all Abilene nodes for multi observation points.
• Synchronized with CDMA timer (GPS based)
• From Aug. ’04 to Apr. ’07 data set is available.
[ZK06] S. Zhang and K. Kobayashi, “Rtanaly: A System to Detect and Measure IGP Routing Changes”
National Institute of Advanced Industrial Science and Technology
Abilene Network Map
Seattle
DenverSunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
11 nodes with T640 routers, and 14 OC192 circuits.
National Institute of Advanced Industrial Science and Technology
Abilene IS-IS operation
• 9 sec. Hello interval, lost ISIS adjacency after missing 3 hellos • 22.5 sec. failure detection delay is supposed.• More faster failure detection is possible, e.g., shorter hello
interval, BFD, carrier loss with circuit failure.
• IGP maintains infrastructure information only.
• Minimize IGP database
• Not import any BGP route into IS-IS.
National Institute of Advanced Industrial Science and Technology
• Network availability in hereafter:
• All network works without any failure.• From Network operator’s viewpoint.
• Don’t care specific source destination path availability.• Not from customer’s viewpoint.
• Timeframe:
• May include more than one event at same time.
ATLA
IPLS
Network
Timeframe of failure Timeframe of double failure
..........
National Institute of Advanced Industrial Science and Technology
Abilene IS-IS overview ’05-’06
• Node failure: timeout node LSP, or seq. number reset.
• Only 1 times on ’05 (53 sec. downtime), 2 on ’06 (1,298 sec. )
• Circuit failure: adjacency away from list in LSP
• Usually found, 635 timeframe on ’05, 513 on ’06.
• Ext. route failure: Route away from LSP
• Represent edge troubles ?
• Difficult to identify whether serious or trivial.To focus this failure.
National Institute of Advanced Industrial Science and Technology
Lost adjacency event
Note that above histograms are drawn with IS-IS captured data at Atlanta. Few details are different with other IS-IS observatory point.
2005/Jan.-Dec. 2006/Jan.-Dec.single−failure
Monitor duration: 365 (days)Total disrupt(count): 635, Availability: 0.95443
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
single−failureMonitor duration: 365 (days)
Total disrupt(count): 513, Availability: 0.98424
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
60 sec. 1 hour 1 day 60 sec. 1 hour 1 day
National Institute of Advanced Industrial Science and Technology
Breakdown in ‘05ATLA−IPLS
Monitor duration: 365 (days)Disrupt(count) 288, Avail: 0.99137
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
050
100
150
200
250
CHIN−IPLSMonitor duration: 365 (days)
Disrupt(count) 34, Avail: 0.99981
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 60
24
68
CHIN−NYCMMonitor duration: 365 (days)
Disrupt(count) 64, Avail: 0.99947
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
05
1015
DNVR−KSCYMonitor duration: 365 (days)
Disrupt(count) 4, Avail: 0.99997
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
0.0
0.5
1.0
1.5
2.0
DNVR−SNVAMonitor duration: 365 (days)
Disrupt(count) 12, Avail: 0.99302
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
01
23
45
DNVR−STTLMonitor duration: 365 (days)
Disrupt(count) 8, Avail: 0.99997
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 60.
00.
51.
01.
52.
02.
53.
0
National Institute of Advanced Industrial Science and Technology
Availability Map (05/01-12)
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
0.9999/11/800
0.9991/122/12,4940.9730/24/819,803
0.9930/12/183,352
0.9913/288/170,303
0.9998/34/1,364
0.9994/64/5,090
0.9998/10/3,940
Availability / Disrupt count / Longest down time (sec.)
0.9997/54/1,194
0.9999/4/398
0.9992/16/7,071
0.9997/12/2,349
0.9999/8/501
0.9993/18/14,192
Hurricane KatrinaAug. ‘05
National Institute of Advanced Industrial Science and Technology
Yearly summary ’05 - ’062005/Jan.- Dec. 2006/Jan.- Dec.
Avail. Disrupt cnt. Avail. Disrupt cnt.
ATLA-HSTN 0.9738 24 0.9990 39
ATLA-IPLS 0.9914 288 0.9975 48
ATLA-WASH 0.9998 12 0.9994 25
CHIN-IPLS 0.9998 34 0.9998 14
CHIN-NYCM 0.9995 64 0.9999 30
DNVR-KSCY 1.0000 4 0.9999 18
DNVR-SNVA 0.9930 12 0.9922 51
DNVR-STTL 1.0000 8 0.9999 5
HSTN-KSCY 0.9993 18 0.9990 19
HSTN-LOSA 0.9991 121 0.9996 40
IPLS-KSCY 0.9998 10 0.9998 17
LOSA-SNVA 0.9997 54 0.9993 128
NYCM-WASH 0.9993 17 0.9989 113
SNVA-STTL 1.0000 11 1.0000 129
Total(*) 0.9544 677 0.9842 676
National Institute of Advanced Industrial Science and Technology
Critical events.
• 2 or more lost adjacency at same timeframe• Some combination makes serious impact. But, not all event
lead split graph condition.
• 32 timeframes (47 disrupt) in ’05, 58 (61) in ’06
• 26/47 timeframes in ’05, 49/61 in ’06, are attributed as missing a node in LSP database.
National Institute of Advanced Industrial Science and Technology
2 or more links failure (2) - Missing node -
Seattle
Denver
Sunnyvale
Los Angels
Kansas City
Chicago
Indianapolis
Atlanta
Washington
New York City
Houston
Missing IPLS router at...........
06/02/19 05:31-05:5606/02/19 06:30-06:3506/02/19 15:47-15:51
............
National Institute of Advanced Industrial Science and Technology
Two or more failure in ‘05
single−failureMonitor duration: 365 (days)
Total disrupt(count): 637, Availability: 0.95435
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
2005/Jan.-Dec.double−failure
Monitor duration: 365 (days)Total disrupt(count): 47, Availability: 0.99976
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
05
1015
2025
All lost adjacency events Two or more missing
60 sec. 1 hour 1 day 60 sec. 1 hour 1 day
National Institute of Advanced Industrial Science and Technology
Two or more failure in ‘06
2006/Jan.-Dec.double−failure
Monitor duration: 365 (days)Total disrupt(count): 61, Availability: 0.99959
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
2030
40
single−failureMonitor duration: 365 (days)
Total disrupt(count): 514, Availability: 0.98419
log_10(Disrupt time (sec.))
Freq
uenc
y
0 1 2 3 4 5 6
010
020
030
040
0
All lost adjacency events Two or more missing
60 sec. 1 hour 1 day 60 sec. 1 hour 1 day
National Institute of Advanced Industrial Science and Technology
Single link failure is trivial ? (1)
• Lost two or more adjacency events are rare, more than 99.95% availability, < 5 hours/year downtime.
• More than 500 lost single adjacency are founded.• 637 times in ’05, and 514 in ’06
• 3-4 hours/year downtime are estimated:• Only suppose 22 sec. downtime for each lost adjacency. • Other delays, i.e., routing convergence, degrade it more.
National Institute of Advanced Industrial Science and Technology
Single link failure is trivial ? (2)
• 22 sec. downtime for each lost adjacency is overestimated ?• Router can detect circuit failure more faster triggered with
lower layer information, e.g., loss of optical, framer error.• IGP timer hack or BFD provide faster failure detection as
sub-second or less [AC02].• Sub-second is derived from propagation delay limit,
impossible to reduce it.• IP FRR would help more.
National Institute of Advanced Industrial Science and Technology
Conclusion
• ’05-’06 Full-year availability evaluation using Abilene ISIS trace data:
• > 99.95 % backbone network viewpoint from IGP.
• Better than real one.• routing convergence delay / access link
• Abilene backbone is over-provisioned bandwidth.
• It is not a news network worked fine :-)
• Thanks for Shu Zhang, Randy Bush, and Xing Li