The final goal
• We prefer not to see: 2
3
High Speed Backbone Service
providers
PSTN
Internet
Video
Backbone
Mobile access
Metro
Business
Telecommunicaiton Networks
Telecommunicaiton Networks
http://www.icn.co
4
5
Traditional network architecture in backbone networks
IP (Internet Protocol)
ATM (Asynchronous Transfer Mode)
SDH/SONET (Synchronous
Digital Hierarchy)
WDM (Wavelength
Division Multiplexing)
Adressing, routing
Traffic engineering
Transport and protection
High bandwidth
6
Evolution of network layers
Thin SONET
Optics
MPLS SONET
IP
Optics
ATM
Layer 3
2
1
0 Packet
Optical
Inter- working Smart
Optical
Packet IP/Ethernet
Layer
2/3
0/1
1999 201x
2003
BGP-4: 15 – 30 minutes OSPF: 10 seconds to minutes SONET: 50 milliseconds
IP
GMPLS
7
IP - Internet Protocol
• Packet switched – Hop-by-hop routing – Packets are forwarded based on forwarding tables
• Distributed control – Shortest path routing
• via link-state protocols: OSPF (Open Shortest Path First), IS-IS (Intermediate System To Intermediate System)
• Routing on a logical topology
• Widespread, its role is straightforward – From a technical point of view not very popular
8
Optical backbone
• Circuit switched – Centralized control – Exact knowledge of the physical topology
• Logical links are lightpaths – Source and destination node pairs, bandwidth
9
A
B C
D
E Wavelengthcrossconnect
Lightpaths
IP router
Optical Backbone Networks
10
11
Motivation Behind Survivable Network Design
FAILURE SOURCES
12
13
Failure Sources – HW Failures
• Network element failures – Type failures
• Manufacturing or design failures • Turns out at the testing phase
– Wear out • Processor, memory, main board, interface cards • Components with moving parts:
– Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these
devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc.
14
Failure Sources – SW Failures
• Design errors • High complexity and compound failures
• Faulty implementations • Typos in variable names
– Compiler detects most of these failures
• Failed memory reading/writing operation
15
Failure Sources – Operator Errors (1)
• Unplanned maintenance – Misconfiguration
• Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link
metrics, and timers and queues (Diffserv) • Traffic Conditioners
– Policers, classifiers, markers, shapers • Wrong security settings
– Block legacy traffic – Other operation faults:
• Accidental errors (unplug, reset) • Access denial (forgotten password)
• Planned maintenance • Upgrade is longer than planned
16
Failure Sources – Operator Errors (2)
• Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough
redundancy in protection path selection) • Compatibility errors
– Between different vendors and versions – Between service providers or AS (Autonomous
system) • Different routein settings and Admission Control between two
ASs
17
Failure Sources – Operator errors (3)
• Operation and maintenance errors
Updates and patches
Misconfiguration
Device upgrade
Maintenance
Data mirroring or recovery
Monitoring and testing
Teach users
Other
18
Failure Sources – User Errors • Failures from malicious users
– Physical devices • Robbery, damage the device
– Against nodes • Viruses
– DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In
1996 computers could be froze by recieving larger packets. • Unexpected user behavior
– Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested)
– Long term • New popular sites and killer applications
Failure Sources – Environmental Causes
• Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites
• Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes
• Electro-magnetic interference – Electro-magnetic noise – solar flares
• Power outage • Humidity and temperature
– Air-conditioner fault • Natural disasters
– Fires, floods, terrorist attacks, lightnings, earthquakes, etc.
Operating Routers During Sandy Hurricane 20
21
Michnet ISP Backbone (1998)
Maintenance
Power Outage Fiber Cut/Cicuit/Carrier Problem
Hardware Problem
Routing Problems
Interface Down
Congestion/Sluggish
Malicious Attack
Software Problem
• Which failures were the most probable ones?
22
Michnet ISP Backbone (1998)
Operator35%
Hardw are15%
Environmental31%
User5%
Unknow n11%
Malice2%
Softw are1%
Cause Type # [%] Maintenance Operator 272 16.2 Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier
Problem Environmental 261 15.3 Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Miscellaneous Unknown 86 5.9 Unknown/Undetermined/
No problem Unknown 32 5.6
Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3
23
Case study - 2002
• D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002,
24
Failure Sources - Summary • Operator errors (misconfiguration)
– Simple solutions needed – Sometimes reach 90% of all failures
• Planned maintenance – Running at night – Sometimes reach 20% of all failures
• DoS attack – It will be worse in the future
• Software failures – 10 million line source codes
• Link failures – Anything from which a point-to-point connection fails (not only cable
cuts)
25
Reliability • Failure
– is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment tf
• Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being
adequately operational (i.e. failure free operation) for the period of time [0 – t] intended in the presence of network failures
26
Reliability (2) • Reliability, R(t)
– Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables
• Properies: – non-increasing – –
tt eetFtR λλ −− =−−=−= )1(1)(1)(
t a
1
0
R(t)
R(a)
0)(lim1)0(=
=
∞→tR
R
t
27
Device is operational
Network with Reparable Subsystems
t
UP
DOWN
Device is operational
Device is operational
The network element is failed, repair action is in progress.
Failure
• Measures to charecterize a reparable system are: – Availability, A(t)
• refers to the probability of a reparable system to be found in the operational state at some time t in the future
• A(t) = P(time = t, system = UP) – Unavailability, U(t)
• refers to the probability of a reparable system to be found in the faulty state at some time t in the future
• U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t
Failure
28
Element Availability Assignment • The mainly used measures are
– MTTR - Mean Time To Repair – MTTF - Mean Time to Failure
• MTTR << MTTF – MTBF - Mean Time Between Failures
• MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=109/FIT
• Another notation – MUT - Mean Up Time
• Like MTTF – MDT - Mean Down Time
• Like MTTR – MCT - Mean Cycle Time
• MCT=MUT+MDT
29
Availability in Hours Availability Nines Outage time/
year Outage time/
month Outage time/
week 90% 1 nine 36.52 day 73.04 hour 16.80 hour
95% - 18.26 day 36.52 hour 8.40 hour
98% - 7.30 day 14.60 hour 3.36 hour
99% 2 nines (maintained) 3.65 day 7.30 hour 1.68 hour
99.5% - 1.83 day 3.65 hour 50.40 min
99.8% - 17.53 hour 87.66 min 20.16 min
99.9% 3 nines (well maintained) 8.77 hour 43.83 min 10.08 min
99.95% - 4.38 hour 21.91 min 5.04 min
99.99% 4 nines 52.59 min 4.38 min 1.01 min
99.999% 5 nines (failure protected) 5.26 min 25.9 sec 6.05 sec
99.9999% 6 nines (high reliability) 31.56 sec 2.62 sec 0.61 sec
99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec
30
Availability Evaluation – Assumptions
• Failure arrival times – independent and identically distributed (iid) variables
following exponential distribution – sometimes Weibull distribution is used (hard) – λ > 0 failure rate (time independent!)
• Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – µ > 0 repair rate (time independent!)
• If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain
αλtetF −−=1)(
31
Two-State Markov Model Steady State Analysis (1)
MTTR
MTTF
=
=
µ
λ
1
1 UP 1
DN 0
λ
µ
1-λ
1-µ
• Transition probability distribution in a matrix form – Transition matrix P (stochastic matrix)
• Time homogeneous Markov-chain – The transition matrix after k steps: Pk – Stationary distribution is a row vector π, for which – π exists, (and in this case it is unambiguous)
P⋅= ππ
Mean of exp. dist. variables:
Two-state Markov Model Steady State Analysis (2)
32
UP 1
DN 0
λ
µ
1-λ
1-µ
⎟⎟⎠
⎞⎜⎜⎝
⎛
−
−=
µµ
λλ
11
P
µλµ
µλ
µλ
µµ
λλ
+=
−=⋅=⋅
⋅+−⋅=
⎟⎟⎠
⎞⎜⎜⎝
⎛
−
−⋅=
A
AUUAUAA
UAUA
1seen have / we)1(
11
)()(
Transition matrix:
Stationary distribution:
( ) ) (, UADOWNUP ==Π
33
Two-State Markov-model - Summary
1 A(t) Ass= µ λ
µ +
MTTR MTTF
MTTF A ss +
= +
= +
= µ λ
λ
µ λ
µ
1 1
1
• Without the assumption of reparable subsystems (µ=0) • availability is the same as reliability
)(|)( 0)( tReetA tt ==
++
+= −
=+− λ
µµλ
µλλ
µλµ
t
34
Estimating the Failure Rate - Military Handbook (1990)
• First for electric devices • MIL-HDBK-217 (Military Handbook, Reliability
Prediction of Electronic Equipment) • Microelectronic circuits • Semiconductors • Passive elements
– Match curves on the observations to get λp
• where λp = the failure rate of the element
tpetR λ−=)(
35
Estimating the Failure Rate - Telcordia standard
• The operation environment is considred in the estimation – On spot measured data – Data tested in laboratory
• AT&T Bell Labs. – Since then called Telcordia standard (1998) – France Telecom (CNET93) and British
Telecom (HRD5) improved the method
36
Equipment Availability - IP Router IP Router
(simplified model, configuration example )
HW common parts
SWlibrary
1 X 4 port OC3/STM1 POS line card
2 X 1 portGigabit Ethernet module
4 X 1 port OC48/STM16 POS line card
8 slotavailable
Pow. Supply,housing,
conditioning
Not
use
d
IP router: interface card MTBF[h] = 8.5·104
MTTR[h] = 4
IP router: SW MTBF[h] = 3·104
MTTR[h] = 0.0004 (SW restart)
MTTR[h] = 0.02 (SW reload) MTTR[h] = 0.25 (no automatic
restart)
IP router: route processor
MTBF[h] = 2·105
MTTR[h] = 4
37
Equipment availability – DXC in SDH/SONET
OEO
Trunk Transponder
Tributary Transponder
Control SDH DXC/ADM: MTBF[h] = 1·106
MTTR[h] = 4
DXC has more ports than IP routers
SDH – Synchronous Digial Hierarchy
SONET - Synchronous Optical NETworking
DXC – digital cross connect
ADM – add-drop multiplexer
OEO – optical electrical optical conversion
38
Aerial cable MTBF[km]=1.75·105
MTTR=6
Equipment Availability – WDM System
OXC
Trans- ponder
WDM line system
Cable/ Fibre
Amplifier
MTBF=400·103
MTTR=6
Submarine cables MTBF[km]=4.64·106
MTTR=540
MTBF=250·103
MTTR=6 MTBF=160·103
MTTR=6
WDM OXC (OEO) or OADM MTBF=1·105
MTTR=6 Buried cable MTBF[km]=2.6·105
MTTR=12
WDM – wavelength division multiplexing
OXC – optical cross connect
OADM – optical add-drop multiplexer
39
Single WDM lightpath
OXC
Trans- ponder
WDM line system Amplifier
MTBF=4·105
MTTR=6
MTBF=2.5·105
MTTR=6 MTBF=1.6·105
MTTR=6
WDM OXC MTBF=1·105
MTTR=6
Ground cable (200 km) MTBF[km]=2.63·105
MTTR=12
As-d = AOXC * Atr * AMUX * Acable * Aamp * AMUX * Atr * AOXC = 0.99994 * 0.999985 * 0.9999625 * 0.99087 * 0.999976 * 0.9999625 * 0.999985 * 0.99994 = 0.99994 * 0.99074 * 0.99994 = 0.99062
3.65 day/year outage i
m
iAA
1=∏=Series rule:
40
1+1 Protection (disjiont pair of paths)
• 200km lightpath 0.99074
)1(11
i
m
iAA −∏−=
=
53 min/year outage
Parallel rule:
As-d = AOXC * [1-(1-Apath1) *(1-Apath2)] * AOXC = 0.99994 * [1-(1-0.99074)*(1-0.99074)] * 0.99994 = 0.99979
Design goals in Survivable Networks
• High connection availability • Short recovery time • Scalability • Maintainability • Efficient usage of network resources
• We search for the best trade off – Efficiency vs. complexity
41
Simple
Complex
Dedicated Protection
• For single connection 1 working + 1 protection path is allocated
The two path are disjoint
1 1
1
1
1 1
2 1 1
The reserved capacity along the common link is : A + B
PRO: instantaneous recovery (no action is needed)
43
Shared Protection
• If two working path is (SRLG) disjoint, the capacity along their protection routes can be shared – At most one of them is activated after a single failure
1 1
1
1
1 1
1 1 1
The spare capacity along the common link is : max{A,B}
CONS: we need actions (signaling) after failure
• 1st phase: Failure detection (depends only on the network architecture)
• 2nd phase: Failure localization (isolation) (tl) • 3rd phase: Failure notification (tn) • 4th phase: Failure correlation (tc) • 5th phase: Fault restoration
– Path selection (tp) – Device configuration (td)
Recover Time – The Tasks After Failure
Failure management
45
Recovery Cycle
Recovery time
Recovery operation (switching) time Fault notification time Fault detection time
time
notification
The service is operational
Failure detected by the nearast node
The protection path is deployed
Data flow arrives at the destination node
failure
Hold-Off time
Sending fault notification
The service is operational
On the example shared protection: tl = 10 ms, tn = 20-30 ms, tc = 20-30 ms, tp = 0-30 ms, td = 50 ms, tR= 100-150ms
Traffic Recovery time
Shared protection (pre-planned)
3
Network resource usage vs. recover time
Dedicated protection Dynamic restoration
150 ms
0 ms 0 %
100 % 150 ms
0 ms 0 %
100 % 150 ms
0 ms 0 %
100 %
?R T R T T
Protection: the restoration process (e.g. protection paths) is planned at connection setup Dynamic restoration: the restoration process is computed on-the-fly after failure
47
Link, Segment or Path Protection 1 2 3
4
7
5
8 9
6
1 2 3
4
7
5
8 9
6 fault
Link protection: local, loop back
1 2 3
4
7
5
8 9
6 fault fault 1 2 3
4
7
5
8 9
6
Segment protection: A good compromise
Working path
Path protection: global, efficient
48
Protection and restoration 100%, fast No guarantee, slower
pre-planned (protection)
after failure event occures (restoration)
link path segment link path segment
dedicated shared
dedicated shared
dedicated shared
Failure dependent
Faiure independent (the faied element is unknown)
Failure dependent
Faiure independent (the faied element is unknown)
Different protection approaches from down to top (e.g. Dedicated Path protection or Failure Dependent Shared Link Protection)
49
Dedicated 1+1 Path Protection • Two signal is sent parallel along the working path
and along the protection path • If the working path is interrupted by a fault
– The destination node switches to protection path • Simple, high network resource usage (100%
redundancy)
R T S D
swithcing
50
Dedicated 1:1 protection • We reserve two disjoint path for the connection • If the working path is interrupted by a fault
– The source and destination node switches to protection path
• In no failure state the protection route can be used for best effort traffic – It is called „preemption”
R T S D
switching switching
51
Dedicated 1:n Path Protection • There is n disjoint working path between the same source
and destination nodes – Better capacity efficiency J – CON: slightly smaller availability L
• What is the avalabiltiy of 1:1 protection? – Aw, Ap
• What about 1:2? – Aw1, Aw2, Ap
A=1-(1-Aw)(1-Ap)=Aw+Ap-AwAp
A=Aw1Aw2+(1-Aw1)Aw2Ap+Aw1(1-Aw2)Ap
S D
Diversity Coding (DC) • Split the traffic into n sub data flows • Use coding techniques along the protection route
– For single failure – For n=2 it is the bitwise XOR of the two working path
• There are not many (short) disjoint paths in the network
52
R T
53
Self healing rings – 1+1 dedicated path protection
• Used in ring acces networks
Path 1
Path 2
B
Α → Β
Α → Β
A
B
Α → Β
Α → Β
Failure
A
Switch
54
Self healing rings – 1:1 dedicated link protection
• Used inside a building/office
Working ring
A
B
Α → Β
Α → Β
A
B
Α → Β
Α → Β
Failure
Protection ring
Switch
Switch
55
AmsterdamLondon
Brussels
Paris
Zurich
Milan
Berlin
Vienna
PragueMunich
Rome
Hamburg
Lyon
Frankfurt
Strasbourg
Zagreb
P-cycles • Shared Protection • Protection cycles are
defined in advance in the spare capacity of the network
• On-cycle and straddling links
• Only two switching
R T
56
P-cycles • Similar to Self-healing rings • Working path is routed along the shortest path • Failure occurs along
– On-cycle link • Route the connection into the other direction
– Straddling links • Decompose the working data into two parts
57
P-cycles • Unit bandwidth along the p-cycle
– Protects unit working bandwidth if the working path is routed along the cycle
– Protects two units of working bandwidth if the working path traverses on a straddling link
• Pros: – No spare capacity reservation along straddling links – Could be a lot of straddling links – Efficient bandwidth usage – Only two switching needed at recovery
• Two nodes along the cycle
58
Shared protection
• Working path is reserved • Protection path are only calculated
– They are built up in the optical control plane, but the switches are not configured
• Soft-switching • Shared protection • Backup multiplexing
1 1
1
1
1 1
1 1 1
59
Capacity on the edges
Free capacity
Spare capacity
Working capacity
Non-shareable
Free capacity
Shareable
Working capacity
with W
swj
link j link j
60
Example
10
10
5
5
10
10
spare
working
free
Single link SRLGs are considered!
61
Calculation of the shareable spare capacity
• Depends from the working paths – In which SRLGs are
they involved
SRLGs
62
Spare provision matrix
slj = non-sharable spare capacity along link j, if the working path is in SRLGl
SRLG (Working edge involved)
Prot
ectio
n ed
ges
(a
ll ed
ges
in th
e ne
twor
k)
l.
3 1 ……….2 …..………………... 2 2 ……… 3……………….……. 1 2 .………5.……………………. 2 1 .………2……………………. 2 2 .………4…………………….
3 1 ……….2 …..………………... 2 2 ……… 3……………….……. 1 2 .………5.……………………. 2 1 .………2……………………. 2 2 .………4…………………….
column
j. row
= S
63
Spare Provision Matrix
Link l
l. column
= S
20
10
10
10
10
j. row
link j
• To obtain the matrix we need to keep track of the network state after each failure
• With the single failure scenario, only one SRLG could possibly be failed at a moment.
64
Spare Provision Matrix
Link l
l. column
= S 20 10 10 j. row
Link j
10 5 5 0 0 0
65
Spare provision matrix • How much is the spare
capacity on link j?
• How much is the non-shareable spare capacity along link j if the working path is known? – finding the maximum demand
of spare capacity among all the SRLGs traversed by W
,maxj j ll SRLGv s
∈=
l. column
= S j. row
,maxWj j ll Ws s
∈=
SRLG
edge
66
Shared protection When a new demand arrives: • The whole capacity of working path
need to be reserved • In the case of the protection path:
0Wj jv s− =
Wj j jf v s b+ − ≥
Wj jv s b− <
spare vj
free fj
working
Wj jv s b− ≥
Wj j jf v s b+ − <
shareable
Non-shareable Wjs
W
Wjh
BLOCK ADMIT
67
References • Dr. Chidung LAC, “Telecommunication
network reliability” • D. Arci, et.al, “Availability models for
protection techniques in WDM networks” • Computer Networking: A Top Down Approach
Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.
• J. Vasseur, M. Pickavet, and P. Demeester. Network recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS. Morgan Kaufmann Publishers, 2004.
Computer Networking: A Top Down Approach Featuring the Internet,
3rd edition. Jim Kurose, Keith
Ross Addison-Wesley, July
2004.
68
References • Darli A. A. Mello, et. al, ‘A Matrix-Based Analytical
Approach to Connection Unavailability Estimation in Shared Backup Path Protection’
• Dr. Chidung LAC, “Telecommunication network reliability”
• D. Arci, et.al, “Availability models for protection techniques in WDM networks”
• Kefei Wang, ‘Protection & Restoration for Optical Ethernet’
• Jesús F. Lobo, Gaël Hernández, Alberto Soria, “MPLS Fast Reroute”
• Ling Huang, „Protection and Restoration in Optical Network”