Dr. János Tapolcai...

1

Survivable Network Design

Dr. János Tapolcai [email protected]

The final goal

•  We prefer not to see: 2

3

High Speed Backbone Service

providers

PSTN

Internet

Video

Backbone

Mobile access

Metro

Business

Telecommunicaiton Networks

Telecommunicaiton Networks

http://www.icn.co

4

5

Traditional network architecture in backbone networks

IP (Internet Protocol)

ATM (Asynchronous Transfer Mode)

SDH/SONET (Synchronous

Digital Hierarchy)

WDM (Wavelength

Division Multiplexing)

Adressing, routing

Traffic engineering

Transport and protection

High bandwidth

6

Evolution of network layers

Thin SONET

Optics

MPLS SONET

IP

Optics

ATM

Layer 3

2

1

0 Packet

Optical

Inter- working Smart

Optical

Packet IP/Ethernet

Layer

2/3

0/1

1999 201x

2003

BGP-4: 15 – 30 minutes OSPF: 10 seconds to minutes SONET: 50 milliseconds

IP

GMPLS

7

IP - Internet Protocol

•  Packet switched –  Hop-by-hop routing –  Packets are forwarded based on forwarding tables

•  Distributed control –  Shortest path routing

•  via link-state protocols: OSPF (Open Shortest Path First), IS-IS (Intermediate System To Intermediate System)

•  Routing on a logical topology

•  Widespread, its role is straightforward –  From a technical point of view not very popular

8

Optical backbone

•  Circuit switched –  Centralized control –  Exact knowledge of the physical topology

•  Logical links are lightpaths –  Source and destination node pairs, bandwidth

9

A

B C

D

E Wavelengthcrossconnect

Lightpaths

IP router

Optical Backbone Networks

10

11

Motivation Behind Survivable Network Design

FAILURE SOURCES

12

13

Failure Sources – HW Failures

•  Network element failures –  Type failures

•  Manufacturing or design failures •  Turns out at the testing phase

–  Wear out •  Processor, memory, main board, interface cards •  Components with moving parts:

–  Cooling fans, hard disk, power supply –  Natural phenomena is mostly influence and damage these

devices (e.g. high humidity, high temperature, earthquake) •  Circuit breakers, transistors, etc.

14

Failure Sources – SW Failures

•  Design errors •  High complexity and compound failures

•  Faulty implementations •  Typos in variable names

– Compiler detects most of these failures

•  Failed memory reading/writing operation

15

Failure Sources – Operator Errors (1)

•  Unplanned maintenance –  Misconfiguration

•  Routing and addressing –  misconfigured addresses or prefixes, interface identifiers, link

metrics, and timers and queues (Diffserv) •  Traffic Conditioners

–  Policers, classifiers, markers, shapers •  Wrong security settings

–  Block legacy traffic –  Other operation faults:

•  Accidental errors (unplug, reset) •  Access denial (forgotten password)

•  Planned maintenance •  Upgrade is longer than planned

16

Failure Sources – Operator Errors (2)

•  Topology/Dimensioning/Implementation design errors –  Weak processor in routers –  High BER in long cables –  Topology is not meshed enough (not enough

redundancy in protection path selection) •  Compatibility errors

–  Between different vendors and versions –  Between service providers or AS (Autonomous

system) •  Different routein settings and Admission Control between two

ASs

17

Failure Sources – Operator errors (3)

•  Operation and maintenance errors

Updates and patches

Misconfiguration

Device upgrade

Maintenance

Data mirroring or recovery

Monitoring and testing

Teach users

Other

18

Failure Sources – User Errors •  Failures from malicious users

–  Physical devices •  Robbery, damage the device

–  Against nodes •  Viruses

–  DoS (denial-of-service) attack (i.e. used in the Interneten) •  Routers are overload •  At once from many addresses •  IP address spoofing •  Example: Ping of Death – the maximal size of ping packet is 65535 byte. In

1996 computers could be froze by recieving larger packets. •  Unexpected user behavior

–  Short term •  Extreme events (mass calling) •  Mobility of users (e.g. after a football match the given cell is congested)

–  Long term •  New popular sites and killer applications

Failure Sources – Environmental Causes

•  Cable cuts –  Road construction (‘Universal Cable Locator’) –  Rodent bites

•  Fading of radio waves –  New skyscraper (e.g. CN Tower) –  Clouds, fog, smog, etc. –  Birds, planes

•  Electro-magnetic interference –  Electro-magnetic noise – solar flares

•  Power outage •  Humidity and temperature

–  Air-conditioner fault •  Natural disasters

–  Fires, floods, terrorist attacks, lightnings, earthquakes, etc.

Operating Routers During Sandy Hurricane 20

21

Michnet ISP Backbone (1998)

Maintenance

Power Outage Fiber Cut/Cicuit/Carrier Problem

Hardware Problem

Routing Problems

Interface Down

Congestion/Sluggish

Malicious Attack

Software Problem

•  Which failures were the most probable ones?

22

Michnet ISP Backbone (1998)

Operator35%

Hardw are15%

Environmental31%

User5%

Unknow n11%

Malice2%

Softw are1%

Cause Type # [%] Maintenance Operator 272 16.2 Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier

Problem Environmental 261 15.3 Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Miscellaneous Unknown 86 5.9 Unknown/Undetermined/

No problem Unknown 32 5.6

Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3

23

Case study - 2002

•  D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002,

24

Failure Sources - Summary •  Operator errors (misconfiguration)

–  Simple solutions needed –  Sometimes reach 90% of all failures

•  Planned maintenance –  Running at night –  Sometimes reach 20% of all failures

•  DoS attack –  It will be worse in the future

•  Software failures –  10 million line source codes

•  Link failures –  Anything from which a point-to-point connection fails (not only cable

cuts)

25

Reliability •  Failure

–  is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment tf

•  Reliability, R(t) –  continuous operation of a system or service –  refers to the probability of the system being

adequately operational (i.e. failure free operation) for the period of time [0 – t] intended in the presence of network failures

26

Reliability (2) •  Reliability, R(t)

–  Defined as 1- F(t) (cummulative distribution function, cdf) –  Simple model: exponentially distributed variables

•  Properies: –  non-increasing –  – 

tt eetFtR λλ −− =−−=−= )1(1)(1)(

t a

1

0

R(t)

R(a)

0)(lim1)0(=

=

∞→tR

R

t

27

Device is operational

Network with Reparable Subsystems

t

UP

DOWN



The network element is failed, repair action is in progress.

Failure

•  Measures to charecterize a reparable system are: –  Availability, A(t)

•  refers to the probability of a reparable system to be found in the operational state at some time t in the future

•  A(t) = P(time = t, system = UP) –  Unavailability, U(t)

•  refers to the probability of a reparable system to be found in the faulty state at some time t in the future

•  U(t) = P(time = t, system = DOWN) •  A(t) + U(t) = 1 at time t

Failure

28

Element Availability Assignment •  The mainly used measures are

–  MTTR - Mean Time To Repair –  MTTF - Mean Time to Failure

•  MTTR << MTTF –  MTBF - Mean Time Between Failures

•  MTBF=MTTF+MTTR •  if the repair is fast, MTBF is approximately the same as MTTF •  Sometimes given in FITs (Failures in Time), MTBF[h]=109/FIT

•  Another notation –  MUT - Mean Up Time

•  Like MTTF –  MDT - Mean Down Time

•  Like MTTR –  MCT - Mean Cycle Time

•  MCT=MUT+MDT

29

Availability in Hours Availability Nines Outage time/

year Outage time/

month Outage time/

week 90% 1 nine 36.52 day 73.04 hour 16.80 hour

95% - 18.26 day 36.52 hour 8.40 hour

98% - 7.30 day 14.60 hour 3.36 hour

99% 2 nines (maintained) 3.65 day 7.30 hour 1.68 hour

99.5% - 1.83 day 3.65 hour 50.40 min

99.8% - 17.53 hour 87.66 min 20.16 min

99.9% 3 nines (well maintained) 8.77 hour 43.83 min 10.08 min

99.95% - 4.38 hour 21.91 min 5.04 min

99.99% 4 nines 52.59 min 4.38 min 1.01 min

99.999% 5 nines (failure protected) 5.26 min 25.9 sec 6.05 sec

99.9999% 6 nines (high reliability) 31.56 sec 2.62 sec 0.61 sec

99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec

30

Availability Evaluation – Assumptions

•  Failure arrival times –  independent and identically distributed (iid) variables

following exponential distribution –  sometimes Weibull distribution is used (hard) –  λ > 0 failure rate (time independent!)

•  Repair times –  iid exponential variables –  sometimes Weibull distribution is used (hard) –  µ > 0 repair rate (time independent!)

•  If both failure arrival times and repair times are exponentially distributed we have a simple model –  Continuous Time Markov Chain

αλtetF −−=1)(

31

Two-State Markov Model Steady State Analysis (1)

MTTR

MTTF

=

=

µ

λ

1

1 UP 1

DN 0

λ

µ

1-λ

1-µ

•  Transition probability distribution in a matrix form –  Transition matrix P (stochastic matrix)

•  Time homogeneous Markov-chain –  The transition matrix after k steps: Pk –  Stationary distribution is a row vector π, for which –  π exists, (and in this case it is unambiguous)

P⋅= ππ

Mean of exp. dist. variables:

Two-state Markov Model Steady State Analysis (2)

32

UP 1

DN 0

λ

µ

1-λ

1-µ

⎟⎟⎠

⎞⎜⎜⎝

⎛

−

−=

µµ

λλ

11

P

µλµ

µλ

µλ

µµ

λλ

+=

−=⋅=⋅

⋅+−⋅=

⎟⎟⎠

⎞⎜⎜⎝

⎛

−

−⋅=

A

AUUAUAA

UAUA

1seen have / we)1(

11

)()(

Transition matrix:

Stationary distribution:

( ) ) (, UADOWNUP ==Π

33

Two-State Markov-model - Summary

1 A(t) Ass= µ λ

µ +

MTTR MTTF

MTTF A ss +

= +

= +

= µ λ

λ

µ λ

µ

1 1

1

•  Without the assumption of reparable subsystems (µ=0) •  availability is the same as reliability

)(|)( 0)( tReetA tt ==

++

+= −

=+− λ

µµλ

µλλ

µλµ

t

34

Estimating the Failure Rate - Military Handbook (1990)

•  First for electric devices •  MIL-HDBK-217 (Military Handbook, Reliability

Prediction of Electronic Equipment) •  Microelectronic circuits •  Semiconductors •  Passive elements

–  Match curves on the observations to get λp

•  where λp = the failure rate of the element

tpetR λ−=)(

35

Estimating the Failure Rate - Telcordia standard

•  The operation environment is considred in the estimation – On spot measured data – Data tested in laboratory

•  AT&T Bell Labs. – Since then called Telcordia standard (1998) – France Telecom (CNET93) and British

Telecom (HRD5) improved the method

36

Equipment Availability - IP Router IP Router

(simplified model, configuration example )

HW common parts

SWlibrary

1 X 4 port OC3/STM1 POS line card

2 X 1 portGigabit Ethernet module

4 X 1 port OC48/STM16 POS line card

8 slotavailable

Pow. Supply,housing,

conditioning

Not

use

d

IP router: interface card MTBF[h] = 8.5·104

MTTR[h] = 4

IP router: SW MTBF[h] = 3·104

MTTR[h] = 0.0004 (SW restart)

MTTR[h] = 0.02 (SW reload) MTTR[h] = 0.25 (no automatic

restart)

IP router: route processor

MTBF[h] = 2·105

MTTR[h] = 4

37

Equipment availability – DXC in SDH/SONET

OEO

Trunk Transponder

Tributary Transponder

Control SDH DXC/ADM: MTBF[h] = 1·106

MTTR[h] = 4

DXC has more ports than IP routers

SDH – Synchronous Digial Hierarchy

SONET - Synchronous Optical NETworking

DXC – digital cross connect

ADM – add-drop multiplexer

OEO – optical electrical optical conversion

38

Aerial cable MTBF[km]=1.75·105

MTTR=6

Equipment Availability – WDM System

OXC

Trans- ponder

WDM line system

Cable/ Fibre

Amplifier

MTBF=400·103

MTTR=6

Submarine cables MTBF[km]=4.64·106

MTTR=540

MTBF=250·103

MTTR=6 MTBF=160·103

MTTR=6

WDM OXC (OEO) or OADM MTBF=1·105

MTTR=6 Buried cable MTBF[km]=2.6·105

MTTR=12

WDM – wavelength division multiplexing

OXC – optical cross connect

OADM – optical add-drop multiplexer

39

Single WDM lightpath

OXC

Trans- ponder

WDM line system Amplifier

MTBF=4·105

MTTR=6

MTBF=2.5·105

MTTR=6 MTBF=1.6·105

MTTR=6

WDM OXC MTBF=1·105

MTTR=6

Ground cable (200 km) MTBF[km]=2.63·105

MTTR=12

As-d = AOXC * Atr * AMUX * Acable * Aamp * AMUX * Atr * AOXC = 0.99994 * 0.999985 * 0.9999625 * 0.99087 * 0.999976 * 0.9999625 * 0.999985 * 0.99994 = 0.99994 * 0.99074 * 0.99994 = 0.99062

3.65 day/year outage i

m

iAA

1=∏=Series rule:

40

1+1 Protection (disjiont pair of paths)

•  200km lightpath 0.99074

)1(11

i

m

iAA −∏−=

=

53 min/year outage

Parallel rule:

As-d = AOXC * [1-(1-Apath1) *(1-Apath2)] * AOXC = 0.99994 * [1-(1-0.99074)*(1-0.99074)] * 0.99994 = 0.99979

Design goals in Survivable Networks

•  High connection availability •  Short recovery time •  Scalability •  Maintainability •  Efficient usage of network resources

•  We search for the best trade off – Efficiency vs. complexity

41

Simple

Complex

Dedicated Protection

•  For single connection 1 working + 1 protection path is allocated

The two path are disjoint

1 1

1

1

1 1

2 1 1

The reserved capacity along the common link is : A + B

PRO: instantaneous recovery (no action is needed)

43

Shared Protection

•  If two working path is (SRLG) disjoint, the capacity along their protection routes can be shared –  At most one of them is activated after a single failure

1 1

1

1

1 1

1 1 1

The spare capacity along the common link is : max{A,B}

CONS: we need actions (signaling) after failure

•  1st phase: Failure detection (depends only on the network architecture)

•  2nd phase: Failure localization (isolation) (tl) •  3rd phase: Failure notification (tn) •  4th phase: Failure correlation (tc) •  5th phase: Fault restoration

–  Path selection (tp) –  Device configuration (td)

Recover Time – The Tasks After Failure

Failure management

45

Recovery Cycle

Recovery time

Recovery operation (switching) time Fault notification time Fault detection time

time

notification

The service is operational

Failure detected by the nearast node

The protection path is deployed

Data flow arrives at the destination node

failure

Hold-Off time

Sending fault notification

The service is operational

On the example shared protection: tl = 10 ms, tn = 20-30 ms, tc = 20-30 ms, tp = 0-30 ms, td = 50 ms, tR= 100-150ms

Traffic Recovery time

Shared protection (pre-planned)

3

Network resource usage vs. recover time

Dedicated protection Dynamic restoration

150 ms

0 ms 0 %

100 % 150 ms

0 ms 0 %

100 % 150 ms

0 ms 0 %

100 %

?R T R T T

Protection: the restoration process (e.g. protection paths) is planned at connection setup Dynamic restoration: the restoration process is computed on-the-fly after failure

47

Link, Segment or Path Protection 1 2 3

4

7

5

8 9

6

1 2 3

4

7

5

8 9

6 fault

Link protection: local, loop back

1 2 3

4

7

5

8 9

6 fault fault 1 2 3

4

7

5

8 9

6

Segment protection: A good compromise

Working path

Path protection: global, efficient

48

Protection and restoration 100%, fast No guarantee, slower

pre-planned (protection)

after failure event occures (restoration)

link path segment link path segment

dedicated shared

dedicated shared

dedicated shared

Failure dependent

Faiure independent (the faied element is unknown)

Failure dependent

Faiure independent (the faied element is unknown)

Different protection approaches from down to top (e.g. Dedicated Path protection or Failure Dependent Shared Link Protection)

49

Dedicated 1+1 Path Protection •  Two signal is sent parallel along the working path

and along the protection path •  If the working path is interrupted by a fault

–  The destination node switches to protection path •  Simple, high network resource usage (100%

redundancy)

R T S D

swithcing

50

Dedicated 1:1 protection •  We reserve two disjoint path for the connection •  If the working path is interrupted by a fault

–  The source and destination node switches to protection path

•  In no failure state the protection route can be used for best effort traffic –  It is called „preemption”

R T S D

switching switching

51

Dedicated 1:n Path Protection •  There is n disjoint working path between the same source

and destination nodes –  Better capacity efficiency J –  CON: slightly smaller availability L

•  What is the avalabiltiy of 1:1 protection? – Aw, Ap

•  What about 1:2? – Aw1, Aw2, Ap

A=1-(1-Aw)(1-Ap)=Aw+Ap-AwAp

A=Aw1Aw2+(1-Aw1)Aw2Ap+Aw1(1-Aw2)Ap

S D

Diversity Coding (DC) •  Split the traffic into n sub data flows •  Use coding techniques along the protection route

–  For single failure –  For n=2 it is the bitwise XOR of the two working path

•  There are not many (short) disjoint paths in the network

52

R T

53

Self healing rings – 1+1 dedicated path protection

•  Used in ring acces networks

Path 1

Path 2

B

Α → Β

Α → Β

A

B

Α → Β

Α → Β

Failure

A

Switch

54

Self healing rings – 1:1 dedicated link protection

•  Used inside a building/office

Working ring

A

B

Α → Β

Α → Β

A

B

Α → Β

Α → Β

Failure

Protection ring

Switch

Switch

55

AmsterdamLondon

Brussels

Paris

Zurich

Milan

Berlin

Vienna

PragueMunich

Rome

Hamburg

Lyon

Frankfurt

Strasbourg

Zagreb

P-cycles •  Shared Protection •  Protection cycles are

defined in advance in the spare capacity of the network

•  On-cycle and straddling links

•  Only two switching

R T

56

P-cycles •  Similar to Self-healing rings •  Working path is routed along the shortest path •  Failure occurs along

–  On-cycle link •  Route the connection into the other direction

–  Straddling links •  Decompose the working data into two parts

57

P-cycles •  Unit bandwidth along the p-cycle

–  Protects unit working bandwidth if the working path is routed along the cycle

–  Protects two units of working bandwidth if the working path traverses on a straddling link

•  Pros: –  No spare capacity reservation along straddling links –  Could be a lot of straddling links –  Efficient bandwidth usage –  Only two switching needed at recovery

•  Two nodes along the cycle

58

Shared protection

•  Working path is reserved •  Protection path are only calculated

–  They are built up in the optical control plane, but the switches are not configured

•  Soft-switching •  Shared protection •  Backup multiplexing

1 1

1

1

1 1

1 1 1

59

Capacity on the edges

Free capacity

Spare capacity

Working capacity

Non-shareable

Free capacity

Shareable

Working capacity

with W

swj

link j link j

60

Example

10

10

5

5

10

10

spare

working

free

Single link SRLGs are considered!

61

Calculation of the shareable spare capacity

•  Depends from the working paths –  In which SRLGs are

they involved

SRLGs

62

Spare provision matrix

slj = non-sharable spare capacity along link j, if the working path is in SRLGl

SRLG (Working edge involved)

Prot

ectio

n ed

ges

(a

ll ed

ges

in th

e ne

twor

k)

l.

3 1 ……….2 …..………………... 2 2 ……… 3……………….……. 1 2 .………5.……………………. 2 1 .………2……………………. 2 2 .………4…………………….

3 1 ……….2 …..………………... 2 2 ……… 3……………….……. 1 2 .………5.……………………. 2 1 .………2……………………. 2 2 .………4…………………….

column

j. row

= S

63

Spare Provision Matrix

Link l

l. column

= S

20

10

10

10

10

j. row

link j

• To obtain the matrix we need to keep track of the network state after each failure

•  With the single failure scenario, only one SRLG could possibly be failed at a moment.

64

Spare Provision Matrix

Link l

l. column

= S 20 10 10 j. row

Link j

10 5 5 0 0 0

65

Spare provision matrix •  How much is the spare

capacity on link j?

•  How much is the non-shareable spare capacity along link j if the working path is known? –  finding the maximum demand

of spare capacity among all the SRLGs traversed by W

,maxj j ll SRLGv s

∈=

l. column

= S j. row

,maxWj j ll Ws s

∈=

SRLG

edge

66

Shared protection When a new demand arrives: •  The whole capacity of working path

need to be reserved •  In the case of the protection path:

0Wj jv s− =

Wj j jf v s b+ − ≥

Wj jv s b− <

spare vj

free fj

working

Wj jv s b− ≥

Wj j jf v s b+ − <

shareable

Non-shareable Wjs

W

Wjh

BLOCK ADMIT

67

References •  Dr. Chidung LAC, “Telecommunication

network reliability” •  D. Arci, et.al, “Availability models for

protection techniques in WDM networks” •  Computer Networking: A Top Down Approach

Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.

•  J. Vasseur, M. Pickavet, and P. Demeester. Network recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS. Morgan Kaufmann Publishers, 2004.

Computer Networking: A Top Down Approach Featuring the Internet,

3rd edition. Jim Kurose, Keith

Ross Addison-Wesley, July

2004.

68

References •  Darli A. A. Mello, et. al, ‘A Matrix-Based Analytical

Approach to Connection Unavailability Estimation in Shared Backup Path Protection’

•  Dr. Chidung LAC, “Telecommunication network reliability”

•  D. Arci, et.al, “Availability models for protection techniques in WDM networks”

•  Kefei Wang, ‘Protection & Restoration for Optical Ethernet’

•  Jesús F. Lobo, Gaël Hernández, Alberto Soria, “MPLS Fast Reroute”

•  Ling Huang, „Protection and Restoration in Optical Network”

Date post:	12-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Dr. János Tapolcai...

Documents