Download - Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 1 An introduction to NETWORK RESILIENCY Giorgio Ventre & Stefano.

Dipartimento di Informatica e Sistemistica, University of Napoli Federico II – Comics Group 1

An introduction to

NETWORK RESILIENCY

Giorgio Ventre & Stefano AvalloneCOMICS Group

Dipartimento di Informatica e SistemisticaUniversità di Napoli Federico II


ReferencesReferences

Jean-Philippe Vasseur, Mario Pickavet, Piet

Demeester. “Network Recovery, protection and

restoration of optical, SONET-SDH, IP and MPLS”.

Morgan Kaufmann

AA. VV. Building Survivable Networks, Feature Issue

of IEEE Network Magazine, March/April 2004


Communication Networks RelevanceCommunication Networks Relevance

Communication Networks are becoming fundamental infrastructures:the amount of data carried out by Communication

Networks is considerably grows in the last years;

many social and economic activities depend on Communication Networks;

many safe critical activities depend on Communication Networks.

Reliability is an essential feature of today Communication Networks !


Network Reliability: definitionNetwork Reliability: definition [1][1]

The (a) ability of a network to maintain or restore an acceptable level of performance during network failures by applying various restoration techniques, and (b) mitigation or prevention of service outages from network failures by applying preventive techniques.

Acronym: Network Survivability.

[1] Alliance for Telecommunications Industry Solutions (ATIS) http://www.atis.org/tg2k/_network_reliability.html


Network Reliability: related conceptsNetwork Reliability: related concepts

There are many concepts that are related to Network Reliability, for example:network element reliability: the probability of a

network element to be fully operational during a certain period of time;

network element availability: the probability of a network element to be in an up-state at a given instant of time t;

network element fault: the inability of a network element to perform a required action

....


Which failures may occur Which failures may occur ??

The ability of a network to provide required services may be compromised by different failures:planed or unplanned failures;

internal or external failures;

software or hardware failures;

malicious or casual failures

....


Accounted FailuresAccounted Failures

Provide actions to address all the failures that may occur on a Communication Network is unfeasible.

Network provider and ISP normally provides actions plain to address the most frequent failures.

These failure are called Accounted Failure

The most common type of Accounted Failure are:single link failure; single node failure.


Failures' ImpactFailures' Impact

In today Communication Networks a single failure may produces a major disruption in network availability.

A single cut in an optical cable may drop thousands of logical network connections.On July 5, 2002 a submarine cable break affected

the Asia Pacific Cable Network (ACPN 2), causing a considerable slowdown in all the network connections among Japan, China, South Korea, etc.


Failures' Impact:Failures' Impact: ATC systems ATC systems Press Releases (

http://www.natca.org/mediacenter/press-release-detail.aspx?id=394) MASSIVE POWER, COMMUNICATIONS FAILURE AT MAJOR AIR

TRAFFIC CONTROL CENTER PUTS CONTROLLERS IN DARK, FLIGHTS IN JEOPARDY

07/19/2006 Bob Marks

PALMDALE, Calif. – A massive power and communications failure late Tuesday at the Los Angeles Air Route Traffic Control Center left scrambling air traffic controllers to deal with a nightmare scenario – how to keep dozens of flights away from each other above a large swath of the Southwestern United States despite the inability to see them, talk to them or relay crucial instructions for 15 excruciatingly long minutes.

Every ounce of skill, heart and determination that controllers bring into the control room every day was put to the test during one of the worst outages to ever hit the facility. It was so bad, controllers say, that the only thing they had of use to aid the situation that actually worked was their cell phones – devices which the Federal Aviation Administration, inexplicably, has barred from control rooms, further impeding the safety of the system.

More details in http://themainbang.typepad.com/blog/2006/07/complete_failur.html


Network Reliability ParametersNetwork Reliability Parameters

Some parameters that may be used to characterize the reliability of a network may be found in ITU G.911 Recommendation:

“Parameters and Calculation Methodologies for Reliability and Availability of Fibre Optic

Systems”

In the following slides some of the parameters defined in ITU G.911 are introduced


Failure in Time (FITs) and Maintenance TimeFailure in Time (FITs) and Maintenance Time

Failure in Time:is the number of device's failure occurred in a

specific time interval;

normally is expressed as failures per bilion of device hours.

Maintenance Time:the time interval during which a maintenance

action is performed on an item either manually or automatically, ...


Mean Time Between Failure (MTBF)Mean Time Between Failure (MTBF)

The Mean Time Between Failures (MTBF) is the steady-state expectation of time between failures

Mathematically the MTBF (in years per failure) is releated to the failure rate F (in FITs per 109 hours) as follows:

MTBF1.14 105

F


Mean Time To Repair (MTTR)Mean Time To Repair (MTTR)

The Mean Time To Repair (MTTR) is defined as total corrective maintenance time divided by the total number of corrective maintenance actions during a period of time.

Given the definitions of MTBF and MTTR the availability A of an item may be derived as:

A 1MTTRMTBF


Users, services and reliability requirements Users, services and reliability requirements

Network reliability is a “relative concept”.

The reliability requirements of a communication network depend on:the user type;

the service type.

Different users-services combinations led to divers requirements in terms of MTBF and MTTR.


User classificationUser classification According to their reliability requirements, network

users may be classified in the following categories:

Safety critical users. Users for which service interruption are unacceptable.

Business critical users. Users for which any service interruption bring to a high financial loss.

Low cost users. Users for which service interruption cause only discomfort.

Basic lever users. Users for which service reliability is only a side effect.


Availability: Impact of OutagesAvailability: Impact of Outages

Ref: “Service Applications for SONET DCS Distribution Restoration”, IEEE J. Special Areas in Comm, Jan 94

50 m

s

200 m

s 2 Sec

10 Sec

5 Min

30 M

in

15 M

in

Protection Switching

Range

1st Restoration

Target Range

2nd Restoration

Target Range

3rd Restoration

Target Range

4th Restoration

Target Range

Restoration time after failure detection

Serv

ice

Out

age

Impa

ct

0

Service“Hit””

(Reframes)

Undesirabl

e

Social / B

usiness Im

pact

Unacceptabl

e•Potential voiceband discinnects (<5%)

•Trigger changeover of CSS7 STP signaling links

•Effect cell rerouting process

•May drop voice band calls depending on channel bank vintage

•Drop all circuit switched connections

•PL disconnects

•Potential packet (X.25) disconnects

•Potential data session time-outs

•Packet (X.25) disconnects

•Data session time-outs

•Network congestion

•Minor social/ Business impacts

•Potentially FCC reportable

•Major social/ business impacts


Market Drivers for SurvivabilityMarket Drivers for Survivability

Customer Relations Competitive Advantage Revenue

Negative - Tariff RebatesPositive - Premium Services

• Business Customers• Medical Institutions• Government Agencies

Impact on Operations Minimize Liability


Network SurvivabilityNetwork Survivability

Availability: 99.999% (5 nines) => less than 5 min downtime per year

Since a network is made up of several components, the ONLY way to

reach 5-nines is to add survivability in the face of failures…

Survivability = continued services in the presence of failures

Protection switching or restoration: mechanisms used to

ensure survivability

• Add redundant capacity, detect faults and automatically

re-route traffic around the failure

Restoration: related term, but slower time-scale

Protection: fast time-scale: 10s-100s of ms…

implemented in a distributed manner to ensure fast restoration


Failure Types & Other MotivationsFailure Types & Other Motivations

Types of failure:

Components: links, nodes, channels in WDM, active components,

software…

Human error: backhoe fiber cut

• Fiber inside oil/gas pipelines less likely to be cut

Systems: Entire COs can fail due to catastrophic events

Protection allows easy maintenance and upgrades :

Eg: switchover traffic when servicing a link…

Single failure vs multiple concurrent failures…

Goal: mean repair time << mean time between failures…

Protection also depends upon kind of application.

Survivability may hence be provided at several layers


Network Survivability ArchitecturesNetwork Survivability Architectures

Network Survivability Architectures

Restoration Protection

Protection Switching

Self-healing Network

Re-Configurable

Network

Mesh RestorationArchitectures

Linear ProtectionArchitectures

Ring ProtectionArchitectures


Network Availability & SurvivabilityNetwork Availability & Survivability

Availability is the probability that an item will be able to

perform its designed functions at the stated

performance level, within the stated conditions and in

the stated environment when called upon to do so.

Reliability

Reliability + Recovery

Availability

=


Quantification of AvailabilityQuantification of Availability

Percent Availability

N-Nines Downtime Time Minutes/Year

99% 2-Nines 5,000 Min/Yr

99.9% 3-Nines 500 Min/Yr



99.9999% 6-Nines .5 Min/Yr


PSTNPSTN Individual elements have an availability of 99.99%

One cut off call in 8000 calls (3 min for average call). Five ineffective calls in

every 10,000 calls.

Facility Facility EntranceEntrance

Facility Facility EntranceEntrance

ANAN

0.01 %0.01 %

0.005 %0.005 % 0.005 %0.005 %

0.02 %0.02 %

0.005 %0.005 % 0.005 %0.005 %

LELELELE

NINININI

LELELELE

NINININI

LDLDLDLD

ANAN

0.01 %0.01 %

PSTN End-2-End Availability 99.94%

NI : Network Interface

LE : Local Exchange

LD : Long Distance

AN : Access Network

Source : http://www.packetcable.com/downloads/specs/pkt-tr-voipar-v01-001128.pdf


IP Network ExpectationsIP Network Expectations

Service Delay Jitter Loss Availability

Real Time Interactive

(VOIP, Cell Relay ..)L L L H

Layer 2 & Layer 3 VPN’s (FR/Ethernet/AAL5)

M

Internet Service H H M L

Video Services L M M H

HHLL LL

L : Low M : Medium H : HighL : Low M : Medium H : High


Measuring Availability: The Port MethodMeasuring Availability: The Port Method

• Based on Port count in Network

• Does not take into account the Bandwidth of ports

e.g. OC-192 and 64k are both ports• Good for dedicated Access service because ports are tied to

customers.

(Total # of Ports X Sample Period) - (number of impacted port x outage duration)

(Total number of Ports x sample period) x 100


The Port Method ExampleThe Port Method Example

• 10,000 active access ports Network

• An Access Router with 100 access ports fails for 30 minutes.– Total Available Port-Hours = 10,000*24 = 240,000– Total Down Port-Hours = 100*.5 = 50– Availability for a Single Day =

(240000-50)/240,000*100 = 99.979166 %


The Bandwidth MethodThe Bandwidth Method• Based on Amount of Bandwidth available in Network

• Takes into account the Bandwidth of ports

• Good for Core Routers

(Total amount of BW X Sample Period) - (Amount of BE impacted x outage duration)

(Total amount of BW in network x sample period) x 100


The Bandwidth Method ExampleThe Bandwidth Method Example

• Total capacity of network 100 Gigabits/sec• An Access Router with 1 Gigabits/sec BW fails for 30 minutes.

– Total BW available in network for a day = 100*24 = 2400 Total BW lost in outage = 1*.5 = 0.5

– Availability for a Single Day = ((2400-0.5)/2,400)*100 = 99.979166

%


Basic Ideas: Working and Protect FibersBasic Ideas: Working and Protect Fibers


Service classification (1/2)Service classification (1/2)

Communication networks are used to carry many different services.

Different services may have divers reliability requirements.

Reliability requirements of such services are related to QoS parameters:Bit Rate;

Delay;

Jitter;

...


Service classification (2/2)Service classification (2/2)

Application Bit Rate Bit Rate Variation Delay Sensitivity Need for Recovery

Plain Old Telephone Service31-32 Kbps Constant 5 5Voice Over IP 8-32 Kbps Constant 5 5Video-telephony 256-1920 KbpsHigh High 5 5Videoconferencing at least 256 Kbps High 5 5Teleworking 64 Kbps â€“ 2 Mbps Very High 5 4TV broadcast 2-8 Mbps High 4 3Distance Learning 64 Kbps â€“ 2 Mbps Very High 5 5Movies on Demand 750 Kbps â€“ 4 MbpsHigh 4 5News on Demand 64 Kbps Very High 2 2Internet Access 64 Kbps â€“ 2 Mbps Very High 1 2Teleshopping 64 Kbps â€“ 2 Mbps Very High 2 2

[2] A.Lason, et al., “Network Scenarios and Requirements”, European IST project Layers Internetworking in Optical Network (LION), deliverable D6, Septemper 1999.


How to increase network reliability ?How to increase network reliability ?Prevent network failure:

put network cables deeper in the ground;

more testing for hardware and software;

.....

Duplicate vulnerable network elements:dual homing.

Independently from these measures, network failures still occur.

There is need for network recovery or resilience schemes !


Network recovery basic ideaNetwork recovery basic idea

Build networks to have alternate paths

Design systems to have alternate entities

Monitor for possible falures

Manage networks proactively


Network recovery requirementsNetwork recovery requirements

Network recovery imposes several requirements. For example:there should be backup capacity to create a

recovery path;

the backup capacity must be enough to ensure QoS constraints;

single point of failure must be avoided;

.....


Recovery and reversion cyclesRecovery and reversion cycles

Recovery Cycle

Reversion Cycle


Recovery mechanismsRecovery mechanisms

A high variety of recovery mechanisms exist.

Every mechanisms has advantages and drawbacks

In the following slides some criteria that may be used to evaluate and classify recovery mechanisms are reported [3, 4].

[3] V. Sharma et al., “Framework for MPLS-based recovery”, RFC 3469, IETF web site, Feb 2003

[4] K. Owens, V. Sharma, M. Oommen, and F. Hellstrand, “Network Survivability Considerations for Traffic Engineered IP Networks”, Internet draft: draft-owens-te-network-survivability-03, May 2002. Available at: www.ietf.org. Accessed July 2005


Backup CapacityBackup Capacity

Dedicatedone to one relationship between the backup resources

and the working path;

the simplest solution;

an inefficient solution.

Sharedthe backup resources are shared among different

working path;

a more simple solution;

a more efficient solution.


Recovery PathRecovery Path

Preplannedrecovery paths for all accounted failure scenario is

calculated in advance;

allows fast recovery of failure;

lacks flexibility for unaccounted failure scenarios.

Dynamicthe recover path is calculate “on the fly” when the

failure is detected;

may be used to search recovery paths also for unaccounted failure scenarios.


Recovery ApproachesRecovery ApproachesProtection

the recovery paths are preplanned and fully signaled before a failure occurs;

when a failure occurs no additional signaling is needed to establish the recovery path;

is the faster solution.

Restorationthe recovery pat may be preplanned or dynamically

allocated but are not signaled in advance;

when a failure occurs aditional signaling is needed to establish the recovery path;

is a more flexible solution.


Protection Variants (1/2)Protection Variants (1/2)

1+1 Protection (Dedicated Protection) there is exactly one dedicated recovery path for each working

segment;

the traffic is permanently duplicated on both the working path and the recovery path;

is a quite expensive solution.

1:1 Protection (Dedicated Protection with extra traffic) there is exactly one dedicated recovery path for each working

segment;

the traffic is transmitted over only a path at a time;

it is possible to transport extra traffic along the recovery path in failure free condition.


Protection Variants (2/2)Protection Variants (2/2)

1:N (Shared Recovery With Extra Traffic)each recovery entity is used to protect N working

entities;

it is possible use the recovery entities to transport extra traffic in failure free conditions.

M:N (M ≤ N)a set of M recovery entities are used to protect a set of N

working entities;

it is possible use the recovery entities to transport extra

traffic in failure free conditions.


Recovery Extent (1/2)Recovery Extent (1/2)

Local Recoveryin failure condition only the affected network element

are bypassed using the recovery path;

the RHE and RTE are closer to the failure, so they may detect the failure quickly, leading to a smaller recovery time.

in case of failure the route followed by the traffic may be not optimal (e.g the same traffic may cross a link twice !) .

In case of two successive nodes failure will fail


Recovery Extent (2/2)Recovery Extent (2/2)

Global Recoveryin failure condition the complete working path between

source and destination is bypassed;

the recovery time is greater that that of the local recovery

an optimal recovery path is used in case of failure;

In case of two successive nodes failure could still resolve the problem;

may generate more “state overhead” that the local approach.

An intermediate solution between Local and Global approach may be adopted !!


Control of Recovery Mechanisms (1/2)Control of Recovery Mechanisms (1/2)

Centralizeda central controller determines the action to take in

case of failure;

the central controller also determine when and where a fault ha occurred;

the central controller is a single point of failure.

is generally an efficient approach;

in principle is a simpler approach, but

the central controller may become a very complex system;


Control of Recovery Mechanisms (2/2)Control of Recovery Mechanisms (2/2)

Distributedthere is not a centralized controller, all the network

elements are capable to autonomously react to failure;

with this approach there is not a global view of the network condition;

the network elements may have to exchange information to keep a consistent view of the network;

is a more scalable approach.


Two or more nodes connected to each other with a ring of links

Protection Topologies - Protection Topologies - RingRing

E

W

W

E

W

EW

E

D

L

L

Working Protect


Protection Topologies - Protection Topologies - MeshMesh Three or more nodes connected to each other

Can be sparse or complete meshes

Spans may be individually protected with linear protection

Overall edge-to-edge connectivity is protected through

multiple paths

Working

Protect


Protection Switching TerminologyProtection Switching Terminology

1+1 architectures - permanent bridge at the source -

select at sink

m:n architectures - m entities provide protection for

n working entities where m is less than or equal to

n

allows unprotected extra traffic

most common - SONET linear 1:1 and 1:n


1+1 vs 1:n1+1 vs 1:n

Working Protect Working Protect

(1+1) (1:n)


SONET Linear SONET Linear 1+11+1 APS APS

BR SW

TX

TX

RX

RX

SW

RX

RX

BR

TX

TX

Working

Protection

Working

Protection

TX = TransmitterRX = Receiver

BR = BridgeSW = Switch


SONET SONET 1:11:1 Linear APS Linear APS

BR SW

TX RX

RX

SW

RX

BR

TX

TX

TX

RX

APS Channel

TX = TransmitterRX = Receiver

BR = BridgeSW = Switch

Protection

Working

Protection


Protection Switching: TerminologyProtection Switching: Terminology Dedicated vs Shared: working connection assigned dedicated or

shared protection bandwidth 1+1 is dedicated, 1:n is shared

Revertive vs Non-revertive: after failure is fixed, traffic is automatically or manually switched backShared protection schemes are usually revertive

Uni-directional or bi-directional protection:Uni: each direction of traffic is handled independent of the

other. Fiber cut => only one direction switched over to protection .

Usually done with dedicated protection; no signaling required.

Bi-directional transmission on fiber (full duplex) => requires bi-directional switching & signaling required


Mesh RestorationMesh Restoration

DCS

DCS DCS

DCS DCS

DCS

Line or Link Restoration

Working Path

Path Restoration

• Control: Centralized or Distributed• Route Calculation: Preplanned or Dynamic• Type of Alternate Routing: Line or Path


Link vs. Path restoration Link vs. Path restoration Link restorationLink restoration

• Requires the ability to identify the failed link at both ends.

• Can not protect node failureCan not protect node failure.

• Link based

Mesh (generalized loop back) – insensitive to additions to network – scalable;

backup path can be pre-computed – fast recovery; dynamic rerouting

Path restorationPath restoration

More resilient than link restorationMore resilient than link restoration.

Reroutes the traffic from the primary path to a Shared Risk Group (SRG) -disjoint

backup path.

Protect both end-to-end paths and single linksProtect both end-to-end paths and single links. • Preferred: Path BasedPreferred: Path Based


Link vs. Path restoration Link vs. Path restoration

A

B

C

D

E

F

Flow 1: A-C-D

Flow 2: E-C-D-F

A

B

C

D

E

F

A

B

C

D

E

F

Link (Generalized Loopback) Restoration

Path Restoration

Fault: Link Cut


Pre-compute vs. Real-timePre-compute vs. Real-time

Pre-computedPre-computed calculates restoration paths before a failure happens. Allows prior availability of reroute information to the nodes where

actions need to be taken after failure is detected. Enables fast restorationEnables fast restoration.

Real-timeReal-time calculates restoration paths after a failure happens. Restoration is slower. Restoration is slower. Enables more efficient capacity utilizationEnables more efficient capacity utilization.

• Preferred: Pre-computedPreferred: Pre-computed


Centralized vs. DistributedCentralized vs. Distributed Centralized restoration:Centralized restoration:

Computes restoration and primary paths for all demands with up-to-date information

Routes may then be downloaded into nodal databases. Effectiveness?

• More capacity efficiency More capacity efficiency • Possibly slow (but may be executed in the background)Possibly slow (but may be executed in the background)• Scalability in questionScalability in question.

Distributed restorationDistributed restoration Source and destination nodes dynamically search for the protection

wavelengths required to reestablish the disrupted lightpath Since lack of knowledge of sharing database of other OXCs, it may not be able

to determine backup sharability for any given primary path

• Preferred: Preferred: • Central path determinationCentral path determination• Distributed RestorationDistributed Restoration


Protection Topologies - Protection Topologies - LinearLinear

Two nodes connected to each other with two or

more sets of links

Working Protect Working Protect

(1+1) (1:n)


Mesh Restoration vs Ring/Linear ProtectionMesh Restoration vs Ring/Linear Protection

Attributes Linear APS Ring PS MeshRestoration

Spare Capacity Needed Most Moderate Least

Fiber Counts Highest Moderate Moderate

Restoration Time <50 ms <50 ms 2-10 seconds

Software Complexity Least Moderate Most

Protection Against MajorFailures

Worst Medium Best

Planning/OperationsComplexity

Least Moderate/least Most

Extracted from: T-H. Wu, Emerging Technologies for Fiber Network Survivability, See References


IP layer restorationIP layer restoration

IP Layer Restoration (real-time)IP Layer Restoration (real-time)

Achieved by exchanging control messages between adjacent routers

• Re-determine the affected route

• Update routing tables

• Propagate changes (OSPF, BGP-4)

Capable of recovery from multiple faultsCapable of recovery from multiple faults

Slow (10s of seconds to minutes – Fumagalli)Slow (10s of seconds to minutes – Fumagalli) requires online processing upon failure

• Fault discovery:

• Explicitly: ICMP messaging

• Implicitly: Expiring of timers

Guarantees networkwide survivability Guarantees networkwide survivability

Independent of underlying physical networkIndependent of underlying physical network

Physical

Data Link

Network (IP)

Transport

Session

Presentation

Application


MPLS layer restorationMPLS layer restoration

MPLS Layer ProtectionMPLS Layer ProtectionReal-time or pre-computedReal-time or pre-computedLine or path level protectionLine or path level protectionProtection path is node and link disjointnode and link disjoint from the primary

path. Protection path may be allocated to low-priority trafficallocated to low-priority traffic in

the absence of network failure. Faster than dynamic IP reroutingFaster than dynamic IP rerouting Working LSPs have pre-established node/link disjoint protection Working LSPs have pre-established node/link disjoint protection

pathspaths

Physical

Data Link

Network

Transport

Session

Presentation

Application

MPLS


Optical layer restorationOptical layer restoration

Optical layer restorationOptical layer restoration

Real-time or pre-computedReal-time or pre-computed

Ring protection or mesh restoration Ring protection or mesh restoration

No visibility into higher layer operations.

May be wasteful use of resourceswasteful use of resources.

• For ring protectionring protection, there is over 100% capacity redundancyover 100% capacity redundancy

• For mesh restoration, 60-80% physical redundancymesh restoration, 60-80% physical redundancy level is typical.

Not recommended for node (or software) failuresNot recommended for node (or software) failures

Faster than higher layer restorations (??)Faster than higher layer restorations (??)

Physical

DWDM (Optical)

Network IP)

Transport

Session

Presentation

Application


Multilayer Recovery (1/2)Multilayer Recovery (1/2)

In a multilayer network it is possible to imagine a situation in which each layer has its own recovery mechanisms.

Not every failure in a particular layer may be resolved in the same layer.

If a failure may be resolved in several layer uncoordinated actions may produce inefficient results

A coordination among the layers is needed !!


Multilayer Recovery (2/2)Multilayer Recovery (2/2)

Sequential Approach[1]

using an hold-off time a chronological order among the recovery mechanisms adopted in different layer is imposed;

alternatively a “token” may used to impose a sequential order among the different layers.

Integrated Approach[1]

there is a recovery scheme that has a full overview of all the layers;

the recovery scheme may decide when and in which layer (layers) the recovery actions must be taken.

[1] D. Colle, et all., “Data-centric optical networks and their survivability”, Selected Areas in Communications, IEEE Journal on Volume 20, Issue 1, Jan. 2002 Page(s):6 - 20