+ All Categories
Home > Documents > WCE-Reliability and Availability

WCE-Reliability and Availability

Date post: 21-Dec-2015
Category:
Upload: nymiter
View: 27 times
Download: 3 times
Share this document with a friend
Description:
WCE-Reliability and Availability
23
COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION 1
Transcript

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED.

ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

1

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED.

ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

2

vRAN2014/09/17

9771 WCE Reliability & Availability

Wirelesspacket core

StandardRouter

ALU ProvidedOr

Telco ProvidedComputing Cloud

IP

3G – 4G

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED.

ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

3

Bringing NFV & vRAN to market Now

Wireless Cloud ElementConventional

BBU

IP mobilebackhaul

Centralized baseband

IP

IPCPRIover fiberIP

RF only sites

IP

Macrocell

Metrocell

IP

AGENDA

Wireless Cloud Element Product Strategy

Wireless Cloud Element

4

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Wireless Cloud Element Demo

Wireless Cloud Element High Availability – Reliability

• 99.999% Availability in a Network Function (e.g. RNC) requires that failures are rapidly detected and a spare component takes over

• Failures with a duration of 15s or greater are classified as Outages

• In-order to ensure that both the active and spare component do not fail together they must never both depend on a single component or point of failure

• WCE uses “anti-affinity” rules to prevent allocation of active and

No Single Point of Failure

5

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

• Common sparing algorithms are:

- 1+1

- N+1

- N+M

• All three techniques are used in the virtual RNC

prevent allocation of active and spare VMs on the same blade

• WCE’s use of redundant links, switches and storage ensure that any single failure does not disable application redundancy schemes

WCE Platform – Hardware Configurations

• Two Cabinets with up to 4 Blade System enclosures (2nd Cabinet has no SAN)

• Minimum Configuration: 6 server blades

• Max Configuration: 64 Server Blades (16 blades per enclosure)

• Server Blades can be added in

6

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

• Server Blades can be added in increments of 1(The reference config concept used in 9370 does not exist)

• Coverage requirements (# of Cells) determines how many CMUs are required

• Traffic requirements (Erlangs, Mbps, Signaling) determines how many UMUs and PCs are required.

• # of CMUS, UMUs, PCs determine how many blades are required.

• The HP C7000 is a fully redundant, carrier grade computing system

• The front of the chassis contain:

- 16 hot swappable, dual socket

WCE Hardware – HP C7000 Front View

7

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

- 16 hot swappable, dual socket Xeon processor blades in the front

- 6 (5+1) -48V DC power supplies

- HP Insight Display for local maintenance

• The back of the C7000 contain:

- All cables

- 10 Shared Fan Modules

- DC Power Input

- Redundant OnBoard

WCE Hardware – HP C7000 Back View

8

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

- Redundant OnBoardAdministrators with OAM access to all of the boards

- Dual 6125XLG Ethernet Switches

- (optional) 6120 Ethernet OAM Switch(s)

6125XLG

SAN Storage

Controller A

Contro

ller B

6 Gb SASChip

32 port6 Gb SAS

32 port6 Gb SAS

6 Gb SASChip

NetApp E5424• Fully redundant path from host ports to drives

• Each controller can access all drive ports

• Each drive chip can access every drive

9

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Controller A

Contro

ller B.........

ESM ESM

.........

6 Gb SASExpander

6 Gb SASExpander

24drives

24drives

drive

• Top down, bottom up cabling ensures continues access

• Expandable via an expansion unit from 24 to 48 drives

• DC powered, NEBS certified

• Data is spread across the drives via distributed RAID-6

WCE Link and Switch Redundancy

• WCE’s hardware configuration ensures there is no single point of failure

• Redundant, active-active, 10 Gigabit Ethernet data-path links

• Redundant switches

• Link or switch failure is rapidly detected and traffic is moved to alternate links

e5400 SAN

Ctlr A Ctlr B

WCE 4 Interconnect Topology

40GbE MAD

10

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Ctlr A Ctlr BA1

A2

B1

B2

40GbE MAD

C7000 #1

6125XLG 6125XLG

C7000 #2

6125XLG 6125XLG

C7000 #4

6125XLG 6125XLG

C7000 #3

6125XLG 6125XLG

10GbEiSCSI

10GbEiSCSI

40GbE (IRF) 40GbE (IRF)40GbE (IRF)40GbE (IRF)

4x 10GbE (Backplane IRF) 4x 10GbE (Backplane IRF) 4x 10GbE (Backplane IRF) 4x 10GbE (Backplane IRF)40GbE MAD 40GbE MAD

10GbE DAC Cable

Multi-Chassis LAG Link

10GbE MM Fiber

10GbE DAC Cable

40GbE QSFP+ DAC Cable

*NHRActive

*NHRStandby

LAGGroup

LAGGroup

*NOTE: If Core Network and RAN NHR are required, the uplink connections are replicated to the second pair of NHR

Actual link configuration depends on bandwidth requirements

Actual link configuration depends on bandwidth requirements

• WCE uses fully dynamic VM allocation – there is no static association between VMs and blades

• VMs are created on any blade assigned to the cluster even it exists in another physical frame

• Multiple Tenant can share a

Dynamic VM Allocation – Anti-Affinity

11

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

• Multiple Tenant can share a single data-center

• To ensure services are highly available “anti-affinity” rules separate redundant VMs onto different physical blades

Multiple RNCs within a single data-center

Software implementation of Carrier Grade Sparing

99.999% Availability requires that failures are rapidly detected and a spare component takes over

Two CG Domains:

• Spared Nodes (3gOAM, CMU, PC & DA) that are required to maintain cells and allow new calls to be originated. Characterized by shared

Carrier Grade Overview

vCenter Server

LRC Mgr

3gOAM 3gOAM

1+1 (2)

CMU CMU CMU

Un-sparedRNC

Operations

CellManagement

DiskAccess

DiskAccess

1+1 (2)

NASFront End

To SAN

12

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

originated. Characterized by shared data and rapid (in the order of seconds) switch of activity.

• Un-spared Nodes (UMU) that are UE specific where sharing data is not required and return to service is slow (in the order of minutes).

N+M (2-16)

Un-spared (1-65)

N+1 (2-30)

CMU CMU CMU

PC PC PC

UMU UMU UMU UMU

Management

UserManagement

TransportTerminationNo Single Point of Failure

• When building carrier grade applications on an infrastructure that cannot provide lossless message transfer, network functions must provide their own messaging systems. A new Message Transfer Framework (MTF) was created for the WCE virtual RNC that offers the

Message Transfer Framework & Multi-Ring TOTEM Protocol (Patent Pending)

UMU Ring

CMU Ring

13

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

the WCE virtual RNC that offers the following services:

- Point-to-point best effort (over UDP),

- Point-to-point reliable (over SCTP),

- Point-to-point reliable and ordered (over multi-ring TOTEM),

- Multicast best effort (over UDP), and

- Multicast reliable and ordered (over multi-ring TOTEM).

Corosync nodes within a ring

3gOAM (gateway)PC Ring

Multi-Ring TOTEM Ring Messaging

• Each Cell Management Unit (CMU) hosts 3 Service Groups (collections of NodeBs)

• Each Service Group is independently spared

- single failures can be restarted, or

CMU Sparing

CMU-1

SG1 SG2 SG3

CMU-1

Before Failure After Failure

14

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

- single failures can be restarted, or

- relocated to another CMU

• Entire CMU failure results in all Service Groups being relocated

- the failed is restarted as a spare

• Failover typically takes 10-15 seconds

CMU-2

SG4 SG5

CMU-3

SG6

CMU-2

SG4 SG3 SG5

CMU-3

SG1 SG6 SG2

• VMware’s HA feature isn’t fast enough for Telecom application failover, but it is useful for return to redundancy

• The vRNC application HA mechanisms detect failures

VMware HA Failover Fast Return to Redundancy

Before FailureFailover Host

15

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

mechanisms detect failures and start spare VMs in seconds

• VMware’s HA feature detects failed hosts and restarts (now spare) VMs on a designated failover host in minutes

• Mean Time To Redundancy goes from hours to minutes

After Failure

WCE Geo-Redundancy / Reflection

RNC-1 Reflection

VMs of RNC-1

16

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

WCE Uptown

RNC-2 Reflection

WCE Midtown

RNC-1 Reflection

VMs of RNC-2

AGENDA

Wireless Cloud Element Product Strategy

Wireless Cloud Element

17

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Wireless Cloud Element Demo

Wireless Cloud Element High Availability – Availability

• WCE provides the capability to create a “shadow” of a network element within the same physical hardware as the service providing network element

• This shadow can be used for:

Shadow (Patent Pending)

Core Network

18

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

- Fast and Graceful Reset

- Software Upgrade

- Software Configuration Change

- Virtual or Physical Machine Change

- Geo-Redundancy

• Roll-back to the previous version is possible after an upgrade

iBTSiBTSiBTS

RNCRNCRNCRNCRNCRNC

Shadow technology is useful for any type

of critical configuration change.

Here is the process:

Shadow Network Functions

Core Network

1. While the active tenant provides

service, create the shadow tenant

2. Once the shadow tenant is ready

the operator initiates a switch

2a. The active tenant is

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

19

RNCRNCRNC

iBTSiBTSiBTS

RNCRNCRNC

2a. The active tenant is

disconnected from the network

becoming a shadow

2b. The shadow tenant is connected to

the network becoming active

2c. The tenant re-establishes links

to other network equipment

3. Once the operator is satisfied with

the configuration change the prior

version of the tenant is removed

RNCRNCRNC

WCE Zero Downtime Maintenance

Disk

Virtual Machine

x86

Guest

O/S

Disk

Virtual Machine

x86

Guest

O/S• Higher level virtualization features are built around the capability to move active VMs between servers (called vMotion or Live Migration)

• Using live migration we

20

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Blade 1 Blade 2

Hypervisor

• Using live migration we enable zero downtime maintenance of hardware Hypervisor

Call success stayed above 95% for the entire time even though all of the hardware was upgraded, a new S/W load was applied and a critical configuration change

Key Performance Indicators during MaintenanceMeasured results from a Live vRNC

60%

70%

80%

90%

100% KPIs of a Live vRNC

During Shadow Upgrades

Start of Upgrade

CallSetupSuccessRateCS(%)

� 32 blades upgraded (HeartBleed)

� New S/W Load

� Critical Configuration Change

21

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

configuration change (requiring an RNC reset) was done.

A combination of Shadow upgrades (a total of four) and Live VM Migration were used.

0%

10%

20%

30%

40%

50%

5/18/14 6:00 PM

5/19/14 12:00 AM

5/19/14 6:00 AM

5/19/14 12:00 PM

5/19/14 6:00 PM

5/20/14 12:00 AM

5/20/14 6:00 AM

5/20/14 12:00 PM

5/20/14 6:00 PM

5/21/14 12:00 AM

5/21/14 6:00 AM

5/21/14 12:00 PM

5/21/14 6:00 PM

5/22/14 12:00 AM

CallSetupSuccessRateCS(%)

HSDPA_Accessibility_SuccessRate(%)

3G2GHHOExecutionSuccessRate(%)

SHO_SuccessRate_Cel_SQM(%)

HSUPA(%)

Calldrop_HSDPA_cell(%)

HSUPA_DropRate(%)

RABDropRate_PS(%)

RABDropRateCS_Voice(%)

Key Performance Indicators during Live MigrationMeasured results from a vRNC under heavy traffic

42500

43000

43500

44000

44500

45000

99,95

99,96

99,97

99,98

99,99

100

add/delete

vMotionumu-29

vMotionpc-4

vMotioncmu-7

vMotion3goam

vMotionda

Movement of live VMs from host to host result in small impacts to KPIs

The worst case was movement of cmu-7 which resulted in a HA system timeout

22

COPYRIGHT © 2014 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

40000

40500

41000

41500

42000

42500

99,9

99,91

99,92

99,93

99,94

99,95

5/13/2014 18:02

5/13/2014 18:07

5/13/2014 18:12

5/13/2014 18:17

5/13/2014 18:22

5/13/2014 18:27

5/13/2014 18:33

5/13/2014 18:38

5/13/2014 18:43

5/13/2014 18:48

5/13/2014 18:53

establishments

session

mobility

transfer

calls

HA system timeout and switch of activity

Note the vertical axis ranges from 99.9% to 100%


Recommended