Achieving Five Nines of VNF Reliability in Telco-grade OpenStack Cloud
Panel Discussion
Kandan Kathirvel, AT&T; Eoin Walsh, Intel; Rimma Iontel, Red Hat Inc. and Fausto Marzi, Ericsson.
Moderated by Haseeb Akhtar, Ericsson
1:50 PM – 2:30 PM on Wednesday, April 27, 2016
Page 2
Converting PNF to VNF without cloud awareness is not optimal
Compute
Datacenter / Central Office
Application
Orchestration Software
Software defined Network
Same
Purpose built and dedicated for field of use
N/w Fabric Purpose built & dedicated (Most cases)
Physical Connection – no SDN
Physical Network Function
Virtual Network Function
Same
Host OS & Virtualization
Not Virtualized (most cases) Vendor provided OS
Mostly Manual or Contained
Purpose build software Operationally Perfected over decades
Early adoption and rapid evolution
Same
Multi-tenant (Common framework)
Commodity hardware (any Vendor)
Cloud Provided (Common)
Common automation Rapidly evolving
New & Evolving Requires significant innovation
Page 3
VNF availability depends on Cloud & VNF resiliency
High Risk of Application Outages Low Risk of Application Outages Cloud Aware Applications
Most of VNFs Few
1 2 3 4 5
Openstack Region
Geo Location 1
(DC1) Geo Location 2
(DC2)
Geo Location 1
(DC1) Geo Location 4
(DC4)
99.99% (52.56 mins down/year) 99.9% (8.76 hrs down/year) 99.999%(5.26 Mins down/year)
VNF Availability Some VNFs current state
VNF HA in a region VNF - Single VM VNF HA in 2 regions at same DC VNF HA across 2 DCs VNF HA across 4 DCs
Optimal VNF
Single Instance of OpenStack region is about 99.9% (8.76 hours unplanned downtime per year)
Openstack Region
Openstack Region1
Openstack Region2
Single DC Single DC Single DC
Few Optimal 2 Regions at a DC
Page 4
Proposed OpenStack Enhancements
Few
• Hitless upgrades – reduce overall platform downtime
• Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support
• Multi-location awareness & workload placement
• Resiliency/Stability testing framework in OpenStack Rally – measure and report
• Auto healing framework for OpenStack Controllers
VNF evolution • Support HA both locally and Globally
• Leverage OpenStack/Cloud Platform resiliency features ex: anti-affinity to place VMs on different servers
Page 5
Open APIs
VIM
Compute
Platform Resource Monitoring & Reporting
Virtualization
Enhanced Platform Awareness
Service Catalog
Service Orchestration
Service Assurance
VNF Manager
SDN Controller
Network Orchestration
Security
Services
Virtual Resource Monitoring & Reporting
Descriptor Repositories
OSS/BSS
VNFC
Analytics
Storage Network
vCompute vStorage vNetwork
VNF VNFC VNFC
VNF VNFC
NFV Ready Architecture
Page 6
Proposed OpenStack Enhancements
Few
• Hitless upgrades – reduce overall platform downtime
• Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support
• Multi-location awareness & workload placement
• Resiliency/Stability testing framework in OpenStack Rally – measure and report
• Auto healing framework for OpenStack Controllers
• Automated provisioning and monitoring (Ceilometer, Heat and Ironic)
• Intelligent workload placement (Nova scheduler)
Page 7
Open APIs
VIM
Compute
Platform Resource Monitoring & Reporting
Virtualization
Enhanced Platform Awareness
Service Catalog
Service Orchestration
Service Assurance
VNF Manager
SDN Controller
Network Orchestration
Security
Services
Virtual Resource Monitoring & Reporting
Descriptor Repositories
OSS/BSS
VNFC
Analytics
Storage Network
vCompute vStorage vNetwork
VNF VNFC VNFC
VNF VNFC
NFV Ready Architecture
Page 8
Proposed OpenStack Enhancements
Few
• Hitless upgrades – reduce overall platform downtime
• Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support
• Multi-location awareness & workload placement
• Resiliency/Stability testing framework in OpenStack Rally – measure and report
• Auto healing framework for OpenStack Controllers
• Automated provisioning and monitoring (Ceilometer, Heat and Ironic)
• Intelligent workload placement (Nova scheduler)
• Tools to measure, monitor and report end-to-end platform SLA
Page 9
Compute Node HA – Local
Prerequisite:
Compute nodes shared
storage
Disaster Workflow:
Disaster is detected
Compute node evacuation
Users connect to the new
service
Risks:
Node Fencing
Active HA Controller
Compute
Node 1
VM VM
VM
HA Agent
Compute
Node n
HA Agent
Compute
Node 2
VM VM
VM
HA Agent
VM VM
VM
Control
Node 1
Control
Node 2
Control
Node 3
Database
Corosync + Pacemaker
HA HA HA
VM VM
VM
VM VM
VM
VM VM
VM
Page 10
Compute Node HA – Global
BGPAS XXXXX
Internet
DC1 DC2
1.1.1.0/24 1.1.2.0/24
VM1Evacuation
VM1
FIP: 100.100.1.15
Freezer-dr-api
FIP: 100.100.1.15
Freezer-dr-api
Prerequisite:
Data replication
Operational Workflow:
Floating IPs retrieved from
Nova
Announce IPs with BGP or
OSPF
Disaster Workflow:
Disaster is detected
Compute node evacuation
On the other Compute Nodes the floating IPs are retrieved
Floating IP announced with BGP or OSPF
Users connect to the new service
Risks:
Node and DC Fencing
Page 11
Proposed OpenStack Enhancements
Few
• Hitless upgrades – reduce overall platform downtime
• Policy driven live/offline migration inclusive of SR-IOV, CPU pinning and Huge pages support
• Multi-location awareness & workload placement
• Resiliency/Stability testing framework in OpenStack Rally – measure and report
• Auto healing framework for OpenStack Controllers
• Automated provisioning and monitoring (Ceilometer, Heat and Ironic)
• Intelligent workload placement (Nova scheduler)
• Global and Local Compute HA management
Page 12
Thanks! We need to build together.