Travis Newhouse @travis newhouse Brian Eckblad ... · ViaSat ViaSat is a global broadband services...

Post on 24-May-2020

12 views 0 download

transcript

How Service Optimization Keeps ViaSat Flying

Mike Craft - @crafty_houseBrian Eckblad - @brianeckbladTravis Newhouse @travis_newhouse

OutlineWho is ViaSat?

Why OpenStack? Yussa lika some cloudz?

Current state of the cloud

Key areas of interest to be successful

Challenges and solutions

ViaSatViaSat is a global broadband services and technology company.

We provide consumer, commercial, and government customers with communications services and systems that exceed expectations for performance, anywhere in the world.

We think big, we act intelligently, and we’re not done…we’re just beginning.

www.viasat.com

Why OpenStack at ViaSat?Motivation:

● On-Demand Infrastructure with self service● Reduce infrastructure capital and operational costs● Transparent capacity scaling

Range of Applications …

● Customer-facing (Airline, Government, Residential, Commercial)● Employee-facing enterprise applications● Internal development and test environments

○ Enterprise, Service Delivery w/NFV

History of OpenStack at ViaSatInternal POC environment went into service mid-2014 used by select development teams - Havana w/OVS

Production design started in Sept-2014, buildout started in Early Nov-2014 with the first production cloud going live in Dec 2014 - Icehouse w/Linux Bridge

Production clouds are running Juno, Kilo and Liberty

For our production releases we partnered with Rackspace using both professional services and support allowing us to bring the cloud to the enterprise quicker

Evolving toward in-house supported deployments

OpenStack Operations at ViaSat5 private clouds deployed using OpenStack Ansible (OSA)

200+ hosts

7000+ instances

2+ PB storage between Cinder and Swift

Linux Bridge ML2 plugin

6 member devops team, no silos!

300+ internal users/customers

Unique aspectsProviding self-service IT to users

Supporting projects spanning private and public platforms?

Lean operations team. 1 operator to 1750+ instances

Highly dynamic workloads: 100+ instances created/deleted per day

Oversubscribed compute ratio of up to 4:1 depending on workload

Densest cloud supports NFV development: 50+ hosts, 2500+ instances, up to 80 instances per host

Network underlay is vendor agnostic

Operating a lean teamAvailability

Visibility

Cost Management

Capacity Planning

Self-service all the things

Availability is keyUnderstand your customer

Operator must know if resources are available

Real-time data + history for comparison against baseline

Visibility into hypervisor and instancesWhere does problem exist: hypervisor or instance?

Is hypervisor overloaded?

Which instance is generating IOPS load?

Manage infrastructure costPrivate cloud differs from pay-per-use model of public cloud

Organization must collaborate to utilize resource efficiently

Reclaim unused and under-utilized resources

What instances are under-utilized?

Does a user require a specific flavor size for an instance? Make the legos fit!

Capacity PlanningWhat is the utilization of the infrastructure? Memory? Disk? CPU?

What is the utilization trend over time?

When will infrastructure resources be exhausted?

Self-Service for UsersPartnered with AppFormix to provide monitoring as a service for projects and instances

Expose underlying infrastructure data points

Enables users to answer questions about their resource utilization

Users can understand issues around hypervisor and storage health

Transparency is key to empower users

Standardized network design

Challenge: Right-sizing InstancesStarted with custom flavors for everything.● Not scalable for operations team● Inefficient workload placement, e.g., CPU exhaustion vs disk.

Now, standardized on flavor sizes:● Avoids resource fragmentation● Improves capacity planning

Users need data to choose right size

Challenge: Right-sizing Instances

Challenge: Right-sizing Instances

Challenge: Hypervisor healthVirtual memory thrashing

- Is it instance memory oversubscription?- Is it disk block cache exhaustion?

CPU contention

Disk I/O

- Latency issues- Tenant misusing software RAID over Cinder LVM

Challenge: Hypervisor health

Challenge: Right-sizing network Initially allowed each project to request a network of any size or design. To many snowflakes.

Standardized on L2/L3 project design

Standardized IP project allocations

Re-Architected the underlay to simplify the design and provide better tenant isolation

Resulted in a better end user experience

Questions?