+ All Categories
Home > Technology > Downtime is not an option - day 2 operations - Jörg Schad

Downtime is not an option - day 2 operations - Jörg Schad

Date post: 21-Jan-2018
Category:
Upload: codemotion
View: 241 times
Download: 0 times
Share this document with a friend
112
1 Downtime is not an option How Fast Data and Microservices change the datacenter @joerg_schad @dcos
Transcript
Page 1: Downtime is not an option - day 2 operations -  Jörg Schad

1

Downtime is not an option How Fast Data and Microservices change the datacenter

@joerg_schad @dcos

Page 2: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 2

Jörg SchadDistributed Systems Engineer

@joerg_schad

Page 3: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 3

In the beginning there was a big

Monolith

Page 4: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 4

Page 5: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

Hardware

Operating System

Application

5

COMPUTERS

Page 6: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

noun | ˈmīkrō/ /ˈsərvəs/ :

an approach to application development in which a large application is built as a suite of modular services. Each module supports a specific business goal and uses a simple, well-defined interface to communicate with other modules.*

Microservices are designed to be flexible, resilient, efficient, robust, and individually scalable.

*From whatis.com

OVERVIEW

Page 7: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

Operating System

Operating System

Operating System

ServiceApp ServiceServiceAppApp

7

MICROSERVICES- Polyglot- Single Responsibility- Smaller Teams- Utilization- Machine

types/groups- Dependency hell

Machine

Infrastructure

Machine Machine

ServiceService ServiceServiceServiceService

Page 8: Downtime is not an option - day 2 operations -  Jörg Schad

© Gerard Julien/AFP

Run everything in containers!

Page 9: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

ServiceApp ServiceServiceAppApp

OS

9

CONTAINERS- Rapid deployment- Dependency

vendoring- Container image

repositories- Spreadsheet

scheduling

OS OS

Machine

Infrastructure

Machine Machine

Container Runtime Container Runtime Container Runtime

ServiceService ServiceServiceServiceService

Page 10: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 10

CONTAINERSCHEDULING

RESOURCE MANAGEMENT

SERVICE MANAGEMENT

- Load Balancing- Readiness Checking

CONTAINER ORCHESTRATION

Page 11: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 11

CONTAINERSCHEDULING

- Placement- Replication/Scaling- Resurrection- Rescheduling- Rolling Deployment- Upgrades- Downgrades- Collocation

RESOURCE MANAGEMENT

- Memory- CPU- GPU- Volumes- Ports- IPs- Images/Artifacts

SERVICE MANAGEMENT

- Labels- Groups/Namespaces- Dependencies- Load Balancing- Readiness Checking

CONTAINER ORCHESTRATION

Page 12: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

Orc

hest

ratio

n

12

Machine Infrastructure

Web Apps & Services

Scheduling

Resource Management

Container Runtime

Machine & OS

Service Management

CONTAINERORCHESTRATION

Machine & OS Machine & OS

Container Runtime Container Runtime

Page 13: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 13

MapReduce is crunching Data

Meanwhile...

Page 14: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 14

But then business demanded

FAST DATAWe need to turn faster!

Page 15: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 15

Fast Data

Batch Event ProcessingMicro-Batch

Days Hours Minutes Seconds Microseconds

Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics

Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations

Page 16: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 16

The SMACK Stack

EVENTSUbiquitous data streams from connected devices

INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven applications

Mesos/ DC/OS

Sensors

Devices

Clients

Page 17: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 17

Datacenter

Page 18: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 18

NAIVE APPROACH

Typical Datacentersiloed, over-provisioned servers,

low utilization

Industry Average12-15% utilization

mySQL

microservice

Cassandra

Spark/Hadoop

Kafka

Page 19: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 19

Page 20: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 20

MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS

Typical Datacentersiloed, over-provisioned servers,

low utilization

Mesos/ DC/OSautomated schedulers, workload multiplexing onto the

same machines

mySQL

microservice

Cassandra

Spark/Hadoop

Kafka

Page 21: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

• A top-level Apache project• A cluster resource

negotiator• Scalable to 10,000s of

nodes• Fault-tolerant, battle-tested• An SDK for distributed apps• Native Docker support

21

Apache Mesos

Page 22: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 22

Page 23: Downtime is not an option - day 2 operations -  Jörg Schad

Datacenter Operating System (DC/OS)

Distributed Systems Kernel (Mesos)

DC/OS ENABLES MODERN DISTRIBUTED APPS

Big Data + Analytics EnginesMicroservices (in containers)

Streaming

Batch

Machine Learning

Analytics

Functions & Logic

Search

Time Series

SQL / NoSQL

Databases

Modern App Components

Any Infrastructure (Physical, Virtual, Cloud)23

Page 24: Downtime is not an option - day 2 operations -  Jörg Schad

24

THEBASICS

DC/OS is … ● 100% open source (ASL2.0)

+ A big, diverse community● An umbrella for ~30 OSS projects

+ Roadmap and designs+ Docs and tutorials

● Not limited in any way● Familiar, with more features

+ Networking, Security, CLI, UI, Service Discovery, Load Balancing, Packages, ...

Page 25: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

Container Options

Enhancements to the Mesos Containerizer to allow support launching specific container formats (Docker, AppC, OCI (future), etc)

● Reduces need to maintain and update multiple containerizers

● Support multiple container formats with a single containerizer

Image provisioner component added to the Mesos containerizer - responsible for pulling, caching, and preparing container root filesystems

Launcher Isolators

Universal containerizer

Provisioner

Process management

Container lifecycle hook

Container image support

Page 26: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 26

DEMO

Page 27: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 27

GEO-ENABLED IoT

Page 28: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 28

DATA FLOW

Page 29: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 29

Keep it running!

Page 30: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 30

Monitoring- Collecting metrics- Routing events- Downstream processing

- Alerting- Dashboards- Storage (long-term retention)

Logging- Scopes- Local vs. Central- Security considerations

DAY 2 OPERATIONS

Page 31: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 31

Maintenance - Cluster Upgrades- Cluster Resizing- Capacity Planning- User & Package Management- Networking Policies- Auditing- Backups & Disaster Recovery

Troubleshooting- Debugging

- Services- System- Access?

- Tracing- Chaos Engineering

DAY 2 OPERATIONS

Page 32: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 32

Troubleshooting● Services: typically specific to service, use logging (for

example, dcos task log) and dcos node ssh for

per-node investigations

● dcos task exec

○ Permissions?

● System:

○ Simple diagnostics via dcos node diagnostics

○ Comprehensive dump via clump

Page 33: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 33

THANK YOU!

ANY QUESTIONS?@dcos

[email protected]

/groups/8295652

/dcos/dcos/examples/dcos/demos

chat.dcos.io

Page 34: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 34

FailuresFramework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 35: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

Distributed Systems could be so easy...

35

1. The network is reliable.

2. Latency is zero.

3. Bandwidth is infinite.

4. The network is secure.

5. Topology doesn't change.

6. There is one administrator.

7. Transport cost is zero.

8. The network is homogeneous.

*) https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

Page 36: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 36

Questions?

Code: https://git.io/vXUoy

http://grnh.se/ie76ru

Page 37: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved. 37

Monitoring

Page 38: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

METRICS

Measurements captured to determine health and performance of cluster

- How utilized is the cluster?- Are resources being optimally used?- Is the system performing better or worse over time?- Are there bottlenecks in the system?- What is the response time of applications?

Page 39: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

DC/OS METRIC SOURCES

● Mesos metrics ○ Resource, frameworks, masters, agents,

tasks, system, events ● Container Metrics

○ CPU, mem, disk, network● Application Metrics

○ QPS, latency, response time, hits, active users, errors

OS

Mesos

Container ContainerContainer

App App App

Page 40: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

Before upgrading1. Make sure cluster is healthy!2. Perform backup

a. ZKb. Replicated logsc. other state

3. Review release notes4. Generate install bundle

a. Validate versions

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Page 41: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

MESOS MASTER METRICS

● Metrics for the master node are available at the following URL:○ http://<mesos-master-ip>/mesos/master/metrics/snapshot ○ The response is a JSON object that contains metrics names and values as

key-value pairs.● Metric Groups:

○ Resources○ Master○ System○ Slaves○ Frameworks○ Tasks○ Messages○ Event Queue○ Registrar

Page 42: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

MESOS MASTER BASIC ALERTS

Metric Value Inference

master/uptime_secs is low The master has restarted

master/uptime_secs < 60 for sustained periods of time The cluster has a flapping master node

master/tasks_lost is increasing rapidly Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks or bugs in Mesos

master/slaves_active is low Slaves are having trouble connecting to the master

master/cpus_percent > 0.9 for sustained periods of time DCOS Cluster CPU utilization is close to capacity

master/mem_percent > 0.9 for sustained periods of time DCOS Cluster Memory utilization is close to capacity

master/disk_used & master/disk_percent DCOS Disk space consumed by Reservations

master/elected is 0 for sustained periods of time No Master is currently elected

Page 43: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 43

Operations

UPGRADES

Page 44: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

1. Master rolling upgradea. Start with standbyb. Install new DC/OS

2. Agent rolling upgrade3. Framework upgrades

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 45: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

1. Master rolling upgrade2. Agent rolling upgrade

a. Uninstall DC/OSb. Install new DC/OS

3. Framework upgrades

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 46: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

1. Master rolling upgrade2. Agent rolling upgrade3. Framework upgrades

a. Orthogonal to DC/OSb. Ensure changes don’t

affect existing apps

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 47: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved. 47

Failure Handling

Page 48: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 48

Failure Handling

MESOS TASK FAILURE

Page 49: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MESOS TASK FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 50: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

Status UpdateStatus Update

EXECUTOR

MESOS TASK FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

TASK

AGENT

SEGFAULT :(

Page 51: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MESOS TASK FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

EXECUTOR

TASK

Launch TaskLaunch Task

AGENT

Launch Task

Page 52: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

EXECUTOR

Status UpdateStatus Update

MESOS TASK FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

TASK

AGENT

Status Update

Page 53: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 53

Failure Handling

MESOS AGENT FAILURE

Page 54: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 55: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 56: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 57: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 58: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Re-register

Page 59: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 59

Failure Handling

MESOS HOST FAILURE

Page 60: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 61: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

Page 62: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

LOCAL AGENT FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

Status Update

Page 63: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MESOS TASK FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

EXECUTOR

TASK

Launch TaskLaunch Task

Launch Task

Resource Offer

Page 64: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MESOS TASK FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT

EXECUTOR

TASK

Status Update

Status Update

Status Update

Page 65: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 65

Failure Handling

MESOS MASTER FAILURE

Page 66: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MASTER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 67: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MASTER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 68: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MASTER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 69: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MASTER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Leading Master Leading Master

Leading Master

Page 70: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

MASTER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Reregister

Reregister

ReregisterReregister

Page 71: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

Reregistered

MASTER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Reregistered

ReregisteredReregistered

Page 72: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 72

Failure Handling

SCHEDULER FAILURE

Page 73: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 74: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 75: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Page 76: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Framework IDLeading Master

Page 77: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Reregister

Page 78: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Reregistered

Page 79: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved.

SCHEDULER FAILURE

ZK

MASTERMARATHON

CLIENT AGENT AGENT AGENT

EXECUTOR

TASK

Status Update

Reconcile Tasks

Page 80: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

1. Master rolling upgradea. Start with standbyb. Uninstall DC/OSc. Install new DC/OS

2. Agent rolling upgrade3. Framework upgrades

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 81: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

1. Master rolling upgrade2. Agent rolling upgrade

a. Uninstall DC/OSb. Install new DC/OS

3. Framework upgrades

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 82: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

1. Master rolling upgrade2. Agent rolling upgrade3. Framework upgrades

a. Orthogonal to DC/OSb. Ensure changes don’t

affect existing apps

UPGRADE PROCEDURE

Framework

Scheduler

Executor

Task

Agent

LEADER STANDBY STANDBY

ZK

ZK

ZK

Executor

Task

Agent

Page 83: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved.

FUTURES (TBD)

Leverage maintenance primitives in Mesos to drain host

Upgrade management through DC/OS to perform rolling upgrades

Page 84: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 84

Monitoring- Collecting metrics- Routing events- Downstream processing

- Alerting- Dashboards- Storage (long-term retention)

Logging- Scopes- Local vs. Central- Security considerations

DAY 2 OPERATIONS

Page 85: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 85

Maintenance - Cluster Upgrades- Cluster Resizing- Capacity Planning- User & Package Management- Networking Policies- Auditing- Backups & Disaster Recovery

Troubleshooting- Debugging

- Services- System

- Tracing- Chaos Engineering

DAY 2 OPERATIONS

Page 86: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 86

MONITORING

Page 87: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 87

MONITORINGCONCEPT

Page 88: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 88

MONITORINGTOOLINGEXAMPLES

● local scraping:

a. collectd

b. cAdvisor*

● event router:

a. fluentd

b. Flume

c. Kafka*

d. logstash*

e. Riemann

*) available via Mesosphere Universe

Page 89: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 89

MONITORINGTOOLINGEXAMPLES

● storage:

a. Elasticsearch*

b. Graphite

c. InfluxDB*

d. KairosDB/Cassandra*

e. OpenTSDB/HBase

f. others such a local filesystem, Ceph FS*,

HDFS*, etc.

*) available via Mesosphere Universe

Page 90: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 90

MONITORINGTOOLINGEXAMPLES

● dashboard:

a. D3

b. Grafana*

c. signal fx

● alerting:

a. BigPanda

b. PagerDuty

c. signal fx

d. VictorOps

*) available via Mesosphere Universe

Page 91: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 91

MONITORINGTOOLINGEXAMPLES(INTEGRATED)

● Amazon CloudWatch ● AppDynamics ● Azure Monitor ● Circonus ● DataDog* ● dcos/metrics● Ganglia ● Google Stackdriver ● Hawkular ● Icinga ● Librato ● Nagios ● New Relic ● OpsGenie ● Pingdom ● Prometheus ● Ruxit Dynatrace* ● Sensu ● Sysdig* ● Zabbix

*) available via Mesosphere Universe

Page 92: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 92

LOGGING

Page 93: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 93

LOGGINGSCOPES

Page 94: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 94

LOGGINGTOOLINGEXAMPLES(PRIMITIVES) ● DC/OS logging overview

● Docker logging drivers

● systemd's journalctl

Page 96: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 96

MAINTENANCE

Page 97: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 97

Overview

● How to install a new version of X?● When to scale what (service-level vs. nodes)● Who gets to access/install which services in what way?

Upgrades

Sizing

User and package management

● What services can talk to each other and in which way?● Who accessed what, when and how?● How is the continuous operation of the cluster and the services accomplished?

What happens when cluster (or critical infra components like ZK) go down?

Networking

Auditing

Disaster Recovery

Page 98: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 98

OTHER TROUBLESHOOTING TECHNIQUES

● Tracing

○ Idea: identify latency issues and perform

root-cause analysis in a distributed setup

○ OpenTracing

● Chaos Engineering

○ Idea: proactively break (parts of) the system to

understand how it reacts

○ Chaos Monkey

○ DRAX

Page 99: Downtime is not an option - day 2 operations -  Jörg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 99

Page 100: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 100

ARCHITECTUREMESOS FUNDAMENTALS

● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master

Page 101: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 101

ARCHITECTUREMESOS FUNDAMENTALS

● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master

Page 102: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 102

ARCHITECTUREMESOS FUNDAMENTALS

● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master

Page 103: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 103

ARCHITECTUREMESOS FUNDAMENTALS

● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master

Page 104: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 104

ARCHITECTUREMESOS FUNDAMENTALS

● Agents advertise resources to Master● Master offers resources to Framework● Framework rejects/uses resources● Agents report task status to Master

Page 105: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 105

Questions?

Code: https://git.io/vXUoyPsssssssst …… we are hiring!

http://grnh.se/ie76ru

Page 106: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

CONTAINER SCHEDULING

106

Page 107: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

RESOURCE MANAGEMENT

107

Page 108: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

SERVICE MANAGEMENT

108

Page 109: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

Service Service Service

Web App Web App Web App

Hardware

Operating System

109

SERVICE-ORIENTEDARCHITECTURE

- Separation of concerns

- Optimization of bottlenecks

- Smaller teams- API Contracts- Data replication- Complicated

provisioning- Dependency

management

Operating System

Operating System

Hardware Hardware

Page 110: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved.

Operating System

Operating System

Operating System

ServiceApp ServiceServiceAppApp

110

MICROSERVICES- Polyglot- Single Responsibility- Smaller Teams- Utilization- Machine

types/groups- Dependency hell

Machine

Infrastructure

Machine Machine

ServiceService ServiceServiceServiceService

Page 111: Downtime is not an option - day 2 operations -  Jörg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 111

THE BIRTH OF MESOS

TWITTER TECH TALKThe grad students working on Mesos

give a tech talk at Twitter.

March 2010

APACHE INCUBATIONMesos enters the Apache Incubator.

Spring 2009

CS262BBen Hindman, Andy Konwinski and

Matei Zaharia create “Nexus” as their CS262B class project.

MESOS PUBLISHEDMesos: A Platform for Fine-Grained

Resource Sharing in the Data Center is published as a technical report.

September 2010

December 2010

DC/OS

April 2016

Page 112: Downtime is not an option - day 2 operations -  Jörg Schad

© 2015 Mesosphere, Inc. All Rights Reserved. 112

Monitoring- Collecting metrics

- Routing events- Downstream processing

○ Alerting○ Dashboards○ Storage (long-term retention)

Logging- Scopes- Local vs. Central- Security considerations

Maintenance - Cluster Upgrades- Cluster Resizing- Capacity Planning- User & Package

Management- Networking Policies- Auditing- Backups & Disaster

Recovery

Troubleshooting- Debugging

○ Services○ System

- Tracing- Chaos Engineering


Recommended