Managing a Multi-Tenant Data Lake

Post on 06-Jan-2017

456 views 2 download

transcript

Managing A Multi-Tenant Data Lake

2Copyright 2016 Comcast Corporation. All Rights Reserved

Agenda Timeline for Evolution Why Governance Multi-Tenancy Anti-Patterns / Warning Signs Instituting Governance Managing through Chaos Monitoring/Metrics Environment Tools SLA Management Support and Staffing Demo - Command Center

3Copyright 2016 Comcast Corporation. All Rights Reserved

Timeline – 2013

2013 – “The Experiment” Started with 10 node cluster

Experimentation with batch processing and enrichment of event data

Team assembled from across organization

Primarily solving single use case

30 nodes by end of trial

2 Racks

4Copyright 2016 Comcast Corporation. All Rights Reserved

Timeline – 2014 (H1)

2014 Production “Honeymoon”

Added 70 more nodes along with lower environments (Dev & QA)

Onboard additional ~20 data sets through batch ETL

Supporting a dozen use cases

5 Racks

5Copyright 2016 Comcast Corporation. All Rights Reserved

Timeline – 2014 (H2)

2014 Production “Tiger’s Tail”

Total of 200 nodes to support additional use cases (data science)

Total of ~30 more data sets through batch ETL

Supporting several dozen use cases and ad-hoc exploration

Starting to have difficulty managing resource requests

9 Racks

6Copyright 2016 Comcast Corporation. All Rights Reserved

Timeline – 2015

2015 Production “Cortez”

Adding 250 more nodes to production environment

Fully embraced governance

Supporting 24x7 production use cases

19 Racks

7Copyright 2016 Comcast Corporation. All Rights Reserved

Timeline – 2016

2016 Production “Planetary”

Adding 1300 more nodes to production environment

Standing up separate 500 node data science cluster

Spinning off critical compute to boundary satellite clusters

Reaping benefits from governance and resource planning

48 Racks

8Copyright 2016 Comcast Corporation. All Rights Reserved

Why Governance?

It’s about establishing acceptable behaviors for the benefit of the community

Minimize user/application impact on cluster

Users will do whatever is technically possible Everyone has been conditioned to work “smarter not harder”

Establishing the guardrails not edicts.

9Copyright 2016 Comcast Corporation. All Rights Reserved

Multi-Tenancy Anti-Patterns

Speculative Execution

Optional User Training

Lack of Resource Isolation

Lack of Testing and Measurement

Ad-hoc Communication Channels

Excessive Resource Utilization/Reservation

Informal Service Level Agreements (SLAs)

Public Domain: Plynn9

10Copyright 2016 Comcast Corporation. All Rights Reserved

Signs of Looming Disaster

Pending Jobs

Queue Fidgeting

Job Rescheduling

Non Predictive Workloads

Cluster Storage Out Of Balance

Public Domain: US DOE

11Copyright 2016 Comcast Corporation. All Rights Reserved

Instituting Governance

Governance is not a technology problem

Governance must be solved using People - Who Processes – What / When / How Policy – Why

Always employ technology to help with enforcement and measurement

12Copyright 2016 Comcast Corporation. All Rights Reserved

Setting Out Governance Standards – Starting Out

Involve the business users to define light-weight policies and processes Onboarding users/applications/tools Resource Utilization Worksheets Deployment checklists Service Level Agreements / Penalties Updates of Governance Standards

You MUST socialize and educate your community on these policies and process

Strive for evolution not revolution

13Copyright 2016 Comcast Corporation. All Rights Reserved

Setting Out Governance Standards – Measurement

Define universally accepted performance measures Storage Compute System Availability Issues and MTTR Average Completion Time Average Pending Apps

Be transparent with results and make them available to entire community

Establish monthly performance reviews with key stakeholders

14Copyright 2016 Comcast Corporation. All Rights Reserved

Setting Out Governance Standards – Enforcement

Lock down as many resources as possible

Monitor resource utilization for compliance

Automate corrective measures

Its all about transitioning from defense to offense and eliminating surprises!

15Copyright 2016 Comcast Corporation. All Rights Reserved

Setting Out Governance Standards – Enforcement

Hadoop provides some base capabilities YARN Queues for compute HDFS Quotas/ACLs for storage

Implement custom solutions for proactive offensive capabilities Job monitoring and migration (Penalty Box) Dynamic Allocation / Queue Flexing Monitor and track leading indicators (Command Center)

16Copyright 2016 Comcast Corporation. All Rights Reserved

Multi-Tenancy: Understanding the Chaos - Monitoring/Metrics

Image Attribution: Pixabay - Creative Commons CC0

17Copyright 2016 Comcast Corporation. All Rights Reserved

Use Case – Extreme Ad Hoc (Data Science)

18Copyright 2016 Comcast Corporation. All Rights Reserved

Use Case – Extreme Ad Hoc (Data Science)

19Copyright 2016 Comcast Corporation. All Rights Reserved

Challenges? You bet!

20Copyright 2016 Comcast Corporation. All Rights Reserved

Challenges Monitoring and Managing a Multi-tenant Hadoop Environment – Diverse User Community

Div

erse

Use

r Com

mun

ity

Images: Creative Commons

21Copyright 2016 Comcast Corporation. All Rights Reserved

Challenges Monitoring and Managing a Multi-tenant Hadoop Environment - SLAs

Div

erse

SLA

s

22Copyright 2016 Comcast Corporation. All Rights Reserved

Challenges Monitoring and Managing a Multi-tenant Hadoop Environment - Governance

Images: Creative Commons

23Copyright 2016 Comcast Corporation. All Rights Reserved

Challenges Monitoring and Managing a Multi-tenant Hadoop Environment – Monitoring & Forecasting

Images: Creative Commons

24Copyright 2016 Comcast Corporation. All Rights Reserved

Environment

25Copyright 2016 Comcast Corporation. All Rights Reserved

Our Environment - Tools for Monitoring

Standard Hadoop Monitoring

26Copyright 2016 Comcast Corporation. All Rights Reserved

Environment - Tools for Monitoring

Command Center

Pepperdata

27Copyright 2016 Comcast Corporation. All Rights Reserved

SLA Management

Application Timing

Images: Creative Commons

28Copyright 2016 Comcast Corporation. All Rights Reserved

SLA Management

Application Timing

Resource Management

Images: Creative Commons

29Copyright 2016 Comcast Corporation. All Rights Reserved

SLA Management

Application Timing

Resource Management

Capacity Management

Images: Creative Commons

30Copyright 2016 Comcast Corporation. All Rights Reserved

Support & Staffing

Images: Creative Commons

31Copyright 2016 Comcast Corporation. All Rights Reserved

Takeaways for DevOps Model in Hadoop

Train Your Teams (!!!)

32Copyright 2016 Comcast Corporation. All Rights Reserved

Takeaways for DevOps Model in Hadoop

Train Your Teams (!!!)

Measure, Forecast and Model

33Copyright 2016 Comcast Corporation. All Rights Reserved

Takeaways for DevOps Model in Hadoop

Train Your Teams (!!!)

Measure, Forecast and Model

Automation and Frameworks

34Copyright 2016 Comcast Corporation. All Rights Reserved

Comcast Command Center

35Copyright 2016 Comcast Corporation. All Rights Reserved

The Command Center: Our Focus

Visualizations & Design

36Copyright 2016 Comcast Corporation. All Rights Reserved

Ease Of Use

Visualizations & Design

The Command Center: Our Focus

37Copyright 2016 Comcast Corporation. All Rights Reserved

Visualizations & Design

Ease Of Use

Extensibility

The Command Center: Our Focus

38Copyright 2016 Comcast Corporation. All Rights Reserved

Visualizations & Design

Ease Of Use

Extensibility

Alerting

The Command Center: Our Focus

39Copyright 2016 Comcast Corporation. All Rights Reserved

The Command Center for Monitoring and Alerting

• Missed SLAs• Guardrails broken

• Definitions• Links

• Containers• Queue capacity

• Status• Measures

• HDFS Usage• Queue Usage

Continuous Evolution

Continuous Engagement

40Copyright 2016 Comcast Corporation. All Rights Reserved

Monitoring and Alerting at Comcast

The Command Center!

41Copyright 2016 Comcast Corporation. All Rights Reserved

Thanks!

Ray HarrisonPrinciple DevOps Architect

Mike FaganPrinciple Big Data Architect

Ray_Harrison@cable.comcast.com Michael_Fagan@cable.comcast.com

We Are Hiring!