Scale-Out Resource Management at Microsoft using Apache YARN

Scale-Out Resource Management at Microsoft Using Apache YARN

Raghu RamakrishnanCTO for Data

Microsoft

Technical FellowHead, Big Data Engineering

Store any data(HDFS) Files, relations, docs, logs, graphs, multimedia …

Do any analysisSQL queries, ML, image processing, log analytics …

Hive, Map-Reduce, Spark-ML, …

At any speedBatch, interactive, streaming

Hive, Spark, Storm, …

At any scale … elastic!“Terabytes or more” or “as much as you need”

AnywhereOn-prem, cloud, hybrid

Data to Intellige

nt Action

The Big Data Service Microsoft Runs On

WindowsSMSG

LiveAdsCRM/Dynamics

Windows Phone

Xbox Live

Office365

STB Malware Protection

Microsoft Stores

STB Commerce RiskMessengerLCAExchang

e

YammerSkype

Bing

data managed: EBscluster sizes: 10s of Ks

# machines: 100s of Ks

daily I/O: >100 PBs

# internal developers: 1000s # daily jobs: 100s of Ks

• Interactive and Real-Time Analytics requires in-memory data, NVRAM, SSDs, RDMA, etc.

• Massive data volumes require scale-out stores using commodity servers, even archival storage

Tiered StorageSeamlessly move data across tiers, mirroring life-cycle and usage

patternsSchedule compute near low-latency copies of data

Where Should the Data Be?

How can we manage this trade-off without moving data across different storage systems (and governance

boundaries)?

• Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming)

• Many users’ jobs (across these job types) run on the same machines (where the data lives)

Resource Management with Multitenancy and SLAsPolicy-driven management of vast compute pools co-located with data

Fundamental MemeSchedule computation “near” data

How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?

Shared Data and Compute

Tiered Storage

Relational Query Engine

MachineLearning

Compute Fabric (Resource Management)

Multiple analytic engines sharing same

resource pool

Compute and store/cache on same machines

What’s Behind a U-SQL Query

. . .

. . . … … …

Optimizer

Scheduler

Compiler

Runtime

U-SQL Studio

Web Interface / API

20+ years of SQL Server smarts!

Data-aware, fault-aware!

ResourceManager

Resource Managers for Big Data

Allocate compute containers to competing jobsMultiple job engines can request from shared poolContainers are the unit of resourceCan fail or be taken away

YARN: Resource manager for Hadoop2.xOther RMs include Corona, Mesos, Omega

YARN Gaps• No way to provide resource allocation

SLOs to production jobs• YARN RM has scalability limitations• Known to scale to about 5K nodes

• High allocation latency, which affects “small” jobs that are majority

• Support for specialized execution frameworks• Interactive environments, long-running services

• Amoeba (work-preserving pre-emption), Rayon (predictable resource allocation and SLOs via intelligent admission control)• Status: shipping in Apache Hadoop 2.6

• Mercury and Yaq• Merge Cosmos-style optimistic allocation with YARN conservatism to

improve performance while preserving global policy control• Status: prototypes, JIRAs and papers

• Federation• Scale-up clusters to be over 100K machines• Enable federation across data centers for fluid load shifting and BCDR• Status: prototype and JIRA

• Framework-level Pooling• Enable frameworks that want to take over resource allocation to support

millisecond-level response and adaptation times• Status: spec

Microsoft Contributions to OSS Apache YARN

AmoebaAdding work-preserving pre-emption to YARN

Dynamic OptimizationLeveraging checkpointing and preemption for parsimonious scheduling in MR

Click to insert photo.

15

Killing Tasks vs. Preemption

10 365 705 102913451685202523652705304533853725405543754715505553955735607564156755709574357775811584550

10

20

30

40

50

60

70

80

90

100

Kill Preempt

Time (s)

% C

ompl

ete

33% Improvement

ClientJob1

RM

Scheduler

NodeManager NodeManager NodeManager

AppMaster Task

Task

Task

Task

Task

TaskTask

YARN-45MR-5192

MR-5194

MR-5197

MR-5189

MR-5189

MR-5176

(…)

YARN-569

YARN-568

YARN-567

MR-5196

Contributing to Apache

Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished

RayonAdding Capacity Reservation to YARN

Sharing a Cluster Between Production & Best-effort JobsProduction Jobs (P) Money-making, large (recurrent) jobs with SLAse.g., Shows up at 3pm, deadline at 6pm, ~ 90 min runtime

Best-Effort Jobs Interactive exploratory jobs submitted by data scientists w/o SLAs However, latency is still important (user waiting)

19

Not supported well currently in YARN

New idea:Support SLOs for production jobs by using - Job-provided resource requirements in RDL- System-enforced admission control

Reservation-Based Scheduling in Hadoop(Curino, Krishnan, Difallah, Douglas, Ramakrishnan, Rao; Rayon paper, SOCC 2014)

20

Resource Definition Language (RDL)

21

e.g., atom (<2GB,1core>, 1, 10, 1min, 10bundle/min)

(simplified for OSS release)

Steps: 1. App formulates reservation request in RDL 2. Request is “placed” in the Plan 3. Allocation is validated against sharing policy 4. System commits to deliver resources on time 5. Plan is dynamically enacted 6. Jobs get (reserved) resources 7. System adapts to live conditions

22

Architecture: Teach the RM About Time

Results• Meets all production job SLAs• Lowers best-effort jobs latency• Increased cluster utilization and throughput

Committed to Hadoop trunk and 2.6 release

Now part of Cloudera CDH and Hortonworks HDP

Comparing Rayon With CapacityScheduler

23

Initial Umbrella JIRA: YARN-1051 (14 sub-tasks)

Rayon OSS

Rayon V2 Umbrella JIRA: YARN-2572 (25 sub-tasks) (tooling, REST APIs, UI, Documentation, perf-improvements)High-Availability Umbrella JIRA: YARN-2573 (7 sub-tasks)Heterogeneity/node-labels Umbrella JIRA: YARN-4193 (8 sub-tasks)

Algo enhancements: YARN-3656 (1 sub-task)

Folks Involved: Carlo Curino, Subru Krishnan, Ishai Menache, Sean Po, Jonathan Yaniv, Arun Suresh, Abhunav Dhoot, Alexey Tumanov

Included in Apache Hadoop 2.6Various enhancements in upcoming Apache Hadoop 2.8

YARN FederationImproving YARN’s scalability

Why Federation?Problem:

YARN scalability is bounded by the centralized ResourceManagerProportional to #nodes, #apps, #containers, heart-beat#frequency

Maintenance and Operations on single massive cluster are painfulSolution:

Scale by federating multiple YARN clusters Appears as a single massive cluster to an appNode Manager(s) heart-beat to one RM onlyMost apps talk with one RM only; a few apps might span sub-clusters

(achieved by transparently proxying AM-RM communication)If single app exceed sub-cluster size, or for load

Easier provisioning / maintenanceLeverage cross-company stabilization effort of smaller YARN clustersUse ~6k YARN clusters as-is as building blocks

Federation ArchitectureCl

ient

HD

FSRead Placement hints (optional) Global

Policy Generator

. .

.

Yarn Sub-Cluster 1

NM-001NM-001

NM-001

…NM-001NM-001

NM-001

…

YARNRM-001YARNRM-001

Yarn Sub-Cluster 2

NM-002NM-002

NM-002

…

NM-002NM-002

NM-002

…

YARNRM-002YARNRM-002

Yarn Sub-Cluster N

NM-NNNNM-NNN

NM-NNN

…

NM-NNNNM-NNN

NM-NNN

…

YARNRM-NNNYARNRM-NNN

Rout

er

9. Submit Job

7. Read Policies (capacity allocation and job routing)

4. Read membership and loadState

Store8. Write App -> sub-cluster mapping

3. Write (initial) capacity allocation

PolicyStore

5. Write capacity allocation (updates) & policies

2. Request capacity allocation

6. Submit Job

1. Heartbeat (membership)

Federation JIRAsUmbrella JIRA: YARN-2915Main sub-JIRAs:Work Item Associated

JIRAAuthor

Federation StateStore APIs YARN-3662 Subru KrishnanFederation PolicyStore APIs YARN-3664 Subru KrishnanFederation "Capacity Allocation" across sub-cluster YARN-3658 Carlo CurinoFederation Router YARN-3658 Giovanni FumarolaFederation Intercepting and propagating AM-RM communications

YARN-3666 Kishore Chaliparambil

Federation maintenance mechanisms (command propagation)

YARN-3657 Carlo Curino

Federation subcluster membership mechanisms YARN-3665 Subru KrishnanFederation State and Policy Store (SQL implementation) YARN-3663 Giovanni FumarolaFederation Global Policy Generator (load balancing) YARN-3660 Subru Krishnan

https://issues.apache.org/jira/browse/YARN-2915

Mercury and YaqImproving cluster utilization and

job completion times

• Feedback delays impact cluster utilization:• NMs report resource availability through heartbeats (HBs)• Resources can remain idle between HBs• Resource utilization suboptimal, especially for shorter tasks

• Mercury• Introduces distributed scheduling in YARN• Minimize allocation latencies

• Yaq• Queuing of tasks at the NMs to minimize idle times between allocations• Queue management techniques to improve job completion time

Cluster Utilization in YARN

5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm60.59% 78.35% 92.38% 78.54% 83.38%

Average YARN cluster utilization for varying workloads

Two types of schedulers:• Central scheduler (YARN)• Scheduling policies/guarantees• Slow(er) decisions

• Distributed schedulers (new)• Fast/low-latency decisions• Ideal for short or lower-priority tasks

• AM specifies resource type:• Guaranteed (via YARN)• Opportunistic (via distributed)

Mercury: Distributed Scheduling in YARN

Mercury Runtim

e

Mercury Runtim

e

Mercury Runtim

e

App Maste

r

Distributed

Scheduler

Central SchedulerDistribute

d Scheduler

Mercury Coordinato

r

Mercury Resource Management Framework

Request with resource type

guaranteed

opportunistic

Mercury: Task Throughput As Task Duration Increases

Results:41% task throughput improvement for short tasksImprovement drops for longer task durationsBest results when carefully combining only-G with only-Q

“only-G” is stock YARN, “only-Q” is Mercury

• Introduce queuing of tasks at NMs• Minimize idle times between allocations• Improve cluster utilization• Queue management to improve job completion times

• Explore queue management techniques• Placement of tasks at queues• Bounding queue length• Dynamic queue reordering

• Techniques applied to:• YARN

leads to 1.7x improvement inmedian job completion time

• Mercuryleads to 3.9x improvement

Yaq: Efficient Management of NM Queues

Evaluating Yaq on a Production WorkloadYaq-on-YARN (Yaq-c) Yaq-on-Mercury (Yaq-

d)

• Umbrella JIRA for Mercury: YARN-2877• Introduce OPPORTUNISTIC container type [YARN-2882, YARN-4335] –

RESOLVEDKonstantinos Karanasos

• Proxying AM-RM communications [YARN-2884] – RESOLVEDKishore Chaliparambil

• Distributed scheduling interceptor [YARN-2885] – RESOLVEDArun Suresh

• Queuing of containers at NMs [YARN-2883] – PATCH AVAILABLEKonstantinos Karanasos

• Notify RM for OPPORTUNISTIC containers [YARN-4738] – PATCH AVAILABLEKonstantinos Karanasos

• Cluster monitor to gathering NM queue state [YARN-4412] – RESOLVEDArun Suresh

• NM queue rebalancing [YARN-2888] – RESOLVEDArun Suresh

Mercury and Yaq OSS

• Amoeba (work-preserving pre-emption), Rayon (predictable resource allocation and SLOs via intelligent admission control)• Status: shipping in Apache Hadoop 2.6

• Mercury and Yaq• Merge Cosmos-style optimistic allocation with YARN conservatism to

improve performance while preserving global policy control• Status: prototypes, JIRAs and papers

• Federation• Scale-up clusters to be over 100K machines• Enable federation across data centers for fluid load shifting and BCDR• Status: prototype and JIRA

• Framework-level Pooling• Enable frameworks that want to take over resource allocation to support

millisecond-level response and adaptation times• Status: spec

Microsoft Contributions to OSS Apache YARN

Papers• Amoeba [SoCC’12]: Ganesh Ananthanarayanan, Sriram Rao, Chris

Douglas, Raghu Ramakrishnan, Ion Stoica• Apache YARN [SoCC’13]: Authored by CISL (Chris Douglas, Carlo

Curino), Yahoo!, HW, Inmobi (Won Best Paper at SoCC’13)• Rayon [SoCC’14]: Carlo Curino, Chris Douglas, Djellel Difallah,

Raghu Ramakrishnan, Sriram Rao (CISL)• Tetris [SIGCOMM’14]: Robert Grandl, Ganesh Ananthanarayanan

and Srikanth Kandula (MSR), Sriram Rao (CISL)• Mercury [USENIX’15 paper and Eurosys’15 poster]: Authored by

CISL (Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas) and RM team (Kishore Chaliparambil, Giovanni Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga)

• Yaq [Eurosys’16 paper and NSDI’16 poster]: Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Milan Vojnovic, and Sriram Rao

• Reservations and planning• YARN is the first cluster scheduler to support such features through

Rayon

• Queue management techniques• No support for advanced queue management in Mesos/Omega/Borg

• Scalability• YARN using Federation can now match the scalability of Mesos and

Borg

Comparison with Mesos/Omega/Borg

REEFControl flow framework for YARN applicationsYARN provides containersREEF makes it easy to build event flows/iterative computations using YARN

Think “libc for BigData”Project history2012: Started project in CISL2013: Engaged with product teams2014: Open Source release

Apache Incubation (~20 devs from 7 organizations)Azure Stream Analytics public preview (built with REEF)Azure ML workflow scheduler on REEF

Learn morehttp://ww.reef-project.org and http://reef.incubator.apache.org

http://ww.reef-project.org/

http://reef.incubator.apache.org/

© 2015 Microsoft Corporation. All rights reserved.

For more, see my blog post athttp://aka.ms/adltechblog/

http://ww.reef-project.org and http://reef.incubator.apache.orgFor more on REEF

http://aka.ms/adltechblog/

http://ww.reef-project.org/

http://reef.incubator.apache.org/

Date post:	16-Apr-2017
Category:	Technology
Upload:	dataworks-summithadoop-summit
View:	461 times
Download:	3 times

Scale-Out Resource Management at Microsoft using Apache YARN

Technology