Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | dataworks-summithadoop-summit |
View: | 461 times |
Download: | 3 times |
Scale-Out Resource Management at Microsoft Using Apache YARN
Raghu RamakrishnanCTO for Data
Microsoft
Technical FellowHead, Big Data Engineering
Store any data(HDFS) Files, relations, docs, logs, graphs, multimedia …
Do any analysisSQL queries, ML, image processing, log analytics …
Hive, Map-Reduce, Spark-ML, …
At any speedBatch, interactive, streaming
Hive, Spark, Storm, …
At any scale … elastic!“Terabytes or more” or “as much as you need”
AnywhereOn-prem, cloud, hybrid
Data to Intellige
nt Action
The Big Data Service Microsoft Runs On
WindowsSMSG
LiveAdsCRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STB Commerce RiskMessengerLCAExchang
e
YammerSkype
Bing
data managed: EBscluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s # daily jobs: 100s of Ks
• Interactive and Real-Time Analytics requires in-memory data, NVRAM, SSDs, RDMA, etc.
• Massive data volumes require scale-out stores using commodity servers, even archival storage
Tiered StorageSeamlessly move data across tiers, mirroring life-cycle and usage
patternsSchedule compute near low-latency copies of data
Where Should the Data Be?
How can we manage this trade-off without moving data across different storage systems (and governance
boundaries)?
• Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run on the same machines (where the data lives)
Resource Management with Multitenancy and SLAsPolicy-driven management of vast compute pools co-located with data
Fundamental MemeSchedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?
Shared Data and Compute
Tiered Storage
Relational Query Engine
MachineLearning
Compute Fabric (Resource Management)
Multiple analytic engines sharing same
resource pool
Compute and store/cache on same machines
What’s Behind a U-SQL Query
. . .
. . . … … …
Optimizer
Scheduler
Compiler
Runtime
U-SQL Studio
Web Interface / API
20+ years of SQL Server smarts!
Data-aware, fault-aware!
ResourceManager
Resource Managers for Big Data
Allocate compute containers to competing jobsMultiple job engines can request from shared poolContainers are the unit of resourceCan fail or be taken away
YARN: Resource manager for Hadoop2.xOther RMs include Corona, Mesos, Omega
YARN Gaps• No way to provide resource allocation
SLOs to production jobs• YARN RM has scalability limitations• Known to scale to about 5K nodes
• High allocation latency, which affects “small” jobs that are majority
• Support for specialized execution frameworks• Interactive environments, long-running services
• Amoeba (work-preserving pre-emption), Rayon (predictable resource allocation and SLOs via intelligent admission control)• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq• Merge Cosmos-style optimistic allocation with YARN conservatism to
improve performance while preserving global policy control• Status: prototypes, JIRAs and papers
• Federation• Scale-up clusters to be over 100K machines• Enable federation across data centers for fluid load shifting and BCDR• Status: prototype and JIRA
• Framework-level Pooling• Enable frameworks that want to take over resource allocation to support
millisecond-level response and adaptation times• Status: spec
Microsoft Contributions to OSS Apache YARN
AmoebaAdding work-preserving pre-emption to YARN
Dynamic OptimizationLeveraging checkpointing and preemption for parsimonious scheduling in MR
Click to insert photo.
15
Killing Tasks vs. Preemption
10 365 705 102913451685202523652705304533853725405543754715505553955735607564156755709574357775811584550
10
20
30
40
50
60
70
80
90
100
Kill Preempt
Time (s)
% C
ompl
ete
33% Improvement
ClientJob1
RM
Scheduler
NodeManager NodeManager NodeManager
AppMaster Task
Task
Task
Task
Task
TaskTask
YARN-45MR-5192
MR-5194
MR-5197
MR-5189
MR-5189
MR-5176
(…)
YARN-569
YARN-568
YARN-567
MR-5196
Contributing to Apache
Engaging with OSS talk with active developers show early/partial work small patches ok to leave things unfinished
RayonAdding Capacity Reservation to YARN
Sharing a Cluster Between Production & Best-effort JobsProduction Jobs (P) Money-making, large (recurrent) jobs with SLAse.g., Shows up at 3pm, deadline at 6pm, ~ 90 min runtime
Best-Effort Jobs Interactive exploratory jobs submitted by data scientists w/o SLAs However, latency is still important (user waiting)
19
Not supported well currently in YARN
New idea:Support SLOs for production jobs by using - Job-provided resource requirements in RDL- System-enforced admission control
Reservation-Based Scheduling in Hadoop(Curino, Krishnan, Difallah, Douglas, Ramakrishnan, Rao; Rayon paper, SOCC 2014)
20
Resource Definition Language (RDL)
21
e.g., atom (<2GB,1core>, 1, 10, 1min, 10bundle/min)
(simplified for OSS release)
Steps: 1. App formulates reservation request in RDL 2. Request is “placed” in the Plan 3. Allocation is validated against sharing policy 4. System commits to deliver resources on time 5. Plan is dynamically enacted 6. Jobs get (reserved) resources 7. System adapts to live conditions
22
Architecture: Teach the RM About Time
Results• Meets all production job SLAs• Lowers best-effort jobs latency• Increased cluster utilization and throughput
Committed to Hadoop trunk and 2.6 release
Now part of Cloudera CDH and Hortonworks HDP
Comparing Rayon With CapacityScheduler
23
Initial Umbrella JIRA: YARN-1051 (14 sub-tasks)
Rayon OSS
Rayon V2 Umbrella JIRA: YARN-2572 (25 sub-tasks) (tooling, REST APIs, UI, Documentation, perf-improvements)High-Availability Umbrella JIRA: YARN-2573 (7 sub-tasks)Heterogeneity/node-labels Umbrella JIRA: YARN-4193 (8 sub-tasks)
Algo enhancements: YARN-3656 (1 sub-task)
Folks Involved: Carlo Curino, Subru Krishnan, Ishai Menache, Sean Po, Jonathan Yaniv, Arun Suresh, Abhunav Dhoot, Alexey Tumanov
Included in Apache Hadoop 2.6Various enhancements in upcoming Apache Hadoop 2.8
YARN FederationImproving YARN’s scalability
Why Federation?Problem:
YARN scalability is bounded by the centralized ResourceManagerProportional to #nodes, #apps, #containers, heart-beat#frequency
Maintenance and Operations on single massive cluster are painfulSolution:
Scale by federating multiple YARN clusters Appears as a single massive cluster to an appNode Manager(s) heart-beat to one RM onlyMost apps talk with one RM only; a few apps might span sub-clusters
(achieved by transparently proxying AM-RM communication)If single app exceed sub-cluster size, or for load
Easier provisioning / maintenanceLeverage cross-company stabilization effort of smaller YARN clustersUse ~6k YARN clusters as-is as building blocks
Federation ArchitectureCl
ient
HD
FSRead Placement hints (optional) Global
Policy Generator
. .
.
Yarn Sub-Cluster 1
NM-001NM-001
NM-001
…NM-001NM-001
NM-001
…
YARNRM-001YARNRM-001
Yarn Sub-Cluster 2
NM-002NM-002
NM-002
…
NM-002NM-002
NM-002
…
YARNRM-002YARNRM-002
Yarn Sub-Cluster N
NM-NNNNM-NNN
NM-NNN
…
NM-NNNNM-NNN
NM-NNN
…
YARNRM-NNNYARNRM-NNN
Rout
er
9. Submit Job
7. Read Policies (capacity allocation and job routing)
4. Read membership and loadState
Store8. Write App -> sub-cluster mapping
3. Write (initial) capacity allocation
PolicyStore
5. Write capacity allocation (updates) & policies
2. Request capacity allocation
6. Submit Job
1. Heartbeat (membership)
Federation JIRAsUmbrella JIRA: YARN-2915Main sub-JIRAs:Work Item Associated
JIRAAuthor
Federation StateStore APIs YARN-3662 Subru KrishnanFederation PolicyStore APIs YARN-3664 Subru KrishnanFederation "Capacity Allocation" across sub-cluster YARN-3658 Carlo CurinoFederation Router YARN-3658 Giovanni FumarolaFederation Intercepting and propagating AM-RM communications
YARN-3666 Kishore Chaliparambil
Federation maintenance mechanisms (command propagation)
YARN-3657 Carlo Curino
Federation subcluster membership mechanisms YARN-3665 Subru KrishnanFederation State and Policy Store (SQL implementation) YARN-3663 Giovanni FumarolaFederation Global Policy Generator (load balancing) YARN-3660 Subru Krishnan
Mercury and YaqImproving cluster utilization and
job completion times
• Feedback delays impact cluster utilization:• NMs report resource availability through heartbeats (HBs)• Resources can remain idle between HBs• Resource utilization suboptimal, especially for shorter tasks
• Mercury• Introduces distributed scheduling in YARN• Minimize allocation latencies
• Yaq• Queuing of tasks at the NMs to minimize idle times between allocations• Queue management techniques to improve job completion time
Cluster Utilization in YARN
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm60.59% 78.35% 92.38% 78.54% 83.38%
Average YARN cluster utilization for varying workloads
Two types of schedulers:• Central scheduler (YARN)• Scheduling policies/guarantees• Slow(er) decisions
• Distributed schedulers (new)• Fast/low-latency decisions• Ideal for short or lower-priority tasks
• AM specifies resource type:• Guaranteed (via YARN)• Opportunistic (via distributed)
Mercury: Distributed Scheduling in YARN
Mercury Runtim
e
Mercury Runtim
e
Mercury Runtim
e
App Maste
r
Distributed
Scheduler
Central SchedulerDistribute
d Scheduler
Mercury Coordinato
r
Mercury Resource Management Framework
Request with resource type
guaranteed
opportunistic
Mercury: Task Throughput As Task Duration Increases
Results:41% task throughput improvement for short tasksImprovement drops for longer task durationsBest results when carefully combining only-G with only-Q
“only-G” is stock YARN, “only-Q” is Mercury
• Introduce queuing of tasks at NMs• Minimize idle times between allocations• Improve cluster utilization• Queue management to improve job completion times
• Explore queue management techniques• Placement of tasks at queues• Bounding queue length• Dynamic queue reordering
• Techniques applied to:• YARN
leads to 1.7x improvement inmedian job completion time
• Mercuryleads to 3.9x improvement
Yaq: Efficient Management of NM Queues
Evaluating Yaq on a Production WorkloadYaq-on-YARN (Yaq-c) Yaq-on-Mercury (Yaq-
d)
• Umbrella JIRA for Mercury: YARN-2877• Introduce OPPORTUNISTIC container type [YARN-2882, YARN-4335] –
RESOLVEDKonstantinos Karanasos
• Proxying AM-RM communications [YARN-2884] – RESOLVEDKishore Chaliparambil
• Distributed scheduling interceptor [YARN-2885] – RESOLVEDArun Suresh
• Queuing of containers at NMs [YARN-2883] – PATCH AVAILABLEKonstantinos Karanasos
• Notify RM for OPPORTUNISTIC containers [YARN-4738] – PATCH AVAILABLEKonstantinos Karanasos
• Cluster monitor to gathering NM queue state [YARN-4412] – RESOLVEDArun Suresh
• NM queue rebalancing [YARN-2888] – RESOLVEDArun Suresh
Mercury and Yaq OSS
• Amoeba (work-preserving pre-emption), Rayon (predictable resource allocation and SLOs via intelligent admission control)• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq• Merge Cosmos-style optimistic allocation with YARN conservatism to
improve performance while preserving global policy control• Status: prototypes, JIRAs and papers
• Federation• Scale-up clusters to be over 100K machines• Enable federation across data centers for fluid load shifting and BCDR• Status: prototype and JIRA
• Framework-level Pooling• Enable frameworks that want to take over resource allocation to support
millisecond-level response and adaptation times• Status: spec
Microsoft Contributions to OSS Apache YARN
Papers• Amoeba [SoCC’12]: Ganesh Ananthanarayanan, Sriram Rao, Chris
Douglas, Raghu Ramakrishnan, Ion Stoica• Apache YARN [SoCC’13]: Authored by CISL (Chris Douglas, Carlo
Curino), Yahoo!, HW, Inmobi (Won Best Paper at SoCC’13)• Rayon [SoCC’14]: Carlo Curino, Chris Douglas, Djellel Difallah,
Raghu Ramakrishnan, Sriram Rao (CISL)• Tetris [SIGCOMM’14]: Robert Grandl, Ganesh Ananthanarayanan
and Srikanth Kandula (MSR), Sriram Rao (CISL)• Mercury [USENIX’15 paper and Eurosys’15 poster]: Authored by
CISL (Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas) and RM team (Kishore Chaliparambil, Giovanni Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga)
• Yaq [Eurosys’16 paper and NSDI’16 poster]: Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Milan Vojnovic, and Sriram Rao
• Reservations and planning• YARN is the first cluster scheduler to support such features through
Rayon
• Queue management techniques• No support for advanced queue management in Mesos/Omega/Borg
• Scalability• YARN using Federation can now match the scalability of Mesos and
Borg
Comparison with Mesos/Omega/Borg
REEFControl flow framework for YARN applicationsYARN provides containersREEF makes it easy to build event flows/iterative computations using YARN
Think “libc for BigData”Project history2012: Started project in CISL2013: Engaged with product teams2014: Open Source release
Apache Incubation (~20 devs from 7 organizations)Azure Stream Analytics public preview (built with REEF)Azure ML workflow scheduler on REEF
Learn morehttp://ww.reef-project.org and http://reef.incubator.apache.org
© 2015 Microsoft Corporation. All rights reserved.
For more, see my blog post athttp://aka.ms/adltechblog/
http://ww.reef-project.org and http://reef.incubator.apache.orgFor more on REEF