+ All Categories
Home > Documents > Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525...

Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525...

Date post: 02-Jan-2016
Category:
Upload: letitia-short
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
62
Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling Cloud Scheduling 1
Transcript
Page 1: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Presented by: Muntasir Raihan Rahman and Anupam Das

CS 525 Spring 2011Advanced Distributed Systems

Cloud SchedulingCloud Scheduling

1

Page 2: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Papers Presented

Improving MapReduce Performance in Heterogeneous Environments @ OSDI 2008

Matei Zahari, Andy Konwinski, Anthony D. Joseph. Randy Katz, Ion Stoica @ Matei Zahari, Andy Konwinski, Anthony D. Joseph. Randy Katz, Ion Stoica @ UC Berkeley RAD LabUC Berkeley RAD Lab

Quincy: Fair Scheduling for Distributed Computing Clusters @ SOSP 2009

Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, Andrew Goldberg @ MSR Silicon ValleyAndrew Goldberg @ MSR Silicon Valley

Reining in the Outliers in Map-Reduce Clusters using Mantri @ OSDI 2010

Ganesh Ananthanarayanan, Ion Stoica @ UC Berkeley RAD Lab, Srikanth Ganesh Ananthanarayanan, Ion Stoica @ UC Berkeley RAD Lab, Srikanth Kandula, Albert Greenberg @ MSR, Yi Lu @ UIUC, Bikash Saha, Edward Kandula, Albert Greenberg @ MSR, Yi Lu @ UIUC, Bikash Saha, Edward

Harris @ Microsoft BingHarris @ Microsoft Bing

2

Page 3: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Quincy: Fair Scheduling for Distributed Computing Clusters [@ SOSP 2009]

Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg @ MSR Silicon Valleyand Andrew Goldberg @ MSR Silicon Valley

3

Page 4: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Motivation

• Fairness:– Existing dryad scheduler unfair [greedy approach].– Subsequent small jobs waiting for a large job to finish.

• Data Locality:– HPC jobs fetch data from a SAN, no need for co-location of

data and computation.– Data intensive workloads have storage attached to

computers.– Scheduling tasks near data improves performance.

4Department of Computer Science, UIUC

Page 5: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Fair Sharing

• Job X takes t seconds when it runs exclusively on a cluster .

• X should take no more than Jt seconds when cluster has J concurrent jobs.

• Formally, for N computers and J jobs, each job should get at-least N/J computers.

5Department of Computer Science, UIUC

Page 6: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Quincy Assumptions

• Homogeneous clusters– Heterogeneity is discussed in next paper [LATE scheduler]

• Global uniform cost measure – E.g. Quincy assumes that the cost of preempting a running

job can be expressed in the same units as the cost of data transfer.

6Department of Computer Science, UIUC

Page 7: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Fine Grain Resource Sharing

• For MPI jobs, coarse grain scheduling– Devote a fixed set of computers for a particular job– Static allocation, rarely change the allocation

• For data intensive jobs (map-reduce, dryad)– We need fine grain resource sharing

• multiplex all computers in the cluster between all jobs

– Large datasets attached to each computer– Independent tasks (less costly to kill a task and restart)

7Department of Computer Science, UIUC

Page 8: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Example of Coarse Grain Sharing

Department of Computer Science, UIUC 8

Page 9: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Example of Fine Grain Sharing

9Department of Computer Science, UIUC

Page 10: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Goals of Quincy

• Fair sharing with locality.• N computers, J jobs,

– Each job gets at-least N/J computers– With data locality

• place tasks near data to avoid network bottlenecks

– Feels like a multi-constrained optimization problem with trade-offs!

– Joint optimization of fairness and data locality– These objectives might be at odds!

10Department of Computer Science, UIUC

Page 11: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Cluster Architecture

11Department of Computer Science, UIUC

Page 12: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Baseline: Queue Based Scheduler

12Department of Computer Science, UIUC

Page 13: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Flow Based Scheduler = Quincy

• Main Idea: [Matching = Scheduling]– Construct a graph based on scheduling constraints, and

cluster architecture.– Assign costs to each matching.– Finding a min cost flow on the graph is equivalent to

finding a feasible schedule.– Each task is either scheduled on a computer or it remains

unscheduled.– Fairness constrains number of tasks scheduled for each

job.

13Department of Computer Science, UIUC

Page 14: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

New Goal

• Minimize matching cost while obeying fairness constraints.– Instead of making local decisions [greedy], solve it globally.

• Issues:– How to construct the graph?– How to embed fairness and locality constraints in

the graph?• Details in appendix of paper

14Department of Computer Science, UIUC

Page 15: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Graph Construction• Start with a directed graph representation of the cluster

architecture.

15Department of Computer Science, UIUC

Page 16: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Graph Construction (2)

• Add an unscheduled node Uj.

• Each worker task has an edge to Uj.

• There is a single edge from Uj to the sink.

• High cost on edges from tasks to Uj.

• The cost and flow on the edge from Uj to the sink controls fairness.

• Fairness controlled by adjusting the number of tasks allowed for each job

16Department of Computer Science, UIUC

Page 17: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Graph Construction (3)

17Department of Computer Science, UIUC

• Add edges from tasks (T) to computers (C), racks (R), and the cluster (X).

• cost(T-C) << cost(T-R) << cost(T-X).• Control over data locality.• 0 cost edge from root task to computer to

avoid preempting root task.

Page 18: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

A Feasible Matching

18Department of Computer Science, UIUC

• Cost of T-U edge increases over time

• New cost assigned to scheduled T-C edge: increases over time

Page 19: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Final Graph

19Department of Computer Science, UIUC

Page 20: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Workloads

• Typical Dryad jobs (Sort, Join, PageRank, WordCount, Prime).

• Prime used as a worst-case job that hogs the cluster if started first.

• 240 computers in cluster. 8 racks, 29-31 computers per rack.

• More than one metric used for evaluation.

20Department of Computer Science, UIUC

Page 21: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Experiments

21Department of Computer Science, UIUC

Page 22: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Experiments (2)

22Department of Computer Science, UIUC

Page 23: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Experiments (3)

23Department of Computer Science, UIUC

Page 24: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Experiments (4)

24Department of Computer Science, UIUC

Page 25: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Experiments (5)

25Department of Computer Science, UIUC

Page 26: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Conclusion

• New computational model for data intensive computing.

• Elegant mapping of scheduling to min cost flow/matching.

26Department of Computer Science, UIUC

Page 27: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Discussion Points• Min cost flow recomputed from scratch each time a change

occurs– Improvement: incremental flow computation.

• No theoretical stability guarantee.• Fairness measure: control number of tasks for each job, are

there other measures?• Correlation constraints?• Other models: auctions to allocate resources.• Selfish behavior: jobs manipulating costs.• Heterogeneous data centers.• Centralized Quincy controller: single point of failure.

27Department of Computer Science, UIUC

Page 28: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Improving MapReduce Performance in Heterogeneous Environments

Matei Zahari, Andy Konwinski, Anthony D. Joseph. Randy Katz, Ion StoicaMatei Zahari, Andy Konwinski, Anthony D. Joseph. Randy Katz, Ion Stoica

28

Page 29: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Background

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

map

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

mapreduce

reduce

Inputrecords

Split

Split

shuffle

Map-Reduce Phases

29

Page 30: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Motivation1. MapReduce is becoming popular

Open-source implementation: Hadoop, used by Yahoo!, Facebook,Scale: 20 PB/day at Google, O(10,000) nodes at Yahoo, 3000 jobs/day at Facebook

2. Utility computing services like Amazon Elastic Compute Cloud (EC2) provide cheap on-demand computingPrice: 10 cents / VM / hourScale: thousands of VMsCaveat: less control over performance

So the smallest increase in performance has significant impact

30

Page 31: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Motivation

MapReduce(Hadoop)

Performance

dependsTask Scheduler

(handles stragglers)

The task scheduler makes its decision based on the assumption that cluster nodes are homogeneous.

However, this assumption does not hold in heterogeneous environments like Amazon EC2.

31

Page 32: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Goal

Define a new scheduling metric.

Choosing the right machines to run speculative tasks.

Capping the amount of speculative executions.

Improve the performance of speculative executions. Speculative execution deals with rerunning stragglers.

32

Page 33: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Speculative ExecutionSlow nodes/stragglers are the main bottleneck for jobs not finishing in time. So, to reduce response time stragglers are speculatively executed on other free nodes.

Node 1

Node 2

How can this be done in a heterogeneous environment?

33

Page 34: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Speculative Execution in Hadoop

TaskAssociated

Progress score ϵ[0,1]

Map TaskProgress Score

Fraction of data read

Reduce Task1/3*Copy Phase +1/3*Sort Phase +1/3*Reduce Phase

Progress depends

E.g. a task halfway through reduce phase scores=1/3*1+1/3*1+1/3*1/2=5/6

Speculative execution threshold : progress < avgProgress – 0.2

34

Page 35: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Hadoop’s Assumption

1. Nodes can perform work at exactly the same rate

2. Tasks progress at a constant rate throughout time

3. There is no cost to launching a speculative task on an idle node

4. The three phases of execution take approximately same time

5. Tasks with a low progress score are stragglers

6. Maps and Reduces require roughly the same amount of work

35

Page 36: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Breaking Down the AssumptionsThe first 2 assumptions talk about homogeneity. But heterogeneity exists due to-

1.Multiple generations of Hardware.2.Co-location of multiple VMs on the same physical host.

Department of Computer Science, UIUC 36

Page 37: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Breaking Down the Assumptions

Assumption 3:There is no cost to launching a speculative task on an idle node

Not true in situations where resources are shared.

E.g.

Network Bandwidth

Disk I/O operation

37

Page 38: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Breaking Down the Assumptions

Assumption 4: The three phases of execution take approximately the same time.

The copy phase of the reduce task is the slowest while the other 2 phases are relatively faster.

Suppose 40% of the reducers have finished the copy phase and quickly completed their remaining task. The Remaining 60% are near the end of copy phase.

Avg progress: 0.4*1+0.6*1/3=60%

Progress of the 60% reduce tasks= 33.33%

progress < avgProgress – 0.2

38

Page 39: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Breaking Down the Assumptions

Assumption 5: Tasks with a low progress score are stragglers.

Suppose a task has finished 80% of its work, and from that point onward gets really slow. But due to the 20% threshold it can never be speculated.

80% < 100%-20%

39

Page 40: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

1. Too many backups, thrashing shared resources like network bandwidth

2. Wrong tasks backed up3. Backups may be placed on slow nodes

Example: Observed that ~80% of reduces backed up, most of them lost to originals; network thrashed

Problems With Speculative Execution

40

Page 41: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Idea: Use Progress Rate

41

Instead of using progress score, compute progress rates, and backup tasks that are “far enough” below the mean

Problem: can still select the wrong tasks

Progress Rate = Progress Score

Execution Time

Page 42: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Progress Rate Example

Time (min)

Node 1

Node 2

Node 3

3x slower

1.9x slower

1 task/min

1 min 2 minA job with 3 tasks

42

Page 43: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Node 1 Task1Task1

Node 2 Task2Task2

Task4Task4

Node 3 Task3Task3 Task5Task5

What if the job had 5 tasks?

time left: 1.8 min Task5

2 min

Time (min)

time left: 1 min

Progress rate

Node2=0.33Node3=0.53Node2 selected

Free SlotFree Slot

43

Progress Rate Example

Node 2 is slowest, but should back up Node 3’s task!

Page 44: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

LATE SchedulerLongest Approximate Time to End

Primary assumption: best task to execute is the one that finishes furthest into the future

Secondary: tasks make progress at approx. constant rate. Caveat- can still select the wrong tasks

Estimated time left= 1- Progress Score

Progress Rate

44

Page 45: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

LATE Scheduler

Task 5

Task 2

Node 1 Task 1Task 1

Node 2

Task 4Task 4

Node 3 Task3Task3

2 min

Time (min)

Progress = 5.3%

Estimated time left:(1-0.66) / (1/3) = 1 min

Estimated time left:(1-0.05) / (1/1.9) = 1.8 min

Progress = 66%

LATE correctly picks Node 3

Copy of Task 5

Copy of Task 5

Page 46: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Other Details of LATE

Cap the number of speculative tasksSpeculativeCap-10%Avoid unnecessary speculationsLimit contention and hurting throughput

Select fast node to launch backupsSlowNodeThreshold–25th percentileBased on total work performed

Only back-up tasks that are sufficiently slowSlowTaskThreshold–25th percentileBased on task progress rate

Does not consider data locality

46

Page 47: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Evaluation Environment

Environments:

Amazon EC2 (200-250 nodes)Small Local Testbed (9 nodes)

Heterogeneity Setup:

Assigning a varying number of VMs to each node Running CPU and I/O intensive jobs to intentionally create stragglers

47

Page 48: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

EC2 Sort with HeterogeneityEach host sorted 128MB with a total of 30GB data

Average 27% speedup over native, 31% over no backups

48

Page 49: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

EC2 Sort with Stragglers

Average 58% speedup over native, 220% over no backups

Each node sorted 256MB with a total of 25GB of data

Stragglers created with 4 CPU (800KB array sort) and 4 disk (dd tasks) intensive processes

49

Page 50: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

EC2 Grep and Wordcount with Stragglers

Grep WordCount

• 36% gain over native• 57% gain over no backups

• 8.5% gain over native• 179% gain over no backups

50

Page 51: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Remarks

Pros:

Considers heterogeneity that appears in real life systems.LATE speculatively executes the tasks that hurt the response time the most on fast nodes. LATE caps speculative tasks to avoid overloading resources.

Cons:

Does not consider data locality.Tasks may require different amount of computation.

51

Page 52: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Discussion Points

What is the impact of allowing more than one speculative copy of a given task to run?

How would LATE perform on larger VMs?

How could we use data locality to improve the performance of LATE?

How generic are the optimizations made by LATE?

52

Page 53: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Department of Computer Science, UIUC

Reining in the Outliers in Map-Reduce Clusters using Mantri

Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward HarrisYi Lu, Bikas Saha, Edward Harris

53

Page 54: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

What this Paper is About

• Current schemes (e.g. Hadoop, LATE) duplicate long-running tasks based on some metrics.

• Mantri: A Cause-, Resource-Aware mitigation scheme– Case by case analysis: takes distinct actions based on

cause– Considers opportunity cost of actions

Department of Computer Science, UIUC 54

Page 55: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

The Outlier Problem: Causes and Solutions

What are Outliers?• Stragglers: Tasks that take >= 1.5 times the median task in

that phase• Re-computes: Tasks that are re-run because their output was

lost (Not considered in LATE paper)

Department of Computer Science, UIUC 55

Page 56: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Frequency of Outliers

(1)The median phase has 10% stragglers and no recomputes(2)10% of the stragglers take > 10x longer

Department of Computer Science, UIUC 56

Page 57: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Mantri [Resource Aware Restart]• Problem: Outliers due to machine contention.• Idea: Restart tasks elsewhere in the cluster• Challenge: The earlier the better, but restart or duplicate?• Mantri Solution:

– Do either iff: P(tnew < trem) is high

– Mantri kills and restarts only if the remaining time is so large that the chance of a restart finishing earlier is high: trem > E(tnew) + C

– Mantri starts a duplicate only if the total amount of resource consumed decreases: P(trem > tnew (c+1)/c), c is the number of copies currently running

– Continuously observe and kill wasteful copies. At-most 3 copies exist.

Department of Computer Science, UIUC 57

Page 58: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Mantri [Network Aware Placement]• Problem: Tasks reading input across the network experience

variable congestion.• Idea: Avoid hot-spots, keep traffic on a link proportional to

bandwidth• Challenges: Global co-ordination, congestion detection• Insights:

– local control is a good approximation– Link utilization averages out on the long term, and is steady on the

short term

If rack i has di map output and ui, vi bandwidths available on uplink and downlink,

Place ai fraction of reduces such that:

Department of Computer Science, UIUC 58

Page 59: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Mantri [Avoid Recomputations]

• Problem: Due to unavoidable input, tasks have to be recomputed.

• Insight: – 50% of recomputes are on 5% machines– Cost to recompute vs cost to replicate

M1

M2

tredo = r2(t2

+t1redo)

• Cost to recompute depends on data loss probabilities and time taken, and also recursively looks at prior phases.

• Mantri preferentially acts on more costly inputs

Department of Computer Science, UIUC 59

Page 60: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Mantri [Data Aware Task Ordering]

• Problem: Workload imbalance causes tasks to straggle.• Idea: Restarting outliers that are lengthy is counter-

productive.• Insights:

– Theorem [Graham, 1969]– Scheduling tasks with longest processing time first is at-most

33% worse than optimal schedule.

• Mantri Solution:– Schedule tasks in a phase in descending order of input size.

Department of Computer Science, UIUC 60

Page 61: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Summary

• Outliers are a significant problem• Happens due to many causes• Mantri: cause and resource aware mitigation

outperforms prior schemes

Department of Computer Science, UIUC 61

Page 62: Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.

Discussion Points

• Too many schemes packed together, no unifying theme!

• Mantri does case by case analysis for each cause, what if the causes are inter-dependent?

Department of Computer Science, UIUC 62


Recommended