Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job...

Post on 21-May-2020

5 views 0 download

transcript

Jockey Guaranteed Job Latency in

Data Parallel Clusters

Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca

2  

DATA PARALLEL CLUSTERS

3  

DATA PARALLEL CLUSTERS

4  

DATA PARALLEL CLUSTERS Predictability

5  

DATA PARALLEL CLUSTERS Deadline

6  

DATA PARALLEL CLUSTERS Deadline

7  

VARIABLE LATENCY

8  

VARIABLE LATENCY

0 5 10 15 20 25 30 35 40

latency [minutes]

9  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

10  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

11  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

12  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

13  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

14  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

4.3x

15  

Why does latency vary?

1.  Pipeline complexity 2.  Noisy execution environment

Cosmos

16  

MICROSOFT’S DATA PARALLEL CLUSTERS

Cosmos

17  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore

Cosmos

18  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore •  Dryad

Cosmos

19  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore •  Dryad •  SCOPE

Cosmos

20  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore •  Dryad •  SCOPE

21  

DRYAD’S DAG WORKFLOW

Cosm

os Cl

uste

r

22  

DRYAD’S DAG WORKFLOW

Cosm

os Cl

uste

r

23  

DRYAD’S DAG WORKFLOW

Pipeline

Job

24  

DRYAD’S DAG WORKFLOW

Deadline

Deadline

Deadline Deadline

Deadline

25  

DRYAD’S DAG WORKFLOW

Deadline

Deadline

Deadline Deadline

Deadline

26  

Stage

DRYAD’S DAG WORKFLOW

Job

27  

Stage

DRYAD’S DAG WORKFLOW

Tasks

Job

28  

29  

EXPRESSING PERFORMANCE TARGETS

Priorities?

30  

EXPRESSING PERFORMANCE TARGETS

Priorities? Not expressive enough

Weights?

31  

EXPRESSING PERFORMANCE TARGETS

Priorities? Not expressive enough

Weights? Difficult for users to set

Utility curves?

32  

EXPRESSING PERFORMANCE TARGETS

Priorities? Not expressive enough

Weights? Difficult for users to set

Utility curves? Capture deadline & penalty

33  

OUR GOAL

34  

OUR GOAL

Maximize utility

35  

OUR GOAL

Maximize utility while minimizing resources

36  

OUR GOAL

Maximize utility while minimizing resources

by dynamically adjusting the allocation

Jockey 37  

Jockey 38  

•  Large clusters

Jockey 39  

•  Large clusters •  Many users

Jockey 40  

•  Large clusters •  Many users •  Prior execution

41  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

42  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

43  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

44  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

45  

JOCKEY – CONTROL LOOP

46  

JOCKEY – CONTROL LOOP

47  

JOCKEY – CONTROL LOOP

48  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

49  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

f( job state, allocation) -> remaining run time

50  

JOCKEY – PROGRESS INDICATOR

51  

JOCKEY – PROGRESS INDICATOR

52  

JOCKEY – PROGRESS INDICATOR

53  

JOCKEY – PROGRESS INDICATOR

total running

54  

JOCKEY – PROGRESS INDICATOR

total running +

total queuing

55  

JOCKEY – PROGRESS INDICATOR

stage

total running +

total queuing

56  

JOCKEY – PROGRESS INDICATOR

total running +

total queuing

total running +

total queuing

total running +

total queuing

Stage 1

Stage 2

Stage 3

+

+

57  

JOCKEY – PROGRESS INDICATOR

total running +

total queuing

total running +

total queuing

total running +

total queuing

# complete total tasks

# complete total tasks

# complete total tasks

Stage 1

Stage 2

Stage 3

+

+

58  

JOCKEY – PROGRESS INDICATOR

59  

JOCKEY – PROGRESS INDICATOR

0 10 20 30 40 50 60

time [min]

60  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

time [min]

61  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

time [min]

62  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

estim

ated

job

com

plet

ion

[min

]

time [min]

63  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

estim

ated

job

com

plet

ion

[min

]

time [min]

64  

JOCKEY – CONTROL LOOP

65  

JOCKEY – CONTROL LOOP

1%  complete  

2%  complete  

3%  complete  

4%  complete  

5%  complete  

Job model

66  

JOCKEY – CONTROL LOOP

10  nodes   20  nodes   30  nodes  

1%  complete  

2%  complete  

3%  complete  

4%  complete  

5%  complete  

Job model

67  

JOCKEY – CONTROL LOOP

10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Job model

68  

JOCKEY – CONTROL LOOP

10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Job model

Deadline: 50 min.

Completion: 1%

69  

JOCKEY – CONTROL LOOP

Job model

Deadline: 50 min.

Completion: 1%

10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

70  

JOCKEY – CONTROL LOOP

Job model 10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Deadline: 40 min.

Completion: 3%

71  

JOCKEY – CONTROL LOOP

Job model 10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Deadline: 30 min.

Completion: 5%

72  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

73  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

analytic model?

74  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

analytic model? machine learning?

75  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

analytic model? machine learning?

simulator

76  

JOCKEY

Problem Solution

77  

JOCKEY

Problem Solution

Pipeline complexity

78  

JOCKEY

Problem Solution

Pipeline complexity Use a simulator

79  

JOCKEY

Problem Solution

Pipeline complexity Use a simulator

Noisy environment

80  

JOCKEY

Problem Solution

Pipeline complexity Use a simulator

Noisy environment Dynamic control

Jockey in Action 81  

Jockey in Action 82  

•  Real job

Jockey in Action 83  

•  Real job •  Production cluster

Jockey in Action 84  

•  Real job •  Production cluster •  CPU load: ~80%

Jockey in Action

85  

Jockey in Action

86  

Jockey in Action

87  

Jockey in Action

88  

Initial deadline: 140 minutes

Jockey in Action

89  

New deadline: 70 minutes

Jockey in Action

90  

New deadline: 70 minutes

Release resources due to excess pessimism

Jockey in Action

91  

“Oracle” allocation: Total allocation-hours

Deadline

Jockey in Action

92  

“Oracle” allocation: Total allocation-hours

Deadline

Available parallelism less than allocation

Jockey in Action

93  

“Oracle” allocation: Total allocation-hours

Deadline

Allocation above oracle

Evaluation 94  

Evaluation 95  

•  Production cluster

Evaluation 96  

•  Production cluster •  21 jobs

Evaluation 97  

•  Production cluster •  21 jobs •  SLO met?

Evaluation 98  

•  Production cluster •  21 jobs •  SLO met? •  Cluster impact?

Evaluation

99  

Evaluation

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

job completion time relative to deadline

deadline

Jobs which met the SLO

100  

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

deadline

Jobs which met the SLO

101  

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

102  

Missed 1 of 94 deadlines

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

103  

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

104  

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

105  

1.4x

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

max allocation Jockey

deadline

Jobs which met the SLO

106  

Allocated too many resources

Missed 1 of 94 deadlines

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

max allocation Jockey

Allocation fromsimulator

deadline

Jobs which met the SLO Allocated too many resources

107  

Simulator made good predictions: 80% finish before deadline

Missed 1 of 94 deadlines

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

max allocation Jockey

Allocation fromsimulator

Control loop only

deadline

Jobs which met the SLO Allocated too many resources

Simulator made good predictions: 80% finish before deadline

108  

Control loop is stable

and successful

Missed 1 of 94 deadlines

Evaluation

109  

Evaluation

110  

0% 25% 50% 75% 100%

fraction of allocation above oracle

Evaluation

111  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

Evaluation

112  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

Jockey

Evaluation

113  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

max allocation

Jockey

Evaluation

114  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

Allocation from simulator

max allocation

Control loop only

Jockey

Conclusion 115  

116  

Data parallel jobs are complex,

117  

Data parallel jobs are complex, yet users demand deadlines.

118  

Data parallel jobs are complex, yet users demand deadlines.

Jobs run in shared, noisy clusters,

119  

Data parallel jobs are complex, yet users demand deadlines.

Jobs run in shared, noisy clusters, making simple models inaccurate.

Jockey 120  

simulator

121  

control-loop

122  

123  

Deadline

Deadline

Deadline Deadline

Deadline

124  

Questions? Andrew Ferguson adf@cs.brown.edu

125  

Co-a

utho

rs •  Peter Bodík

(Microsoft Research) •  Srikanth Kandula

(Microsoft Research) •  Eric Boutín

(Microsoft) •  Rodrigo Fonseca

(Brown)

Questions? 126  

Andrew Ferguson adf@cs.brown.edu

Backup Slides

127  

!"

# $%&# '#!"#$%"&'()*+",$*+&)(

Utility Curves

Deadline

For single jobs, scale doesn’t matter

For multiple jobs, use financial penalties

128  

129  

Jockey Resource allocation control loop

1. Slack

2. Hysteresis

3. Dead Zone

Prediction Run Time Utility

130  

Cosmos

•  Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler)

•  Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion.

•  A token is a guaranteed share of CPU and memory •  To increase efficiency, unused tokens are re-allocated to

jobs with available work

Resource sharing in Cosmos

131  

Jockey Progress indicator •  Can use many features of the job to build a progress

indicator •  Earlier work (ParaTimer) concentrated on fraction of tasks

completed •  Our indicator is very simple, but we found it performs

best for Jockey’s needs Total vertex initialization time

Total vertex run time Frac;on  of  completed  ver;ces  

132  

Comparison with ARIA •  ARIA uses analytic models •  Designed for 3 stages: Map, Shuffle, Reduce •  Jockey’s control loop is robust due to control-

theory improvements •  ARIA tested on small (66-node) cluster without a

network bottleneck •  We believe Jockey is a better match for production

DAG frameworks such as Hive, Pig, etc.

133  

Jockey

Latency prediction: C(p, a) •  Event-based simulator

–  Same scheduling logic as actual Job Manager

–  Captures important features of job progress

–  Does not model input size variation or speculative re-execution of stragglers

–  Inputs: job algebra, distributions of task timings, probabilities of failures, allocation

•  Analytic model

–  Inspired by Amdahl’s Law: T = S + P/N

–  S is remaining work on critical path, P is all remaining work, N is number of machines

134  

Jockey Resource allocation control loop •  Executes in Dryad’s Job Manager

•  Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup)

•  Output: Number of tokens to allocate

•  Improved with techniques from control-theory

Jockey offline during job runtime

job profile

135  

Jockey

simulator

offline during job runtime

job profile

136  

Jockey

simulator

offline during job runtime

job stats

job profile

137  

Jockey

simulator

offline during job runtime

job stats latency predictions

job profile

138  

Jockey

simulator

offline during job runtime

utility function job stats latency

predictions

job profile

139  

Jockey

simulator

offline during job runtime

running job

utility function job stats latency

predictions

resource allocation control loop

job profile

140