Applied research groupSystems+database people building prototypes, publishing papers
Applied research groupSystems+database people building prototypes, publishing papers
Collaborating with Big Data product group at MSShipping our code to production
Applied research groupSystems+database people building prototypes, publishing papers
Collaborating with Big Data product group at MSShipping our code to production
Open-sourcing our codeApache Hadoop, REEF, Heron
Resource
management
Distributed
tiered storage
Query
optimization
Log analyticsStream
processing
Resource
management
Distributed
tiered storage
Query
optimization
Log analyticsStream
processing
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
Node Manager
•
Node Manager
Node Manager
Node Manager
•
•
Node Manager
Node Manager
Node Manager
•
•
Node Manager
Node Manager
Node Manager
1. Request
•
•
Node Manager
Node Manager
Node Manager
1. Request
2. Allocation
•
•
Node Manager
Node Manager
Node Manager
1. Request
2. Allocation
3. Start task
•
•
Node Manager
Node Manager
Node Manager
1. Request
2. Allocation
3. Start task
•
•
•
Node Manager
Node Manager
Node Manager
1. Request
2. Allocation
3. Start task
•
•
•
•
Node Manager
Node Manager
Node Manager
1. Request
2. Allocation
3. Start task
•
•
•
•
Do we really need a Resource Manager?
Ad-hocapp
Ad-hocapp
Ad-hocapp
Ad-hocApps
YARN
MR
v2Tez Giraph Storm Dryad
REEF
...
Hive / Pig
Hadoop 1.x
(MapReduce)
MR v1
Hive / Pig
Users
Application
Frameworks
Programming
Model(s)
Cluster OS
(Resource
Management)
Hadoop 1 World Hadoop 2 World
File System
HDFS 1 HDFS 2
Hardware
Ad-hocapp
Ad-hocapp
Scope
on
YARNSpark
•
monolithic
Heron
Ad-hocapp
Ad-hocapp
Ad-hocapp
Ad-hocApps
YARN
MR
v2Tez Giraph Storm Dryad
REEF
...
Hive / Pig
Hadoop 1.x
(MapReduce)
MR v1
Hive / Pig
Users
Application
Frameworks
Programming
Model(s)
Cluster OS
(Resource
Management)
Hadoop 1 World Hadoop 2 World
File System
HDFS 1 HDFS 2
Hardware
Ad-hocapp
Ad-hocapp
Scope
on
YARNSpark
•
monolithic
• Reuse of RM component
Heron
Ad-hocapp
Ad-hocapp
Ad-hocapp
Ad-hocApps
YARN
MR
v2Tez Giraph Storm Dryad
REEF
...
Hive / Pig
Hadoop 1.x
(MapReduce)
MR v1
Hive / Pig
Users
Application
Frameworks
Programming
Model(s)
Cluster OS
(Resource
Management)
Hadoop 1 World Hadoop 2 World
File System
HDFS 1 HDFS 2
Hardware
Ad-hocapp
Ad-hocapp
Scope
on
YARNSpark
•
monolithic
• Reuse of RM component
• YARN
layering abstractions
Heron
But is all this good enough for the Microsoft clusters?
High resource
utilizationScalability
Workload
heterogeneity
Production jobs
and
predictability
100% Utilization
0
• Wide variety
• Wide variety
• Wide variety
•
•
deadlines
recurring>60%
• Predictability
over-provisioned
• Rayon/Morpheus:
• Mercury/Yaq:
• YARN Federation:
• Medea:
4 Hadoop committers in CISL
404 patches as of last night
• Rayon/Morpheus:
• Mercury/Yaq:
• YARN Federation:
• Medea:
4 Hadoop committers in CISL
404 patches as of last night
[Hadoop 3.0; ATC 2015, EuroSys 2016]
N1 N2
RM
N1 N2
RM
j1
N1 N2
RM
j1
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
• Feedback delays
idle between allocations
• Feedback delays
idle between allocations
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm
60.59% 78.35% 92.38% 78.54% 83.38%
N1 N2
RM
j2
• Feedback delays
idle between allocations
• Actual
5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm
60.59% 78.35% 92.38% 78.54% 83.38%
N1 N2
RM
j2
• Introduce task queuing at nodes• Mask feedback delays
• Improve cluster utilization
• Improve task throughput (by up to 40%)
• Container types• GUARANTEED and OPPORTUNISTIC
• Keep guarantees for important jobs
• Use opportunistic execution to improve utilization
N1 N2
RM
N1 N2
RM
N1 N2
RM
j1
N1 N2
RM
j1
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
N1 N2
RM
j2
•
N1 N2
RM
j2
•
•
N1 N2
RM
j2
•
•
•
•
•
•
•
So all we need to do is use long queues?
can be detrimental for job completion times• Despite the utilization gains
can be detrimental for job completion times• Despite the utilization gains
Proper queue management techniques are required
N1 N2 N3
N1 N2 N3
N1 N2 N3
N1 N2 N3
Place tasks to
node queues
Prioritize task
execution
(queue reordering)
Bound queue
lengths
Place tasks to
node queues
Prioritize task
execution
(queue reordering)
Bound queue
lengths
Yaq improves median job completion time by 1.7x over YARN
N1 N2 N3
RM
N1 N2 N3
RM
queue length
N1 N2 N3
RM
queue length
queue length
N1 N2 N3
RM
N1 N2 N3
RM
queue length
queue wait time
N1 N2 N3
RM
queue length
queue wait time
• Shortest Remaining Job First (SRJF)
• Least Remaining Tasks First (LRTF)
• Shortest Remaining Job First (SRJF)
• Least Remaining Tasks First (LRTF)
N1 N2 N3
RM
j2: 5 tasks
j3: 9 tasks
j1: 21 tasks
• Shortest Remaining Job First (SRJF)
• Least Remaining Tasks First (LRTF)
N1 N2 N3
RM
j2: 5 tasks
j3: 9 tasks
j1: 21 tasks
• Shortest Remaining Job First (SRJF)
• Least Remaining Tasks First (LRTF)
N1 N2 N3
RM
j2: 5 tasks
j3: 9 tasks
j1: 21 tasks
• Shortest Remaining Job First (SRJF)
• Least Remaining Tasks First (LRTF)
job-aware
N1 N2 N3
RM
j2: 5 tasks
j3: 9 tasks
j1: 21 tasks
lower throughput
longer job completion times
• 1.7x improvement in median JCT over YARN
• Container types
distributed scheduling
any distributed scheduler
over-commitment
multi-tenancy
• Pricing
cluster utilization
queue management techniques
job completion time