Starfish: A Self-tuning System for Big Data AnalyticsHerodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu
Department of Computer Science, Duke University
Starfish Overview
SQL Client
OozieHivePig
Extensible MapReduce
Execution Engine
HDFSOther Storage
Engines
Starfish
Elastic MR…
Middleware / DB / K-V Stores
Database
SystemsHadoop
Analytics System
Scribe
Flume
Data
inputs
Data
outputs
Java Client
Web Frontend
OLTP System
Starfish in the Hadoop Ecosystem
Hadoop is a MAD system for data analytics
Magnetism: attracts all sources of data
Agility: adapts in sync with rapid data evolution
Depth: supports complex analytics needs
Starfish makes Hadoop MADDER and Self-Tuning
Data-lifecycle-awareness: achieves good
performance throughout data lifecycle
Elasticity: adjusts resources and operational costs
Robustness: provides availability and predictability
Just-in-Time Job Optimization
Goal
Find good settings for configuration parameters
Settings depend on job, data, and cluster characteristics
Challenges
Data opacity until processing
File-based processing
Heavy use of programming languages
Approach
Profiler: Uses dynamic instrumentation to learn performance
models (job profiles) for unmodified MapReduce programs
Sampler: Collects statistics about input, intermediate, and output
key-value spaces of a MapReduce job
What-if Engine: Uses a mix of simulation and model-based
estimation to predict job performance
Just-in-Time Optimizer: Searches through the high-dimensional
space of parameter settings
What-if
Engine
Components in the Starfish Architecture
Just-in-Time Optimizer
Profiler
Job-level tuning
Data Manager
Metadata
Mgr.
Intermediate
Data Mgr.
Workflow-level tuning
Workload-level tuning
Workflow-aware
Scheduler
Workload Optimizer Elastisizer
Data Layout &
Storage Mgr.
Sampler
Response surfaces of MapReduce programs in Hadoop
Workflow-aware Scheduling
Scheduling Objectives
Ensure balanced data layout
Avoid cascading reexecution under node failure or data corruption
Ensure power proportional computing
Adapt to imbalance in load or cost of energy across data centers
Causes of unbalanced data layouts
Skewed data
Data-layout-unaware scheduling of tasks
Addition or dropping of nodes without rebalancing operations
Approach
Consider interactions between scheduling policies and block
placement policies of the storage system
Use smart scheduling to perform rebalancing automatically
Exploit opportunities for collocating data sets
Unbalanced data layout after executing one MapReduce job
Workload Optimization
Processing multiple workflows on same data Optimization Techniques
Data-flow sharing
Materialization
Reorganization
Challenges
Interactions of above techniques among each other
and with scheduling, data layout policies, and
configuration parameter settings
Jumbo Operator
Use a single MapReduce job to process multiple
Select-Project-Aggregate operations over a table
Enables sharing of scans, computation, sorting,
shuffling, and output generation
Provisioning for Hadoop Workloads
Goal
Make provisioning decisions
based on workload requirements
Provisioning Choices
Number of nodes
Cluster configuration
Network configuration
Long-term vision
Hadoop Analytics as a Service
Workload performance and pay-as-you-go cost under various
cluster configurations on Amazon Elastic MapReduce
Amazon S3
storage
Users
(username, age, ipaddr)
GeoInfo
(ipaddr, region)
Clicks
(username, url, value)
Copy Copy CopyCopy
Partition by age into
<20, ≤25, ≤35, >35
Count users
per age <20
Count users
per region with
age > 25
Join
S3
Filter value >0
Join
Join
Count clicks
per
region,age
Filter age >35 Filter url is
“Sports” type
Count clicks
per url type
Count clicks
per age
Count clicks
per age
I II IVIII V VI
Copy
Example analytics workload for Amazon Elastic MapReduce
0
2000
4000
6000
8000
10000
12000
m1.small m1.large m1.xlarge
Exec
uti
on
Tim
e (
sec
)
Node Type
2 nodes 4 nodes 6 nodes
$0.00
$1.00
$2.00
$3.00
$4.00
$5.00
$6.00
m1.small m1.large m1.xlarge
Co
st
($)
Node Type
2 nodes 4 nodes 6 nodes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
5
10
15
20
Data Node
Dis
k U
sag
e (
%)
Map-only aggregation (non-data local)
Map-only aggregation (data local)
Partition (replication count = 1)
Initial layout
0
50
100
150
200
250
300
350
400
450
500
Ex
ec
uti
on
Tim
e (
se
c)
Serial
Concurrent
Jumbo
0
50
100
150
200
250
300
350
400
450
500
Ex
ec
uti
on
Tim
e (
se
c)
Partitioning
Serial
Concurrent
Jumbo
TeraSort in Hadoop
Ru
nn
ing
Tim
e (
sec)
Ru
nn
ing
Tim
e (
sec)
WordCount in Hadoop