Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | antony-roberts |
View: | 213 times |
Download: | 0 times |
Automatic optimization of MapReduce
Programs
Michael Cafarella, Eaman Jahani, Christopher Re
August 2011
MapReduce is victorious
• Google statistics:
• Hadoop statistics:7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters1
Aug 04 Mar 06 Sept 07 May 10
Number of jobs 29K 171K 2127K 4474K
Machine years used 217 2002 11081 39121
Input Data (TB) 3,288 52,254 403,152 946,460
Output Data (TB) 193 2,970 14,018 45,720
Average worker machines
157 268 394 368
1. Omer Trajman, Cloudera VP, http://www.dbms2.com/
MapReduce in relational land
• Designers original Intention: free-formed datao web-scale indexing/log processing
• But, many relational workloads1
o Complex queries/data analysis
• Caveat: MR performance lags RDBMS performance
1. Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
Selection is Slower with
MapReduce
Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009
Join is Even Slower
MR Lags in Relational Land
• Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1
• Query processing taskso No metadata, semantics, indiceso Free-formed input is a double-edged sword
1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008
Manimal• Manimal is a hybrid system, combining
MapReduce programming model and well-known execution techniques
• Techniques today only found in RDBMS, but shouldbe in MapReduce, too.
Manimal Approachbytecode *.classMR
Engine
Static Analyze
r
Optimizer logic
Execution Framewo
rk
optimizationopportunities
execution
path
void map(Text key, WebPage w) {if(w.rank > 10) emit(w.url,w.rank);
}
• Challenges:o Safely detect query semantic optimizationo How much performance gain?
SELECTION from B+Tree index on W.RANK
Manimal Contributions
• Our Manimal system:o Detect safe relational optimizations in users’
compiled MapReduce programs
• Our results:o Runs with unmodified MapReduce codeo Runs up to 11x faster on same codeo Provides framework for more optimizations
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
Execution framework
public void map(Text key, WebPage w, OutputCollector<Text, LongWritable> out) {
if(w.rank > 10)emit(w.url, w.rank);
}
Execution Framework
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
13
Execution Framework
void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) }
(SELECT f, w.rank>10)
Analyzer in: user programAnalyzer out: optimization descriptor
index-generation program
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
14
Execution Framework
Optimizer in: optimization descriptor catalogOptimizer out: execution descriptor
/logs/log.1 /logs/log.1.idx select src…
/logs/log.2 /logs/log.2.idx select src…
(SELECT,“log.1.idx”,w.rank>10)
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
(SELECT f, w.rank>10)
15
Execution Framework
numwords 19519
(SELECT,“log.1.idx”,w.rank>10)
varload ‘value’invokevirtualastore ‘text’…ifeq …
Analyzer Optimizer Execution
Execution in: execution descriptor user programExecution out: program output
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
An Optimization Example
//webpage.java: SCHEMA!Class WebPage {String URL,int rank,String content}
//mapper.javavoid map(Text key, WebPage w) {
if (w.url==‘teaparty.fr’)emit(w.url, 1);
}
• Data-centric programming idioms == relational ops
PROJECTED view: (url,null,null)DIRECT-OP on compressed Webpage
Semantic Extraction• Query semantic are obvious to human readers,
but not explicit in the code for framework
• EXTRACT IT!o Static code analysiso Control-flow graph and data-flow grapho Find opportunities: selection, projection, direct opo Safe optimizations: same output
Analyzer: An Example
//webpage.java
Class WebPage {String URL,int rank,String content}
//mapper.javamap(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank);}
Fn Entry w.rank > 10 Fn Exit
An
alyze
r
emit(url,rank)
Current Optimizations• B+-Tree for Selections • Projected views• Delta compression on numerics• Direct operation of compressed data
• Hadoop compression is not semantic aware
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
Experiments: Analyzer• Test MapReduce programs from Pavlo, SIGMOD ‘09:
• Detected 5 out of 8 opportunities:o Two misses due to custom serialization classo Another miss requires knowledge of
java.util.Hashtable semantics
Experiments: Performance
• Optimize four Web page handling tasks:o Selection (filtering)o Projection (aggregation on subfield of page)o Join (pages to user visits)o User Defined Functions (aggregation)
• 5 cluster nodes, 123GB of data
Experiments: Performance
Description
Hadoop
Selection 430 s
Projection 5496 s
Join 6078 s
Experiments: Performance
Description
Hadoop Manimal Speedup
Selection 430 s 38 s 11.2
Projection 5496 s 1856 s 2.96
Join 6078 s 904 s 6.73
Experiments: Performance
• Up to 11x speedup over original Hadoop• Performance comparable to DBMS-X from Pavlo• UDF not detected: running time identical
Description
Hadoop Manimal Speedup
Space Overhead
Selection 430 s 38 s 11.2 0.1%
Projection 5496 s 1856 s 2.96 20%
Join 6078 s 904 s 6.73 11.7%
Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments
o Analyzer recallo Performance gain
• Related Work and Conclusion
Related Work• Lots of recent MapReduce activity
o Quincy: Task scheduling (Isard et al, SOSP, 2009)
o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010)o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010)o Starfish (Herodotou et al, CIDR 2011)
• Manimal does not introduce new optimizations. It detects and applies existing optimizations to code
Lessons Learned• The Good: We can recognize data processing
idioms in real code. Relational operations still exist even in NoSQL world
• The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)
Conclusion
• Manimal provides framework for applying well-known optimization techniques to MapReduceo Automatic optimization of user codeo Up to 11x speed increaseo Provides framework for more optimizations