Date post: | 28-Nov-2014 |
Category: |
Entertainment & Humor |
Upload: | dacong-yan |
View: | 511 times |
Download: | 0 times |
YSmart vs SCOPE
2
YSmart RevisitedWhat is YSmart?
Yet Another SQL-to-MR Translator
Why “yet another”?Sentence-by-sentence translation fails!
3
Wrong Viewa = 1;b = 2;x = a j1 b
c = 1;y = j2 c;d = y;z = j3 d;
Correct Viewa = <exp1>;b = <exp2>;x = a J1 b
c = <exp1>;y = J2 c;d = y;z = J3 d;
<exp1>, <exp2> are expensive data loading!
<J1>, <J2>, <J3> are expensive computation!
Example
a
b
J1
c J2 J3d
4
Big Data!
We cannot afford redundancies anymore!
Let’s eliminate redundancies YSmart!
Primitive MR Jobs
Identify Correlati
ons
Merge Correlated MR jobs
SQL-like queriesMR Jobs for best performance
Correlation-Aware SQL-to-MR Translator
Input Correlation (IC)Multiple MR jobs have input correlation (IC) if their
input relation sets are not disjoint
lineitem orders
J1
lineitem
J2
Transit Correlation (TC)Multiple MR jobs have transit correlation (TC) if
they have input correlation (IC), and they have the same Partition Key
lineitem orders
J1
lineitem
J2
Key: l_orderkeyKey: l_orderkey
7
J1 J2
Job Flow Correlation (JFC)A MR job has Job Flow Correlation (JFC) with one of its child
MR jobs if it has the same partition key as that MR job
Map Func. of MR Job 1
Map Func. of MR Job 2
Partition Key
Other Data
Reduce Func. of MR Job 1
Reduce Func. of MR Job 2
Output of MR Job 2
lineitem orders
J1
J2
Put it all together1: Sentence-to-Sentence Translation• 5 MR jobs
lineitem orders lineitem lineitem
Join2
AGG1 AGG2
Left-outer-Join
Join1
2: InputCorrelation+TransitCorrelation• 3 MR jobs
3: InputCorrelation+TransitCorrelation+ JobFlowCorrelation• 1 MR job
4: Hand-coding (similar with Case 3)• In reduce function, we optimize code according query semantic
lineitem orders
lineitem orders
lineitem orders
Join2
Left-outer-Join
10
YSmart vs SCOPE
Naïve Translati
on
Optimization
(Big) Data Processing LanguageBig Data Analytic Jobs
• YSmart: look at data dependence and control dependence• Identify three correlations• Merge jobs to eliminate redundancy (straightforward)
• SCOPE: look at the actual structure of the input data• Identify structural property correlations• Partition, group, merge (complicated)
11
Big Picture
Naïve Translati
on
Input Independen
tOptimizatio
n
(Big) Data Processing LanguageBig Data Analytic Jobs
Input DependentOptimizati
on
Naïve Translati
onYSmart SCOPE
12
YSmart vs SCOPE
Naïve Translati
on
Optimization
(Big) Data Processing LanguageBig Data Analytic Jobs
• The diagram is actually an over-abstraction.• In reality,• YSmart: source-to-source transformation• SCOPE: run-time optimizing compiler tightly coupled
with underlying execution environment
YSmart vs SCOPENot an apple-to-apple comparison!
But, let’s do it anyway…
14
YSmart Alone
Naïve Translati
onYSmart
(Big) Data Processing LanguageBig Data Analytic Jobs
• 3x speedup, but 17% slower than human• It is supposed to be smarter than human!• What went wrong:• Bad input code• Not enough optimization
15
SCOPE Alone
Naïve Translati
onSCOPE
(Big) Data Processing LanguageBig Data Analytic Jobs
• No thorough evaluation; 2x speedup on a specific case• Problem• They are looking at structures, but at a wrong level.• Very likely, they are optimizing computations that are not
strictly necessary!
16
Discussions
1. Is SQL good enough as a big data analytics processing language?• Bad language design can be detrimental• Redundancies could be introduced unnecessarily simply
due to poor expressiveness of the language
2. How to migrate traditional program analysis and compiler optimization to the big data era?• Correlation detection in YSmart is inherently similar to
dependency analysis.• In compiler optimization, we focus on def-use
statements and expressions; in big data, we should focus on big data transfer and big data tables.
17
Fundamentally, we need good programming languages
&program analyses
forbig data analytics!