RIOS: Runtime Integrated QueryOptimizer for Spark
YoufuLi(UCLA),MingdaLi(UCLA), LingDing(UCLA)andMatteoInterlandi(Microsoft)
CloudComputingPrograms
OpenSourceData-IntensiveScalableComputing(DISC)Platforms:HadoopMapReduceandSpark◦ functionalAPI◦ mapand reduceUser-DefinedFunctions◦ RDDtransformations(filter,flatMap,zipPartitions,etc.)
Severalyearslater,introductionofhigh-levelSQL-likedeclarativequerylanguages(andsystems)◦ Conciseness◦ Pickaphysicalexecutionplanfromanumberofalternatives
QueryOptimizationTwostepsprocess◦ Logical optimizations(e.g.,filterpushdown)◦ Physical optimizations(e.g.,joinordersandimplementation)
PhysicaloptimizerinRDMBS:◦ Cost-based◦ Datastatistics (e.g.,predicateselectivities,costofdataaccess,etc.)
Theroleofthecost-basedoptimizeristo(1) enumeratesomesetofequivalentplans(2) estimatethecostofeach(3) selectasufficientlygoodplan
QueryOptimization:WhyImportant?
0.25 1 4
16 64
256 1024 4096
16384
W R W R W R W R
431
17
343
21
unab
le to
fini
sh in
5+
hour
s
276
1510
2
954
Tim
e (s
)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
0.25 1 4
16 64
256 1024 4096
16384
W R W R W R W R
431
17
343
21
unab
le to
fini
sh in
5+
hour
s
276
1510
2
954
Tim
e (s
)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
QueryOptimization:WhyImportant?
BadplansoverBigDatacanbedisastrous!
Challenges for Cost-based OptimizerinDISC
Lack of upfrontstatistics:◦ datasitsinHDFSandunstructured
Evenifinputstatisticsareavailable:◦ Correlations betweenpredicates◦ Exponentialerrorpropagationinjoins◦ ArbitraryUDFs
Cost-based OptimizerinDISC:StateoftheArt
Pre-existing statistics
Bad statistics
No upfront statistics
Cost-based OptimizerinDISC:StateoftheArt
• Collect and store statistics• Noruntimeplanrevision
Pre-existing statistics◦ Spark CBO[1]
Bad statistics
No upfront statistics
Cost-based OptimizerinDISC:StateoftheArt
• Collect and store statistics• Noruntimeplanrevision
Pre-existing statistics◦ Spark CBO[1]
Bad statistics◦ AdaptiveQueryplanning[2]
No upfront statistics
Assumptionisthatsomeinitialstatisticsexist
Cost-based OptimizerinDISC:StateoftheArt
Pre-existing statistics◦ Spark CBO[1]
Bad statistics◦ AdaptiveQueryplanning[2]
No upfront statistics◦ Pilotruns(samples)[3]
Assumptionisthatsomeinitialstatisticsexist
• Samplesareexpensive• Onlyforeign-keyjoins• Noruntimeplanrevision
• Collect and store statistics• Noruntimeplanrevision
Traditional Query Planning VS RIOS
Query Optimizer
Physical Plan
Execution
Logical Plan
Statistics
Traditional Query Planning VS RIOS
Query Optimizer
Physical Plan
Execution
Logical Plan
Statistics
Logical Plan
Planning ExecutionRuntime Stats
Physical plan
RIOS
RuntimeIntegratedOptimizerforSparkKeyidea:Execute-Gather-Aggregate-Planstrategy(EGAP)◦ Queryplansarelazily executed◦ Statisticsaregatheredatruntime◦ Aggregate statistics aftergathering◦ Joinsaregreedily planned for execution◦ Plan canbedynamicallychanged ifabaddecision wasmade
Neitherupfrontstatisticsnorpilotrunsarerequired◦ Rawdatasetsizeisrequiredforinitialguess
Supportfornotforeign-keyjoins
RuntimeOptimizer:anExample
BAA
CAA
AAB
AAC
Assumption: A < C
RuntimeOptimizer:ExecuteStep
BAA
CAA
AAB
AAC
A
B
C
Assumption: A < C
RuntimeOptimizer:Gatherstep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
RuntimeOptimizer:Aggregatestep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
Driver
RuntimeOptimizer:Planstep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
RuntimeOptimizer:Executestep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
UAB
RuntimeOptimizer:Gatherstep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
UAB
S
RuntimeOptimizer:Planstep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
UAB
S
RuntimeOptimizer:Executestep
BAA
CAA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
UAB
S
ABCABCABC
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
Assumption: A < Cσσ(A) > σ(C)
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσσ(A) > σ(C)
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)Repartition Step
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
RuntimeOptimizer:WrongGuess
B(A)A
σ(C)AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
SABCABCABC
σ(A) > σ(C)
RuntimeIntegratedOptimizerforSpark
Spark batch execution model allows late binding of joins
Set of Statistics:◦ Joinestimations(basedonsamplingorsketches)◦ Numberofrecords◦ Averagesizeofeachrecord
Statisticsareaggregated usingaSparkjoboraccumulators
Joinimplementationsarepickedbasedonthresholds
ChallengesandOptimizations
Execute- Blockandreviseexecutionplanswithoutwastingcomputation
Aggregate- Efficientaccumulationofstatistics
Plan- Trytoscheduleasmanybroadcastjoinsaspossible
Experiments
Q1:WhataretheperformanceofRIOScomparedtoregularSpark,pilotruns and Spark-CBO?
Q2:Howexpensivearewrongguesses?
Minibenchmarkwith3FactTables
4 16 64
256 1024 4096
16384
1 10 100 1000
11
25
92
1482
9
24
86
1353
11
25
93
1521
74
1312
9095
unab
le to
fini
sh in
5+
hour
s
69
1242
8951
unab
le to
fini
sh in
5+
hour
s
Tim
e (s
)
Scale Factor
spark good-orderRIOS R2RRIOS W2R
pilot-runspark wrong-order
Minibenchmarkwith3FactTables
Q1:RIOSisalwaysfasterthanSparkandpilotrun
4 16 64
256 1024 4096
16384
1 10 100 1000
11
25
92
1482
9
24
86
1353
11
25
93
1521
74
1312
9095
unab
le to
fini
sh in
5+
hour
s
69
1242
8951
unab
le to
fini
sh in
5+
hour
s
Tim
e (s
)
Scale Factor
spark good-orderRIOS R2RRIOS W2R
pilot-runspark wrong-order
Minibenchmarkwith3FactTables
Q2:Notmuch,around15%intheworstcase
4 16 64
256 1024 4096
16384
1 10 100 1000
11
25
92
1482
9
24
86
1353
11
25
93
1521
74
1312
9095
unab
le to
fini
sh in
5+
hour
s
69
1242
8951
unab
le to
fini
sh in
5+
hour
s
Tim
e (s
)
Scale Factor
spark good-orderRIOS R2RRIOS W2R
pilot-runspark wrong-order
TPCDSandTPCHQueries
4
16
64
256
1024
4096
16384
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000Query 17 Query 50 Query 28 Query 9
19
37
130
2468
15
25
114
1879
13 17
51
1298
30
62
255
6717
17 19
69
2171
10 14
66
1861
9 12
30
1109
16
29
187
6589
11
17
56
1973
10 14
45
1437
9 11
28
1073
12
21
109
5760
15
22
72
2114
13 18
60
1779
13 17
47
1177
19
30
239
7632
23
40
176
3253
21
34
163
2308
15
22
61
1399
44
97
828
unab
le to
fini
sh in
5+
hour
s
Tim
e (s
)
Scale Factor
spark good-orderCBO-plan
RIOSpilot-run
spark bad-order
TPCDSandTPCHQueries
Q1:RIOSisalwaysthefasterapproach
4
16
64
256
1024
4096
16384
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000Query 17 Query 50 Query 28 Query 9
19
37
130
2468
15
25
114
1879
13 17
51
1298
30
62
255
6717
17 19
69
2171
10 14
66
1861
9 12
30
1109
16
29
187
6589
11
17
56
1973
10 14
45
1437
9 11
28
1073
12
21
109
5760
15
22
72
2114
13 18
60
1779
13 17
47
1177
19
30
239
7632
23
40
176
3253
21
34
163
2308
15
22
61
1399
44
97
828
unab
le to
fini
sh in
5+
hour
s
Tim
e (s
)
Scale Factor
spark good-orderCBO-plan
RIOSpilot-run
spark bad-order
ConclusionsRIOS:cost-based queryoptimizerforSpark
Statisticsaregatheredatruntime(noneedforinitialstatisticsorpilotruns)
Latebindofjoins
Upto2xfasterthanthebestplangeneratedbypilotrun,and>100Xthanpreviousapproachesforfacttablejoins.
ExperimentConfiguration
๏ Datasets:• TPCDS• TPCH
๏ Configuration:• 16 machines, 4 cores (2 hyper threads per core)
machines, 32GB of RAM, 1TB disk• Spark 2.2.1• Scale factor from 1 to 1000 (~1TB)
Reference1. https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-
spark-2-2.html.2. S.Agarwal,S.Kandula,N.Bruno,M.-C.Wu,I.Stoica,andJ.Zhou.Re-optimizing
data-parallelcomputing.InNSDI,2012.3. K.Karanasos,A.Balmin,M.Kutsch,F.Ozcan,V.Ercegovac,C.Xia,andJ.
Jackson.Dynamicallyoptimizingqueriesoverlargescaledataplatforms.InSIGMOD,pages943–954.
4. S.Chaudhuri,G.Das,andV.Narasayya.Optimizedstratifiedsamplingforapproximatequeryprocessing.TODS,32(2),jun 2007.
5. O.Papapetrou,W.Siberski,andW.Nejdl.Cardinalityestimationanddynamiclengthadaptationforbloomfilters.DistributedandParallelDatabases,28(2):119– 156,2010.
Thankyou