Post on 22-May-2020
transcript
SAMPLING-BASED QUERY RE-
OPTIMIZATION
Wentao Wu
Microsoft Research
1
Background2
Query optimization remains challenging despite of decades of efforts and progresses.
Cardinality estimation is the key challenge.
Selectivity of join predicates
Correlation of columns
Histogram vs. Sampling3
Single-column histograms cannot capture data
correlations between columns.
Use the attribute-value-independence (AVI) assumption.
Sampling is better than histograms on capturing
data correlations.
We run query over exact rather than summarized data.
But Why are Histograms Dominant?4
The overhead is much smaller, compared with other
cardinality estimation approaches.
Sampling incurs additional overhead and should be
used conservatively.
A naΓ―ve idea: use sampling for all plans considered by
the optimizer.
Cost-Based Query Optimization5
For large N, sampling is not affordable to be used for every plan.
Merge
Join
Hash
Join
A B
C
P1 Hash
Join
Nested
Loop
A C
B
PN
β¦
Pick the best plan from N candidates:
N could be large!
(102 or even 103)
Our Idea6
Use sampling as a post-processing validation step.
Detect cardinality estimation errors for the final plan
returned by the optimizer.
Re-optimize the query if cardinality estimation errors
are detected.
Catch big mistakes of the optimizer before the plan runs!
The Re-optimization Algorithm
7
Query q
Sampling-based
ValidationRefined
Cardinality
Estimates Ξ
Query Optimizer Plan Pq Final Plan
Update
Cardinalities
The Re-optimization Algorithm (Cont.)8
Example: π = π΄ β π΅ β πΆ
Join Cardinality
π΄ β π΅ 100
π΅ β πΆ 300
π΄ β πΆ 500
Update
Query OptimizerQuery qP1
(Final Plan)
P2
Sampling-based
Validation
π΄ β π΅: 1000
Efficiency of Re-optimization9
The worst-case expected number of iterations:
ππ βΌ π( π).
N is the number of join
trees in the search space.ππ =
π=1
π
π β (1 β1
π) β β β (1 β
π β 1
π) β
π
π
Quality of Re-optimized Plans
If sampling-based cost estimates are consistent with
the actual costs, that is,
then the final re-optimized plan is locally optimal:
However, cost models are imperfect, and cardinality
estimates based on sampling are imperfect, too.
See experimental results.
10
cost_est(P1) < cost_est(P2) => cost_act(P1) < cost_act(P2),
cost_act(Pfinal) <= cost_act(P), for any P in re-optimization.
Experimental Evaluation11
We implemented the re-optimization procedure in
PostgreSQL 9.0.4.
We have two goals:
Test the approach for βcommonβ cases.
Test the approach for βcornerβ cases.
Experimental Evaluation (Cont.)12
βCommonβ cases
10GB TPC-H benchmark
βCornerβ cases
(Homegrown) Optimizer βTorture Testβ (OTT)
Specially designed database and queries with high data
correlation that can challenge query optimizers.
Experimental Evaluation (Cont.)13
Results on the 10GB TPC-H database
Experimental Evaluation (Cont.)14
Results of the βtorture testβ (5-join queries, log-scale)
Details of OTT15
More details about OTT:
K tables R1, β¦, RK, with Rk(Ak, Bk)
Each Rk is generated independently, with Bk = Ak.
Ak (and thus Bk) is uniformly distributed.
The queries look like:
Property: These queries are not empty if and only if A1 = β¦ = AK!
ππ΄1=π1β§β β β β§π΄πΎ=ππΎβ§π΅1=π΅2β§β β β β§π΅πΎβ1=π΅πΎ(π 1 Γβ β β Γ π πΎ)
Details of OTT (Cont.)16
An instance of OTT used in our experiments:
Use 6 TPC-H tables (excluding βnationβ and βregionβ).
Use a set of empty queries with non-empty sub-queries.
Bad Plan Good Plan
Non-empty Empty!
Summary17
Sampling as post-processing: efficiency/effectiveness tradeoff!
Query
Optimizer
Sampling-based
Validationq
Feedback
Plan Pq
Improved Query Plan
Q & A18
Thank you
Cardinality Estimation Methods19
Histograms
Single-column histograms (dominant in current DBMS)
Multi-column histograms
Other methods
Offline approaches: sampling, sketch, graphical models
Online approaches: dynamic query plans, parametric
query optimization, query feedback, mid-query re-
optimization, plan bouquets
A Sampling-Based Estimator20
Estimate the selectivity ππ of a join query π = π 1 β π 2.[Haas et al., J. Comput. Syst. Sci. 1996]
π11
π12
β¦β¦
π1π1
π21
π22
β¦β¦
π2π2
Rs2Rs
1
The estimator ΰ·ππ is unbiased and strongly consistent.
Do a βcross productβ over the samples: π π, π = 0 ππ 1.
π11 π21β¦β¦
π2π2
π11
π1π1
π21β¦β¦
π2π2
π1π1
β¦
β¦
β
β
β
β
π(1, 1)
π(1, π2)
π(π1, 1)
π(π1, π2)
ΰ·ππ =Οπ,π π(π, π)
π1π2
|π π 1 β π π 2|
|π π 1| Γ |π π 2|
Other Sampling-Based Methods21
Sampling-Based Estimation of the Number of Distinct Values of an Attribute, VLDBβ95
Towards Estimation Error Guarantees for Distinct Values, PODSβ00
End-biased Samples for Join Cardinality Estimation, ICDEβ06
Join Size Estimation Subject to Filter Conditions, VLDBβ15
Convergence of Re-optimization
Convergence Condition of Re-optimization
22
Theorem: The re-optimization procedure terminates when
all the joins in the returned query plan have been
observed in previous rounds of iteration.
For example, re-optimization will terminate after T1β is returned.
Convergence of Re-optimization (Cont.)
The previous convergence condition is sufficient but not necessary.
Re-optimization could terminate even before it meets the previous condition.
23
Termination
To understand re-optimization better, we need the notion
of local/global transformations.
Local/Global Transformations
Local transformation of query plans
24
Local transformations are those plans that share the same joins.
They only differ in choices of specific physical operators.
Characterization of Re-optimization
The three possible cases in re-optimization:
25
(1) It terminates in two steps with P2 = P1.
(2) It terminates in n + 1 steps (n > 1) where all plan
transitions are global transformations.
(3) It terminates in n + 1 steps (n > 1) where only the last
transition is a local transformation: the others are all global
transformations.
Characterization of Re-optimization (Cont.)
An illustration of Case (2) and (3):
26
The number of iterations thus depends on the number of
global transformations!
Analysis of Efficiency
A probabilistic model for analysis of expected number of steps in re-optimization:
We have N balls in a queue, initially unmarked.
27
β¦ bNb1 Is b1 marked?Yes
Exit
No
b1Insert BackMark b1
The probability that the ball will be inserted at any position
in the queue is uniformly 1/N.
Analysis of Efficiency (Cont.)
The expected number of steps of the previous procedure is:
How is it related to query optimizations?
Think of query plans (or, globally different join trees) as balls!
The uniform distribution employed in the model may be invalid in practice.
We have more analysis for situations where underestimation or overestimation is dominant. (And more analysis could be done in the future.)
28
ππ =
π=1
π
π β (1 β1
π) β β β (1 β
π β 1
π) β
π
π