Post on 24-Dec-2015
transcript
Administrivia
• Homework 3 available today– Written exercise; will be posted on class website– Due date: Tuesday, March 20 by end of class period
• Homework 4 available later this week– Implement nested loops and hash join operators for
minibase– Due date: April 5 (after Spring Break)
• Midterm 2 is 3/22, 2 weeks from today– In class, covers lectures 10-17– Review will be held Tuesday 3/20 7-9 pm 306 Soda
Hall
Review
Query Optimizationand Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
Now you are here
You were here
•Operators are the building blocks for computing results of queries
•Sort•Project•Join•Filter•Access methods for files•...
•Query plans are a tree of operators that compute the result of a query•Optimization is the process of picking the best plan•Execution is the process of executing the plan
Query Plans: turning text into tuples
SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND(A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname
HAVING COUNT(*) > 1
Aslan 3
Bageera 3
Elsa 3
Shere Khan 3
Tigger 2
Name Shift
10 2 1 100 3
10 3 2 100 3
20 3 2 100 3
20 2 3 100 3
30 3 2 100 3
… … … … …
40 100 Aslan Big Cat
50 300 Baloo Bear
60 100 Bageera Big Cat
70 100Shere Khan
Big Cat
90 100 Dumbo Elephant
… … … …
OperatorsQuery Plan
Query ResultQuery
Operator Review
• Access Path : pulls tuples from tables– File scans– Index scans (clustered or unclustered)– Index-only scans
• Select (or Filter): conditionally excludes tuples– Can be ‘pushed’/combined with Access
Path operator• Use indexes where possible and apply
other predicates on the result– Can also be applied at intermediate
point in query plan• Projection: removes columns and
duplicates– Column projection often done by
operators– Duplicate elimination via Sort or Hash
Operator Review
• Sort: sorts tuples in a particular order– Simple merge sort– General external merge sort (with
various optimizations)– B+ tree traversal
• Join: combine tuples from 2 other operators– Page nested loops– Block nested loops– Index nested loops– Sort-merge join– Hash-join
• Other operators for– Group By, Temping, …
Query Optimization steps
1. Parse query from text to ‘intermediate model’
2. Traverse ‘intermediate model’ and produce alternate query plans– Query plan = tree of relational
operators– Optimizer keeps track of cost
and properties of plans
3. Pick the cheapest plan4. Pass cheapest plan on to query
execution engine to execute and produce results of query
Query parser
Query optimizer
SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND(A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname
HAVING COUNT(*) > 1
Block 2Block 1
Block 3
Cost = 200 Cost = 150 Cost = 500
To execution engine
Query Blocks: Units of Optimization
• Intermediate model is a set of query blocks– 1 per
SELECT/FROM/WHERE/GROUP BY/HAVING clause
SELECT S.snameFROM Sailors SWHERE S.age IN (SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)
SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname
HAVING COUNT(*) > 1
Query Block
• Subqueries produce nested query blocks – treated as calls to a
subroutine, made once per tuple produced by outer query block
– sometimes subqueries can be rewritten to produce cheaper plan
Outer Query Block
Nested Query BlockRewritten Query BlockX
Query blocks are optimized 1 at a time
1. Convert block to relational algebra tree
2. Traverse tree and build plan bottom up:• Pick best access method for each
relation in FROM clause• Applying predicates if possible
• Consider all join trees • All ways to join relations in FROM
clause 1-at-a time• Consider multiple permutations
and join methods– But not all! too many choices – Restrict to left-deep plans– Prune bad plans along the way
Query Block
Converting Query Blocks to Relational Algebra Trees
SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
• SQL is relationally complete; can express everything in relational algebra
(sname)(bid=100 rating > 5) (Reserves Sailors)
Πsname(σ(age in set from subquery) Sailors)
SQL extends Relational Algebra
• SQL is more powerful than relational algebra– extend relational algebra to include
aggregate ops: GROUP BY, HAVING
• How is this query block expressed?SELECT S.sname
FROM Sailors SWHERE S.age IN (constant set from subquery)
• And this query block?SELECT MAX (S2.age)
FROM Sailors S2 GROUP BY S2.rating
σ(age in set from subquery) Sailors
ΠMax(age)(GroupByRating(Sailors) )GroupByRating(Sailors)
Why optimize?
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
• Operators have implementation choices – Index scan? File scan? Nested loop join? Sort merge?
• Operators can also be applied in different order!
Motivating Example -- Schema used
• As seen in previous lectures…• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
– Assume there are 100 boats• Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
– Assume there are 10 different ratings • Assume we have 5 pages in our buffer pool!
Sailors (sid: integer, sname: string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: dates, rname: string)
Motivating Example
• Cost: 500+500*1000 I/Os• Not the worst plan, but… • Misses several opportunities:
– selections could have been `pushed’ earlier,
– indexes might have been helpful….
• Goal of optimization: To find more efficient plans that compute the same answer.
SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Sailors Reserves
sid=sid
bid=100 rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
Plan:
500 1000
Selectivity calculation
• Sailors: 500 pages, 80 tuples per page, 10 ratings
• Selectivity of S.rating > 5?– ½ -> 500*80/2 = 20,000 tuples – 20,000/80 = 250 pages
• Reserves: 1000 pages, 100 tuples per page, 100 boats
• Selectivity of R.bid = 100?– 1/100 -> 1000*100/100 = 1000 tuples– 1000/100 = 10 pages
500,500 IOs
Alternative Plans – Push Selects (No Indexes)
Sailors Reserves
sid=sid
bid=100 rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
Sailors
Reserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
bid=100 (On-the-fly)
250,500 IOs
500
500 + 250 *1000 =
1000
250 1000
500
Alternative Plans – Push Selects (No Indexes)
Sailors
Reserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
bid=100 (On-the-fly)
Sailors Reserves
sid=sid
bid = 100
sname
(Page-Oriented Nested loops)
(On-the-fly)
rating > 5
(On-the-fly)(On-the-fly)
250,500 IOs250,500 IOs
1000250
500 + 250 *1000 =
1000
250 10
500
Sailors
Reserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
bid=100 (On-the-fly)
6000 IOs
Sailors
Reserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
bid=100
(On-the-fly)
250,500 IOs
Alternative Plans – Try different join order
swap10
500
1000 + 10 *500=
1000
SailorsReserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
bid=100
(Scan &Write totemp T2)(On-the-fly)
6000 IOs
Sailors
Reserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
bid=100
(On-the-fly)
Alternative Plans – Push Selects and precompute result (No Indexes)
1000 + 500+ 250 + (10 * 250) = 1000
10 500
1000
10
500
250
250
4250 IOs
ReservesSailors
sid=sid
bid=100
sname
(Page-Oriented Nested loops)
(On-the-fly)
rating>5
(Scan &Write totemp T2)(On-the-fly)
Alternative Plans – Try different join order
500 + 1000 +10 +(250 *10) =
SailorsReserves
sid=sid
rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
bid=100
(Scan &Write totemp T2)(On-the-fly)
4250 IOs
swap
1000
10
500
250
250
500
250
1000
10
10
4010 IOs
500,500 IOs
Optimized query is 124x cheaper than the original!
Sailors Reserves
sid=sid
bid=100 rating > 5
sname
(Page-Oriented Nested loops)
(On-the-fly)
(On-the-fly)
ReservesSailors
sid=sid
bid=100
sname
(Page-Oriented Nested loops)
(On-the-fly)
rating>5
(Scan &Write totemp T2)(On-the-fly)
4010 IOs
More Alternative Plans (No Indexes)
• Main difference: Sort Merge Join
• With 5 buffers, cost of plan:– Scan Reserves (1000) + write temp T1 (10 pages, if we
have 100 boats, uniform distribution) = 1010.– Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings) =
750.– Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300– Total: 4060 page I/Os.
• If use BNL join, join = 10+4*250, total cost = 2770.• Can also `push’ projections, but must be careful!
– T1 has only sid, T2 only sid, sname:– T1 fits in 3 pgs, cost of BNL under 250 pgs, total < 2000.
Reserves Sailors
sid=sid
bid=100
sname(On-the-fly)
rating > 5(Scan;write to temp T1)
(Scan;write totemp T2)
(Sort-Merge Join)
log4 ceil(10/5) = 1; log4 ceil(250/5))=3
More Alt Plans: Indexes• With clustered index on
bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages.
• INL with outer not materialized.
Decision not to push rating>5 before the join is based on availability of sid index on Sailors. Cost: Selection of Reserves tuples (10 I/Os); then, for each, must get matching Sailors tuple (1000*1.2); total 1210 I/Os.
Join column sid is a key for Sailors.–At most one matching tuple, unclustered index on sid OK.
– Projecting out unnecessary fields from outer doesn’t help.
(On-the-fly)
(Use hashIndex, donot writeto temp)
Reserves
Sailors
sid=sid
bid=100
sname
rating > 5
(Index Nested Loops,
with pipelining )
(On-the-fly)
10 I/Os for 1000 tuples on 10 pagesFor each tuple assume 1.2 pages to find match
What is needed for optimization?
• A closed set of operators – Relational ops (table in, table out)– Encapsulation based on iterators
• Plan space, based on– Based on relational equivalences, different
implementations• Cost Estimation, based on
– Cost formulas– Size estimation, based on
• Catalog information on base tables• Selectivity (Reduction Factor) estimation
• A search algorithm– To sift through the plan space based on cost!