Query Optimization R&G, Chapter 15 Lecture 16. Administrivia Homework 3 available today –Written...

transcript

Query Optimization

R&G, Chapter 15Lecture 16

Administrivia

• Homework 3 available today– Written exercise; will be posted on class website– Due date: Tuesday, March 20 by end of class period

• Homework 4 available later this week– Implement nested loops and hash join operators for

minibase– Due date: April 5 (after Spring Break)

• Midterm 2 is 3/22, 2 weeks from today– In class, covers lectures 10-17– Review will be held Tuesday 3/20 7-9 pm 306 Soda

Review

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

Now you are here

You were here

•Operators are the building blocks for computing results of queries

•Sort•Project•Join•Filter•Access methods for files•...

•Query plans are a tree of operators that compute the result of a query•Optimization is the process of picking the best plan•Execution is the process of executing the plan

Query Plans: turning text into tuples

SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND(A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname

HAVING COUNT(*) > 1

Aslan 3

Bageera 3

Elsa 3

Shere Khan 3

Tigger 2

Name Shift

10 2 1 100 3

10 3 2 100 3

20 3 2 100 3

20 2 3 100 3

30 3 2 100 3

… … … … …

40 100 Aslan Big Cat

50 300 Baloo Bear

60 100 Bageera Big Cat

70 100Shere Khan

Big Cat

90 100 Dumbo Elephant

… … … …

OperatorsQuery Plan

Query ResultQuery

Operator Review

• Access Path : pulls tuples from tables– File scans– Index scans (clustered or unclustered)– Index-only scans

• Select (or Filter): conditionally excludes tuples– Can be ‘pushed’/combined with Access

Path operator• Use indexes where possible and apply

other predicates on the result– Can also be applied at intermediate

point in query plan• Projection: removes columns and

duplicates– Column projection often done by

operators– Duplicate elimination via Sort or Hash

Operator Review

• Sort: sorts tuples in a particular order– Simple merge sort– General external merge sort (with

various optimizations)– B+ tree traversal

• Join: combine tuples from 2 other operators– Page nested loops– Block nested loops– Index nested loops– Sort-merge join– Hash-join

• Other operators for– Group By, Temping, …

Query Optimization steps

1. Parse query from text to ‘intermediate model’

2. Traverse ‘intermediate model’ and produce alternate query plans– Query plan = tree of relational

operators– Optimizer keeps track of cost

and properties of plans

3. Pick the cheapest plan4. Pass cheapest plan on to query

execution engine to execute and produce results of query

Query parser

Query optimizer

SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND(A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname

HAVING COUNT(*) > 1

Block 2Block 1

Block 3

Cost = 200 Cost = 150 Cost = 500

To execution engine

Query Blocks: Units of Optimization

• Intermediate model is a set of query blocks– 1 per

SELECT/FROM/WHERE/GROUP BY/HAVING clause

SELECT S.snameFROM Sailors SWHERE S.age IN (SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)

SELECT A.aname, max(F.feedingshift)FROM Animals A, Feeding FWHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear')GROUP BY A.aname

HAVING COUNT(*) > 1

Query Block

• Subqueries produce nested query blocks – treated as calls to a

subroutine, made once per tuple produced by outer query block

– sometimes subqueries can be rewritten to produce cheaper plan

Outer Query Block

Nested Query BlockRewritten Query BlockX

Query blocks are optimized 1 at a time

1. Convert block to relational algebra tree

2. Traverse tree and build plan bottom up:• Pick best access method for each

relation in FROM clause• Applying predicates if possible

• Consider all join trees • All ways to join relations in FROM

clause 1-at-a time• Consider multiple permutations

and join methods– But not all! too many choices – Restrict to left-deep plans– Prune bad plans along the way

Query Block

Converting Query Blocks to Relational Algebra Trees

SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5

Reserves Sailors

sid=sid

bid=100 rating > 5

• SQL is relationally complete; can express everything in relational algebra

(sname)(bid=100 rating > 5) (Reserves Sailors)

Πsname(σ(age in set from subquery) Sailors)

SQL extends Relational Algebra

• SQL is more powerful than relational algebra– extend relational algebra to include

aggregate ops: GROUP BY, HAVING

• How is this query block expressed?SELECT S.sname

FROM Sailors SWHERE S.age IN (constant set from subquery)

• And this query block?SELECT MAX (S2.age)

FROM Sailors S2 GROUP BY S2.rating

σ(age in set from subquery) Sailors

ΠMax(age)(GroupByRating(Sailors) )GroupByRating(Sailors)

Why optimize?

Reserves Sailors

sid=sid

bid=100 rating > 5

• Operators have implementation choices – Index scan? File scan? Nested loop join? Sort merge?

• Operators can also be applied in different order!

Motivating Example -- Schema used

• As seen in previous lectures…• Reserves:

– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.

– Assume there are 100 boats• Sailors:

– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.

– Assume there are 10 different ratings • Assume we have 5 pages in our buffer pool!

Sailors (sid: integer, sname: string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: dates, rname: string)

Motivating Example

• Cost: 500+500*1000 I/Os• Not the worst plan, but… • Misses several opportunities:

– selections could have been `pushed’ earlier,

– indexes might have been helpful….

• Goal of optimization: To find more efficient plans that compute the same answer.

SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5

Sailors Reserves

sid=sid

bid=100 rating > 5

(Page-Oriented Nested loops)

(On-the-fly)

500 1000

Selectivity calculation

• Sailors: 500 pages, 80 tuples per page, 10 ratings

• Selectivity of S.rating > 5?– ½ -> 500*80/2 = 20,000 tuples – 20,000/80 = 250 pages

• Reserves: 1000 pages, 100 tuples per page, 100 boats

• Selectivity of R.bid = 100?– 1/100 -> 1000*100/100 = 1000 tuples– 1000/100 = 10 pages

500,500 IOs

Alternative Plans – Push Selects (No Indexes)

Sailors Reserves

sid=sid

bid=100 rating > 5

(On-the-fly)

Sailors

Reserves

sid=sid

rating > 5

(On-the-fly)

bid=100 (On-the-fly)

250,500 IOs

500 + 250 *1000 =

250 1000

Alternative Plans – Push Selects (No Indexes)

Sailors

Reserves

sid=sid

rating > 5

(On-the-fly)

Sailors Reserves

sid=sid

bid = 100

(On-the-fly)

rating > 5

(On-the-fly)(On-the-fly)

250,500 IOs250,500 IOs

1000250

500 + 250 *1000 =

250 10

Sailors

Reserves

sid=sid

rating > 5

(On-the-fly)

6000 IOs

Sailors

Reserves

sid=sid

rating > 5

(On-the-fly)

bid=100

(On-the-fly)

250,500 IOs

Alternative Plans – Try different join order

swap10

1000 + 10 *500=

SailorsReserves

sid=sid

rating > 5

(On-the-fly)

bid=100

(Scan &Write totemp T2)(On-the-fly)

6000 IOs

Sailors

Reserves

sid=sid

rating > 5

(On-the-fly)

bid=100

(On-the-fly)

Alternative Plans – Push Selects and precompute result (No Indexes)

1000 + 500+ 250 + (10 * 250) = 1000

10 500

4250 IOs

ReservesSailors

sid=sid

bid=100

(On-the-fly)

rating>5

Alternative Plans – Try different join order

500 + 1000 +10 +(250 *10) =

SailorsReserves

sid=sid

rating > 5

(On-the-fly)

bid=100

4250 IOs

4010 IOs

500,500 IOs

Optimized query is 124x cheaper than the original!

Sailors Reserves

sid=sid

bid=100 rating > 5

(On-the-fly)

ReservesSailors

sid=sid

bid=100

(On-the-fly)

rating>5

4010 IOs

More Alternative Plans (No Indexes)

• Main difference: Sort Merge Join

• With 5 buffers, cost of plan:– Scan Reserves (1000) + write temp T1 (10 pages, if we

have 100 boats, uniform distribution) = 1010.– Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings) =

750.– Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300– Total: 4060 page I/Os.

• If use BNL join, join = 10+4*250, total cost = 2770.• Can also `push’ projections, but must be careful!

– T1 has only sid, T2 only sid, sname:– T1 fits in 3 pgs, cost of BNL under 250 pgs, total < 2000.

Reserves Sailors

sid=sid

bid=100

sname(On-the-fly)

rating > 5(Scan;write to temp T1)

(Scan;write totemp T2)

(Sort-Merge Join)

log4 ceil(10/5) = 1; log4 ceil(250/5))=3

More Alt Plans: Indexes• With clustered index on

bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages.

• INL with outer not materialized.

Decision not to push rating>5 before the join is based on availability of sid index on Sailors. Cost: Selection of Reserves tuples (10 I/Os); then, for each, must get matching Sailors tuple (1000*1.2); total 1210 I/Os.

Join column sid is a key for Sailors.–At most one matching tuple, unclustered index on sid OK.

– Projecting out unnecessary fields from outer doesn’t help.

(On-the-fly)

(Use hashIndex, donot writeto temp)

Reserves

Sailors

sid=sid

bid=100

rating > 5

(Index Nested Loops,

with pipelining )

(On-the-fly)

10 I/Os for 1000 tuples on 10 pagesFor each tuple assume 1.2 pages to find match

What is needed for optimization?

• A closed set of operators – Relational ops (table in, table out)– Encapsulation based on iterators

• Plan space, based on– Based on relational equivalences, different

implementations• Cost Estimation, based on

– Cost formulas– Size estimation, based on

• Catalog information on base tables• Selectivity (Reduction Factor) estimation

• A search algorithm– To sift through the plan space based on cost!

Query Optimization R&G, Chapter 15 Lecture 16. Administrivia Homework 3 available today –Written...

Documents