Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product...

Query Optimization

Chap. 19

Evaluation of SQL

• Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where clauses eliminated – Rows grouped by – Groups not satisfying having eliminated – Select clause target list evaluated – If distinct eliminate duplicate rows – Union after each subselect evaluated – Rows sorted in order by

Actual Order of Evaluation• Order chosen by query optimizer

– Determines efficient way

• Steps to optimization:– Syntax checking phase

• scan - identify tokens of text • parse- check syntax • validate - check attributes and relation names

– Query optimization phase • create internal representation - Query Tree• identify execution strategies for Query Plan

– Maintains statistics for tables and columns, indexes

• choose suitable Query Plan to optimize query e.g. order of execution of ops, use indexes, etc.

How to produce an execution planOracle’s example

Evaluation cont’d

– Execution phase • query optimizer produces execution plan• code generator generates code • runtime db processor runs query code

– To minimize run time • chosen strategy NOT optimal, but reasonably efficient

For procedural languages limited need for query optimization

Strategy

• How would you optimize an SQL query?

Heuristics in Query Optimization

• Apply select and project before join or other binary operations. Why?– select and project reduce size

• Strategy is obvious, but challenge was to show could be done with rules

Create Internal Representation –Query Tree

Converting Query Trees into Query Plans

• An execution plan for a query tree includes information about access methods for each relation and algorithms for operators

• For example:– For operation

• Choose an index• Use a tablescan

– For |X|• Use nested loop, sort-merge or hash join

– For • Scan result of join or combine with |X| when write out

result from join

Query Optimization

• Canonical form (initial query tree - conceptual order of evaluation) – Leaf nodes are tables– Internal nodes are operations– Begin by separating select conditions from joins

(end up with X)– Combine all selects then all projects

• transform to final query tree using general transformation rules for relational algebra

Query tree for SQL query

Select lnameFrom employee, works_on, projectWhere pname=‘Aquarius’ and pnumber=pno

and essn=ssn and bdate > ‘1987-12-31’

Write as canonical query treeWrite as relational algebra expression

General Transformation Rules for Relational Algebra

1. Cascade of 2. Commutativity of 3. Cascade of 4. Commuting with 5. Commutativity of |X| (and X)

6. Commuting with |X| (or X)

7. Commuting with |X| (or X)

General Transformation Rules for Relational Algebra

8. Commutativity of set operations U and ∩9. Associativity of |X|, X, U and ∩10. Commuting with set operations

11. The operation commuted with U12. Converting a (, X) sequence into |X|

Outline of a Heuristic Algebraic Optimization Algorithm

• Use Rule 1 break up conjunctive ’s into cascades of ’s

• Use Rules 2,4,6, 10 for commutativity of to: – move ’s as far down tree as possible

• Use Rules 5 and 9 for commutativity and associativity of binary operations to:– Place most restrictive (and |X|) so executed first

• fewest tuples, smallest absolute size or smallest selectivity

• But make sure no cartesian products result

Outline of a Heuristic Algebraic Optimization Algorithm

• Use Rule 12, combine Cartesian product with to:– create |X|

• Use Rules 3, 4, 7, 11 concerning cascade of ’s and commuting with other ops to:– move down tree as far as possible

• Identify subtrees that represent groups of operations than can be executed by a single algorithm

Summary of Heuristics

• Apply first operations that reduce size of intermediate results– Perform ’s and ’s as early as possible (move

down tree as far as possible)– Execute most restrictive and |X| first (reorder

leaf nodes but avoid cartesian products)

Multiple table joins

• Multiple table joins – Query plan identifies most efficient ordering of

joins – uses dynamic programming

Order of joins - Oracle

• Much more complicated – when determining order of joins, keep track or resulting sort order (interesting order)

• Using dynamic programming considers order of result – can void redundant sort operation later and/or speed up subsequent join

• Can flatten nested joins – dynamic programming can escalate optimization time, so use rules instead

• Estimating cost is difficult

Joins

– May not have to materialize actual table resulting from join

– Instead use pipelining - successive rows output from one step fed into next plan

Converting trees into Query plans

• pipelined evaluation– Forward result from an operation directly to next

operation• Result from , placed in buffer• |X| consumes tuples from buffer• Result from |X| pipelined to

Query Tree Question

• Should we do a pname, pnumber then pname = ‘Aquarius’ then pnumber ?

• No, since the operations are done together– the processor would read a row of project,

see if pname = ‘Aquarius’ then use pnumber to perform the join.

Algorithms

• DBMS has general access algorithms to implement select, join, or combinations of ops

• Don't have separate access routines for each op– Creating temporary files is inefficient – Generate algorithms for combinations of operations – join, select, project - code is created dynamically to

implement multiple operations

Materialized table

• Think about what operations require utilizing a materialized table– Input to select?– Input to project?– Input to join?

Identify Execution Strategiesand Suitable Query Plans

Cost

• Optimizers combine: – Heuristic rules for ordering ops – Systematically estimate cost of different

execution strategies - choose lowest cost• E.g. nested loop or hash join?• CPU cost usually similar for the Query Plans

Cost of Query Plans• Cost components for query execution

– Computation cost: • in-memory ops on data buffers e.g. sorting, searching,

implementation/order of operatopms• CPU cost usually similar for the Query Plans

– Memory usage cost:• Number of memory buffers needed

– Access cost to secondary storage: • cost of search for hashing, indexes, R/W• contiguous versus scattered storage on disk

– Storage cost • if intermediate tables

– Communication cost: • cost to ship query and results from DB site to request site

Cost cont’d

• For small DB's, minimize computation cost • For large DB's, minimize cost to secondary storage

e.g. block transfers between disk and main memory• For distributed DB's, minimize communication cost

• Workload– Mix of queries and frequencies of queries– Given workload and query execution plans, can

determine CPU and I/O resource needs•

Statistics

• Statistics– Maintain statistics about tables

• # rows, #columns, domains• SYSCOLUMNS

col_name, table_name, #of values, High, Low

– Statistics on columns that deviate strongly from the uniform assumption

– Selectivity of values – Execute special command to gather info

• RUNSTATS DB2• ANALYZE Oracle

Cost function information

• Number of tuples or records (r)• Record size (R)• Number of blocks (b)• Blocking factor – records per block (bfr)• Number of distinct values (d)• Selectivity (sl)

Selectivity sl a.k.a Filter Factor FF

• Fraction of rows with specified values(s) for specific attribute that result from the predicate restriction

• How many tuples satisfy predicate• Hopefully only need to access those

tuples + index

Selectivity sl, Filter Factor FF

• # records satisfying condition c total# of records in relation

• Estimate attribute with i distinct values as:– Assume |R| is #rows in table R

FF = sl = ( |R|/i) / |R| = 1/col_cardinality

FF = sl =(10,000/2)/10,000 = 1/2

Examples of sl

• if SQL statement specified: – col = const,

• DB2 assumes sl = 1/col_cardinality

– col between const1 and const2• DB2 assumes sl = (const2 - const1)/(High - Low)

• For some predicates, sl not predictable by simple formula

Assumptions

• Uniform distribution of column value• Attribute values independent• Independent distribution of values from any 2

columns C1 and C2

sl(C1) * sl(C2) e.g. 1/2 (gender) * 1/4 (class) =

1/8 undergrads are female freshman

Cost of a Join

• How to determine join selectivity (js)– js = |(R|X|c S)|/|(RXS)| = |(R|X|c S)| / (|R|X|S|)

• If no join condition, js=?js=1

• If no matching tuples, js=?js=0

Cost of a Join

• Assuming R.A=S.B is join condition– if A is a key of R, js=?

js ≤ (1/|R|)– Unless B is a foreign key and NOT NULL, then js=?

js = (1/|R|)– if B is a key of S, js=?

js ≤ (1/|S|)

• Cost of different implementations of join:

You Can Retrieve Query Plan

Explain plan set queryno=1000 for select * from customers where city = ‘Boston’

Select * from plan_table where queryno=1000;

Givess access type (index or not), columns, etc.

Information for Optimization

• When using indexes– Cluster Ratio

how well clustering property holds for rows with respect to a given index if clustering ratio 80% or more, use sequential

prefetch

Date post:	01-Jan-2016
Category:	Documents
Upload:	jayson-townsend
View:	218 times
Download:	0 times

Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product...

Documents