+ All Categories
Home > Documents > Temple University – CIS Dept. CIS616– Principles of Data Management

Temple University – CIS Dept. CIS616– Principles of Data Management

Date post: 31-Dec-2015
Category:
Upload: afrodite-ballas
View: 38 times
Download: 0 times
Share this document with a friend
Description:
Temple University – CIS Dept. CIS616– Principles of Data Management. V. Megalooikonomou Query Processing / Optimization (based on notes by C. Faloutsos at CMU). Data-files. catalog. Overview of a DBMS. Naïve user. casual user. DBA. DML parser. DDL parser. DML precomp. trans. mgr. - PowerPoint PPT Presentation
51
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Query Processing / Optimization
Transcript
Page 1: Temple University – CIS Dept. CIS616– Principles of Data Management

Temple University – CIS Dept.CIS616– Principles of Data Management

V. Megalooikonomou

Query Processing / Optimization

(based on notes by C. Faloutsos at CMU)

Page 2: Temple University – CIS Dept. CIS616– Principles of Data Management

Overview of a DBMSDBAcasual

user

DML parser

buffer mgr

trans. mgr

DMLprecomp.

DDL parser

catalogData-files

Naïve user

Page 3: Temple University – CIS Dept. CIS616– Principles of Data Management

Overview - detailed Motivation - Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies

Page 4: Temple University – CIS Dept. CIS616– Principles of Data Management

Why Q-opt? SQL: ~declarative good q-opt -> big difference

e.g., seq. Scan vs B-tree index, on P=1,000 pages

Page 5: Temple University – CIS Dept. CIS616– Principles of Data Management

Q-opt steps bring query in internal form (e.g., parse

tree) … into ‘canonical form’ (syntactic q-

opt) generate alternative plans estimate cost; pick best

Page 6: Temple University – CIS Dept. CIS616– Principles of Data Management

Q-opt - example

select name

from STUDENT, TAKES

where c-id=‘CIS616’ and

STUDENT.ssn=TAKES.ssn

STUDENT TAKES

STUDENT TAKES

Canonical form

Page 7: Temple University – CIS Dept. CIS616– Principles of Data Management

Q-opt - example

STUDENT TAKES

Index; seq. scan

Hash join; merge join;

nested loops;

Page 8: Temple University – CIS Dept. CIS616– Principles of Data Management

Overview - detailed Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies

Page 9: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions … or syntactic q-opt In short: perform selections and

projections early More details:

see transformation rules in text

Page 10: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions Q: How to prove a transformation

rule?

A: use TRC, to show that LHS = RHS, e.g.: )2()1()21(

?

RRRR PPP

)2()1()21(?

RRRR PPP

Page 11: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions

))()2())(1(

)()21(

)()21(

)2()1()21(?

tPRttPRt

tPRtRt

tPRRt

LHSt

RRRR PPP

Page 12: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions

QED

RHSt

RRt

RtRt

tPRttPRt

RRRR

PP

PP

PPP

)2()1(

))2(())1((

))()2())(1(

...

)2()1()21(?

Page 13: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions Selections

perform them early break a complex predicate, and push

simplify a complex predicate (‘X=Y and Y=3’) -> ‘X=3 and Y=3’

))...)((...()( 21^...2^1 RR pnpppnpp

Page 14: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions Projections

perform them early (but carefully…) Smaller tuples Fewer tuples (if duplicates are

eliminated) project out all attributes except the

ones requested or required (e.g., joining attributes)

Page 15: Temple University – CIS Dept. CIS616– Principles of Data Management

Equivalence of expressions Joins

Commutative , associative

Q: n-way join - how many diff. orderings? … Exhaustive enumeration too slow…

RSSR

)()( TSRTSR

Page 16: Temple University – CIS Dept. CIS616– Principles of Data Management

Q-opt steps bring query in internal form (e.g., parse

tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best

Page 17: Temple University – CIS Dept. CIS616– Principles of Data Management

17

Cost estimation E.g., find ssn’s of students with an

‘A’ in CIS616 (using seq. scanning) How long will a query take?

CPU (but: small cost; decreasing; tough to estimate)

Disk (mainly, # block transfers) How many tuples will qualify? (what statistics do we need to

keep?)

Page 18: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation

Statistics: for each relation ‘r’ we keep nr : # tuples; Sr : size of tuple in bytes V(A,r): number of distinct

values of attr. ‘A’ (recently, histograms,

too)

Sr

#1#2

#3

#nr

Page 19: Temple University – CIS Dept. CIS616– Principles of Data Management

Derivable statistics

fr: blocking factor = max# records/block (=?? )

br: # blocks (=?? ) SC(A,r) = selection

cardinality = avg# of records with A=given (=?? )

fr

Sr

#1

#2

#br

Page 20: Temple University – CIS Dept. CIS616– Principles of Data Management

Derivable statistics fr: blocking factor = max#

records/block (= B/Sr ; B: block size in bytes)

br: # blocks (= nr / fr )

Page 21: Temple University – CIS Dept. CIS616– Principles of Data Management

Derivable statistics SC(A,r) = selection cardinality =

avg# of records with A=given (= nr / V(A,r) ) (assumes uniformity...) – eg: 30,000 students, 10 colleges – how many students in CST?

Page 22: Temple University – CIS Dept. CIS616– Principles of Data Management

Additional quantities we need:

For index ‘i’: fi: average fanout - degree (~50-100) HTi: # levels of index ‘i’ (~2-3)

~ log(#entries)/log(fi) LBi: # blocks at leaf level

HTi

Page 23: Temple University – CIS Dept. CIS616– Principles of Data Management

Statistics Where do we store them? How often do we update them?

Page 24: Temple University – CIS Dept. CIS616– Principles of Data Management

Q-opt steps bring query in internal form (e.g., parse

tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans

selections; sorting; projections joins

estimate cost; pick best

Page 25: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation Selections –

e.g.,select * from TAKESwhere grade =

‘A’ Plans?

fr

Sr

#1

#2

#br

Page 26: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation Plans?

seq. scan binary search

(if sorted & consecutive)

index search if an index

exists

fr

Sr

#1

#2

#br

Page 27: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

seq. scan – cost? br (worst case) br/2 (average, if

we search for primary key)

fr

Sr

#1

#2

#br

Page 28: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

binary search – cost?if sorted and

consecutive: ~log(br) + SC(A,r)/fr (=#blocks

spanned by qualified tuples)

-1

fr

Sr

#1

#2

#br

Page 29: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

estimation of selection cardinalities SC(A,r):

non-trivial…

fr

Sr

#1

#2

#br

Page 30: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

method#3: index – cost? levels of index + blocks w/ qual. tuples

fr

Sr

#1

#2

#br

...

case#1: primary key

case#2: sec. key – clustering index

case#3: sec. key – non-clust. index

Page 31: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

method#3: index – cost? levels of index + blocks w/ qual. tuples

fr

Sr

#1

#2

#br

..

case#1: primary key – cost:

HTi + 1

HTi

Page 32: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

method#3: index - cost? levels of index + blocks w/ qual. tuples

fr

Sr

#1

#2

#br

case#2: sec. key – clustering index

OR prim. index on non-key

…retrieve multiple records

HTi + SC(A,r)/fr

HTi

Page 33: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

method#3: index – cost? levels of index + blocks w/ qual. tuples

fr

Sr

#1

#2

#br

...

case#3: sec. key – non-clust. index

HTi + SC(A,r)

(actually, pessimistic...)

Page 34: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation – arithmetic examples find accounts with branch-name =

‘Perryridge’ account(branch-name, balance, ...)

Page 35: Temple University – CIS Dept. CIS616– Principles of Data Management

Arithm. examples – cont’d n-account = 10,000 tuples f-account = 20 tuples/block V(balance, account) = 500 distinct

values V(branch-name, account) = 50

distinct values for branch-index: fanout fi = 20

Page 36: Temple University – CIS Dept. CIS616– Principles of Data Management

Arithm. examples Q1: cost of seq. scan? A1: 500 disk accesses Q2: assume a clustering index on

branch-name – cost?

Page 37: Temple University – CIS Dept. CIS616– Principles of Data Management

Cost estimation + plan generation

method#3: index – cost? levels of index + blocks w/ qual.

tuples

fr

Sr

#1

#2

#br

case#2: sec. key – clustering index

HTi + SC(A,r)/frHTi

Page 38: Temple University – CIS Dept. CIS616– Principles of Data Management

Arithm. examples A2:

HTi + SC(branch-name, account)/f-account

HTi: 50 values, with index fanout 20 -> HT=2 levels (log(50)/log(20) = 1+)

SC(..)= # qualified records = nr/V(A,r) = 10,000/50 = 200 tuples SC/f: spanning 200/20 blocks = 10 blocks

Page 39: Temple University – CIS Dept. CIS616– Principles of Data Management

Arithm. examples A2 final answer: 2+10= 12 block

accesses (vs. 500 block accesses of seq.

scan) footnote: in all fairness

seq. disk accesses: ~2msec or less random disk accesses: ~10msec

Page 40: Temple University – CIS Dept. CIS616– Principles of Data Management

Overview - detailed Motivation - Why q-opt? Equivalence of expressions Cost estimation Cost of indices Join strategies

Page 41: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins algorithm(s) for r JOIN s? nr, ns tuples each

r(A, ...)

s(A, ......)nr

ns

Page 42: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins Algorithm #0: (naive) nested loop

(SLOW!)for each tuple tr of r

for each tuple ts of sprint, if they match

r(A, ...)

s(A, ......)nr

ns

Page 43: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins Algorithm #0: why is it bad? how many disk accesses (‘br’ and

‘bs’ are the number of blocks for ‘r’ and ‘s’)?r(A, ...)

s(A, ......)nr

ns

nr*bs + br

Page 44: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins Algorithm #1: Blocked nested-loop join

read in a block of r read in a block of s

print matching tuples

r(A, ...)

s(A, ......)nr,

brns records, bs blocks

cost: br + br * bs

Page 45: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins Arithmetic example:

nr = 10,000 tuples, br = 1,000 blocks ns = 1,000 tuples, bs = 200 blocks

r(A, ...)

s(A, ......)10,000

1,000 1,000 records,

200 blocks

alg#0: 2,001,000 d.a.

alg#1: 201,000 d.a.

Page 46: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins Observation1: Algo#1: asymmetric:

cost: br + br * bs - reverse roles: cost= bs + bs*br

Best choice?

r(A, ...)

s(A, ......)nr,

brns records, bs blocks

smallest relation in outer loop

Page 47: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins Other algorithm(s) for r JOIN s? nr, ns tuples each

r(A, ...)

s(A, ......)nr

ns

Page 48: Temple University – CIS Dept. CIS616– Principles of Data Management

2-way joins - other algo’s sort-merge

sort ‘r’; sort ‘s’; merge sorted versions (good, if one or both are already sorted)

r(A, ...)

s(A, ......)nr

ns

Page 49: Temple University – CIS Dept. CIS616– Principles of Data Management

hash join: hash ‘r’ into (0, 1, ..., ‘max’) buckets hash ‘s’ into buckets (same hash function) join each pair of matching buckets

2-way joins - other algo’s

r(A, ...)s(A, ......)

0

1

max

Page 50: Temple University – CIS Dept. CIS616– Principles of Data Management

More heuristics by Oracle, Sybase and Starburst (-> DB2) : in book

In general: q-opt is very important for large databases.

(‘explain select <sql-statement>’ gives plan)

Structure of query optimizers:

Page 51: Temple University – CIS Dept. CIS616– Principles of Data Management

Conclusions -- Q-opt steps bring query in internal form (eg.,

parse tree) … into ‘canonical form’ (syntactic q-

opt) generate alt. plans

selections (simple; complex predicates) sorting; projections joins

estimate cost; pick best


Recommended