Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | lester-hahn |
View: | 56 times |
Download: | 4 times |
File Processing : Query Processing
2014, Spring
Pusan National University
Ki-Joune Li
STEMPNU
Basic Concepts of Query
Query Retrieve records satisfying predicates
Types of Query Operators Aggregate Query Sorting
STEMPNU
Relational Operators : Select
Selection (condition) Retrieve records satisfying predicates Example
Find Student where Student.Score > 3.5 score>3.5(Student)
Index or Hash
Select
Predicate
STEMPNU
Relational Operators : Project
Project (attributes) Extract interesting attributes Example
Find Student.name where score > 3.5
name(acore>3.5(Student))
Full Scan
Interesting attributes to get
Extract
STEMPNU
Cartesian Product
Cartesian Product () Two Tables : R1 R2
Produce all cross products
Join ( )
r11
r12
…
r1m
R1
r21
r22
…
r2n
R2
=
r11
r11
…
r11
r21
r22
…
r2n
r12
r12
…
r12
r21
r22
…
r2n
r1m r21
r22
…
r2n
…
r1m
r1m
…
…
STEMPNU
Join
Join ( ) Select combined records of cartesian product with same value of
a common attribute (Natural Join) Example
Student (StudentName, AdvisorProfessorID, Department, Score)
Professor(ProfessorName, ProfessorID, Department)
Student AdivsorProfessorID=ProfessorID Professor
= AdivsorProfessorID=ProfessorID(Student Professor)
Double Scan : Expensive Operation
STEMPNU
Relational Algebra
Relational Algebra Operand : Table (Relation) Operator : Relational Operator (, , , etc) Example: SQL and relational algebra
find Student.Name from Student, Professorwhere Student.Score > 3.5 and Student.AdvisorProfessorID=Professor.ID and Professor.Department=‘CSE’
student.name(score>3.5(Student) Department=‘CSE’ (Professor) )
Relational Algebra Specifies the sequence of operations
STEMPNU
Query Processing Mechanism
Query Processing Steps
1. Parsing and translation
2. Optimization
3. Evaluation
STEMPNU
Parsing and Translation
Parsing Query Statement (e.g. in SQL) Translation into relational algebra Equivalent Expression
For a same query statement
several relation algebraic expressions are possible Example
name(balance 2500(account )) name (balance 2500(name, balance (account )))
Different execution schedules
Query Execution Plan (QEP) Determined by relational algebra Several QEPs may be produced by Parsing and Translation
STEMPNU
Query Optimization
Choose ONE QEP among QEPs based on Execution Cost of each QEP, where cost means execution time
How to find cost of each QEP ? Real Execution
Exact but Not Feasible Cost Estimation
Types of Operations Number of Records Selectivity Distribution of data
STEMPNU
Cost Model : Basic Concepts
Cost Model : Number of Block Accesses Cost
C = Cindex + Cdata
where Cindex : Cost for Index Access
Cdata : Cost for Data Block Retrieval
Cindex vs. Cdata ? Cindex : depends on index
Cdata depends on selectivity Random Access or Sequential Access
Selectivity Number (or Ratio) of Objects Selected by Query
STEMPNU
Cost Model : Type of Operations
Cost model for each type of operations Select Project Join Aggregate Query
Query Processing Method for each type of operations
Index/Hash or Not
STEMPNU
Cost Model : Number of Records
Number of Records Nrecord Nblocks
Number of Scans Single Scan
O(N) : Linear Scan O(logN ) : Index
Multiple Scans O(NM ) : Multiple Linear Scans O(N logM ) : Multiple Scans with Index
STEMPNU
Selectivity
Selectivity Affects on Cdata
Random Access Scattered on several blocks Nblock Nselected
Sequential Access Contiguously stored on blocks Nblock = Nselected / Bf
STEMPNU
Selectivity Estimation
Selectivity Estimation Depends on Data Distribution Example
Q1 : Find students where 60 < weight < 70 Q2 : Find students where 80 < weight < 90
How to find the distribution Parametric Method
e.g. Gaussian Distribution No a priori knowledge
Non-Parametric Method e.g. Histogram Smoothing is necessary
Wavelet, Discrete Cosine
30 40 50 60 70 80 90 100
Frequency
STEMPNU
Select : Linear Search
Algorithm : linear search Scan each file block and test all records to see whether they
satisfy the selection condition.
Cost estimate (number of disk blocks scanned) = br
br denotes number of blocks containing records from relation r
If selection is on a key attribute (sorted), cost = (br /2) stop on finding record
Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices
STEMPNU
Select : Range Search
Algorithm : primary index, comparison Relation is sorted on A For A V (r)
Step 1: use index to find first tuple v and Step 2: scan relation sequentially
For AV (r) just scan relation sequentially till first tuple > v; do not use index
Algorithm : secondary index, comparison For A V (r)
Step 1: use index to find first index entry v and Step 2: scan index sequentially to find pointers to records.
For AV (r) scan leaf nodes of index finding pointers to records, till first entry > v
STEMPNU
Select : Range Search
Comparison between Searching with Index and Linear Search
Secondary Index retrieval of records that are pointed to requires an I/O for each record
Linear file scan may be cheaper if records are scattered on many blocks clustering is important for this reason
STEMPNU
Select : Complex Query
Conjunction : 1 2 . . . n(r) Algorithm : selection using one index
Step 1: Select a condition of i (i (r) ) Step 2: Test other conditions on tuple after fetching it into memory
buffer. Algorithm : selection using multiple-key index
Use appropriate multiple-attribute index if available. Algorithm : selection by intersection of identifiers
Step 1: Requires indices with record pointers. Step 2: Intersection of all the obtained sets of record pointers. Step 3: Then fetch records from file
Disjunction : 1 2 . . . n (r) Algorithm : Disjunctive selection by union of identifiers
STEMPNU
Join Operation
Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join
Choice based on cost estimate Examples use the following information
Number of records of (S)customer: 10,000 (R)depositor: 5000 Number of blocks of customer: 400 depositor: 100 Blocking factors of customer : 250 depositor: 50
STEMPNU
Nested-Loop Join
Algorithm NLJ the theta join r sFor each tuple tr in r do begin
For each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition if they do, add tr • ts to the result.
EndEnd
r : outer relation, s : inner relation. No indices, any kind of join condition. Expensive
STEMPNU
s1
s2
…
r1
r2
…
r50
s250
s251
…
s500
r51
r52
…
r100
s9751 r4951
r4952
…
r5000
…
s9752
s10000
…
…
Example: Nested-Loop Join
B1
B2
B400
B1
B2
B100
STEMPNU
Nested-Loop Join : Performance
Worst case the estimated cost is nr bs + br disk accesses, if not enough
memory only to hold one block of each relation,
Example 5000 400 + 100 = 2,000,100 disk accesses with depositor as
outer relation, and 1000 100 + 400 = 1,000,400 disk accesses with customer as the
outer relation.
If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to br + bs disk accesses. If smaller relation (depositor) fits entirely in memory,
cost estimate will be 500 disk accesses.
STEMPNU
Block Nested-Loop Join
Algoritm BNLJ
For each block Br of r do
Get Block Br For each block Bs of s do
Get Block Bs For each tuple tr in Br do
For each tuple ts in Bs do Check if (tr, ts) satisfy the join condition if they do, add tr
• ts to the result.End
EndEnd
End
No disk access required
No disk access requiredDisk access happens here
STEMPNU
s1
s2
…
r1
r2
…
r50
s250
s251
…
s500
r51
r52
…
r100
s9751 r4951
r4952
…
r5000
…
s9752
s10000
…
…
Example: Block-Oriented Nested-Loop Join
B1
B2
B400
B1
B2
B100
STEMPNU
Block Nested-Loop Join : Performance
Worst case Estimate: br bs + br block accesses. Each block in the inner relation s is read once for each block
in the outer relation (instead of once for each tuple in the outer relation)
Improvements : If M blocks can be buffered use (M-2) disk blocks as blocking unit for outer relations, use remaining two blocks to buffer inner relation and output
Then the cost becomes br / (M-2) bs + br
STEMPNU
Indexed Nested-Loop Join
Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relation’s join attribute
Can construct an index just to compute a join.
Algorithm INLJFor each block Br of r do
Get Block Br For each tuple tr in Br do
Search Index (IDXr , tr.key)
if found, add tr • ts to the result.
End
End
STEMPNU
Indexed Nested-Loop Join : Performance
Worst case buffer has space for only one page of r,
Cost of the join: br + nr c Where c is the cost of traversing index and fetching matching tuple Number of matching tuples may be greater than one.
If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation
STEMPNU
Example of Nested-Loop Join Costs
Assume depositor customer, with depositor as the outer relation. customer have a primary B+-tree index on the join attribute
customer-name, which contains 20 entries in each index node. customer has 10,000 tuples,
the height of the tree is 4, and one more access is needed to find the actual data
Depositor has 5000 tuples
Cost of block nested loops join 400*100 + 100 = 40,100 disk accesses assuming worst case memory
Cost of indexed nested loops join 100 + 5000 * 5 = 25,100 disk accesses.
STEMPNU
Hash-Join
Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations
h : A→ { 0, 1, ..., n } r0, r1, . . ., rn : partitions of r tuples
s0, s1. . ., sn : partitions of s tuples
r tuples in ri need only to be compared with s tuples in si .