Date post: | 18-Dec-2015 |
Category: |
Documents |
View: | 223 times |
Download: | 2 times |
SPRING 2004 CENG 352 2
Basic Steps in Query Processing1. Parsing and translation
2. Optimization
3. Evaluation
SPRING 2004 CENG 352 3
Basic Steps in Query Processing (Cont.)
• Parsing and translation– translate the query into its internal form. This is then
translated into relational algebra.– Parser checks syntax, verifies relations
• Evaluation– The query-execution engine takes a query-evaluation plan,
executes that plan, and returns the answers to the query.
SPRING 2004 CENG 352 4
Basic Steps: Optimization
• A relational algebra expression may have many equivalent expressions– E.g., balance2500(balance(account)) is equivalent to
balance(balance2500(account))
• Each relational algebra operation can be evaluated using one of several different algorithms– Correspondingly, a relational-algebra expression can be evaluated in
many ways.
• Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.– e.g., can use an index on balance to find accounts with balance < 2500,
– or, can perform complete relation scan and discard accounts with balance 2500
SPRING 2004 CENG 352 5
Basic Steps: Optimization (Cont.)
• Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. – Cost is estimated using statistical information from the
database catalog• e.g. number of tuples in each relation, size of tuples, etc.
• We first study– How to measure query costs– Algorithms for evaluating relational algebra operations– How to combine algorithms for individual operations in order to
evaluate a complete expression
• Then– We study how to optimize queries, that is, how to find an
evaluation plan with lowest estimated cost
SPRING 2004 CENG 352 6
Measures of Query Cost• Cost is generally measured as total elapsed time for
answering query– Many factors contribute to time cost
• disk accesses, CPU, or even network communication
• Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account– Number of seeks * average-seek-cost
– Number of blocks read * average-block-transfer-cost
– Number of blocks written * average-block-transfer-cost
SPRING 2004 CENG 352 7
Some Common Techniques• Algorithms for evaluating relational operators use
some simple ideas extensively:– Indexing: Can use WHERE conditions to retrieve small
set of tuples (selections, joins)
– Iteration: Sometimes, faster to scan all tuples even if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.)
– Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.
SPRING 2004 CENG 352 8
Statistics and Catalogs• Need information about the relations and indexes
involved. Catalogs typically contain at least:– # tuples (NTuples) and # pages (NPages) for each
relation.– # distinct key values (NKeys) and NPages for each index.– Index height, low/high key values (Low/High) for each
tree index.
• Catalogs updated periodically.– Updating whenever data changes is too expensive; lots of
approximation anyway, so slight inconsistency ok.
• More detailed information (e.g., histograms of the values in some field) are sometimes stored.
SPRING 2004 CENG 352 9
Relational Operations
• We will consider how to implement:– Selection ( ) Selects a subset of rows from relation.– Projection ( ) Deletes unwanted columns from relation.– Join ( ) Allows us to combine two relations.– Set-difference ( ) Tuples in reln. 1, but not in reln. 2.– Union ( ) Tuples in reln. 1 and in reln. 2.– Aggregation (SUM, MIN, etc.) and GROUP BY
• Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.
SPRING 2004 CENG 352 10
Selection Operation• File scan – search algorithms that locate and retrieve
records that fulfill a selection condition.• Algorithm A1 (linear search). Scan each file block and
test all records to see whether they satisfy the selection condition.– Cost estimate (number of disk blocks scanned) = br
• br denotes number of blocks containing records from relation r
– If selection is on a key attribute, cost = (br /2) • stop on finding record
– Linear search can be applied regardless of • selection condition or• ordering of records in the file, or • availability of indices
SPRING 2004 CENG 352 11
Selection Operation (Cont.)• A2 (binary search). Applicable if selection is
an equality comparison on the attribute on which file is ordered. – Assume that the blocks of a relation are stored
contiguously
– Cost estimate (number of disk blocks to be scanned): log2(br) — cost of locating the first tuple by a binary
search on the blocks
• Plus number of blocks containing records that satisfy selection condition
SPRING 2004 CENG 352 12
Selections Using Indices• Index scan – search algorithms that use an index
– selection condition must be on search-key of index.• A3 (primary index on candidate key, equality). Retrieve a
single record that satisfies the corresponding equality condition – Cost = HTi + 1
• A4 (primary index on nonkey, equality) Retrieve multiple records. – Records will be on consecutive blocks – Cost = HTi + number of blocks containing retrieved records
• A5 (equality on search-key of secondary index).– Retrieve a single record if the search-key is a candidate key
• Cost = HTi + 1– Retrieve multiple records if search-key is not a candidate key
• Cost = HTi + number of records retrieved – Can be very expensive!
• each record may be on a different block – one block access for each retrieved record
SPRING 2004 CENG 352 13
Selections Involving Comparisons• Can implement selections of the form AV (r) or A V(r) by using
– a linear file scan or binary search,– or by using indices in the following ways:
• A6 (primary index, comparison). (Relation is sorted on A)– For A V(r) use index to find first tuple v and scan relation
sequentially from there– For AV (r) just scan relation sequentially till first tuple > v; do not use
index• A7 (secondary index, comparison).
– For A V(r) use index to find first index entry v and scan index sequentially from there, to find pointers to records.
– For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v
– In either case, retrieve records that are pointed to• requires an I/O for each record• Linear file scan may be cheaper if many records are
to be fetched!
SPRING 2004 CENG 352 14
Implementation of Complex Selections• Conjunction: 1 2. . . n(r)
• A8 (conjunctive selection using one index). – Select a combination of i and algorithms A1 through A7 that results
in the least cost fori (r).– Test other conditions on tuple after fetching it into memory buffer.
• A9 (conjunctive selection using multiple-key index). – Use appropriate composite (multiple-key) index if available.
• A10 (conjunctive selection by intersection of identifiers). – Requires indices with record pointers. – Use corresponding index for each condition, and take intersection of
all the obtained sets of record pointers. – Then fetch records from file– If some conditions do not have appropriate indices, apply test in
memory.
SPRING 2004 CENG 352 15
Algorithms for Complex Selections• Disjunction:1 2 . . . n (r).
• A11 (disjunctive selection by union of identifiers). – Applicable if all conditions have available indices.
• Otherwise use linear scan.
– Use corresponding index for each condition, and take union of all the obtained sets of record pointers.
– Then fetch records from file
• Negation: (r)
– Use linear scan on file
– If very few records satisfy , and an index is applicable to • Find satisfying records using index and fetch from file
SPRING 2004 CENG 352 16
Schema for Examples
• Similar to old schema; rname added for variations.• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
• Sailors:– Each tuple is 50 bytes long, 80 tuples per page, 500
pages.
Sailors (sid: integer, sname: string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: dates, rname: string)
SPRING 2004 CENG 352 17
Simple Selections• With no index, unsorted: Must essentially scan the whole relation;
cost is 1000 I/Os (#pages in R).• With no index, sorted data: Utilize the sort order on rname by
doing a binary search to locate the first Joe. Cost is log2 1000 10 I/Os.
• With a B+ tree index on selection attribute: Use index to find qualifying data entries, then retrieve corresponding data records. Cost of finding the starting page is 2 or 3 I/Os; for a clustered index add one more I/O; for an unclustered index add one page per qualifying tuple.
• Hash index: 1 or 2 I/Os to retrieve the index pages. If 100 reservations by Joe then an additional 1-100 disk accesses depending how these records are distributed.
SELECT *FROM Reserves RWHERE R.rname = ‘Joe’
SPRING 2004 CENG 352 18
Using an Index for Selections
• Cost depends on #qualifying tuples, and clustering.• Assume we estimate roughly 10% of Reserves tuples
will be in result ( = 10,000 tuples, or 100 pages).– With a clustered index: cost is 100 I/Os + 1 or 2 disk
accesses for index.
– With an unclustered index: cost could be as high as 10,000 I/Os in the worst case. (might be cheaper to simply scan the entire relation)
SELECT *FROM Reserves RWHERE R.rname < ‘C%’
SPRING 2004 CENG 352 19
A Note on Complex Selections
• Selection conditions are first converted to conjunctive normal form (CNF):
(day<8/9/94 OR bid=5 OR sid=3 ) AND (rname=‘Paul’ OR bid=5 OR sid=3)
(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3
SPRING 2004 CENG 352 20
Two Approaches to General Selections• Consider the selection condition:
day<8/9/94 AND bid=5 AND sid=3
• First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index:1. A B+ tree index on day can be used; then, bid=5 and sid=3 must be
checked for each retrieved tuple.2. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must
then be checked.
– Terms that match the index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.
SPRING 2004 CENG 352 21
Intersection of Rids• Second approach (if we have 2 or more matching indexes) :
– Get sets of rids of data records using each matching index.– Then intersect these sets of rids.– Retrieve the records and apply any remaining terms.
• For the given example condition:
– If we have a B+ tree index on day and an index on sid, we can retrieve rids of records satisfying day<8/9/94 using the first, rids of records satisfying sid=3 using the second, intersect, retrieve records and check bid=5.
SPRING 2004 CENG 352 22
The Projection Operation• To implement projection we have to do the
following:1. Remove unwanted attributes.2. Eliminate any duplicate tuples produced.
• The expensive part is removing duplicates.– SQL systems don’t remove duplicates unless the
keyword DISTINCT is specified in a query.
• There are two basic algorithms:1. Sorting Approach.2. Hashing Approach.
SPRING 2004 CENG 352 23
Approach based on sorting
• Modify Pass 1 of external sort to eliminate unwanted fields. If B buffer pages are available, runs of about 2B pages can be produced, but tuples in runs are smaller than input tuples. (Size ratio depends on # and size of fields that are dropped.)
• Modify merging passes to eliminate duplicates. Thus, number of result tuples smaller than input. (Difference depends on # of duplicates.)
SPRING 2004 CENG 352 24
Example
Cost: • In Pass 1, read original relation (1000 pages), write out
same number of smaller tuples.– Assume that each smaller tuple is 10 bytes long.– Thus cost is 250 pages.
• In merging passes, fewer tuples written out in each pass. – Assuming we have 20 buffer pages, the temporary relation can be
sorted in 2 passes.– In the first pass 250 pages are written out as 7 runs about 40 pages
long.– In the second pass we read the runs at a cost of 250 I/Os and
merge them.
• The total cost is 1500 I/Os.
SELECT DISTINCT R.sid, R.bidFROM Reserves R
SPRING 2004 CENG 352 25
Projection Based on Hashing• Partitioning phase: Read R using one input buffer. For each
tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers.– Result is B-1 partitions (of tuples with no unwanted fields). 2 tuples
from different partitions guaranteed to be distinct.
• Duplicate elimination phase: For each partition, read it and build an in-memory hash table, using hash fn h2 ( h1) on all fields, while discarding duplicates.– If partition does not fit in memory, can apply hash-based projection
algorithm recursively to this partition.
• Cost: For partitioning, read R, write out each tuple, but with fewer fields. This is read in next phase.– In our projection example this cost is 1000 + 2 * 250 = 1500 I/Os.
SPRING 2004 CENG 352 26
Discussion of Projection
• Sort-based approach is the standard; better handling of duplicate elimination and result is sorted.
• If an index on the relation contains all wanted attributes in its search key, can do index-only scan.– Apply projection techniques to data entries (much smaller!)
• If an ordered (i.e., tree) index contains all wanted attributes as prefix of search key, can do even better:– Retrieve data entries in order (index-only scan), discard unwanted
fields, compare adjacent tuples to check for duplicates.