Changes
• 13.1 – Added content to slide #8• 13.2 – Added extra slide• 13.4 – Added content to slide #9• 13.8 – Added content to slide #7• 15.1 – Added content to slide #8• 15.2 – Added content to slide #3• 15.4 – Added title to slide #12 • 18.3 – Added content to slide #2 and 3• 18.4 – Added extra slide
Secondary Storage ManagementThe Memory Hierarchy
The Memory Hierarchy• Computer systems have
several different components in which data may be stored.
• Data capacities & access speeds range over at least seven orders of magnitude
• Devices with smallest capacity also offer the fastest access speed
Description of Levels
1. Cache
• Megabyte or more of Cache storage.
• On-board cache : On same chip.
• Level-2 cache : On another chip.
• Cache data accessed in few nanoseconds.
• Data moved from main memory to cache when needed by processor
• Volatile
Description of Levels 2. Main Memory
• 1 GB or more of main memory.• Instruction execution & Data Manipulation -
involves information resident in main memory.• Time to move data from main memory to the
processor or cache is in the 10-100 nanosecond range.
• Volatile3. Secondary Storage
• Typically a magnetic disk.• Capacity upto 1 TB.• One machine can have several disk units.• Time to transfer a single byte between disk &
main memory is around 10 milliseconds.
Description of Levels
4. Tertiary Storage
• Holds data volumes measured in terabytes.
• Significantly higher read/write times.
• Smaller cost per bytes.
• Retrieval takes seconds or minutes, but capacities in the petabyte range are possible.
Transfer of Data Between Levels
• Data moves between adjacent levels of the hierarchy.
• Each level is organized to transfer large amounts of data to or from the level below
• Key technique for speeding up database operations is to arrange data so that when one piece of a disk block is needed, it is likely that other data on the same block will also be needed at about the same time.
Volatile & Non Volatile Storage• A volatile device “forgets” what is stored in it
when the power goes off. • Example: Main Memory
• A nonvolatile device, on the other hand, is expected to keep its contents intact even for long periods when the device is turned off or there is a power failure.
• Example: Secondary & Tertiary Storage
Note: No change to the database can be considered final until ithas migrated to nonvolatile, secondary storage.
Virtual Memory• Managed by Operating System.
• Some memory in main memory & rest on disk.
• Transfer between the two is in units of disk blocks (pages).
• Not a level of the memory hierarchy, it is an artifact of the operating system and its use of the machine’s hardware.
Thank you!
Section 13.2 – Secondary storage management
CS-257 Database System PrinciplesAvinash Anantharamu
102
• 13.2 Disks
• 13.2.1 Mechanics of Disks • 13.2.2 The Disk Controller • 13.2.3 Disk Access Characteristics
Index
• The use of secondary storage is one of the important characteristics of a DBMS, and secondary storage is almost exclusively based on magnetic disks
Disks:
Structure of a Disk
• 0’s and 1’s are represented by different patterns in the magnetic material.
• A common diameter for the disk platters is 3.5 inches.
Data in Disk
• Two principal moving pieces of hard drive1- Head Assembly2- Disk Assembly
• Disk Assembly has 1 or more circular platters that rotate around a central spindle.• Platters are covered with thin magnetic
material
Mechanics of Disks
Top View of Disk Surface
• Tracks are concentric circles on a platter.
• Tracks are organized into sectors which are segments of circular platter.
• Sectors are indivisible as far as errors are concerned.
• Blocks are logical data transfer units.
Mechanics of Disks
• Should a portion of the magnetic layer become corrupted in some way, so that it cannot store information, then the entire sector containing this portion cannot be used.
Corruptions
• Control the actuator to move head assembly
• Selecting the surface from which to read or write
• Transfer bits from desired sector to main memory
Disk Controller
Simple Single Processor Computer
• Seek time: The disk controller positions the head assembly at the cylinder containing the track on which the block is located. The time to do so is the seek time.
• Rotational latency: The disk controller waits while the first sector of the block moves under the head. This time is called the rotational latency.
Disk Access characteristics
• Transfer time: All the sectors and the gaps between them pass under the head, while the disk controller reads or writes data in these sectors. This delay is called the transfer time.
• Latency of the disk: The sum of the seek time, rotational latency, transfer time is the latency of the time.
Disk Access characteristics
• Database Systems -The complete Book-Second Edition
Reference:
Thank you Any Questions?
? Questions ?
13.3 Accelerating Access to Secondary StorageSan Jose State University
Spring 2012
13.3 Accelerating Access to Secondary StorageSan Jose State University
Spring 2012
13.3 Accelerating Access to Secondary Storage
Section Overview
13.3.1: The I/O Model of Computation 13.3.2: Organizing Data by Cylinders 13.3.3: Using Multiple Disks 13.3.4: Mirroring Disks 13.3.5: Disk Scheduling and the Elevator
Algorithm 13.3.6: Prefetching and Large-Scale Buffering
13.3 Introduction Average block access is ~10ms. Disks may be busy. Requests may outpace access delays, leading
to infinite scheduling latency. There are various strategies to increase disk
throughput. The “I/O Model” is the correct model to
determine speed of database operations
13.3 Introduction (Contd.)
Actions that improve database access speed:
– Place blocks closer, within the same cylinder
– Increase the number of disks
– Mirror disks
– Use an improved disk-scheduling algorithm
– Use prefetching
13.3.1 The I/O Model of Computation
If we have a computer running a DBMS that:
– Is trying to serve a number of users
– Has 1 processor, 1 disk controller, and 1 disk
– Each user is accessing different parts of the DB
It can be assumed that:
– Time required for disk access is much larger than access to main memory; and as a result:
– The number of block accesses is a good approximation of time required by a DB algorithm
13.3.2 Organizing Data by Cylinders
It is more efficient to store data that might be accessed together in the same or adjacent cylinder(s).
In a relational database, related data should be stored in the same cylinder.
13.3.3 Using Multiple Disks
If the disk controller supports the addition of multiple disks and has efficient scheduling, using multiple disks can improve performance significantly
By striping a relation across multiple disks, each chunk of data can be retrieved in a parallel fashion, improving performance by up to a factor of n, where n is the total number of disks the data is striped over
A drawback of striping data across multiple disks is that you increase your chances of disk failure.
To mitigate this risk, some DBMS use a disk mirroring configuration
Disk mirroring makes each disk a copy of the other disks, so that if any disk fails, the data is not lost
Since all the data is in multiple places, access speedup can be increased by more than n since the disk with the head closest to the requested block can be chosen
13.3.4 Mirroring Disks
13.3.4 Mirroring Disks
Advantages Disadvantages
Striping Read/Write speedup ~nCapacity increased by ~n
Higher risk of failure
Mirroring Read speedup ~nReduced failure riskFast initial access
High cost per bitSlow writes compared to striping
One way to improve disk throughput is to improve disk scheduling, prioritizing requests such that they are more efficient
– The elevator algorithm is a simple yet effective disk scheduling algorithm
– The algorithm makes the heads of a disk oscillate back and forth similar to how an elevator goes up and down
– The access requests closest to the heads current position are processed first
13.3.5 Disk Scheduling
When sweeping outward, the direction of head movement changes only after the largest cylinder request has been processed
When sweeping inward, the direction of head movement changes only after the smallest cylinder request has been processed
Example:
13.3.5 Disk Scheduling
Cylinder Time Requested (ms)
8000 0
24000 0
56000 0
16000 10
64000 20
40000 30
Cylinder Time Completed (ms)
8000 4.3
24000 13.6
56000 26.9
64000 34.2
40000 45.5
16000 56.8
In some cases we can anticipate what data will be needed
We can take advantage of this by prefetching data from the disk before the DBMS requests it
Since the data is already in memory, the DBMS receives it instantly
13.3.6 Prefetching and Large-Scale Buffering
? Questions ?
Disk Failures
Presented by Timothy ChenSpring 2013
Index
• 13.4 Disk Failures13.4.1 Intermittent Failures13.4.2 Organizing Data by Cylinders13.4.3 Stable Storage13.4.4 Error- Handling Capabilities of Stable
Storage13.4.5 Recovery from Disk Crashes13.4.6 Mirroring as a Redundancy Technique13.4.7 Parity Blocks13.4.8 An Improving: RAID 513.4.9 Coping With Multiple Disk Crashers
Intermittent Failures
• If we try to read the sector but the correct content of that sector is not delivered to the disk controller
• Controller will check good and bad sector• If the write is correct: Read is performed• Good sector and bad sector is known by the
read operation
CheckSum
• Read operation that determines the good or bad status
How CheckSum is performed
• Each sector has some additional bits• Set depending on the values of the data bits
stored in each sector• If the data bit in the not proper we know there is
an error reading• Odd number of 1: bits have odd parity(01101000)• Even number of 1: bit have even parity
(111011100)• Find Error is the it is one bit parity
Stable Storage
• Deal with disk error• Sectors are paired and each pair X showing left
and right copies as Xl and Xr • It check the parity bit of left and right by
subsituting spare sector of Xl and Xr until the good value is returned
Error-Handling Capabilities of Stable Storage
• Since it has XL and XR, one of them fail we can still read other one
• Chance both of them fail are pretty small• The write Fail, it happened during power
outage
Recover Disk Crash
• The most serious mode of failure for disks is “head crash” where data permanently destroyed.
• The way to recover from crash , we use RAID method
Mirroring as a Redundancy Technique
• it is call Raid 1• Just mirror each disk• We call one disk the data disk and the second
a redundant disk.
Raid 1 graph
Parity Block
• It often call Raid 4 technical• read block from each of the other disks and
modulo-2 sum of each column and get redundant disk
disk 1: 11110000disk 2: 10101010disk 3: 00111000
get redundant disk 4(even 1= 0, odd 1 =1)disk 4: 01100010
Raid 4 graphic
Parity Block- Fail Recovery
• It can only recover one disk fail• If it has more than one like two disk• Then it can’t be recover us modulo-2 sum
An Improvement Raid 5
Coping with multiple Disk Crash
• For more one disk fail• Either raid 4 and raid 5 can’t be work• So we need raid 6• It is need at least 2 redundant disk
Raid 6
Reference
• http://www.definethecloud.net/wp-content/uploads/2010/12/325px-RAID_1.svg_.png
• http://en.wikipedia.org/wiki/RAID
Secondary Storage Management
13.5 Arranging data on disk
Mangesh Dahale
ID-105
CS 257
Outline
• Fixed-Length Records• Example of Fixed-Length Records• Packing Fixed-Length Records into
Blocks• Example of Packing Fixed-Length
Records into Blocks• Details of Block header
Arranging Data on Disk
• A data element such as a tuple or object is represented by a record, which consists of consecutive bytes in some disk block.
Fixed Length Records
The Simplest record consists of fixed length fields.
The record begins with a header, a fixed-length regionwhere information about the record itself is kept.
Fixed Length Record header1. A pointer to record schema.2. The length of the record.3. A timestamp indicating when the record was created.
ExampleCREATE TABLE employee(
name CHAR(30) PRIMARY KEY,
address VARCHAR(255),
gender CHAR(1),
birthdate DATE
);
Packing Fixed Length Records into Blocks
• Records are stored in blocks of the disk and moved into main memory when we need to access or update them.
• A block header is written first and it is followed by series of blocks.
Example
•Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused
Block header contains following information
• Links to one or more other blocks that are part of a network blocks
• Information about the role played by this block in such a network
• Information about which relation the tuples of this block belong to.
• A “directory” giving the offset of each round in the block
• Timestamp(s) indicating the time of the block's last modification and / or access
•Thank You
•Thank You
Block header contains following information
• Links to one or more other blocks that are part of a network blocks
• Information about the role played by this block in such a network
• Information about which relation the tuples of this block belong to.
• A “directory” giving the offset of each round in the block
• Timestamp(s) indicating the time of the block's last modification and / or access
Variable Length Data and Records
- Ashwin Kalbhor Class ID : 107
Agenda
• Records with Variable Length Fields• Records with Repeating Fields• Variable Format Records• Records that do not fit in a block
• Example of a record
name
address
gender
birth date
0 30 286 287 297
Records with Variable Length Fields
• Simple and Effective way to represent variable length records is as follows –1. Fixed length fields are kept ahead of the variable length records.2. A header is put in front of the of the record.3. Record header contains• Length of the record• Pointers to the beginning of all variable length
fields except the first one.
Example
Record with name and address as variable length field.
birth date
name address
header informationrecord
lengthto address
gender
Records with repeating fields
• Repeating fields simply means fields of the same length L.
• All occurrences of Field F grouped together.• Pointer pointing to the first field F is put in the
header.• Based on the length L the starting offset of any
repeating field can be obtained.
Example of a record with Repeating Fields
Movie star record with “movies” as the repeating field.
name address
other header informationrecord
lengthto addressto movie pointers
pointers to movies
Alternative representation
• Record is of fixed length• Variable length fields stored on a separate
block.• The record itself keeps track of -
1. Pointers to the place where each repeating field begins, and2. Either how many repetitions there are, or where the repetitions end.
Storing variable length fields separately from the record.
Variable Format Records
• Records that do not have fixed schema• Represented by sequence of tagged fields• Each of the tagged fields consist of information
• Attribute or field name• Type of the field• Length of the field• Value of the field
Variable Format Records
N 16
S S14
Clint Eastwood
Hog’s Breath Inn
R
code for name
code for restaurant ownedcode for string
typecode for string type length
length
Records that do not fit in a block
• When the length of a record is greater than block size ,then record is divided and placed into two or more blocks
• Portion of the record in each block is referred to as a RECORD FRAGMENT
• Record with two or more fragments is called a SPANNED RECORD
• Record that do not cross a block boundary is called UNSPANNED RECORD
Spanned Records
• Spanned records require the following extra header information –• A bit indicates whether it is fragment or not• A bit indicates whether it is first or last fragment
of a record• Pointers to the next or previous fragment for the
same record
Spanned Records
record 1 record 3 record 2 - a
record 2 - b
block header
record header
block 1 block 2
Thank You.
13.8 Record Modifications
CS257Lok Kei Leong ( 108 )
Outline
• Record Insertion
• Record Deletion
• Record Update
Insertion• Insert new records into a relation
- records of a relation in no particular order- record of a relation in fixed order
(e.g. sorted by primary key)• A pointer to a record from outside the block is a “structured
address”
Record 4 Record 3 Record 2
Record 1
unusedheader
Offeset table
What If The Block is Full?
• If we need to insert the record in a particular block but the block is full. What should we do?
• Find room outside the Block• There are 2 solutions I. Find Space on Nearby BlockII. Create an Overflow Block
Insertion (solution 1)
• Find space on a “nearby” block• Block B1 has no space • If space available on block B2 move records of B1 to
B2 • If there are external pointers to records of B1 moved
to B2 leave forwarding address in offset table of B1
Insertion (solution 2)
• Create an overflow block• Each block B has its header pointer to an overflow
block where additional blocks of B can be placed
Block B Overflow block for B
Deletion• Slide around the block, using an offset table• If we cannot slide records:
- maintain an available-space list in the block headerto keep track of space available
• Since there may be pointers to the deleted record, we need to avoid dangling pointers or winding up pointing to a new record
Tombstone• What about pointer to deleted records ?• A tombstone is placed in place of each
deleted record• A tombstone is a bit placed at first byte of
deleted record to indicate the record was deleted ( 0 – Not Deleted 1 – Deleted)
• A tombstone is permanent
Record 1 Record 2
Update
• For Fixed-Length Records, there is no effect on the storage system
• For variable length records:• associated with insertion and deletion
(never create a tombstone for old record) • Longer updated record
create more space on its block- sliding records - create an overflow block
Question?
Query ExecutionSection 15.1
Sweta ShahCS257: Database Systems
ID: 118
Query Processor Query compilation Physical Query Plan Operators
Scanning Tables Table Scan Index scan
Sorting while scanning tables Model of computation for physical operators Parameters for measuring cost Iterators
Agenda
The Query Processor is a group of components of a DBMS that turns user queries and data-modification commands into a sequence of database operations and executes those operations
Query processor is responsible for supplying details regarding how the query is to be executed
Query Processor
The major parts of the query processor
Query compilation itself is a multi-step process consisting of : Parsing: in which a parse tree representing query
and its structure is constructed Query rewrite: in which the parse tree is
converted to an initial query plan Physical plan generation: where the logical query
plan is turned into a physical query plan by selecting algorithms.
Query compilation
Outline of query compilation
Physical query plans are built from operators Each of the operators implement one step of
the plan. They are particular implementations for one
of the operators of relational algebra. They can also be non relational algebra
operators like “scan” which scans tables.
Physical Query Plan Operators
One of the most basic things in a physical query plan.
We read the entire contents of relation R. Necessary when we want to perform join or
union of a relation with another relation.
Scanning Tables
Two basic approaches to locating the tuples of a relation R
Table-scan Relation R is stored in secondary memory
with its tuples arranged in blocks it is possible to get the blocks one by one This operation is called Table Scan
Two basic approaches to locating the tuples of a relation R
Index-scan there is an index on any attribute of
Relation R Use this index to get all the tuples of R This operation is called Index Scan
Why do we need sorting while scanning? the query could include an ORDER BY clause
requiring that a relation be sorted
Various algorithms for relational-algebra operations require one or both of their arguments to be sorted relation
Sort-scan takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in that sorted order
Sorting While Scanning Tables
Choosing physical plan operators wisely is an essential for a good query processor.
Cost for an operation is measured in number of disk i/o operations.
If an operator requires the final answer to a query to be written back to the disk, the total cost will depend on the length of the answer and will include the final write back cost to the total cost of the query.
Model of Computation for Physical Operators
Major improvements in cost of the physical operators can be achieved by avoiding or reducing the number of disk i/o operations
This can be achieved by passing the answer of one operator to the other in the main memory itself without writing it to the disk.
Improvements in cost
Parameters that affect the performance of a query Buffer space availability in the main
memory at the time of execution of the query
Size of input and the size of the output generated
The size of memory block on the disk and the size in the main memory also affects the performance
Parameters for Measuring Costs
Many physical operators can be implemented as an iterator
It is a group of three functions that allows a consumer of the result of the physical operator to get the result one tuple at a time
Iterators for Implementation of Physical Operators
The three functions forming the iterator are: Open: This function starts the process of getting
tuples. It initializes any data structures needed to
perform the operation
Iterator
GetNext This function returns the next tuple in the
result Adjusts data structures as necessary to allow
subsequent tuples to be obtained If there are no more tuples to return, GetNext
returns a special value NotFound
Iterator
Close This function ends the iteration after all tuples it calls Close on any arguments of the operator
Iterator
Thank You !!!
Query Execution
One-pass algorithm for database operations
Chetan Sharma008565661
Overview
One-Pass Algorithm
One-Pass Algorithm Methods:
1) Tuple-at-a-time, unary operations.
2) Full-relation, unary operations.
3) Full-relation, binary operations.
Various Algorithms
We can divide algorithms for operators into three “degrees” of difficulty and cost:
1) One-Pass Algorithm: Require at least one of the arguments to fit in main memory.
2) Two-pass Algorithm: Some methods work for data that is too large to fit in available main memory but not for the largest imaginable data sets.
3) 3 or more pass Algorithms: Some methods work without a limit on the size of the data. -multipass:recursive generalizations of the two-pass algorithms.
One-Pass Algorithm
• Reading the data only once from disk.
• Usually, they require at least one of the arguments to fit in main memory
Tuple-at-a-Time
• These operations do not require an entire relation, or even a large part of it, in memory at once. Thus, we can read a block at a time, use one main memory buffer, and produce our output.
• Ex- selection and projection
Tuple-at-a-Time
A selection or projection being performed on a relation R
Full-relation, unary operations
• Now, let us consider the unary operations that apply to relations as a whole , rather than to one tuple at a time:
a)Duplicate elimination. b)Grouping .
a) Duplicate elimination
b) Grouping
• MIN (a),MAX (a) aggregate, record the minimum or maximum value, respectively, of attribute a seen for any tuple in the group so far.
• COUNT aggregation, add one for each tuple of the group that is seen.
• SUM (a), add the value of attribute a to the accumulated sum for its group.
• AVG (a) is the hard case. We must maintain two accumulations: the count of the number of tuples in the group and the sum of the a-values of these tuples.
b) Grouping
When all tuples of R have been read into the input buffer and contributed to the aggregation(s) for their group, we can produce the output by writing the tuple for each group. Note-: that until the last tuple is seen, we cannot begin to create output for a operation. Thus, this algorithm does not fit the iterator framework very well; The entire grouping has to be done by the Open method before the first tuple can be retrieved
One-Pass Algorithms for Binary Operations
• All other operations are in this class: set and bag versions of union, intersection, difference, joins, and products.
• binary operations require reading the smaller of the operands R and S into main memory and building a suitable data structure so tuples can be both inserted quickly and found quickly.
• to be performed in one pass is: min(B(R),B(S)) <= M
Some examples
In each case, we assume R is the larger of the relations, and we house S in main memory.
• Set Union:
-We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple. -All these tuples are also copied to the output. -Read each block of R into the Mth buffer, one at a time.-For each tuple t of R, see if t is in S, and if not, we copy t to the output. If t is also in S, we skip t.
• Set Intersection :
-Read S into M - 1 buffers and build a search structure with full tuples as the search key. -Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.
examples continued..
• Product• Read S into M — 1 buffers of main memory; no special data structure is needed.
Then read each block of R, and for each tuple t of R concatenate t with each tuple of S in main memory. Output each concatenated tuple as it is formed.
Summery
One-Pass Algorithm
One-Pass Algorithm Methods:
1) Tuple-at-a-time, unary operations.
2) Full-relation, unary operations.
3) Full-relation, binary operations.
Questions
&
Nested Loops Joins
Book Section of chapter 15.3
Guide: Prof. Dr. T.Y. LINName: Sanya Valsan
Roll: 120
Topic to be covered
• Tuple-Based Nested-Loop Join• An Iterator for Tuple-Based Nested-Loop Join• A Block-Based Nested-Loop Join Algorithm• Analysis of Nested-Loop Join
Introduction
• Nested – loop joins are, in a sense, one and a half passes since in each variation one of the two arguments has its tuples read only once, while the other argument has to be read repeatedly.
• Nested-loop joins can be used for relations of any size; it is not necessary that one relation fit in main memory.
15.3.1 Tuple-Based Nested-Loop Join
• The simplest variation of nested-loop join has loops that range over individual tuples of the relations involved. In this algorithm, which we call tuple-based nested-loop join, we compute the join as follows
RS
Continued• For each tuple s in S DO
For each tuple r in R Doif r and s join to make a tuple t THEN
output t;– If we are careless about how to buffer the blocks of
relations R and S, then this algorithm could require as many as T(R)T(S) disk I/O’s. There are many situations where this algorithm can be modified to have much lower cost.
Note: T(R) – number of tuples in relation R.
Continued
• One improvement looks much more carefully at the way tuples of R and S are divided among blocks, and uses as much of the memory as it can to reduce the number of disk I/O's as we go through the inner loop. We shall consider this block-based version of nested-loop join.
15.3.2 An Iterator for Tuple-Based Nested-Loop Join
• Open() {– R.Open();– S.open();– A:=S.getnext();}
GetNext() {Repeat {
r:= R.Getnext();IF(r= Not found) {/* R is exhausted for the
current s*/R.close();s:=S.Getnext();
IF( s= Not found) RETURN Not Found;/* both R & S are exhausted*/R.Close();r:= R.Getnext();
}}until ( r and s join)RETURN the join of r and s;
}Close() {
R.close ();S.close ();
}
15.3.3 A Block-Based Nested-Loop Join Algorithm
We can Improve Nested loop Join by compute R join S.
1. Organizing access to both argument relations by blocks.
2. Using as much main memory as we can to store tuples belonging to the relation S, the relation of the outer loop.
The nested-loop join algorithmFOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;organize their tuples into a search structure whose
search key is the common attributes of R and S;FOR each block b of R DO BEGIN
read b into main memory;FOR each tuple t of b DO BEGIN
find the tuples of S in main memory thatjoin with t ;output the join of t with each of these tuples;
END ;END ;
END ;
15.3.4 Analysis of Nested-Loop Join
Assuming S is the smaller relation, the number of chunks or iterations of outer loop is B(S)/(M - 1). At each iteration, we read hf - 1 blocks of S andB(R) blocks of R. The number of disk I/O's is thus
B(S)/M-1(M-1+B(R)) or B(S)+B(S)B(R)/M-1
Assuming all of M, B(S), and B(R) are large, but M is the smallest of these, an approximation to the above formula is B(S)B(R)/M. That is, cost is proportional to the product of the sizes of the two relations, divided by the amount of available main memory.
Example• B(R) = 1000, B(S) = 500, M = 101
– Important Aside: 101 buffer blocks is not as unrealistic as it sounds. There may be many queries at the same time, competing for main memory buffers.
• Outer loop iterates 5 times • At each iteration we read M-1 (i.e. 100) blocks of S and all of R (i.e.
1000) blocks. • Total time: 5*(100 + 1000) = 5500 I/O’s
• Question: What if we reversed the roles of R and S? • We would iterate 10 times, and in each we would read 100+500 blocks,
for a total of 6000 I/O’s. • Compare with one-pass join, if it could be done! • We would need 1500 disk I/O’s if B(S) x M-1
Continued…….
1. The cost of the nested-loop join is not much greater than the cost of a one-pass join, which is 1500 disk 110's for this example.
2. Nested-loop join is generally not the most efficient join algorithm.
Summary of the topic
In this topic we have learned about how the nested tuple Loop join are used in database using query execution and what is the process for that.
Any Questions
?
Two Pass Algorithm Based On Sorting
Section 15.4CS257 Spring2013Swapna Vemparala
Class ID : 131
Contents
• Two-Pass Algorithms• Two-Phase, Multiway Merge-Sort• Duplicate Elimination Using Sorting• Grouping and Aggregation Using Sorting• A Sort-Based Union Algorithm• Sort-Based Intersection and Difference• A Simple Sort-Based Join Algorithm• A More Efficient Sort-Based Join
Two-Pass Algorithms
• Data from operand relation is read into main memory, processed, written out to disk again, and reread from disk to complete the operation.
15.4.1 Two-Phase, Multiway Merge-Sort
• To sort very large relations in two passes.
• Phase 1: Repeatedly fill the M buffers with new tuples from R and sort them, using any main-memory sorting algorithm. Write out each sorted sublist to secondary storage.
• Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M — 1 sorted sublists, which limits the size of R. We allocate one input block to each sorted sublist and one block to the output.
Merging
• Find the smallest key• Move smallest element to first available position of
output block.• If output block full -write to disk and reinitialize the
same buffer in main memory to hold the next output block.
• If this block -exhausted of records, read next block from the same sorted sub list into the same buffer that was used for the block just exhausted.
• If no blocks remain- stop.
15.4.2 Duplicate Elimination Using Sorting
• Same as previous…• Instead of sorting on the second pass, -
repeatedly select first unconsidered tuple t among all sorted sub lists.
• Write one copy of t to the output and eliminate from the input blocks all occurrences of t.
• Output - exactly one copy of any tuple in R.
15.4.3 Grouping and Aggregation Using Sorting
• Read the tuples of R into memory, M blocks at a time. Sort the tuples in each set of M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.
• Use one main-memory buffer for each sublist, and initially load the first block of each sublist into its buffer.
• Repeatedly find the least value of the sort key present among the first available tuples in the buffers.
15.4.4 A Sort-Based Union Algorithm
• In the first phase, create sorted sublists from both R and S.
• Use one main-memory buffer for each sublist of R and S. Initialize each with the first block from the corresponding sublist.
• Repeatedly find the first remaining tuple t among all the buffers
15.4.5 Sort-Based Intersection and Difference
• For both set version and bag version, the algorithm is same as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sub lists.
• For set intersection -output t if it appears in both R and S.• For bag intersection -output t the minimum of the
number of times it appears in R and in S.• For set difference -output t if and only if it appears in R
but not in S.• For bag difference-output t the number of times it
appears in R minus the number of times it appears in S.
15.4.6 A Simple Sort-Based Join Algorithm
• Given relations R(X, Y) and S(Y, Z) to join, and given M blocks of main memory for buffers
• Sort R, using 2PMMS, with Y as the sort key• Sort S similarly• Merge the sorted R and S, use only two
buffers
15.4.8 A More Efficient Sort-Based Join
• If we do not have to worry about very large numbers of tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself
To compute R(X, Y) S(Y, Z) using M►◄ main-memory buffers
Create sorted sublists of size M, using Y as the sort key, for both R and S.
Bring the first block of each sublist into a buffer
Repeatedly find the least Y-value y among the first available tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value
15.4.8 A More Efficient Sort-Based Join
Thank you
Query Execution15.5 Two-pass Algorithms based on Hashing
ByAvinash Anantharamu
At a glimpse
• Introduction• Partitioning Relations by Hashing• Algorithm for Duplicate Elimination• Grouping and Aggregation• Union, Intersection, and Difference• Hash-Join Algorithm• Sort based Vs Hash based• Summary
Introduction
Hashing is done if the data is too big to store in main memory buffers. – Hash all the tuples of the argument(s) using an
appropriate hash key. – For all the common operations, there is a way to
select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value.
– This reduces the size of the operand(s) by a factor equal to the number of buckets.
Partitioning Relations by HashingAlgorithm:
initialize M-1 buckets using M-1 empty buffers;FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THENBEGIN
copy the buffer t o disk;initialize a new empty block in that buffer;
END; copy t to the buffer for bucket h(t);END ;
END ;FOR each bucket DO
IF the buffer for this bucket is not empty THENwrite the buffer to disk;
Duplicate Elimination• For the operation δ(R) hash R to M-1 Buckets.(Note that two copies of the same tuple t will hash to the same
bucket)• Do duplicate elimination on each bucket Ri independently,
using one-pass algorithm• The result is the union of δ(Ri), where Ri is the portion of R
that hashes to the ith bucket
Requirements
• Number of disk I/O's: 3*B(R)– B(R) < M(M-1), only then the two-pass, hash-based
algorithm will work• In order for this to work, we need:
– hash function h evenly distributes the tuples among the buckets
– each bucket Ri fits in main memory (to allow the one-pass algorithm)
– i.e., B(R) ≤ M2
Grouping and AggregationHash all the tuples of relation R to M-1 buckets, using a hash
function that depends only on the grouping attributes(Note: all tuples in the same group end up in the same bucket)
Use the one-pass algorithm to process each bucket independently
Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and Difference
• For binary operation we use the same hash function to hash tuples of both arguments.
• R U S we hash both R and S to M-1• R ∩ S we hash both R and S to 2(M-1)• R-S we hash both R and S to 2(M-1)• Requires 3(B(R)+B(S)) disk I/O’s.• Two pass hash based algorithm requires
min(B(R)+B(S))≤ M2
Hash-Join Algorithm
• Use same hash function for both relations; hash function should depend only on the join attributes
• Hash R to M-1 buckets R1, R2, …, RM-1
• Hash S to M-1 buckets S1, S2, …, SM-1
• Do one-pass join of Ri and Si, for all i• 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based
• For binary operations, hash-based only limits size to min of arguments, not sum
• Sort-based can produce output in sorted order, which can be helpful
• Hash-based depends on buckets being of equal size
• Sort-based algorithms can experience reduced rotational latency or seek time
Summary
• Partitioning Relations by Hashing• Algorithm for Duplicate Elimination• Grouping and Aggregation• Union, Intersection, and Difference• Hash-Join Algorithm• Sort based Vs Hash based
Thank you
15.6 Index Based Algorithms
By: Tomas Tupy (123)
Outline
• Terminology• Clustered Indexes
– Example• Non-Clustered Indexes• Index Based Selection• Joining Using an Index• Join Using a Sorted Index
What is an Index?
• A data structure which improves the speed of data retrieval ops on a relation, at the cost of slower writes and the use of more storage space.
• Enables sub-linear time lookup.• Data is stored in arbitrary order, while logical
ordering is achieved by using the index.
Terminology Recap
• B(R) – Number of blocks needed to hold R• T(R) – Number of tuples in R• V(R,a) – Number of distinct values of the
column for a in R.• Clustered Relation – Tuples are packed into as
few blocks as possible.• Clustered Indexes – Indexes on attribute(s)
such that all tuples with a fixed value for the search key appear on a few blocks as possible.
Clustering Indexes
• A relation is clustered if its tuples are packed into relatively few blocks.
• Clustering indexes are indexes on an attribute or attributes such that all the tuples with a fixed value for the search key of this index appear in as little blocks as possible.
• Tuples are stored to match the index order.• A relation that isn’t clustered cannot have a
clustering index.
Clustering Indexes
• Let R(a,b) be a relation sorted on attribute a.• Let the index on a be a clustering index.• Let a1 be a specific value for a.
• A clustering index has all tuples with a fixed value packed into minumum # of blocks.
a1 a1 a1 a1 a1 a1 a1 a1a1 a1 a1
All the a1 tuples
Pros/Cons
• Pros– Faster reads for particular selections
• Cons– Writing to a table with a clustered index can be
slower since there might be a need to rearrange data.
– Only one clustered index possible.
Clustered Index Example
Customer
ID
Name
Address
Order
ID
CustomerID
Price
Problem: We want to quickly retrieve all orders for a particular customer.
How do we do this?
Clustered Index Example
• Solution: Create a clustered index on the “CustomerID” column of the Order table.
• Now the tuples with the same CustomerID will be physically stored closed to one another on the disk.
Non Clustered Indexes
• There can be many per table• Quicker for insert and update operations.• The physical order of tuples is not the same as
index order.
Index Based Algorithms
• Especially useful for the selection operator.• Join and other binary operators also benefit.
Index-Based Selection
• No index– Without an index on relation R, we have to read
all the tuples in order to implement selection oC(R), and see which tuples match our condition C.
– What is the cost in terms of disk I/O’s to implement oC(R)? (For both clustered and non-clustered relations)
Index-Based Selection
• No index– Answer:
• B(R) if our relation is clustered• Upto T(R) if relations in not-clustered.
Index-Based Selection
• Let us consider an index on attribute a where our condition C is a = v.
• oa=v(R)• In this case we just search the index for value
v and we get pointers to exactly the tuples we need.
Index-Based Selection
• Let’s say for our selection oa=v(R), our index is clustering.
• What is the cost in the # of disk I/O’s to retrieve the set oa=v(R)?
Index-Based Selection
• Answer– the average is: B(R) / V(R,a)
• A few more I/Os:– Index might not be in main memory– Tuples which a = v might not be block aligned.– Even if clustered, might not be packed as tight as
possible. (Extra space for insertion)
Index-Based Selection
• Non-clustering index for our selection oa=v(R) • What is the cost in the # of disk I/O’s to
retrieve the set oa=v(R)?
Index-Based Selection
• Answer– Worst case is: T(R) / V(R,a)– This can happen if tuples live in different blocks.
Joining by Using an Index(Algorithm 1)
• Consider natural join: R(X,Y) |><| S(Y,Z)• Suppose S has and index on attribute Y.• Start by examining each block of R, and within
each block consider each tuple t, where tY is a component of t corresponding to the attribute Y.
• Now we use the index to find tuples of S that have tY in their Y component.
• These tuples will create the join.
Joining by Using an Index(Algorithm 1) Analysis
• Consider R(X,Y) |><| S(Y,Z)• If R is clustered, then we have to read B(R)
blocks to get all tuples of R. If R is not clustered then up to T(R) disk I/O’s are required.
• For each tuple t of R, we must read an average of T(S) / V(S,Y) tuples of S.
• Total: B(R)T(S) / V(S,Y) for clustered index, and T(R)T(S) / V(S,Y) for non-clustered index.
Join Using a Sorted Index
• Consider R(X,Y) |><| S(Y,Z)• Data structures such as B-Trees provide the
best sorted indexes.• In the best case, if we have sorting indexes on
Y for both R and S then we perform only the last step of the simple sort-based join. (Sometimes called zig-zag join)
Join Using a Sorted Index(Zig-zag join)
• Consider R(X,Y) |><| S(Y,Z) where we have indexes on Y for both R and S.
• Tuples from R with a Y value that does not appear in S never need to be retrieved, and vice-versa…
Index on Y in R
Index on Y in S
Thank You!
• Questions?
15.7 Buffer Management
By Snigdha Rao ParvatneniSJSU ID: 008648978
Class Roll Number: 124
Course: CS257
Agenda
• Introduction• Role of Buffer Manager• Architecture of Buffer Management• Buffer Management Strategies• Relation between Physical Operator Selection and Buffer Management• Example
Introduction
• Generally, we assume that operators in relations have some main memory buffers to store the data.
• It is very rare that these buffers are allocated in advance to the operator.
• Task of assigning main memory buffers to processes is given to the Buffer Manager.
• The Buffer Manager is responsible for allocating the main memory to the process as per the need and minimizing the delays and unsatisfiable requests.
Role of Buffer Manager
• Buffer Manager responds to the request for main memory access to disk blocks. Below picture depicts it.
Architecture of Buffer Management
• There are two broad architectures for a buffer manager:
– Buffer manager controls main memory directly like in many Relational DBMS.
– Buffer manager allocates buffers in virtual memory and let the OS decide which buffers should be in main memory and which buffer should be in OS managed disk swap space like in many Object Oriented DBMS and Main Memory DBMS.
Problem
• Irrespective of approach there is a problem that buffer manager has to limit number of buffers, to fit in available main memory.
– When buffer manager controls main memory directly• If requests exceeds available space then buffer manager has to select a buffer to
empty by returning its contents to disk.• When blocks have not been changed then they are simply erased from main
memory. But, when blocks have been changed then they are written back to its place on disk.
– When buffer manager allocates space in virtual memory • Buffer manager has the option of allocating more buffers, which can actually fit into
main memory. When all these buffers will be in use then there will be thrashing.• It is an operating system problem where many blocks are moved in and out of disk’s
swap space. Therefore, system will end up spending most of time in swapping blocks and getting very little work done.
Solution
• To resolve this problem When DBMS is initialized then the number of buffers is set.
• User need not worry about mode of buffering used.
• For users there is a fixed size buffer pool, in other words set of buffers are available to query and to other database actions.
Buffer Management Strategies
• Buffer Manager needs to make a critical choice of which block to keep and which block to discard when buffer is needed for newly requested blocks.
• Then buffer manager uses buffer replacement strategies. Some common strategies are –
– Least-Recently-Used (LRU)
– First-In-First-Out (FIFO)
– The Clock Algorithm (Second Chance)
– System Control
Last-Recently Used (LRU)
• This rule is to throw out the block which has not been read or written from
long time.
• To do this the Buffer Manager needs to maintain a table which will indicate
the last time when block in each buffer was accessed.
• It is also needed that each database access should make an entry in this table.
Significant amount of is involved effort in maintaining this information.
• Buffers which are not used from long time is less likely to be accessed before
than those buffers which have been accessed recently. Hence, It is an
effective strategy.
First-In-First-Out (FIFO)
• In this rule, when buffer is needed then the buffer which has been occupied for longest by same block is emptied and used by new block.
• To do this Buffer Manager needs to know only the time at which block occupying the buffer was loaded into the buffer.
• Entry in the table is made when block is read from disk, not every time it is accessed.
• Involves less maintenance than LRU but it is more prone to mistakes.
The Clock Algorithm
• It is an efficient approximation of LRU and is commonly implemented.
• Buffers are treated to be arranged in circle where arrow points to one of the buffers. Arrow will rotate clockwise if it needs to find a buffer to place a disk block.
• Each buffer has an associated flag with value 0 or 1. Buffers with flag value 0 are vulnerable to content transfer to disk whereas buffer with flag value 1 are not vulnerable.
• Whenever block is read into buffer or contents of buffer are accessed, flag associated with it is set to 1.
Working of Clock’s Algorithm
• Whenever buffer is needed for the
block arrow looks for first 0 it
can find in clockwise direction.
• Arrow move changes flag value
from 1 to 0.
• Block is thrown out of buffer only
when it remains unaccessed i.e.
flag value 0 for the time
between two rotations of the arrow.
• First rotation when flag is set from
1 to 0 and second rotation when
arrow comes back to check flag value.
System Control
• Query processor and other DBMS components can advice buffer manager to avoid some mistake which occurs with LRU, FIFO or Clock.
• Some blocks cannot be moved out of main memory without modifying other blocks pointing to it. Such blocks are called pinned blocks.
• Buffer Manager needs to modify buffer replacement strategy, to avoid expelling pinned blocks. That’s why some blocks are remains in main memory even though there is no technical reason for not writing it to the disk.
Relation Between Physical Operator Selection And Buffer Management
• Query optimizer selects the physical operator to execute the query. These physical operator expects certain number of buffers for execution.
• However, the buffer Manager does not guarantee the availability of these buffers when query is executed.
• In this situation two question arises
– Can an algorithm adapt to changes in the number of available main memory buffers?
– When expected number of available buffers are less, then some blocks needs to be put in the disk instead of main memory. How buffer replacement strategy affects the performance?
Example
• Block based nested loop join – algorithm does not depends upon number of available buffers M, but performance depends.
• For each M-1 blocks of outer loop relation S, read blocks in main memory, organize the tuple into search structure where key is the common attribute between R and S.
• Now for each block b of inner loop relation R, read b into main memory and for each tuple t of b find tuples in S that join with t.
• The S uses M-1 buffers and it depends upon average number of buffers available at each iteration. One buffer is reserved for R.
• If we pin M-1 block that we use for S in one iteration of outer loop then we cannot loose these buffers during that round. If more buffers will become available then more blocks of R can be kept in the memory. Will it improve the running time?
Cases with LRU• Case1
– When LRU is used as buffer replacement strategy and k buffers are available to hold blocks of R.
– R is read in order such that blocks that remains in the buffer at the end of iteration of outer loop will be last k blocks of R.
– For next iteration we will again start from beginning of R. Therefore, k buffers for R needs to be replace.
• Case2– With better implementation of nested loop join when LRU is used visit blocks
of R in order that alternates first to last then last to first.
– In this we save k disk I/O on each iteration except first iteration.
With Other Algorithms
• Other algorithms also are impacted by the fact that availability of buffer can vary and by the buffer-replacement strategy used by the buffer manager.
• In sort based algorithm when availability of buffers reduces we can change the size of a sub-lists. Major limitation of this is we will be forced to create many sub-lists that we cannot then allocate a buffer for each sub-list in the merging process.
• In hash based algorithm when availability of buffers reduces we can reduce the size of buckets, provided bucket then should not become so large that they do not fit into the allotted main memory.
References
• DATABASE SYSTEMS: The Complete Book, Second Edition by Hector Garcia-Molina, Jeffrey D. Ullman & Jennifer Widom
Thank You
15.8 Algorithms using more than two passes
Presented By: Seungbeom Ma (ID 125)
Professor: Dr. T. Y. Lin
Computer Science Department
San Jose State University
Multipass Algorithms
• Previously , most of algorithms are required two passes.
• There is a case that we need more than two passes.
• Case : Data is too big to store in main memory.– We have to hash or sort the relation with multipass
algorithms.
Agenda
• 1. Multipass Sort-Based Algorithm • 2. Multipass Hash-Based Algorithm
Multipass sort-based algorithm.
• M: Number of Memory Buffers• R: Relation• B(R) : Number of blocks for holding relation.• BASIS:• 1. If R fits in M block (B (R) <= M).• 2. Reading R into main memory.• 3. Sorting R in the main memory with any sorting
algorithm.• 4. Write the sorted relation to disk.
Multipass sort-based algorithm.
• INDUCTION: (B(R)> M)• 1. If R does not fit into main memory then
partitioning the blocks hold R into M groups, which call R1, R2, …, RM
• 2.Recursively sorting Ri from i =1 to M
• 3.Once sorting is done, the algorithm merges the M sorted sub-lists.
Performance: Multipass Sort-Based Algorithms
1) Each pass of a sorting algorithm:
1.Reading data from the disk.
2. Sorting data with any sorting algorithms
3. Writing data back to the disk.
2-1) (k)-pass sorting algorithm needs
2k B(R) disk I/O’s
2-2)To calculate (Multipass)-pass sorting algorithm needs
= > A+ B
A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists]
B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass]
Total: (2k-1)(B(R)+B(S)) disk I/O’s
Multipass Hash-Based Algorithms
• 1. Hashing the relations into M-1 buckets, where M is number of memory buffers.• 2. Unary case: • It applies the operation to each bucket individually. • 1.Duplicate elimination (δ) and grouping (γ).
– 1) Grouping: Min, Max, Count , Sum , AVG , which can group the data in the table– 2) Duplicate elimination: Distinct
Basis:
If the relation fits in M memory block,
-> Reading relation into memory and perform the operations.
• 3. Binary case: It applies the operation to each corresponding pair of buckets. • Query operations: union, intersection, difference , and join
– If either relations fits in M-1 memory blocks,– -> Reading that relation into main memory M-1 blocks– -> Reading next relation to 1 block at a time into the Mth block– Then performing the operations.
INDUCTION
• If Unary and Binary relation does not fit into the main memory buffers.
1. Hashing each relation into M-1 buckets.
2. Recursively performing the operation on each bucket or corresponding pair of buffers.
3. Accumulating the output from each buckets or pair.
Hash-Based Algorithms : Unary Operatiors
Perfermance: Hash-Based Algorithms
• R: Realtion.• Operations are like δ and γ• M: Buffers• U(M, k): Number of blocks in largest relation with k-pass hashing
algorithm.
Performance: Induction
Induction:
1. Assuming that the first step divides relation R into M-1 equal buckets.
2. The buckets for the next pass must be small enough to handle in k-1 passes
3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).
Sort-Based VS Hash-Based
1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time
2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.
THANKS
David LeCS257, ID: 126Feb 28, 2013
15.9 Query Execution Summary
Query ProcessingOutline of Query CompilationTable ScanningCost MeasuresReview of Algorithms
One-pass MethodsNested-Loop JoinTwo-pass
Sort-basedHash-based
Index-basedMulti-pass
Overview
Query Processing
Query Compilation
Query Execution
query
query plan
metadata
data
Query is compiled. This involves extensive optimization using operations of relational algebra.
First compiled into a logical query plans, e.g. using expressions of relational algebra.
Then converted to a physical query plan such as selecting implementation for each operator, ordering joins and etc.
Query is then executed.
Outline of Query Compilation
Parse query
Select logical plan
SQL query
expressiontree
queryoptimization
Parsing: A parse tree for the query is constructed.
Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan.
Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions.
Select physical plan
Execute plan
logical queryplan tree
physical queryplan tree
Table ScanningThere are two approaches for
locating tuples of relation R:Table-scan: Get the blocks one
by one.Index-scan: Use index to lead us
to all blocks holding R.Sort-scan takes a relation R and
sorting specifications and produces R in a sorted order. This can be accomplished with SQL clause ‘ORDER BY’.
Estimates of cost are essential for query optimization.
It allows us to determine the slow and fast parts of a query plan.
Reading many consecutive blocks on a track is extremely important since disk I/O’s are expensive in term of time.
EXPLAIN SELECT * FROM a JOIN b on a.id = b.id;
Cost Measures
EXPLAIN SELECT snp.* FROM snp JOIN chr ON snp.chr_key = chr.chr_key WHERE snp_name <> ''
Cost MeasuresOptimizing Queries:
One-pass Methods
Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once.
Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators. Hash table O(n) or a balanced binary search tree O(n log n) is used for duplicate eliminations to speed up the detections.
Full-relation, binary operations. These include union, intersection, difference, product and join.
Review of Algorithms
Nested-Loop Joins
In a sense, it is ‘one-and-a-half’ passes, since one argument has its tuples read only once, while the other will be read repeatedly.
Can use relation of any size and does not have to fit all in main memory.
Two variations of nested-loop joins:Tuple-based: Simplest form, can be very slow
since it takes T(R)*T(S) disk I/O’s if we are joining R(x,y) with S(y,z).
Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples.
Review of Algorithms
Two-pass Algorithms
Usually enough even for large relations.Based on Sorting:
Partition the arguments into memory-sized, sorted sublists.
Sorted sublists are then merged appropriately to produce desired results.
Based on Hashing:Partition the arguments into buckets. Useful if data is too big to store in memory.
Review of Algorithms
Two-pass Algorithms
Sort-based vs. Hash-based:Hash-based are often superior to sort-based
since they require only one of the arguments to be small.
Sorted-based works well when there is reason to keep some of the data sorted.
Review of Algorithms
Index-based Algorithms
Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes.
Clustering and non-clustering indexes:Clustering index has all tuples with fixed value
packed into minimum number of blocks.A clustered relation can have non-clustering
indexes.
Review of Algorithms
Multi-pass Algorithms
Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets.
Each pass of a sorting algorithm reads all data from disk and writes it out again.
Thus, a k-pass sorting algorithm requires 2·k·B(R) disk I/O’s.
Review of Algorithms
Database Systems: The Complete Book, 2nd Edition. Chapter 15, sections 1 to 9.
Reference
Thank You.
Concurrency Control
18.1 Serial and Serializable Schedules
Dona BaysaID: 127
CS 257Spring 2013
Chapter 18.1 Topics
• Intro Concurrency Control Scheduler
• Serializability• Schedules
Serial and Serializable
Intro: Concurrency Control & Scheduler
• Concurrently executing transactions can cause inconsistent database state
• Concurrency Control assures transactions preserve consistency
• Scheduler: Regulates individual steps of different transactions Takes reads/writes requests from transactions and
executes/delays them
Intro: Scheduler
• Transaction requests passed to Scheduler• Scheduler determines execution of requests
Transaction manager
Scheduler
Buffers
Read/Writerequests
Reads and writes
Serializability
• How to assure concurrently executing transactions preserve database state correctness? Serializability – schedule transactions as if they
were executed one-at-a-time Determine a Schedule
Schedules
• Schedule – sequence of important actions performed by transactions Actions: reads and writes
• Example: Transactions and actionsT1 T2
READ(A, t) READ(A, s)
t := t+100 s := s*2
WRITE(A,t) WRITE(A,s)
READ(B,t) READ(B,s)
t := t+100 s := s*2
WRITE(B,t) WRITE(B,s)
Serial Schedules
• All actions of one transactions are followed by all actions of another transaction, and so on.
• No mixing of actions• Depends only on order of transactions• Serial Schedules:
T1 precedes T2 T2 precedes T1
Serial Schedule: Example
• T1 precedes T2
Notation: (T1 ,T2)
• Consistency constraint: A = B
• Final value: A = B = 250 Consistency is preserved
T1 T2 A B
READ(A, t) 25 25
t := t+100
WRITE(A,t) 125
READ(B,t)
t := t+100
WRITE(B,t) 125
READ(A, s)
s := s*2
WRITE(A,s) 250
READ(B,s)
s := s*2
WRITE(B,s) 250
Serializable Schedules
• Serial schedules preserve consistency• Any other schedules that also guarantee
consistency? Serializable schedules
• Definition: A schedule S is serializable if there’s a serial
schedule S’ such that for every initial database state, the effects of S and S’ are the same.
Serializable Schedule: Example
• Serializable, but not serial, schedule
• T2 acts on A after T1,
but before T1 acts on B• Effect is same as serial schedule (T1, T2 )
T1 T2 A B
25 25
READ(A, t)
t := t+100
WRITE(A,t) 125
READ(A, s)
s := s*2
WRITE(A,s) 250
READ(B,t)
t := t+100
WRITE(B,t) 125
READ(B,s)
s := s*2
WRITE(B,s) 250
Notation: Transactions and Schedules
• Transaction: Ti (for example T1, T2,…)• Database element: X• Actions: read/write
rTi (X) = ri (X) wTi (X) = wi (X)
• Examples Transactions:
T1: r1 (A); w1 (A); r1 (B); w1 (B); T2: r2 (A); w2 (A); r2 (B); w2 (B);
Schedule: r1 (A); w1 (A); r2 (A); w2 (A); r1 (B); w1 (B); r2 (B); w2 (B);
Thank you!
Concurrency Control
18.2 Conflict Serializability
Geetha Ranjini ViswanathanID: 121
18.2 Conflict-Serializability
• 18.2.1 Conflicts• 18.2.2 Precedence Graphs and a Test for
Conflict-Serializability• 18.2.3 Why the Precedence-Graph Test Works
18.2.1 Conflicts
• Conflict - a pair of consecutive actions in a schedule such that, if their order is interchanged, the final state produced by the schedule is changed.
18.2.1 Conflicts
• Non-conflicting situations: Assuming Ti and Tj are different transactions, i.e., i ≠ j:
• ri(X); rj(Y) will never conflict, even if X = Y.
• ri(X); wj(Y) will not conflict for X ≠ Y.
• wi(X); rj(Y) will not conflict for X ≠ Y.
• wi(X); wj(Y) will not conflict for X ≠ Y.
18.2.1 Conflicts
• Two actions of the same transactions always conflictri(X); wi(Y)
• Two writes of the same database element by different transactions conflictwi(X); wj(X)
• A read and a write of the same database element by different transaction conflictri(X); wj(X)wi(X); rj(X)
• Conflicting situations: Three situations where actions may not be swapped:
18.2.1 Conflicts
• Conclusions:Any two actions of different transactions may be swapped unless:
• They involve the same database element, and• At least one is a write
• The schedules S and S’ are conflict-equivalent, if S can be transformed into S’ by a sequence of non-conflicting swaps of adjacent actions.
• A schedule is conflict-serializable if it is conflict-equivalent to a serial schedule.
18.2.1 Conflicts
• Example 18.6Conflict-serializable scheduleS: r1(A); w1(A); r2(A); w2(A); r1(B); w1(B); r2(B); w2(B);Above schedule is converted to the serial schedule S’ (T1, T2) through a sequence of swaps.
r1(A); w1(A); r2(A); w2(A); r1(B); w1(B); r2(B); w2(B);r1(A); w1(A); r2(A); r1(B); w2(A); w1(B); r2(B); w2(B);r1(A); w1(A); r1(B); r2(A); w2(A); w1(B); r2(B); w2(B);r1(A); w1(A); r1(B); r2(A); w1(B); w2(A); r2(B); w2(B);
S’: r1(A); w1(A); r1(B); w1(B); r2(A); w2(A); r2(B); w2(B);
18.2.2 Precedence Graphs and a Test for Conflict-Serializability
• Given a schedule S, involving transactions T1 and T2, T1 takes precedence over T2 (T1 <s T2), if there are actions A1 of T1 and A2 of T2, such that:
• A1 is ahead of A2 in S,
• Both A1 and A2 involve the same database element, and
• At least one of A1 and A2 is a write action
• We cannot swap the order of A1 and A2.
• A1 will appear before A2 in any schedule that is conflict-equivalent to S.
• A conflict-equivalent serial schedule must have T1 before T2.
18.2.2 Precedence Graphs and a Test for Conflict-Serializability
• Precedence graph:• Nodes represent transactions of S• Arc from node i to node j, if Ti <S Tj
• Example 18.7Given ScheduleS: r2(A); r1(B); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);
Precedence Graph
Acyclic graph Schedule is conflict-serializable
T1 T2 T3
18.2.2 Precedence Graphs and a Test for Conflict-Serializability
• Example 18.8S: r2(A); r1(B); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);
Convert S to serial schedule S’ (T1, T2, T3).
r1(B); r2(A); w2(A); r3(A); w1(B); w3(A); r2(B); w2(B);
r1(B); r2(A); w2(A); w1(B); r3(A); w3(A); r2(B); w2(B);
r1(B); r2(A); w1(B); w2(A); r3(A); w3(A); r2(B); w2(B);
r1(B); w1(B); r2(A); w2(A); r3(A); w3(A); r2(B); w2(B);
r1(B); w1(B); r2(A); w2(A); r3(A); r2(B); w3(A); w2(B);
r1(B); w1(B); r2(A); w2(A); r2(B); r3(A); w3(A); w2(B);
r1(B); w1(B); r2(A); w2(A); r2(B); r3(A); w2(B); w3(A);
r1(B); w1(B); r2(A); w2(A); r2(B); w2(B); r3(A); w3(A);
S’: r1(B); w1(B); r2(A); w2(A); r2(B); w2(B); r3(A); w3(A);
18.2.2 Precedence Graphs and a Test for Conflict-Serializability
• Example 18.9Given ScheduleS: r2(A); r1(B); w2(A); r2(B); r3(A); w1(B); w3(A); w2(B);
Precedence Graph
Cyclic graph Schedule is NOT conflict-serializable
T1 T2 T3
18.2.3 Why the Precedence-Graph Test Works
• Consider a cycle involving n transactions T1 —> T2 ... —> Tn —> T1
• In the hypothetical serial order, the actions of T1 must precede those of T2, which precede those of T3, and so on, up to Tn.
• But the actions of Tn, which therefore come after those of T1, are also required to precede those of T1. This puts constraints on legal swaps between T1 and Tn.
• Thus, if there is a cycle in the precedence graph, then the schedule is not conflict-serializable.
Questions?
Shailesh Padave
ID 111
CS257Spring 2013
18.3 Enforcing Serializability by locks
INTRODUCTION
• Enforcing serializability by locks– Locks– Locking scheduler– Two phase locking
Locks
• Most common architecture for a scheduler, one in which locks are maintained on database element to prevent unserializable behavior
• It works like as follows : – A request from transaction– Scheduler checks in the lock table to guide the decision– Generates a serializable schedule of actions.
Consistency of transactions
• Actions and locks must relate to each other in expected ways:– Transactions can only read & write only if it has a
lock on the database elements involved in the transaction.
– Unlocking an element is compulsory.
• Legality of schedules– No two transactions can acquire the lock on same
element without the prior one releasing it.
Locking scheduler
• Grants lock requests only if it is in a legal schedule.
• Lock table stores the information about current locks on the elements.
• Consider– li(X): Transaction Ti requests a lock on database
element X– ui(X): Transaction Ti releases its lock on database
element X
Locking scheduler (contd.)
• A legal schedule of consistent transactions but unfortunately it is not a serializable but it is legal.
T1 T2 A B
l1(A); r1(A);A:=A+100;
w1(A);u1(A);
l1(B); r1(B);B:=B+100;
w1(B);u1(B);
l2(A); r2(A);A:=A*2;
w2(A);u2(A);l2(B); r2(B);
B:=B*2;w2(B);u2(B);
25
125
250
25
50
150
Locking schedule (contd.)
• The locking scheduler delays requests in order to maintain a consistent database state.T1 T2 A B
l1(A); r1(A);A:=A+100;
w1(A);l1(B);u1(A);
r1(B);B:=B+100;w1(B);u1(B);
l2(A); r2(A);A:=A*2;
w2(A);u2(A);L2(B); Denied
l2(B); u2(A);r2(B);B:=B*2;
w2(B);u2(B);
25
125
250
25
125
250
Two-phase locking(2PL)
• Guarantees a legal schedule of consistent transactions is conflict-serializable.
• All lock requests proceed all unlock requests.
• The growing phase:– Obtain all the locks and no unlocks allowed.
• The shrinking phase:– Release all the locks and no locks allowed.
Working of Two-Phase locking
• Assures serializability.• Two protocols for 2PL:
– Strict two phase locking : Transaction holds all its write locks till commit / abort.
– Rigorous two phase locking : Transaction holds all locks till commit / abort.
• Two phase transactions are ordered in the same order as their first unlocks.
Two Phase Locking
Locks
required
Time
Instantaneously executes now
Every two-phase-locked transaction has a point at which it may be thought to execute instantaneously
Failure of 2PL.
• 2PL fails to provide security against deadlocks.• T1: l1(A); r1(A); A:=A+100; w1(A); l1(B); u1(A); r1(B);
B:=B+100; w1(B); u1(B);• T2: l2(B); r2(B); B:=B*2; w2(B); l2(A); u2(B);
r2(A);A:=A*2; w2(A); u2(A);T1 T2 A B
l1(A); r1(A);
A:=A+100;
w1(A);
l1(B); Denied
l2(B); r2(B);
B:=B*2;
W2(B);l2(A); Denied
25
125
25
50
Thank You
18. Concurrency Control18.4. Locking Systems With
Several Lock Modes
byKiruthika Sivaraman
ID: 129008663109
Introduction
•Problem:• A Transaction T must take a lock on a database element X even if it only wants
to read X and not write it• This is to prevent another transaction from modifying the element as we read
it. This would cause unserializable behavior.• Therefore we have two locks:
• Shared lock (“read lock”)• Exclusive long (“write lock”)
Lock Types
• Shared Lock or Read Lock– To read database element X we use
shared lock.– There can be more than one shared lock
on X.
• Exclusive Lock or Write Lock– To write to database element X we use
exclusive lock.– There can be only one exclusive lock on
X.
Notations Used
• sli(X) – Transaction Ti requests shared lock on database element X.
• xli(X) – Transaction Ti requests exclusive lock on database element X.
• ui(X) – Transaction Ti unlocks X.
Requirements
• Consistency of transactions– Transaction may not write without an exclusive lock
and cannot read without any kind of lock.
• Two phase locking of transactions– Locking must precede unlocking
– xli(X) or sli(X) cannot be preceded by ui(Y) for any Y.
• Legality of schedules– An element can be locked exclusive by one
transaction or by several in shared mode but not both.
Compatibility Matrices
• Compatibility matrix describes lock management policy.
• Has row and column for each lock mode
• Row corresponds to lock held on element X by a transaction.
• Column corresponds to mode of lock requested on X.
Upgrading Locks
• Transaction T first takes a shared lock on X.
• Later when T is ready to write it upgrades it lock to exclusive lock on X.
• ui(X) releases all the lock that was established by transaction Ti on X.
• By this way transaction T remains friendly with other transactions.
Upgrading Locks - Drawback
• T1 first establishes a shared lock on X.
• T2 also establishes a shared lock on X.
• T1 tries to upgrade its lock to exclusive lock.
• T2 also tries the same.
• Deadlock!
Update Locks
• To avoid deadlock
• Update lock is similar to shared locks with only difference that the transaction requesting update lock can only upgrade its lock to exclusive lock.
• Once a transaction requests update lock on X, then no other locks will be granted on X.
Increment Lock
• Two transaction can establish an increment lock on a database element at the same time.
• Useful when the order of write is not important.
INC(A,2)INC(A,10)
INC(A,10)INC(A,2)
Increment Lock Compatibility Matrix
THANK YOU!
Akash Patel
Concurrency ControlArchitecture for a Locking Scheduler
Section 18.5
Presented By:Akash Patel
ID: 113
Overview
• Overview of Locking Scheduler
• Scheduler That Inserts Lock Actions
• The Lock Table
• Handling Locking and Unlocking Request
Principles of simple scheduler architecture
• The transactions themselves do not request locks, or cannot be relied upon to do so. It is the job of the scheduler to insert lock actions into the stream of reads, writes, and other actions that access data.
• Transactions do not release locks. Rather, the scheduler releases the locks when the transaction manager tells it that the transaction will commit or abort.
Scheduler That Inserts Lock Actions into the transactions request stream
Scheduler, Part 1
Scheduler, Part 2
Lock(A);Read(A)
Read(A);Write(B)
Read(A);Write(B);Commit(T)
LockTable
From Transaction
• The scheduler maintains a lock table, which, although it is shown as secondary-storage data, may be partially or completely in main memory
• Actions requested by a transaction are generally transmitted through the scheduler and executed on the database.
• Under some circumstances a transaction is delayed, waiting for a lock, and its requests are not (yet) transmitted to the database.
The two parts of the scheduler perform
• Part I takes the stream of requests generated by the transactions and inserts appropriate lock actions ahead of all database-access operations, such as read, write, increment, or update.
• Part II takes the sequence of lock and database-access actions passed to it by Part I, and executes each appropriately
Determine the transaction (T) that action belongs and status of T (delayed or not). If T is not delayed then
1. Database access action is transmitted to the database and executed
2. If lock action is received by PartII, it checks the L Table whether lock can be granted or not
– i> Granted, the L Table is modified to include granted lock– ii>Not G. then update L Table about requested lock then PartII
delays transaction T
3. When a T = commits or aborts, PartI is notified by the transaction manager and releases all locks. - If any transactions are waiting for locks PartI notifies PartII.
4. Part II when notified about the lock on some DB element, determines next transaction T’ to get lock to continue.
The Lock Table
• A relation that associates database elements with locking information about that element
• Implemented with a hash table using database elements as the hash key
• Size is proportional to the number of lock elements only, not to the size of the entire database
DB element A
Lock information for A
Lock table Entry Field
• Group Mode• S means that only shared locks are held.• U means that there is one update lock and perhaps one
or more shared locks.• X means there is one exclusive lock and no other locks.
• Waiting • Waiting bit tells that there is at least one transaction waiting
for a lock on A.
• A list • A list describing all those transactions that either currently
hold locks on A or are waiting for a lock on A.
Handling Lock Requests
• Suppose transaction T requests a lock on A
• If there is no lock table entry for A, then there are no locks on A, so create the entry and grant the lock request
• If the lock table entry for A exists, use the group mode to guide the decision about the lock request
1) If group mode is U (update) or X (exclusive)No other lock can be granted
• Deny the lock request by T• Place an entry on the list saying T requests a lock• And Wait? = ‘yes’
2) If group mode is S (shared)Another shared or update lock can be granted
• Grant request for an S or U lock• Create entry for T on the list with Wait? = ‘no’• Change group mode to U if the new lock is an update lock
How to deal with existing Lock
Handling Unlock Requests
• Now suppose transaction T unlocks A
• Delete T’s entry on the list for A
• If T’s lock is not the same as the group mode, no need to change group mode
• Otherwise check entire list for new group mode
Handling Unlock Requests
o If the value of waiting is “yes" need to grant one or more locks using following approaches
• First-Come-First-Served: • Grant the lock to the longest waiting request. • No starvation (waiting forever for lock)• Priority to Shared Locks: • Grant all S locks waiting, then one U lock. • Grant X lock if no others waiting• Priority to Upgrading: • If there is a U lock waiting to upgrade to an X lock, grant
that first.
Thank you
18. Concurrency Control
18.6 Hierarchies of Database Elements
303
Mehal Patel (ID: 114)
Hierarchies of Database Elements
• Two problems arises when tree structure in data is encountered – How to provide locking in each case.1) Hierarchy of lockable elements (This section ) - Large elements ( relations ) - Smaller elements within relations ( blocks or individual tuples )2) Data it-self organized as trees. ( next section
18.7 )
Locks with Multiple Granularity
• A database element can be a relation, block or a tuple
• Different Database Systems use different size of data base elements for locking purpose
Example
• Consider a database for a bank– Choosing relations as database elements means one
lock for an entire relation– If we were dealing with a relation having account
balances, this kind of lock would be very inflexible and thus provide very little concurrency
– This means only 1 deposit/withdrawal can be made– Better way to lock individual blocks or pages
such that two accounts in different blocks can be updated simultaneously
Warning (Intention) Locks
• These protocol helps manage lock at different level of hierarchy
Database Elements Organized in Hierarchy
Warning Protocol Rules
• These involve both ordinary (S and X) and warning (IS and IX) locks
• Below Rules Followed– Begin at the root of hierarchy– Request the S or X lock if we are at the desired element– If the desired element id further down the hierarchy, place
a warning lock (IS if S and IX if X)– When the warning lock is granted, we proceed to the child
node and repeat the above steps until desired node is reached
Uses Compatibility matrix
Compatibility Matrix for Shared, Exclusive and Intention Locks
IS IX S X
IS Yes Yes Yes No
IX Yes Yes No No
S Yes No Yes No
X No No No No
Example
• Movie(title, year, length, studioName)Transaction T1
- Select * from Movie where Title = ‘King kong’
Transaction T2- Update Movie set year = 1939 where title=‘ gone with the wind’
• If T2 had updated tuples related to ‘King Kong’ then it would have to wait untill T1 release S lock.
• S and X are not compatible
Phantoms and Handling Insertions Correctly
• This arises when transactions create new sub elements of lockable elements
• Since we can lock only existing elements the new elements fail to be locked
• Example followed
Example
• Consider a transaction T3
– Select sum(length) from movie where studioName =‘Disney’
– This calculates the total length of all tuples having studioName=Disney
– Thus, T3 acquires IS for relation and S for targeted tuples
– Now, if another transaction T4 inserts a new tuple for studioName = ‘Disney’, the result of T3 becomes incorrect
…contd
• Not a concurrency problem since the serial order (T3, T4) is maintained
• Consider below now: T4 writing new tuple while T3 is reading tuples on same relation– r3(d1);r3(d2);w4(d3);w4(X);w3(L);w3(X)– Want to find total length of all movies under studioName –
outcome should be consistent.– Above scenario is not consistent. T3 will not get correct
sum because lock on new element (d3) was not obtained.
…contd
• This problem is due to the relation has a phantom tuple (the new inserted tuple), which should have been locked but it did not existed when lock was acquired by T3.
• The occurrence of phantoms can be avoided if all insertion and deletion transactions are treated as write operations (Excelusive lock – X ) on the whole relation.
Thank You
18.7 THE TREE PROTOCOL
MALLIKA PEREPAID: 115
Outline-
• Introduction• Motivation• Rules for Access to Tree-
Structured Data• Why the Tree Protocol Works
Introduction-
• In this section, we deal with tree structures that are formed by the link pattern.– Database elements are disjoint pieces of data, but the only
way to get to a node is through its parent.
– B-trees are an important example of this sort of data.
– Traversing a particular path to an element gives us freedom to manage locks differently from the two-phase locking approaches we have seen previously.
B-tree Details-
• Basic DS:- Keeps records in sorted order, allows
searches, sequential access, insertions and deletions in logarithmic time.
- It is used in databases and file systems.• Locking structure:
- Granularity is at node level. Treating Smaller pieces as elements is not beneficial and entire B-tree as a single element is infeasible!
Motivation for Tree based locking-• If we use a standard set of lock modes (shared, update
and exclusive) and two phase locking, then concurrent use of the B-tree is almost impossible.
• The reason is that every transaction must begin by locking the root node of the B-tree.
• Any transaction that inserts or deletes could wind up rewriting the root of the B-tree.
• Thus, only one transaction that is not read-only can access the B-tree at any time
• However, in most situations a B-tree node will not be rewritten, even if the transaction inserts or deletes a tuple.
• Thus, as soon as a transaction moves to a child of the root and observes the situation that rules out a rewrite of the root, we would like to release the lock on the root.
• Releasing the lock on the root early will violate two phase locking, so we cannot be sure that the schedule of several transactions accessing the B-tree will be serializable.
• The solution is a specialized protocol for transactions that access tree structured data like B-trees.
Rules for Access to Tree-Structured Data-
• The following restrictions on locks form tree protocol. Assumptions:
There is only ONE kind of a lock, represented by the form li(X) Transactions are consistent, and schedules must be legal
but there is no two-phase locking requirement on transactions.
Rules: A transaction’s first lock may be at any node of the tree Subsequent locks may only be acquired if the transaction
currently has a lock on the parent node. Nodes may be unlocked at any time. A transaction may not relock a node on which it has
released a lock, even if it still holds a lock on the node’s parent
Example…
A tree of lockable elements
A
B C
D E
F G
3 transactions following the protocol…
T1 T2 T3
l1 (A);r1(A);
l1 (B); r1(B);
l1(C); l1 (C);
w1(A);u1(A);
l1(D); r1(D);
w1 (B); u1(B);
l2(B);r2(B);
l3(E);r3(E);
w1(D);u1(D);
w1(c);u1(C);
l2(E) denied
l3(F);r3(F);
w3(F);u3(F);
l3(G);r3(G) w3(E);u3(E);
l2(E);r2(E);
w3(G);u3(G);
w2 (B); u2(B);
w1 (E); u1(E);
Why Tree Protocol works-
• The tree protocol implies a serial order on the transactions involved in a schedule.
• The order of precedence can be defined as-Ti<S Tj ;If in a schedule S, the transactions Ti and Tj
lock a node in common, and Ti locks the node first.
• If precedence graph drawn from the precedence relations that we defined above has no cycles, then we claim that any topological order of transactions is an equivalent serial schedule.
• For Example either ( T1,T2,T3) or (T3,T1,T2) is an equivalent serial schedule the reason for this serial order is that all the nodes are touched in the same order as they are originally scheduled.
• If two transactions lock several elements in common, then they are all locked in same order.
Precedence graph derived from schedule:
2
1
3
Example:--4 Path of elements locked by two transactions
X T locks first P
U locks first Z Y U locks first
• Now Consider an arbitrary set of transactions T1, T2;.. . . Tn,, that obey the tree protocol and lock some of the nodes of a tree according to schedule S.
• First among those that lock, the root. they do also in same order.
• If Ti locks the root before Tj, Then Ti locks every node in common with Tj does. That is Ti<sTj, But not Tj>sTi.
THANK YOU
CS-257
CONCURRENCY CONTROL
SECTION 18.8Timestamps
What is Timestamping?
Scheduler assign each transaction T a unique number, it’s timestamp TS(T).
Timestamps must be issued in ascending order, at the time when a transaction first notifies the scheduler that it is beginning
Timestamp TS(T)
Two methods of generating Timestamps.
Use the value of system, clock as the timestamp.Use a logical counter that is incremented after a new
timestamp has been assigned.
Scheduler maintains a table of currently active transactions and their timestamps irrespective of the method used
Timestamps for database element X and commit bit
RT(X):- The read time of X, which is the highest timestamp of transaction that has read X.
WT(X):- The write time of X, which is the highest timestamp of transaction that has write X.
C(X):- The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.
Physically Unrealizable Behavior
Read too late:A transaction U that started after
transaction T, but wrote a value for X before T reads X.
U writes X
T reads X
T start U start
Physically Unrealizable Behavior
Write too late:A transaction U that started after T, but
read X before T got a chance to write X.
U reads X
T writes X
T start U start
Figure: Transaction T tries to write too late
Dirty Read
It is possible that after T reads the value of X written by U, transaction U will abort.
U writes X
T reads X
U start T start U aborts
T could perform a dirty read if it reads X when shown
Thomas Write Rule
Too late to repair the damage
U writes X
T writes X
T Start U Start T Commits
U Aborts
Rules for Timestamps-Based scheduling
The scheduler, in response to a read or write request from a transaction T has the choice of:
a) Granting the request,
b) Aborting T (if T would violate physical reality) and restarting T with a new timestamp (abort followed by restart is often called rollback), or
c) Delaying T and later deciding whether to abort T or to grant the request(if the request is a read, and the read might be dirty)
Rules for Timestamps-Based scheduling
Scheduler receives a request rT(X)If TS(T) ≥ WT(X), the read is physically realizable.
If C(X) is true, grant the request, if TS(T) > RT(X), set RT(X) := TS(T); otherwise do not change RT(X).
If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts.
If TS(T) < WT(X), the read is physically unrealizable. Rollback T.
Rules for Timestamps-Based scheduling (Cont.)
Scheduler receives a request WT(X). If TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically realizable and
must be performed.Write the new value for X,Set WT(X) := TS(T), andSet C(X) := false.
if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X.
If C(X) is true, then the previous writers of X is committed, and ignore the write by T.
If C(X) is false, we must delay T.
if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.
Rules for Timestamps-Based scheduling (Cont.)
Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.
Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.
Three transactions executing under a timestamp-based scheduler
T1 T2 T3 A B C200 150 175 RT=0 RT=0 RT=0
WT=0 WT=0 WT=0r1(B); RT=200
r2(A); RT=150r3(C); RT=175
W1(B); WT=200W1(A); WT=200
W2(C);Abort
W3(A);
Timestamps and LockingGenerally, timestamping performs better than locking in situations where:
Most transactions are read-only.It is rare that concurrent transaction will try to read and write
the same element. In high-conflict situation, locking performs better than
timestamps The argument for this rule-of-thumb is:
Locking will frequently delay transactions as they wait for locks.
But if concurrent transactions frequently read and write elements in common, then rollbacks will be frequent in a timestamp scheduler, introducing even more delay than a locking system.
Thank You
?
Concurrency control by validationAnusha Damodaran
ID : 130
CS 257 : Database System Principles
Section 18.9
At a Glance
• What is Validation?• Architecture of Validation based Scheduler• Validation Rules• Comparison between Concurrency Control
Mechanisms
Validation (p1)
• Another type of Optimistic Concurrency control• Allows transactions to access data without locks• Validation Scheduler: Keeps record of what active
transactions are doing• Goes through ‘Validation Phase’ before the
transaction starts to write values of database elements
• If there is a physically unrealizable behavior, the transaction is rolled back
18.9.1 Architecture of Validation based Scheduler (p1)
Scheduler must be told for each transaction T– Read Set , RS(T) - Sets of database elements T reads– Write Set , WS(T) - Sets of database elements T writes– Three phases of the Validation Scheduler
• Read – Transaction reads from Database all elements in its Read Set.– Also computes in its local address space all results its going to write.
• Validate – Validates the transaction by comparing its read and write sets with those of
other transactions.– If validation fails, transaction is rolled back, else proceeds to write phase.
• Write– Writes to the database its values for the elements in its write set.
Validation based Scheduler
– Scheduler has an assumed serial order of the transactions to work with.
– Maintains three sets• START : Set of transactions that have started but not yet
completed– START (T) – time at which transaction started
• VAL : Set of transactions that have been validated but not yet finished the writing of phase 3
– START(T) & VAL(T) – time at which T validated• FIN : Set of transactions that completed phase 3
– START(T), VAL(T), FIN(T) – time at which T finished
18.9.2 Validation Rules
– Case 1: • U is in VAL or FIN, that is U is validated• FIN(U) > START(T) , that is U did not finish before T started• RS(T) ∩ WS(U) is not empty (let it contain database element X)
• Since we don’t know whether or not T got to read U’s value, we must rollback T to avoid a risk that the actions of T and U will not be consistent with the assumed serial order.
T reads XU writes X
U startsT starts
U validatedT validating
18.9.2 Validation Rules
– Case 2: • U is in VAL , i.e. U has successfully validated• FIN(U) > VAL(T) , i.e. U did not finish before T entered its validation phase• WS(T) ∩ WS(U) is not empty (let database element X be in both write sets)
• T and U must both write values of X , and if we let T validate, it is possible that it will write X before U does. Since we cannot be sure, we rollback T to make sure it does not violate the assumed serial order in which it follows U.
T writes XU writes X
U validated T validating U finish
Rules for Validating a transaction T
• Check that RS(T) ∩ WS(U) = ᶲ for any previously validated U that did not finish before T started, i.e., if FIN(U) > START(T).
• Check that WS(T) ∩ WS(U) = ᶲ for any previously validated U that did not finish before T validated, i.e., if FIN(U) > VAL(T).
1. RS(T) ∩ WS(U) = ᶲ ; FIN(U) > START(T)2. WS(T) ∩ WS(U) = ᶲ ; FIN(U) > VAL(T)
Example 18.2.9• 4 Transactions T, U,V,W attempt to execute and validate
T: RS = {A,B} WS ={A,C}
U : RS = {B} WS = {D}
W : RS ={A,D} WS = {A,C}
V : RS = {B} WS = {D, E}
- Read
- Validate
- Write
Example 18.2.9• Validation of U [RS = {B}; WS = {D}]
– Nothing to check ; Reads {B} – U validates successfully – Writes {D}
• Validation of T [RS = {A,B}; WS ={A,C}]
– FIN(U) > START(T) ; RS(T) ∩ WS(U) should be empty {A,B} ∩ {D} = ᶲ– FIN(U) > VAL(T) ; WS(T) ∩ WS(U) should be empty {A,C} ∩ {D} = ᶲ
• Validation of V [RS = {B}; WS = {D, E}]
– FIN(T) > START(V); RS(V) ∩ WS(T) should be empty {B} ∩ {A,C} = ᶲ– FIN(T) > VAL(V) ;WS(V) ∩ WS(T) should be empty {D,E} ∩ {A,C} = ᶲ– FIN(U) > START(V) ;RS(V) ∩ WS(U) should be empty {B} ∩ {D} = ᶲ
• Validation of W [RS ={A,D}; WS = {A,C}]
– FIN(T) > START(W); RS(W) ∩ WS(T) should be empty {A,D} ∩ {A,C} = {A}– FIN(V) > START(W);RS(W) ∩ WS(V) should be empty {A,D} ∩ {D,E} = {D}– FIN(V) > VAL(W);WS(W) ∩ WS(V) should be empty {A,C} ∩ {D,E} = ᶲ– W is not validated , Is rolled back and hence does not write values A and C
18.9.3 Comparison between Concurrency Control Mechanisms – Storage Utilization
Concurrency control Mechanisms
Storage Utilization
Locks Space in the lock table is proportional to the number of database elements locked.
Timestamps Space is needed for read- andwrite-times with every database element,
whether or not it is currentlyaccessed.
Validation Space is used for timestamps and read or write sets for each currently active transaction, plus a few more transactions that finished after some
currently active transaction began.
• Timestamping and validation may use slightly more space than a locking.• A potential problem with validation is that the write set for a transaction must be known before the writes occur
18.9.3 Comparison between Concurrency Mechanisms - Delay
• The performance of the three methods depends on whether interaction among transactions is high or low.
(Interaction the likelihood that a transaction will access an element that is also being accessed by a concurrent transaction)– Locking delays transactions but avoids rollbacks, even when
interaction is high. – Timestamps and validation do not delay transactions, but can cause
them to rollback, which is a more serious form of delay and also wastes resources.
• If interference is low, then neither timestamps nor validation will cause many rollbacks, and is preferable to locking
Questions ?
Thank you