Why Concerning Storage and Indexing?

transcript

Part III: Storage and Indexing(Chapters 8-11)

Part V: Transctions, Concurrency control, Scheduling, and Recovery

(Chapters 16-18)

Performance: another major factor in user satisfaction Depends on

• Efficient data structures for data representation• Efficiency of system operation on those

structures Disks contains data files and system files including

dictionary and index files Disk access: one of the most critical factor in

performance.

Why Not Store Everything in Main Memory?

Cost and size Main memory is volatile: What’s the problem? Typical storage hierarchy:

Factors: access speed, cost per unit, reliability Cache and main memory (RAM) for currently

used data: fast but costly Flash memory: limited number of writes (and

slow), non-volatile, disk-substitute in embedded systems

Disk for the main database (secondary storage). Tapes for archiving older versions of the data

(tertiary storage).

Buffer Management in a DBMS

CC & Recovery may require additional I/O when a frame is chosen for replacement. Why?

MAIN MEMORY

disk page

free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictatedby replacement policy

Indexes An index on a file speeds up selections on the

search key fields for the index. Any subset of the fields of a relation can be the

search key for an index on the relation. Search key is not the same as key (minimal set

of fields that uniquely identify a record in a relation).

An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k. Given data entry k*, we can find record with key

k quickly. Classes: dense/sparse index, primary/secondary,

clustered/un-clustered

Dense vs Sparse Index Dense index: one

index entry per search key value.

Sparse index: index records for only some of the records Every sparse index is

clustered! Sparse indexes are

smaller Which one is faster? Which one has less

overhead?

Ashby, 25, 3000

Smith, 44, 3000

Sparse Indexon

Name Data File

Dense Indexon

Bristow, 30, 2007

Basu, 33, 4003

Cass, 50, 5004

Tracy, 44, 5004

Daniels, 22, 6003

Jones, 40, 6003

Clustered vs. Unclustered Index

Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. To build clustered index, first sort the Heap file (with

some free space on each page for future inserts). Overflow pages may be needed for inserts. (Thus,

order of data recs is `close to’, but not identical to, the sort order.) Index entries

Data entries

direct search for

(Index File)

(Data file)

Data Records

data entries

Data entries

Data Records

CLUSTERED UNCLUSTERED

Index Trees As for any index, 3 alternatives for data

entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key

k> Choice is orthogonal to the indexing

technique used to locate data entries k*. Tree-structured indexing techniques support

both range searches and equality searches. ISAM (indexed sequential access method):

static structure (data entries reside in leaf pages and overflow pages)

B+ tree: dynamic, adjusts gracefully under inserts and deletes.

Index file may still be quite large. But we can apply the idea repeatedly – making a tree

Leaf pages contain data entries.

1K 2 P

index entry

Non-leaf

Overflow page

Primary pages

Example ISAM Tree Each node can hold 2 entries; no need for `next-

leaf-page’ pointers. Why? Sequential allocation of leaf pages. Insert 23. Insert 48, 41, 42.

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

After Inserting 23*, 48*, 41*, 42* ...

10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*

20 33 51 63

23* 48* 41*

Overflow

Primary

B+ Tree: Most Widely Used Index

Insert/delete at log F N cost; keep tree height-balanced. F = fanout , N = # leaf pages;

F is typically >> 2. Why? Minimum 50% occupancy (except for root). Each

node contains d <= m <= 2d entries. The parameter d is called the order of the tree.

Supports equality and range-searches efficiently.

Index Entries

Data Entries("Sequence set")

(Direct search)

Extendible Hashing Situation: Bucket (primary page) becomes

full. Why not re-organize file by doubling # of buckets? Reading and writing all pages for reorganizing a

data file is expensive Idea: Use directory of pointers to buckets, double

# of buckets by doubling the directory, splitting just the bucket that overflowed!

Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page!

Trick lies in how hash function is adjusted!

Example

Directory is array of size 4. To find bucket for r, take

last `global depth’ # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101,

it is in bucket pointed to by 01.

Insert: If bucket is full, split it (allocate new page, re-distribute).

If necessary, double the directory. (When is splitting a bucket does not require doubling? We can tell by comparing global depth with local depth for the split bucket.)

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

1* 5* 21*13*

32*16*

15* 7* 19*

4* 12*

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

1* 5* 21*13*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

Linear Hashing

This is another dynamic hashing scheme, an alternative to Extendible Hashing.

LH handles the problem of long overflow chains without using a directory, and handles duplicates.

Idea: Use a family of hash functions h0, h1, h2, ... hi(key) = h(key) mod(2iN); N = initial # buckets h is some hash function (range is not 0 to N-1) If N = 2d0, for some d0, hi consists of applying h and

looking at the last di bits, where di = d0 + i. hi+1 doubles the range of hi (similar to directory

doubling)

Overview of LH File

In the middle of a round.

Levelh

Buckets that existed at thebeginning of this round:

this is the range of

NextBucket to be split

of other buckets) in this round

Levelh search key value )(

search key value )(

Buckets split in this round:If is in this range, must useh Level+1

`split image' bucket.to decide if entry is in

created (through splitting`split image' buckets:

LawyersRecently reported in the Massachusetts Bar Association

Lawyers Journal, the following are questions actually asked of witnesses by attorneys during trials:

"Now doctor, isn't it true that when a person dies in his sleep,he doesn't know about it until the next morning?“

"Were you alone or by yourself?“

"Was it you or your younger brother who was killed in the war?“

"How far apart were the vehicles at the time of the collision?“

Q: "How was your first marriage terminated?"A: "By death."Q: "And by whose death was it terminated?"

Part V:Transactions, Concurrency control, Scheduling, and

Recovery

Ch. 16 - 18

Overview

Transactions and ACID properties Serial execution and serializable execution Serializability (dependency) graph Serializability theorem Conflict equivalence and view equivalence Properties of schedules: SR, RC, ACA, ST Schedulers

3 options to handle the request from TM optimistic (aggressive) vs pessimistic

(conservative) 2PL and Strict 2PL Timestamp ordering and Strict TO

Atomicity of Transactions

A transaction might commit after completing all its actions, or it could abort (or be aborted by the DBMS) after executing some actions.

A very important property guaranteed by the DBMS for all transactions is that they are atomic.

Atomicity: a transaction is assumed to be executing all its actions in one step, or not executing any actions at all. Not easy to achieve. Why?

Consistency, Isolation, Durability

A transaction executed in isolation must preserve DB consistency.

Even if multiple transactions executed concurrently, each should be unaware of other transactions are being executed concurrently.

When a transaction complete successfully, the changes it made must persist, even with failures afterwards.

Conflicts and Equivalence

When do two operations conflict? They are issued by different transactions They operate on the same data object At least one of them is a write operation

Conflict equivalent Two executions are conflict equivalent if in both

executions, all conflicting operations have the same order.

Serializability Correctness criterion

Serializability is the correctness definition of DB All serializable schedules are equally correct Scheduling algorithms enforce certain ordering In distributed DBMS, variable delays may disturb

any particular ordering which is supposed to occur

Serialization graph (dependency graph) shows dependency relationship among transactions

Serialization Theorem For a schedule H, if SG(H) is acyclic, then H is

serializable.

Properties of Schedules Recoverability

To ensure that aborting a transaction does not change the semantics of committed transactionsw1(x)r2(x)w2(y)C2

Is it recoverable? What if T1 aborts? Recoverable execution depends on commit order A transaction cannot commit until all values it

read are guaranteed not to be aborted. How to do it?

Delaying commit: T2 cannot commit until T1 commits

Cascaded abort is sometimes necessary. Why?w1(x)r2(x)w2(x)A1

Properties of Schedules

Recoverability Cascaded abort is sometimes necessary

w1(x)r2(x)w2(x)A1 Avoiding cascaded aborts

Achieved if every transaction reads only the values written by committed transactions

Must delay each r(x) until all transactions that issued w(x) is either committed or abortedw1(x) ….. C1 r2(x) w2(y) …

Properties of Schedules Restoring before images

Implementing transaction abort by simply restoring before images of all writes is very convenientw0(x)w1(x)w2(x) A1 A2

Value of x must be restored to the initial value, not the value written by T1

Solution: delay w(x) until all transactions that have written x are either committed or aborted

Strictness Executions that satisfy both requirements Delay both r(x) and w(x) until all transactions that

have written w(x) are either committed or abortedr1(x)w1(x) w1(y) w2(z) w2(x) C1 --- is it strict?

Relationships among Properties

Recoverability (RC) RC if Ti reads from Tj and Ci is in H, then Ci

follows Cj Avoiding cascaded aborts (ACA)

ACA if Ti reads x from Tj then ri(x) follows Cj Strictness (ST)

ST if whenever Oi(x) follows wj(x), then Oi(x) follows either Aj or Cj

What is the relationship among ST, ACA, and RC?ST < ACA < RC

What about with SR and Serial execution?

Two-Phase Locking (2PL)

Two-Phase Locking Protocol Each Xact must obtain a S (shared) lock on

object before reading, and an X (exclusive) lock on object before writing.

A transaction can not request additional locks once it releases any locks.

If an Xact holds an X lock on an object, no other Xact can get a lock (S or X) on that object.

Multiple-Granularity Locks Why consider it? Database consists of tables, pages, tuples

(records) Hard to decide what granularity to lock

(tuples vs. pages vs. tables). Shouldn’t have to decide. How? Data “containers” are nested:

Tuples

Tables

Database

contains

The Phantom Problem

T1 implicitly assumes that it has locked the set of all sailor records with rating = 1. Assumption only holds if no sailor records are

added while T1 is executing! Why did this problem happen? Example shows that conflict serializability

guarantees serializability only if the set of objects is fixed!

Need some mechanism to enforce this assumption -- index locking and predicate locking

Index Locking

If there is a dense index on the rating field using Alternative (2), T1 should lock the index page containing the data entries with rating = 1. If there are no records with rating = 1, T1

must lock the index page where such a data entry would be, if it existed!

What if there is no suitable index? T1 must lock all pages, and lock the

file/table to prevent new pages from being added, to ensure that no new records with rating = 1 are added.

r=1Data

Predicate Locking Grant lock on all records that satisfy some logical

predicate, e.g. age > 2*salary. Index locking is a special case of predicate

locking for which an index supports efficient implementation of the predicate lock. What is the predicate in the sailor example? rating=1

Why not using predicate locks in commercial DBMS?

In general, predicate locking has a significant locking overhead.

Locking in B+ Trees

How can we efficiently lock a particular leaf node? Don’t confuse this with multiple granularity

locking -- How are they different? One solution: Ignore the tree structure, just

lock pages while traversing the tree, following 2PL -- What’s wrong?

This has terrible performance! Root node (and many higher level nodes)

becomes a bottleneck. Why? Because every tree access begins at the root.

B+ Tree Locking

Higher levels of the tree only direct searches for leaf pages.

For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.)

We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.

A Simple Tree Locking Algorithm Search: Start at root and go down;

repeatedly, S lock child then unlock parent.

Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: If child is safe, release all locks on ancestors.

Safe node: Node such that changes will not propagate up beyond this node. When is a node safe for inserts?

• Node is not full. When is a node safe for deletes?

• Node is not half-empty.

Timestamp Ordering Idea: Any conflicting operations are

executed in their timestamp order Simple and aggressive

Schedule immediately and reject requests that arrive too late

How do you know a request has arrived too late?

Give each object a read-timestamp (RTS) and a write-timestamp (WTS), give each transaction a timestamp (TS) when it begins

Timestamp Ordering Timestamp ordering rule:

If Oi(x) and Oj(x) are conflicting operation, Oi(x) is processed before Oj(x), if and only if TS(Ti) < TS(Tj).

Request arriving too late: Oi(x) arrives after the scheduler has sent

conflicting operation Oj(x) with TS(Tj) > TS(Ti)

Basic Timestamp Ordering Ri(x): if TS(Ti) < WTS(x), reject it;

otherwise (TS(Ti) >=WTS(x)), then schedule it and set RTS(x) to max (RTS(x), TS(Ti))

Wi(x): if TS(Ti) < RTS(x) or TS(Ti) <WTS(x), reject it; otherwise (TS(Ti) >=WTS(x)), then schedule it and set WTS(x) to max (WTS(x), TS(Ti))

When restarted, Ti is assigned a new timestamp Thomas Write Rule:

For wi(x), if TS(Ti) < WTS(x) and TS(Ti) >= RTS(x), then wi(x) can be ignored, rather than being rejected.

Why is it correct? Ignoring obsolete write

Exercise: Non-equivalence of 2PL and TO

H1=r2(x) w3(x) C3 w1(y)C1 r2(y) w2(y) C21. Is this schedule possible to timestamp

ordering?2. Is it possible with 2PL?

H1 is legal with strict timestamp ordering. What is the equivalent serial schedule?T1 T2 T3It is not possible with 2PLT2 must release lock on x for T3, but then gets lock on y – violation of two-phaseness

Relationship between 2PL and TO Schedules generated by 2PL and TO

They are all correct (serializable) They are not the same set: H1 shows that Is the relationship inclusive?

S {schedules by 2PL} subset of S {schedules by TO}?S {schedules by TO} subset of S {schedules by 2PL}?

Consider w3(x) C3 w2(x) C2 r1(x)Is it legal with TO?

Is it legal with 2PL? Two sets of schedules are intersecting, but subset

2PL TOSR

Failure and Recovery Failure and consistency

Transaction failures System failures Media failures

Principle of recovery Redundancy Database can be protected by ensuring that its

correct state can be reconstructed from information stored redundantly in the system

Recovering database – restart operation Bringing the stable DB to a consistent state by

removing effects of uncommitted transactions and applying missing effects of committed transactions.

Recovery and Restart

Types of storage media Volatile storage: fast , but not surviving system

failures Non-volatile storage Stable storage: information never lost (practically)

Recovery Ideally, stable DB should contain, for each data item,

the last value written by committed transaction Practically, stable DB may contain values written by

uncommitted transactions, or may not contain the last committed values.

Why?1) Updating of uncommitted T 2) Buffering of committed values in the cache

Function of Recovery Manager

Atomicity: Transactions may abort (“Rollback”).

Durability: What if DBMS stops running? (Causes?)

crash! Desired Behavior after

system restarts:– T1, T2 & T3 should be

durable.– T4 & T5 should be

aborted (effects not seen).

T1T2T3T4T5

Recovery Management

Design rules for recovery manager Undo rule: committed values must be saved

before overwritten by uncommitted values in the stable DB

Redo rule: before commit, new values it wrote must be in the stable storage (DB or log)

Restart activity Preparation: during normal operation Actual recovery: after failure

Preparation Logging Checkpointing

Cache Manager Two operations: fetch and flush

Use dirty bit for deciding flushing operation Flush: if the slot in cache is not dirty, do nothing;

otherwise, copy the value into stable storage Fetch: select a slot, using replacement algorithm

if full (and flush if necessary), copy the value into slot, reset dirty bit, update cache directory

When to flush? Depends on recovery strategy of the system Different recovery algorithms use different

strategies Idempotence of restart

Any sequence of incomplete execution, followed by a complete execution of restart has the same effect of just one complete execution

Handling the Memory Pool

Write to disk: force/no-force

Cache page: steal/no-steal Force every write to disk?

Poor response time. But provides durability.

Steal buffer-pool frames from uncommited Xacts? If not, poor throughput. If so, how can we ensure

atomicity?

No Force

No Steal Steal

Trivial

Desired

More on Steal and Force STEAL (why enforcing Atomicity is hard)

To steal frame F: Current page in F (say P) is written to disk; some Xact holds lock on P.• What if the Xact with the lock on P aborts?• Must remember the old value of P at steal time

(to support UNDOing the write to page P). NO FORCE (why enforcing Durability is hard)

What if system crashes before a modified page is written to disk?

Write as little as possible, in a convenient place, at commit time, to support REDOing modifications.

Recovery Algorithms

Undo/redo algorithm Most complicated of the four recovery algorithms Flexible in deciding when to flush (no-force) Maximize efficiency during normal operation at the

expense of less efficient recovery Comparison with other recovery algorithms

Issues: disk I/O, log space, recovery time No-redo requires more frequent flush (force) Uncommitted transaction is allowed to replace dirty

slot for in-place update – undo might be necessary Restart procedure

Process log forward and backward for redo and undo

Undo/Redo Recovery

A transaction T writes vale V to data object X. What will happen?

System fetches X if it is not already in cache Record V in the log and in X’s slot C No need for the cache manager to flush C

If cache manager replaces C (steal), and either T aborts or system fails before T commits, undo is required

If T commits and system fails before C is flushed (no force), redo is required

Restart Procedure for Undo/Redo Recovery

1. Discard all cache slots2. Scan the log to analyze which transactions

committed, aborted, or active, to determine data for redo/undo

3. Redo all actions that were committed but not recorded in the stable DB

4. Undo all actions of transactions that were aborted or active at the time of failure

Why Concerning Storage and Indexing?

Documents