File Organisation+Some Short Notes

File OrganizationThe database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields.One approach:

assume record size is fixed each file has records of one particular type only different files are used for different relations

This case is easiest to implement; will consider variable length records later.

Fixed Length Records Simple approach:

Store record i starting from byte n * (i – 1), where n is the size of each record. Record access is simple but records may cross blocks Modification: do not allow records to cross block boundaries Deletion of record i:

Alternatives move records i + 1, . . ., n to i, . . . , n – 1 move record n to i do not move records, but link all free records on a free list

Free Lists Store the address of the first deleted record in the file header. Use this first record to store the address of the second deleted record, and so on. Can think of these stored addresses as pointers since they “point” to the location of a

record. More space efficient representation: reuse space for normal attributes of free records to

store pointers. (No pointers stored in in-use records.)

Variable Length RecordsVariable length records arise in database systems in several ways:

Storage of multiple record types in a file. Record types that allow variable lengths for one or more fields. Record types that allow repeating fields (used in some older data models).

Slotted Page Structure

Slotted page header contains: number of record entries end of free space in the block location and size of each record Records can be moved around within a page to keep them contiguous with no empty

space between them; entry in the header must be updated. Pointers should not point directly to record — instead they should point to the entry for

the record in header.

Organization of Records in Files Heap – a record can be placed anywhere in the file where there is space Sequential – store records in sequential order, based on the value of the search key of

each record Hashing – a hash function computed on some attribute of each record; the result specifies

in which block of the file the record should be placed. Records of each relation may be stored in a separate file. In a multi-table clustering file

organization records of several different relations can be stored in the same file.

Sequential File Organization Suitable for applications that require sequential processing of the entire file The records in the file are ordered by a search-key.

Deletion – use pointer chains Insertion –locate the position where the record is to be inserted

if there is free space insert there. if no free space, insert the record in an overflow block. In either case, pointer chain must be updated.

Need to reorganize the file from time to time to restore sequential order

Multitable Clustering File Organization Store several relations in one file using a multi-table clustering file organization Multi-table clustering organization of customer and depositor:

good for queries involving depositor customer, and for queries involving one single customer and his accounts.

bad for queries involving only customer.

Hashing

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array (cf. associative array). The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.

Hash functions are mostly used to speed up table lookup or data comparison tasks—such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

A hash function may map two or more keys to the same hash value. In many applications, it is desirable to minimize the occurrence of such collisions, which means that the hash function must map the keys to the hash values as evenly as possible.

Fig:- A hash function that maps names to integers from 0 to 15.. There is a collision between keys "John Smith" and "Sandra Dee".

Hash functions are related to checksums, check digits, fingerprints, randomization functions, error correcting codes, and cryptographic hash functions.

Static Hashing

A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).

In a hash file organization we obtain the bucket of a record directly from its search key value using a hash function.

Hash function h is a function from the set of all search key values K to the set of all bucket addresses B.

Hash function is used to locate records for access, insertion as well as deletion.

Records with different search key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.

http://en.wikipedia.org/wiki/File:Hash_table_4_1_1_0_0_1_0_LL.svg

Example of Hash File Organization There are 10 buckets, The binary representation of the ith character is assumed to be

the integer i. The hash function returns the sum of the binary representations

of the characters modulo 10 E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

Account table Acc_No Branch_name Balance

Hash files organization of account file, using branch_name as key

Hash Functions

A hash function maps keys to small integers (buckets). An ideal hash function maps the keys to the integers in a random-like manner, so that bucket values are evenly distributed even if there are regularities in the input data.

This process can be divided into two steps:

Map the key to an integer.

Map the integer to a bucket.

We will assume that our keys are either integers, things that can be treated as integers (e.g. characters, pointers) or 1D sequence of such things (lists of integers, strings of characters).

Simple hash functionsThe following functions map a single integer key (k) to a small integer bucket value h(k). m is the size of the hash table (number of buckets).

Division method (Cormen) Choose a prime that isn't close to a power of 2. h(k) = k mod m. Works badly for many types of patterns in the input data.

Knuth Variant on Division h(k) = k(k+3) mod m. Supposedly works much better than the raw division method.

Multiplication Method (Cormen). Choose m to be a power of 2. Let A be some random-looking real number. Knuth suggests M = 0.5*(sqrt(5) - 1). Then do the following:

s = k*A x = fractional part of s h(k) = floor(m*x)

This seems to be the method that the theoreticians like.

To do this quickly with integer arithmetic, let w be the number of bits in a word (e.g. 32) and suppose m is 2^p. Then compute:

s = floor(A * 2^w) x = k*s h(k) = x >> (w-p) // i.e. right shift x by (w-p) bits // i.e. extract the p most significant // bits from x

Hashing sequences of charactersThe hash functions in this section take a sequence of integers k=k1,...,kn and produce a small integer bucket value h(k). m is the size of the hash table (number of buckets), which should be a prime number. The sequence of integers might be a list of integers or it might be an array of characters (a string).

The specific tuning of the following algorithms assumes that the integers are all, in fact, character codes. In C++, a character is a char variable which is an 8-bit integer. ASCII uses only 7 of these 8 bits. Of those 7, the common characters (alphabetic and number) use only the low-order 6 bits. And the first of those 6 bits primarily indicates the case of characters, which is

relatively insignificant. So the following algorithms concentrate on preserving as much information as possible from the last 5 bits of each number, and make less use of the first 3 bits.

When using the following algorithms, the inputs ki must be unsigned integers. Feeding them signed integers may result in odd behavior.

For each of these algorithms, let h be the output value. Set h to 0. Walk down the sequence of integers, adding the integers one by one to h. The algorithms differ in exactly how to combine an integer ki with h. The final return value is h mod m.

CRC variant: Do a 5-bit left circular shift of h. Then XOR in ki. Specifically:

highorder = h & 0xf8000000 // extract high-order 5 bits from h // 0xf8000000 is the hexadecimal representation // for the 32-bit number with the first five // bits = 1 and the other bits = 0 h = h << 5 // shift h left by 5 bits h = h ^ (highorder >> 27) // move the highorder 5 bits to the low-order // end and XOR into h h = h ^ ki // XOR h and ki

PJW hash (Aho, Sethi, and Ullman pp. 434-438): Left shift h by 4 bits. Add in ki. Move the top 4 bits of h to the bottom. Specifically:

// The top 4 bits of h are all zero h = (h << 4) + ki // shift h 4 bits left, add in ki g = h & 0xf0000000 // get the top 4 bits of h if (g != 0) // if the top 4 bits aren't zero, h = h ^ (g >> 24) // move them to the low end of h h = h ^ g // The top 4 bits of h are again all zero

PJW and the CRC variant both work well and there's not much difference between them. We believe that the CRC variant is probably slightly better because

It uses all 32 bits. PJW uses only 24 bits. This is probably not a major issue since the final value m will be much smaller than either.

5 bits is probably a better shift value than 4. Shifts of 3, 4, and 5 bits are all supposed to work OK.

Combining values with XOR is probably slightly better than adding them. However, again, the difference is slight.

BUZ hash: Set up a function R that takes 8-bit character values and returns random numbers. This function can be pre-computed and stored in an array. Then, to add each character ki to h, do a 1-bit left circular shift of h and then XOR in the random value for ki. That is:

highorder = h & 0x80000000 // extract high-order bit from h h = h << 1 // shift h left by 1 bit h = h ^ (highorder >> 31) // move them to the low-order end and // XOR into h h = h ^ R[ki] // XOR h and the random value for ki

Handling of Bucket Overflows Bucket overflow can occur because of

Insufficient buckets Skew in distribution of records. This can occur due to two

reasons: Multiple records have same search key value Chosen hash function produces non-uniform distribution of key values

Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets.

Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.

Above scheme is called closed hashing. An alternative, called open hashing, which does not use overflow

buckets, is not suitable for database applications.

Indexing Indexing mechanisms used to speed up access to desired

data.E.g., author catalog in library

Search Key attribute to set of attributes used to look up records in a file.

An index file consists of records (called index entries) of theform

Index files are typically much smaller than the original file Two basic kinds of indices:

o Ordered indices: search keys are stored in sorted ordero Hash indices: search keys are distributed uniformly across

“buckets” using a “hash function”.

Indexes can also be characterized as dense or sparse.

A dense index has an index entry for every search key value (and hence every record) in the data file.

A sparse (or nondense) index, on the other hand, has index entries for only some of the search values.

Indexes on the basis of the number of levels can be classified into 2 types –

Single level Ordered Indexes Multilevel Indexes

Single level Ordered IndexesIt can be classified into 3 types :-

Primary Indexes Clustering Indexes Secondary Indexes

Primary IndexA primary index is an ordered file whose records are of fixed length with two fields. The first field is of the same data type as the ordering key field—called the primary key—of the data file, and the second field is a pointer to a disk block (a block address). There is one index entry (or index record) in the index file for each block in the data file. Each index entry has the value of the primary key field for the first record in a block and a pointer to that block as its two field values.The following figure illustrates this primary index. The total number of entries in the index is the same as the number of disk blocks in the ordered data file. The first record in each block of the data file is called the anchor record of the block, or simply the block anchor.

Index File Blocks

Primary key

Clustering IndexIf records of a file are physically ordered on a nonkey field—which does not have a distinct value for each record—that field is called the clustering field. We can create a different type of index, called a clustering index, to speed up retrieval of records that have the same value for the clustering field. This differs from a primary index, which requires that the ordering field of the data file have a distinct value for each record. A clustering index is also an ordered file with two fields; the first field is of the same type as the clustering field of the data file, and the second field is a block pointer. There is one entry in the clustering index for each distinct value of the clustering field, containing the value and a

Primary Key value Block Pointer

159..

Roll_no Name1 Xxx2 Yyy3 Zzz4 www

Roll_no Name5 Aaa6 bbb7 Ccc8 ddd

Roll_no Name9 Eee10 Fff11 Ggg12 hhh

pointer to the first block in the data file that has a record with that value for its clustering field.The following figure illustrates clustering index.Index File Blocks

Clustering field

Secondary IndexA secondary index is also an ordered file with two fields. The first field is of the same data type as some non-ordering field of the data file that is an indexing field. The second field is either a block pointer or a record pointer. The secondary field may be a key or a non-key.In case of a key secondary field, the field is also called as a secondary key. Here there is one index entry for each record in the data file, which contains the value of the secondary key for the record and a pointer

Clustering Field value Block Pointer

DGPASNKOL..

City NameDGP XxxDGP YyyDGP ZzzDGP www

City NameASN AaaASN BbbASN CccASN ddd

City NameKOL eeeKOL FffKOL GggKOL hhh

either to the block in which the record is stored or to the record itself. Hence, such an index is dense.

Index File Blocks

Secondary key

In case of a non-key secondary field, numerous records in the data file can have the same value for the indexing field. Here, we create an extra level of indirection to handle the multiple pointers. In this non-dense scheme, the pointer in index file points to a block of record pointers; each record pointer in that block points to one of the data file records with the value for the indexing field.

Index File Blocks of record pointers Blocks

Secondary key value Record Pointer

123

Reg_no Roll_no5 11 26 3

Reg_no Roll_no3 44 52 6

Secondary key value Record Pointer

456

Secondary field value Block Pointer

12 3

Year Roll_no3 41 52 6

Multilevel Index If primary index does not fit in memory, access becomes expensive. Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it.

outer index – a sparse index of primary index inner index – the primary index file

If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.

Indices at all levels must be updated on insertion or deletion from the file.

Embedded SQL

Year Roll_no2 71 83 10

Embedded SQL is a method of combining the computing power of a programming language and the database manipulation capabilities of SQL. Embedded SQL statements are SQL statements written inline with the program source code of the host language. The embedded SQL statements are parsed by an embedded SQL preprocessor and replaced by host-language calls to a code library. The output from the preprocessor is then compiled by the host compiler. This allows programmers to embed SQL statements in programs written in any number of languages such as: C/C++, COBOL and Fortran. Thus the embedded SQL provides the 3GL with a way to manipulate a database, supporting:

highly customized applicationsbackground applications running without user interventiondatabase manipulation which exceeds the abilities of simpleSQL

applications linking to Oracle packages, e.g. forms andreports

applications which need customized window interfaces

Query Optimization

Given a query, there are many plans that a database management system(DBMS) can follow to process it and produce its answer. All plans are equivalent in terms of their final output but vary in their cost, i.e., the amount of time that they need to run. Query optimization is a procedure to find out, what is the plan that needs the least amount of time?

Such query optimization is absolutely necessary in a DBMS. The cost difference between two alternatives can be enormous. For example, consider the following database schema, which will be

Emp(name,age,sal,dno)dept(dno,dname,floor,budget,mgr,ano)acnt(ano,type,balance,bno)bank(bno,bname,address)

Further, consider the following very simple SQL query:select name, floorfrom emp, deptwhere emp.dno=dept.dno and sal>100K.Assume the characteristics below for the database contents, structure, and run-time environment:Parameter Description Parameter ValueNumber of emp pages 20000Number of emp tuples 100000Number of emp tuples with sal>100K 10Number of dept pages 10Number of dept tuples 100Indices of emp Clustered B+-tree on emp.sal

(3-levels deep)Indices of dept Clustered hashing on dept.dno

(average bucket length of 1.2 pages)Number of buffer pages 3Cost of one disk page access 20ms

Consider the following three different plans:P1 --Through the B+-tree find all tuples of emp that satisfy the selection on emp.sal. For each one, use the hashing index to find the corresponding dept tuples. (Nested loops, using the index on both relations.)P2-- For each dept page, scan the entire emp relation. If an emp tuple agrees on the dno attribute with a tuple on the dept page and satisfies the selection on emp.sal, then the emp-dept tuple pair appears in the result. (Page-level nested loops, using no index.)P3-- For each dept tuple, scan the entire emp relation and store all emp-dept tuple pairs.Then, scan this set of pairs and, for each one, check if it has the same values in thetwo dno attributes and satisfies the selection on emp.sal. (Tuple-level formation of the cross product, with subsequent scan to test the join and the selection.)Calculating the expected I/O costs of these three plans shows the tremendous difference in efficiency that equivalent plans may have. P1 needs 0.32 seconds, P2 needs a bit more than an hour, and P3 needs more than a whole day. Without

query optimization, a system may choose plan P2 or P3 to execute this query with devastating results. Query optimizers, however, examine “all” alternatives,so they should have no trouble choosing P1 to process the query.

The path that a query traverses through a DBMS until its answer is generated is shown in Figure 1. The system modules through which it moves have the following functionality:

The Query Parser checks the validity of the query and then translates it into an internal form, usually a relational calculus expression or something equivalent.

The Query Optimizer examines all algebraic expressions that are equivalent to the given query and chooses the one that is estimated to be the cheapest.

The Code Generator or the Interpreter transforms the access plan generated by the optimizer into calls to the query processor.

The Query Processor actually executes the query.

Database securityDatabase security concerns the use of a broad range of information security controls to protect databases (potentially including the data, the database applications or stored functions, the database systems, the database servers and the associated network links) against compromises of their confidentiality, integrity and availability. It involves various types or categories of controls, such as technical, procedural/administrative and physical. Database security is a specialist topic within the broader realms of computer security, information security and risk management.

Security risks to database systems include, for example:

Unauthorized or unintended activity or misuse by authorized database users, database administrators, or network/systems managers, or by unauthorized users or hackers (e.g. inappropriate access to sensitive data, metadata or functions within databases, or inappropriate changes to the database programs, structures or security configurations);

Malware infections causing incidents such as unauthorized access, leakage or disclosure of personal or proprietary data, deletion of or damage to the data or programs, interruption or denial of authorized access to the database, attacks on other systems and the unanticipated failure of database services;

Overloads, performance constraints and capacity issues resulting in the inability of authorized users to use databases as intended;

Physical damage to database servers caused by computer room fires or floods, overheating, lightning, accidental liquid spills, static discharge, electronic breakdowns/equipment failures and obsolescence;

Design flaws and programming bugs in databases and the associated programs and systems, creating various security vulnerabilities (e.g. unauthorized privilege escalation), data loss/corruption, performance degradation etc.;

Data corruption and/or loss caused by the entry of invalid data or commands, mistakes in database or system administration processes, sabotage/criminal damage etc.

Many layers and types of information security control are appropriate to databases, including:

Access control Auditing

Authentication

Encryption

Integrity controls

Backups

Application security

Traditionally databases have been largely secured against hackers through network security measures such as firewalls, and network-based intrusion detection systems. While network security controls remain valuable in this regard, securing the database systems themselves, and the programs/functions and data within them, has arguably become more critical as networks are increasingly opened to wider access, in particular access from the Internet. Furthermore, system, program, function and data access controls, along with the associated user identification, authentication and rights management functions, have always been important to limit and in some cases log the activities of authorized users and administrators. In other words, these are complementary approaches to database security, working from both the outside-in and the inside-out as it were.

Many organizations develop their own "baseline" security standards and designs detailing basic security control measures for their database systems. These may reflect general information security requirements or obligations imposed by corporate information security policies and applicable laws and regulations (e.g. concerning privacy, financial management and reporting systems), along with generally-accepted good database security practices (such as appropriate hardening of the underlying systems) and perhaps security recommendations from the relevant database system and software vendors. The security designs for specific database systems typically specify further security administration and management functions (such as administration and reporting of user access rights, log management and analysis, database replication/synchronization and backups) along with various business-driven information security controls within the database programs and functions (e.g. data entry validation and audit trails). Furthermore, various security-related activities (manual controls) are normally incorporated into the procedures, guidelines etc. relating to the design, development, configuration, use, management and maintenance of databases.

Date post:	27-Oct-2014
Category:	Documents
Upload:	nitin-nilesh
View:	85 times
Download:	0 times

File Organisation+Some Short Notes

Documents