1 Chapter 7 Indexing File Structures by Folk, Zoellick, and Ricarrdi.

Post on 04-Jan-2016

221 views 1 download

transcript

1

Chapter 7

Indexing

File Structures by Folk, Zoellick, and Ricarrdi

2

Chapter Objectives

Index files. Operations Required to Maintain an Index File. Primary keys. Secondary keys.

3

Index a tool for finding records in a file consists of:

key field field on which the index is searched

reference (address or RRN) field tells where to find the data file record associated with

a particular key.

7.1 What is an Index

4

Examples of an Index book index

usually at the end of the book arranged alphabetically by topic

The index in a library (an on-line catalog) allows you to locate items by an author, by a title, or by a call number.

photo thumbnails usually represents a link to the actual photo

much smaller file, can be loaded quickly actual photo takes much longer to load

if index was actual photos, would take long to load

5

Book Index

6

Example: Index in Databases

University uses an index file to keep track of its courses.

The data file consists of the following fields in each record: Department Title Professor Student List Room & Time

7

Example: Primary key Department

not specific enough Course Number

not unique Professor

not unique Room & Time

possible classes aren’t identified this way

Department + Course Number -> Obvious?

7.2 A Simple Index

8

Index file It is used to provide rapid access to

individual records in the data file via the keys

Example index file consists of the following fields: key (e.g. CIS402) reference (address) =address of the

corresponding record in the data file

9

10

Primary Index

k1 k2 k4 k5 k7 k9

k1 k2 k4 k5 k7 k9

AAA ZZZ CCC XXX EEE FFF

Index File

Data File

7.1 What Is an Index?

7.2 A Simple Index for E-S Files

Class TextIndex{ public: TextIndex(int maxKeys = 100, int unique = 1);

int Insert(const char*ckey, int recAddr); //add to index int Remove(const char* key); //remove key from index int Search(const char* key) const;

//search for key, return recAddr void Print (ostream &) const; protected: int MaxKeys; // maximum num of entries int NumKeys;// actual num of entries char **Keys; // array of key values int* RecAddrs; // array of record references int Find (const chat* key) const; int Init (int maxKeys, int unique); int Unique;// if true --> each key must be unique}

Index Class Interface

13

Operations on an Indexed File

Create (when data file is created) Load into memory (whole file, if possible and

prudent) Write updated file to permanent storage Record(s) added to data file Record(s) deleted from data file Update record(s) in data file

Searches

14

15

Creating Files of Data Create files

index file data record file

Load Index via

buffer I/O an array.

Writing Back Index File Can be part of the close operation for the index file

close function in index object can write the buffer/array to the disk before closing file

16

Record addition Adding a new data record to the data file

requires adding a new record to the index file If the index file is sorted:

adding a new record may require rearranging the records in this file.depends upon index file representation in memory if sort necessary, easily done if the indices are in main

memory

17

Record deletion Deletion of a data record requires deletion of the

corresponding index record. Can space in data file be reclaimed?

Difficult, as with index file organization all data records are pinned a pinned data record is one that has a reference to its address in

an index file

Other consequences Resorting difficulty

Solution: Sort the file via the indices

18

Record Updating Two categories of updates:

modification of key value re-ordering of the index file might be required

two possible situations1. modifying key reorders file

2. see below

modification of non-key value might still require reordering of records in the data

file. (WHY?)

size of data record might increase, requiring moving it to space that can hold it

must reset index for that record

19

Indexes too large for Memory

kept on the secondary storage disadvantages

time consumptionsearching the index file

requires disk accesses instead of just memory accesses

rearranging indexesrequires disk accesses

7.5 Indexes That Are Too Large to Hold in Memory

20

Solutions to Index Files in 2ndary storage

If the index file is too large to be kept in main memory than the following alternative organizations should be considered:

a hashed organization (if access speed is very important)

a tree structured organization, or a multilevel index such as a B-tree

21

Pros of a simple index file allows for use of binary search sorting and maintaining an index is much

easier than for a data file true if index entries are much smaller than data

records,

if data records are pinned, can rearrange keys without moving data records

apply them to multiple simple indexes...

22

Indexing with Multiple Key Access

unique primary key often used as a search keyword. Example primary key

CS215

What if you’d like to include the prof in the search? Two keys: Course & Prof Could also be:

Course & Time Location & Time (?)

7.6 Indexing to Provide Access by Multiple Keys

23

Secondary key A secondary key is a key for

which multiple records may exist in the data file.

Example: Sorting an Excel sheet using two

fields (e.g. name & section) A professor teaches more than

one class

24

Secondary Index File create for the possible secondary indexes.

secondary keys can be sharedprimary keys were unique

Example: Professor El-Ramly secondary keys:

Primary keys containing this prof: CS352 CS215

Can access those courses via the secondary key What if course has multiple sections?

25

Record Addition Adding a record to the data file likely requires

adding a record to the secondary index file. Costs are similar to the cost of adding a record

in the primary index file. records might have to be shifted indexes may have to be rearranged

26

27

Record Deletion must remove all references to that record in the file system. search for primary key in primary index file

remove index search in secondary index file

for the primary key of the record to be deleted remove index from the secondary index file.

what if secondary keys are maintained? secondary key refers to primary key primary key will have been deleted, and will not exist if we consider this possibility, don’t have to delete secondary key pitfalls?

28

Record Updating There are three possible situations:

secondary key altered may have to rearrange the secondary key index so it stays in sorted

order primary key altered

big impact on the primary key index in the secondary key index only need to update the affected primary

key field confined to non-key fields

all updates that do not affect either the primary or secondary key fields do not affect the secondary key index, even if the update is substantial. recall, can affect primary index, since that refers to location in data file

29

30

Retrieving Data with Multiple Secondary Keys

Example: All courses taught by Spiegel or Gordon Requires two searches searches produce a list of courses by providing

primary keys.Spiegel: CIS136, CIS235, CIS402Gordon: CIS425, CIS520, CIS243

31

Boolean AND in searches Example:

Search for courses: taught by Spiegel located in Lytle Hall

Courses found are in intersection ofcourses taught by Spiegelcourses offered in Lytle Hall

32

Boolean OR searches Example:

Search for courses: taught by Spiegel located in Lytle Hall

Courses found are in union ofcourses taught by Spiegelcourses offered in Lytle Hall

33

Cons of the Current Secondary Index Structure

index file has to be rearranged every time a new record is added to the file.

for duplicate secondary keys, secondary key field is repeated for each entry.

Secondary Key Primary Key

El-Ramly CS215

El-Ramly CS352

Khattab CS214

Khattab CS316

7.8 Improving the Secondary Index Structure

34

Cons of the Current Secondary Index Structure

Solution A: by an array of references

Solution B: by linking the list of references

7.8 Improving the Secondary Index Structure

35

Improvements to the secondary index key

structure Solution 1

Allow for multiple primary keys to be associated with a single secondary key by allocating a primary key list (STL vector is best; why?) for each secondary key entry. Solves the problem of sorting each time when an new entry is

added. According to text: Suffers from internal fragmentation due to

fixed nature of list, and the number of allocated entries in the array may prove too small. STL (or Java) vector fixes this: How?

A. Array of References

BEETHOVEN ANG3795 DG139201 DG18807 RCA2626

COREA WAR23699

DVORAK COL31809

PROKOFIEV LON2312

RIMSKY-KORSAKOV MER75016

SPRINGSTEEN COL38358

SWEET HONEY IN THE R FF245

Secondary key Set of primary key references

Revised composer index

* no need to rearrange

* limited reference array

* internal fragmentation

37

Allow for multiple primary keys to be associated with a single secondary key by allocating a primary key array (STL vector is best; why?) for each secondary key entry.

Solves the problem of sorting each time when an new entry is added.

According to text: Suffers from internal fragmentation due to fixed nature of list, and the number of allocated entries in the array may prove too small. STL (or Java) vector fixes this: How?

A. Array of References

38

Solution B Create an inverted list of indexes. Have each

secondary key point to a list of primary key references associated with it.This method eliminates most of the problems

associated with maintaining a secondary index file.

Which solution is better?

B. Inverted List

Inverted Lists

Guidelines for better solution no reorganization when adding no limitation for duplicate key no internal fragmentation

Solution B: by Linking the list of references

A list of primary key references

secondary key field, relative record number of the first

corresponding primary key reference

PROKOFIEV ANG36193

LON2312

Linking List of References (1)

BEETHOVEN

COREA

PROKOFIEV

RIMSKY-KORSAKOV

SPINGSTEEN

SWEET HONEY IN THE R

DVORAK

3

2

7

10

6

4

9

LON2312

RCA2626

ANG23699

COL38358

DG18807

MER75016

COL31809

DG139201

ANG36193

WAR23699

-1

-1

-1

8

-1

1

-1

-1

5

0

0

1

2

3

4

5

6

7

8

9 FF245 -1

Secondary Index file Label ID List file

Improved revision of the composer index

0

1

2

3

4

5

6

10

Linking List of References (2)

The primary key references in a separate, entry-sequenced file

Advantages rearranges only when secondary key changes rearrangement is quick less penalty associated with keeping the secondary index file on

secondary storage (less need for sorting) Label ID List file not need to be sorted reusing the space of deleted record is easy

Linking List of References (3)

Disadvantage same secondary key references may not be

physically grouped lack of localitycould involve a large amount of seekingsolution: reside in memory

same Label ID list can hold the lists of a number of secondary index files

if too large in memory, can load only a part of it

43

Selective Indexes

Selective Index: Index on a subset of records Selective index contains only some part of

entire index provide a selective view useful when contents of a file fall into several

categories e.g. 20 < Age < 30 and $1000 < Salarye.g. Courses offered after 12 noon

7.9 Selective Indexes