Post on 20-Dec-2015
transcript
1
File Structure File as a stream of characters
No structure Consider students registered in a course
320587Joe SmithSC953184923Kathy LeeEN324979231Albert ChanSC943
File as a structured collection of related data A set of related data form a record a file consists of
records Information about each student forms a record
320587Joe SmithSC953184923Kathy LiEN923249793Albert ChanSC943
What is the meaning of each piece of information about each student?
2
DBMS Structures & Files
DBMS Structures File Structures
Attribute Field
Tuple Record
Relation File
3
Fields
Each record consists of a set of fields Fields separate data units
Identification of the pieces of data in a record320587Joe SmithSC953
Usually the same fields exist in all records in a file
4
Field Separation Alternatives Fixed length fields
A given field (e.g., NAME) is the same size for all records Easy and fast reading but wastes space
320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3
Length indicator at the beginning of each field Also wastes space (at least 1 byte per field) You have to know the length before you store
63205879Joe Smith2SC2951361849238Kathy Li2EN29213624979311Albert Chan2SC29413
5
Field Separation Alternatives Separate fields with delimeters
Use white space characters (blank, new line, tab) Easy to read, uses one byte per field, have to be careful in
the choice of the delimeter |320587|Joe Smith|SC|95|3||184923|Kathy Li|EN|92|3||249793|Albert Chan|SC|94|3|
Use keywords Each field has a keyword that indicates what the field is Self describing but high space overhead
ID=320587NAME=Joe SmithFACULTY=SCDEG=92YEAR=3ID=184923NAME=Kathy LiFACULTY=ENDEG=92YEAR=3ID= 249793NAME=Albert ChanFACULTY=SCDEG=94YEAR=3
6
Record Organization Alternatives Fixed length records
All records are the same length
320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3
The number and size of fields in each record may be variable
|320587|Joe Smith|SC|95|3| Padding|184923|Kathy Li|EN|92|3| Padding|249793|Albert Chan|SC|94|3| Padding
7
Record Organization Alternatives
Variable Length Records Fixed number of fields
Count the fields to detect the end of record
Length field at the beginningPut the length of each record in front of itYou have to buffer the record before writing
24320587|Joe Smith|SC|95|323184923|Kathy Li|EN|92|326249793|Albert Chan|SC|94|3
8
Record Organization Alternatives
Variable Length Records (cont’d) Index the beginning
Build a secondary index that shows where each record begins
320587|Joe Smith|SC|95|3184923|Kathy Li|EN|92|3249793|Albert …
00 24 47
End-of-record markers Put a special end-of-record marker
9
Summary
…
File System
Header Record Record Record Record Record
…Field Field Field
consists of
…
File File File File File
consists of
consists of
10
Accessing a File Sequential access
Based on key values Useful when file is small or most (all) of the file needs
to be searched Complexity O(n) where n is the number of disk reads Block records to reduce n Block size should match physical disk organization
multiples of sector size Direct access
Based on relative record number (RRN) Record-based file systems can jump to the record
directly Stream-based systems calculate byte offset =
RRN * record length
11
Header Records May be the same or different length than the rest of the
records in the file May contain information about the file
Number of records Size of records Date of file creation Date of last file modification Name of file creator/owner Meta information
Formats of data Origin of data Units used …
12
File Organization Issues
Primary concern: Organizing files for improving performance
Data compression Reclaiming space in files Search and sorting Indexing
13
Data Compression
Encoding information to reduce size of files Reversible compression
redundancy reduction short notations: AB for Alberta
suppressing repeating sequence 22 23 24 24 24 24 24 24 25 22 23 ff 24 06 25 (images)
variable length coding (Huffman) most frequently used letters with least length codes
Irreversible compression from GIF to JPEG save 20 ~ 90 %
14
Reclaiming Space in Files File updates
record addition record deletion record modification
Requirements how to recognize deleted records:
tombstone: * how to utilize space left by deleted records
storage compaction– reconstruct the file to reclaim space occupied by all deleted
records– how often ?
Available List
15
0 1 2 3 4 5 6 7
-1
List Head
Available ListConsider fixed length records Available list is a linked list of deleted records Implemented as a stack Use relative record number (RRN) for physical addresses
Adam Barb Peter Susan Brenda Sue Tim Jack
3 Adam Barb Peter -1 Brenda Sue JackTim
3Adam Barb Peter -1 Brenda Sue Jack
TamerAdam Barb Peter -1 Brenda Sue Jack
6
3
16
Variable Length Records Case Problems
RRNs cannot be used Fitting Fragmentation
internal fragmentation: occurs if variable length records are stored in fixed size slots with padding
external fragmentation: split record leftover may be too small to hold any record
Solutions An available list with the byte offset Placement strategies Storage compaction Coalescing holes
combining adjacent slots to form a bigger one
17
Placement Strategies
First fit unsorted list, the newly deleted record is put at the front insertion uses the first one on the list that fits
Best fit the list is sorted in ascending order insertion uses the first one on the list that fits too much fragmentation
Worst fit the list is sorted in descending order insertion always uses the first one if possible
18
Search ProblemFind a record with a given key value Sequential search: O(n) Binary search: O(log n)
the file must be sorted how to maintain the sorting order?
deleting, insertion
variable length records Sorting
RAM sort: read the whole file into RAM, sort it, and then write it back to disk
Keysort: read the keys into RAM, sort keys in RAM and then rearrange records according to sorted keys
Index
19
Keysorting
320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3
320587 1184923 2249793 3
Before sortingRRN
320587 Joe Smith SC 95 3184923 Kathy Lee EN 92 3249793 Albert Chan SC 94 3
184923 2249793 3 320587 1
After sorting
Problem: Now the physical file has to be rearranged
20
Indexing A tool used to find things
book index, student record indexes A function from keys to addresses
A record consisting of two fields key: on which the index is searched reference: location of data record associated with the key
Advantages smaller size of the index file makes RAM index possible binary search from files of variable length records rearrange keys without moving records multiple indexes
primary and secondary
21
Operations With an Indexed File
Create original index and data file Load index file into RAM before using it Rewrite index file after using it
file header Update
insertion deletion update
22
Secondary Index
Primary index
CD # physicallocation
ABG379 ...
Composer index
composer CD #
Beethoven ABG379
title CD #
Symphony ABG379
Title index
Provides multiple views of records Example: Consider a collection of music CDs
23
Primary vs Secondary Keys
Uniqueness a primary key is a unique identification of a record a secondary key may be associated with many records
Binding:association of key and address
We may retrieve records using combinations of secondary keys FIND all records WHERE Composer = “ Beethoven” AND Title = “Symphony 9’
24
Binding Association between a key and a physical address Tight binding
bind early the binding takes place when the file is24 constructed
advantage: high performance disadvantage: updates
Lazy binding bind later the binding takes place when they are actually used
advantage: easy updates safer: consistency
Primary index: tight binding; secondary index: later binding