1
Chapter 5 –Chapter 5 –Managing Files of RecordsManaging Files of Records
2
What’s Up for This Chapter?What’s Up for This Chapter?This Chapter’s Material
– Accessing records in files– Record structures for access– File access methods vs. file organizations– Some real-world examples of file structures– File portability issues
3
The Central ProblemThe Central Problem Locating Stored Data
– Once the data has been stored into a file,how do you find it to retrieve it?
– What does “find the data” even mean? How do you decide what you want to find? How do you look for it? What if it’s not there? What if something very much like it is there? What if there are lots of “it” there?
– And, of course, there are efficiency considerations How fast is your search algorithm? What would you have to do to the file to use a faster one? Which will you do more often, add records or find them?
– Bringing you back to the design of the file itself
4
Record KeysRecord KeysWhat Is a Key?– Data stored in a record by which you look for the
record– Can be one field or a set of fields
Examples – { name } or {last name + first_name }
Two Types of Keys– Primary key
Key value, unique in entire file, by which an individual record
can be located or determined to be absent
– Secondary key Key value by which one or more records can be located
5
Primary KeysPrimary KeysRequired Characteristics– Unique across the entire file
Can never have 2 records with same primary key Error to try to add record with duplicate primary key
– In “canonical” form Format precisely known, so search candidates can be brought
into that same format before the search Example – words (names, etc.) in all upper-case
– Not often used any more: rather, program the system to do thesearch independently of case
– Unchanging Value for given record should never change
– Given primary key value should always identify same record– Example – Texas Driver’s License number stays with you, even
if you move away from Texas, then come back
6
Primary Keys, cont’d.Primary Keys, cont’d.Implication on File Design– Don’t use possibly non-unique field(s) as primary
key Bad – name, birth date, etc.
– Don’t use anything that can possibly change Bad – name, address, etc.
– What can we use? Best – artificial identifier
– Student number
– Driver’s license number
– Other artificially created unique value
7
Secondary KeysSecondary Keys Not Such Stringent Rules– Duplicates allowed
Still have to define what “find” means if duplicates allowed
– Usually real data, as opposed to primary keys The kinds of thing you’d want to search for in real life
– Not used to impose any order on the file Can return results based on secondary key(s)
– Selected by secondary key value(s)– Sorted on secondary key value(s)
8
SearchingSearchingFrom 2325 – Two Major Methods– Sequential
Start at beginning, look until you find what you’re after Choices:
– Non-unique keys allowed?– Return first match or all of them?
– Binary Start in middle, remove half the list each time through Requires:
– Primary key values unique across file– File sorted on primary– Records directly accessible
There are others, but …
9
Sequential SearchingSequential SearchingPerformance– It might take 1 try; it might take N tries
Average number of tries = N / 2 if:– Searching on a unique key
– Returning first match
Average number of tries = N if:– Returning all matches
10
Sequential SearchingSequential SearchingPerformance– Big factor in disk access
Worst case:– File fragmented around the disk– Each program read takes one physical read
Best case:– File fairly contiguous on disk– I/O System buffers things so very few (1?) actual reads are done– In multi-user OSs, this seldom happens
However:– If read/write head didn’t move between accesses
• Rotational latency & transfer times small compared to seek time• Multiple physical reads wouldn’t have as much of an impact
– However, most OSs are multi-tasking now• Can’t rely on read/write head’s being where you left it• Must assume N physical reads take N full disk accesses
11
Improving Sequential SearchesImproving Sequential Searches
Reduce Number of Physical Reads– We can’t do anything about:
File fragmentation– If file’s clusters scattered around disk, multiple seeks are necessary
Multi-tasking environment– Have to assume each program read causes a physical read– (May not be true, if I/O System has good internal caching)
– So what do we do? Increase the number of records pulled in by each physical read
– Saw this with magnetic tape – group the records into blocks– Similar to way we collected fields into records, but …
• Grouping fields into records is dependent on data characteristics• Grouping records into blocks is dependent on I/O system & disk
– Block size should be:• Multiple of disk sector size• Compatible with I/O System’s ability to read
12
When to Use Sequential SearchingWhen to Use Sequential Searching
Sequential Searching is Good for:– Text files where you’re looking for a pattern
Unix ‘grep’ (general regular expression processor) command
– Small files Like you use in labs here
– Files that are searched very infrequently Not worth the effort to sort to make binary search work
– When you expect a large number of matches Example – searching on a secondary key
It’s Not so Good for:– Binary files– Sorted files– Big files
13
Unix Tools for Sequential Unix Tools for Sequential AccessAccess cat
– Seen this one – concatenate files– cat F1 F2 >F3
wc– Word count (also character & line count)– wc article.txt
grep– Search file for occurrences of regular expression pattern– grep “Ames" personlist.txt
od– Octal dump – or hex, or …– od -ch list.dat
14
Direct AccessDirect AccessWhat is it?– Go straight to the record you want in the file
No searching No unnecessary disk accesses
– What’s its “order”? Time to find a record is independent of number of records
However, it can be harder to do
15
Direct AccessDirect Access How to Do It?– At I/O System level, seek to record
C++ seek operations go to relative byte address (RBA) in file Variants:
– Seek with “get” pointer vs. seek with “put” pointer– Relative to start or end of file (default: start)
– But that still doesn’t answer the question How do we know what RBA a particular record starts at? We’ve talked about index files – but that’s for later We could move the problem up one level
– Use relative record number (RRN) But that’s no real help
– Still need some kind of index – way to find record’s RRN– Also requires use of fixed-length records:
RBA = RRN * Record_Size(assuming, of course, that the first RRA is 0)
16
Building a File of RecordsBuilding a File of Records
Like Building a Record of Fields–Same problem, up one level
Fixed-length or specified-length records?How to directly access records?
–But wait – there’s more:Want to require software to know as few details about file
as possibleTo do that, those details need to be stored with (in) the file
–File header recordsStore file-specific information at start of fileHeader record format
–Constant across all file types within one system–Why?
17
File Header RecordsFile Header RecordsThings a Header Record Might Contain–File structure
Type of record structureNumber of data recordsLength of records (if fixed-length)Record delimiter (if delimited)
–Record structure (if records have consistent structure)Number of fieldsLength of each field or delimiter between each fieldFormat of each fieldKey information – if needed
–Primary key field–Secondary key field(s), if any
–Date/time of most recent access–Date/time of most recent update
18
File Header Records, File Header Records, continuedcontinued
Header Record Format– Binary or character?
Depends – is it important for people to read it?
– Here’s a place where HTML-style format might work Lets files of different formats have different headers
(in some ways) Only invokes that parse overhead once per file
19
What’s the Difference?What’s the Difference?File Organization–Format of the file itself
Fixed-length, specified-length, or delimited recordsASCII or binary character encoding
File Access Method–Way(s) software can get at contents of file
Sequential vs. directIndexed sequential
20
Designing a FileDesigning a FileAccess Affects Organization
–If sequential access is all we needPretty much any organization is OKSubject, of course, to application needs
–If we need direct accessNeed fixed-length recordsCan also use indexed files, but that’s for later on
But Organization Also Affects Access–What if data to be stored in a record is wildly variable?
Fixed-length records would be extremely wastefulBut if we use specified-length records, how to do direct access?
–Just about have to use indexing then
21
MetadataMetadata
Data About Data–Usually in the form of a file header–Example in text
Astronomy image storage formatHTML format (name = value)But look on page 177: coding style makes a BIG difference
–Parsing this kind of dataRead field name; read field valueConvert ASCII value to type required for storage & useStore converted value into right variable
–Why use this type of header?
22
More MetadataMore MetadataPC Graphics Storage Formats
–DataColor values for each pixel in imageData compression often used (GIF, JPG)Different color “depth” possibilities
–MetadataHeight, widthNumber of bits per pixel (color depth)If not true color (24 bits / pixel)
–Color look-up table• Normally 256 entries• Indexed by values stored for each pixel (normally 1 byte)• Contains R/G/B values for color combination
–Formatted to be loaded directly into PC graphics RAM
23
Mixing Data Objects in a FileMixing Data Objects in a FileObjective–Store different types of data in the same file–Textbook example – mix of astronomy data
“File” header (HTML-style)“File” of notes – lines of ASCII text“File” of image data – in whatever format
–So our data file becomes a file of filesEach individual “file” (header, notes, or image) looks like
a record in this new “mega-file”These “mega-records” are of varying lengthHow do we store the “records” in the “mega-records”?
–Could use another level of specified-length record software
–Or, …
24
Our “Mega-File”Our “Mega-File”
NotesSub-file
ImageSub-file
Mega-fileHeader
NotesSub-file
ImageSub-file…
ImageHeader
ImageData
Text lineText lineText lineText lineText line …Text lineTerminator line
Organization
Notes Header
25
More on Our Mega-FileMore on Our Mega-FileAccess–Can we just read it sequentially?
Why or why not?What if we wanted to skip a notes sub-file?What if some image didn’t even have a notes sub-file?
–Can we access it directly?What would the header have to include to allow that?
–An index of the “records” in the file–We call the entries in that index “tags”
Each tag in the tag list has:–Type of sub-file referred to
• Special-case type: end of file–RBA of sub-file in mega-file–Length of sub-file (not necessary, but helpful)–Key information, if any, for sub-file
26
More on Our Mega-FileMore on Our Mega-File
Access, continued– So how do we access the mega-file now?
Read and process the header– Get whole-file information
– Build in-memory tag table for sub-files Sequential access
– Same as before
– May be able to program in some speed-ups from tag table Direct access
– Locate sub-file in tag table
– Go right to it
27
ExtensibilityExtensibilityLook at Our “Mega-File” Format Again–Header tells us things about the sub-files:
What kinds of files they areWhere to find them
–Files themselvesTo the mega-file processor, just random bytesTo the sub-file processor, meaningful information
What if we need a new type of sub-file?–Define a new type of header entry–Extend header processor to understand that entry–Write (or borrow or buy) code to handle new sub-file
Cardinal Rule:–Everything changes –file types, data types, ...
28
Factors Affecting Portability - 1Factors Affecting Portability - 1
Operating System Differences–Example – text lines
End with line-feed characterEnd with carriage-return and line-feedPrefixed by a count of characters in the line
Natural Language Differences–Example – character coding
Single-byte coding – ASCII, EBCDICDouble-byte coding – Unicode
Programming Language Differences–Pascal can’t directly process varying-length records–Different C++ compilers use different byte lengths
for the standard data types
29
Factors Affecting Portability - Factors Affecting Portability - 22
Computer Architecture Differences–Byte order in 16-bit and 32-bit integer values
Big-endian – leftmost byte is most significantLittle-endian – rightmost byte is most significant
–Storage of data in memorySome architectures require values that are N bytes long
to start at a byte whose address is divisible by N
0x15 0x32
Big-endian Little-endianinterpretation: interpretation:
0x1532 0x3215
30
How to Port FilesHow to Port FilesDefine Your Format C*A*R*E*F*U*L*L*Y
–Once a format is defined, never change itIf you need a new format, add it so as not to invalidate
the existing formatsIf you need to change a format, add a new one instead,
and let programs that need the new version use it
–Decide on a standard format for data elementsText lines
–ASCII , EBCDIC, or Unicode?–Which character(s) to end lines?
Binary–Tightly packed or multiple-of-N addressing?–Which “endian”?
–You can always write code to convert to & from thestandard format on a new language, computer, etc.
31
The Conversion ProblemThe Conversion ProblemFew Environments – can do directly
Many Env’ts. – need intermediate form
IBM VAX
VAX IBM
IBM IBM
VAX VAX
IA-32 IA-32
IA-64 IA-64
.
.
.XML
(or some otherstandard format)