LIS 7450, Searching Electronic Databases
Basic: Database Structure & Database Construction
Dialog: Database Construction for Dialog (FYI)
Deborah A. Torres
Database Structure
Organization of Data Elements and records
Database Record
Record – basic unit of information in a database (file). Example: Bibliographic record contains
description information, i.e. author, title, publisher etc.
Fields
Field – a distinct part or section of a record (a unit of information within the record) Example of personnel record fields:
employee’s name, special identifier number, address, date of hire etc.
Field Design Decisions
For each field Decide what information is placed within
that field & format for that information (text, numeric)
Should there be subfields within a field? What to call the fields? Field codes (abbreviations, numbering) Order of the fields
Example: MARC Record (a type of record you should be familiar with)
Record Fields & CodesThe 100 field
contain author information.The 245 field contains main title information.
Other Design Decisions
Hyphenated words Home-school
Stop words High frequency words not useful for searching
Single words and phrases Library, library science, color of money
Alternative spellings of words Color, colour
Types of Databases
Bibliographic – references and abstracts of published documents
Fulltext – complete text of articles, dictionary entry, code of law, or other such document.
Directory – factual information about organizations, companies, products, people, or materials.
Types of Databases
Numeric – data in a tabular or statistically manipulated form, often with some added text.
Hybrid – a mix of record types. For example, a database may have full-text records for some publications and citations and abstracts for other source documents.
Database Construction
Basic Steps for automatic indexing of text documents
Six Basic StepsStep 1: Parse text into wordsStep 2: Compare to stoplist and eliminate
stopwordsStep 3: Stem content words (reduce to root
words) (skip this step if decide not to stem)
Step 4: Count stemmed word occurrencesStep 5: Create union list of termsStep 6: Create data structure for specific
retrieval techniques (i.e. an inverted file)
Example: Simple Set of 5, One-sentence documents
D1: It is a dog eat dog world!D2: While the world sleeps.D3: Let sleeping dogs lie.D4: I will eat my hat.D5: My dog wears a hat.
“D” stands for document
Step 1: Parse Text into WordsD1:itisa dogeatdogworld
D2:whiletheworldsleeps
D3:letsleepingdogslie
D4:Iwilleatmyhat
D5:mydogwearsahat
Note: Some databases remove punctuation for words, like possessives; others preserve it. What difference would this make?
Step 2: Eliminate Stop WordsD1:dogeatdogworld
D2:worldsleeps
D3:letsleepingdogslie
D4:eathat
D5:dogwearshat
Stop words are content-free words – those not useful in determining the content of the document.Examples: pronouns (I, my), prepositions (of, by, on), articles (a, the, this)
Step 3: Stemming (remember not all databases stem words)
D1:dogeatdogworld
D2:worldsleeps
D3:letsleepingdogslie
D4:eathat
D5:dogwearshat
D1:dogeatdogworld
D2:worldsleep
D3:letsleepdoglie
D4:eathat
D5:dogwearhat
Types of Stemming DecisionsNo Stemming:contractcontractscontractedcontractingcontractorcontractioncontractualcontracture
Weak Stemming:Inflections: -s, -es, -ed, -ing, -’s
Strong Stemming:Derivations: -tion, -ly, -ally
Reduce words to a root variant; there are different stemming algorithms
A bit more about stemming for searching…
Some databases automatically search for all of the words that come from the same stem/root word unless you indicate that you only want the word you entered.
Example: if you entered computer, the database would also search for computing, computers, computation, etc.
Step 4: Sort Words, Count DuplicatesD1:dogdogeatworld
D2:sleep world
D3:dogletliesleep
D4:eathat
D5:doghat wear
D1:dog(2)eatworld
D2:sleep world
D3:dogletliesleep
D4:eathat
D5:doghat wear
Sort into Alpha order
Count any duplicate
s
Step 5: Create Union List of Unique TermsUnsorted List
dogeat
world sleep world dogletlie
sleep eathat doghat wear
Sorted List dogdogdogeateathat hat letlie
sleep sleep wearworld world
Sorted, Unique List
dogeathatletlie
sleepwearworld
Step 6: Create Inverted Index (inverted file)
dogeathatletliesleepwearword
Union List Unique terms
dog: D1 D3 D5eat: D1 D4hat: D4 D5let: D3lie: D3sleep: D2 D3wear: D5word: D1 D2
Inverted Index: has pointers to documents in which word occurs
Inverted Index
Dialog Database Construction
FYI: For those interested in Dialog
Dialog Database Construction
Step 1: Create a linear file of records received from the Information Provider. Assign sequential accession numbers to the records.
Step 2: Label the fields within the records: AU for Author, TI for Title, etc. If a field is word-indexed, also label the words within each field. Exclude stop words: AN FOR THE AND FROM TO BY WITH
Dialog Database Construction
Step 3: Create the Basic Index: all words and phrases from fields containing subject-related terms.
Step 4: Create the Additional Indexes: all terms from all remaining fields.