INLS 520 – Fall 2007Erik Mitchell
INLS 520
Information Organization
INLS 520 – Fall 2007Erik Mitchell
Review
• Last week– Types of categorization & classification
structures• Classification
– Definitions– Look at Library classification systems for
Dewey & Library of Congress
INLS 520 – Fall 2007Erik Mitchell
Today
• Controlled vocabularies– Types– Basic concepts
• Related technologies– Metadata standards– Example Systems
• Knowledge organization systems– Term Lists, Thesauri, Taxonomies,
Ontologies
INLS 520 – Fall 2007Erik Mitchell
Concepts & definitions• Controlled Vocabularies
– “organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” (Warner via Leise, Fast)
– “the primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval” (ANSI Z39.19)
• Knowledge organization systems– “tools that present the organized interpretation of knowledge structures”
(Hjørland)
– “classification schemes that organize materials at a general level…, subject headings that provide more detailed access, and authority files that control variant versions of key information” (Hodge)
– “It depends on what the meaning of the words 'is' is.” (Clinton)
INLS 520 – Fall 2007Erik Mitchell
Uses of controlled vocabulary (1)
• Define scope, content, and context of information
• Navigation, breadcrumbs
• Map to user terminology
• Enhance browsing, searching
• Term consistency and relationships
INLS 520 – Fall 2007Erik Mitchell
Functions of a CV
• Removes ambiguity– Synonyms, Homonyms, polysemes,
• Defines relationships– Equivalence, hierarchical, associative (BT,
NT, RT, CR) reciprocity, • Provides context
– Category, scope, qualifiers, modifiers, scope notes
INLS 520 – Fall 2007Erik Mitchell
Types of Controlled Vocabularies
• Term Lists– Glossaries, Dictionaries, Gazetteers, Folksonomies
• Synonym rings– Z39.19 example– Oracle Text
• Taxonomies– Website navigation scheme
• Thesauri / Ontologies– Authority files, subject thesauri, topic maps
INLS 520 – Fall 2007Erik Mitchell
A conceptual map
http://www.taxotips.com/
INLS 520 – Fall 2007Erik Mitchell
CV Concepts• Content Analysis
– Ambiguity– Synonymy– Exhaustivity– Specificity– Co-extensivity– Aboutness– Semantic structure– Warrant (User,
Literary, Organization)
• Form Analysis– Linguistics– Grammar– Semiotics– Single / Multiple terms
• Indexing & Retrieval– Pre vs. Post Coordinate– Recall vs. Precision– Natural language
processing (NLP)
INLS 520 – Fall 2007Erik Mitchell
Content Analysis (1)• Ambiguity
– Each term should relate to a single concpet• Synonymy
– Each concept should be identified by a single entry• Specificity
– Using the most specific words or phrase expressing the subject• Exhaustivity
– The extent to which the entire document is indexed (Summarization, depth)
• Co-extensivity– “Assign as many terms as needed to bring out the main theme, and
according to guidelines sub-themes.” (p. 29, Lancaster)– “nothing more, nothing less”
• Semantic Structure– Terms can be related with equivalence, hierarchy, or associated
relationships (Use, See, NT, BT, RT)
INLS 520 – Fall 2007Erik Mitchell
Content Analysis (2)• Aboutness = Subject/topic?
– Wilson (1968)• Author intent, topicality, relationship to other resources,
textual analysis– Farithorne (1969)
• Intentional aboutness (author), extensional aboutness (document)
– Maron (1977)• objective about (document), subjective about (user), and
retrieval about (information retrieval)– Hjorland (2001)
• “Closely related to theories of meaning, interpretation, and epistemology”
INLS 520 – Fall 2007Erik Mitchell
Content Analysis (3)
• Wilson’s criteria for evaluating aboutness (1968)– Identify author’s purpose (intent)– Weigh the predominant topics, elements
(topical analysis)– Group/count a document’s use of concepts
and references (bibliometrics)– Identify essential elements (text analysis)
INLS 520 – Fall 2007Erik Mitchell
Content Analysis (4)• Literary Warrant
– “The inclusion of a vocabulary term in a controlled vocabulary based on its appearance in one or more content items. For example, a medical text may use the term “oncology.” Based on literary warrant, that term would be included in the controlled vocabulary even though the general public uses the term “cancer.” (Glosso-Thesaurus)
• User Warrant– “The inclusion of a vocabulary term in a controlled vocabulary based
on use by users. Such terms can be identified through search log analysis or free listing.” (Glosso-Thesaurus)
• Organizational Warrant– “Justification for the...selection of a preferred term due to the
characteristics and context of the organization using the resource” (ANSI Z39.19)
INLS 520 – Fall 2007Erik Mitchell
Form Analysis– Linguistics
• Synatx/Form (grammar)• Morphology (internal word structure)• Semantics (meaning)• Pragmatics, discourse analysis (word/phrase
use)– Semiotics
• study of signs/symbols – Lexical structure
• Document layout, markup, tags (think DOM)
INLS 520 – Fall 2007Erik Mitchell
Indexing & Retrieval
• Pre/Post-Coordinate• Organization prior to retrieval• Organization at the point of retrieval
• Recall / Precision• Recall: Number of retrieved relevant docs / total number
of docs in collection• Precision: number or retrieved relevant docs / all relevant
docs in collection
• Natural language processing• Uses semantics and syntax to automatically distill
‘aboutness’
INLS 520 – Fall 2007Erik Mitchell
Recall & Precision• A collection of 100
documents• Searches
– “Vocabularies”• Recall 100/100 = 1• Precision 100/100 = 1
– “Facet”• Recall 20/100= .2• Precision 20/28 = .71
– “OWL”• Recall 1/100 = .001• Precision 1/1 = 1
CV Entry # of docsControlled Vocabularies
100
Faceted analysis 20
Ontologies 5
OWL 1
RDF 3
Recall = # of docs retrieved / total # of docs in collection
Precision = # relevant of docs retrieved / total relevant # of docs in collection
INLS 520 – Fall 2007Erik Mitchell
Term List Examples
• Authority files – Maps to preferred terms– Library of Congress– Encoded Archival Context– Union List of Artist Names
• Glossaries/Dictionaries –Words & definitions, sometimes topic focused– Glosso-Thesaurus
• Folksonomies –– Contextualization, Trend discovery, Personal Information
• Synonym rings – Used for back-end equivalence in searching– Princeton Wordnet
INLS 520 – Fall 2007Erik Mitchell
Thesauri & taxonomy examples
• List of vocabularies– http://www.slais.ubc.ca/resources/
indexing/database1.htm – Taxonomy warehouse
• Two Examples– Health & Ageing Thesaurus– Thesaurus of Geographic names
INLS 520 – Fall 2007Erik Mitchell
Interoperable system example
• NCBI Entrez– 35 databases using interoperable controlled
vocabulary systems to provide rich meta-searching
• Cross-database discovery – search for “heart attack”
• Cross database linking – search for aconitase, follow the “other links” tab.
INLS 520 – Fall 2007Erik Mitchell
Vocabulary and Classification systems - exercise
• Organization structures– Term Lists /
Enumerative systems– Hierarchies– Tees– Paradigms– Facets / Associative
relationships– Folksonomies
• Break into groups, discuss & list– Goal– Structure– Issues– Benefits
• Resources– Kwasnik, Boxes &
arrows
INLS 520 – Fall 2007Erik Mitchell
Choosing a framework• Use questions
– Who is your user, what are their needs?– What systems are your users familiar with?– Will this system be internal/external?
• Content questions– How extensive, defined is the information?– Is your subject matter static or fluid?– What organizational framework best describes your content?
• System Questions– What access are you trying to provide?– What external pressures exist?– What external entities/theories will interact with this system?
INLS 520 – Fall 2007Erik Mitchell
Interoperability issues
• Similarity of subject matter in domains
• Multiple CV accepted in a domain• Specificity/granularity of content
indexing• Use of synonyms, warrant• Intended use, purpose of system
INLS 520 – Fall 2007Erik Mitchell
Creating a CV (1)• Design methods
– Re-use existing, start with content & desired use ideas
– Committee / community approach• Top-down
– Concept driven• Bottom-up
– Document driven– Empirical approach
• Deductive approach– Select terms, create relationships, perform term control
• Inductive approach– Establish CV at outset, build hierarchies on as needed
basis
INLS 520 – Fall 2007Erik Mitchell
Creating a CV (2)• Top-Down
– Identify audience– Identify all topics,
concepts, uses, and context of the domain
– Sort topics identified into an appropriate organization scheme (enumerative, hierarchical, faceted)
– Solidify structure and clean up gaps & redundancies
– Assign documents to categories, test retrieval
• Bottom-up– Identify audience– Survey documents for
topics/concepts.– Build system on the fly –
let content drive structure and limits of system
– Identify gap & redundancies in system
– Test retrieval
INLS 520 – Fall 2007Erik Mitchell
Creating a CV (3)• Think about scope, use, content, maintenance• Gather Terms
– Based on existing systems, content– Based on user needs/expectations– Investigate issues of specificity, exhaustivity, granularity
• Build hierarchies, relationships– Broader/narrower terms, Related terms, Use/Use for,
see/see also• Establish Rules• Implement• Evaluate• Maintain
http://www.boxesandarrows.com/view/creating_a_controlled_vocabulary
INLS 520 – Fall 2007Erik Mitchell
Evaluating a CV
• Goals• Determine if the CV solves retrieval needs of
user/system• Determine if CV matches user’s content
model/term expectations
• Methods• Expert evaluation of CV• User based card sorting compared to actual CV• Identification of non-included documents• Analysis of use of system - HCI
INLS 520 – Fall 2007Erik Mitchell
CV Maintenance• Primary responsibility
– Editor, board, committee• New terms
– Is it really new or a different view– What is the proper form & placement
• Modified terms– Include a change log– Use a “USE” reference to point to new term
• Deleted terms– Unused / Overused terms– May want to keep for historical retrieval purposed
• Modification history– Use modification notes, date/time stamps
INLS 520 – Fall 2007Erik Mitchell
Class exercise• Protégé overview
– Orientation– Object types (Classes, Slots, Instances)– Relationships (hierarchies, associative)
• Replication of the Glosso-Thesaurus– Visit the Boxes & Arrows Glosso Thesaurus – Look at the data there and come up with a structure in Protégé that
allows replication of the thesaurus– Some issues to consider are:
• Do you want terms to be classes or instances?• What is the easiest way to show the relationships (broader term,
narrower term, etc)?• Do you need to allow multiple relationships for a given type (BT, RT,
etc)?• If you have multiple classes, at what level should you create the slots?
INLS 520 – Fall 2007Erik Mitchell
Next Week
• More on Knowledge organization systems– Taxonomies, Ontologies– More work with Protégé