+ All Categories
Home > Documents > Controlled Subject Vocabularies and Thesauri

Controlled Subject Vocabularies and Thesauri

Date post: 31-Jan-2016
Category:
Upload: ziya
View: 61 times
Download: 0 times
Share this document with a friend
Description:
Controlled Subject Vocabularies and Thesauri. University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Review. Controlled vocabularies Choice of names Form of names Name Authority files. Controlled Vocabularies. - PowerPoint PPT Presentation
Popular Tags:
38
8/28/97 Information Organization and Retrieval Controlled Subject Vocabularies and Thesauri University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval
Transcript
Page 1: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Controlled Subject Vocabularies and Thesauri

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Review

• Controlled vocabularies

• Choice of names

• Form of names

• Name Authority files

Page 3: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

Page 4: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for thesame person

Page 5: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 6: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Name authority filesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name

Page 7: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Categorization Summary

• Processes of categorization underlie many of the issues having to do with information organization

• Categorization is messier than our computer systems would like

• Human categories have graded membership, consisting of family resemblances.

• Family resemblance is expressed in part by which subset of features are shared

• It is also determined by underlying understandings of the world that do not get represented in most systems

Page 8: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Today

• Origins and Uses of Controlled Vocabularies for Information Retrieval

• Types of Indexing Languages, Thesauri and Classification Systems

• Process of Design and Development of Thesauri

Page 9: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Origins

• Very early history of content representation– Sumerian tokens and “envelopes”– Alexandria - pinakes– Indices

Page 10: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Origins

• Biblical Indexes and Concordances

• Journal Indexes

• “Information Explosion” following WWII– Cranfield Studies of indexing languages and

information retrieval– Development of bibliographic databases

• Index Medicus -- production and Medlars searching

Page 11: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Origins

• Communication theory revisited

• Problems with transmission of meaning

Noise

Source DecodingEncoding Destination

Message Message

Channel

StorageSourceDecoding

(Retrieval/Reading)Encoding

(writing/indexing)Destination

Message Message

Page 12: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

What is a “Controlled Vocabulary”

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

• Similarly, there are too many ways of expressing or explaining the topic of a document.

• Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of “lead-in vocabulary” and an limited and selective “Indexing Language” sometimes with special coding or structures.

Page 13: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Structure of an IR SystemSearchLine

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 14: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Uses of Controlled Vocabularies• Library Subject Headings, Classification and

Authority Files.• Commercial Journal Indexing Services and

databases• Yahoo, and other Web classification schemes• Online and Manual Systems within

organizations– SunSolve– MacArthur

Page 15: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Types of Indexing Languages

• Uncontrolled Keyword Indexing• Indexing Languages

– Controlled, but not structured

• Thesauri– Controlled and Structured

• Classification Systems– Controlled, Structured, and Coded

• Faceted Classification Systems

Page 16: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Indexing Languages

• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents.

• An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.

Page 17: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Indexing Languages

• Library of Congress Subject Headings

• Yellow Pages Topics

• Wilson Indexes (“Reader’s Guide”)

Page 18: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Thesauri

• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

Page 19: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Thesauri (cont.)

• National and International Standards for Thesauri– ANSI/NISO z39.19--1994 -- American National Standard

Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri

– ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri

Page 20: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Thesauri (cont.)

• Examples:– The ERIC Thesaurus of Descriptors– The Art and Architecture Thesaurus– The Medical Subject Headings (MESH) of the

National Library of Medicine

Page 21: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Classification Systems

• A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms.

Page 22: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Classification Systems (cont.)

• Examples:– The Library of Congress Classification System– The Dewey Decimal Classification System– The ACM Computing Reviews Categories– The American Mathematical Society

Classification System

Page 23: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Automatic Indexing and Classification

• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.

• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.

• Automatic classification attempts to automatically group similar documents using either:– A fully automatic clustering method.

– An established classification scheme and set of documents already indexed by that scheme.

Page 24: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

ClusteringAglomerative methods

DocDoc

DocDoc

DocDoc

DocDoc

Page 25: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Search using document contents2. Obtain ranked list3. Assign document to N categories ranked over theshold.

Page 26: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Development of a Thesaurus

• Term Selection.

• Merging and Development of Concept Classes.

• Definition of Broad Subject Fields and Subfields.

• Development of Classificatory structure

• Review, Testing, Application, Revision.

Page 27: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

1. Preliminary Term Selection• Select sources for the

collection of terms.– Prearranged Sources

– Open-ended Sources

• Assign codes to each source.

• Selection of terms– For part of pre-

arranged and for all open-ended sources

• Enter terms into database with all information.

Page 28: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

2. Merging and Development of Concept Classes

• Sort Term DB into alphabetical order.

• First Round: Merge information for Identical terms -- possibly pulling info from additional sources.

• Second Round: Merge synonyms or terms in the same concept class.

Page 29: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

3. Definition of Broad Subject Fields and Subfields

• Define Broad Subject fields and sort terms into these broad fields

• Define subfields within each broad field and sort terms into these subfields.

• Work out the detailed structure– Select Preferred Terms

– Merge information for terms in the same concept class

• Repeat these steps– for each subfield within a

broad field

– and for each broad field

– Until all terms have been consolidated and preferred terms selected

Page 30: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

4. Development of Classificatory Structure

• Produce preliminary version of classified index and update the working database.

• Improve classificatory structure

• Reality check: produce and distribute a version of the classified index. Distribute to users/experts.

Page 31: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

5. Final Stages

• Review

• Testing

• Application

• Revision

Page 32: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Review

• Discuss classified index with users/experts. – Select descriptors and checklist descriptors.

• Assign Notational Symbols

• Produce Main Thesaurus & Indexes

Page 33: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Review (cont.)

• Check cross references and insert where needed

• Produce Test Version

• Test by Indexing

• Modify as needed

• Produce Production Version.

Page 34: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Testing a Thesaurus

• Assign descriptors to a sample set of NEW documents (use enough to get an idea of any gaps in the thesaurus.

• Test retrieval using sample questions and seeing how effectively the thesaurus maps to the appropriate descriptor

Page 35: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

The Indexing Process

• Concept identification

• term selection (via thesaurus)

• term assignment

Page 36: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Application: The Indexing Process (Manual)

IsTerm

suitable

NOSelect Alternativeterm to represent

Concept

WouldConcept be

better representedby one of

these terms

Is There

Another Concept

Consider Preferred

Term

Select Preferred

Term

Establish TermDenoting Concept

Examine Documentand Identify Significant Concepts

Consider First

Concept

PreferredTerm?

StartNO

NO

NO

NO

NO

YES YES YES

YES

YESYES

DoesThesaurus

contain termfor

Concept

Consider anyassociated terms inThesaurus (NT,BT)

Admit New TermInto Thesaurus

Can Conceptbe expressed

combining terms?

Consider Each ofThese Terms

Assign Termsto

Document

Prefer Alternative

Term(s)

End

Adapted from ISO 5963, p.5

Page 37: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

Thesaurus Revision and Updates

• There will always be new concepts, products, or expressions that need to be added to the thesaurus. – Set a regular schedule of reviews and revisions.– Collect complaints, problems, etc. and fold into

revision of the thesaurus

Page 38: Controlled Subject Vocabularies and Thesauri

8/28/97 Information Organization and Retrieval

References• Soegel, D. Indexing Languages and Thesauri: Construction and

Maintenance. Los Angeles : Melville Publishing Co., 1974

• Foskett, A.C. The Subject Approach to Information. London: Clive Bingley, 1982.

• Standards:– ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the Construction,

Format and Management of Monolingual Thesauri– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for

Indexes in Information Retrieval– ISO 2788 -- Documentation -- Guidelines for the establishment and development of

monolingual thesauri– ISO 5964-- Documentation -- Guidelines for the establishment and development of

multilingual thesauri


Recommended