Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
Prof. Ray R. Larson
University of California, BerkeleySchool of Information
Developing a Metadata Infrastructure for Information
Access:What, Where, When and Who?
Overview
Metadata as Infrastructure– What, Where, When and Who?
What are Entry Vocabulary Indexes?– Notion of an EVI
– How are EVIs Built
Time Period Directories– Mining Metadata for new metadata
4W Demo New Project: Bringing Lives to Light
Metadata as Infrastructure
The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?
Metadata as Infrastructure
The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.
The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.
What?
Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.
Two kinds of mapping in every search:
• Documents are assigned to topic categories, e.g. Dewey
• Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers.
Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.
Problem:Controlled
Vocabularies can be
difficult for people to
use.
“pass mtr veh spark ign eng”
Index
Use: “Economic Policy”
In Library of Congress subj
For: “Wirtschaftspolitik”
“What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s
vocabulary to the controlled vocabulary of a collection of documents…
Has an Entry Vocabulary
Module been built?
User selects a subject domain of
interest.
Download a set of training data.
Build associations between extracted terms & controlled
vocabularies.
Map user’s query to ranked list of
controlled vocabulary terms
Part of speech tagging
Use an existing EVI.
Extract terms (words and noun phrases) from
titles and abstracts.
User selects search terms from the ranked
list of terms returned by the EVI.
YES
Building an Entry Vocabulary Module (EVI)
Searching
For noun phrases
Internet DB indexed with a controlled
vocabulary.
Domains to select from: Engineering, Medicine, Biology, Social science, etc.
User has question but is unfamiliar with the domain
he wants to search.
NO
Building and Searching EVIs
Technical Details
Download a set of
training data.
Build associations between extracted terms & controlled
vocabularies.
Part of speech tagging
Extract terms (words and noun
phrases) from titles and abstracts.
Building an Entry Vocabulary Module (EVI)
For noun phrases
Internet DB indexed with a
controlled vocabulary.
Association Measure
C ¬Ct a b¬t c d
Where t is the occurrence of a term and C is the occurrence of a class in the training set
Association Measure
Maximum Likelihood ratio
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)
and p1= p2= p=
a a+b
c c+d
a+c a+b+c+d
Vis. Dunning
Alternatively
Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
...),,2[logL(p t)W(c, 1 baaStatistical association
Digital library resources
EVI example
EVI 1
Index term:“pass mtr veh spark ign eng”User
Query “Automobile
” EVI 2Index term:“automobiles”OR
“internal combustible engines”
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
Why not treat language the same way?
Support for the Learner with a Query
Any resource:Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages
Any catalog: Archives, Libraries, Museums, TV, Publishers
Facet Vocabulary Displays
WHAT Thesaurus Cross-
e.g. LCSH references
WHERE Gazetteer Map
WHEN Period directory Timeline
WHO Biograph. dict. Personal e.g. Who’s Who relations
Texts
Numericdatasets
It is also difficult to move between different media forms
Thesaurus/Ontology
EVI
Searching across data types
Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results
Texts
Numericdatasets
But texts associated with numeric data can be mapped as well…
Thesaurus/Ontology
captions
EVI
EVI
Texts
Numericdatasets
But there are also geographic dependencies…
Thesaurus/Ontology
captionsMaps/Geo Data
EVI
EVI
WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург,
Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria;
– 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans;
USSR Use a gazetteer!
WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.
Timebar
Texts
Numericdatasets
So geographic search becomes part of the infrastructure
Thesaurus/Ontology
Gazetteers captionsMaps/Geo Data
EVI
WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time
periods Named time periods resemble place names in being:
– Unstable: European War, Great War, First World War– Multiple: Second World War, Great Patriotic War– Ambiguous: “Civil war” in different centuries in
England, USA, Spain, etc. Places have temporal aspects & periods have
geographical aspects: When the Stone Age was, varies by region
Vocabularies are the key!Want: Kung-fu movies?Use LCSH: Hand-to-hand fighting, oriental, in motion pictures.
Linking vocabularies WHAT, WHERE, WHEN
Library subject headingsTopic – Geographic subdivision – Chronological subdivision
Place name gazetteer:Place name – Type – Spatial markers (Lat & long) – When
Time Period DirectoryPeriod name – Type – Time markers (Calendar) – Where
Texts
Numericdatasets
Time period directories link via the place (or time)
Thesaurus/Ontology
Gazetteers captionsMaps/Geo Data
EVI
Time Period Directory Time lines, Chronologies
WHO: Biographical Dictionary Complex relationships
Life events metadata
WHAT: Actions prisoner
WHERE: Places Holstein
WHEN: Times
1261-1262
WHO: People Margaret Sambiria
Need external links
Any document, object, or performance
Any resource:Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages
Any catalog: Archives, Libraries, Museums, TV, Publishers
Connect it with its context – and other resources.
Facet Vocabulary Displays
WHAT Thesaurus Cross- e.g. LCSH references
WHERE Gazetteer Map
WHEN Period directory Timeline
WHO Biograph. dict. Personal e.g. Who’s Who relations
More information about the country of India…
WikipediaCIA Factbook
BBC Ethnologue
Berkeley Natural History Museums
New Project: Bringing Lives to Light:
Biography in Context
Ray R. Larson, Michael Buckland, Fredric Gey
University of California, Berkeley
WHEN, WHERE and WHO Catalog records found from a time period search commonly include
names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.
Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.
Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.
Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.
Texts
Numericdatasets
A new form of biographical dictionary would link to all
Thesaurus/Ontology
Gazetteers captionsMaps/Geo Data
EVI
Time Period Directory Time lines, Chronologies
Biographical Dictionary
Projected Work
Develop XML markup for Biographical Events
Most likely to be adaptation and extension of existing biographical event markup– Example: EAC/EAD
Harvest biographical resources – Wikipedia, etc.
Integrate as next generation of current interface
EAC/EAD<bioghist> <head>Biographical Note</head> <chronlist> <chronitem> <date>1892, May 7</date> <event>Born, <geogname>Glencoe, Ill.</geogname></event> </chronitem> <chronitem> <date>1915</date> <event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event> </chronitem> <chronitem> <date>1916</date> <event>Married <persname>Ada Hitchcock</persname> </event> </chronitem> <chronitem> <date>1917-1919</date> <event>Served in <corpname>United States Army</corpname></event> </chronitem> </chronlist> </bioghist>
Wikipedia data
Life events metadata
WHAT: Actions prisoner
WHERE: Places Holstein
WHEN: Times
1261-1262
WHO: People Margaret Sambiria
Need external links
A Metadata Infrastructure
CATALOGS
AchivesHistorical Societies
LibrariesMuseums
Public TelevisionPublishersBooksellers
AudioImages
Numeric DataObjectsTexts
Virtual RealityWebpages
RESOURCES
INTERMEDIA INFRASTRUCTURE
Biographical DictionaryWHO
TimelinesTime Period DirectoryWHEN
MapsGazetteerWHERE
Syndetic StructureThesaurusWHAT
Special Display ToolsAuthority ControlFacet
Learners
Dossiers
Acknowledgements Electronic Cultural Atlas Initiative project This work is being supported supported by the Institute of
Museum and Library Services through a National Leadership Grant for Libraries
Contact: [email protected]