+ All Categories
Home > Documents > Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata...

Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata...

Date post: 22-Dec-2015
Category:
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
63
Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?
Transcript

Prof. Ray R. Larson

University of California, BerkeleySchool of Information

Developing a Metadata Infrastructure for Information

Access:What, Where, When and Who?

Overview

Metadata as Infrastructure– What, Where, When and Who?

What are Entry Vocabulary Indexes?– Notion of an EVI

– How are EVIs Built

Time Period Directories– Mining Metadata for new metadata

4W Demo New Project: Bringing Lives to Light

Metadata as Infrastructure

The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

Metadata as Infrastructure

The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.

The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

What?

Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.

Two kinds of mapping in every search:

• Documents are assigned to topic categories, e.g. Dewey

• Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers.

Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

Texts

‘What’ searches involve mapping to controlled vocabularies

Thesaurus/Ontology

Start with a collection of documents.

Building a Search Term Recommender

Classify and index with controlled

vocabulary

Or use a pre-indexed

collection.

Index

Problem:Controlled

Vocabularies can be

difficult for people to

use.

“pass mtr veh spark ign eng”

Index

Use: “Economic Policy”

In Library of Congress subj

For: “Wirtschaftspolitik”

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

“What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s

vocabulary to the controlled vocabulary of a collection of documents…

Has an Entry Vocabulary

Module been built?

User selects a subject domain of

interest.

Download a set of training data.

Build associations between extracted terms & controlled

vocabularies.

Map user’s query to ranked list of

controlled vocabulary terms

Part of speech tagging

Use an existing EVI.

Extract terms (words and noun phrases) from

titles and abstracts.

User selects search terms from the ranked

list of terms returned by the EVI.

YES

Building an Entry Vocabulary Module (EVI)

Searching

For noun phrases

Internet DB indexed with a controlled

vocabulary.

Domains to select from: Engineering, Medicine, Biology, Social science, etc.

User has question but is unfamiliar with the domain

he wants to search.

NO

Building and Searching EVIs

Technical Details

Download a set of

training data.

Build associations between extracted terms & controlled

vocabularies.

Part of speech tagging

Extract terms (words and noun

phrases) from titles and abstracts.

Building an Entry Vocabulary Module (EVI)

For noun phrases

Internet DB indexed with a

controlled vocabulary.

Association Measure

C ¬Ct a b¬t c d

Where t is the occurrence of a term and C is the occurrence of a class in the training set

Association Measure

Maximum Likelihood ratio

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a a+b

c c+d

a+c a+b+c+d

Vis. Dunning

Alternatively

Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 baaStatistical association

Digital library resources

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

But why stop there?

Index

EVI

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Support for the Learner with a Query

Any resource:Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages

Any catalog: Archives, Libraries, Museums, TV, Publishers

Facet Vocabulary Displays

WHAT Thesaurus Cross-

e.g. LCSH references

WHERE Gazetteer Map

WHEN Period directory Timeline

WHO Biograph. dict. Personal e.g. Who’s Who relations

Texts

Numericdatasets

It is also difficult to move between different media forms

Thesaurus/Ontology

EVI

Searching across data types

Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

Texts

Numericdatasets

But texts associated with numeric data can be mapped as well…

Thesaurus/Ontology

captions

EVI

EVI

Texts

Numericdatasets

But there are also geographic dependencies…

Thesaurus/Ontology

captionsMaps/Geo Data

EVI

EVI

WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург,

Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania /

Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria;

– 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans;

USSR Use a gazetteer!

WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.

Timebar

Zoom on map. Click on place for a list of records. Click on record to display text.

Texts

Numericdatasets

So geographic search becomes part of the infrastructure

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time

periods Named time periods resemble place names in being:

– Unstable: European War, Great War, First World War– Multiple: Second World War, Great Patriotic War– Ambiguous: “Civil war” in different centuries in

England, USA, Spain, etc. Places have temporal aspects & periods have

geographical aspects: When the Stone Age was, varies by region

Vocabularies are the key!Want: Kung-fu movies?Use LCSH: Hand-to-hand fighting, oriental, in motion pictures.

Linking vocabularies WHAT, WHERE, WHEN

Library subject headingsTopic – Geographic subdivision – Chronological subdivision

Place name gazetteer:Place name – Type – Spatial markers (Lat & long) – When

Time Period DirectoryPeriod name – Type – Time markers (Calendar) – Where

Texts

Numericdatasets

Time period directories link via the place (or time)

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

WHEN: Time Period Directory Timeline

Link to Catalog

Link to Wikipedia

WHO: Biographical Dictionary Complex relationships

Life events metadata

WHAT: Actions prisoner

WHERE: Places Holstein

WHEN: Times

1261-1262

WHO: People Margaret Sambiria

Need external links

Any document, object, or performance

Any resource:Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages

Any catalog: Archives, Libraries, Museums, TV, Publishers

Connect it with its context – and other resources.

Facet Vocabulary Displays

WHAT Thesaurus Cross- e.g. LCSH references

WHERE Gazetteer Map

WHEN Period directory Timeline

WHO Biograph. dict. Personal e.g. Who’s Who relations

Demo of search interface

Entry Vocabulary Index suggests correct LCSH with different spelling

Related places

Potentially related people

Potentially related periods

Mostly in India 16th-18th century

Find out more about this area.

Different Browsing Options!

Zooming in to South Asia

Restricting time frame

Select

More information about the country of India…

More information about the country of India…

WikipediaCIA Factbook

BBC Ethnologue

Berkeley Natural History Museums

Historical events – linked to Library catalog & Wikipedia : none avail. for this time period

ECAI Cultural Atlases: presenting history in its geographical & chronological contexts

Mongol Empire Video

Demo Interface

http://ecai.berkeley.edu/imls2004/imls4w/

New Project: Bringing Lives to Light:

Biography in Context

Ray R. Larson, Michael Buckland, Fredric Gey

University of California, Berkeley

Overview

Focussing on the Who in Who, What, Where and When

Types of Biographical Markup

WHEN, WHERE and WHO Catalog records found from a time period search commonly include

names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.

Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.

Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

Texts

Numericdatasets

A new form of biographical dictionary would link to all

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Biographical Dictionary

Projected Work

Develop XML markup for Biographical Events

Most likely to be adaptation and extension of existing biographical event markup– Example: EAC/EAD

Harvest biographical resources – Wikipedia, etc.

Integrate as next generation of current interface

EAC/EAD<bioghist> <head>Biographical Note</head> <chronlist> <chronitem> <date>1892, May 7</date> <event>Born, <geogname>Glencoe, Ill.</geogname></event> </chronitem> <chronitem> <date>1915</date> <event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event> </chronitem> <chronitem> <date>1916</date> <event>Married <persname>Ada Hitchcock</persname> </event> </chronitem> <chronitem> <date>1917-1919</date> <event>Served in <corpname>United States Army</corpname></event> </chronitem> </chronlist> </bioghist>

Wikipedia data

Life events metadata

WHAT: Actions prisoner

WHERE: Places Holstein

WHEN: Times

1261-1262

WHO: People Margaret Sambiria

Need external links

A Metadata Infrastructure

CATALOGS

AchivesHistorical Societies

LibrariesMuseums

Public TelevisionPublishersBooksellers

AudioImages

Numeric DataObjectsTexts

Virtual RealityWebpages

RESOURCES

INTERMEDIA INFRASTRUCTURE

Biographical DictionaryWHO

TimelinesTime Period DirectoryWHEN

MapsGazetteerWHERE

Syndetic StructureThesaurusWHAT

Special Display ToolsAuthority ControlFacet

Learners

Dossiers

Acknowledgements Electronic Cultural Atlas Initiative project This work is being supported supported by the Institute of

Museum and Library Services through a National Leadership Grant for Libraries

Contact: [email protected]


Recommended