+ All Categories
Home > Documents > CLiMB: Computational Linguistics for Metadata Building

CLiMB: Computational Linguistics for Metadata Building

Date post: 17-Jan-2016
Category:
Upload: bandele
View: 21 times
Download: 1 times
Share this document with a friend
Description:
Center for Research on Information Access Columbia University Libraries. CLiMB: Computational Linguistics for Metadata Building. Overall Goals. Research: Development of richer retrieval through increased numbers of descriptors - PowerPoint PPT Presentation
Popular Tags:
50
CLiMB: Computational Linguistics for Metadata Building Center for Research on Information Access Columbia University Libraries
Transcript
Page 1: CLiMB:  Computational Linguistics for  Metadata Building

CLiMB: Computational Linguistics

for Metadata Building

Center for Research on Information Access

Columbia University Libraries

Page 2: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 2

Page 3: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 3

Overall Goals

• Research: Development of richer retrieval through increased numbers of descriptors

• Research and Practice: Creation of enabling technologies for new large digitization projects

• Research and Practice: Expand capability for cross-collection searching

• Practice: Development of suite of CLiMB tools• Resources: Vocabulary list which can be used by other visual

resource professionals

The essence of CLiMB: • Use scholars themselves as “catalogers” by utilizing scholarly

publications• Enhance existing descriptive metadata

Page 4: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 4

Computational Linguistic Techniques

• What techniques have we tried?

• How well have they worked?

• What else do we want to try?

Page 5: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 5

Computational Linguistic Techniques

• What techniques have we tried?

– Goal: Identify high quality metadata terms

– Goal: Use metadata for finding images

• How well have they worked?

• What else do we want to try?

Page 6: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 6

Text about Images

The Blacker House is known for its porte cochère and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch.

-- Based on Bosley

Page 7: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 7

Techniques We Have Tried

Unsupervised– Part of speech tagging– Noun phrase identification– Proper noun identification

Supervised (using existing resources)– Matching algorithms - proper names & variants– Back of book index analysis – Composite list of terms from authoritative lists

Page 8: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 8

What about LSI?

• Latent Semantic Indexing

• Builds a representation of a document

• Effective in information retrieval

• Why not for CLiMB?– LSI is useful for text query and document retrieval– LSI, a statistical technique, removes phrasal info– CLiMB needs high quality phrases– May be useful in later stages

Page 9: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 9

Indexing for What Purpose

• Index = find important terms and phrases

• Index = characterize a document with a set of terms that occurs in the doc

Page 10: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 10

Indexing for What Purpose

• Index = find important terms and phrases– sleeping porch– occasional collaborator– sandstone boulder foundation

• Index = characterize a document with a set of terms that occurs in the doc– sleep*, porch, occas*, collaborat*, foundat*– enables location of doc’s with similar profile

Page 11: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 11

Finding Similar Documents

• Linear Algebra Techniques– Latent Semantic Indexing

• Singular Value Decomposition (SVD)

– Semidiscrete Decomposition

• Vector Space Models– Term by Document matrices– Term Weighting– Polysemy and Synonymy

• Clustering Techniques– K-means– EM Clustering– Wavelet

Page 12: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 12

• What techniques have we tried?

– Goal: Identify high quality metadata terms

– Goal: Load metadata into image search database

– Goal: Use enriched metadata for finding images

• How well have they worked?

• What else do we want to try?

Computational Linguistic Techniques

Page 13: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 13

Art Object Identification (AO-ID)

• Need Unique Identifiers– Key of database records

• Varies from collection to collection– Greene & Greene – Project Names– Chinese Paper Gods – God Names– South Asian Temples – Temple Names

Page 14: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 14

Text about Images

The Blacker House is known for its porte cochère and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch.

-- Based on Bosley

Page 15: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 15

Compile list of subject vocabulary

Find meaningful terms in texts

Collect terms from all sources.Identify and link AO-ID described in text.

Determine term relationships

Insert into existing metadata records.Mount in image search platform.

Process queries and evaluate

Segment relevant texts

Extract metadata

Page 16: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 16

Create Composite List of Subject Terms

Philosophy: Use whatever resources exist

• Catalog records– Robert R. Blacker house (Pasadena, Calif.)– Greene, Charles Sumner– Blacker, Robert R.

• Art and Architecture Thesaurus– porte cochère

• Back of the book index– Blacker house

Page 17: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 17

Progress – Composite List

• Greene & Greene– Extracted back of the book indexes – Direct matching of index terms to the text

• Terms found - highlighted in yellow– David Gamble– Pasadena– Westmoreland Place– furniture

Page 18: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 18

Page 19: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 19

Compile list of subject vocabulary

Find meaningful terms in texts

Collect terms from all sources.Identify and link AO-ID described in text.

Determine term relationships

Insert into existing metadata records.Mount in image search platform.

Process queries and evaluate

Segment relevant texts

Extract metadata

Page 20: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 20

Three Term Types and Approaches

1) Art Object ID names and other proper nouns important to the domain (Charles Pratt)• Named Entity noun phrase finders, POS taggers

2) Common noun terms, semantically significant to the domain (V-shaped plan)• List of domain terms from authority sources

3) Common noun phrases in a generic domain vocabulary (chimney)• Statistical methods for identifying relevant terms

Page 21: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 21

Part of Speech (POS) taggers• Why use a part of speech tagger?

– To identify nouns, verbs and proper nouns• The Blacker House is known for its porte cochère…

– <Determiner>The– <Proper_Noun>

• <Singular_Proper_Noun>Blacker• <Singular_Proper_Noun>House

– <Verb_Present>is– <Verb_Past_Participle>known– <Preposition>for– <Possessive_Pronoun>its– <Adjective>adjacent– <Noun_Plural>terraces

Page 22: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 22

Part of Speech (POS) taggers

• Strength: An essential step allows the rest of the system to work

• Weakness: The best POS taggers have 95% accuracy– A typical 20-word sentence is likely to have a

mistake!

• But: some errors do not matter much– E.g. sleeping porch

Page 23: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 23

What We Tried: POS Taggers

• Mitre Alembic WorkBench– Freeware from Mitre corporation– Strong for proper nouns– Average for common nouns

• IBM’s Nominator– Accurate for both– Restrictive licensing

Page 24: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 24

Proper Nouns

• Alembic WorkBench Results– 91.2% recall

• Misses The senior Pratt, Hall brothers

– 97.5% precision using Alembic• Successfully finds William Issac Ott, University of California

• This is very good!• Highlighted in light green

– Mary– Greene– Persian– Etc.

Page 25: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 25

Page 26: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 26

Noun Phrase Chunking

[The [ Blacker House ] ] is known for

[ [its Porte Cochère] and [adjacent terraces] ]. [Samuel Parker Williams], [an occasional Greene collaborator], worked on [the site], particularly on

[the [ [sandstone boulder] foundation] ] for [the [ sleeping porch ] ].

-- Based on Bosley

Page 27: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 27

NP Chunkers• Columbia’s LinkIT

– Regular expression grammar over POS tags– Improves WorkBench results through finding

simplex NPs

• LTChunk– By LTG Group, University of Edinburgh– Not as many NPs

• Arizona - commercialized• IBM – also commercial

Page 28: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 28

Results: Proper NounsTool Precision Recall

AlembicWorkBench

97.50 91.20

LinkIT 68.94 98.81

LTChunk 68.13 63.48

Page 29: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 29

Results: Proper Nouns

0102030405060708090

100

Wo

rkB

ench

Wo

rkB

ench

and

Lin

kIT

LT

Ch

un

k

Recall ofproper nounsin BosleyChapter 5

Precision

Page 30: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 30

Results: NP Chunking

• Highlighted in purple:– The design process– The southwest adobe-stucco– July 1907

Page 31: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 31

Page 32: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 32

Experiments with Algorithms

• TF/IDF and term frequency ratios– Filter technical terms from frequent common nouns– Term frequency ratio algorithm to improve accuracy

• Co-occurrence– Useful terms may appear near other good ones

• Machine learning – Use learning algorithms to discover complex

associational context

Page 33: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 33

Compile list of subject vocabulary

Find meaningful terms in texts

Collect terms from all sources.Identify and link AO-ID described in text.

Determine term relationships

Insert into existing metadata records.Mount in image search platform.

Process queries and evaluate

Segment relevant texts

Extract metadata

Page 34: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 34

What is Segmentation?

• Divide texts into cohesive chunks

• Needed for determining associational

context

• Needed to determine what terms are

related to an art object

Page 35: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 35

Results: Segmentation

Project People, Frequency

0

2

4

6

8

10

12

1 4 7

10 13 16 19 22 25 28 31 34 37 40 43 46 49

Paragraph

Freq

uenc

y

Cole

Bolton

Thorsen

Pratt

Gamble

Blacker

Robinson

Ford

• Use the frequency that our terms appear within a document to estimate where the document is about that term

• This graph shows where different names are mentioned in Bosley on Greene & Greene Ch. 5

Page 36: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 36

What We’ve Tried: Segmenters• Marti Hearst’s TextTiling

– Performs well for a general algorithm, but not sufficient for this specialized task

– M. Hearst, ACL, 1993

• F. Choi’s C99 segmenter – Performance comparable to TextTiling– F. Y. Y. Choi, NAACL, 2000

• Frequency ratio approach outperformed TextTiling• In-house tool to be tested

– Kan & Klavans, WVLC-6, 1998, Segmenter

Page 37: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 37

Meronymy as “Part-Of”

• Why is this potentially useful?– A method for identifying “hot” paragraphs

• Descriptive text contains “part of” relations

• Details that correlate to the whole – Porch is a part of house

• An early hypothesis – in testing stages

Page 38: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 38

Meronymy for Cohesion

The Spinks house design is an elaboration of the rectangular, large-gabled form of the “California House” ….has … porches and terraces. In front, an expanse of …lawn rises nearly to the level of the entry terrace…. The front door is approached obliquely in the shaded recess of the terrace….

Page 39: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 39

Meronymy and Other Relations

The

California

House

Spinks House

porch terrace entry terrace

front door

Other Houses

front entry

Page 40: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 40

Compile list of subject vocabulary

Find meaningful terms in texts

Collect terms from all sources.Identify and link AO-ID described in text.

Determine term relationships

Insert into existing metadata records.Mount in image search platform.

Process queries and evaluate

Segment relevant texts

Extract metadata

Page 41: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 41

Progress – Project Name Matching

• Finding project names in Greene & Greene• Challenge: finding variations

– AO-ID Robert Roe Blacker House – RRB House– The house – 1214 Fairlawn Terrace.

• Possible techniques to improve matching– Developing a semi-automatic technique– Use existing information to label text– An iterative platform for manual intervention

Page 42: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 42

Variants of The Culbertson House

• Cordelia A. Culbertson house (Pasadena, Calif.)• Francis F. Prentiss house (Pasadena, Calif.)• Culbertson sisters house (Pasadena, Calif.)• Prentiss, Francis F. • Culbertson, Cordelia A.• Allen, Elizabeth S.• Allen, Mrs. Dudley P.

• House was purchased by Allen’s, who remarried and became Prentiss!

Page 43: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 43

Zaoshen (Chinese deity)• USE FOR: Dingfuzhenjun (Chinese deity) • USE FOR: Kitchen God (Chinese deity) • USE FOR: Simingzaojun (Chinese deity) • USE FOR: Simingzaoshen (Chinese deity) • USE FOR: Ssu-ming-tsao-chèun (Chinese deity) • USE FOR: Ssu-ming-tsao-shen (Chinese deity) • USE FOR: Ting-fu-chen-chèun (Chinese deity) • USE FOR: Tsao-chèun (Chinese deity) • USE FOR: Tsao-shen (Chinese deity) • USE FOR: Tsao-wang (Chinese deity) • USE FOR: Tsao-wang-yeh (Chinese deity) • USE FOR: Zaojun (Chinese deity) • USE FOR: Zaowang (Chinese deity) • REFERENCE: Encyc. Britannicab(Tsao Shen, pinyin Zao Shen, in Chinese

mythology, the god of the kitchen (god of the hearth), who is believed to report to the celestial gods on family conduct and have it within his power to bestow poverty or riches on individual families; has also been confused with Ho Shen (god of fire) and Tsao Chèun (Furnace Prince))

Page 44: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 44

Some Data to Illustrate

• Unaltered Project Names– 0 matches (both case sensitive and insensitive)

• Case Insensitive Project Name matching– 4 matches– {Theodore Irwin house} occurs 1 time– {California Institute of Technology} occurs 1 time– {William R. Thorsen house} occurs 1 time– {William T. Bolton house} occurs 1 time

• At least double in the chapter

Page 45: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 45

A Future Solution

• Bootstrapping algorithm– Seed terms hand labelled– Terms mapped into multi-dimensional feature space– Other terms that are close to the seed terms are

added to the set

• Features:– Window size– Headedness– Modifier similar to that of a seed term

Page 46: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 46

Summary: Research Tools Tested

• Part of Speech Taggers

• Noun Phrase Chunkers

• Merging techniques

• Proper Noun Finders

• Proper Name Variant Finder

• Segmenters

Page 47: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 47

Compile list of subject vocabulary

Find meaningful terms in texts

Collect terms from all sources.Identify and link AO-ID described in text.

Determine term relationships

Insert into existing metadata records.Mount in image search platform.

Process queries and evaluate

Segment relevant texts

Extract metadata

Page 48: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 48

Future: Determine relationships

• The Blacker House related to Greene– The Greenes built the house.

• Porte Cochère is related to Blacker House – because they are directly a part of the house.

• William Issac Ott is related to – Blacker House (on which he worked)– Greene (with whom he worked).

• Detecting these semantic relationships statistically is a challenge for our next steps:– Co-occurrence– Use of subject headings– Meronymy and other relations (WordNet)

Page 49: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 49

Compile list of subject vocabulary

Find meaningful terms in texts

Collect terms from all sources.Identify and link AO-ID described in text.

Determine term relationships

Insert into existing metadata records.Mount in image search platform.

Process queries and evaluate

Segment relevant texts

Extract metadata

Page 50: CLiMB:  Computational Linguistics for  Metadata Building

January 21, 2003 CLiMB - Columbia University 50

Thank you!

Any questions?

www.columbia.edu/cu/cria/climb


Recommended