+ All Categories
Home > Documents > 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray...

7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray...

Date post: 19-Dec-2015
Category:
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
7/16/2002 JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University of California, Berkeley School of Information Management and Systems and UC Data Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries
Transcript
Page 1: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

The “Entry Vocabulary Index” Approach to Multilingual Search

Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland

University of California, BerkeleySchool of Information Management and Systems

and UC Data

Harvesting Translingual Vocabulary Mappings for

Multilingual Digital Libraries

Page 2: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Overview• What are Entry Vocabulary Indexes?

– EVI Research at Berkeley– Notion of an EVI– How are EVIs Built

• Berkeley Multilingual EVI– Technology components– Database– Examples of operation

• Ongoing research

Page 3: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Entry Vocabulary Index Research Projects at Berkeley

• DARPA Information Management Program– “Search Support for Unfamiliar Metadata

Vocabularies”• Institute for Museum and Library Services

– “Seamless Searching of Numeric and Textual Resources”

• DARPA TIDES program– “Translingual Information Management Using Domain

Ontologies”• NSF/NASA/DARPA: DLI-2 (IDL)

– “ Discovery and Use of Textual, Numeric and Spatial Data”

Page 4: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

The IMLS project:To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases:

--- books, articles, and their bibliographic records;

--- numerical data in socio-economic databases.

PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched.

PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.

Page 5: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

TIDES Project

• Translingual Information Detection, Extraction and Summarization– Building EVIs to map across languages

• Using same notion with training data in different languages

• Using Library of Congress Subject Headings from the CDL MELVYL database

Page 6: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

What is an Entry Vocabulary Index?

• EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

Page 7: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Start with a collection of documents.

Page 8: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Classify and index with controlled

vocabulary.

Index

Ideally, use a database

already indexed

Page 9: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Problem:Controlled

Vocabularies can be

difficult for people to

use.

“pass mtr veh spark ign eng”

Index

Page 10: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 11: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

EVI example

EVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 12: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

But why stop there?

Index

EVI

Page 13: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 14: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 15: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 16: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 baaStatistical association

Digital library resources

Page 17: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Background on Online Library Catalogs

• Library catalogs have been automated at a furious pace worldwide since the late ’70s

• Library objects (books, maps, pictures) in 400+ languages• Bibliographic descriptions contain one or more sentences from a

particular language (transliterated)• Objects have been classified by subject by librarians

– Library of Congress Subject Heading (Islamic Fundamentalism)– Library of Congress Classification (BP60, BP63, KF27)– Dewey Decimal Classification (297.2, 306.6, 320.5)

• International standard (MARC) for catalog metadata• Huge number of remotely searchable catalogs worldwide

accessible using the international search/retrieve protocol Z39.50

Page 18: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

What can libraries and their catalogs provide?

• Millions of sentences in multiple languages• Sentences with topical content identified from

150,000 Library of Congress Subject Headings• Transfer point (interlingua) between English

topics and words in other languages• Can be used to create:

– Bilingual dictionaries– Query expansion in cross-language information

retrieval

Page 19: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic”

Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”

Page 20: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Our Training Set and Prototype• University of California/CDL MELVYL catalog• Private copy, 10 million+ records (5 million non-

English)• Records in over 100 languages• Obtained in MARC database standard format• Foreign language titles use Library of Congress

transliteration (Romanization) standard• Prototype search software maps from/to English and

– Arabic, Chinese, French, German– Italian, Japanese, Russian, Spanish

Page 21: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Technical Details

Download a set of

training data.

Build associations between extracted terms & controlled

vocabularies.

Part of speech tagging

Extract terms (words and noun

phrases) from titles and abstracts.

Building an Entry Vocabulary Module (EVI)

For noun phrases

Internet DB indexed with a

controlled vocabulary.

Page 22: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Association Measure

C ¬Ct a b¬t c d

Where t is the occurrence of a term and C is the occurrence of a classification in the training set

Page 23: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Association Measure

• Maximum Likelihood ratio

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a a+b

c c+d

a+c a+b+c+d

Vis. Dunning

Page 24: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages

Page 25: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Non-English words can be mapped to English subject headings

Page 26: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Examples

Page 27: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Catalog Languages vs. FBIS Languages  (University of California online catalog. 10 million records)

  Approx. language distribution (Berkeley # sentences, FBIS est. # lines source)

  Berkeley FBIS     Berkeley FBIS             

German 840,032 49,872   Danish 41,517 18,688Spanish 614,025 388,772   Hebrew 41,468 3,500French 609,089 2,871   Czech 35,432 3,647Russian 341,050 15,415   Urdu 30,206  

Italian 266,424 254   Turkish 30,015  

Portuguese 149,389 24,930   Bulgarian 27,850  

Chinese 127,636 246,549   Norwegian 26,478 13,596Japanese 110,956     Korean 25,979 68,607Arabic 96,124 (8263)*   Rumanian 25,874  

Dutch 90,170     Finnish 25,027 8,187Latin 88,818     Thai 24,693  

Polish 81,698     Serbo-Croatian 24,601 36,139Indonesian 59,445     Greek 23,926  

Swedish 53,854 16,652   Bengali 23,430  

Hungarian 46,330 6,631   Catalan 20,392  

Hindi 42,886     Tamil 20,232  

*English only, no source text

106 languages with > 500 records

Page 28: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Future Research• Add content from other online library catalogs

– RLIN (>30M records, >900K Chinese, >250K Arabic) – COPAC [UK] (9M records, 40k Arabic)

• Transliteration and back-transliteration for scripted languages

• Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information)

• Further evaluation (TREC, CLEF, NCTIR and local analysis)

Page 29: 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

7/16/2002 JCDL 2002, Ray Larson

Prototype available

http://otlet.sims.berkeley.edu/mulevm2.html


Recommended