Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
7/16/2002 JCDL 2002, Ray Larson
The “Entry Vocabulary Index” Approach to Multilingual Search
Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland
University of California, BerkeleySchool of Information Management and Systems
and UC Data
Harvesting Translingual Vocabulary Mappings for
Multilingual Digital Libraries
7/16/2002 JCDL 2002, Ray Larson
Overview• What are Entry Vocabulary Indexes?
– EVI Research at Berkeley– Notion of an EVI– How are EVIs Built
• Berkeley Multilingual EVI– Technology components– Database– Examples of operation
• Ongoing research
7/16/2002 JCDL 2002, Ray Larson
Entry Vocabulary Index Research Projects at Berkeley
• DARPA Information Management Program– “Search Support for Unfamiliar Metadata
Vocabularies”• Institute for Museum and Library Services
– “Seamless Searching of Numeric and Textual Resources”
• DARPA TIDES program– “Translingual Information Management Using Domain
Ontologies”• NSF/NASA/DARPA: DLI-2 (IDL)
– “ Discovery and Use of Textual, Numeric and Spatial Data”
7/16/2002 JCDL 2002, Ray Larson
The IMLS project:To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases:
--- books, articles, and their bibliographic records;
--- numerical data in socio-economic databases.
PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched.
PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.
7/16/2002 JCDL 2002, Ray Larson
TIDES Project
• Translingual Information Detection, Extraction and Summarization– Building EVIs to map across languages
• Using same notion with training data in different languages
• Using Library of Congress Subject Headings from the CDL MELVYL database
7/16/2002 JCDL 2002, Ray Larson
What is an Entry Vocabulary Index?
• EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…
7/16/2002 JCDL 2002, Ray Larson
Start with a collection of documents.
7/16/2002 JCDL 2002, Ray Larson
Classify and index with controlled
vocabulary.
Index
Ideally, use a database
already indexed
7/16/2002 JCDL 2002, Ray Larson
Problem:Controlled
Vocabularies can be
difficult for people to
use.
“pass mtr veh spark ign eng”
Index
7/16/2002 JCDL 2002, Ray Larson
Solution:Entry Level Vocabulary
Indexes.Index
EVIpass mtr veh
spark ign eng”
= “Automobile”
7/16/2002 JCDL 2002, Ray Larson
EVI example
EVI 1
Index term:“pass mtr veh spark ign eng”User
Query “Automobile
” EVI 2Index term:“automobiles”OR
“internal combustible engines”
7/16/2002 JCDL 2002, Ray Larson
But why stop there?
Index
EVI
7/16/2002 JCDL 2002, Ray Larson
“Which EVI do I use?”
Index
EVI
Index
Index EVI
IndexEVI
7/16/2002 JCDL 2002, Ray Larson
EVI to EVIs
Index
EVI
Index
Index EVI
IndexEVI
EVI2
7/16/2002 JCDL 2002, Ray Larson
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
Why not treat language the same way?
7/16/2002 JCDL 2002, Ray Larson
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
...),,2[logL(p t)W(c, 1 baaStatistical association
Digital library resources
7/16/2002 JCDL 2002, Ray Larson
Background on Online Library Catalogs
• Library catalogs have been automated at a furious pace worldwide since the late ’70s
• Library objects (books, maps, pictures) in 400+ languages• Bibliographic descriptions contain one or more sentences from a
particular language (transliterated)• Objects have been classified by subject by librarians
– Library of Congress Subject Heading (Islamic Fundamentalism)– Library of Congress Classification (BP60, BP63, KF27)– Dewey Decimal Classification (297.2, 306.6, 320.5)
• International standard (MARC) for catalog metadata• Huge number of remotely searchable catalogs worldwide
accessible using the international search/retrieve protocol Z39.50
7/16/2002 JCDL 2002, Ray Larson
What can libraries and their catalogs provide?
• Millions of sentences in multiple languages• Sentences with topical content identified from
150,000 Library of Congress Subject Headings• Transfer point (interlingua) between English
topics and words in other languages• Can be used to create:
– Bilingual dictionaries– Query expansion in cross-language information
retrieval
7/16/2002 JCDL 2002, Ray Larson
Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic”
Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”
7/16/2002 JCDL 2002, Ray Larson
Our Training Set and Prototype• University of California/CDL MELVYL catalog• Private copy, 10 million+ records (5 million non-
English)• Records in over 100 languages• Obtained in MARC database standard format• Foreign language titles use Library of Congress
transliteration (Romanization) standard• Prototype search software maps from/to English and
– Arabic, Chinese, French, German– Italian, Japanese, Russian, Spanish
7/16/2002 JCDL 2002, Ray Larson
Technical Details
Download a set of
training data.
Build associations between extracted terms & controlled
vocabularies.
Part of speech tagging
Extract terms (words and noun
phrases) from titles and abstracts.
Building an Entry Vocabulary Module (EVI)
For noun phrases
Internet DB indexed with a
controlled vocabulary.
7/16/2002 JCDL 2002, Ray Larson
Association Measure
C ¬Ct a b¬t c d
Where t is the occurrence of a term and C is the occurrence of a classification in the training set
7/16/2002 JCDL 2002, Ray Larson
Association Measure
• Maximum Likelihood ratio
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)
and p1= p2= p=
a a+b
c c+d
a+c a+b+c+d
Vis. Dunning
7/16/2002 JCDL 2002, Ray Larson
Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages
7/16/2002 JCDL 2002, Ray Larson
Non-English words can be mapped to English subject headings
7/16/2002 JCDL 2002, Ray Larson
Examples
7/16/2002 JCDL 2002, Ray Larson
Catalog Languages vs. FBIS Languages (University of California online catalog. 10 million records)
Approx. language distribution (Berkeley # sentences, FBIS est. # lines source)
Berkeley FBIS Berkeley FBIS
German 840,032 49,872 Danish 41,517 18,688Spanish 614,025 388,772 Hebrew 41,468 3,500French 609,089 2,871 Czech 35,432 3,647Russian 341,050 15,415 Urdu 30,206
Italian 266,424 254 Turkish 30,015
Portuguese 149,389 24,930 Bulgarian 27,850
Chinese 127,636 246,549 Norwegian 26,478 13,596Japanese 110,956 Korean 25,979 68,607Arabic 96,124 (8263)* Rumanian 25,874
Dutch 90,170 Finnish 25,027 8,187Latin 88,818 Thai 24,693
Polish 81,698 Serbo-Croatian 24,601 36,139Indonesian 59,445 Greek 23,926
Swedish 53,854 16,652 Bengali 23,430
Hungarian 46,330 6,631 Catalan 20,392
Hindi 42,886 Tamil 20,232
*English only, no source text
106 languages with > 500 records
7/16/2002 JCDL 2002, Ray Larson
Future Research• Add content from other online library catalogs
– RLIN (>30M records, >900K Chinese, >250K Arabic) – COPAC [UK] (9M records, 40k Arabic)
• Transliteration and back-transliteration for scripted languages
• Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information)
• Further evaluation (TREC, CLEF, NCTIR and local analysis)
7/16/2002 JCDL 2002, Ray Larson
Prototype available
http://otlet.sims.berkeley.edu/mulevm2.html