+ All Categories
Home > Documents > The Effect of Pseudo Relevance Feedback on MT-Based CLIR Yan Qu, Alla N. Eilerman Hongming Jin,...

The Effect of Pseudo Relevance Feedback on MT-Based CLIR Yan Qu, Alla N. Eilerman Hongming Jin,...

Date post: 13-Dec-2015
Category:
Upload: branden-baker
View: 217 times
Download: 1 times
Share this document with a friend
46
The Effect of Pseudo Relevance Feedback on MT-Based CLIR Yan Qu, Alla N. Eilerman Hongming Jin, David A. Evans CLARITECH Corporation
Transcript

The Effect of Pseudo Relevance Feedback on

MT-Based CLIR

Yan Qu, Alla N. Eilerman

Hongming Jin, David A. Evans

CLARITECH Corporation

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 2April 12, 2000

Outline• Our approach to Cross-Language Information

Retrieval (CLIR)• Objectives of this work• Review of previous work with Pseudo Relevance

Feedback (PRF)• System diagram• Data for experiments• Error analysis of MT-based query translation• The effect of PRF on French monolingual retrieval• The effect of PRF on English-to-French cross-

language retrieval• Summary and conclusions

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 3April 12, 2000

Our Approach to CLIR

• Used MT-based query translation to bridge the language gap

• Adapted pseudo relevance feedback to CLIR– pre-translation query expansion– post-translation query expansion– combined (pre- and post-translation) query

expansion

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 4April 12, 2000

Objectives

• Identify factors that affect the quality of MT-based query translation

• Evaluate the effectiveness of using pseudo relevance feedback for improving CLIR performance

• Identify contexts for selecting these feedback methods

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 5April 12, 2000

Relevance Feedback in Monolingual Retrieval

• Relevance feedback (Salton & Buckley, 1990; Evans et al., 1999)

• Pseudo relevance feedback (PRF) (Evans & Lefferts, 1994; Milic-Frayling et al., 1998)

• Both have been demonstrated to be effective in improving retrieval performance

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 7April 12, 2000

Pseudo Relevance Feedback in CLIR

PRE-TRANSLATION

POST-TRANSLATION

COMBINED

Parallelcorpora

Carbonell et al,1997

Bilingualdictionaries

Ballesteros &Croft, 1998

Ballesteros &Croft, 1998

Ballesteros &Croft, 1998

MT-system ??? ??? ???

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 8April 12, 2000

CLIR with Simple MT-based Query Translation

Queries in SL

Ranked list from TL Database

MT

Queries in TL Retrieval

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 9April 12, 2000

CLIR with Query Expansion Before MT

Queries in SL

Retrieval

Ranked list from TL Database

MT

Queries in TLRetrieval & Thesaurus Extraction

Thesaurus Terms in SL

& thesaurus terms in TL

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 10April 12, 2000

CLIR with Query Expansion After MT

Queries in SL

Retrieval

Ranked list from TL Database

MT

Queries in TL

Retrieval & Thesaurus Extraction

Thesaurus Terms in TL

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 11April 12, 2000

MT

Queries in SL

Queries in TL

Retrieval

Ranked list from TL database

Queries in SL

Retrieval & Thesaurus Extraction

MT

Retrieval

Thesaurus terms in SL

Queries in TL & thesaurus terms in TL

Ranked list from TL Database

Queries in SL

MT

Retrieval

Ranked list from TL Database

Queries in TL

Retrieval & Thesaurus Extraction

Queries in TL & thesaurus terms in TL

Process Summary

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 13April 12, 2000

CLARIT English NLP• Used for processing the English corpus

and the English queries

• Consists of a parser and morphological analyzer

• Uses an English lexicon and grammar to identify linguistic structures in texts

• Supplemented by a “stop word” list to filter out substantive words that are extraneous to the topics (e.g., document, relevant)

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 14April 12, 2000

French Text Processing (Pseudo-NLP Approach)

• Goal: to obtain mostly correct phrase segmentation

• Manually constructed resources– lexicon of closed-class categories with 1081 entries– “stop word” lexicon including 525 words and their

inflected forms that are extraneous to the topics (e.g., document, pertinent)

– grammar based on the CLARIT English grammar and adapted to accommodate French categories

– no French morphological normalization

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 15April 12, 2000

English-to-French Translation

• SYSTRAN Enterprise translation software

• Translation direction: English to French

• Client-server architecture

• Translation is a black box to our system

• No special or additional resources were used to supplement the translation process

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 16April 12, 2000

Data Sources for Experiments

• TREC-6 CLIR track data collections provided by NIST

(Voorhees & Harman, 1998)– 250 MB collection of French SDA news

(1988-1990) from the Swiss News Agency: 141,656 documents

– 750 MB collection of English AP news (1988-1990) from the Associated Press: 242,918 documents

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 17April 12, 2000

Topics for Experiments• TREC-6 CLIR track topics provided by

NIST (Voorhees & Harman, 1998)– 22 English topics for the English-to-French

cross-language runs– 22 French topics for the French monolingual

runs– Equivalent across languages– Prepared by humans– Composed of the title, description, and the

narrative fields

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 18April 12, 2000

A Sample English Topic<num> Number: CL1<E-title> Waldheim Affair

<E-desc> Description:

Reasons for controversy surrounding Waldheim's World War II actions.

<E-narr> Narrative:

Revelations about Austrian President Kurt Waldheim’s participation in Nazi crimes during World War II are argued on both sides. Relevant documents are those that express doubts about the truth of these revelations. Documents that just discuss the affair are not relevant.

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 19April 12, 2000

An Ideal French Topic

<num> Number: CL1<F-title> Affaire Waldheim

<F-desc> Description:

Raisons de la controverse à l'égard des agissements de Waldheim pendant la deuxième guerre mondiale.

<F-narr> Narrative:

Les révélations sur la participation du président autrichien Kurt Waldheim aux crimes nazis pendant la deuxième guerre mondiale font l'objet de controverses. Les documents pertinents font état de doutes sur la culpabilité de Waldheim. Les articles qui ne font que mentionner l'affaire ne sont pas valables.

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 20April 12, 2000

CLARIT Queries• Composed of the title, description, and

the narrative fields

• Processed automatically into query vectors

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 21April 12, 2000

Sample English Query Vector<cf="1" tf="1">waldheim affair</><cf="1" tf="1">waldheim world war ii</><cf="1" tf="1">nazi crime</><cf="1" tf="1">austrian president kurt waldheim</><cf="1" tf="1">austrian president</><cf="1" tf="1">controversy surround</><cf="1" tf="1">president kurt waldheim</><cf="1" tf="1">kurt waldheim</><cf="1" tf="3">waldheim</><cf="1" tf="1">kurt</><cf="1" tf="2">revelation</><cf="1" tf="1">austrian</><cf="1" tf="1">participation</><cf="1" tf="1">surround</><cf="1" tf="1">truth</>

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 22April 12, 2000

Sample French Query Vector<cf="1" tf="1">crimes nazis</><cf="1" tf="1">affaire waldheim</><cf="1" tf="1">président autrichien kurt waldheim</><cf="1" tf="1">président autrichien</><cf="1" tf="1">controverses</><cf="1" tf="1">agissements</><cf="1" tf="1">kurt waldheim</><cf="1" tf="1">culpabilité</><cf="1" tf="4">waldheim</><cf="1" tf=”2">deuxième guerre mondiale</><cf="1" tf="2">deuxième guerre</><cf="1" tf="1">doutes</><cf="1" tf="1">révélations</><cf="1" tf="1">nazis</>

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 23April 12, 2000

Topic and Query Statistics

EnglishTopics

IdealFrenchTopics

Avg. # ofWords

43 51.3

Avg. # ofTerms

17.3 15.4

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 24April 12, 2000

Evaluation

• Relevance judgments on the French SDA news, prepared by NIST judges (TREC-6)

• Evaluation measures:– eleven-point average precision (N=1000

documents)– precision at low recall levels (10,

20, and 100 documents)– recall– exact precision

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 25April 12, 2000

English-to-French Retrieval vs. French Monolingual Retrieval

(without PRF)F-nf EF-nf Percentage

(baseline) of baseline

RelRetDocCount 1006 845 84%Recall 0.7306 0.6137 84%Average Precision 0.2548 0.1862 73%Precision at 10 Docs 0.3727 0.3136 84%Precision at 20 Docs 0.3386 0.2818 83%Precision at 100 Docs 0.1909 0.1618 85%Exact Precision 0.3143 0.2213 70%

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 26April 12, 2000

Types of Translation Errors• E1: missing translation of an English term• E2: unnecessary translation of a borrowed

English term• E3: wrong sense disambiguation• E4: wrong sense disambiguation caused by

removed capitalization• E5: word-by-word translation of a multiword

(idiomatic) term• E6: wrong phrase construction• E7: broken phrase

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 27April 12, 2000

Error Type 1: Missing Translation

English: agencies’

Ideal French translation: (des) agences

MT output: (d’)agencies

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 28April 12, 2000

Error Type 2: Unnecessary Translation

English: fast food

Ideal French translation: fast food

MT output: aliments de préparation rapide (food of fast preparation)

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 29April 12, 2000

Error Type 3: Wrong Sense Disambiguation

English: logging

Ideal French translation: déforestation (deforestation)

MT output: notation (notation)

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 30April 12, 2000

Error Type 4: Wrong Disambiguation Caused

by Removed CapitalizationEnglish: aids (AIDS)

Ideal French translation: sida (SIDA “AIDS”)

MT output: aides (assistants)

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 31April 12, 2000

Error Type 5: Word-by-Word Translation of a Multiword Idiomatic Term

English: death penalty

Ideal French translation: la peine de mort

MT output: la pénalité de la mort

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 32April 12, 2000

Error Type 6: Wrong Phrase Construction

English: austrian president kurt waldheim’s participation

Ideal French translation: la participation du président autrichien kurt waldheim

MT output: la participation autrichienne de waldheim de kurt de président

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 33April 12, 2000

Error Type 7: Broken Phrase

English: sex education

Ideal French translation:

éducation sexuelle

MT output: éducation de sexe

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 34April 12, 2000

Error DistributionsWrong sense disambiguation

0

5

10

15

20

25

E1 E2 E3 E4 E5 E6 E7

Error Type

Fre

qu

ency

Frequency

Word-by-word translationWrong phrase construction

Broken phrases

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 35April 12, 2000

The Effect of PRF on French Monolingual Retrieval

F-nf F-prf Increase(baseline)

RelRetDocCount 1006 1147 14%Recall 0.7306 0.8330 14%Average Precision 0.2548 0.2968 16%Precision at 10 Docs 0.3727 0.4273 15%Precision at 20 Docs 0.3386 0.3523 4%Precision at 100 Docs 0.1909 0.2236 17%Exact Precision 0.3143 0.334 6%

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 36April 12, 2000

EF-nf EF-pf-pre Increase EF-pf-post Increase EF-pf-comb Increase(baseline)

RelRetDocCount 845 1010 19.5% 1010 19.5% 1047 23.9%Recall 0.6137 0.7335 19.5% 0.7335 19.5% 0.7603 23.9%Average Precision 0.1862 0.2099 12.7% 0.2392 28.5% 0.2176 16.9%Precision at 10 Docs 0.3136 0.3455 10.2% 0.3409 8.7% 0.3455 10.2%Precision at 20 Docs 0.2818 0.2977 5.6% 0.3023 7.3% 0.3045 8.1%Precision at 100 Docs 0.1618 0.1864 15.2% 0.1973 21.9% 0.1864 15.2%Exact Precision 0.2213 0.2552 15.3% 0.2582 16.7% 0.2617 18.3%

The Effect of PRF on English-to-French Retrieval

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 37April 12, 2000

English-to-French Retrieval vs French Monolingual Retrieval

(with PRF)F-prf EF-prf-pre Percentage EF-prf-post Percentage EF-prf-comb Percentage

(baseline) of baseline of baseline of baseline

RelRetDocCount 1147 1010 88% 1010 88% 1047 91%Recall 0.8330 0.7335 88% 0.7335 88% 0.7603 91%Average Precision 0.2968 0.2099 71% 0.2392 81% 0.2176 73%Precision at 10 Docs 0.4273 0.3455 81% 0.3409 80% 0.3455 81%Precision at 20 Docs 0.3523 0.2977 85% 0.3023 86% 0.3045 86%Precision at 100 Docs 0.2236 0.1864 83% 0.1973 88% 0.1864 83%Exact Precision 0.334 0.2552 76% 0.2582 77% 0.2617 78%

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 38April 12, 2000

0.00.1

0.20.30.4

0.50.60.7

0.80.9

AverageRecall

AveragePrecision

ExactPrecision

Precision at100 Docs

F-nf

F-pf

EF-nf

EF-pf-pre

EF-pf-post

EF-pf-comb

Cross-Language Retrieval vs. Monolingual Retrieval

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 39April 12, 2000

Cross-Language Retrieval vs. Monolingual Retrieval

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 40April 12, 2000

-0.2000-0.1500

-0.1000-0.05000.00000.0500

0.10000.15000.20000.2500

0.30000.3500

query

pf-pref

pf-post

pf-comb

Performance of Different PRF Methods

Topic 1009: Effects of logging

Topic 1016: Tuberculosis

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 41April 12, 2000

Topic 1009 “Effects of Logging”• Key concept lost due to wrong sense disambiguation (E3 error):

logging (felling trees) notation (notation)

• Pre-translation feedback – neutralized the effect of the translation error by bringing useful

thesaurus terms (tropical forest, tree, earth, sea, ocean, land, atmosphere, carbone dioxide, ozone depletion, greenhouse effect, global warming, destruction, pollution, damage, environment, environmentalist, conference, organization, world, nation, country).

Result: 688% increase in average precision

• Post-translation feedback – returned some useful terms– introduced noise caused by the wrong translation of logging

Result: 29% increase in average precision

• Combined feedback – created a strong base query prior to translation – further improved it with appropriate terms after translation– avoided too much noise

Result: 621% increase in average precision

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 44April 12, 2000

Topic 1016 “Tuberculosis”• Key term is translated correctly: tuberculosis tuberculose

• Translation errors affected some important terms: aids (AIDS) aides (assistants) third-world (countries) le troisième-monde (the third world)

• Pre-translation and combined feedback created additional sources of errors and noise by introducing– ambiguous thesaurus terms (cases, tests), which were

mistranslated (caisse instead of cas, essai instead of test)

– acronyms (AIDS, CDC, HIV), either mistranslated or not translated

Result: 29-30% decrease in average precision• Post-translation feedback compensated for translation errors

by bringing– correct terms (SIDA “AIDS”, tiers monde “third world)– additional useful terms (bacille, tuberculeux, virus, infectées,

maladie, risque, santé, problème, etc.) Result: 32% increase in average precision

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 45April 12, 2000

Performance of Different PRF Methods

PRE-TRANSLATION

FEEDBACK

POST-TRANSLATION

FEEDBACK

COMBINED

FEEDBACK

NUMBEROF

QUERIES+ + + 9

+ – + 2

+ – – 1

– + + 1

– + – 8

– – – 1

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 48April 12, 2000

Decision Tree for Selecting PRF Methods

Most key terms translated correctly

Yes No

All three methods behave similarly;generally improveretrieval performance

Lost of meaning compensated by

context terms

Yes No

Post-MT feedback is generally betterthan pre-MT and combined feedback

Pre-MT and combined feedback are generally better than post-MT feedback

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 49April 12, 2000

Summary• Adopted pseudo relevance feedback for

query expansion in CLIR with MT-based query translation

• Conducted analysis of translation errors

• Evaluated empirically the effect of three feedback methods on retrieval performance

• Examined contexts where different feedback methods are effective

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 50April 12, 2000

Conclusions• Wrong sense disambiguation and inappropriate

translation of multi-word terms are the most frequent translation errors when using MT.

• All feedback methods demonstrated significant performance improvement in CLIR compared with not using feedback.

• The use of PRF in general helps to reduce the negative effect of translation errors.

• Post-translation feedback generally outperforms pre-translation and combined feedback.

• The effectiveness of different feedback methods depends on the types of translation errors and the relative importance of the terms affected by these errors.

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 51April 12, 2000

Future Work

• Investigate the effect of query length

• Investigate the effect of context

• Develop measures to evaluate the original query quality

• Develop measures to evaluate the translated query quality

• Investigate the empirical conditions for selecting different feedback methods

The Effect of Pseudo Relevance Feedback on MT-Based CLIR © 2000, CLARITECH Corporation 52April 12, 2000

The End


Recommended