+ All Categories
Home > Documents > Cross-language IR and statistical MT

Cross-language IR and statistical MT

Date post: 23-Feb-2016
Category:
Upload: curt
View: 40 times
Download: 0 times
Share this document with a friend
Description:
Cross-language IR and statistical MT. Jian-Yun Nie DIRO, University of Montreal http://www.iro.umontreal.ca/~nie. Outline. What are the problems in CLIR? The approaches proposed in the literature Their effectiveness Remaining problems. Problem of CLIR. Cross-language IR (CLIR) - PowerPoint PPT Presentation
Popular Tags:
62
CROSS-LANGUAGE IR AND STATISTICAL MT Jian-Yun Nie DIRO, University of Montreal http://www.iro.umontreal.ca/ ~nie Academia Sinica 05 1
Transcript
Page 1: Cross-language IR and statistical MT

Academia Sinica 05 1

CROSS-LANGUAGE IR AND STATISTICAL MT

Jian-Yun NieDIRO, University of Montrealhttp://www.iro.umontreal.ca/~nie

Page 2: Cross-language IR and statistical MT

Outline• What are the problems in CLIR?• The approaches proposed in the literature• Their effectiveness• Remaining problems

2

Page 3: Cross-language IR and statistical MT

Problem of CLIR• Cross-language IR (CLIR)

• Query in a language (e.g. Chinese) and documents in another language (English)

• Multilingual IR (MLIR)• Query in one language and documents in several languages

• Where CLIR and MLIR are useful?• Search for international patents• Identify possible competitors or collaborators in other countries• Search for local information that is only in a local language• Multilingual users: avoid the burden to issue several queries• …

• In many cases, the translation of retrieved documents into the language of the query is still necessary (goal of machine translation)

3

Page 4: Cross-language IR and statistical MT

History• In 1970s, first papers on CLIR

• TREC-3 (1994) Spanish (monolingual): El Norte Newspaper SP 1-25• TREC-4 (1995) Spanish (monolingual): El Norte Newspaper SP 26-50• TREC-5 (1996) Spanish (monolingual): El Norte newspaper and Agence France Presse SP 51-75

Chinese (monolingual): Xinhua News agency, People’s Daily CH 1-28• TREC-6 (1997) Chinese (monolingual), The same documents as TREC-6 CH 29-54

CLIR: English: Associated Press CL 1-25French, German: Schweìzerìsche Depeschenagentur (SDA)

• TREC-7 (1998) CLIR:English, French, German, Italian (SDA) CL 26-53+ German: New Zurich Newspaper (NZZ)

• TREC-8 (1999) CLIR (English, French, German, Italian): as inTREC-7 CL 54-81• TREC-9 (2000) English-Chinese:

Chinese newswire articles from Hong Kong CH 55-79• TREC 2001 English-Arabic:

Arabic newswire from Agence France Presse 1-25• TREC2002 English-Arabic:

Arabic newswire from Agence France Presse 26-75

4

Page 5: Cross-language IR and statistical MT

History• NTCIR (Japon, NII) (1999 -)

• Asian languages (CJK) + English• Patent retrieval, blogs, Evaluation methodology, ..

• CLEF (Europe) (2000 -)• European languages• Image retrieval, Wikipedia, …

• SEWM: Chinese IR (2004 -)• FIRE: IR in Indian languages (2008 -)• Russian, …• Search engines

• Yahoo!: 2006, French/German->German/French, English, Spanish, Italian

• Google: 2007, Query translation, translation of retrieved documents

5

Page 6: Cross-language IR and statistical MT

Problems in CLIR• Translation of query (or documents) so as to compare them

• Similarities with MT• Translation• Similar methods can be used

• A task different from MT• Short queries (2-3 words): HD video recording• Flexible syntax: video hd recording, recording hd…• Goal: help find relevant documents, not to make the translated

query readable• The  "translation" can be by related words (same/related subjet, …)• Less strict translation

• Important to weight translation terms• weight = correctness of translation + Utility for IR• E.g. cost for computers

Translation-> 计算机成本,计算机开销 , 计算机价格 , … utility for RI -> 计算机成本,计算机开销 , 计算机价格 , … …

6

Page 7: Cross-language IR and statistical MT

Strategies• Translate the query• Translate the documents

• The two strategies have similar effectiveness• More complex to translate documents

• Translate both query and documents into a third language (pivot language)• Less effective than direct translation• (Related) Transitive translation: French->English-> Chinese

7

Page 8: Cross-language IR and statistical MT

Academia Sinica 05 8

How to translate1. Machine Translation (MT)

2. Bilingual dictionaries, thesauri, lexical resources, …

3. Parallel texts: translated textsParallel texts encompass translation knowledge

Page 9: Cross-language IR and statistical MT

Academia Sinica 05 9

Approach 1: Using MT• Seems to be the ideal tool for CLIR and MLIR (if the translation quality is high)Query in F Translation in E

MT Documents in E

• Typical effectiveness: 80-100% of the monolingual effectiveness

• Problems:• Quality• Availability • Development cost

Page 10: Cross-language IR and statistical MT

Academia Sinica 05 10

Problems of MT

• Wrong choice of translation word/term• organic food – nouriture organique (biologique)• train skilled personnel - personnel habile de train (ambiguity)

• Wrong syntax• human-assisted machine translation - traduction automatique humain-aidée

• Unknown words• Personal names:

Bérégovoy Bérégovoy, Beregovoy邓小平 Deng Xiaoping, Deng Hsao-ping, Deng Hsiao p'ing

• For CLIR: Choose one translation word• E.g. organic – organique• Better to keep all the synonyms (organique, biologique)? – query expansion

effect

Page 11: Cross-language IR and statistical MT

ExemplesSystran Google

• 1. drug traffictrafic de stupéfiants (correct) trafic de stupéfiants (correct)毒品交易 (correct) 毒品贩运 (correct)

• 2. drug insurance:assurance de drogue (incorrect) d'assurance médicaments (correct)药物保险 (correct) 药物保险 (correct)

• 3. drug research: recherche de drogue (incorrect) la recherche sur les drogues (incorrect) 药物研究 (correct) 药物研究 (correct)

• 4. drug for treatment of Friedreich’s ataxia:drogue pour le traitement de médicament pour le traitement de l'ataxie de Friedreich (incorrect) l'Ataxie de Friedreich (correct)

Friedreich 的不整齐的治疗的药物 (correct) 药物治疗弗里德的共济失调 (correct)• 5. drug control:

commande de drogue (likely incorrect) contrôle des drogues (likely)药物管制 (likely) 药物管制 (likely)

• 6. drug production: production de drogue (likely) la production de drogues (likely) 药物生产 (likely) 药物生产 (likely)

11

Page 12: Cross-language IR and statistical MT

Academia Sinica 05 12

Approach 2: Using bilingual dictionaries

• Unavailability of high-quality MT systems for many language pairs

• MT systems are often a closed box that is difficult to adapt to IR task

• Bilingual dictionary: • A non-expensive alternative• Usually available

Page 13: Cross-language IR and statistical MT

Academia Sinica 05 13

Approach 2: Using bilingual dictionaries

• General form of dict. (e.g. Freedict)access: attaque, accéder, intelligence, entrée, accèsacademic: étudiant, académiquebranch: filiale, succursale, spécialité, branchedata: données, matériau, data

• LDC English-Chinese• AIDS / 艾滋病 / 爱滋病 /• data / 材料 / 资料 / 事实 / 数据 / 基准 /• prevention / 阻碍 / 防止 / 妨碍 / 预防 / 预防法 /• problem / 问题 / 难题 / 疑问 / 习题 / 作图题 / 将军 / 课题 / 困难 / 难 / 题是 /• structure / 构造 / 构成 / 结构 / 组织 / 化学构造 / 石理 / 纹路 / 构造物 /建筑物 / 建造 / 物 /

Page 14: Cross-language IR and statistical MT

Academia Sinica 05 14

Basic methods• Use all the translation terms

• data / 材料 / 资料 / 事实 / 数据 / 基准 /• structure / 构造 / 构成 / 结构 / 组织 / 化学构造 / 石理 / 纹路 / 构造物 /

建筑物 / 建造 / 物 /• Introduce noise• Implicitly, the term with more translations is assigned higher importance• Use the first (or most frequent) translation

• Limit to the most frequent translation (when frequency is available)• Not always an appropriate choice

• General effectiveness: 50-60% of monolingual IR• Problems of dictionary

• Coverage (unknown words, unknown translations)• [Xu and Weischedel 2005] tested the impact of dictionary coverage

on CLIR (En-Ch)• The effectiveness increases till 10 000 entries

Page 15: Cross-language IR and statistical MT

Academia Sinica 05 15

Translate the query as a whole• Phrase translation [Ballesteros and Croft, 1996, 1997] base de données: databasepomme de terre: potato• Translate phrases first• Then the remaining words

• Best global translation for the whole query1. Candidates:

For each query word• Determine all the possible translations (through a dictionary)

2. Selectionselect the set of translation words that produce the highest cohesion

Page 16: Cross-language IR and statistical MT

Academia Sinica 05 16

Cohesion• Cohesion ~ frequency of two translation words together

E.g.• data: données, matériau, data • access: attaque, accéder, intelligence, entrée, accès

(accès, données) 152 *(accéder, données) 31(données, entrée) 21(entrée, matériau) 3…

• Freq. from a document collection or from the Web (Grefenstette 99)• (Gao, Nie et al. 2001) (Liu and Jin 2005)(Seo et al. 2005)…

• sim: co-occurrence, mutual information, statistical dependence• Dynamic translation: Graph of terms in two languages connected by dictionary

translations + random walk• Improved effectiveness (80-100% of monolingual IR)

Qt ttQt

ttT

QT

i ijj

jiQQ

TTsimTCohesion ),(maxarg)(maxarg

Page 17: Cross-language IR and statistical MT

Academia Sinica 05 17

Approach 3: using parallel texts• Training a translation model (IBM 1)• Principle:

• train a statistical translation model from a set of parallel texts: p(tj|si)

• Principle: The more sj appears in parallel texts of ti, the higher p(tj|si).

• Given a query, use the translation words with the highest probabilities as its translation

Page 18: Cross-language IR and statistical MT

Academia Sinica 05 18

Simple utilization• Determine the probability of a word translation

• One should also take into account the discriminant power of the translation (IDF)

EQe

EE QePeftQfP )|()|()|(

f

F

QeE n

CeftQfw

E

||log)|(),(

Page 19: Cross-language IR and statistical MT

exampleQuery #3 What measures are being taken to stem international

drug traffic?médicament=0.110892mesure=0.091091international=0.086505trafic=0.052353drogue=0.041383découler=0.024199circulation=0.019576pharmaceutique=0.018728pouvoir=0.013451prendre=0.012588extérieur=0.011669passer=0.007799demander=0.007422endiguer=0.006685nouveau=0.006016stupéfiant=0.005265produit=0.004789

• multiple translations, but ambiguity is kept

• Unknown word in target language

19

Page 20: Cross-language IR and statistical MT

IBM1 + dictionnaire• The weight of each translation word in the dictionary is increased (TREC-6)

• MAP-mono = 0.3731• MAP-LOGOS = 0.2866 (76.8%), MAP-Systran = 0.2763

(74.1%)

Number of translation words

Default probability

10 20 30 40 50 100

0.005 0.2671 0.2787 0.2812 0.2813 0.2829 0.2671

0.01 0.2755 0.2873 0.2891 0.2896 0.2906 0.2742

0.02 0.2873 0.2959 0.2962 0.2967 0.2985 0.2825

0.03 0.2811 0.2906 0.2898 0.2897 0.2904 0.2744

0.04 0.2751 0.2842 0.2827 0.2826 0.2831 0.2683

0.05 0.2687 0.2761 0.2729 0.2729 0.2730 0.2578

20

Number N of translation words

MAP (%monolingual IR)

10 0.2546 (68.24%)

20 0.2635 (70.62%)

30 0.2660 (71.30%)

40 0.2664 (71.40%)

50 0.2671 (71.59%)

100 0.2506 (67.14%)

Without default prob.

Page 21: Cross-language IR and statistical MT

Integrating translation in an IR model (Kraaij et al. 2003)

• The problem of CLIR:

• Query translation (QT)

• Document translation (DT)

Vt

DiQii

tPtPQDScore )|(log)|(),(

sj

s

sj

sss

Vs

MLQjji

Vs

MLQj

MLQjiQi

sPstt

sPstPtP

)|()|(

)|(),|()|(

tj

t

tj

ttt

VtDjji

VtDjDjiDi

tPtst

tPtsPsP

)|()|(

)|(),|()|(

21

Page 22: Cross-language IR and statistical MT

Results (CLEF 2000-2002)

Run EN-FR FR-EN EN-IT IT-EN Mono 0.4233 0.4705 0.4542 0.4705 MT 0.3478 0.4043 0.3060 0.3249 QT 0.3878 0.4194 0.3519 0.3678 DT 0.3909 0.4073 0.3728 0.3547

Translation model (IBM 1) trained on a web collection

22

Page 23: Cross-language IR and statistical MT

Academia Sinica 05 23

Principle of translation model training• p(tj|si) is estimated from a parallel training corpus, aligned into parallel sentences

• IBM models 1, 2, 3, …• Process:

• Input = two sets of parallel texts• Sentence alignment A: Sk Tl (bitext)• Initial probability assignment: t(tj|si,A)• Expectation Maximization (EM): t(tj|si ,A) • Final result: t(tj|si) = t(tj|si ,A)

Page 24: Cross-language IR and statistical MT

Academia Sinica 05 24

Details on translation model training on a parallel corpus• Sentence alignment

• Align a sentence in the source language to its translation(s) in the target language

• Translation model• Extract translation relationships• Various models (assumptions)

Page 25: Cross-language IR and statistical MT

Academia Sinica 05 25

Sentence alignment• Assumption:

• The order of sentences in two parallel texts is similar• A sentence and its translation have similar length (length-based

alignment, e.g. Gale & Church)

• A translation contains some “known” translation words, or cognates (e.g. Simard et al. 93)

6

5

4

3

2

1

)2,2()1,2()2,1()1,1(

),1()1,(

min),(

djiDdjiDdjiDdjiD

djiDdjiD

jiD

di: distance for different patterns (0-1, 1-1, …)

Page 26: Cross-language IR and statistical MT

Academia Sinica 05 26

Example of aligned sentences (Canadian Hansards)

Débat Artificial intelligence L'intelligence artificielle A Debat

Depuis 35 ans, les spécialistes Attempts to produce thinking d'intelligence artificielle cherchent machines have met during the à construire des machines past 35 years with a curious mix pensantes. of progress and failure.

Leurs avancées et leurs insuccès alternent curieusement.

Two further points are important.

Les symboles et les programmes First, symbols and programs are sont des notions purement purely abstract notions. abstraites.

Page 27: Cross-language IR and statistical MT

Academia Sinica 05 27

TM training: Initial probability assignment t(tj|si, A)

même evenun acardinal cardinaln’ isest notpas safeà froml’ drugabri cartelsdes .cartelsdeladrogue.

Page 28: Cross-language IR and statistical MT

Academia Sinica 05 28

TM training:Application of EM: t(tj|si, A)

même evenun acardinal cardinaln’ isest notpas safeà froml’ drugabri cartelsdes .cartelsdeladrogue.

Page 29: Cross-language IR and statistical MT

Academia Sinica 05 29

IBM models (Brown et al.)• IBM 1: does not consider positional information and

sentence length• IBM 2: considers sentence length and word position• IBM 3, 4, 5: fertility in translation

• For CLIR, IBM 1 seems to correspond to the current (bag-of-words) approaches to IR.

Page 30: Cross-language IR and statistical MT

Academia Sinica 05 30

IBM translation models: principle• Input: bitexts (set of aligned sentences)• Output: transfer probability t(f|e)

the (le, 0.18) (la, 0.15) (de, 0.12) …minister (ministre, 0.8) (le, 0.12), …people (gens, 0.25) (les, 0.16) (personnes, 0.1), …years (ans, 0.38) (années, 0.31) (depuis, 0.12), …

• Each pair of (e,f) is a parameter of the model•

• In practice:• Limit to only 1-1 sentence alignments, of length <=40 words• Words of freq. =1 replaced by UNK

• Using EM to re-estimate parameters

1)|(, t

stps

Page 31: Cross-language IR and statistical MT

Academia Sinica 05 31

Word alignment for one sentence pairSource sentence in training: e = e1, …el (+NULL)Target sentence in training: f = f1, …fm

Only consider alignments in which each target word (or position j) is aligned to a source word (of position aj)

The set of all the possible word alignments: A(e,f)

Page 32: Cross-language IR and statistical MT

Academia Sinica 05 32

General formula

),,,|(),,,|()|()|,(

|||,| ,],1[ ],0[ with ),...,(

)|,( )|()|(

111

11

1

11

1

)(

eeeeaf

fea

eafefef

fe,a

mfafPmfaaPmPP

mlmilaaa

P

PEFP

jjj

jm

j

jj

im

A

Prob. that e is translated into a sentence of length m

Prob. that j-th target is aligned with aj-th source word

Prob. to produce the word fj at position j

Page 33: Cross-language IR and statistical MT

Academia Sinica 05 33

Example

a = (1,2,4,3)

),4,test traduic'),3,4,2,1(|ementautomatiqu(

),4t,est tradui c'),4,2,1(|3(

),4,estc'),4,2,1(|traduit(

),4est, c'),2,1(|4(

),4,c'),2,1(|est(

),4,c',1|2(

),4,1|c'(),4|1( ated)lly translautomatica isit |4(

ated)lly translautomatica isit | ement,automatiqut est traduic'(

3414

31

314

2313

21

213

1212

112

111

1

1

e

e

e

e

e

e

eee

a

mfafP

mfaaP

mfafP

mfaaP

mfafP

mfaaP

mafPmaPmP

p

t

a

t

a

t

a

ta

),,,|(),,,|()|( 111

11

1

11 eee mfafPmfaaPmP jj

jj

m

j

jj

NULLitisautomaticallytranslated

c’est

traduitautomatiquement

Page 34: Cross-language IR and statistical MT

Academia Sinica 05 34

IBM model 1

• Simplifications

• the model becomes (for one sentence alignment a)

)|()|(),,,|(

)1/(1)|(),,|(

)|(

111

11

11

jj ajajjj

j

jjj

j

eftefpmfafP

llapmfaaP

mP

e

ee

m

jajm

m

j

aj

j

j

eftl

l

eftp

1

1

)|()1(

1

)|()|,(

eaf

Position alignment is uniformly distributed

Context-independent word translation

Any length generation is equally probable – a constant

Page 35: Cross-language IR and statistical MT

Academia Sinica 05 35

Example Model 1

a = (1,2,4,3)

54 10064.80.80.450.70.2

5

ated)lly translautomatica isit |ement,automatiqut est traduic'(

8.0lly)automatica|ementautomatiqu(45.0)translated|traduit(

,7.0is)|est( ,2.0it)|c'(

ap

tttt

:Assume

NULLitisautomaticallytranslated

c’est

traduitautomatiquement

Page 36: Cross-language IR and statistical MT

Academia Sinica 05 36

Sum up all the alignments

m

j

l

iijm

l

a

l

a

m

jajm

eftl

eftl

pm

j

1 0

0 0 1

)|()1(

)|(...)1(

)|(1

ef

)|( ij eft• Problem:

We want to optimize so as to maximize the likelihood of the given sentence alignments• Solution: Using EM

Page 37: Cross-language IR and statistical MT

Academia Sinica 05 37

Parameter estimation1. An initial value for t(f|e) (f, e are words)2. Compute the count of word alignment e-f in

the pair of sentences ( ) (E-step)

3. Maximization (M-step)

4. Loop on 2-3

),,( )()( ssf|ec fe)()( , ss fe

l

ii

m

jjl

ii

a

m

jj

sss(s)

eeδffδeft

eft

eeδffδpf|e;cj

01

0

1

)()()(

),(),()|(

)|(

),(),()|(),(a

f,eafe

S

s

sse

f

S

s

sse

efceft

f|ec

1

)()(1-

1

)()(

),;|()|(

factor)tion (normaliza ),;(

fe

fe

Count of f in f Count of e in e

Page 38: Cross-language IR and statistical MT

Academia Sinica 05 38

Utilization of TM in CLIR• Query: a set of source words• Each source word: a set of weighted target words

• some filtering: stopwords, prob. threshold, number of translations, …

• All the target words query “translation”• Query “translation” with monolingual IR

Page 39: Cross-language IR and statistical MT

Academia Sinica 05 39

How effective is this approach?(with the Hansard model)

F-E (Trec6)

F-E (Trec7)

E-F (trec6)

E-F (Trec7)

Monolingual 0.2865 0.3202 0.3686 0.2764

Dict. 0.1707 (59.0%)

0.1701 (53.1%)

0.2305 (62.5%)

0.1352 (48.9%)

Systran 0.3098 (107.0%)

0.3293 (102.8)

0.2727 (74.0%)

0.2327 (84.2%)

Hansard TM 0.2166 (74.8%)

0.3124 (97.6%)

0.2501 (67.9%)

0.2587 (93.6%)

Hansard TM+ dict.

0.2560 (88.4%)

0.3245 (101.3%)

0.3053 (82.8%)

0.2649 (95.8%)

Page 40: Cross-language IR and statistical MT

Academia Sinica 05 40

Problem of parallel texts• Only a few large parallel corpora

• e.g. Canadian Hansards, EU parliament, Hong Kong Hansards, UN documents, …

• Many languages are not covered• Is it possible to extract parallel texts from the Web?

• STRANDS• PTMiner

Page 41: Cross-language IR and statistical MT

Academia Sinica 05 41

An example of “parallel” pageshttp://www.iro.umontreal.ca/index.html http://www.iro.umontreal.ca/index-english.html

Page 42: Cross-language IR and statistical MT

Academia Sinica 05 42

STRANDS [Resnik 98]• Assumption:

If - A Web page contains 2 pointers- The anchor text of each pointer identifies a language

Then The two pages referenced are “parallel”

French English

French text

English text

Page 43: Cross-language IR and statistical MT

Academia Sinica 05 43

PTMiner (Nie & Chen 1999)• Candidate Site Selection

By sending queries to AltaVista, find the Web sites that may contain parallel text.

• File Name FetchingFor each site, fetching all the file names that are indexed by search engines. Use host crawler to thoroughly retrieve file names from each site.

• Pair ScanningFrom the file names fetched, scan for pairs that satisfy the common naming rules.

Page 44: Cross-language IR and statistical MT

Academia Sinica 05 44

Candidate Sites Searching

• Assumption: A candidate site contains at least one such Web page referencing another language.

• Take advantage of existing search engines (AltaVista)

Page 45: Cross-language IR and statistical MT

Academia Sinica 05 45

File Name Fetching• Initial set of files (seeds) from a candidate site:

host:www.info.gov.hk

• Breadth-first exploration from the seeds to discover other documents from the sites

Page 46: Cross-language IR and statistical MT

Academia Sinica 05 46

Pair Scanning

• Naming examples:index.html v.s. index_f.html

/english/index.html v.s. /french/index.html

• General idea:

parallel Web pages = Similar URLs at the difference of a tag identifying a language

Page 47: Cross-language IR and statistical MT

Academia Sinica 05 47

Further verification of parallelism• Download files (for verification with document contents)• Compare file lengths• Check file languages (by an automatic language detector

– SILC)• Compare HTML structures• (Sentence alignment)

Page 48: Cross-language IR and statistical MT

Academia Sinica 05 48

Mining Results (several years ago)• French-English

• Exploration of 30% of 5,474 candidate sites• 14,198 pairs of parallel pages• 135 MB French texts and 118 MB English texts

• Chinese-English• 196 candidate sites• 14,820 pairs of parallel pages• 117.2M Chinese texts and 136.5M English texts

• Several other languages I-E, G-E, D-E, …

Page 49: Cross-language IR and statistical MT

Academia Sinica 05 49

CLIR results: F-EF-E (Trec6)

F-E (Trec7)

E-F (Trec6)

E-F (Trec7)

Monolingual 0.2865 0.3202 0.3686 0.2764

Systran 0.3098 (107.0%)

0.3293 (102.8)

0.2727 (74.0%)

0.2327 (84.2%)

Hansard TM 0.2166 (74.8%)

0.3124 (97.6%)

0.2501 (67.9%)

0.2587 (93.6%)

Web TM 0.2389 (82.5%)

0.3146 (98.3%)

0.2504 (67.9%)

0.2289 (82.8%)

• Web TM comparable to Hansard TM

Page 50: Cross-language IR and statistical MT

Academia Sinica 05 50

CLIR Results: C-E

• Chinese: People’s Daily, Xinhua news agency• English: AP

• MT system:E-C: 0.2001 (50.3%) C-E: (56 - 70%)

C-E E-CMonolingual 0.3861 0.3976Dictionary (EDict) 0.1530 (39.6%) 0.1427 (35.9%)

TM 0.1654 (42.84%) 0.1591 (40.02%)

TM + Dict 0.2583 (66.90%) 0.2232 (56.14%)

Page 51: Cross-language IR and statistical MT

Other methods – using parallel texts for pseudo-relevance feedback

• Given a query in F• Find relevant documents in the parallel corpus• Extract keywords from their parallel documents, and

consider them as a query translation

F EQuery F

Rel. doc. F

Corresponding doc. in E

Words in E

51

Page 52: Cross-language IR and statistical MT

Other methods - LSI• Monolingual LSI :

• Create a latent semantic space• Each dimension represents a combination of initial dimensions (terms, documents)

• Comparison of document-query in the new space

• Bilingual LSI :• Create a latent semantic space for both languages on a parallel corpus

• Concatenate two parallel texts together• Convert terms in both languages into the semantic space

• Problems: • The dimensions in the latent space are determined to minimise some

representational error – may be different from translational error• Coverage of terms by the parallel corpus• Complexity in creating the semantic space

• Effectiveness – usually lower than using a translation model

52

Page 53: Cross-language IR and statistical MT

Using a comparable corpus• Comparable: News Articles published in two newspapers on the same day

• Estimate cross-lingual similarity (less precise than translation)• Similar methods to co-occurrence analysis

• Less effective than using a parallel corpus• To be used only when there is no parallel corpus, or the parallel corpus is not large enough

53

Page 54: Cross-language IR and statistical MT

Other problems – unknown words• Proper names (‘Pierre Nadeau’ in Chinese?)• New technical terms (‘web surfing’ in Chinese at the beginning of the web?)

• Solutions• Transliteration• Mining the web

54

Page 55: Cross-language IR and statistical MT

Transliteration• Translate a name phonetically

• Generate the pronounciation of the name• Transform the sounds into the target language sounds• Generate the characters to represent the sounds

English name Frances Taylor

English phonemes F R AE N S IH S T EY L ER

Chinese phonemes f u l ang x i s i t ai l e

Chinese Pinyin fu lang xi si tai le

Chinese transliteration 弗 朗 西 丝 泰 勒

55

Page 56: Cross-language IR and statistical MT

Mining the web - 1• A site is referred to by several pages with different anchor texts in different languages

• Anchor texts as parallel texts• Useful for the translation of organizations ( 故宫博物馆 - National Museum)

http://www.yahoo.com

雅虎搜索引擎 Yahoo!搜索引擎

雅虎 美国雅虎

Yahoo!モバイル

Yahoo の検索エンジン 美國雅虎

Yahoo search engine

雅虎 WWW 站

Yahoo!

56

Page 57: Cross-language IR and statistical MT

Mining the web - 2• Some  "monolingual » texts may contain translations• 现在网上最热的词条,就是这个“ Barack Obama” ,巴拉克 · 欧巴

马(巴拉克 · 奥巴马)。 • 这就诞生了潜语义索引 (Latent Semantic Indexing) …

• templates:• Source-name (Target-name)• Source-name, Target-name• …

• May be used to complete the existing dictionary

57

Page 58: Cross-language IR and statistical MT

Other improvement measures• Pre- and post-translation expansion

• Query expansion before the translation• Query expansion after the translation

• Fuzzy matching• information - información – informazione• ~cognate• Matching n-grams (e.g. 4-grams)• Transformation using rules (konvektio -> convection)

• Combine translations using different tools

58

Page 59: Cross-language IR and statistical MT

Current state• Effectiveness of CLIR

• Between European languages ~90-100% monolingual• Between European and Asian languages ~ 80-100%

• A usable quality• One always needs translation of the retrieved documents

• The need for CLIR is still limited / Tools for CLIR are limited

59

Page 60: Cross-language IR and statistical MT

Remaining problems• Current approaches :

• CLIR= translation + monolingual IR• The resources and tools are usually developed for MT, not for

CLIR

• Problem of context• window 7 update -> fenêtre 7 mise à jour• Hints to be used:

• window 7 (a frequent context)• window – update -> window

• Dependent words do not always form a phrase• Take into account more flexible dependencies (even proximity)• How to train a translation model in such a context?• This is not only a problem in CLIR but also in general IR.

• See the lecture on dependency models

60

Page 61: Cross-language IR and statistical MT

The future• CLIR≠ translation+ monolingual IR• Translation is a step in CLIR

• For IR• Similar to query expansion• Can use similar approaches to query expansion

61

Page 62: Cross-language IR and statistical MT

Academia Sinica 05 62

Summary• High-quality MT usually offers the best solution• Well-trained TM based on parallel texts can match or

outperform MT (Kraaij et al. 03)• Dictionary

• Simple utilization is not good• Complex approaches improve quality

• The performance of CLIR usually lower than monolingual IR (between 50% and 100%)

• Filtering noisy parallel corpus is useful• Better translation model = better CLIR effectiveness

• Consider compound terms in TM• Ongoing work…


Recommended