Post on 28-May-2018
transcript
Morpho-Syntax in Statistical Machine Translation
© 2006 IBM Corporation
Young-Suk LeeIBM T. J. Watson Research Center
OpenLab 2006March 30 − April 1, 2006
2
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Reordering Rules: Motivations
IBM T. J. Watson Research Center
© 2006 IBM Corporation
NNSIN RB JJS
DTS NNS JJS JJS
de transportes especialmente peligrosos
of extremely dangerous transport
los procedimientos administrativos complejos
ø complex administrative procedures
Outline
• Baseline Phrase Translation System
o Block Acquisition & Decoding
• Acquisition of Reordering Ruleso Base Reordering Rules
o Lexicalized Reordering Rules
• Experimental Results
• Related and Ongoing Work
IBM T. J. Watson Research Center
© 2006 IBM Corporation
4
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Baseline Block Acquisition
IBM T. J. Watson Research Center
© 2006 IBM Corporation
f1 f2 f3
e1 e2 e3 e4 e5 e6
Block (b): a phrase translation pair consisting of source & target phrase
f
e
Tillmann 2003, EMNLP Proceedings
5
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Extended Block Acquisition Algorithm
IBM T. J. Watson Research Center
© 2006 IBM Corporation
o Expansion word list: A list of target words typically aligned to null source words (e.g. I, we, are)
o Extend the target phrase to include an expansion word if it occurs in the neighborhood of a block
I think that we are getting a package
creo que realizamos un paquete
6
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Decoding
• Phrase translation models
• Direct model:
• Source channel model:
• Block unigram model:
IBM T. J. Watson Research Center
© 2006 IBM Corporation
∑ ′′
=
efecount
fecountfep
),(
),()|(
),(,)(
)()( feb
bcount
bcountbp
b
=′
=
∑ ′
∑ ′′
=
fefcount
efcountefp
),(
),()|(
7
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Decoding Cont’d ...
• IBM Model 1 cost per phrase in both directions
• Word & part-of-speech tag trigram language models
• Word-level distortion models applied to blocks• Al-Onaizan 2004, DARPA MT Evaluation Workshop
• Word & block count penalty• Zens and Ney 2004, HLT Proceedings
IBM T. J. Watson Research Center
© 2006 IBM Corporation
niefpmax i
m
j
ji ≤≤−∑=
1),|(10log1
8
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Acquisition of Base Reordering Rules
IBM T. J. Watson Research Center
© 2006 IBM Corporation
• Viterbi-align
• Part-of-speech tagged source language corpus
• Un-tagged target language corpus
• Identify the source language part-of-speech tag sequence (monotone increasing)
• whose corresponding target word sequence is not monotone increasing
• Compute the reordering probabilities of each part-of-speech tag sequence
9
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Reordering Probability Computation
∑ ′′
=
rreorde k
ki
kitagrreordecount
tagreordercounttagreorderp
),(
),()|(
)|(ki tagreorderp
IBM T. J. Watson Research Center
© 2006 IBM Corporation
0.0481 4 3 20.3751 4 3 2
0.0721 4 2 30.1111 4 2 3
0.5601 3 4 20.1091 3 4 2
0.0941 3 2 40.2391 3 2 4
0.0471 2 4 30.0591 2 4 3
0.1791 2 3 40.1071 2 3 4
reorder'reorder'
IN1 NNS2 RB3 JJS4DTS1 NNS2 JJS3 JJS4
)|(ki tagreorderp
10
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
One Best Reordering Rules
α+> )|()|( tagreorderptagreorderp sf
IBM T. J. Watson Research Center
© 2006 IBM Corporation
DT1 JJ5 NN2 CC3 NN4DT1 NN2 CC3 NN4 JJ5
JJS2 CC3 JJS4 NNS1NNS1 JJS2 CC3 JJS4
DT1 JJ3 NN2 IN4DT1 NN2 JJ3 IN4
DTS1 JJS4 JJS3 NNS2DTS1 NNS2 JJS3 JJS4
IN1 RB3 JJS4 NNS2IN1 NNS2 RB3 JJS4
11
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Lexicalization of Exceptions
The Fund must of course continue to serve its purpose and pursueresearch into1 varieties2 more3 suited4 to demand and causing as little harm as possible .
El Fondo, por supuesto, debe continuar cumpliendo con su misión de investigación sobre la búsqueda de1/IN variedades2/NNS más3/RBadaptadas4 /JJS a la demanda y lo menos nocivas posible,
IBM T. J. Watson Research Center
© 2006 IBM Corporation
the operational support of the1 Secretary2 General3 of4 the Council
el apoyo operativo de la1/DT Secretaría2/NN General3 /JJ del4 /INConsejo
DT1 NN2 JJ3[~General] IN4 → DT1 JJ3 NN2 IN4
IN1 NNS2 RB3 JJS4[~adaptadas] → IN1 RB3 JJS3 NNS2
12
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Lexicalized Reordering Rules
• Identify the key part-of-speech tag in the base reordering rules
• Replace the key part-of-speech tag with the corresponding word
o DT NN JJ IN → DT NN General IN
• Compute reordering probabilities of lexicalized part-of-speech tag sequences
• Exception word list
o If the reordering pattern with the highest probability is monotone increasing, select the word in the pattern as an exception
IBM T. J. Watson Research Center
© 2006 IBM Corporation
13
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Lexicalized Reordering Probabilities
0.0134 1 3 2
0.0074 1 2 3
0.0981 4 3 2
0.1941 4 2 3
0.0121 3 4 2
0.2011 3 2 4
0.0211 2 4 3
0.4541 2 3 4
reorder´
DT1 NN2 General3 IN4
IBM T. J. Watson Research Center
© 2006 IBM Corporation
)|(ki tagreorderp
14
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Performance Evaluations
• Translation model training corpus
• ~1.3 M sentence pairs from EPPS distributed by RWTH
• Language model training corpus
• EPPS English corpus: ~35 M words
• UN parallel corpus English (LDC94T4A): ~45 M words
• English gigaword second edition (LDC2005T12): ~2.5 B words
IBM T. J. Watson Research Center
© 2006 IBM Corporation
15
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Evaluation Corpus Statistics
IBM T. J. Watson Research Center
© 2006 IBM Corporation
31 words/segment920CORTES Dev06 VHT
37 words/segment753CORTES Dev06 FTE
31 words/segment792EPPS Dev06 VHT
35 words/segment699EPPS Dev06 FTE
Avg. Segment Length# of SegmentsData Sets
16
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Lexicalized Reordering Rules: Impact
IBM T. J. Watson Research Center
© 2006 IBM Corporation
BL
EU
r2n
4c
0.5322
0.5123
0.4439
0.4186
0.5434
0.5204
0.4507
0.4242
0.4
0.44
0.48
0.52
0.56
SE
CO
ND
AR
Y
EPPS Dev06 FTE EPPS Dev06 VHT CORTES Dev06 FTE CORTES Dev06 VHT
17
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Base vs. Lexicalized Reordering Rules
IBM T. J. Watson Research Center
© 2006 IBM Corporation
BL
EU
r2n
4c
0 . 5 12 3
0 . 4 4 3 9
0 . 4 18 6
0 . 5 4 13
0 . 5 2 0 5
0 . 4 4 5 2
0 .4 19 6
0 .5 4 3 4
0 .4 5 0 7
0 . 4 2 4 2
0 . 5 3 2 2
0 . 5 2 0 4
0.4
0.43
0.46
0.49
0.52
0.55
EPPS Dev06 FTE EPPS Dev06 VHT CORTES Dev06 FTE CORTES Dev06 VHT
18
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Related Work
• N-best Reordering in Arabic-to-English Translation
o Statistically significant performance improvement by applying local reordering to noun phrase parsed Arabic
o IBM Site Report: DARPA MT Evaluation Workshop 2004
• Morphological Analysis for Statistical Machine Translation
o Identify one to one word correspondences between Arabic and English to improve word to word translation qualities
o Companion Volume of HLT-NAACL 2004, pages 57−60
• Local Reordering for Spanish-English Translations
o Presentation at TC-STAR 2005 Evaluation Workshop
o April 21-22, 2005, Trento, Italy
IBM T. J. Watson Research Center
© 2006 IBM Corporation
19
Business Unit or Product Name
Presentation Title | Presentation Subtitle | Confidential © 2004 IBM Corporation
Ongoing Work
• Non-local reordering models
• [Se ha puesto a prueba]VP [su voluntad]NP →
[Its will]NP [has been put to the test]VP
• Todas sus Señorías firmaron [con los electores]PP [un contrato]NP →All your ladies and gentlemen signed [a contract]NP [with the electors]PP
• Integration of reordering models into the decoder
IBM T. J. Watson Research Center
© 2006 IBM Corporation