0
Open Source Toolkitfor Statistical Machine Translation:
Factored Translation Modelsand Lattice Decoding
Final Presentation
Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi,Chris Callison-Burch, Ondrej Bojar, Brooke Cowan,
Chris Dyer, Hieu Hoang, Richard Zens,Alexandra Constantin, Evan Herbst, Christine Moran
17 August 2006
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
1
Schedule
• First session: Overview and toolkit development
– Factored models and confusion network decodingKoehn, Federico
– Moses toolkitHoang, Dyer, Herbst, Callison-Burch, Bertoldi
• Second session: Experiments
– Experiments in small data settingsShen, Bojar, Moran, Cowan
– Factored models for morphological rich languagesDyer, Koehn, Cowan, Constantin
– Confusion network experimentsZens
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
2
Accomplishments
• Open source toolkit
– advances state-of-the-art of statistical machine translation models– best performance of European Parliament task– competitive on IWSLT and TC-Star
• Factored models
– outperform traditional phrase-based models– framework for a wide range of models– integrated approach to morphology and syntax
• Confusion networks
– exploit ambiguous input and outperform 1-best– enable integrated approach to speech translation
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
3
Phrase-Based Translation
er geht ja nicht nach hauseer geht ja nicht nach hause
he does not go home
• Foreign input is segmented in phrases
– any sequence of words, not necessarily linguistically motivated
• Each phrase is translated into English, phrases are reordered
• Log linear model: Many feature functions hi(e, f) with weights λi combinedto overall score
∑i λihi(e, f) → easy to extend
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
4
Translation
• Task: translate this sentence from German into English
er geht ja nicht nach hause
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
5
Translation step 1
• Task: translate this sentence from German into English
er geht ja nicht nach hauseer
he
• Pick phrase in input, translate
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
6
Translation step 2
• Task: translate this sentence from German into English
er geht ja nicht nach hauseer ja nicht
he does not
• Pick phrase in input, translate
– it is allowed to pick words out of sequence (reordering)– phrases may have multiple words: many-to-many translation
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
7
Translation step 3
• Task: translate this sentence from German into English
er geht ja nicht nach hauseer geht ja nicht
he does not go
• Pick phrase in input, translate
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
8
Translation step 4
• Task: translate this sentence from German into English
er geht ja nicht nach hauseer geht ja nicht nach hause
he does not go home
• Pick phrase in input, translate
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
9
Translation options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not tonot
is notare notis not a
• Phrase translation tables provide many translation options
• Learned from automatically word-aligned corpora
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
10
Translation options
he
er geht ja nicht nach hause
it, it
, he
isare
goesgo
yesis
, of course
notdo not
does notis not
afterto
according toin
househome
chamberat home
notis not
does notdo not
homeunder housereturn home
do not
it ishe will be
it goeshe goes
isare
is after alldoes
tofollowingnot after
not tonot
is notare notis not a
• The machine translation decoder does not know the right answer
→ Search problem solved by heuristic beam search
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
11
Decoding process: precompute translation optionser geht ja nicht nach hause
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
12
Decoding process: start with initial hypothesiser geht ja nicht nach hause
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
13
Decoding process: hypothesis expansioner geht ja nicht nach hause
are
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
14
Decoding process: hypothesis expansioner geht ja nicht nach hause
are
it
he
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
15
Decoding process: hypothesis expansioner geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
16
Decoding process: find best pather geht ja nicht nach hause
are
it
hegoes
does not
yes
go
to
home
home
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
17
Statistical machine translation today
• Best performing methods based on surface word phrases
– uses mapping of short chunks of text (mostly 1-3 words)– sophisticated methods for phrase extraction and modeling (EM algorithm,
generative models, discriminative training)
• Translation solely based on surface forms of words
– no use of explicit syntactic information– no use of morphological information
• How can be build richer models?
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
18
One motivation: morphology
• Current models treat house and houses as completely different words
– training occurrences of house have no effect on learning translation of houses– if we only see house, we do not know how to translate houses– rich morphology (German, Arabic, Finnish, Czech, ...) → many word forms
• Better approach combines evidence for house and houses
– analyze surface word forms into lemma and morphologye.g.: Haus +plural
– translate lemma and morphology separatelye.g.: Haus → house; +pl → +pl
– generate target surface forme.g.: house +pl → houses
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
19
Factored translation models
• Factored represention of words
word word
part-of-speech
OutputInput
morphology
part-of-speech
morphology
word class
lemma
word class
lemma
......• Benefits– generalization, e.g. by translating lemmas, not surface forms– richer model, e.g. using syntax for reordering, language modeling)
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
20
Example factored model
• Our example as factored model:
lemma lemma
OutputInput
morphologymorphology
word word
• Translation process broken up into mapping steps
– translation of lemma– translation of morphology– generation of word from lemma, morphology
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
21
Expansion of input phrase
• Probabilistic mapping steps– translation step: lemma → lemma
haus → house, home, chamber, ...– translation step: morphology → morphology
single-noun → single-noun, single-pronoun, plural-noun, ...– generation step: lemma,morphology → word
house,single-noun → househouse,plural-noun → houses
• Still a phrase model– translation steps may map phrases
nach hause → home, return home– generation steps operate on single words– traditional phrase-models are special case: single-factor models
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
22
Computational complexity of mapping steps
• Number of factored expansions may grow exponentially
• Key insights to reduce complexity for a given input sentence:– expansions can be pre-computed and stored as translation options,– pruning translation options early
• Future work: problems with more complex models need to be addressed
– we had problems using some models with three steps or more– see student proposals (Hoang, Dyer) for solutions
Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006
Spoken Language Translationwith Confusion Networks
Marcello Federico, Nicola Bertoldi, Wade Shen, Richard Zens
August 17, 2006
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
1
Outline
• Spoken language translation
• Approaches to SLT
• Confusion network decoding
• Computational issues
• Implementation in Moses
• Language model interface
• Other applications of confusion networks
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
2
Spoken Language Translation
Translation from speech input is likely more di!cult than translation from textinput:
• many styles and genres:formal read speech, unplanned speeches, interviews,spontaneous conversations, ...
• less controlled language:relaxed syntax, spontaneous speech phenomena
• automatic speech recognition is prone to errors:possible corruption of syntax and meaning
This work addresses methods to improve performance of spoken languagetranslation by better integrating speech recognition and machine translationmodels.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
3
Integrating Speech Recognition and Translation
• Correlation between transcription word-error-rate and translation quality:
38.5
39
39.5
40
40.5
41
41.5
42
42.5
14 15 16 17 18 19 20 21
BLEU SCORE
WER OF TRANSCRIPTIONS
• Better transcriptions have been possibly analyzed during ASR decoding butdiscarded due to lower scores
• Potential for improving translation quality by exploiting more transcriptionhypotheses generated during ASR.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
4
Statistical Spoken Language Translation
• Let o be be spoken input in the foreign language
• let F(o) be a set of possible transcriptions of o
Goal: find the best English translation through the approximate criterion:
e! = arg maxe
Pr(e | o) ! arg maxe
maxf"F(o)
Pr(e, f | o)
Pr(e, f | o) is computed with a log-linear model incorporating:
• acoustics features, i.e. probs that some foreign words are in the input
• linguistic features, i.e. probs of foreign and English sentences
• translation features, i.e. probs of foreign phrases into English
• alignment features: i.e. probs for word re-ordering
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
5
ASR Word Graph
A very general set of transcriptions F(o) can be represented by a word-graph:
• directly computed from the ASR word lattice (e.g. HTK format, lattice tool)
• provides a good representations of all hypotheses analyzed by the ASR system
• arcs are labeled with words, acoustic and language model probabilities
• paths correspond to transcription hypotheses for which probabilities can becomputed
!
"
#$%
&''
#$%
&'(
)
&'*
#$%
&'+#$%
&
,-.
/,-.
0,-.
',-. (
,-.
*
,-.
+
,-.
1
,-.
"!
,-.
""
,-.
"&
,-. "/
,-.
"0
,-.
"'
,-.
"(
,-.
"*
,-.
"+
,-.
"1
,-.
&(&
)
&(/
)
&(0
)
&(')
&(()
&(*)
&(+)
#$%
#$%
#$%
#$%
#$%
#$%
#$%
#$%
#$%
#$%
#$% #$
%
#$%
#$%
#$%
#$%
#$%
&!
2.32,445 &"
2.32,445
&&
2.32,445
&/
2.32,445
&0
2.32,445
&'
2.32,445
&(
2.32,445
&1
2.32,445
//2.32,445
#$%
#$%
#$% #$
% #$% #$
% &*67
&+
68#$
%
/!
#$%
/0
#$%
67 /"
67 /&
67
#$%
68
'+
67
'167
(!
67
#$%
#$%
#$%
/'
68
/(
68
/*68
/+
68
#$%
#$%
#$%
#$%
/1
79:.-25
0!
79:.-25
0"
79:.-25
0&
79:.-25
0/
79:.-25
00
79:.-25
0'
79:.-25
0(
79:.-25
0*
79:.-25
0+
79:.-25
01
79:.-25
'!
79:.-25
'"
79:.-25
'&
79:.-25
'/
79:.-25
'0
79:.-25
''
79:.-25
'(
79:.-25
#$%
#$%
#$%
#$% #$
% #$% #$
%
#$%
#$% #$
% #$%
#$%
#$%
#$% #$
%
#$%
#$%
'*#$%
/&+
;<=>
#$%
#$%
("
79:.-25 (&
79:.-25
(/
79:.-25
(0
79:.-25
('79:.-25
((79:.-25
(+
79:.-25
*/
79:.-25
*'
79:.-25
*(79:.-25
*+
79:.-25
1!
79:.-25
1/
79:.-25
10
79:.-25
1+
79:.-25
&&&
79:.-25
&&(
79:.-25
&&+
79:.-25
&0!
79:.-25
#$%
#$% #$%
#$%
#$%
(*)#$
%
*"#$%
(1
4.
*!
,)
*&
9.#$
%
&!14?
&"!4.
&""
4.
&"&
4.
&"/
4.
&"04.
&"(
9.
&"*
9.
&"1
4?
&&!
4?#$
%
"1+
#$%
*0
#$%
#$%
)#$
% "+*
.,
"*/#$%
#$%
"(0
5)
"++.
"11
,
**
@5#$
%
"0+
@.
"''
.4
"'1
A3
"('
5
"*0
)
"+1
.
&!!
,
+(
#$%
*1
AB
+!AB+&
6.
+/
A38
+0
6.
@5
+*
.6
++
!
+1.448#$
%
"00
.448"0(
73
"01
@.
"'(
.4
"(!
A3
"((
5
"**
)
"1!
.
&!"
,
#$%
+"
#$%
;<=>
#$%
"/(
#$%
+'
#$%
;<=>
1"
#$%
"&0
#$%
1&
#$%
#$%
@5!
#$%
"/!
@5C
"/&
@5
"//
@5
"/0
@5
A38
"/1
74
"0&
5@
"'!
@.
"'*
.4
"("
A3
"(*
5
"*!
5
"*"
5
"*&
5
"*+
)
"1&
.
&!&,
#$%
""(
#$%
#$% ""&
A!
""*!
""+
!
"&!48.6
"&'
7
"&+.6"/*A38
"0!
74
"'"
@.
"'&@. "'/
@.
"(&
A3
"*1
)
"1'
.
"1(
.
"1*
.
&!/
,
1'
A@
1(
A@
#$%
"!&
.7
"!0
.@
"!'
?
"!(B
"!+
.@
""!
?""/
A
""0
A"&"
48
"&(
7
"+!
)
&!0
,
#$%
1*#$%
;<=>
11
67
"!!
=7
"&&
48
"&*
7
"+")
&!'
,#$
%
&'/#$%
"!"
#$%
;<=>
"!/
#$%
;<=>
#$%
#$%
"!*
#$%
;<=>
"!1
#$%
;<=>
"""
#$%
;<=>
#$% #$%
""'#$%
;<=>
#$%
#$%
""1
#$%
;<=>
#$%
#$%
"&/#$%
;<=>
#$%
#$%
#$%
&&'
#$%
"&1
#$%
;<=>
"/"
#$%
;<=>
#$%
#$%
"/'
#$%
;<=>
#$%
"/+
#$%
;<=>
#$%
"0"
#$%
;<=>
"0/#$% ;<=
>
"0'
#$%
;<=>
"0*
#$% ;<=>
#$%
#$%
#$%
#$%
#$%
"'0#$%
;<=>
#$%
#$%
"'+
#$%
;<=>
#$%
#$%
#$%
"(/
#$%
;<=>
#$% #$
%
#$%
"(+.#$
%
"(1
#$%
;<=>
#$%
#$%
&/0
#$%
#$%
"*'
A3
"*(
.#$
%
"+/#$%
"+&#$%
#$%
.A3
"+'
.
#$%
#$%
#$%
&'"#$%
#$%
"+0
#$% ;<=
>
"+(
#$%
;<=>
#$%
#$%
#$%
"1"
.#$
%
"1/#$%
.#$
%
"10#$%
;<=>
#$%
#$%
&//
#$%
#$%
#$%
#$%
#$%
&!(
.
&!*
.
#$% #$%
#$%
&01
#$%
#$% &!+
#$%
;<=>
#$%
#$%
#$%
#$%
#$%
&"'
#$%
;<=>
#$%
&"+
#$%
;<=>
#$% &&"#$
% ;<=>
&&/
2
&&0:7
#$%
,)
67
&/"
#$%
&&1#$%
&&*
#$%
7#$
%
&/*
#$%
:2
&/&
#$%
&/'
687
&/1
#$%
#$%
&0'
38
&0*
3
&/!#$% ;<=
>
&0"
#$%
.5
&/+
7
&'!
,
&'&
)
;<=>
;<=>
&/(#$%
;<=>
#$% ;<=>
;<=>
2
&0/D
&0&#$% ;<=
>
&00#$% ;<=>
&0(#$%
;<=>&0+#$
%
;<=>
#$%
;<=>
#$%
;<=>
&'0
#$%
;<=>
&(")
&'1#$%
&(!)
)
#$%
#$%
#$%
#$% #$%
#$%
#$%
#$%
#$%
&(1
E.2.3F.
&*!
E.2.3F.
&*"
E.2.3F.
&*&
E.2.3F.
&*/
E.2.3F.
&*0
E.2.3F.
&*'
E.2.3F.
&*(
E.2.3F.
#$%
#$%
#$%
#$%
#$% &114.
#$%
&**
48 /!!
4. #$%
&*+
48
/!"4.
/!&4.
/!/4. /!04.
/!'
4.
/!(
4.
/!*
4.
/!+
4.
&*1
48
&+!48
&+"48
&+&48 &+/
48
&+048 &+'
48
&+(
48#$%
#$%
#$%
#$%
#$%
#$%
#$%
#$%
#$%
&+*79:.-25
&++
79:.-25 &+1
79:.-25
&1!
79:.-25
&1"
79:.-25 &1&
79:.-25
&1/79:.-25
&10
79:.-25
&1'
79:.-25
&1(
79:.-25
&1*
79:.-25
#$%
#$% #$%
#$%
#$%
#$%
#$%
#$%
#$%
#$%
&1+#$%
;<=>
#$%
#$%
#$%
#$%
#$%
#$% #$
%
#$%
#$%
/!1
67
/"!
67
/""
67
/"&67 /"/
67
/"0
67
/"'
67
/"(
67
#$% #$
%
#$%
#$% #$%
#$%
#$%
/"*:.- /"+
:.-
/"1:.- /&!
:.-
/&"
:.-
#$%
#$%
#$%
#$%
/&&
253
/&/253 /&0
253
/&'
253
/&(253
#$%
#$%
#$%
#$%
/&*#$%
;<=>
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
6
Approaches to Spoken Language Translation
The previous statistical framework includes several alternative implementations:
• 1-best translation:translate only the most probable hypothesis in the word graph
– pros: very e!cient– cons: no potential to recover from recognition errors in the 1-best
transcription
• N-best translation:translate only the N–most probable hypotheses in the word-graph
– pros: can exploit more accurate transcriptions in the word graph– cons: N must be large in order to include good transcriptions, and
decoding time increases linearly with N
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
7
Approaches to Spoken Language Translation
• Transducer:compose word-graph with a translation FSN and apply a transducer algorithm
– pros: straightforward method that permits to work on the full word graph– cons: computationally prohibitive with large vocabulary tasks and long range
word re-ordering
• Confusion network:translate a suitable approximation of the WG
– pros: it permits to e"ectively explores all paths in the word-graph, with noproblems in re-ordering
– cons: can only exploit limited information in the word graph
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
8
Confusion Network(Mangu 1999)
A confusion network approximates a word graph with a linear network, s.t.:
• arcs are labeled with words or with the empty word ( !-word)
• arcs are weighted with word posterior probabilities
• paths are a superset of those in the word graph
! "#$% &
'()*'+
$
,
-).-'//0
1)-).
2'+$/) 3
45/5
)//5'+
6
789)(-0
9)('+
:'+$! ;
'+$*
'0
-0.1)
-47
45<
=<>
..5
0>7.->'$70??0+.0)-5/0> @'+$>) "!
'+$)
)/>0
/))4
/5<.5
8)7/
/A<+)>)
//54)!
)
""
9)
?''+
$
"&'+$B ",'+$-7 "2
'+$*
'0
)7
>0>)-
"3#C$% "6DEF
CNs can be conveniently represented as a sequences of columns of di"erent depth.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
9
Confusion Network Decoding:
Extension of basic phrase-based decoding step:
• cover some not yet covered consecutive columns (span)
• retrieve phrase-translations for all paths inside the columns
• compute translation, distortion and target language models
Example. Coverage set: 01110... Path: cancello d’
0 1 1 1 0 ...era 0.997 cancello 0.995 ! 0.999 di 0.615 imbarco 0.999 ...e 0.002 vacanza 0.004 la 0.001 d’ 0.376 bar 0.001
! 0.001 ! 0.002 all’ 0.005
l’ 0.002
! 0.001
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
10
Confusion Network Decoding
Computational issues:
• Number of paths grows exponentially with span length
• Implies look-up of translations for a huge number of source phrases
• Factored models require considering joint translation over all factors (tuples):– cartesian product of all translations of each single factor
Solutions implemented into Moses
• Source entries of the phrase-table are stored with prefix-trees
• Translations of all possible coverage sets are pre-fetched from disk
• E!ciency achieved by incrementally pre-fetching over the span length
• Phrase translations over all factors are extracted independently, then translationtuples are generated and pruned by adding a factor each time
Once translation tuples are generated, usual decoding applies.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
11
Implementation into Moses
• Input Format: CN input can be rather large, so better to put one word-positionper line:
Haus 0.1 aus 0.4 Aus 0.4 eps 0.1der 0.9 eps 0.1Zeitung 1.0
each line represents alternatives with their probability.
• Factored confusion networks: alternatives are over the full factor space:
Haus|N 0.1 aus|PREP 0.4 Aus|N 0.4 eps|eps 0.1der|DET 0.1 der|PREP 0.8 eps|eps 0.1Zeitung|N 1.0
Notice: confusion network can be projected over single factors.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
12
Implementation into Moses
Decoding CN with Factored Models
• at each step of the search process, a portion of the CN is explored, e.g.
... ...Haus | N 0.1 aus|PREP 0.4 Aus|N 0.4 eps|eps 0.1der|DET 0.1 der|PREP 0.8 eps|eps 0.1Zeitung|N 1.0... .... ... ...
.... and translations are looked up for each factor.
Features:
• E!ciency by pre-filtering possible translations for each factor
• Decoding of confusion networks is completely hidden to the decoder.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
13
Other Applications of Confusion Networks
Translation tasks with ambiguous input:
• linguistic annotation for factored models
– avoid hard decision by linguistic tools but rather provide alternativeannotations with respective scores:
– e.g. particularly ambiguous part of speech tags
• insertion of punctuation marks missing in the input– model all possible insertions of punctuation marks in the input
• translation of input similar to that produced by speech recognition– e.g. OCR output for optical text translation
• ....
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
14
Language Model Interface
• Features
– compact binary format for very large language model– quantization of probabilities (8 bits)– fast upload of language model from disk– upload of n-grams on demand
• Comparison with SRI LM Toolkit
– memory: 50% less with large quantized models– speed: 10% slower in decoding with 3-gram LM
• Recent work and improvements
– speed-up by directly storing log-probs– addition of cache memory on n-gram internal data strucure– analysis of LM score computations by search algorithm– caching of probabilities and LM states
the search algorithm requests the same probabilities many times
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
15
Requests of N-grams by Decoder
Requests of 3-gram probabilities during decoding of a single sentence. About1.6M requests involving about 120K 3-grams.
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
16
Conclusions
Implementation work
• E!cient on-demand pre-fetching of phrase translations
• Tuning of parameters for confusion network decoding
• Language model interface and pre-fetching of n-grams
Development of state-of-the-art baselines for SLT
• IWSLT BTEC Chinese-English SLT– submissions to IWSLT 2006 evaluation
• EPPS Spanish-English SLT– performance comparable with best TC-STAR systems
Achievements
• SLT decoder more e!cient wrt current implementations by IRST and MIT/LL
• works with large-data tasks and large confusion networks
• works with factored confusion networks
Marcello Federico, ITC-irst Trento Project Summary August 17, 2006
Engineering ResultsEngineering Results
JHUSWS 2006JHUSWS 2006
Aug 17, 2006 2JHUSWS 2006
Open software, so what?Open software, so what?
State of the world, June 2006State of the world, June 2006““Black boxBlack box”” decoder (Pharaoh) widely useddecoder (Pharaoh) widely used
20+ citations in this year20+ citations in this year’’s ACL Proceedings alones ACL Proceedings aloneUbiquitous baseline systemUbiquitous baseline system
ButBut…… it is difficult to extendit is difficult to extendNew features limited to what can be expressed in the New features limited to what can be expressed in the existing phraseexisting phrase--table format (source, target, feature vector)table format (source, target, feature vector)Many interesting projects require reinventing the wheel just Many interesting projects require reinventing the wheel just to change one spoketo change one spoke
Aug 17, 2006 3JHUSWS 2006
Software GoalsSoftware Goals
AccessibilityAccessibilityEasy to maintainEasy to maintainFlexibilityFlexibilityEasy for distributed team developmentEasy for distributed team developmentPortabilityPortability
Aug 17, 2006 4JHUSWS 2006
AccessibilityAccessibility
Easy to readEasy to read““Nothing should be a black boxNothing should be a black box””Descriptive namesDescriptive namesUniform coding styleUniform coding style
Available immediatelyAvailable immediatelySource code on Source code on Sourceforge.netSourceforge.net
CrossCross--platform compatibilityplatform compatibilityWindows, Linux, Windows, Linux, MacOSMacOS X, 64 bit OSX, 64 bit OS
void Load(const std::string &fileName, FactorCollection &factorCollection, FactorType factorType, float weight, size_t nGramOrder);
Aug 17, 2006 5JHUSWS 2006
Easy to MaintainEasy to Maintain
Modular codeModular codeTeam developmentTeam developmentObject oriented frameworkObject oriented framework
Integrated documentation frameworkIntegrated documentation frameworkUsing Using DoxygenDoxygenEasy to maintain Easy to maintain WikiWiki documentation on the Webdocumentation on the Web
Aug 17, 2006 6JHUSWS 2006
DocumentationDocumentation
Aug 17, 2006 7JHUSWS 2006 Aug 17, 2006 8JHUSWS 2006
ExtensibilityExtensibility
Open architecture designed for extensibilityOpen architecture designed for extensibilityArchitecture matches theoretical descriptions of phraseArchitecture matches theoretical descriptions of phrase--based based MT modelsMT models
Short rampShort ramp--up time for researchers familiar with SMT but not with up time for researchers familiar with SMT but not with any particular decoderany particular decoder
Feature function evaluation decoupled from search Feature function evaluation decoupled from search algorithmsalgorithms
Facilitates experimentation with new classes of feature functionFacilitates experimentation with new classes of feature functionssModular designModular design
Framework to allow different replacements of all parts of the deFramework to allow different replacements of all parts of the decodercoderMultiple implementations of translation tablesMultiple implementations of translation tablesLanguage modelsLanguage modelsDifferent types of modelsDifferent types of models
Aug 17, 2006 9JHUSWS 2006
Case Study: Lexicalized ReorderingCase Study: Lexicalized Reordering
Very successful Very successful model,model, but implementation not but implementation not possible with a possible with a ““black boxblack box”” decoderdecoderWith Moses, anyone with an idea can try itWith Moses, anyone with an idea can try itAdding support for LR models to Adding support for LR models to mosesmoses required code required code changes in four (relatively logical) locationschanges in four (relatively logical) locations
FeatureFeature--function base class (function base class (ScoreProducerScoreProducer) extended, logic ) extended, logic for feature value computation implementedfor feature value computation implementedEnable the model based on configurationEnable the model based on configurationCall to evaluate the feature function when extending a Call to evaluate the feature function when extending a hypothesishypothesisAdd the feature values to Add the feature values to nn--best list output for tuning best list output for tuning algorithmsalgorithms
Aug 17, 2006 10JHUSWS 2006
Regression TestingRegression Testing
Regression TestingRegression TestingPharaoh scores used as baseline, which were updated Pharaoh scores used as baseline, which were updated as models changed (for example, hypothesis as models changed (for example, hypothesis recombination based on LM state rather than recombination based on LM state rather than nn--gram gram order)order)Detailed logging enables strict test coverage for all Detailed logging enables strict test coverage for all model typesmodel typesRegression test suite was run approximately 3000 Regression test suite was run approximately 3000 times during workshoptimes during workshop
Aug 17, 2006 11JHUSWS 2006
AccomplishmentsAccomplishments
Code contributions from every member of the Code contributions from every member of the teamteamPerformance improvementsPerformance improvements
Day 1Day 1:: 5.01 sec/sentence 5.01 sec/sentence avgavg decoding timedecoding timeTodayToday:: 1.43 sec/sentence 1.43 sec/sentence avgavg decoding timedecoding time
Aug 17, 2006 12JHUSWS 2006
SummarySummary
State of the world, August 2006State of the world, August 2006““White boxWhite box”” multimulti--factored decoder (Moses) availablefactored decoder (Moses) available
DropDrop--in replacement for Pharaohin replacement for Pharaoh
Further experimentation and development anticipated at:Further experimentation and development anticipated at:Aachen, Charles University, Cornell, Edinburgh, IRST, MIT, Aachen, Charles University, Cornell, Edinburgh, IRST, MIT, Lincoln Labs, UMDLincoln Labs, UMD……and many more.and many more.
Software Goals
• Accessibility• Easy to maintain• Flexibility• Easy for distributed team development• Portability
Accessibility
• Easy to read– “Nothing should be a black box”– Descriptive names
– Uniform coding style• Available immediately
– Source code on Sourceforge.net• Cross-platform compatibility
– Windows, Linux, MacOS X, 64 bit OS
void Load(const std::string &fileName, FactorCollection &factorCollection, FactorType factorType, float weight, size_t nGramOrder);
Easy to Maintain
• Modular code– Team development– Object orientated framework
• Integrated documentation framework– Using Doxygen– Interactive Wiki documentation on the Web
• Extendable– Flexibility
• Framework to allow different replacements of all parts of the decoder
• Multiple implementations of translation tables• Language models• Different types of models
– Code size • 10,000 at beginning of workshop• 16,000 now
1
System tuning
• Log Linear Model
e∗ = arg maxe
Pr(e | f) = arg maxe
pλ(e | f) = arg maxe
∑i
λihi(e, f) (1)
• real valued feature functions:– model specific component of the translation process:
fluency, adequacy, reordering, ...– statistical models are estimated on specific training data
• feature weights:– balance ranges of feature scores– weight importance of features– tuned through Minimum Error Training (MET)
Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006
2
Minimum Error Training
• automatic procedure to optimize feature weights
• minimization of translation errors
• development set (f , ref)
• automatic error function Err(e; ref): (100-BLEU) score
e∗ = e∗(λ) = arg maxe
pλ(e | f) (2)
λ∗ = arg minλ
Err(e∗(λ); ref) (3)
• Err(e) is not math-sound =⇒ no exact solution
• approximate iterative algorithm: gradient descent, downhill simplex
Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006
3
CLSP-WS solution for MET
Moses Extractor
Optimizer
input
weights
features
n-best
reference
score
Scorer
weights
1-best
inner loopouter loop
optimalweights
• outer loop:– decoding with actual lambdas– generation of nbest translations– addition to previous translations
• inner loop:– optimization over n-bests– decoder and ”random” weights
as initial points• optimizer:
– iterative optimization on single weights– discretization of the r-dimensional space of weights
Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006
4
MET vs. size of nbest list
• German-English EuroParl task
• tuning on dev set of 2000 sentences
• evaluation on test set of 2000 sentences
• convergence in 5-6 iterations:– good: faster outer loop
• no impact of size of nbest:– good: faster inner loop
18
19
20
21
22
23
24
25
26
0 2 4 6 8 10 12 14
BLE
U
iteration
100 nbest200 nbest400 nbest800 nbest
Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006
5
MET vs. size of development set
• extraction of 4 subsets:100, 200, 400, 800 sentences
• larger dev set:– more stable result– less iterations– better results
• bad:– overfitting– large dev set– slower outer loop (decoding)
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16 18
BLE
U
iteration
100 sentences200 sentences400 sentences800 sentences
2000 sentences
iteration BLEU100 sentences 18 24.3200 sentences 15 25.1400 sentences 16 24.6800 sentences 14 24.9
2000 sentences 9 25.3
Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006
6
MET vs. optimization algorithm
• task: Spanish-English EPPS, speech input
• dev set of 2643 Confusion Networks, test set of 1073 CNs
• CLSP-WS algorithm vs. downhill simplex (RWTH)
iteration ∆ BLEUdev test
CLSP-WS algorithm 4 +1.0 +0.4downhill simplex 7 +2.9 +3.4
• mismatch between internal score of CLSP-WS algorithm and official score
• better performance of the downhill simplex algorithm
• post-workshop investigation
Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006
1
Moses in parallel
• effective R&D cycle:– fast experiments
• computing facilities:– 6 clusters, 200 machines
• parallelization of translation
• ’split and merge’ technique
• translation time:– splitting/merging ≈ constant, negligible– access to cluster related to cluster load– loading data≈ constant– decoding ∝ input length
Moses
source input
part-1 part-N
Splitter
Moses
translation-N
Merger
translation
(remote) cluster of machines
translation-1
Nicola Bertoldi, ITC-irst Moses in parallel August 17, 2006
2
Moses in parallel
• Spanish-English EuroParl task
• CLSP cluster, 18 machines
• no control of cluster load
standard 1 job 5 jobs 10 jobs 20 jobs10 sentences 6.3 13.1 9.0 9.0 –
100 sentences 5.2 5.6 3.0 1.7 1.71000 sentences 6.3 6.5 2.0 1.6 1.1
Average time (seconds).
Nicola Bertoldi, ITC-irst Moses in parallel August 17, 2006
Decoder Output Analysis
Evan Herbst
8 / 17 / 06
Evan Herbst Decoder Output Analysis 8 / 17 / 06
1
Measurables
• Difficulty
– perplexity
• Error
– WER– PWER– BLEU– confidence intervals
• Significance
– t-test– sign test
Evan Herbst Decoder Output Analysis 8 / 17 / 06
2
Definition: Perplexity
Measure likelihood of corpus given model (e.g. language model)
PX = 2−1N
∑i log(pLM(wi)),wi words
Evan Herbst Decoder Output Analysis 8 / 17 / 06
3
Definition: WER
Word Error Rate: modified edit distance
Evan Herbst Decoder Output Analysis 8 / 17 / 06
4
Definition: PWER
Position-independent Word Error Rate: match bags of words
Evan Herbst Decoder Output Analysis 8 / 17 / 06
5
Definition: BLEU
BiLingual Evaluation Understudy: n-gram precision and length comparison
Evan Herbst Decoder Output Analysis 8 / 17 / 06
6
Numbers
Dataset: 2000-sentence Europarl subset
pharaoh moses baseline
Linguae → de-en en-de de-en en-de
BLEU .2557 .1775 .2554 .1776
WER .5432 .6144 .5428 .6145
PWER/WER .865 .940 .865 .947
Lemma BLEU .2625 .2170 .2622 .2180
N-gram Prec. .609/.315/.188/.119 .519/.223/.122/.070 .609/.314/.188/.119 .519/.223/.122/.070
Perplexity 40.97 62.01 40.94 61.77
Ref Perplex. 68.81 125.29 68.81 125.29
Inferences
• lemmas vs. surface: morphology
• output vs. reference perplexity: fluency
• PWER/WER ratio: reordering; phrase tables
Evan Herbst Decoder Output Analysis 8 / 17 / 06
7
Tool: Comparison
Evan Herbst Decoder Output Analysis 8 / 17 / 06
8
Tool: Alignment
Evan Herbst Decoder Output Analysis 8 / 17 / 06
Suffix Arrays for More Statistics(and Less Disk Space!)
Chris Callison-Burch
August 17, 2006
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
1
Phrase Tables in Statistical Machine Translation
• Using longer phrases leads to better translation quality
• Phrase tables can get unwieldily large with long phrases
• Problem of large tables is compounded for factored translation models
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
2
Phrase Tables in Factored Translation Models
• Translation tables between source and target phrases, and POS tags, stems,morphological markers, etc.
• Plus generation tables
• Want longer sequences for factors with smaller tags sets
• Number of tables depend on number of conditioning variables, and on back-offstrategies
• Potentially more tables than all pairwise combinations of factors
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
3
Ad Hoc Solutions
• Limit length of phrases
• Only extract phrases for test data
• Make unnecessary independence assumptions
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
4
Proposed Solution: Intelligent Data Structure
• Uses less memory than table-based data structures
• Allows us to condition on whatever factors we want and easily back-off
• Retrieve translation / generation probabilities for arbitrarily long sequences
• Suffix arrays to index parallel corpus
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
5
How Suffix Arrays Work
0123456789
Spain declined to confirm that Spain declined to aid Moroccodeclined to confirm that Spain declined to aid Moroccoto confirm that Spain declined to aid Moroccoconfirm that Spain declined to aid Moroccothat Spain declined to aid MoroccoSpain declined to aid Moroccodeclined to aid Moroccoto aid Moroccoaid MoroccoMorocco
Spain declined to confirm that Spain declined aidto Morocco0 1 2 3 4 5 6 87 9
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
Initialized, unsortedSuffix Array Suffixes denoted by s[i]
CorpusIndex ofwords:
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
6
Alphabetically Sorted
8361950472
to aid Moroccoto confirm that Spain declined to aid Morocco
MoroccoSpain declined to aid Morocco
declined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco
that Spain declined to aid MoroccoSpain declined to confirm that Spain declined to aid Morocco
SortedSuffix Array Suffixes denoted by s[i]
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
7
(Reasonably) Fast Find
8361950472
to aid Moroccoto confirm that Spain declined to aid Morocco
MoroccoSpain declined to aid Morocco
declined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco
that Spain declined to aid MoroccoSpain declined to confirm that Spain declined to aid Morocco
SortedSuffix Array Suffixes denoted by s[i]
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
8
Applied to Factored Translation Models
spain declin to confirm that spain declin aidto moroccoNNP TO VB IN NNP VBN VBTO NNPVBDSpain declined to confirm that Spain declined aidto Morocco
0 1 2 3 4 5 6 87 9Factored Corpus
Index ofwords:POS:
stems:
• Index each factor
• Store word-level alignments
• Calculate probabilities on the fly
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
9
Generation Probabilities
p(NNP VBN | Spain declined) = 0.5p(NNP VBD | Spain declined) = 0.5
Spain declined to aid Morocco
to aid Moroccothat Spain declined to aid Morocco
spain declin to confirm that spain declin aidto moroccoNNP TO VB IN NNP VBN VBTO NNPVBDSpain declined to confirm that Spain declined aidto Morocco
0 1 2 3 4 5 6 87 9Factored Corpus
Index ofwords:POS:
stems:
SortedSuffix Array Suffixes denoted by s[i]
8361950472 to confirm that Spain declined to aid Morocco
Moroccodeclined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco
Spain declined to confirm that Spain declined to aid Morocco
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
10
Generation Probabilities
p(Spain | NNP) = 0.66666 .p(Morocco | NNP) = 0.33333 .
spain declin to confirm that spain declin aidto moroccoNNP TO VB IN NNP VBN VBTO NNPVBDSpain declined to confirm that Spain declined aidto Morocco
0 1 2 3 4 5 6 87 9Factored Corpus
Index ofwords:POS:
stems:
NNP VBD TO VB IN NNP VBN TO VB NNP
SortedSuffix Array Suffixes denoted by s[i]
VB NNP
TO VB NNP
NNP4905273816
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
IN NNP VBN TO VB NNP
NNP VBN TO VB NNPTO VB IN NNP VBN TO VB NNP
VB IN NNP VBN TO VB NNP
VBN TO VB NNPVBD TO VB IN NNP VBN TO VB NNP
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
11
Translation Probabilities
Spain declined to aid Morocco
to aid Moroccothat Spain declined to aid Morocco
spain declin to confirm that spain declin aidto moroccoNNP TO VB IN NNP VBN VBTO NNPVBDSpain declined to confirm that Spain declined aidto Morocco
0 1 2 3 4 5 6 87 9Factored Corpus
Index ofwords:POS:
stems:
SortedSuffix Array Suffixes denoted by s[i]
8361950472 to confirm that Spain declined to aid Morocco
Moroccodeclined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco
Spain declined to confirm that Spain declined to aid Morocco
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
Marocle
aiderd'
refuséavait
Espagnel'
queconfirmer
derefusé
aEspagne
L'
Mor
occo
aid
todecli
ned
Spai
nth
atco
nfirm
todecli
ned
Spai
n
p(L'Espagne a refusé de | Spain declined) = 0.5p(l'Espagne avait refusé d' | Spain declined) = 0.5
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
12
Translation Probabilities
NNP VBD TO VB IN NNP VBN TO VB NNP
spain declin to confirm that spain declin aidto moroccoNNP TO VB IN NNP VBN VBTO NNPVBDSpain declined to confirm that Spain declined aidto Morocco
0 1 2 3 4 5 6 87 9Factored Corpus
Index ofwords:POS:
stems:
s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]
Marocle
aiderd'
refuséavait
Espagnel'
queconfirmer
derefusé
aEspagne
L'
Mor
occo
aid
todecli
ned
Spai
nth
atco
nfirm
todecli
ned
Spai
n
p(l'Espagne avait refusé d' | Spain declined, NNP VBN) = 1
SortedSuffix Array Suffixes denoted by s[i]
VB NNP
TO VB NNP
NNP4905273816
IN NNP VBN TO VB NNP
NNP VBN TO VB NNPTO VB IN NNP VBN TO VB NNP
VB IN NNP VBN TO VB NNP
VBN TO VB NNPVBD TO VB IN NNP VBN TO VB NNP
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
13
Advantages
• Memory reduction
– Memory = 2 * num factors * corpus + word alignments– Significantly less than phrase tables!
• Greater range of statistics
– Arbitrary number of conditioning variables– Allows range of back-off strategies
• Can extract statistics for arbitrarily long sequences
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006
14
Research to be Undertaken
• Integrate into Moses decoder
• Deal with increased computational complexity
• Change search strategies to incorporate longer factor sequences, of differentlevels of granularity
• Experiment to test if longer sequences improve translation quality
• Experiment with what variables to condition upon, how to back off
Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006 MIT Lincoln + Computer Science AI Labs
18/14/2006
Charles University
Wade Shèn, Břooke Cowan, OndrejBojar and Christine Möran
Factored Translation Models for Small Data Problems
Experiments with Spanish, Czech and Chinese
28/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
38/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
General MotivationsChallenges with Small Data
• Phrase-based MT relies on large data– Learn “Phrase” co-occurence within language– Learn Translation templates/phrases across languages
• Problems Phrase-based MT with Small Data– Word Alignment– Hard to see enough phrases (coverage)
Especially in morphologically rich languages– Tend to rely on shorter phrases
Increased local agreement problems Increased long-distance coherence problems
48/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Possible Advantages of Factored ModelsGeneralization over Morphology
• We can Model morph. variation and phrase translation separately for better statistics: Translation + Generation
– Spanish Gender
– Czech Case
Masculine FeminineEnglishSpanish Él es un jugador rojo Ella es una jugadora roja
he is a red player she is a red player
Nominative + Plural Dative + PluralEnglishCzech černé kočky černým kočkám
black cats black cats
el ser un jugador rojMorph: f 3p+sing f f fMorph: m 3p+sing m m m
černá kočka Morph: dat+pl dat+plMorph: nom+pl nom+pl
58/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factors as Type CheckingLong Range Phenomena and Divergence
• Long range dependencies can be modeled with latent factors– Spanish: Verb – Subject Number Agreement
• Verb-Argument dependencies
Spanish Mi hija de dos años tiene catarroGloss My daughter of two years has coldCzech Nachlazena je moje dvouletá dcera.
verb: 3p+singSubject: 3p+sing AGR
verb: 3p+sing Subject: 3p+singAGR
Czech Napsal zprávu o matčině domu na papírGloss He wrote a message about mother’s house on a paper
noun: accusativeverb select
Czech Našel zprávu o matčině domu na papířeGloss He found a message about mother’s house on a paper
noun: locativeverb select
68/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Phrase-Level Generalization
• Class-based divergences– Chinese-English resultative constructions
Similar pattern for large class of verbs
• Longer distance movement dependencies– Chinese-English Questions
Chinese 你 要 答 破 吗made hit broken doneyou
回Gloss it
English you broke it
Chinese 你 要 答 [clause…] 吗want [clause…] y/n-markeryou
would you like to reply to [clause…] ?
回Gloss replyEnglish
causes reorderingTags: VModal Pn Tag: Part
Verb Specific
78/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Large vs. Small DataHow generalizations may affect SMT Performance
• With large data sets these phenomena can be learned– Language Models should get local agreement phenomena
with enough data– Long range agreement/coherence still problematic– Generalization may still be better, but errors in analysis can
limit
• Generalization may be advantageous for small data– For example: (Spanish/Czech Agreement)
Can’t learn every noun/adjective/determiner triple– Situation for many real-world problems
88/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines– Approaches– Data Sets
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
98/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Data Sets and Baselines
Data Set Translation Direction(s)
Size Baseline w/diff LMs(BLEU/Surface)
Full Europarl English Spanish
950k LM Train700k Bitext
3g 29.354g 29.575g 29.54
3g 23.413g (950k) 25.10
3g 25.82(four references)
4g 19.54(seven references)
Euromini English Spanish
60k LM Train40k Bitext
Czech WSJ English Czech
20k LM Train20k Bitext
IWSLT Chinese Chinese English
40k LM Train40k Bitext
108/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Using Factored ModelsApproaches for Small-Data Tasks
• Factored Models we tried– Different levels of linguistic information modeled separately
example: Morphology vs. phrasal content– Feature “Checking” of existing phrasal models with LMs on
factors
– Generalized Factor-based Distortion Phrase are likely to move distance X if preceding word is Tag Y
• Hypothesis: These models allows better utilization of limited training data
I would like some donutsGood
pn mod vb det np
I would like some big jumpBad
pn mod vb det adj vb
Words
POS
High likelihood Low likelihood
118/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Different Factored ApproachesOverview of Models Tried
• High Order Language Models
• Parallel Translation Models
Analysis Problems AddressedExplicit Agreement
Long Distance CoherenceUnsupervised Agreement/Coherence • LMs over Word-Classes
• LMs over verbs/subject• LMs over nouns determiner
adjectives
SupervisedModel Types
• LMs over POS
• Parallel Translation Models over Word-Classes and Surface
Agreement/CoherenceUnsupervised
Explicit AgreementProblem Types
• Parallel Translation Models over Lemmas and Morphology
SupervisedAnalysis Model Types
128/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish– Morphology and Agreement Features (Brooke)– Parallel Lemma and Morphology Translation (Wade)– Scaling to Larger Corpora (Wade)
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
138/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Spanish ExperimentsLanguage Models over Morphological Features
• NDA– Nouns/Determiner/Adjective Agreement– Generate only on N, D and A tags (don’t
care’s elsewhere)
• VNP– Verb/Nouns/Preposition Selection
Agreement– Generate on V, N or P
ModelModel
Surface
Generate + Check Latent Factors
nda
word
vpn
N/D/A FeaturesGender: masc, fem, common, none Number: sing, plural, invariable, none
V/N/P FeaturesNumber: sing, plural, invariable, none Person: 1p, 2p, 3p, nonePrep-ID: Preposition, none
148/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
ModelModel
Spanish ExperimentsSkipped LMs for Agreement
• Allow NULL factors to be generated• Increase effective context length to model longer range
dependencies
Surface
Generate Latent Factors
…gave the woman
nda
word
vpn
s+f
s
s+f
X
X
“a”3+s
X
mujerlaadio
Target Phrase
Source Phrase
158/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Spanish Agreement LMsExperimental Results
• With Skipping
• No Skipping (LM counts don’t care positions)
• No Skipping with all morphological features w/ and w/o POS
• All models beat baseline– Skipping doesn’t seem to help– Full morphology is best
Data Set Baseline NDA VPN BothEuroMini 23.41 24.47 24.33 24.54
Data Set Baseline NDA+Skip VPN+SkipEuroMini 23.41 24.03 24.16
Data Set Baseline Morph Morph+POSEuroMini 23.41 24.66 24.25
168/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Lemma
Person + Number + Gender + Case
Spanish ExperimentsParallel Lemma/Morphology Translation
• Factor surface into lemma and morphology features• Translate both simultaneously• Re-generate target surface form• Apply LM on both surface and morphology features
• Results:
Surface
Analysis Generation
Me
I
1ps+ Acc
Yo
Mi
1ps+ Acc
Data Set Baseline LemmaEuroMini + 950k LM 25.10 25.71
178/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Scaling Up to Large TrainingPOS Language Models
• Full Train → Less/No Gain from richer features
POS-LM vs. Baseline
28
28.5
2929.5
30
30.5
31
3g 4g 5g 6g 7g 8g 9g
POS N-gram Order
BLE
U S
core
BaselinePOS-LMFull Tags
NOTE: Scale
188/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech– Factored Word Alignment for Limited Data– Rich Morphology and Tagged LMs– Putting it Together: Parallel Translation
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Analysis and Conclusions
• Follow-on Research
198/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factors for Coping with Limited DataBetter Word Alignment for Czech
• Word Alignment is difficult when data is limited and Morphology is rich
– Data: 20k bitext sentences, large vocabulary– Contrast Set: 20k + 840k (Out of Domain) sentences– Task: English Czech
• Two methods to deal with limited data
• Contrastive Behavior for small and large data
Stem Alignment Lemma Alignment
Data Set Word-Word Stem-Lemma Stem-Stem20k Czech 25.17 25.23 25.82
24.99Large Contrast 25.40
208/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Czeching Rich Morphology with TagsTagged Czech Language Models
• Idea: Use morphologically rich POS Tag sequences to “czech”target output generation
• POS Information Configurations (Baseline: 25.82)
Surface
Generation
cat
N+acc
kočky
Apply LM
Full TagsFeature 1Feature 2… (15 total)Size: 1098 tagsResult: 27.04
CNG TagsCaseNumber+Genderon V, P, PP, N, ASize: 707 tagsResult: 27.45
CNG+VPCNG FeaturesPerson+Tense+Aspect (verbs)Lemma+Case (prepostions)Size: 899 tagsResult: 27.62
218/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Comparing with Larger Data ModelsTagged Czech Language Models
• Large vs. Small Data
• Tagged Language Models improve performance for small data significantly
– approaches large data performance• Large Task also improves (but much less: 2.36% vs. 6.97%)
Data Set Data Set BLEU Relative Improvement
20k Czech 25.82 –Large Contrast(20k + 840k OOD)
27.47 –
Baseline
20k Czech 27.62 6.97%CNG+VP
2.37%Large Contrast(20k + 840k OOD)
28.12
228/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Parallel Translation Models for Czech
• Motivation: Factored LM models seem to lose number information
• Better than baseline, but worse than both CNG & CNG+VP
POS Tag + CNG Features
Surfacehim
3p+acc
ho
Model ResultSurface Surface + POS POS+CNG 25.94
on Lemma
3p+acc
Surface Lemma + POS POS+CNG 26.43
238/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models (Christine)– Lexical Distortion Models– Factor-based Distortion– Results
• Models for Sparse Statistics in Chinese
• Analysis and Conclusions
• Follow-on Research
248/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Generalized Distortion ModelingIntroduction to Distortion
• For each phrase pair we learn its likely placement relative to the previous phrase
• Orientations– Monotone
word alignment point on top left– Swap
word alignment point on top right– Discontinuous
Not monotone or swap
• Examples– la casa roja the red house– D NN ADJ D ADJ NN
Source
Targ
et
Monotone
Discontinuous Swap
258/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factor-based Distortion Models
• A Factor-based extension of Lexicalized Distortion– Use of more general factors
e.g. POSf-POSe, Lemma-Lemma
• Can model longer range dependencies– More conditioning variables
• Motivating Results– Hard-coding in a few factor based rules (e.g. swap nouns and
adjectives when translating from English to Spanish) led to improvements (Gispert, et. al. 2006)
268/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factor-based DistortionSpanish Experiments
• Lexicalized Distortion only
• Factor-based Distortion on small data
• Further Experiments– Other Factors– Minimizing Model Parameters– Combining different models
Data Set ResultBaseline (No Lexical)
Factored: POS-POS SystemCombined: Lexical + POS-POS
Baseline Lexical
Europarl Lang Pharaoh MosesEn De
Es En
En Es
18.15 18.85
31.06 31.85 31.46 32.37
278/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
288/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
IWSLT ChineseExperiments with Unsupervised Annotation
• Data: Travel-domain sentences, limited vocabulary, short sentences• Task: Text and ASR translation, Chinese English
• Can we use automatic word classes to learn general sequence constraints?
• First Experiment: 2-gram Word Class LMs of varying orders
ModelModel
SurfaceHow much is it?
class
word
c55c3c22c1
?钱多少总共
Target Phrase
Source Phrase
298/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
IWSLT ChineseAlignment Templates for Translation
• Second Experiment: Extend Class-based LM to the translation Model
• Bigram word classes for source and target
• Translate alignment templates similar to [Och 98] + surface
• Apply LM to surface and Class
Word Class
Surface
Generation
Me
I Yo
Mi
308/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
18
18.5
19
19.5
20
20.5
21
21.5
22
22.5
3g 4g 5g 6g 7g 8g 9g
Class N-gram Order
BLE
U S
core
Baseline
Class-LM
ClassTrans+LM
• Class-LM significantly better (p=0.05, ~1.0 BLEU)• Class-Trans may be limited by synchronous PT constraint
– Start to address here, but not in time for eval
NOTE: Scale
IWSLT ChineseAutoclass Results
318/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Coping with Rich Morphological Constraints in Czech
• Conclusions and Follow On Research
328/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Conclusions and Future Work
• Factored Approach can help with small data– Large Data tasks may need different factored approaches
• MIT/LL + CSAIL– Continue experiments with morphology and coherence– Fully Asynchronous Factor Translation– Apply techniques to other languages
Extend existing LCTL experiments– Syntax-driven reordering models (Brooke)
• Asynchronous Factors Translation (Hieu)
• Making use of verb sub-categorization information (Ondrej)
Valency-Aware Machine Translation
Project Proposal
Ondrej [email protected]
August 17, 2006
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
1
Overview
• JHU Workshop motivation and one of the results.
• State-of-the-art MT errors.
• Project goal.
• Motivation: Why Czech.
• Proposed strategy and information sources.
• Summary.
Appendices: References, illustrations and further details on Czech and English
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
2
Workshop Motivation
• Statistical machine translation (SMT) into morphologically rich languages ismore difficult than from them.
See e.g. Koehn (2005).
• One of workshop goals: examine utility of factored translation models totranslate into morphologically rich languages.
• There was room for improvement:
Regular BLEU English→Czech 25%BLEU of lemmatized MT against lemmatized references 32%
⇒ Errors in morphology cause large BLEU loss.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
3
One of the Workshop Results
• Significant improvements gained on small data sets:English→Czech: 20k sentences, BLEU 25.82% to 27.62%or up to 28.12% with additional out-of-domain parallel data.
• Still far below the margin of lemmatized BLEU (35%).
• However local agreement already very good:
Microstudy: Adjective-Noun Agreement74% correct, 2% mismatch, other: missing noun etc.
⇒ So where are the morphological errors?
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
4
Current English→Czech MT ErrorsMicrostudy of current best MT output (BLEU 28.12%), intuitive metric:
• 15 sentences, 77 verb-modifier pairs in source text examined:
Translation of . . . preserves meaning . . . is disrupted . . . is missingVerb 43% 14% 21%Modifier 79% 12% 6%
But: When Verb&Mod correct, 44% of cases are non-grammatical or meaning-
disturbing relations.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
5
Samples ErrorsInput: Keep on investing.
MT output: Pokracovalo investovanı. (grammar correct here!)
Gloss: Continued investing. (Meaning: The investing continued.)
Correct: Pokracujte v investovanı.
⇒language model misled us ⇒ need to include source valency information.
Input: brokerage firms rushed out ads . . .
MT Output: brokerske firmy vybehl reklamy
Gloss: brokerage firmspl.fem ransg.masc adspl.nom,pl.acc,pl.voc,sg.gen
Correct option 1: brokerske firmy vybehly s reklamamipl.instr
Correct option 2: brokerske firmy vydaly reklamypl.acc
Target-side data may be rich enough to learn: vybehnout–s–instr
Not rich enough to learn all morphological and lexical variants:vybehl–s–reklamou, vybehla–s–reklamami, vybehl–s–prohlasenım, vybehli–s–oznamenım, . . .
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
6
Project Goal
Improve MT output quality by valency information.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
7
Motivation: Why Czech• Relevant properties: very rich morphological system and relatively free word
order.• Well-established theory on syntax and valency in particular.
Sgall, Hajicova, and Panevova (1986), Panevova (1994)
• Data available:monolingual and parallel corpora
manual surface and deep treebanks (parallel forthcoming!)
manual valency lexicons
Language Corpus Annotation up to Tokens
Cs PDT 2.0 (Hajic, 2005) manual surface and deep syntax 1.5M surf.Cs CNC (Kocek, Koprivova, and Kucera, 2000) automatic lemmatization and morphology 114MCs Web corpus automatic surface syntax 100M
Cs↔En PCEDT 1.0 (Cmejrek, Curın, and Havelka, 2003) automatic surface and deep syntax 500kCs↔En CzEng 0.5 automatic surface syntax 15M
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
8
Proposed StrategyPreliminary experiments at workshop:
• Factored models touching valency explored during workshop perform badly.No gain or a slight loss.
Future:
• Evaluate the causes.Was it just sparse data?
• Check subcategorization using partially lexicalized language models.Morphological LM with verbs lexicalized should capture subcategorization.
• Experiment with syntax-based language models.(Chelba and Jelinek, 1998; Charniak, 2001)
• Map explicit subcategorization information from source to target.Translate lemma+subcat to lemma+subcat and POS to POS, generate surface from this.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
9
Project Will Use these Sources of Information
• Available valency/subcategorization dictionaries.VALLEX for Czech. (∼PropBank for English.)
• Automatically collected subcategorization data.(Korhonen, 2002) and previous, my diss. in prep.
• Word-sense-like algorithms to label verb occurrences with frames.(Bojar, Semecky, and Benesova, 2005), and all WSD community results
Compare with simple approaches:
• More monolingual data for plain n-gram language models may help enough.• Are valency-based generalizations useful in general/on small data/on out-of-
domain data?
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
10
Summary
• Factored models help fixing morphology → local dependencies already correct.
• Significant margin for improving verb-modifier agreement.
• English→Czech pair is a good fit for the experiments.
• Improved valency models should improve translation quality:Valency theory, data and methods available.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
11
References
Bojar, Ondrej. 2003. Towards Automatic Extraction of Verb Frames. Prague Bulletin of
Mathematical Linguistics, 79–80:101–120.
Bojar, Ondrej, Jirı Semecky, and Vaclava Benesova. 2005. VALEVAL: Testing VALLEX
Consistency and Experimenting with Word-Frame Disambiguation. Prague Bulletin of
Mathematical Linguistics, 83:5–17.
Charniak, Eugene. 2001. Immediate-head parsing for language models. In Meeting of the
Association for Computational Linguistics, pages 116–123.
Chelba, Ciprian and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling.
In Christian Boitet and Pete Whitelock, editors, Proceedings of the Thirty-Sixth Annual Meeting
of the Association for Computational Linguistics and Seventeenth International Conference
on Computational Linguistics, pages 225–231, San Francisco, California. Morgan Kaufmann
Publishers.
Cmejrek, Martin, Jan Curın, and Jirı Havelka. 2003. Czech-English Dependency-based Machine
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
12
Translation. In EACL 2003 Proceedings of the Conference, pages 83–90. Association for
Computational Linguistics, April.
Collins, Michael. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. In
Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages
184–191.
Collins, Michael, Jan Hajic, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A
Statistical Parser of Czech. In Proceedings of 37th ACL Conference, pages 505–512, University
of Maryland, College Park, USA.
Hajic, Jan. 2005. Complex Corpus Annotation: The Prague Dependency Treebank. In Maria
Simkova, editor, Insight into Slovak and Czech Corpus Linguistics, pages 54–73, Bratislava,
Slovakia. Veda, vydavatelstvo SAV.
Holan, Tomas. 2003. K syntakticke analyze ceskych(!) vet. In MIS 2003. MATFYZPRESS,
January 18–25, 2003.
Kocek, Jan, Marie Koprivova, and Karel Kucera, editors. 2000. Cesky narodnı korpus - uvod a
prırucka uzivatele. FF UK - UCNK, Praha.
Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In
Proceedings of MT Summit X, September.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
13
Korhonen, Anna. 2002. Subcategorization Acquisition. Technical Report UCAM-CL-TR-530,
University of Cambridge, Computer Laboratory, Cambridge, UK, February.
Kruijff, Geert-Jan M. 2003. 3-Phase Grammar Learning. In Proceedings of the Workshop on
Ideas and Strategies for Multilingual Grammar Development.
Panevova, Jarmila. 1994. Valency Frames and the Meaning of the Sentence. In Ph. L.
Luelsdorff, editor, The Prague School of Structural and Functional Linguistics, pages 223–243,
Amsterdam-Philadelphia. John Benjamins.
Sgall, Petr, Eva Hajicova, and Jarmila Panevova. 1986. The Meaning of the Sentence and
Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech
Republic/Dordrecht, Netherlands.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
14
Analysis of CzechAnalytic (surface syntactic):
#36Zakony
Laws
udelejte
make
pro
for
lidi
people
ADV
AUXPOBJ
PRED
Tectogrammatical (deep syntactic):
#36zakonPl
lawPl
udelatimp
makeimpyou
clovekPl,pro
personPl,for
BENACTPAT
PRED
Morphological:Form Lemma Morphological tag
zakony zakon NNIP1-----A----
zakony zakon NNIP4-----A----
zakony zakon NNIP5-----A----
zakony zakon NNIP7-----A----
udelejte udelat Vi-P---2--A----
udelejte udelat Vi-P---3--A---4
pro pro-1 RR--4----------
lidi clovek NNMP1-----A----
lidi clovek NNMP4-----A----
lidi clovek NNMP5-----A----
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
15
Properties of Czech languageCzech English
Rich morphology ≥ 4,000 tags possible, ≥ 2,300 seen 50 usedWord order free rigid
• rigid global word order phenomena: clitics
• rigid local word order phenomena: coordination, clitics mutual order
Nonprojective sentences 16,920 23.3%Nonprojective edges 23,691 1.9%
Known parsing results Czech EnglishEdge accuracy 69.2–82.5% 91%Sentence correctness 15.0–30.9% 43%
Data by (Collins et al.,1999), (Holan, 2003), Zeman
(http://ckl.mff.cuni.cz/˜zeman//projekty/neproj/index.html)
and (Bojar, 2003). Consult(Kruijff, 2003) for measuringword order freeness.
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
16
Detailed numbers on CzechEdge length 1 ≤ 2 ≤ 5English [%] 74.2 86.3 95.6Czech [%] 51.8 72.1 90.2
1
Number of gaps 0 1 2Sentences [%] 76.9 22.7 0.42
2
Climbing steps 1 2 3 4 5Nodes [%] 90.3 8.0 1.3 0.3 0.1
3
1Data for English by (Collins, 1996). Data for Czech by (Holan, 2003).2Data by (Holan, 2003).3Data by (Holan, 2003).
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
17
Analytic vs. Tectogrammatical (2)
#45ToIt
byconjunct particle
sereflexive particle
meloshould
zmenitchange
.full stop
AUXK
AUXR
OBJAUXVSB
PRED
#45toit
mıtshould
zmenitconj
changeconj
Generic
Actor
PREDPAT ACT
PRED
Ondrej Bojar Valency-Aware Machine Translation August 17, 2006
Asynchronous Factored Translation
Hieu HoangUniversity of Edinburgh
Current System
Phrase Table 1
Je vous achète I am buying you
Phrase Table 2
PRO PRO VB PRO VB VB PRO
Translating
Je vous achète un chat
PRO PRO VB ART NN
Current System
Phrase Table 1
Je vous achète I am buying you
Phrase Table 2
PRO PRO VB PRO VB VB PRO
Translating
Je vous achète un chat
PRO PRO VB ART NN
LimitationsSynchronous
Phrase Table 1
Je
vous
achète
I
you
am buying
Phrase Table 2
PRO PRO VB PRO VB VB PRO
Je vous achète un chat
PRO PRO VB ART NN
Asynchronous TranslationSynchronous
Phrase Table 1
Je
vous
achète
I
you
am buying
Phrase Table 2
PRO PRO VB PRO VB VB PRO
Je vous achète un chat
PRO PRO VB ART NN
TilingJe vous achète un chat
PRO PRO VB ART NN
Current System
TilingJe vous achète un chat
PRO PRO VB ART NN
Current System
PRO PRO VB ART NN
Je vous achète un chat
TilingJe vous achète un chat
PRO PRO VB ART NN
Current System
Future
PRO PRO VB ART NN
Je vous achète un chat
TilingJe vous achète un chat
PRO PRO VB ART NN
Current System
Future
PRO PRO VB ART NN
Je vous achète un chat
TilingJe vous achète un chat
PRO PRO VB ART NN
Current System
Future
PRO PRO VB ART NN
Je vous achète un chat
TilingJe vous achète un chat
PRO PRO VB ART NN
Current System
Future
Je vous achète un chat
PRO PRO VB ART NN
Je
Vous
achète
un chat
Long Templates
Phrase Table 1
I
am buying
You
a cat
Phrase Table 2
PRO PRO VB ART NN PRO VB VB PRO ART NN
Je vous achète un chat
PRO PRO VB ART NN
Templates
Phrase Table 1
Je
Vous
achète
un chat
I
am buying
You
a cat
Phrase Table 2
PRO PRO VB ART NN PRO VB VB PRO ART NN
Combining information from different factors
ni suo ta da mingzi le ma ?
You said his name, right ?
past
past
You say his name already question
Surface:
Tense:
Challenges
• Computational complexity• Pruning strategies• Recombination• Scoring
Translation of morphologically rich languageswith additional linguistic information
Chris Dyer, Philipp Koehn, Chris Callison-Burch, Hieu Hoang
17 August 2006
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
1
Morphologically rich languages
• Languages differ in their morphological markup• Examples with increasing complexity:
– Chinese: no marking for number, gender, tense, or aspect– English: number(2) for nouns, four verb forms– Spanish: number(2) and gender(2) for adjectives, ...– German: number(2), gender(3), case(4), definiteness for adjectives, ...– Arabic: number(3), gender(2), case(3), definiteness, possessors for nouns– Finnish: prepositions often expressed morphologically
Language Vocabulary size in EuroparlEnglish 65,887 word formsSpanish 102,886 word formsGerman 195,290 word formsFinnish 358,345 word forms
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
2
Impact of morphological complexity
• How much information do we have if we discount inflectional morphology?
• Experiment (systems trained on full 700,000 sentence Europarl corpus):
Method devtest testsurface → surface 18.22 BLEU 18.04 BLEU
surface → surface (lemmatize) 22.27 BLEU 22.15 BLEUsurface → lemma 22.70 BLEU 22.45 BLEU
• Gain of 4 BLEU points possible, if we can solve morphology
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
3
Problem: unknown word forms
• Unknown surface word forms (German)
test set unigrams bigrams trigramsdevtest-2006 0.71% 12.00% 40.46%
test-2006 0.69% 12.20% 41.08%
• Unknown lemmas (German)
test set unigrams bigrams trigramsdevtest-2006 0.64% 9.05% 33.93%
test-2006 0.64% 9.14% 34.36%
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
4
Factored models
• Factored models allow us to address these problems
• Sparse data
– back off to translation of lemmas– back off to language models with richer statistics
• Agreement and grammatical coherence
– use of factors that enforce agreement within noun phrases– use of factors that enforce agreement on the clause level
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
5
Addressing data sparseness with lemmas
word word
lemma
OutputInput
• Translate surface into lemma
• Generate surface from lemma
• Translate surface into surface
• Language models over surface and lemma
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
6
Addressing data sparseness with lemmas, model 2
word word
lemma
OutputInput
• Translate surface into surface
• Generate lemma from surface
• Language models over surface and lemma
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
7
Experimental Results
Method devtest testbaseline 18.22 18.04
hidden lemma (gen only) 18.82 18.69hidden lemma (gen and trans) 18.41 18.52
best published results - 18.15
• Better performance than baseline model
• Simpler model has higher performance
– fewer search errors
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
8
Addressing data sparseness with factored models
lemma lemma
part-of-speech
OutputInput
morphology
part-of-speech
word word
• Morphological analysis and generation model
• Pitfalls of this approach
– tag set does not necessarily have sufficient information– explosive search space on large models
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
9
Overall grammatical coherence
word word
part-of-speech
OutputInput
• High order language models over POS
• Motivation: syntactic tags should enforce syntactic sentence structure
• Results: No major impact with 7-gram POS model (BLEU 18.25 vs. 18.22)
• Analysis: local grammatical coherence already fairly good, POS sequence LMmodel not strong enough to support major restructuring
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
10
Local agreement (esp. within noun phrases)
word word
part-of-speech
OutputInput
morphology
• High order language models over POS and morphology
• Motivation
– DET-sgl NOUN-sgl good sequence– DET-sgl NOUN-plural bad sequence
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
11
Agreement within noun phrases
• Experiment: 7-gram POS, morph LM in addition to 3-gram word LM
• Results
Method Agreement errors in NP devtest testbaseline 15% in NP ≥ 3 words 18.22 BLEU 18.04 BLEU
factored model 4% in NP ≥ 3 words 18.25 BLEU 18.22 BLEU
• Example
– baseline: ... zur zwischenstaatlichen methoden ...– factored model: ... zu zwischenstaatlichen methoden ...
• Example
– baseline: ... das zweite wichtige anderung ...– factored model: ... die zweite wichtige anderung ...
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
12
Subject-verb agreement
• Lexical n-gram language model would prefer
the paintings of the old man is beautiful
old man is is a better trigram than old man are
• Correct translation
the paintings of the old man are beautiful- SBJ-plural - - - - V-plural -
• Special tag that tracks count of subject and verb
p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-)
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
13
Experiment on English–German
• Add special features for subject and verb
• Verb
– our morphological analyzer does not provide verb morphology→ use of surface forms
• Subject
– subject identified with German parser(Amit Dubey’s parser trained on TIGER treebank)
– if pronoun: surface form of pronoun– if noun phrase: POS and morphological tags of determiner, adjective,
and noun
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
14
Skip language models
• Full language model confused by many non-items:p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-)
• Skip language models: ignoring irrelevant tags:p(SBJ-plural,V-plural) > p(SBJ-plural,V-singular)
• Results: experiments did not finish as of yet, preliminary results inconclusive
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
15
Reflection on the data
• Clause elements are translated reasonable well
– now high agreement within noun phrases (with factored model 4%)
• Overall sentence structure muddled
– subject–verb agreement hard to enforce, since which noun phrase is subjectis hard to establish
– role (and hence case) of noun phrases often wrong, since relation to verb isunclear
• Similar problems when translating Arabic–English, Chinese–English
– this motivates work on syntax-based machine translation– one solution: syntactic restructuring models (Brooke’s presentation)– another solution: clause-level sequence models
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
16
Clause level sequence models
• Correct sentence with verb
the paintings of the old man are beautifulSBJ SBJ OBJ OBJ OBJ OBJ V ADJ
• Incorrect sentence without verb
the paintings of the old man beautifulSBJ SBJ OBJ OBJ OBJ OBJ ADJ
• Syntactic role label sequence model is on the steering wheel!
p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,V,ADJ) > p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,ADJ)
• May be simplified using skip language models to
p(SBJ,OBJ,V,ADJ) > p(SBJ,OBJ,ADJ)
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
17
Another reality check
• One typical error of the current system
wir haben daher nicht fur diesen bericht stimmenwe have hence not for this report voting
SUBJ AUX PART PART PP-OBJ PP-OBJ PP-OBJ VINF
• Typical sentences have many particles floating around
– if interested in core sentence structure: ignore them– if interested in all parts of the clause: include them
• Key lesson: feature engineering
– know your tag sets and morphological features– be aware of what problem you want to address– create a factor for this purpose
Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006
8/22/2006 JHUSWS 2006 1
Future ResearchFuture Research
BackBack--off models: improving MT off models: improving MT through smarter searching and better through smarter searching and better
use of datause of data
Chris Dyer, University of MarylandChris Dyer, University of Maryland
8/22/2006 2JHUSWS 2006
Two GoalsTwo Goals
Smarter SearchSmarter SearchMitigate sparseMitigate sparse--data effects in multidata effects in multi--factored modelsfactored modelsRecover from search errorsRecover from search errorsEnable wellEnable well--motivated models for translating into motivated models for translating into morphologically complex languagesmorphologically complex languages
BackBack--off modelsoff modelsTake advantage of singleTake advantage of single--factored models when it factored models when it makes sense to do somakes sense to do so
8/22/2006 3JHUSWS 2006
Smarter Search: MotivationSmarter Search: MotivationMorphological complexity poses problems for Morphological complexity poses problems for ““whitewhite--space tokenizedspace tokenized”” statistical MTstatistical MT
Beyond data sparseness: conventional models run into search Beyond data sparseness: conventional models run into search problems for rare surface formsproblems for rare surface forms
Lemmatizing results in considerable German Lemmatizing results in considerable German performance gainsperformance gains
devtestdevtest--20062006 testtest--20062006
surfacesurface→→surfacesurface 18.2218.22 18.0418.04
surfacesurface→→surfacesurface, , lemmatizelemmatize
22.2722.27 22.1522.15
surfacesurface→→lemmalemma 22.7022.70 22.4522.45
8/22/2006 4JHUSWS 2006
Smarter Search: MotivationSmarter Search: Motivation
Single factor models do not generalize . They cannot produce a Single factor models do not generalize . They cannot produce a target form target form unless seen in the training data.unless seen in the training data.Basic generation models allow us to improve translation coverageBasic generation models allow us to improve translation coverage with with (inexpensive) monolingual resources(inexpensive) monolingual resources
Translating Translating EnglishEnglish→→GermanGerman
Generation training data sizeGeneration training data size # distinct words # distinct words produceableproduceable
Surface onlySurface only n/an/a 105,000 distinct words105,000 distinct words
Lemmas onlyLemmas only n/an/a 85,000 distinct 85,000 distinct lemmaslemmas
8/22/2006 5JHUSWS 2006
Smarter Search: MotivationSmarter Search: Motivation
Single factor models do not generalize . They cannot produce a Single factor models do not generalize . They cannot produce a target form target form unless seen in the training data.unless seen in the training data.Basic generation models allow us to improve translation coverageBasic generation models allow us to improve translation coverage with with (inexpensive) monolingual resources(inexpensive) monolingual resources
Translating Translating EnglishEnglish→→GermanGerman
Generation training data sizeGeneration training data size # distinct words # distinct words produceableproduceable
Surface onlySurface only n/an/a 105,000 distinct words105,000 distinct words
Lemmas onlyLemmas only n/an/a 85,000 distinct 85,000 distinct lemmaslemmas
Lemmas + Lemmas + bitextbitext EuroparlEuroparl 15 million words15 million words 117,000 distinct words117,000 distinct words
Lemmas + full Lemmas + full EuroparlEuroparl 27 million words27 million words 122,000 distinct words122,000 distinct words
Lemmas + 1.2M EP + Lemmas + 1.2M EP + WikipediaWikipedia
113 million words113 million words 137,000 distinct words137,000 distinct words
Net result: 30% increase in forms Net result: 30% increase in forms produceableproduceable over a singleover a single--factor modelfactor model
8/22/2006 6JHUSWS 2006
Morphological Analysis and Morphological Analysis and Generation ModelGeneration Model
4-step model1. Translate surface to lemma2. Generate morphology from lemma3. Translate POS to morphology4. Generate surface from lemma + morphology
n-gram LM, surface
n-gram LM, lemmata
n-gram LM, morphology
8/22/2006 7JHUSWS 2006
Initial results were disappointingInitial results were disappointing……
BLEU scores well below baseline (~11)BLEU scores well below baseline (~11)Tuning took an entire weekend on a very small Tuning took an entire weekend on a very small tuning settuning set
8/22/2006 8JHUSWS 2006
The ProblemThe Problem
Search errorsSearch errorsAggressive pruningAggressive pruning
Each step multiplies number of states in the search space Each step multiplies number of states in the search space over a single factored modelover a single factored model
Spans must overlap exactlySpans must overlap exactly
8/22/2006 9JHUSWS 2006
The Problem: an illustrationThe Problem: an illustration
Translation options ‘the right approach’:
der richtige Ansatz
dem richtigen Ansatz
den richtigen Ansatz
8/22/2006 10JHUSWS 2006
The SolutionThe Solution
Back off to shorter spansBack off to shorter spansWhen a deadWhen a dead--end is reached, break up the source end is reached, break up the source span into smaller spans and translate thosespan into smaller spans and translate those
8/22/2006 11JHUSWS 2006
The Solution: an illustrationThe Solution: an illustration
Translation options ‘the’:
der, die, das, dem, den, das, des
Translation options ‘right approach’:
richtiger Ansatz
Ansatz
richtigen Ansatzes8/22/2006 12JHUSWS 2006
BackBack--off Modelsoff Models
Lexicalized surface forms are commonLexicalized surface forms are commonBecause of lexicalization, obscure morphology or Because of lexicalization, obscure morphology or root forms often retainedroot forms often retained
Ex. Ex. ““be that as it maybe that as it may””
Translations often approximate, unusual when Translations often approximate, unusual when analyzed in more abstract layersanalyzed in more abstract layersIf you mistranslate common stock phrases because If you mistranslate common stock phrases because of a rigid analysis and generation processes, fluency of a rigid analysis and generation processes, fluency sufferssuffers
8/22/2006 13JHUSWS 2006
BackBack--off Modelsoff Models
SolutionSolutionTry to let single translation step to cover all factorsTry to let single translation step to cover all factorsBack off to multiBack off to multi--factored modelfactored model
8/22/2006 14JHUSWS 2006
BackBack--off Models: Implementationoff Models: Implementation
““PrimaryPrimary”” phrase tablephrase tableStandard formStandard formContains all factors on target sideContains all factors on target side
Necessary for secondary factor Necessary for secondary factor LMsLMs
May be trained on single factor data with May be trained on single factor data with ““best best guessesguesses”” for secondary factorsfor secondary factorsMay be aggressively filtered, i.e., for >May be aggressively filtered, i.e., for >nn occurrences, occurrences, etc.etc.
8/22/2006 15JHUSWS 2006
BackBack--off Models: Implementationoff Models: Implementation
Key idea: BackKey idea: Back--off weightoff weightFeature that is associated with choosing a single Feature that is associated with choosing a single factored pathfactored pathTuned along with other feature weightsTuned along with other feature weightsFunction of source phrase length?Function of source phrase length?
8/22/2006 16JHUSWS 2006
SummarySummary
Increase performance of multiIncrease performance of multi--factored modelsfactored modelsRecover from search errorsRecover from search errorsRecover from data sparseness (make more efficient Recover from data sparseness (make more efficient use of longer underlying phrases)use of longer underlying phrases)
Extend the benefits of multiExtend the benefits of multi--factor models to factor models to target languages where sparsetarget languages where sparse--data and search data and search errors are not generally an issueerrors are not generally an issue
EnglishEnglish
Translation with syntax and factors:Handling global and local
dependencies in SMT
Brooke CowanMIT CSAIL
August 17, 2006
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
1
Goals of statistical machine translation
• Linguistically-correct output
– learn correct syntax and morphology in target language– e.g., noun-phrase agreement, subject-verb agreement, verbs and their
arguments
• Meaning-preserving output
– learn mapping between source and target sentence elements– e.g., identify the subject in the source and ensure it plays the proper role in
the target– can involve a significant amount of reordering
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
2
Linguistically-correct output
• E.g., in Spanish noun phrases, nouns, determiners, and adjectives areconstrained to agree in gender and number
políticaspolicies
pesquerasfisheries
comunitariascommon
lasthe
det noun adj adj
FEMININE PLURAL
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
3
Linguistically-correct output
• E.g., in Spanish noun phrases, nouns, determiners, and adjectives areconstrained to agree in gender and number
políticaspolicies
pesquerasfisheries
comunitariascommon
lasthe
det noun adj adj
FEMININE PLURAL
• Phrasal agreement phenomena are generally local in nature.
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
4
Meaning-preserving output: free word order
• E.g., when translating from German to English, we want to identify and placethe subject, object, and phrasal modifiers in the output
i would like to thank the rapporteur for his report
ich möchte dem berichterstatter für seinen bericht danken
dem berichterstatter möchte ich für seinen bericht danken
für seinen bericht möchte ich dem berichterstatter danken
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
5
Meaning-preserving output: free word order
• E.g., when translating from German to English, we want to identify and placethe subject, object, and phrasal modifiers in the output
i would like to thank the rapporteur for his report
ich möchte dem berichterstatter für seinen bericht danken
dem berichterstatter möchte ich für seinen bericht danken
für seinen bericht möchte ich dem berichterstatter danken
• Translation involving free-word-order languages or languages pairs with verydifferent basic word order can be quite challenging because these phenomenaare generally global in nature.
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
6
A hybrid system
• A syntax-based system
– handle global phenomena in translation∗ inter-phrasal reordering∗ verb/argument structure∗ some long-distance agreement phenomena (e.g., subject/verb agreement)
• A factored phrase-based system
– handle local phenomena in translation∗ agreement and reorderings
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
7
Combining the two systems
• Use the the syntax-based system to reorder the source-language input
• Feed the output of the syntax-based system into the phrase-based system
i would like to thank the rapporteur for his report
für seinen bericht möchte ich dem berichterstatter danken
German input:
Modified German input:
ich WOULD LIKE TO THANK dem berichterstatter für seinen bericht
English output:
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
8
The syntax-based system
• Discriminatively-trained, tree-to-tree translation system (Cowan, Collins, andKucerova, EMNLP ’06)
• Fully implemented and tested on German-to-English Europarl task
• Model predicts an aligned extended projection (AEP) on the target side
– a syntactic structure encapulating the argument structure of the maintarget-side verb, and
– alignment information between the modifiers on the source and target sides
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
9
What is an AEP?
np-sb 3 adja erhebliche
s pp-mo 1 appr zwischen
German clause: English AEP:
piat beiden nn gesetzen
vvfin-hd bestehenadv-mo 2 also
adja rechtliche $, ,adja praktischekon undadja wirtschaftlichenn underschiede
Extended Projection (EP) of the main verb
(Frank 2002)
Alignment information
+
S
NP-A VP
V
are
NP-A
SUBJECT: thereOBJECT: 3MOD(1): post-objectMOD(2): pre-subject
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
10
Integration with Moses
• Factor-based systems handle local phenomena well
• Extensions to Moses
Modified German input:
[ ich ] [ WOULD LIKE TO THANK ] [ dem berichterstatter ] [ für seinen bericht ]
– externally-provided translation options– constraints on reordering– n-best lists of AEPs
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
11
Research questions
• Factor the translation problem into two parts
– syntax-based system to handle global reorderings and agreements– factor-based system to handle local reordering and agreements
• Can this approach improve overall translation quality?
– past work in rule-based clause restructuring (e.g., Collins, Koehn, Kucerova,ACL ’05)
• What is the best way to combine these systems?
– hard constraints vs soft constraints– voting/backoff framework
Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006
Part of Speech Information for Alignment
Alexandra Constantin
2006 CLSP Summer Workshop
Bilingual Dictionary
Haus – house, building, home, household
Lexical Translation Probability Distribution Implicit Alignment
1 2 3 4Das Haus ist klein.1 2 3 4The house is small.
Alignment Function a
1 2 3 4
Klein ist das Haus
The house is small
1 2 3 4
POS Motivation
POS information for infrequent words
Example IBM Model 1 - Notations
e = target word
f = source word
t(e|f) = probability of translating foreign word f into English word e
f = (f_1, …, f_n) = foreign sentence
e = (e_1,…,e_m) = English sentence
p(e|f) = translation probability
a = alignment function
IBM Model 1 EM Algorithm
1. Initialize model (typically with uniform distribution)
2. Apply the model to the data (expectation step)
3. Learn the model from the data (maximization step)
4. Iterate steps 2-3 until convergence
Expectation Step Expectation Step – p(e|f)
Expectation Step Maximization Step
Adding POS Information Experiments- AER
Compare generated alignments against manual alignmentsManual alignments: probable (P) and sure (S)Automated alignments: A
Results
AER 10k 20k 40k 60k 80k 100k
Baseline 53.7 51.8 49.3 48.6 47.5 47.1
Only POS
76.0 75.4 75.5 75.1 75.3 75.1
+ POS 53.6 51.5 49.6 48.4 47.7 47.3
Future Work
Use alignments to train MT system and compare BLEU scoresUse POS information in more complicated alignment methodsUse other factors
JHU CLSP Summer Workshop 2006Team Presentation
Experimental Resultsfor Confusion Network Decoding
Richard Zens, Nicola Bertoldi, Marcello Federico, Wade Shen
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20061
IWSLT Task
• Chinese–English, domain: phrase book entries
• corpus statistics:
Chinese Englishsentences 40 Krunning words 351 K 365 Kvocabulary 11 K 10 K
• confusion network statistics (489 sentences):
read speech spontaneous speechavg. length 17.2 17.4avg. / max. depth 2.2 / 92 2.9 / 82avg. number of paths 1021 1032
• no development data for confusion networks
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20062
Results for IWSLT
• phrase table provided by MIT/LL
• competitive baseline results
• results:read speech spontaneous speech
BLEU [%] BLEU [%]verbatim 21.41-best from lattice 19.0 17.21-best from CN 19.0 17.2full CN 19.3 17.8
• improvements are statistically significant (89% confidence)
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20063
Other Ambiguous Input: Punctuation
• Chinese input does not contain punctuation
• illustration:
hello world →
1 2 3 4hello 1.0 ε 0.9 world 1.0 ! 0.7
, 0.1 . 0.2ε 0.1
• results for verbatim input:
punctuation input type BLEU [%]1-best 20.8confusion network 21.0
• competitive performance without tuning→ room for improvement
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20064
Truecasing
truecasing, i.e. restoring case information in lowercase text
• common approach:
– core MT system produces lowercase output– truecasing is done as postprocessing step
• application of factored translation models
1. translate lowercase2. generate truecase output (using a truecase LM)
• results:BLEU [%]
two-step 18.9integrated 17.8
→ somewhat worse performance than dedicated toolZens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20065
EPPS Task• EPPS: European Parliament Plenary Sessions
• Spanish-English speech-to-speech translation task
• corpus statistics:
Spanish Englishsentences 1.2 Mrunning words 31 M 30 Mvocabulary 140 K 94 K
• confusion network statistics:dev test
sentences 2 633 1 071avg. length 10.6 23.6avg. / max. depth 2.8 / 165 2.7 / 136avg. number of paths 1038 1075
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20066
Results for EPPS Task
dev testASR-WER BLEU ASR-WER BLEU
1-best lattice 19.3 42.2 22.4 37.61-best CN 21.7 40.3 23.3 36.7full CN 7.0 42.4 8.5 38.9
• best result for test in previous work: 37.2 BLEU
• in comparison with previous work on this task, we have
1. a stronger baseline,2. larger improvements and3. much more efficient decoding (4x vs. 25x)
note: all figures in percentZens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20067
Exploration of Confusion Networks
0 2 4 6 8 10 12 14
path length
0.1
1
10
100
1x103
1x104
1x105
1x106
1x107
1x108
1x109
1x1010
avg.
num
ber p
er s
ente
nce
CN totalCN explored1-best explored
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20068
JHU CLSP Summer Workshop 2006Proposal for Follow-up Research
Exploiting Ambiguous Inputin Statistical Machine Translation
Richard Zens
Human Language Technology and Pattern RecognitionLehrstuhl für Informatik 6
Computer Science DepartmentRWTH Aachen University, Germany
Zens: Exploiting Ambiguous Input in SMT August 17, 20061
Motivation
• MT often used in a pipeline, i.e. the input to the MT systemis the output of another imperfect NLP system, e.g.
– spoken language translation: ASR– segmentation: Chinese words, Arabic tokens– named entity recognition / translation
Zens: Exploiting Ambiguous Input in SMT August 17, 20062
Motivation
• MT often used in a pipeline, i.e. the input to the MT systemis the output of another imperfect NLP system, e.g.
– spoken language translation: ASR– segmentation: Chinese words, Arabic tokens– named entity recognition / translation
• traditional approach: ignore problem, i.e. translate 1-best
• result of previous work:improvements if ambiguity is taken into account
Zens: Exploiting Ambiguous Input in SMT August 17, 20063
Previous Approaches
1. confusion network decoding
• advantages: efficiency, reordering is straightforward• problem: representing alternative segmentations
2. lattice decoding
• advantage: representing alternative segmentations• problem: reordering
goal:⇒ exploit advantages of both approaches,⇒ but avoid weaknesses
Zens: Exploiting Ambiguous Input in SMT August 17, 20064
Generalized Confusion Networks
• confusion networks:
0 1 2 3 4
Zens: Exploiting Ambiguous Input in SMT August 17, 20065
Generalized Confusion Networks
• confusion networks:
0 1 2 3 4
• generalization:
0 1 2 3 4
– add edges that cover multiple positions→ representation of alternative segmentations
– do not add nodes→ retain efficiency, straightforward reordering
Zens: Exploiting Ambiguous Input in SMT August 17, 20066
Improved Reordering for Lattice Input
• confusion network is approximation of lattice→ valuable information might be lost→ potential improvement when using lattices
Zens: Exploiting Ambiguous Input in SMT August 17, 20067
Improved Reordering for Lattice Input
• confusion network is approximation of lattice→ valuable information might be lost→ potential improvement when using lattices
• so far:
– only very local reordering on lattice:∗ skip 1 phrase [Zens & Bender+ 05]
∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05]
Zens: Exploiting Ambiguous Input in SMT August 17, 20068
Improved Reordering for Lattice Input
• confusion network is approximation of lattice→ valuable information might be lost→ potential improvement when using lattices
• so far:
– only very local reordering on lattice:∗ skip 1 phrase [Zens & Bender+ 05]
∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05]
• idea:
– generalize reordering scheme used for CN to lattice input→ long range reordering
Zens: Exploiting Ambiguous Input in SMT August 17, 20069
Goals
• improve robustness to imperfect input
• investigate novel approaches:
– generalized confusion networks– reordering strategies for lattice input
• perform a systematic comparison in terms of MT qualityand computational requirements
• scalability → apply to tasks of different size:small: IWSLT, medium: EPPS/TC-Star, large: NIST/GALE
Zens: Exploiting Ambiguous Input in SMT August 17, 200610
Targeted Applications
• spoken language translation:
– output of ASR system– punctuation insertion / sentence boundary detection– disfluency detection
• named entity recognition / translation
• Chinese word segmentation
• Arabic tokenization
Zens: Exploiting Ambiguous Input in SMT August 17, 200611
References[Kumar & Byrne 05] S. Kumar, W. Byrne: Local phrase reordering models for statisti-
cal machine translation. Proc. HLT/EMNLP, pp. 161–168, Vancouver, Canada, October2005.
[Sadat & Habash 06] F. Sadat, N. Habash: Combination of Preprocessing Schemes forStatistical MT. Proc. COLING/ACL, pp. 1–8, Sydney, Australia, July 2006.
[Xu & Matusov+ 05] J. Xu, E. Matusov, R. Zens, H. Ney: Integrated Chinese Word Seg-mentation in Statistical Machine Translation. Proc. Int. Workshop on Spoken LanguageTranslation (IWSLT), pp. 141–147, Pittsburgh, PA, October 2005.
[Zens & Bender+ 05] R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu,Y. Zhang, H. Ney: The RWTH Phrase-based Statistical Machine Translation System.Proc. Int. Workshop on Spoken Language Translation (IWSLT), pp. 155–162, Pitts-burgh, PA, October 2005.
[Zens & Och+ 02] R. Zens, F.J. Och, H. Ney: Phrase-Based Statistical Machine Transla-tion. Proc. M. Jarke, J. Koehler, G. Lakemeyer, editors, 25th German Conf. on ArtificialIntelligence (KI2002), Vol. 2479 of Lecture Notes in Artificial Intelligence (LNAI), pp.18–32, Aachen, Germany, September 2002. Springer Verlag.
Zens: Exploiting Ambiguous Input in SMT August 17, 200612