Download - 0 Open Source Toolkit Schedule for Statistical Machine ... · Final Presentation Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Chris Callison-Burch, Ondrej Bojar,

0

Open Source Toolkitfor Statistical Machine Translation:

Factored Translation Modelsand Lattice Decoding

Final Presentation

Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi,Chris Callison-Burch, Ondrej Bojar, Brooke Cowan,

Chris Dyer, Hieu Hoang, Richard Zens,Alexandra Constantin, Evan Herbst, Christine Moran

17 August 2006

Philipp Koehn et al., JHU 2006 WS on MT Final Presentation 17 August 2006

1

Schedule

• First session: Overview and toolkit development

– Factored models and confusion network decodingKoehn, Federico

– Moses toolkitHoang, Dyer, Herbst, Callison-Burch, Bertoldi

• Second session: Experiments

– Experiments in small data settingsShen, Bojar, Moran, Cowan

– Factored models for morphological rich languagesDyer, Koehn, Cowan, Constantin

– Confusion network experimentsZens


2

Accomplishments

• Open source toolkit

– advances state-of-the-art of statistical machine translation models– best performance of European Parliament task– competitive on IWSLT and TC-Star

• Factored models

– outperform traditional phrase-based models– framework for a wide range of models– integrated approach to morphology and syntax

• Confusion networks

– exploit ambiguous input and outperform 1-best– enable integrated approach to speech translation


3

Phrase-Based Translation

er geht ja nicht nach hauseer geht ja nicht nach hause

he does not go home

• Foreign input is segmented in phrases

– any sequence of words, not necessarily linguistically motivated

• Each phrase is translated into English, phrases are reordered

• Log linear model: Many feature functions hi(e, f) with weights λi combinedto overall score

∑i λihi(e, f) → easy to extend


4

Translation

• Task: translate this sentence from German into English

er geht ja nicht nach hause


5

Translation step 1


er geht ja nicht nach hauseer

he

• Pick phrase in input, translate


6

Translation step 2


er geht ja nicht nach hauseer ja nicht

he does not


– it is allowed to pick words out of sequence (reordering)– phrases may have multiple words: many-to-many translation


7

Translation step 3


er geht ja nicht nach hauseer geht ja nicht

he does not go



8

Translation step 4


er geht ja nicht nach hauseer geht ja nicht nach hause

he does not go home



9

Translation options

he


it, it

, he

isare

goesgo

yesis

, of course

notdo not

does notis not

afterto

according toin

househome

chamberat home

notis not

does notdo not

homeunder housereturn home

do not

it ishe will be

it goeshe goes

isare

is after alldoes

tofollowingnot after

not tonot

is notare notis not a

• Phrase translation tables provide many translation options

• Learned from automatically word-aligned corpora


10

Translation options

he


it, it

, he

isare

goesgo

yesis

, of course

notdo not

does notis not

afterto

according toin

househome

chamberat home

notis not

does notdo not

homeunder housereturn home

do not

it ishe will be

it goeshe goes

isare

is after alldoes

tofollowingnot after

not tonot

is notare notis not a

• The machine translation decoder does not know the right answer

→ Search problem solved by heuristic beam search


11

Decoding process: precompute translation optionser geht ja nicht nach hause


12

Decoding process: start with initial hypothesiser geht ja nicht nach hause


13

Decoding process: hypothesis expansioner geht ja nicht nach hause

are


14


are

it

he


15


are

it

hegoes

does not

yes

go

to

home

home


16

Decoding process: find best pather geht ja nicht nach hause

are

it

hegoes

does not

yes

go

to

home

home


17

Statistical machine translation today

• Best performing methods based on surface word phrases

– uses mapping of short chunks of text (mostly 1-3 words)– sophisticated methods for phrase extraction and modeling (EM algorithm,

generative models, discriminative training)

• Translation solely based on surface forms of words

– no use of explicit syntactic information– no use of morphological information

• How can be build richer models?


18

One motivation: morphology

• Current models treat house and houses as completely different words

– training occurrences of house have no effect on learning translation of houses– if we only see house, we do not know how to translate houses– rich morphology (German, Arabic, Finnish, Czech, ...) → many word forms

• Better approach combines evidence for house and houses

– analyze surface word forms into lemma and morphologye.g.: Haus +plural

– translate lemma and morphology separatelye.g.: Haus → house; +pl → +pl

– generate target surface forme.g.: house +pl → houses


19

Factored translation models

• Factored represention of words

word word

part-of-speech

OutputInput

morphology

part-of-speech

morphology

word class

lemma

word class

lemma

......• Benefits– generalization, e.g. by translating lemmas, not surface forms– richer model, e.g. using syntax for reordering, language modeling)


20

Example factored model

• Our example as factored model:

lemma lemma

OutputInput

morphologymorphology

word word

• Translation process broken up into mapping steps

– translation of lemma– translation of morphology– generation of word from lemma, morphology


21

Expansion of input phrase

• Probabilistic mapping steps– translation step: lemma → lemma

haus → house, home, chamber, ...– translation step: morphology → morphology

single-noun → single-noun, single-pronoun, plural-noun, ...– generation step: lemma,morphology → word

house,single-noun → househouse,plural-noun → houses

• Still a phrase model– translation steps may map phrases

nach hause → home, return home– generation steps operate on single words– traditional phrase-models are special case: single-factor models


22

Computational complexity of mapping steps

• Number of factored expansions may grow exponentially

• Key insights to reduce complexity for a given input sentence:– expansions can be pre-computed and stored as translation options,– pruning translation options early

• Future work: problems with more complex models need to be addressed

– we had problems using some models with three steps or more– see student proposals (Hoang, Dyer) for solutions


Spoken Language Translationwith Confusion Networks

Marcello Federico, Nicola Bertoldi, Wade Shen, Richard Zens

August 17, 2006

Marcello Federico, ITC-irst Trento Project Summary August 17, 2006

1

Outline

• Spoken language translation

• Approaches to SLT

• Confusion network decoding

• Computational issues

• Implementation in Moses

• Language model interface

• Other applications of confusion networks


2

Spoken Language Translation

Translation from speech input is likely more di!cult than translation from textinput:

• many styles and genres:formal read speech, unplanned speeches, interviews,spontaneous conversations, ...

• less controlled language:relaxed syntax, spontaneous speech phenomena

• automatic speech recognition is prone to errors:possible corruption of syntax and meaning

This work addresses methods to improve performance of spoken languagetranslation by better integrating speech recognition and machine translationmodels.


3

Integrating Speech Recognition and Translation

• Correlation between transcription word-error-rate and translation quality:

38.5

39

39.5

40

40.5

41

41.5

42

42.5

14 15 16 17 18 19 20 21

BLEU SCORE

WER OF TRANSCRIPTIONS

• Better transcriptions have been possibly analyzed during ASR decoding butdiscarded due to lower scores

• Potential for improving translation quality by exploiting more transcriptionhypotheses generated during ASR.


4

Statistical Spoken Language Translation

• Let o be be spoken input in the foreign language

• let F(o) be a set of possible transcriptions of o

Goal: find the best English translation through the approximate criterion:

e! = arg maxe

Pr(e | o) ! arg maxe

maxf"F(o)

Pr(e, f | o)

Pr(e, f | o) is computed with a log-linear model incorporating:

• acoustics features, i.e. probs that some foreign words are in the input

• linguistic features, i.e. probs of foreign and English sentences

• translation features, i.e. probs of foreign phrases into English

• alignment features: i.e. probs for word re-ordering


5

ASR Word Graph

A very general set of transcriptions F(o) can be represented by a word-graph:

• directly computed from the ASR word lattice (e.g. HTK format, lattice tool)

• provides a good representations of all hypotheses analyzed by the ASR system

• arcs are labeled with words, acoustic and language model probabilities

• paths correspond to transcription hypotheses for which probabilities can becomputed

!

"

#$%

&''

#$%

&'(

)

&'*

#$%

&'+#$%

&

,-.

/,-.

0,-.

',-. (

,-.

*

,-.

+

,-.

1

,-.

"!

,-.

""

,-.

"&

,-. "/

,-.

"0

,-.

"'

,-.

"(

,-.

"*

,-.

"+

,-.

"1

,-.

&(&

)

&(/

)

&(0

)

&(')

&(()

&(*)

&(+)

#$%

#$%

#$%

#$%

#$%

#$%

#$%

#$%

#$%

#$%

#$% #$

%

#$%

#$%

#$%

#$%

#$%

&!

2.32,445 &"

2.32,445

&&

2.32,445

&/

2.32,445

&0

2.32,445

&'

2.32,445

&(

2.32,445

&1

2.32,445

//2.32,445

#$%

#$%

#$% #$

% #$% #$

% &*67

&+

68#$

%

/!

#$%

/0

#$%

67 /"

67 /&

67

#$%

68

'+

67

'167

(!

67

#$%

#$%

#$%

/'

68

/(

68

/*68

/+

68

#$%

#$%

#$%

#$%

/1

79:.-25

0!

79:.-25

0"

79:.-25

0&

79:.-25

0/

79:.-25

00

79:.-25

0'

79:.-25

0(

79:.-25

0*

79:.-25

0+

79:.-25

01

79:.-25

'!

79:.-25

'"

79:.-25

'&

79:.-25

'/

79:.-25

'0

79:.-25

''

79:.-25

'(

79:.-25

#$%

#$%

#$%

#$% #$

% #$% #$

%

#$%

#$% #$

% #$%

#$%

#$%

#$% #$

%

#$%

#$%

'*#$%

/&+

;<=>

#$%

#$%

("

79:.-25 (&

79:.-25

(/

79:.-25

(0

79:.-25

('79:.-25

((79:.-25

(+

79:.-25

*/

79:.-25

*'

79:.-25

*(79:.-25

*+

79:.-25

1!

79:.-25

1/

79:.-25

10

79:.-25

1+

79:.-25

&&&

79:.-25

&&(

79:.-25

&&+

79:.-25

&0!

79:.-25

#$%

#$% #$%

#$%

#$%

(*)#$

%

*"#$%

(1

4.

*!

,)

*&

9.#$

%

&!14?

&"!4.

&""

4.

&"&

4.

&"/

4.

&"04.

&"(

9.

&"*

9.

&"1

4?

&&!

4?#$

%

"1+

#$%

*0

#$%

#$%

)#$

% "+*

.,

"*/#$%

#$%

"(0

5)

"++.

"11

,

**

@5#$

%

"0+

@.

"''

.4

"'1

A3

"('

5

"*0

)

"+1

.

&!!

,

+(

#$%

*1

AB

+!AB+&

6.

+/

A38

+0

6.

@5

+*

.6

++

!

+1.448#$

%

"00

.448"0(

73

"01

@.

"'(

.4

"(!

A3

"((

5

"**

)

"1!

.

&!"

,

#$%

+"

#$%

;<=>

#$%

"/(

#$%

+'

#$%

;<=>

1"

#$%

"&0

#$%

1&

#$%

#$%

@5!

#$%

"/!

@5C

"/&

@5

"//

@5

"/0

@5

A38

"/1

74

"0&

5@

"'!

@.

"'*

.4

"("

A3

"(*

5

"*!

5

"*"

5

"*&

5

"*+

)

"1&

.

&!&,

#$%

""(

#$%

#$% ""&

A!

""*!

""+

!

"&!48.6

"&'

7

"&+.6"/*A38

"0!

74

"'"

@.

"'&@. "'/

@.

"(&

A3

"*1

)

"1'

.

"1(

.

"1*

.

&!/

,

1'

A@

1(

A@

#$%

"!&

.7

"!0

.@

"!'

?

"!(B

"!+

.@

""!

?""/

A

""0

A"&"

48

"&(

7

"+!

)

&!0

,

#$%

1*#$%

;<=>

11

67

"!!

=7

"&&

48

"&*

7

"+")

&!'

,#$

%

&'/#$%

"!"

#$%

;<=>

"!/

#$%

;<=>

#$%

#$%

"!*

#$%

;<=>

"!1

#$%

;<=>

"""

#$%

;<=>

#$% #$%

""'#$%

;<=>

#$%

#$%

""1

#$%

;<=>

#$%

#$%

"&/#$%

;<=>

#$%

#$%

#$%

&&'

#$%

"&1

#$%

;<=>

"/"

#$%

;<=>

#$%

#$%

"/'

#$%

;<=>

#$%

"/+

#$%

;<=>

#$%

"0"

#$%

;<=>

"0/#$% ;<=

>

"0'

#$%

;<=>

"0*

#$% ;<=>

#$%

#$%

#$%

#$%

#$%

"'0#$%

;<=>

#$%

#$%

"'+

#$%

;<=>

#$%

#$%

#$%

"(/

#$%

;<=>

#$% #$

%

#$%

"(+.#$

%

"(1

#$%

;<=>

#$%

#$%

&/0

#$%

#$%

"*'

A3

"*(

.#$

%

"+/#$%

"+&#$%

#$%

.A3

"+'

.

#$%

#$%

#$%

&'"#$%

#$%

"+0

#$% ;<=

>

"+(

#$%

;<=>

#$%

#$%

#$%

"1"

.#$

%

"1/#$%

.#$

%

"10#$%

;<=>

#$%

#$%

&//

#$%

#$%

#$%

#$%

#$%

&!(

.

&!*

.

#$% #$%

#$%

&01

#$%

#$% &!+

#$%

;<=>

#$%

#$%

#$%

#$%

#$%

&"'

#$%

;<=>

#$%

&"+

#$%

;<=>

#$% &&"#$

% ;<=>

&&/

2

&&0:7

#$%

,)

67

&/"

#$%

&&1#$%

&&*

#$%

7#$

%

&/*

#$%

:2

&/&

#$%

&/'

687

&/1

#$%

#$%

&0'

38

&0*

3

&/!#$% ;<=

>

&0"

#$%

.5

&/+

7

&'!

,

&'&

)

;<=>

;<=>

&/(#$%

;<=>

#$% ;<=>

;<=>

2

&0/D

&0&#$% ;<=

>

&00#$% ;<=>

&0(#$%

;<=>&0+#$

%

;<=>

#$%

;<=>

#$%

;<=>

&'0

#$%

;<=>

&(")

&'1#$%

&(!)

)

#$%

#$%

#$%

#$% #$%

#$%

#$%

#$%

#$%

&(1

E.2.3F.

&*!

E.2.3F.

&*"

E.2.3F.

&*&

E.2.3F.

&*/

E.2.3F.

&*0

E.2.3F.

&*'

E.2.3F.

&*(

E.2.3F.

#$%

#$%

#$%

#$%

#$% &114.

#$%

&**

48 /!!

4. #$%

&*+

48

/!"4.

/!&4.

/!/4. /!04.

/!'

4.

/!(

4.

/!*

4.

/!+

4.

&*1

48

&+!48

&+"48

&+&48 &+/

48

&+048 &+'

48

&+(

48#$%

#$%

#$%

#$%

#$%

#$%

#$%

#$%

#$%

&+*79:.-25

&++

79:.-25 &+1

79:.-25

&1!

79:.-25

&1"

79:.-25 &1&

79:.-25

&1/79:.-25

&10

79:.-25

&1'

79:.-25

&1(

79:.-25

&1*

79:.-25

#$%

#$% #$%

#$%

#$%

#$%

#$%

#$%

#$%

#$%

&1+#$%

;<=>

#$%

#$%

#$%

#$%

#$%

#$% #$

%

#$%

#$%

/!1

67

/"!

67

/""

67

/"&67 /"/

67

/"0

67

/"'

67

/"(

67

#$% #$

%

#$%

#$% #$%

#$%

#$%

/"*:.- /"+

:.-

/"1:.- /&!

:.-

/&"

:.-

#$%

#$%

#$%

#$%

/&&

253

/&/253 /&0

253

/&'

253

/&(253

#$%

#$%

#$%

#$%

/&*#$%

;<=>


6

Approaches to Spoken Language Translation

The previous statistical framework includes several alternative implementations:

• 1-best translation:translate only the most probable hypothesis in the word graph

– pros: very e!cient– cons: no potential to recover from recognition errors in the 1-best

transcription

• N-best translation:translate only the N–most probable hypotheses in the word-graph

– pros: can exploit more accurate transcriptions in the word graph– cons: N must be large in order to include good transcriptions, and

decoding time increases linearly with N


7

Approaches to Spoken Language Translation

• Transducer:compose word-graph with a translation FSN and apply a transducer algorithm

– pros: straightforward method that permits to work on the full word graph– cons: computationally prohibitive with large vocabulary tasks and long range

word re-ordering

• Confusion network:translate a suitable approximation of the WG

– pros: it permits to e"ectively explores all paths in the word-graph, with noproblems in re-ordering

– cons: can only exploit limited information in the word graph


8

Confusion Network(Mangu 1999)

A confusion network approximates a word graph with a linear network, s.t.:

• arcs are labeled with words or with the empty word ( !-word)

• arcs are weighted with word posterior probabilities

• paths are a superset of those in the word graph

! "#$% &

'()*'+

$

,

-).-'//0

1)-).

2'+$/) 3

45/5

)//5'+

6

789)(-0

9)('+

:'+$! ;

'+$*

'0

-0.1)

-47

45<

=<>

..5

0>7.->'$70??0+.0)-5/0> @'+$>) "!

'+$)

)/>0

/))4

/5<.5

8)7/

/A<+)>)

//54)!

)

""

9)

?''+

$

"&'+$B ",'+$-7 "2

'+$*

'0

)7

>0>)-

"3#C$% "6DEF

CNs can be conveniently represented as a sequences of columns of di"erent depth.


9

Confusion Network Decoding:

Extension of basic phrase-based decoding step:

• cover some not yet covered consecutive columns (span)

• retrieve phrase-translations for all paths inside the columns

• compute translation, distortion and target language models

Example. Coverage set: 01110... Path: cancello d’

0 1 1 1 0 ...era 0.997 cancello 0.995 ! 0.999 di 0.615 imbarco 0.999 ...e 0.002 vacanza 0.004 la 0.001 d’ 0.376 bar 0.001

! 0.001 ! 0.002 all’ 0.005

l’ 0.002

! 0.001


10

Confusion Network Decoding

Computational issues:

• Number of paths grows exponentially with span length

• Implies look-up of translations for a huge number of source phrases

• Factored models require considering joint translation over all factors (tuples):– cartesian product of all translations of each single factor

Solutions implemented into Moses

• Source entries of the phrase-table are stored with prefix-trees

• Translations of all possible coverage sets are pre-fetched from disk

• E!ciency achieved by incrementally pre-fetching over the span length

• Phrase translations over all factors are extracted independently, then translationtuples are generated and pruned by adding a factor each time

Once translation tuples are generated, usual decoding applies.


11

Implementation into Moses

• Input Format: CN input can be rather large, so better to put one word-positionper line:

Haus 0.1 aus 0.4 Aus 0.4 eps 0.1der 0.9 eps 0.1Zeitung 1.0

each line represents alternatives with their probability.

• Factored confusion networks: alternatives are over the full factor space:

Haus|N 0.1 aus|PREP 0.4 Aus|N 0.4 eps|eps 0.1der|DET 0.1 der|PREP 0.8 eps|eps 0.1Zeitung|N 1.0

Notice: confusion network can be projected over single factors.


12

Implementation into Moses

Decoding CN with Factored Models

• at each step of the search process, a portion of the CN is explored, e.g.

... ...Haus | N 0.1 aus|PREP 0.4 Aus|N 0.4 eps|eps 0.1der|DET 0.1 der|PREP 0.8 eps|eps 0.1Zeitung|N 1.0... .... ... ...

.... and translations are looked up for each factor.

Features:

• E!ciency by pre-filtering possible translations for each factor

• Decoding of confusion networks is completely hidden to the decoder.


13

Other Applications of Confusion Networks

Translation tasks with ambiguous input:

• linguistic annotation for factored models

– avoid hard decision by linguistic tools but rather provide alternativeannotations with respective scores:

– e.g. particularly ambiguous part of speech tags

• insertion of punctuation marks missing in the input– model all possible insertions of punctuation marks in the input

• translation of input similar to that produced by speech recognition– e.g. OCR output for optical text translation

• ....


14

Language Model Interface

• Features

– compact binary format for very large language model– quantization of probabilities (8 bits)– fast upload of language model from disk– upload of n-grams on demand

• Comparison with SRI LM Toolkit

– memory: 50% less with large quantized models– speed: 10% slower in decoding with 3-gram LM

• Recent work and improvements

– speed-up by directly storing log-probs– addition of cache memory on n-gram internal data strucure– analysis of LM score computations by search algorithm– caching of probabilities and LM states

the search algorithm requests the same probabilities many times


15

Requests of N-grams by Decoder

Requests of 3-gram probabilities during decoding of a single sentence. About1.6M requests involving about 120K 3-grams.


16

Conclusions

Implementation work

• E!cient on-demand pre-fetching of phrase translations

• Tuning of parameters for confusion network decoding

• Language model interface and pre-fetching of n-grams

Development of state-of-the-art baselines for SLT

• IWSLT BTEC Chinese-English SLT– submissions to IWSLT 2006 evaluation

• EPPS Spanish-English SLT– performance comparable with best TC-STAR systems

Achievements

• SLT decoder more e!cient wrt current implementations by IRST and MIT/LL

• works with large-data tasks and large confusion networks

• works with factored confusion networks


Engineering ResultsEngineering Results

JHUSWS 2006JHUSWS 2006

Aug 17, 2006 2JHUSWS 2006

Open software, so what?Open software, so what?

State of the world, June 2006State of the world, June 2006““Black boxBlack box”” decoder (Pharaoh) widely useddecoder (Pharaoh) widely used

20+ citations in this year20+ citations in this year’’s ACL Proceedings alones ACL Proceedings aloneUbiquitous baseline systemUbiquitous baseline system

ButBut…… it is difficult to extendit is difficult to extendNew features limited to what can be expressed in the New features limited to what can be expressed in the existing phraseexisting phrase--table format (source, target, feature vector)table format (source, target, feature vector)Many interesting projects require reinventing the wheel just Many interesting projects require reinventing the wheel just to change one spoketo change one spoke

Aug 17, 2006 3JHUSWS 2006

Software GoalsSoftware Goals

AccessibilityAccessibilityEasy to maintainEasy to maintainFlexibilityFlexibilityEasy for distributed team developmentEasy for distributed team developmentPortabilityPortability

Aug 17, 2006 4JHUSWS 2006

AccessibilityAccessibility

Easy to readEasy to read““Nothing should be a black boxNothing should be a black box””Descriptive namesDescriptive namesUniform coding styleUniform coding style

Available immediatelyAvailable immediatelySource code on Source code on Sourceforge.netSourceforge.net

CrossCross--platform compatibilityplatform compatibilityWindows, Linux, Windows, Linux, MacOSMacOS X, 64 bit OSX, 64 bit OS

void Load(const std::string &fileName, FactorCollection &factorCollection, FactorType factorType, float weight, size_t nGramOrder);

Aug 17, 2006 5JHUSWS 2006

Easy to MaintainEasy to Maintain

Modular codeModular codeTeam developmentTeam developmentObject oriented frameworkObject oriented framework

Integrated documentation frameworkIntegrated documentation frameworkUsing Using DoxygenDoxygenEasy to maintain Easy to maintain WikiWiki documentation on the Webdocumentation on the Web

Aug 17, 2006 6JHUSWS 2006

DocumentationDocumentation

Aug 17, 2006 7JHUSWS 2006 Aug 17, 2006 8JHUSWS 2006

ExtensibilityExtensibility

Open architecture designed for extensibilityOpen architecture designed for extensibilityArchitecture matches theoretical descriptions of phraseArchitecture matches theoretical descriptions of phrase--based based MT modelsMT models

Short rampShort ramp--up time for researchers familiar with SMT but not with up time for researchers familiar with SMT but not with any particular decoderany particular decoder

Feature function evaluation decoupled from search Feature function evaluation decoupled from search algorithmsalgorithms

Facilitates experimentation with new classes of feature functionFacilitates experimentation with new classes of feature functionssModular designModular design

Framework to allow different replacements of all parts of the deFramework to allow different replacements of all parts of the decodercoderMultiple implementations of translation tablesMultiple implementations of translation tablesLanguage modelsLanguage modelsDifferent types of modelsDifferent types of models

Aug 17, 2006 9JHUSWS 2006

Case Study: Lexicalized ReorderingCase Study: Lexicalized Reordering

Very successful Very successful model,model, but implementation not but implementation not possible with a possible with a ““black boxblack box”” decoderdecoderWith Moses, anyone with an idea can try itWith Moses, anyone with an idea can try itAdding support for LR models to Adding support for LR models to mosesmoses required code required code changes in four (relatively logical) locationschanges in four (relatively logical) locations

FeatureFeature--function base class (function base class (ScoreProducerScoreProducer) extended, logic ) extended, logic for feature value computation implementedfor feature value computation implementedEnable the model based on configurationEnable the model based on configurationCall to evaluate the feature function when extending a Call to evaluate the feature function when extending a hypothesishypothesisAdd the feature values to Add the feature values to nn--best list output for tuning best list output for tuning algorithmsalgorithms

Aug 17, 2006 10JHUSWS 2006

Regression TestingRegression Testing

Regression TestingRegression TestingPharaoh scores used as baseline, which were updated Pharaoh scores used as baseline, which were updated as models changed (for example, hypothesis as models changed (for example, hypothesis recombination based on LM state rather than recombination based on LM state rather than nn--gram gram order)order)Detailed logging enables strict test coverage for all Detailed logging enables strict test coverage for all model typesmodel typesRegression test suite was run approximately 3000 Regression test suite was run approximately 3000 times during workshoptimes during workshop

Aug 17, 2006 11JHUSWS 2006

AccomplishmentsAccomplishments

Code contributions from every member of the Code contributions from every member of the teamteamPerformance improvementsPerformance improvements

Day 1Day 1:: 5.01 sec/sentence 5.01 sec/sentence avgavg decoding timedecoding timeTodayToday:: 1.43 sec/sentence 1.43 sec/sentence avgavg decoding timedecoding time

Aug 17, 2006 12JHUSWS 2006

SummarySummary

State of the world, August 2006State of the world, August 2006““White boxWhite box”” multimulti--factored decoder (Moses) availablefactored decoder (Moses) available

DropDrop--in replacement for Pharaohin replacement for Pharaoh

Further experimentation and development anticipated at:Further experimentation and development anticipated at:Aachen, Charles University, Cornell, Edinburgh, IRST, MIT, Aachen, Charles University, Cornell, Edinburgh, IRST, MIT, Lincoln Labs, UMDLincoln Labs, UMD……and many more.and many more.

Software Goals

• Accessibility• Easy to maintain• Flexibility• Easy for distributed team development• Portability

Accessibility

• Easy to read– “Nothing should be a black box”– Descriptive names

– Uniform coding style• Available immediately

– Source code on Sourceforge.net• Cross-platform compatibility

– Windows, Linux, MacOS X, 64 bit OS

void Load(const std::string &fileName, FactorCollection &factorCollection, FactorType factorType, float weight, size_t nGramOrder);

Easy to Maintain

• Modular code– Team development– Object orientated framework

• Integrated documentation framework– Using Doxygen– Interactive Wiki documentation on the Web

• Extendable– Flexibility

• Framework to allow different replacements of all parts of the decoder

• Multiple implementations of translation tables• Language models• Different types of models

– Code size • 10,000 at beginning of workshop• 16,000 now

1

System tuning

• Log Linear Model

e∗ = arg maxe

Pr(e | f) = arg maxe

pλ(e | f) = arg maxe

∑i

λihi(e, f) (1)

• real valued feature functions:– model specific component of the translation process:

fluency, adequacy, reordering, ...– statistical models are estimated on specific training data

• feature weights:– balance ranges of feature scores– weight importance of features– tuned through Minimum Error Training (MET)

Nicola Bertoldi, ITC-irst Minimum Error Training August 17, 2006

2

Minimum Error Training

• automatic procedure to optimize feature weights

• minimization of translation errors

• development set (f , ref)

• automatic error function Err(e; ref): (100-BLEU) score

e∗ = e∗(λ) = arg maxe

pλ(e | f) (2)

λ∗ = arg minλ

Err(e∗(λ); ref) (3)

• Err(e) is not math-sound =⇒ no exact solution

• approximate iterative algorithm: gradient descent, downhill simplex


3

CLSP-WS solution for MET

Moses Extractor

Optimizer

input

weights

features

n-best

reference

score

Scorer

weights

1-best

inner loopouter loop

optimalweights

• outer loop:– decoding with actual lambdas– generation of nbest translations– addition to previous translations

• inner loop:– optimization over n-bests– decoder and ”random” weights

as initial points• optimizer:

– iterative optimization on single weights– discretization of the r-dimensional space of weights


4

MET vs. size of nbest list

• German-English EuroParl task

• tuning on dev set of 2000 sentences

• evaluation on test set of 2000 sentences

• convergence in 5-6 iterations:– good: faster outer loop

• no impact of size of nbest:– good: faster inner loop

18

19

20

21

22

23

24

25

26

0 2 4 6 8 10 12 14

BLE

U

iteration

100 nbest200 nbest400 nbest800 nbest


5

MET vs. size of development set

• extraction of 4 subsets:100, 200, 400, 800 sentences

• larger dev set:– more stable result– less iterations– better results

• bad:– overfitting– large dev set– slower outer loop (decoding)

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16 18

BLE

U

iteration

100 sentences200 sentences400 sentences800 sentences

2000 sentences

iteration BLEU100 sentences 18 24.3200 sentences 15 25.1400 sentences 16 24.6800 sentences 14 24.9

2000 sentences 9 25.3


6

MET vs. optimization algorithm

• task: Spanish-English EPPS, speech input

• dev set of 2643 Confusion Networks, test set of 1073 CNs

• CLSP-WS algorithm vs. downhill simplex (RWTH)

iteration ∆ BLEUdev test

CLSP-WS algorithm 4 +1.0 +0.4downhill simplex 7 +2.9 +3.4

• mismatch between internal score of CLSP-WS algorithm and official score

• better performance of the downhill simplex algorithm

• post-workshop investigation


1

Moses in parallel

• effective R&D cycle:– fast experiments

• computing facilities:– 6 clusters, 200 machines

• parallelization of translation

• ’split and merge’ technique

• translation time:– splitting/merging ≈ constant, negligible– access to cluster related to cluster load– loading data≈ constant– decoding ∝ input length

Moses

source input

part-1 part-N

Splitter

Moses

translation-N

Merger

translation

(remote) cluster of machines

translation-1

Nicola Bertoldi, ITC-irst Moses in parallel August 17, 2006

2

Moses in parallel

• Spanish-English EuroParl task

• CLSP cluster, 18 machines

• no control of cluster load

standard 1 job 5 jobs 10 jobs 20 jobs10 sentences 6.3 13.1 9.0 9.0 –

100 sentences 5.2 5.6 3.0 1.7 1.71000 sentences 6.3 6.5 2.0 1.6 1.1

Average time (seconds).

Nicola Bertoldi, ITC-irst Moses in parallel August 17, 2006

Decoder Output Analysis

Evan Herbst

8 / 17 / 06

Evan Herbst Decoder Output Analysis 8 / 17 / 06

1

Measurables

• Difficulty

– perplexity

• Error

– WER– PWER– BLEU– confidence intervals

• Significance

– t-test– sign test


2

Definition: Perplexity

Measure likelihood of corpus given model (e.g. language model)

PX = 2−1N

∑i log(pLM(wi)),wi words


3

Definition: WER

Word Error Rate: modified edit distance


4

Definition: PWER

Position-independent Word Error Rate: match bags of words


5

Definition: BLEU

BiLingual Evaluation Understudy: n-gram precision and length comparison


6

Numbers

Dataset: 2000-sentence Europarl subset

pharaoh moses baseline

Linguae → de-en en-de de-en en-de

BLEU .2557 .1775 .2554 .1776

WER .5432 .6144 .5428 .6145

PWER/WER .865 .940 .865 .947

Lemma BLEU .2625 .2170 .2622 .2180

N-gram Prec. .609/.315/.188/.119 .519/.223/.122/.070 .609/.314/.188/.119 .519/.223/.122/.070

Perplexity 40.97 62.01 40.94 61.77

Ref Perplex. 68.81 125.29 68.81 125.29

Inferences

• lemmas vs. surface: morphology

• output vs. reference perplexity: fluency

• PWER/WER ratio: reordering; phrase tables


7

Tool: Comparison


8

Tool: Alignment


Suffix Arrays for More Statistics(and Less Disk Space!)

Chris Callison-Burch

August 17, 2006

Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006

1

Phrase Tables in Statistical Machine Translation

• Using longer phrases leads to better translation quality

• Phrase tables can get unwieldily large with long phrases

• Problem of large tables is compounded for factored translation models


2

Phrase Tables in Factored Translation Models

• Translation tables between source and target phrases, and POS tags, stems,morphological markers, etc.

• Plus generation tables

• Want longer sequences for factors with smaller tags sets

• Number of tables depend on number of conditioning variables, and on back-offstrategies

• Potentially more tables than all pairwise combinations of factors


3

Ad Hoc Solutions

• Limit length of phrases

• Only extract phrases for test data

• Make unnecessary independence assumptions


4

Proposed Solution: Intelligent Data Structure

• Uses less memory than table-based data structures

• Allows us to condition on whatever factors we want and easily back-off

• Retrieve translation / generation probabilities for arbitrarily long sequences

• Suffix arrays to index parallel corpus


5

How Suffix Arrays Work

0123456789

Spain declined to confirm that Spain declined to aid Moroccodeclined to confirm that Spain declined to aid Moroccoto confirm that Spain declined to aid Moroccoconfirm that Spain declined to aid Moroccothat Spain declined to aid MoroccoSpain declined to aid Moroccodeclined to aid Moroccoto aid Moroccoaid MoroccoMorocco

Spain declined to confirm that Spain declined aidto Morocco0 1 2 3 4 5 6 87 9

s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]

Initialized, unsortedSuffix Array Suffixes denoted by s[i]

CorpusIndex ofwords:


6

Alphabetically Sorted

8361950472

to aid Moroccoto confirm that Spain declined to aid Morocco

MoroccoSpain declined to aid Morocco

declined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco

that Spain declined to aid MoroccoSpain declined to confirm that Spain declined to aid Morocco

SortedSuffix Array Suffixes denoted by s[i]

s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]


7

(Reasonably) Fast Find

8361950472

to aid Moroccoto confirm that Spain declined to aid Morocco

MoroccoSpain declined to aid Morocco

declined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco

that Spain declined to aid MoroccoSpain declined to confirm that Spain declined to aid Morocco


s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]


8

Applied to Factored Translation Models

spain declin to confirm that spain declin aidto moroccoNNP TO VB IN NNP VBN VBTO NNPVBDSpain declined to confirm that Spain declined aidto Morocco

0 1 2 3 4 5 6 87 9Factored Corpus

Index ofwords:POS:

stems:

• Index each factor

• Store word-level alignments

• Calculate probabilities on the fly


9

Generation Probabilities

p(NNP VBN | Spain declined) = 0.5p(NNP VBD | Spain declined) = 0.5

Spain declined to aid Morocco

to aid Moroccothat Spain declined to aid Morocco



Index ofwords:POS:

stems:


8361950472 to confirm that Spain declined to aid Morocco

Moroccodeclined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco

Spain declined to confirm that Spain declined to aid Morocco

s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]


10

Generation Probabilities

p(Spain | NNP) = 0.66666 .p(Morocco | NNP) = 0.33333 .



Index ofwords:POS:

stems:

NNP VBD TO VB IN NNP VBN TO VB NNP


VB NNP

TO VB NNP

NNP4905273816

s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]

IN NNP VBN TO VB NNP

NNP VBN TO VB NNPTO VB IN NNP VBN TO VB NNP

VB IN NNP VBN TO VB NNP

VBN TO VB NNPVBD TO VB IN NNP VBN TO VB NNP


11

Translation Probabilities

Spain declined to aid Morocco

to aid Moroccothat Spain declined to aid Morocco



Index ofwords:POS:

stems:


8361950472 to confirm that Spain declined to aid Morocco

Moroccodeclined to confirm that Spain declined to aid Moroccodeclined to aid Moroccoconfirm that Spain declined to aid Moroccoaid Morocco

Spain declined to confirm that Spain declined to aid Morocco

s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]

Marocle

aiderd'

refuséavait

Espagnel'

queconfirmer

derefusé

aEspagne

L'

Mor

occo

aid

todecli

ned

Spai

nth

atco

nfirm

todecli

ned

Spai

n

p(L'Espagne a refusé de | Spain declined) = 0.5p(l'Espagne avait refusé d' | Spain declined) = 0.5


12

Translation Probabilities

NNP VBD TO VB IN NNP VBN TO VB NNP



Index ofwords:POS:

stems:

s[0]s[1]s[2]s[3]s[4]s[5]s[6]s[7]s[8]s[9]

Marocle

aiderd'

refuséavait

Espagnel'

queconfirmer

derefusé

aEspagne

L'

Mor

occo

aid

todecli

ned

Spai

nth

atco

nfirm

todecli

ned

Spai

n

p(l'Espagne avait refusé d' | Spain declined, NNP VBN) = 1


VB NNP

TO VB NNP

NNP4905273816

IN NNP VBN TO VB NNP

NNP VBN TO VB NNPTO VB IN NNP VBN TO VB NNP

VB IN NNP VBN TO VB NNP

VBN TO VB NNPVBD TO VB IN NNP VBN TO VB NNP


13

Advantages

• Memory reduction

– Memory = 2 * num factors * corpus + word alignments– Significantly less than phrase tables!

• Greater range of statistics

– Arbitrary number of conditioning variables– Allows range of back-off strategies

• Can extract statistics for arbitrarily long sequences


14

Research to be Undertaken

• Integrate into Moses decoder

• Deal with increased computational complexity

• Change search strategies to incorporate longer factor sequences, of differentlevels of granularity

• Experiment to test if longer sequences improve translation quality

• Experiment with what variables to condition upon, how to back off

Chris Callison-Burch Suffix Arrays for More Statistics (and Less Disk Space!) August 17, 2006 MIT Lincoln + Computer Science AI Labs

18/14/2006

Charles University

Wade Shèn, Břooke Cowan, OndrejBojar and Christine Möran

Factored Translation Models for Small Data Problems

Experiments with Spanish, Czech and Chinese

28/14/2006

MIT Lincoln + Computer Science AI Labs Charles University

Outline

• Motivations

• Experimental Design and Baselines

• Models for Agreement in Spanish

• Coping with Rich Morphological Constraints in Czech

• Generalizing Lexical Distortion Models

• Models for Sparse Statistics in Chinese

• Conclusions and Follow-on Research

38/14/2006


General MotivationsChallenges with Small Data

• Phrase-based MT relies on large data– Learn “Phrase” co-occurence within language– Learn Translation templates/phrases across languages

• Problems Phrase-based MT with Small Data– Word Alignment– Hard to see enough phrases (coverage)

Especially in morphologically rich languages– Tend to rely on shorter phrases

Increased local agreement problems Increased long-distance coherence problems

48/14/2006


Possible Advantages of Factored ModelsGeneralization over Morphology

• We can Model morph. variation and phrase translation separately for better statistics: Translation + Generation

– Spanish Gender

– Czech Case

Masculine FeminineEnglishSpanish Él es un jugador rojo Ella es una jugadora roja

he is a red player she is a red player

Nominative + Plural Dative + PluralEnglishCzech černé kočky černým kočkám

black cats black cats

el ser un jugador rojMorph: f 3p+sing f f fMorph: m 3p+sing m m m

černá kočka Morph: dat+pl dat+plMorph: nom+pl nom+pl

58/14/2006


Factors as Type CheckingLong Range Phenomena and Divergence

• Long range dependencies can be modeled with latent factors– Spanish: Verb – Subject Number Agreement

• Verb-Argument dependencies

Spanish Mi hija de dos años tiene catarroGloss My daughter of two years has coldCzech Nachlazena je moje dvouletá dcera.

verb: 3p+singSubject: 3p+sing AGR

verb: 3p+sing Subject: 3p+singAGR

Czech Napsal zprávu o matčině domu na papírGloss He wrote a message about mother’s house on a paper

noun: accusativeverb select

Czech Našel zprávu o matčině domu na papířeGloss He found a message about mother’s house on a paper

noun: locativeverb select

68/14/2006


Phrase-Level Generalization

• Class-based divergences– Chinese-English resultative constructions

Similar pattern for large class of verbs

• Longer distance movement dependencies– Chinese-English Questions

Chinese 你要答破吗made hit broken doneyou

回Gloss it

English you broke it

Chinese 你要答 [clause…] 吗want [clause…] y/n-markeryou

would you like to reply to [clause…] ?

回Gloss replyEnglish

causes reorderingTags: VModal Pn Tag: Part

Verb Specific

78/14/2006


Large vs. Small DataHow generalizations may affect SMT Performance

• With large data sets these phenomena can be learned– Language Models should get local agreement phenomena

with enough data– Long range agreement/coherence still problematic– Generalization may still be better, but errors in analysis can

limit

• Generalization may be advantageous for small data– For example: (Spanish/Czech Agreement)

Can’t learn every noun/adjective/determiner triple– Situation for many real-world problems

88/14/2006


Outline

• Motivations

• Experimental Design and Baselines– Approaches– Data Sets






98/14/2006


Data Sets and Baselines

Data Set Translation Direction(s)

Size Baseline w/diff LMs(BLEU/Surface)

Full Europarl English Spanish

950k LM Train700k Bitext

3g 29.354g 29.575g 29.54

3g 23.413g (950k) 25.10

3g 25.82(four references)

4g 19.54(seven references)

Euromini English Spanish


Czech WSJ English Czech


IWSLT Chinese Chinese English


108/14/2006


Using Factored ModelsApproaches for Small-Data Tasks

• Factored Models we tried– Different levels of linguistic information modeled separately

example: Morphology vs. phrasal content– Feature “Checking” of existing phrasal models with LMs on

factors

– Generalized Factor-based Distortion Phrase are likely to move distance X if preceding word is Tag Y

• Hypothesis: These models allows better utilization of limited training data

I would like some donutsGood

pn mod vb det np

I would like some big jumpBad

pn mod vb det adj vb

Words

POS

High likelihood Low likelihood

118/14/2006


Different Factored ApproachesOverview of Models Tried

• High Order Language Models

• Parallel Translation Models

Analysis Problems AddressedExplicit Agreement

Long Distance CoherenceUnsupervised Agreement/Coherence • LMs over Word-Classes

• LMs over verbs/subject• LMs over nouns determiner

adjectives

SupervisedModel Types

• LMs over POS

• Parallel Translation Models over Word-Classes and Surface

Agreement/CoherenceUnsupervised

Explicit AgreementProblem Types

• Parallel Translation Models over Lemmas and Morphology

SupervisedAnalysis Model Types

128/14/2006


Outline

• Motivations


• Models for Agreement in Spanish– Morphology and Agreement Features (Brooke)– Parallel Lemma and Morphology Translation (Wade)– Scaling to Larger Corpora (Wade)





138/14/2006


Spanish ExperimentsLanguage Models over Morphological Features

• NDA– Nouns/Determiner/Adjective Agreement– Generate only on N, D and A tags (don’t

care’s elsewhere)

• VNP– Verb/Nouns/Preposition Selection

Agreement– Generate on V, N or P

ModelModel

Surface

Generate + Check Latent Factors

nda

word

vpn

N/D/A FeaturesGender: masc, fem, common, none Number: sing, plural, invariable, none

V/N/P FeaturesNumber: sing, plural, invariable, none Person: 1p, 2p, 3p, nonePrep-ID: Preposition, none

148/14/2006


ModelModel

Spanish ExperimentsSkipped LMs for Agreement

• Allow NULL factors to be generated• Increase effective context length to model longer range

dependencies

Surface

Generate Latent Factors

…gave the woman

nda

word

vpn

s+f

s

s+f

X

X

“a”3+s

X

mujerlaadio

Target Phrase

Source Phrase

158/14/2006


Spanish Agreement LMsExperimental Results

• With Skipping

• No Skipping (LM counts don’t care positions)

• No Skipping with all morphological features w/ and w/o POS

• All models beat baseline– Skipping doesn’t seem to help– Full morphology is best

Data Set Baseline NDA VPN BothEuroMini 23.41 24.47 24.33 24.54

Data Set Baseline NDA+Skip VPN+SkipEuroMini 23.41 24.03 24.16

Data Set Baseline Morph Morph+POSEuroMini 23.41 24.66 24.25

168/14/2006


Lemma

Person + Number + Gender + Case

Spanish ExperimentsParallel Lemma/Morphology Translation

• Factor surface into lemma and morphology features• Translate both simultaneously• Re-generate target surface form• Apply LM on both surface and morphology features

• Results:

Surface

Analysis Generation

Me

I

1ps+ Acc

Yo

Mi

1ps+ Acc

Data Set Baseline LemmaEuroMini + 950k LM 25.10 25.71

178/14/2006


Scaling Up to Large TrainingPOS Language Models

• Full Train → Less/No Gain from richer features

POS-LM vs. Baseline

28

28.5

2929.5

30

30.5

31

3g 4g 5g 6g 7g 8g 9g

POS N-gram Order

BLE

U S

core

BaselinePOS-LMFull Tags

NOTE: Scale

188/14/2006


Outline

• Motivations



• Coping with Rich Morphological Constraints in Czech– Factored Word Alignment for Limited Data– Rich Morphology and Tagged LMs– Putting it Together: Parallel Translation



• Analysis and Conclusions

• Follow-on Research

198/14/2006


Factors for Coping with Limited DataBetter Word Alignment for Czech

• Word Alignment is difficult when data is limited and Morphology is rich

– Data: 20k bitext sentences, large vocabulary– Contrast Set: 20k + 840k (Out of Domain) sentences– Task: English Czech

• Two methods to deal with limited data

• Contrastive Behavior for small and large data

Stem Alignment Lemma Alignment

Data Set Word-Word Stem-Lemma Stem-Stem20k Czech 25.17 25.23 25.82

24.99Large Contrast 25.40

208/14/2006


Czeching Rich Morphology with TagsTagged Czech Language Models

• Idea: Use morphologically rich POS Tag sequences to “czech”target output generation

• POS Information Configurations (Baseline: 25.82)

Surface

Generation

cat

N+acc

kočky

Apply LM

Full TagsFeature 1Feature 2… (15 total)Size: 1098 tagsResult: 27.04

CNG TagsCaseNumber+Genderon V, P, PP, N, ASize: 707 tagsResult: 27.45

CNG+VPCNG FeaturesPerson+Tense+Aspect (verbs)Lemma+Case (prepostions)Size: 899 tagsResult: 27.62

218/14/2006


Comparing with Larger Data ModelsTagged Czech Language Models

• Large vs. Small Data

• Tagged Language Models improve performance for small data significantly

– approaches large data performance• Large Task also improves (but much less: 2.36% vs. 6.97%)

Data Set Data Set BLEU Relative Improvement

20k Czech 25.82 –Large Contrast(20k + 840k OOD)

27.47 –

Baseline

20k Czech 27.62 6.97%CNG+VP

2.37%Large Contrast(20k + 840k OOD)

28.12

228/14/2006


Parallel Translation Models for Czech

• Motivation: Factored LM models seem to lose number information

• Better than baseline, but worse than both CNG & CNG+VP

POS Tag + CNG Features

Surfacehim

3p+acc

ho

Model ResultSurface Surface + POS POS+CNG 25.94

on Lemma

3p+acc

Surface Lemma + POS POS+CNG 26.43

238/14/2006


Outline

• Motivations




• Generalizing Lexical Distortion Models (Christine)– Lexical Distortion Models– Factor-based Distortion– Results


• Analysis and Conclusions

• Follow-on Research

248/14/2006


Generalized Distortion ModelingIntroduction to Distortion

• For each phrase pair we learn its likely placement relative to the previous phrase

• Orientations– Monotone

word alignment point on top left– Swap

word alignment point on top right– Discontinuous

Not monotone or swap

• Examples– la casa roja the red house– D NN ADJ D ADJ NN

Source

Targ

et

Monotone

Discontinuous Swap

258/14/2006


Factor-based Distortion Models

• A Factor-based extension of Lexicalized Distortion– Use of more general factors

e.g. POSf-POSe, Lemma-Lemma

• Can model longer range dependencies– More conditioning variables

• Motivating Results– Hard-coding in a few factor based rules (e.g. swap nouns and

adjectives when translating from English to Spanish) led to improvements (Gispert, et. al. 2006)

268/14/2006


Factor-based DistortionSpanish Experiments

• Lexicalized Distortion only

• Factor-based Distortion on small data

• Further Experiments– Other Factors– Minimizing Model Parameters– Combining different models

Data Set ResultBaseline (No Lexical)

Factored: POS-POS SystemCombined: Lexical + POS-POS

Baseline Lexical

Europarl Lang Pharaoh MosesEn De

Es En

En Es

18.15 18.85

31.06 31.85 31.46 32.37

278/14/2006


Outline

• Motivations







288/14/2006


IWSLT ChineseExperiments with Unsupervised Annotation

• Data: Travel-domain sentences, limited vocabulary, short sentences• Task: Text and ASR translation, Chinese English

• Can we use automatic word classes to learn general sequence constraints?

• First Experiment: 2-gram Word Class LMs of varying orders

ModelModel

SurfaceHow much is it?

class

word

c55c3c22c1

?钱多少总共

Target Phrase

Source Phrase

298/14/2006


IWSLT ChineseAlignment Templates for Translation

• Second Experiment: Extend Class-based LM to the translation Model

• Bigram word classes for source and target

• Translate alignment templates similar to [Och 98] + surface

• Apply LM to surface and Class

Word Class

Surface

Generation

Me

I Yo

Mi

308/14/2006


18

18.5

19

19.5

20

20.5

21

21.5

22

22.5

3g 4g 5g 6g 7g 8g 9g

Class N-gram Order

BLE

U S

core

Baseline

Class-LM

ClassTrans+LM

• Class-LM significantly better (p=0.05, ~1.0 BLEU)• Class-Trans may be limited by synchronous PT constraint

– Start to address here, but not in time for eval

NOTE: Scale

IWSLT ChineseAutoclass Results

318/14/2006


Outline

• Motivations






• Conclusions and Follow On Research

328/14/2006


Conclusions and Future Work

• Factored Approach can help with small data– Large Data tasks may need different factored approaches

• MIT/LL + CSAIL– Continue experiments with morphology and coherence– Fully Asynchronous Factor Translation– Apply techniques to other languages

Extend existing LCTL experiments– Syntax-driven reordering models (Brooke)

• Asynchronous Factors Translation (Hieu)

• Making use of verb sub-categorization information (Ondrej)

Valency-Aware Machine Translation

Project Proposal

Ondrej [email protected]

August 17, 2006

Ondrej Bojar Valency-Aware Machine Translation August 17, 2006

1

Overview

• JHU Workshop motivation and one of the results.

• State-of-the-art MT errors.

• Project goal.

• Motivation: Why Czech.

• Proposed strategy and information sources.

• Summary.

Appendices: References, illustrations and further details on Czech and English


2

Workshop Motivation

• Statistical machine translation (SMT) into morphologically rich languages ismore difficult than from them.

See e.g. Koehn (2005).

• One of workshop goals: examine utility of factored translation models totranslate into morphologically rich languages.

• There was room for improvement:

Regular BLEU English→Czech 25%BLEU of lemmatized MT against lemmatized references 32%

⇒ Errors in morphology cause large BLEU loss.


3

One of the Workshop Results

• Significant improvements gained on small data sets:English→Czech: 20k sentences, BLEU 25.82% to 27.62%or up to 28.12% with additional out-of-domain parallel data.

• Still far below the margin of lemmatized BLEU (35%).

• However local agreement already very good:

Microstudy: Adjective-Noun Agreement74% correct, 2% mismatch, other: missing noun etc.

⇒ So where are the morphological errors?


4

Current English→Czech MT ErrorsMicrostudy of current best MT output (BLEU 28.12%), intuitive metric:

• 15 sentences, 77 verb-modifier pairs in source text examined:

Translation of . . . preserves meaning . . . is disrupted . . . is missingVerb 43% 14% 21%Modifier 79% 12% 6%

But: When Verb&Mod correct, 44% of cases are non-grammatical or meaning-

disturbing relations.


5

Samples ErrorsInput: Keep on investing.

MT output: Pokracovalo investovanı. (grammar correct here!)

Gloss: Continued investing. (Meaning: The investing continued.)

Correct: Pokracujte v investovanı.

⇒language model misled us ⇒ need to include source valency information.

Input: brokerage firms rushed out ads . . .

MT Output: brokerske firmy vybehl reklamy

Gloss: brokerage firmspl.fem ransg.masc adspl.nom,pl.acc,pl.voc,sg.gen

Correct option 1: brokerske firmy vybehly s reklamamipl.instr

Correct option 2: brokerske firmy vydaly reklamypl.acc

Target-side data may be rich enough to learn: vybehnout–s–instr

Not rich enough to learn all morphological and lexical variants:vybehl–s–reklamou, vybehla–s–reklamami, vybehl–s–prohlasenım, vybehli–s–oznamenım, . . .


6

Project Goal

Improve MT output quality by valency information.


7

Motivation: Why Czech• Relevant properties: very rich morphological system and relatively free word

order.• Well-established theory on syntax and valency in particular.

Sgall, Hajicova, and Panevova (1986), Panevova (1994)

• Data available:monolingual and parallel corpora

manual surface and deep treebanks (parallel forthcoming!)

manual valency lexicons

Language Corpus Annotation up to Tokens

Cs PDT 2.0 (Hajic, 2005) manual surface and deep syntax 1.5M surf.Cs CNC (Kocek, Koprivova, and Kucera, 2000) automatic lemmatization and morphology 114MCs Web corpus automatic surface syntax 100M

Cs↔En PCEDT 1.0 (Cmejrek, Curın, and Havelka, 2003) automatic surface and deep syntax 500kCs↔En CzEng 0.5 automatic surface syntax 15M


8

Proposed StrategyPreliminary experiments at workshop:

• Factored models touching valency explored during workshop perform badly.No gain or a slight loss.

Future:

• Evaluate the causes.Was it just sparse data?

• Check subcategorization using partially lexicalized language models.Morphological LM with verbs lexicalized should capture subcategorization.

• Experiment with syntax-based language models.(Chelba and Jelinek, 1998; Charniak, 2001)

• Map explicit subcategorization information from source to target.Translate lemma+subcat to lemma+subcat and POS to POS, generate surface from this.


9

Project Will Use these Sources of Information

• Available valency/subcategorization dictionaries.VALLEX for Czech. (∼PropBank for English.)

• Automatically collected subcategorization data.(Korhonen, 2002) and previous, my diss. in prep.

• Word-sense-like algorithms to label verb occurrences with frames.(Bojar, Semecky, and Benesova, 2005), and all WSD community results

Compare with simple approaches:

• More monolingual data for plain n-gram language models may help enough.• Are valency-based generalizations useful in general/on small data/on out-of-

domain data?


10

Summary

• Factored models help fixing morphology → local dependencies already correct.

• Significant margin for improving verb-modifier agreement.

• English→Czech pair is a good fit for the experiments.

• Improved valency models should improve translation quality:Valency theory, data and methods available.


11

References

Bojar, Ondrej. 2003. Towards Automatic Extraction of Verb Frames. Prague Bulletin of

Mathematical Linguistics, 79–80:101–120.

Bojar, Ondrej, Jirı Semecky, and Vaclava Benesova. 2005. VALEVAL: Testing VALLEX

Consistency and Experimenting with Word-Frame Disambiguation. Prague Bulletin of

Mathematical Linguistics, 83:5–17.

Charniak, Eugene. 2001. Immediate-head parsing for language models. In Meeting of the

Association for Computational Linguistics, pages 116–123.

Chelba, Ciprian and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling.

In Christian Boitet and Pete Whitelock, editors, Proceedings of the Thirty-Sixth Annual Meeting

of the Association for Computational Linguistics and Seventeenth International Conference

on Computational Linguistics, pages 225–231, San Francisco, California. Morgan Kaufmann

Publishers.

Cmejrek, Martin, Jan Curın, and Jirı Havelka. 2003. Czech-English Dependency-based Machine


12

Translation. In EACL 2003 Proceedings of the Conference, pages 83–90. Association for

Computational Linguistics, April.

Collins, Michael. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. In

Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages

184–191.

Collins, Michael, Jan Hajic, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A

Statistical Parser of Czech. In Proceedings of 37th ACL Conference, pages 505–512, University

of Maryland, College Park, USA.

Hajic, Jan. 2005. Complex Corpus Annotation: The Prague Dependency Treebank. In Maria

Simkova, editor, Insight into Slovak and Czech Corpus Linguistics, pages 54–73, Bratislava,

Slovakia. Veda, vydavatelstvo SAV.

Holan, Tomas. 2003. K syntakticke analyze ceskych(!) vet. In MIS 2003. MATFYZPRESS,

January 18–25, 2003.

Kocek, Jan, Marie Koprivova, and Karel Kucera, editors. 2000. Cesky narodnı korpus - uvod a

prırucka uzivatele. FF UK - UCNK, Praha.

Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In

Proceedings of MT Summit X, September.


13

Korhonen, Anna. 2002. Subcategorization Acquisition. Technical Report UCAM-CL-TR-530,

University of Cambridge, Computer Laboratory, Cambridge, UK, February.

Kruijff, Geert-Jan M. 2003. 3-Phase Grammar Learning. In Proceedings of the Workshop on

Ideas and Strategies for Multilingual Grammar Development.

Panevova, Jarmila. 1994. Valency Frames and the Meaning of the Sentence. In Ph. L.

Luelsdorff, editor, The Prague School of Structural and Functional Linguistics, pages 223–243,

Amsterdam-Philadelphia. John Benjamins.

Sgall, Petr, Eva Hajicova, and Jarmila Panevova. 1986. The Meaning of the Sentence and

Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech

Republic/Dordrecht, Netherlands.


14

Analysis of CzechAnalytic (surface syntactic):

#36Zakony

Laws

udelejte

make

pro

for

lidi

people

ADV

AUXPOBJ

PRED

Tectogrammatical (deep syntactic):

#36zakonPl

lawPl

udelatimp

makeimpyou

clovekPl,pro

personPl,for

BENACTPAT

PRED

Morphological:Form Lemma Morphological tag

zakony zakon NNIP1-----A----




udelejte udelat Vi-P---2--A----

udelejte udelat Vi-P---3--A---4

pro pro-1 RR--4----------

lidi clovek NNMP1-----A----




15

Properties of Czech languageCzech English

Rich morphology ≥ 4,000 tags possible, ≥ 2,300 seen 50 usedWord order free rigid

• rigid global word order phenomena: clitics

• rigid local word order phenomena: coordination, clitics mutual order

Nonprojective sentences 16,920 23.3%Nonprojective edges 23,691 1.9%

Known parsing results Czech EnglishEdge accuracy 69.2–82.5% 91%Sentence correctness 15.0–30.9% 43%

Data by (Collins et al.,1999), (Holan, 2003), Zeman

(http://ckl.mff.cuni.cz/˜zeman//projekty/neproj/index.html)

and (Bojar, 2003). Consult(Kruijff, 2003) for measuringword order freeness.


16

Detailed numbers on CzechEdge length 1 ≤ 2 ≤ 5English [%] 74.2 86.3 95.6Czech [%] 51.8 72.1 90.2

1

Number of gaps 0 1 2Sentences [%] 76.9 22.7 0.42

2

Climbing steps 1 2 3 4 5Nodes [%] 90.3 8.0 1.3 0.3 0.1

3

1Data for English by (Collins, 1996). Data for Czech by (Holan, 2003).2Data by (Holan, 2003).3Data by (Holan, 2003).


17

Analytic vs. Tectogrammatical (2)

#45ToIt

byconjunct particle

sereflexive particle

meloshould

zmenitchange

.full stop

AUXK

AUXR

OBJAUXVSB

PRED

#45toit

mıtshould

zmenitconj

changeconj

Generic

Actor

PREDPAT ACT

PRED


Asynchronous Factored Translation

Hieu HoangUniversity of Edinburgh

Current System

Phrase Table 1

Je vous achète I am buying you

Phrase Table 2

PRO PRO VB PRO VB VB PRO

Translating

Je vous achète un chat

PRO PRO VB ART NN

Current System

Phrase Table 1

Je vous achète I am buying you

Phrase Table 2


Translating


PRO PRO VB ART NN

LimitationsSynchronous

Phrase Table 1

Je

vous

achète

I

you

am buying

Phrase Table 2



PRO PRO VB ART NN

Asynchronous TranslationSynchronous

Phrase Table 1

Je

vous

achète

I

you

am buying

Phrase Table 2



PRO PRO VB ART NN

TilingJe vous achète un chat

PRO PRO VB ART NN

Current System


PRO PRO VB ART NN

Current System

PRO PRO VB ART NN



PRO PRO VB ART NN

Current System

Future

PRO PRO VB ART NN



PRO PRO VB ART NN

Current System

Future

PRO PRO VB ART NN



PRO PRO VB ART NN

Current System

Future

PRO PRO VB ART NN



PRO PRO VB ART NN

Current System

Future


PRO PRO VB ART NN

Je

Vous

achète

un chat

Long Templates

Phrase Table 1

I

am buying

You

a cat

Phrase Table 2

PRO PRO VB ART NN PRO VB VB PRO ART NN


PRO PRO VB ART NN

Templates

Phrase Table 1

Je

Vous

achète

un chat

I

am buying

You

a cat

Phrase Table 2

PRO PRO VB ART NN PRO VB VB PRO ART NN

Combining information from different factors

ni suo ta da mingzi le ma ?

You said his name, right ?

past

past

You say his name already question

Surface:

Tense:

Challenges

• Computational complexity• Pruning strategies• Recombination• Scoring

Translation of morphologically rich languageswith additional linguistic information

Chris Dyer, Philipp Koehn, Chris Callison-Burch, Hieu Hoang

17 August 2006

Dyer, Koehn, Callison-Burch, Hoang Morphologically rich languages 17 August 2006

1

Morphologically rich languages

• Languages differ in their morphological markup• Examples with increasing complexity:

– Chinese: no marking for number, gender, tense, or aspect– English: number(2) for nouns, four verb forms– Spanish: number(2) and gender(2) for adjectives, ...– German: number(2), gender(3), case(4), definiteness for adjectives, ...– Arabic: number(3), gender(2), case(3), definiteness, possessors for nouns– Finnish: prepositions often expressed morphologically

Language Vocabulary size in EuroparlEnglish 65,887 word formsSpanish 102,886 word formsGerman 195,290 word formsFinnish 358,345 word forms


2

Impact of morphological complexity

• How much information do we have if we discount inflectional morphology?

• Experiment (systems trained on full 700,000 sentence Europarl corpus):

Method devtest testsurface → surface 18.22 BLEU 18.04 BLEU

surface → surface (lemmatize) 22.27 BLEU 22.15 BLEUsurface → lemma 22.70 BLEU 22.45 BLEU

• Gain of 4 BLEU points possible, if we can solve morphology


3

Problem: unknown word forms

• Unknown surface word forms (German)

test set unigrams bigrams trigramsdevtest-2006 0.71% 12.00% 40.46%

test-2006 0.69% 12.20% 41.08%

• Unknown lemmas (German)

test set unigrams bigrams trigramsdevtest-2006 0.64% 9.05% 33.93%

test-2006 0.64% 9.14% 34.36%


4

Factored models

• Factored models allow us to address these problems

• Sparse data

– back off to translation of lemmas– back off to language models with richer statistics

• Agreement and grammatical coherence

– use of factors that enforce agreement within noun phrases– use of factors that enforce agreement on the clause level


5

Addressing data sparseness with lemmas

word word

lemma

OutputInput

• Translate surface into lemma

• Generate surface from lemma

• Translate surface into surface

• Language models over surface and lemma


6

Addressing data sparseness with lemmas, model 2

word word

lemma

OutputInput

• Translate surface into surface

• Generate lemma from surface

• Language models over surface and lemma


7

Experimental Results

Method devtest testbaseline 18.22 18.04

hidden lemma (gen only) 18.82 18.69hidden lemma (gen and trans) 18.41 18.52

best published results - 18.15

• Better performance than baseline model

• Simpler model has higher performance

– fewer search errors


8

Addressing data sparseness with factored models

lemma lemma

part-of-speech

OutputInput

morphology

part-of-speech

word word

• Morphological analysis and generation model

• Pitfalls of this approach

– tag set does not necessarily have sufficient information– explosive search space on large models


9

Overall grammatical coherence

word word

part-of-speech

OutputInput

• High order language models over POS

• Motivation: syntactic tags should enforce syntactic sentence structure

• Results: No major impact with 7-gram POS model (BLEU 18.25 vs. 18.22)

• Analysis: local grammatical coherence already fairly good, POS sequence LMmodel not strong enough to support major restructuring


10

Local agreement (esp. within noun phrases)

word word

part-of-speech

OutputInput

morphology

• High order language models over POS and morphology

• Motivation

– DET-sgl NOUN-sgl good sequence– DET-sgl NOUN-plural bad sequence


11

Agreement within noun phrases

• Experiment: 7-gram POS, morph LM in addition to 3-gram word LM

• Results

Method Agreement errors in NP devtest testbaseline 15% in NP ≥ 3 words 18.22 BLEU 18.04 BLEU

factored model 4% in NP ≥ 3 words 18.25 BLEU 18.22 BLEU

• Example

– baseline: ... zur zwischenstaatlichen methoden ...– factored model: ... zu zwischenstaatlichen methoden ...

• Example

– baseline: ... das zweite wichtige anderung ...– factored model: ... die zweite wichtige anderung ...


12

Subject-verb agreement

• Lexical n-gram language model would prefer

the paintings of the old man is beautiful

old man is is a better trigram than old man are

• Correct translation

the paintings of the old man are beautiful- SBJ-plural - - - - V-plural -

• Special tag that tracks count of subject and verb

p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-)


13

Experiment on English–German

• Add special features for subject and verb

• Verb

– our morphological analyzer does not provide verb morphology→ use of surface forms

• Subject

– subject identified with German parser(Amit Dubey’s parser trained on TIGER treebank)

– if pronoun: surface form of pronoun– if noun phrase: POS and morphological tags of determiner, adjective,

and noun


14

Skip language models

• Full language model confused by many non-items:p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-)

• Skip language models: ignoring irrelevant tags:p(SBJ-plural,V-plural) > p(SBJ-plural,V-singular)

• Results: experiments did not finish as of yet, preliminary results inconclusive


15

Reflection on the data

• Clause elements are translated reasonable well

– now high agreement within noun phrases (with factored model 4%)

• Overall sentence structure muddled

– subject–verb agreement hard to enforce, since which noun phrase is subjectis hard to establish

– role (and hence case) of noun phrases often wrong, since relation to verb isunclear

• Similar problems when translating Arabic–English, Chinese–English

– this motivates work on syntax-based machine translation– one solution: syntactic restructuring models (Brooke’s presentation)– another solution: clause-level sequence models


16

Clause level sequence models

• Correct sentence with verb

the paintings of the old man are beautifulSBJ SBJ OBJ OBJ OBJ OBJ V ADJ

• Incorrect sentence without verb

the paintings of the old man beautifulSBJ SBJ OBJ OBJ OBJ OBJ ADJ

• Syntactic role label sequence model is on the steering wheel!

p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,V,ADJ) > p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,ADJ)

• May be simplified using skip language models to

p(SBJ,OBJ,V,ADJ) > p(SBJ,OBJ,ADJ)


17

Another reality check

• One typical error of the current system

wir haben daher nicht fur diesen bericht stimmenwe have hence not for this report voting

SUBJ AUX PART PART PP-OBJ PP-OBJ PP-OBJ VINF

• Typical sentences have many particles floating around

– if interested in core sentence structure: ignore them– if interested in all parts of the clause: include them

• Key lesson: feature engineering

– know your tag sets and morphological features– be aware of what problem you want to address– create a factor for this purpose


8/22/2006 JHUSWS 2006 1

Future ResearchFuture Research

BackBack--off models: improving MT off models: improving MT through smarter searching and better through smarter searching and better

use of datause of data

Chris Dyer, University of MarylandChris Dyer, University of Maryland

8/22/2006 2JHUSWS 2006

Two GoalsTwo Goals

Smarter SearchSmarter SearchMitigate sparseMitigate sparse--data effects in multidata effects in multi--factored modelsfactored modelsRecover from search errorsRecover from search errorsEnable wellEnable well--motivated models for translating into motivated models for translating into morphologically complex languagesmorphologically complex languages

BackBack--off modelsoff modelsTake advantage of singleTake advantage of single--factored models when it factored models when it makes sense to do somakes sense to do so

8/22/2006 3JHUSWS 2006

Smarter Search: MotivationSmarter Search: MotivationMorphological complexity poses problems for Morphological complexity poses problems for ““whitewhite--space tokenizedspace tokenized”” statistical MTstatistical MT

Beyond data sparseness: conventional models run into search Beyond data sparseness: conventional models run into search problems for rare surface formsproblems for rare surface forms

Lemmatizing results in considerable German Lemmatizing results in considerable German performance gainsperformance gains

devtestdevtest--20062006 testtest--20062006

surfacesurface→→surfacesurface 18.2218.22 18.0418.04

surfacesurface→→surfacesurface, , lemmatizelemmatize

22.2722.27 22.1522.15

surfacesurface→→lemmalemma 22.7022.70 22.4522.45

8/22/2006 4JHUSWS 2006

Smarter Search: MotivationSmarter Search: Motivation

Single factor models do not generalize . They cannot produce a Single factor models do not generalize . They cannot produce a target form target form unless seen in the training data.unless seen in the training data.Basic generation models allow us to improve translation coverageBasic generation models allow us to improve translation coverage with with (inexpensive) monolingual resources(inexpensive) monolingual resources

Translating Translating EnglishEnglish→→GermanGerman

Generation training data sizeGeneration training data size # distinct words # distinct words produceableproduceable

Surface onlySurface only n/an/a 105,000 distinct words105,000 distinct words

Lemmas onlyLemmas only n/an/a 85,000 distinct 85,000 distinct lemmaslemmas

8/22/2006 5JHUSWS 2006

Smarter Search: MotivationSmarter Search: Motivation

Single factor models do not generalize . They cannot produce a Single factor models do not generalize . They cannot produce a target form target form unless seen in the training data.unless seen in the training data.Basic generation models allow us to improve translation coverageBasic generation models allow us to improve translation coverage with with (inexpensive) monolingual resources(inexpensive) monolingual resources

Translating Translating EnglishEnglish→→GermanGerman

Generation training data sizeGeneration training data size # distinct words # distinct words produceableproduceable

Surface onlySurface only n/an/a 105,000 distinct words105,000 distinct words

Lemmas onlyLemmas only n/an/a 85,000 distinct 85,000 distinct lemmaslemmas

Lemmas + Lemmas + bitextbitext EuroparlEuroparl 15 million words15 million words 117,000 distinct words117,000 distinct words

Lemmas + full Lemmas + full EuroparlEuroparl 27 million words27 million words 122,000 distinct words122,000 distinct words

Lemmas + 1.2M EP + Lemmas + 1.2M EP + WikipediaWikipedia

113 million words113 million words 137,000 distinct words137,000 distinct words

Net result: 30% increase in forms Net result: 30% increase in forms produceableproduceable over a singleover a single--factor modelfactor model

8/22/2006 6JHUSWS 2006

Morphological Analysis and Morphological Analysis and Generation ModelGeneration Model

4-step model1. Translate surface to lemma2. Generate morphology from lemma3. Translate POS to morphology4. Generate surface from lemma + morphology

n-gram LM, surface

n-gram LM, lemmata

n-gram LM, morphology

8/22/2006 7JHUSWS 2006

Initial results were disappointingInitial results were disappointing……

BLEU scores well below baseline (~11)BLEU scores well below baseline (~11)Tuning took an entire weekend on a very small Tuning took an entire weekend on a very small tuning settuning set

8/22/2006 8JHUSWS 2006

The ProblemThe Problem

Search errorsSearch errorsAggressive pruningAggressive pruning

Each step multiplies number of states in the search space Each step multiplies number of states in the search space over a single factored modelover a single factored model

Spans must overlap exactlySpans must overlap exactly

8/22/2006 9JHUSWS 2006

The Problem: an illustrationThe Problem: an illustration

Translation options ‘the right approach’:

der richtige Ansatz

dem richtigen Ansatz

den richtigen Ansatz

8/22/2006 10JHUSWS 2006

The SolutionThe Solution

Back off to shorter spansBack off to shorter spansWhen a deadWhen a dead--end is reached, break up the source end is reached, break up the source span into smaller spans and translate thosespan into smaller spans and translate those

8/22/2006 11JHUSWS 2006

The Solution: an illustrationThe Solution: an illustration

Translation options ‘the’:

der, die, das, dem, den, das, des

Translation options ‘right approach’:

richtiger Ansatz

Ansatz

richtigen Ansatzes8/22/2006 12JHUSWS 2006

BackBack--off Modelsoff Models

Lexicalized surface forms are commonLexicalized surface forms are commonBecause of lexicalization, obscure morphology or Because of lexicalization, obscure morphology or root forms often retainedroot forms often retained

Ex. Ex. ““be that as it maybe that as it may””

Translations often approximate, unusual when Translations often approximate, unusual when analyzed in more abstract layersanalyzed in more abstract layersIf you mistranslate common stock phrases because If you mistranslate common stock phrases because of a rigid analysis and generation processes, fluency of a rigid analysis and generation processes, fluency sufferssuffers

8/22/2006 13JHUSWS 2006

BackBack--off Modelsoff Models

SolutionSolutionTry to let single translation step to cover all factorsTry to let single translation step to cover all factorsBack off to multiBack off to multi--factored modelfactored model

8/22/2006 14JHUSWS 2006

BackBack--off Models: Implementationoff Models: Implementation

““PrimaryPrimary”” phrase tablephrase tableStandard formStandard formContains all factors on target sideContains all factors on target side

Necessary for secondary factor Necessary for secondary factor LMsLMs

May be trained on single factor data with May be trained on single factor data with ““best best guessesguesses”” for secondary factorsfor secondary factorsMay be aggressively filtered, i.e., for >May be aggressively filtered, i.e., for >nn occurrences, occurrences, etc.etc.

8/22/2006 15JHUSWS 2006

BackBack--off Models: Implementationoff Models: Implementation

Key idea: BackKey idea: Back--off weightoff weightFeature that is associated with choosing a single Feature that is associated with choosing a single factored pathfactored pathTuned along with other feature weightsTuned along with other feature weightsFunction of source phrase length?Function of source phrase length?

8/22/2006 16JHUSWS 2006

SummarySummary

Increase performance of multiIncrease performance of multi--factored modelsfactored modelsRecover from search errorsRecover from search errorsRecover from data sparseness (make more efficient Recover from data sparseness (make more efficient use of longer underlying phrases)use of longer underlying phrases)

Extend the benefits of multiExtend the benefits of multi--factor models to factor models to target languages where sparsetarget languages where sparse--data and search data and search errors are not generally an issueerrors are not generally an issue

EnglishEnglish

Translation with syntax and factors:Handling global and local

dependencies in SMT

Brooke CowanMIT CSAIL

August 17, 2006

Brooke Cowan, MIT CSAIL Syntax and factors in SMT August 17, 2006

1

Goals of statistical machine translation

• Linguistically-correct output

– learn correct syntax and morphology in target language– e.g., noun-phrase agreement, subject-verb agreement, verbs and their

arguments

• Meaning-preserving output

– learn mapping between source and target sentence elements– e.g., identify the subject in the source and ensure it plays the proper role in

the target– can involve a significant amount of reordering


2

Linguistically-correct output

• E.g., in Spanish noun phrases, nouns, determiners, and adjectives areconstrained to agree in gender and number

políticaspolicies

pesquerasfisheries

comunitariascommon

lasthe

det noun adj adj

FEMININE PLURAL


3

Linguistically-correct output

• E.g., in Spanish noun phrases, nouns, determiners, and adjectives areconstrained to agree in gender and number

políticaspolicies

pesquerasfisheries

comunitariascommon

lasthe

det noun adj adj

FEMININE PLURAL

• Phrasal agreement phenomena are generally local in nature.


4

Meaning-preserving output: free word order

• E.g., when translating from German to English, we want to identify and placethe subject, object, and phrasal modifiers in the output

i would like to thank the rapporteur for his report

ich möchte dem berichterstatter für seinen bericht danken

dem berichterstatter möchte ich für seinen bericht danken

für seinen bericht möchte ich dem berichterstatter danken


5

Meaning-preserving output: free word order

• E.g., when translating from German to English, we want to identify and placethe subject, object, and phrasal modifiers in the output


ich möchte dem berichterstatter für seinen bericht danken

dem berichterstatter möchte ich für seinen bericht danken


• Translation involving free-word-order languages or languages pairs with verydifferent basic word order can be quite challenging because these phenomenaare generally global in nature.


6

A hybrid system

• A syntax-based system

– handle global phenomena in translation∗ inter-phrasal reordering∗ verb/argument structure∗ some long-distance agreement phenomena (e.g., subject/verb agreement)

• A factored phrase-based system

– handle local phenomena in translation∗ agreement and reorderings


7

Combining the two systems

• Use the the syntax-based system to reorder the source-language input

• Feed the output of the syntax-based system into the phrase-based system



German input:

Modified German input:

ich WOULD LIKE TO THANK dem berichterstatter für seinen bericht

English output:


8

The syntax-based system

• Discriminatively-trained, tree-to-tree translation system (Cowan, Collins, andKucerova, EMNLP ’06)

• Fully implemented and tested on German-to-English Europarl task

• Model predicts an aligned extended projection (AEP) on the target side

– a syntactic structure encapulating the argument structure of the maintarget-side verb, and

– alignment information between the modifiers on the source and target sides


9

What is an AEP?

np-sb 3 adja erhebliche

s pp-mo 1 appr zwischen

German clause: English AEP:

piat beiden nn gesetzen

vvfin-hd bestehenadv-mo 2 also

adja rechtliche $, ,adja praktischekon undadja wirtschaftlichenn underschiede

Extended Projection (EP) of the main verb

(Frank 2002)

Alignment information

+

S

NP-A VP

V

are

NP-A

SUBJECT: thereOBJECT: 3MOD(1): post-objectMOD(2): pre-subject


10

Integration with Moses

• Factor-based systems handle local phenomena well

• Extensions to Moses

Modified German input:

[ ich ] [ WOULD LIKE TO THANK ] [ dem berichterstatter ] [ für seinen bericht ]

– externally-provided translation options– constraints on reordering– n-best lists of AEPs


11

Research questions

• Factor the translation problem into two parts

– syntax-based system to handle global reorderings and agreements– factor-based system to handle local reordering and agreements

• Can this approach improve overall translation quality?

– past work in rule-based clause restructuring (e.g., Collins, Koehn, Kucerova,ACL ’05)

• What is the best way to combine these systems?

– hard constraints vs soft constraints– voting/backoff framework


Part of Speech Information for Alignment

Alexandra Constantin

2006 CLSP Summer Workshop

Bilingual Dictionary

Haus – house, building, home, household

Lexical Translation Probability Distribution Implicit Alignment

1 2 3 4Das Haus ist klein.1 2 3 4The house is small.

Alignment Function a

1 2 3 4

Klein ist das Haus

The house is small

1 2 3 4

POS Motivation

POS information for infrequent words

Example IBM Model 1 - Notations

e = target word

f = source word

t(e|f) = probability of translating foreign word f into English word e

f = (f_1, …, f_n) = foreign sentence

e = (e_1,…,e_m) = English sentence

p(e|f) = translation probability

a = alignment function

IBM Model 1 EM Algorithm

1. Initialize model (typically with uniform distribution)

2. Apply the model to the data (expectation step)

3. Learn the model from the data (maximization step)

4. Iterate steps 2-3 until convergence

Expectation Step Expectation Step – p(e|f)

Expectation Step Maximization Step

Adding POS Information Experiments- AER

Compare generated alignments against manual alignmentsManual alignments: probable (P) and sure (S)Automated alignments: A

Results

AER 10k 20k 40k 60k 80k 100k

Baseline 53.7 51.8 49.3 48.6 47.5 47.1

Only POS

76.0 75.4 75.5 75.1 75.3 75.1

+ POS 53.6 51.5 49.6 48.4 47.7 47.3

Future Work

Use alignments to train MT system and compare BLEU scoresUse POS information in more complicated alignment methodsUse other factors

JHU CLSP Summer Workshop 2006Team Presentation

Experimental Resultsfor Confusion Network Decoding

Richard Zens, Nicola Bertoldi, Marcello Federico, Wade Shen

Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20061

IWSLT Task

• Chinese–English, domain: phrase book entries

• corpus statistics:

Chinese Englishsentences 40 Krunning words 351 K 365 Kvocabulary 11 K 10 K

• confusion network statistics (489 sentences):

read speech spontaneous speechavg. length 17.2 17.4avg. / max. depth 2.2 / 92 2.9 / 82avg. number of paths 1021 1032

• no development data for confusion networks


Results for IWSLT

• phrase table provided by MIT/LL

• competitive baseline results

• results:read speech spontaneous speech

BLEU [%] BLEU [%]verbatim 21.41-best from lattice 19.0 17.21-best from CN 19.0 17.2full CN 19.3 17.8

• improvements are statistically significant (89% confidence)


Other Ambiguous Input: Punctuation

• Chinese input does not contain punctuation

• illustration:

hello world →

1 2 3 4hello 1.0 ε 0.9 world 1.0 ! 0.7

, 0.1 . 0.2ε 0.1

• results for verbatim input:

punctuation input type BLEU [%]1-best 20.8confusion network 21.0

• competitive performance without tuning→ room for improvement


Truecasing

truecasing, i.e. restoring case information in lowercase text

• common approach:

– core MT system produces lowercase output– truecasing is done as postprocessing step

• application of factored translation models

1. translate lowercase2. generate truecase output (using a truecase LM)

• results:BLEU [%]

two-step 18.9integrated 17.8

→ somewhat worse performance than dedicated toolZens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20065

EPPS Task• EPPS: European Parliament Plenary Sessions

• Spanish-English speech-to-speech translation task

• corpus statistics:

Spanish Englishsentences 1.2 Mrunning words 31 M 30 Mvocabulary 140 K 94 K

• confusion network statistics:dev test

sentences 2 633 1 071avg. length 10.6 23.6avg. / max. depth 2.8 / 165 2.7 / 136avg. number of paths 1038 1075


Results for EPPS Task

dev testASR-WER BLEU ASR-WER BLEU

1-best lattice 19.3 42.2 22.4 37.61-best CN 21.7 40.3 23.3 36.7full CN 7.0 42.4 8.5 38.9

• best result for test in previous work: 37.2 BLEU

• in comparison with previous work on this task, we have

1. a stronger baseline,2. larger improvements and3. much more efficient decoding (4x vs. 25x)

note: all figures in percentZens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding August 17, 20067

Exploration of Confusion Networks

0 2 4 6 8 10 12 14

path length

0.1

1

10

100

1x103

1x104

1x105

1x106

1x107

1x108

1x109

1x1010

avg.

num

ber p

er s

ente

nce

CN totalCN explored1-best explored


JHU CLSP Summer Workshop 2006Proposal for Follow-up Research

Exploiting Ambiguous Inputin Statistical Machine Translation

Richard Zens

Human Language Technology and Pattern RecognitionLehrstuhl für Informatik 6

Computer Science DepartmentRWTH Aachen University, Germany

Zens: Exploiting Ambiguous Input in SMT August 17, 20061

Motivation

• MT often used in a pipeline, i.e. the input to the MT systemis the output of another imperfect NLP system, e.g.

– spoken language translation: ASR– segmentation: Chinese words, Arabic tokens– named entity recognition / translation


Motivation

• MT often used in a pipeline, i.e. the input to the MT systemis the output of another imperfect NLP system, e.g.

– spoken language translation: ASR– segmentation: Chinese words, Arabic tokens– named entity recognition / translation

• traditional approach: ignore problem, i.e. translate 1-best

• result of previous work:improvements if ambiguity is taken into account


Previous Approaches

1. confusion network decoding

• advantages: efficiency, reordering is straightforward• problem: representing alternative segmentations

2. lattice decoding

• advantage: representing alternative segmentations• problem: reordering

goal:⇒ exploit advantages of both approaches,⇒ but avoid weaknesses


Generalized Confusion Networks

• confusion networks:

0 1 2 3 4


Generalized Confusion Networks

• confusion networks:

0 1 2 3 4

• generalization:

0 1 2 3 4

– add edges that cover multiple positions→ representation of alternative segmentations

– do not add nodes→ retain efficiency, straightforward reordering


Improved Reordering for Lattice Input

• confusion network is approximation of lattice→ valuable information might be lost→ potential improvement when using lattices




• so far:

– only very local reordering on lattice:∗ skip 1 phrase [Zens & Bender+ 05]

∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05]




• so far:

– only very local reordering on lattice:∗ skip 1 phrase [Zens & Bender+ 05]

∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05]

• idea:

– generalize reordering scheme used for CN to lattice input→ long range reordering


Goals

• improve robustness to imperfect input

• investigate novel approaches:

– generalized confusion networks– reordering strategies for lattice input

• perform a systematic comparison in terms of MT qualityand computational requirements

• scalability → apply to tasks of different size:small: IWSLT, medium: EPPS/TC-Star, large: NIST/GALE


Targeted Applications

• spoken language translation:

– output of ASR system– punctuation insertion / sentence boundary detection– disfluency detection

• named entity recognition / translation

• Chinese word segmentation

• Arabic tokenization


References[Kumar & Byrne 05] S. Kumar, W. Byrne: Local phrase reordering models for statisti-

cal machine translation. Proc. HLT/EMNLP, pp. 161–168, Vancouver, Canada, October2005.

[Sadat & Habash 06] F. Sadat, N. Habash: Combination of Preprocessing Schemes forStatistical MT. Proc. COLING/ACL, pp. 1–8, Sydney, Australia, July 2006.

[Xu & Matusov+ 05] J. Xu, E. Matusov, R. Zens, H. Ney: Integrated Chinese Word Seg-mentation in Statistical Machine Translation. Proc. Int. Workshop on Spoken LanguageTranslation (IWSLT), pp. 141–147, Pittsburgh, PA, October 2005.

[Zens & Bender+ 05] R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu,Y. Zhang, H. Ney: The RWTH Phrase-based Statistical Machine Translation System.Proc. Int. Workshop on Spoken Language Translation (IWSLT), pp. 155–162, Pitts-burgh, PA, October 2005.

[Zens & Och+ 02] R. Zens, F.J. Och, H. Ney: Phrase-Based Statistical Machine Transla-tion. Proc. M. Jarke, J. Koehler, G. Lakemeyer, editors, 25th German Conf. on ArtificialIntelligence (KI2002), Vol. 2479 of Lecture Notes in Artificial Intelligence (LNAI), pp.18–32, Aachen, Germany, September 2002. Springer Verlag.