MIT Lincoln + Computer Science AI Labs
18/14/2006
Charles University
Wade Shèn, Břooke Cowan, OndrejBojar and Christine Möran
Factored Translation Models for Small Data Problems
Experiments with Spanish, Czech and Chinese
28/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
38/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
General MotivationsChallenges with Small Data
• Phrase-based MT relies on large data– Learn “Phrase” co-occurence within language– Learn Translation templates/phrases across languages
• Problems Phrase-based MT with Small Data– Word Alignment– Hard to see enough phrases (coverage)
Especially in morphologically rich languages– Tend to rely on shorter phrases
Increased local agreement problems Increased long-distance coherence problems
48/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Possible Advantages of Factored ModelsGeneralization over Morphology
• We can Model morph. variation and phrase translation separately for better statistics: Translation + Generation
– Spanish Gender
– Czech Case
Masculine FeminineEnglishSpanish Él es un jugador rojo Ella es una jugadora roja
he is a red player she is a red player
Nominative + Plural Dative + PluralEnglishCzech černé kočky černým kočkám
black cats black cats
el ser un jugador rojMorph: f 3p+sing f f fMorph: m 3p+sing m m m
černá kočka Morph: dat+pl dat+plMorph: nom+pl nom+pl
58/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factors as Type CheckingLong Range Phenomena and Divergence
• Long range dependencies can be modeled with latent factors– Spanish: Verb – Subject Number Agreement
• Verb-Argument dependencies
Spanish Mi hija de dos años tiene catarroGloss My daughter of two years has coldCzech Nachlazena je moje dvouletá dcera.
verb: 3p+singSubject: 3p+sing AGR
verb: 3p+sing Subject: 3p+singAGR
Czech Napsal zprávu o matčině domu na papírGloss He wrote a message about mother’s house on a paper
noun: accusativeverb select
Czech Našel zprávu o matčině domu na papířeGloss He found a message about mother’s house on a paper
noun: locativeverb select
68/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Phrase-Level Generalization
• Class-based divergences– Chinese-English resultative constructions
Similar pattern for large class of verbs
• Longer distance movement dependencies– Chinese-English Questions
Chinese 你 要 答 破 吗made hit broken doneyou
回Gloss it
English you broke it
Chinese 你 要 答 [clause…] 吗want [clause…] y/n-markeryou
would you like to reply to [clause…] ?
回Gloss replyEnglish
causes reorderingTags: VModal Pn Tag: Part
Verb Specific
78/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Large vs. Small DataHow generalizations may affect SMT Performance
• With large data sets these phenomena can be learned– Language Models should get local agreement phenomena
with enough data– Long range agreement/coherence still problematic– Generalization may still be better, but errors in analysis can
limit
• Generalization may be advantageous for small data– For example: (Spanish/Czech Agreement)
Can’t learn every noun/adjective/determiner triple– Situation for many real-world problems
88/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines– Approaches– Data Sets
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
98/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Data Sets and Baselines
Data Set Translation Direction(s)
Size Baseline w/diff LMs(BLEU/Surface)
Full Europarl English Spanish
950k LM Train700k Bitext
3g 29.354g 29.575g 29.54
3g 23.413g (950k) 25.10
3g 25.82(four references)
4g 19.54(seven references)
Euromini English Spanish
60k LM Train40k Bitext
Czech WSJ English Czech
20k LM Train20k Bitext
IWSLT Chinese Chinese English
40k LM Train40k Bitext
108/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Using Factored ModelsApproaches for Small-Data Tasks
• Factored Models we tried– Different levels of linguistic information modeled separately
example: Morphology vs. phrasal content– Feature “Checking” of existing phrasal models with LMs on
factors
– Generalized Factor-based Distortion Phrase are likely to move distance X if preceding word is Tag Y
• Hypothesis: These models allows better utilization of limited training data
I would like some donutsGood
pn mod vb det np
I would like some big jumpBad
pn mod vb det adj vb
Words
POS
High likelihood Low likelihood
118/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Different Factored ApproachesOverview of Models Tried
• High Order Language Models
• Parallel Translation Models
Analysis Problems AddressedExplicit Agreement
Long Distance CoherenceUnsupervised Agreement/Coherence • LMs over Word-Classes
• LMs over verbs/subject• LMs over nouns determiner
adjectives
SupervisedModel Types
• LMs over POS
• Parallel Translation Models over Word-Classes and Surface
Agreement/CoherenceUnsupervised
Explicit AgreementProblem Types
• Parallel Translation Models over Lemmas and Morphology
SupervisedAnalysis Model Types
128/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish– Morphology and Agreement Features (Brooke)– Parallel Lemma and Morphology Translation (Wade)– Scaling to Larger Corpora (Wade)
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
138/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Spanish ExperimentsLanguage Models over Morphological Features
• NDA– Nouns/Determiner/Adjective Agreement– Generate only on N, D and A tags (don’t
care’s elsewhere)
• VNP– Verb/Nouns/Preposition Selection
Agreement– Generate on V, N or P
ModelModel
Surface
Generate + Check Latent Factors
nda
word
vpn
N/D/A FeaturesGender: masc, fem, common, none Number: sing, plural, invariable, none
V/N/P FeaturesNumber: sing, plural, invariable, none Person: 1p, 2p, 3p, nonePrep-ID: Preposition, none
148/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
ModelModel
Spanish ExperimentsSkipped LMs for Agreement
• Allow NULL factors to be generated• Increase effective context length to model longer range
dependencies
Surface
Generate Latent Factors
…gave the woman
nda
word
vpn
s+f
s
s+f
X
X
“a”3+s
X
mujerlaadio
Target Phrase
Source Phrase
158/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Spanish Agreement LMsExperimental Results
• With Skipping
• No Skipping (LM counts don’t care positions)
• No Skipping with all morphological features w/ and w/o POS
• All models beat baseline– Skipping doesn’t seem to help– Full morphology is best
Data Set Baseline NDA VPN BothEuroMini 23.41 24.47 24.33 24.54
Data Set Baseline NDA+Skip VPN+SkipEuroMini 23.41 24.03 24.16
Data Set Baseline Morph Morph+POSEuroMini 23.41 24.66 24.25
168/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Lemma
Person + Number + Gender + Case
Spanish ExperimentsParallel Lemma/Morphology Translation
• Factor surface into lemma and morphology features• Translate both simultaneously• Re-generate target surface form• Apply LM on both surface and morphology features
• Results:
Surface
Analysis Generation
Me
I
1ps+ Acc
Yo
Mi
1ps+ Acc
Data Set Baseline LemmaEuroMini + 950k LM 25.10 25.71
178/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Scaling Up to Large TrainingPOS Language Models
• Full Train → Less/No Gain from richer features
POS-LM vs. Baseline
28
28.5
2929.5
30
30.5
31
3g 4g 5g 6g 7g 8g 9g
POS N-gram Order
BLE
U S
core
BaselinePOS-LMFull Tags
NOTE: Scale
188/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech– Factored Word Alignment for Limited Data– Rich Morphology and Tagged LMs– Putting it Together: Parallel Translation
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Analysis and Conclusions
• Follow-on Research
198/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factors for Coping with Limited DataBetter Word Alignment for Czech
• Word Alignment is difficult when data is limited and Morphology is rich
– Data: 20k bitext sentences, large vocabulary– Contrast Set: 20k + 840k (Out of Domain) sentences– Task: English Czech
• Two methods to deal with limited data
• Contrastive Behavior for small and large data
Stem Alignment Lemma Alignment
Data Set Word-Word Stem-Lemma Stem-Stem20k Czech 25.17 25.23 25.82
24.99Large Contrast 25.40
208/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Czeching Rich Morphology with TagsTagged Czech Language Models
• Idea: Use morphologically rich POS Tag sequences to “czech”target output generation
• POS Information Configurations (Baseline: 25.82)
Surface
Generation
cat
N+acc
kočky
Apply LM
Full TagsFeature 1Feature 2… (15 total)Size: 1098 tagsResult: 27.04
CNG TagsCaseNumber+Genderon V, P, PP, N, ASize: 707 tagsResult: 27.45
CNG+VPCNG FeaturesPerson+Tense+Aspect (verbs)Lemma+Case (prepostions)Size: 899 tagsResult: 27.62
218/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Comparing with Larger Data ModelsTagged Czech Language Models
• Large vs. Small Data
• Tagged Language Models improve performance for small data significantly
– approaches large data performance• Large Task also improves (but much less: 2.36% vs. 6.97%)
Data Set Data Set BLEU Relative Improvement
20k Czech 25.82 –Large Contrast(20k + 840k OOD)
27.47 –
Baseline
20k Czech 27.62 6.97%CNG+VP
2.37%Large Contrast(20k + 840k OOD)
28.12
228/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Parallel Translation Models for Czech
• Motivation: Factored LM models seem to lose number information
• Better than baseline, but worse than both CNG & CNG+VP
POS Tag + CNG Features
Surfacehim
3p+acc
ho
Model ResultSurface Surface + POS POS+CNG 25.94
on Lemma
3p+acc
Surface Lemma + POS POS+CNG 26.43
238/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models (Christine)– Lexical Distortion Models– Factor-based Distortion– Results
• Models for Sparse Statistics in Chinese
• Analysis and Conclusions
• Follow-on Research
248/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Generalized Distortion ModelingIntroduction to Distortion
• For each phrase pair we learn its likely placement relative to the previous phrase
• Orientations– Monotone
word alignment point on top left– Swap
word alignment point on top right– Discontinuous
Not monotone or swap
• Examples– la casa roja the red house– D NN ADJ D ADJ NN
Source
Targ
et
Monotone
Discontinuous Swap
258/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factor-based Distortion Models
• A Factor-based extension of Lexicalized Distortion– Use of more general factors
e.g. POSf-POSe, Lemma-Lemma
• Can model longer range dependencies– More conditioning variables
• Motivating Results– Hard-coding in a few factor based rules (e.g. swap nouns and
adjectives when translating from English to Spanish) led to improvements (Gispert, et. al. 2006)
268/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Factor-based DistortionSpanish Experiments
• Lexicalized Distortion only
• Factor-based Distortion on small data
• Further Experiments– Other Factors– Minimizing Model Parameters– Combining different models
Data Set ResultBaseline (No Lexical)
Factored: POS-POS SystemCombined: Lexical + POS-POS
Baseline Lexical
Europarl Lang Pharaoh MosesEn De
Es En
En Es
18.15 18.85
31.06 31.85 31.46 32.37
278/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Coping with Rich Morphological Constraints in Czech
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Conclusions and Follow-on Research
288/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
IWSLT ChineseExperiments with Unsupervised Annotation
• Data: Travel-domain sentences, limited vocabulary, short sentences• Task: Text and ASR translation, Chinese English
• Can we use automatic word classes to learn general sequence constraints?
• First Experiment: 2-gram Word Class LMs of varying orders
ModelModel
SurfaceHow much is it?
class
word
c55c3c22c1
?钱多少总共
Target Phrase
Source Phrase
298/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
IWSLT ChineseAlignment Templates for Translation
• Second Experiment: Extend Class-based LM to the translation Model
• Bigram word classes for source and target
• Translate alignment templates similar to [Och 98] + surface
• Apply LM to surface and Class
Word Class
Surface
Generation
Me
I Yo
Mi
308/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
18
18.5
19
19.5
20
20.5
21
21.5
22
22.5
3g 4g 5g 6g 7g 8g 9g
Class N-gram Order
BLE
U S
core
Baseline
Class-LM
ClassTrans+LM
• Class-LM significantly better (p=0.05, ~1.0 BLEU)• Class-Trans may be limited by synchronous PT constraint
– Start to address here, but not in time for eval
NOTE: Scale
IWSLT ChineseAutoclass Results
318/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Outline
• Motivations
• Experimental Design and Baselines
• Models for Agreement in Spanish
• Generalizing Lexical Distortion Models
• Models for Sparse Statistics in Chinese
• Coping with Rich Morphological Constraints in Czech
• Conclusions and Follow On Research
328/14/2006
MIT Lincoln + Computer Science AI Labs Charles University
Conclusions and Future Work
• Factored Approach can help with small data– Large Data tasks may need different factored approaches
• MIT/LL + CSAIL– Continue experiments with morphology and coherence– Fully Asynchronous Factor Translation– Apply techniques to other languages
Extend existing LCTL experiments– Syntax-driven reordering models (Brooke)
• Asynchronous Factors Translation (Hieu)
• Making use of verb sub-categorization information (Ondrej)