Factored Translation Models for Small Data Problems€¦ · 3 8/14/2006 MIT Lincoln + Computer...

MIT Lincoln + Computer Science AI Labs

18/14/2006

Charles University

Wade Shèn, Břooke Cowan, OndrejBojar and Christine Möran

Factored Translation Models for Small Data Problems

Experiments with Spanish, Czech and Chinese

28/14/2006

MIT Lincoln + Computer Science AI Labs Charles University

Outline

• Motivations

• Experimental Design and Baselines

• Models for Agreement in Spanish

• Coping with Rich Morphological Constraints in Czech

• Generalizing Lexical Distortion Models

• Models for Sparse Statistics in Chinese

• Conclusions and Follow-on Research

38/14/2006


General MotivationsChallenges with Small Data

• Phrase-based MT relies on large data– Learn “Phrase” co-occurence within language– Learn Translation templates/phrases across languages

• Problems Phrase-based MT with Small Data– Word Alignment– Hard to see enough phrases (coverage)

Especially in morphologically rich languages– Tend to rely on shorter phrases

Increased local agreement problems Increased long-distance coherence problems

48/14/2006


Possible Advantages of Factored ModelsGeneralization over Morphology

• We can Model morph. variation and phrase translation separately for better statistics: Translation + Generation

– Spanish Gender

– Czech Case

Masculine FeminineEnglishSpanish Él es un jugador rojo Ella es una jugadora roja

he is a red player she is a red player

Nominative + Plural Dative + PluralEnglishCzech černé kočky černým kočkám

black cats black cats

el ser un jugador rojMorph: f 3p+sing f f fMorph: m 3p+sing m m m

černá kočka Morph: dat+pl dat+plMorph: nom+pl nom+pl

58/14/2006


Factors as Type CheckingLong Range Phenomena and Divergence

• Long range dependencies can be modeled with latent factors– Spanish: Verb – Subject Number Agreement

• Verb-Argument dependencies

Spanish Mi hija de dos años tiene catarroGloss My daughter of two years has coldCzech Nachlazena je moje dvouletá dcera.

verb: 3p+singSubject: 3p+sing AGR

verb: 3p+sing Subject: 3p+singAGR

Czech Napsal zprávu o matčině domu na papírGloss He wrote a message about mother’s house on a paper

noun: accusativeverb select

Czech Našel zprávu o matčině domu na papířeGloss He found a message about mother’s house on a paper

noun: locativeverb select

68/14/2006


Phrase-Level Generalization

• Class-based divergences– Chinese-English resultative constructions

Similar pattern for large class of verbs

• Longer distance movement dependencies– Chinese-English Questions

Chinese 你要答破吗made hit broken doneyou

回Gloss it

English you broke it

Chinese 你要答 [clause…] 吗want [clause…] y/n-markeryou

would you like to reply to [clause…] ?

回Gloss replyEnglish

causes reorderingTags: VModal Pn Tag: Part

Verb Specific

78/14/2006


Large vs. Small DataHow generalizations may affect SMT Performance

• With large data sets these phenomena can be learned– Language Models should get local agreement phenomena

with enough data– Long range agreement/coherence still problematic– Generalization may still be better, but errors in analysis can

limit

• Generalization may be advantageous for small data– For example: (Spanish/Czech Agreement)

Can’t learn every noun/adjective/determiner triple– Situation for many real-world problems

88/14/2006


Outline

• Motivations

• Experimental Design and Baselines– Approaches– Data Sets






98/14/2006


Data Sets and Baselines

Data Set Translation Direction(s)

Size Baseline w/diff LMs(BLEU/Surface)

Full Europarl English Spanish

950k LM Train700k Bitext

3g 29.354g 29.575g 29.54

3g 23.413g (950k) 25.10

3g 25.82(four references)

4g 19.54(seven references)

Euromini English Spanish


Czech WSJ English Czech


IWSLT Chinese Chinese English


108/14/2006


Using Factored ModelsApproaches for Small-Data Tasks

• Factored Models we tried– Different levels of linguistic information modeled separately

example: Morphology vs. phrasal content– Feature “Checking” of existing phrasal models with LMs on

factors

– Generalized Factor-based Distortion Phrase are likely to move distance X if preceding word is Tag Y

• Hypothesis: These models allows better utilization of limited training data

I would like some donutsGood

pn mod vb det np

I would like some big jumpBad

pn mod vb det adj vb

Words

POS

High likelihood Low likelihood

118/14/2006


Different Factored ApproachesOverview of Models Tried

• High Order Language Models

• Parallel Translation Models

Analysis Problems AddressedExplicit Agreement

Long Distance CoherenceUnsupervised Agreement/Coherence • LMs over Word-Classes

• LMs over verbs/subject• LMs over nouns determiner

adjectives

SupervisedModel Types

• LMs over POS

• Parallel Translation Models over Word-Classes and Surface

Agreement/CoherenceUnsupervised

Explicit AgreementProblem Types

• Parallel Translation Models over Lemmas and Morphology

SupervisedAnalysis Model Types

128/14/2006


Outline

• Motivations


• Models for Agreement in Spanish– Morphology and Agreement Features (Brooke)– Parallel Lemma and Morphology Translation (Wade)– Scaling to Larger Corpora (Wade)





138/14/2006


Spanish ExperimentsLanguage Models over Morphological Features

• NDA– Nouns/Determiner/Adjective Agreement– Generate only on N, D and A tags (don’t

care’s elsewhere)

• VNP– Verb/Nouns/Preposition Selection

Agreement– Generate on V, N or P

ModelModel

Surface

Generate + Check Latent Factors

nda

word

vpn

N/D/A FeaturesGender: masc, fem, common, none Number: sing, plural, invariable, none

V/N/P FeaturesNumber: sing, plural, invariable, none Person: 1p, 2p, 3p, nonePrep-ID: Preposition, none

148/14/2006


ModelModel

Spanish ExperimentsSkipped LMs for Agreement

• Allow NULL factors to be generated• Increase effective context length to model longer range

dependencies

Surface

Generate Latent Factors

…gave the woman

nda

word

vpn

s+f

s

s+f

X

X

“a”3+s

X

mujerlaadio

Target Phrase

Source Phrase

158/14/2006


Spanish Agreement LMsExperimental Results

• With Skipping

• No Skipping (LM counts don’t care positions)

• No Skipping with all morphological features w/ and w/o POS

• All models beat baseline– Skipping doesn’t seem to help– Full morphology is best

Data Set Baseline NDA VPN BothEuroMini 23.41 24.47 24.33 24.54

Data Set Baseline NDA+Skip VPN+SkipEuroMini 23.41 24.03 24.16

Data Set Baseline Morph Morph+POSEuroMini 23.41 24.66 24.25

168/14/2006


Lemma

Person + Number + Gender + Case

Spanish ExperimentsParallel Lemma/Morphology Translation

• Factor surface into lemma and morphology features• Translate both simultaneously• Re-generate target surface form• Apply LM on both surface and morphology features

• Results:

Surface

Analysis Generation

Me

I

1ps+ Acc

Yo

Mi

1ps+ Acc

Data Set Baseline LemmaEuroMini + 950k LM 25.10 25.71

178/14/2006


Scaling Up to Large TrainingPOS Language Models

• Full Train → Less/No Gain from richer features

POS-LM vs. Baseline

28

28.5

2929.5

30

30.5

31

3g 4g 5g 6g 7g 8g 9g

POS N-gram Order

BLE

U S

core

BaselinePOS-LMFull Tags

NOTE: Scale

188/14/2006


Outline

• Motivations



• Coping with Rich Morphological Constraints in Czech– Factored Word Alignment for Limited Data– Rich Morphology and Tagged LMs– Putting it Together: Parallel Translation



• Analysis and Conclusions

• Follow-on Research

198/14/2006


Factors for Coping with Limited DataBetter Word Alignment for Czech

• Word Alignment is difficult when data is limited and Morphology is rich

– Data: 20k bitext sentences, large vocabulary– Contrast Set: 20k + 840k (Out of Domain) sentences– Task: English Czech

• Two methods to deal with limited data

• Contrastive Behavior for small and large data

Stem Alignment Lemma Alignment

Data Set Word-Word Stem-Lemma Stem-Stem20k Czech 25.17 25.23 25.82

24.99Large Contrast 25.40

208/14/2006


Czeching Rich Morphology with TagsTagged Czech Language Models

• Idea: Use morphologically rich POS Tag sequences to “czech”target output generation

• POS Information Configurations (Baseline: 25.82)

Surface

Generation

cat

N+acc

kočky

Apply LM

Full TagsFeature 1Feature 2… (15 total)Size: 1098 tagsResult: 27.04

CNG TagsCaseNumber+Genderon V, P, PP, N, ASize: 707 tagsResult: 27.45

CNG+VPCNG FeaturesPerson+Tense+Aspect (verbs)Lemma+Case (prepostions)Size: 899 tagsResult: 27.62

218/14/2006


Comparing with Larger Data ModelsTagged Czech Language Models

• Large vs. Small Data

• Tagged Language Models improve performance for small data significantly

– approaches large data performance• Large Task also improves (but much less: 2.36% vs. 6.97%)

Data Set Data Set BLEU Relative Improvement

20k Czech 25.82 –Large Contrast(20k + 840k OOD)

27.47 –

Baseline

20k Czech 27.62 6.97%CNG+VP

2.37%Large Contrast(20k + 840k OOD)

28.12

228/14/2006


Parallel Translation Models for Czech

• Motivation: Factored LM models seem to lose number information

• Better than baseline, but worse than both CNG & CNG+VP

POS Tag + CNG Features

Surfacehim

3p+acc

ho

Model ResultSurface Surface + POS POS+CNG 25.94

on Lemma

3p+acc

Surface Lemma + POS POS+CNG 26.43

238/14/2006


Outline

• Motivations




• Generalizing Lexical Distortion Models (Christine)– Lexical Distortion Models– Factor-based Distortion– Results


• Analysis and Conclusions

• Follow-on Research

248/14/2006


Generalized Distortion ModelingIntroduction to Distortion

• For each phrase pair we learn its likely placement relative to the previous phrase

• Orientations– Monotone

word alignment point on top left– Swap

word alignment point on top right– Discontinuous

Not monotone or swap

• Examples– la casa roja the red house– D NN ADJ D ADJ NN

Source

Targ

et

Monotone

Discontinuous Swap

258/14/2006


Factor-based Distortion Models

• A Factor-based extension of Lexicalized Distortion– Use of more general factors

e.g. POSf-POSe, Lemma-Lemma

• Can model longer range dependencies– More conditioning variables

• Motivating Results– Hard-coding in a few factor based rules (e.g. swap nouns and

adjectives when translating from English to Spanish) led to improvements (Gispert, et. al. 2006)

268/14/2006


Factor-based DistortionSpanish Experiments

• Lexicalized Distortion only

• Factor-based Distortion on small data

• Further Experiments– Other Factors– Minimizing Model Parameters– Combining different models

Data Set ResultBaseline (No Lexical)

Factored: POS-POS SystemCombined: Lexical + POS-POS

Baseline Lexical

Europarl Lang Pharaoh MosesEn De

Es En

En Es

18.15 18.85

31.06 31.85 31.46 32.37

278/14/2006


Outline

• Motivations







288/14/2006


IWSLT ChineseExperiments with Unsupervised Annotation

• Data: Travel-domain sentences, limited vocabulary, short sentences• Task: Text and ASR translation, Chinese English

• Can we use automatic word classes to learn general sequence constraints?

• First Experiment: 2-gram Word Class LMs of varying orders

ModelModel

SurfaceHow much is it?

class

word

c55c3c22c1

?钱多少总共

Target Phrase

Source Phrase

298/14/2006


IWSLT ChineseAlignment Templates for Translation

• Second Experiment: Extend Class-based LM to the translation Model

• Bigram word classes for source and target

• Translate alignment templates similar to [Och 98] + surface

• Apply LM to surface and Class

Word Class

Surface

Generation

Me

I Yo

Mi

308/14/2006


18

18.5

19

19.5

20

20.5

21

21.5

22

22.5

3g 4g 5g 6g 7g 8g 9g

Class N-gram Order

BLE

U S

core

Baseline

Class-LM

ClassTrans+LM

• Class-LM significantly better (p=0.05, ~1.0 BLEU)• Class-Trans may be limited by synchronous PT constraint

– Start to address here, but not in time for eval

NOTE: Scale

IWSLT ChineseAutoclass Results

318/14/2006


Outline

• Motivations






• Conclusions and Follow On Research

328/14/2006


Conclusions and Future Work

• Factored Approach can help with small data– Large Data tasks may need different factored approaches

• MIT/LL + CSAIL– Continue experiments with morphology and coherence– Fully Asynchronous Factor Translation– Apply techniques to other languages

Extend existing LCTL experiments– Syntax-driven reordering models (Brooke)

• Asynchronous Factors Translation (Hieu)

• Making use of verb sub-categorization information (Ondrej)

Date post:	21-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Factored Translation Models for Small Data Problems€¦ · 3 8/14/2006 MIT Lincoln + Computer...

Documents