Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | jordan-earles |
View: | 224 times |
Download: | 0 times |
1
Machine TranslationDomain Adaptation
Day 19
2
PROJECT #2
MEMM tools
• Online description of project #2 has been updated with more information
Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1
You write code to convert this to features!
“featurize.pl training.txt training.feats”
Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1
trigram.model<binary gobbledegoo>
Run memm_train to train this model
“memm_train --input training.feats --classifier trigram.model --markovOrder 2”
Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1
trigram.model<binary gobbledegoo>
test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.
Get some unseen test data…
Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1
trigram.model<binary gobbledegoo>
test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.
test.featsPRP w0=he:1 w-1=<s>:1VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1
Use the same featurization code on test data
“featurize.pl test.txt test.feats”
Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1
trigram.model<binary gobbledegoo>
test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.
test.featsPRP w0=he:1 w-1=<s>:1VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1
test.tagsPRPVBD. NNPVBD.
memm_test predicts tags (memm_test ignores first column; can include true tags)
“memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags”
MEMM featurestraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.
training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1
Actual features used by MEMMPRP w0=I:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]=<s>:1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1<s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1NNP w0=John:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]=<s>:1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1<s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1
You provide these features…
…and add the argument “--markovOrder 2”
The MEMM adds in features about tag
context add training and test time
11
MACHINE TRANSLATION
12
Acknowledgments
• Many thanks to (for helpful content and input on content):– Chris Callison-Burch, Matt Post, & Adam Lopez
(JHU)– Philipp Koehn & Barry Haddow (U Edinburgh)– Kevin Knight (ISI)
13
14
15
Translation: global problem and interesting research problem
English32%
Chinese13%
Spanish9%
Japanese7%
French5%
German4%
Arabic4%
Portuguese4%
Other21%
Internet users – 2007• Non-English Internet content and user communities are increasing explosively
• Human translation costs are excessive: major languages range from 10-50 cents per word
• Non-English Internet content and user communities are increasing explosively
• Human translation costs are excessive: major languages range from 10-50 cents per word
Result: the vast majority of published material remains untranslated!
16
Prevalence of MT on the Web
Estonian
Hungarian
Slovenian
Slovak
Romanian
Latvian
Lithuanian
12.13% 12.93%
25.47%
46.40% 47.40% 50.07% 51.53%
Proportion of MT’d Content by language
From Rarrick et al, 2010
17
18
The Goal: (sentence) translation
• Translate source sentences into target sentences– For now, ignore
discourse structure, co-reference, and phenomena across sentence boundaries
滴水之恩當以涌泉相報
A drop of water shall be returned with a burst of
spring.
19
Types of MT systems
• Source of information– Rule based: People write rules to specify translations of
words, phrases– Data-driven: Use learning techniques to derive translation
“rules” from data sources (e.g., parallel corpora)
• Level of representationInterlingua
Semantic forms
Syntax trees
Phrases
WordsModified Vauquois pyramid
20
Advantages of data-driven translation
• We can model the genres of documents that we would like to model– Learn contextually appropriate translations for technical
data, chat data, etc.• Very flexible system– Given corpus C = ({x1,y1}, {x2,y2}, …) of sentence pairs– Translate(C, x) = y is a function of the training data and the
input sentence– To build a new system (or optimize our old one) we just
change the data
– But…we need oodles of data to get “good” models
21
Statistical MT
• Learn word and phrase alignments from “parallel” data
22
Statistical MT
• Learn word and phrase alignments from “parallel” data– Parallel data? – Parallel documents?
23
Statistical MT
• Learn word and phrase alignments from “parallel” data– Parallel documents?
24
Statistical MT
• Learn word and phrase alignments from “parallel” data– Parallel documents?
25
Statistical MT
• Learn word and phrase alignments from “parallel” data– Parallel documents?
26
Statistical MT
• Learn word and phrase alignments from “parallel” data– Start with parallel documents• Need parallel sentences• Sentence break and sentence align
– Word align and produce word and phrase translation tables (our translation models)
27
28
29
Some Hmong
a house ib lub tsev
a new house ib lub tsev tshiab
my new house kuv lub tsev tshiab
eight new houses yim lub tsev tshiab
my eight new houses kuv yim lub tsev tshiab
30
Some More Hmong
a house ib lub tsev
a new house ib lub tsev tshiab
my new house kuv lub tsev tshiab
eight new houses yim lub tsev tshiab
my eight new houses kuv yim lub tsev tshiab
the house lub tsev
31
Even More Hmong
kuv pluag heev I'm very poorib pluag mov a meal ib taig mov a bowl of riceib taig zaub a bowl of vegetables
32
Statistical MT
• Learn word and phrase alignments from “parallel” data– Start with parallel documents• Need parallel sentences• Sentence break and sentence align
– Word align and produce word and phrase translation tables (our translation models)
33
Statistical MT
• Learn word and phrase alignments from “parallel” data– Start with parallel documents
• Need parallel sentences• Sentence break and sentence align
– Word align and produce word and phrase translation tables (our translation models)
• Use monolingual data to– Build language models
• Inform ordering• Choose best translation from n-best list
34
Statistical MT Recipe
Start With• Parallel sentences
– Align words & phrases, & generate counts
Build These Components• Translation Model
– Probs associated with aligned words & phrases – P (E|F)
35
Statistical MT Recipe
Start With• Parallel sentences
– Align words & phrases, & generate counts
• Monolingual data
Build These Components• Translation Model
– Probs associated with aligned words & phrases – P (E|F)
• Language Model – P(E)
36
Statistical MT Recipe
Start With• Parallel sentences
– Align words & phrases, & generate counts
• Monolingual data• Decoding Algorithm
Build These Components• Translation Model
– Probs associated with aligned words & phrases – P (E|F)
• Language Model – P(E)• Decoder
– Maximizes P(F|E)*P(E)
37
Statistical Machine Translation
• Given foreign f, find best English translation e*e* = argmaxe P(e | f)
• Use Bayes’ rule to get “noisy channel” modelP(e | f) = P(f | e) P(∙ e) / P(f)argmaxe P(e | f) = argmax P(f | e) P(∙ e)
• P(f | e) is the channel or translation model• P(e) is the language model
38
Centauri/Arcturan [Knight, 1997]Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Slides 38-74 adapted from Kevin Knight and CCB’s JHU crew
39
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
40
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
41
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
42
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
43
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
44
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
45
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
46
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
47
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
48
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
process ofelimination
49
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
cognate?
50
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
zerofertility
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
51
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
It’s Really Spanish/English
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
52
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa
It’s Really Spanish/English
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
53
Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }
Centauri/Arcturan [Knight, 1997]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
zerofertility
54
Reorder
55
Reorder
56
Reorder
57
Reorder
5040 Possible Orderings!!
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Language Model
• Use a standard n-gram language model for P(E).• Trained on large monolingual corpus – 4- or 5-gram is typical– Often uses target side of parallel data + monolingual data
76
Translation Model
• “Phrase table”– N-gram pairs and probabilities
77
Statistical Machine Translation
78
EVALUATING MT
MT Evaluation
• I have a throbbing pain.• I am experiencing a throbbing
pain.• I am suffering from a throbbing
pain.• I am feeling a throbbing pain.• It is a throbbing pain.• It's throbbing and it really
hurts.• It's painful and it's throbbing.• It's throbbing with pain.
• It's in throbbing pain.• It hurts so much it's throbbing.• I've got a throbbing pain.• I can feel a throbbing pain.• I am suffering from a
throbbing pain.• I am experiencing a throbbing
pain.• I have a painful throbbing.• I feel a painful throbbing.
Source : ズキズキ 痛み ます 。16 human translations:
79
Data from International Workshop on Spoken Language Translation
80
MT Evaluation
• No “right answer”!• What can we test instead?– Human adequacy / fluency ratings– Human efficacy in an application
(e.g. question answering from translated foreign documents vs. native documents)
– Very accurate, but slow & expensive• Agreement with reference translations– BLEU (BiLingual Evaluation Understudy: IBM)– Fast system development
81
BLEU (Papineni, ACL 2002)
• MT output:1: It is a guide to action which ensures that the military always obeys the
commands of the party.2: It is to insure the troops forever hearing the activity guidebook that
party direct.
• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed
Party commands.2: It is the guiding principle which guarantees the military forces always
being under the command of the Party.3: It is the practical guide for the army always to heed the directions of
the party.
82
BLEU
• MT output:1: It is a guide to action which ensures that the military always obeys
the commands of the party.2: It is to insure the troops forever hearing the activity guidebook that
party direct.
• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed
Party commands.2: It is the guiding principle which guarantees the military forces always
being under the command of the Party.3: It is the practical guide for the army always to heed the directions of
the party.
83
BLEU
• MT output:1: It is a guide to action which ensures that the military always obeys the
commands of the party.2: It is to insure the troops forever hearing the activity guidebook that
party direct.
• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed
Party commands.2: It is the guiding principle which guarantees the military forces always
being under the command of the Party.3: It is the practical guide for the army always to heed the directions of
the party.
84
BLEU: observations
1: It is a guide to action which ensures that the military always obeys the commands of the party.
2: It is to insure the troops forever hearing the activity guidebook that party direct.
• Observations– Word overlap is indicative– n-gram (word sequence) overlap is even more distinct– Drawing from multiple reference translations helps
85
BLEU metric
• Compute n-gram precisions:Pn = c(matched n-grams) / c(n-grams in candidate)
• Compute a brevity penalty(Prevent candidates from deleting difficult words)BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c =
candidate length• Combine using geometric mean
BLEU = BP (∏∙ i=1n Pi)^(1/n)
• Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100)
BLEU results circa 2002
[from Papineni et al., ACL 2002] [from G. Doddington, NIST]
Distinguishes humans from machines… …correlates well with human judgments
86
However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning
87
Next Time
• MT & Word Alignment• Application of EM