Post on 26-Jun-2015
transcript
Feature-BasedModels
•Some (not all) key ingredients in Google Translate:
•Phrase-based translation models
•... Learned heuristically from word alignments
•... Coupled with a huge language model
•... And very tight pruning heuristics
•Today: more flexible parameterizations.
p(English|Chinese) !
p(English) ! p(Chinese|English)
Bayes’ Rule
translation modellanguage model
English
p(Chinese|English)
English
p(Chinese|English)
! p(English)
English
p(Chinese|English)
! p(English)
∼ p(English|Chinese)
English
p(Chinese|English)1
! p(English)1
∼ p(English|Chinese)
English
p(Chinese|English)2
! p(English)1
∼ p(English|Chinese)
English
p(Chinese|English)1/2
! p(English)1
∼ p(English|Chinese)
English
p(Chinese|English)0
! p(English)1
∼ p(English|Chinese)
English
0 · log p(Chinese|English)
+1 · log p(English)
∼ log p(English|Chinese)
English
0 · log p(Chinese|English)
+1 · log p(English)
∼ log p(English|Chinese)
log(x) is monotonic for positive x:log(x) > log(y) iff x>y
English
0 · log p(Chinese|English)
+1 · log p(English)
= score(English|Chinese)
score(English|Chinese) =
λ1 log p(Chinese|English) + λ2 log p(English)
score(English|Chinese) =
exp(λ1 log p(Chinese|English) + λ2 log p(English))
exp(λ1 log p(Chinese|English) + λ2 log p(English))�
English
exp(λ1 log p(Chinese|English) + λ2 log p(English))
p(English|Chinese) =
exp(λ1 log p(Chinese|English) + λ2 log p(English))�
English
exp(λ1 log p(Chinese|English) + λ2 log p(English))
p(English|Chinese) =
log-linear modelmaximum entropy model
conditional modelundirected model
p(English|Chinese) =
log-linear modelmaximum entropy model
conditional modelundirected model
p(English) ! p(Chinese|English)
Note: Original model is a special case of this model!
exp(λ1 log p(Chinese|English) + λ2 log p(English))�
English
exp(λ1 log p(Chinese|English) + λ2 log p(English))
p(English|Chinese) =
log-linear modelmaximum entropy model
conditional modelundirected model
p(English|Chinese) =
log-linear modelmaximum entropy model
conditional modelundirected model
exp
��
k
λkhk(English, Chinese)
�
�
English�
exp
��
k
λkhk(English�, Chinese)
�
p(English|Chinese) =
log-linear modelmaximum entropy model
conditional modelundirected model
1Z
exp
��
k
λkhk(English, Chinese)
�
p(English|Chinese) =
log-linear modelmaximum entropy model
conditional modelundirected model
1Z
exp
��
k
λkhk(English, Chinese)
�
Z is the normalization term or partition function
p(English|Chinese) =
1Z
exp
��
k
λkhk(English, Chinese)
�
Z is the normalization term or partition function
The functions hk are features or feature functionsThey are deterministic (fixed) functions of the
input/output pair.
The parameters of the model are the terms.λk
What’s a Feature?
What’s a Feature?A feature can be any function in the form:
hk : English× Chinese→ R+
What’s a Feature?
•Language model: p(English)
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•Language model: p(English)
•Translation model: p(Chinese|English)
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•Language model: p(English)
•Translation model: p(Chinese|English)
•Reverse translation model: p(English|Chinese)
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•Language model: p(English)
•Translation model: p(Chinese|English)
•Reverse translation model: p(English|Chinese)
•The number of words in the English sentence.
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•Language model: p(English)
•Translation model: p(Chinese|English)
•Reverse translation model: p(English|Chinese)
•The number of words in the English sentence.
•The number of verbs in the English sentence.
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•Language model: p(English)
•Translation model: p(Chinese|English)
•Reverse translation model: p(English|Chinese)
•The number of words in the English sentence.
•The number of verbs in the English sentence.
•1 if the English sentence has a verb, 0 otherwise.
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?A feature can be any function in the form:
hk : English× Chinese→ R+
What’s a Feature?
•A word-based translation model: p(Chinese|English)
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•A word-based translation model: p(Chinese|English)
•Agreement features in the English sentence.
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•A word-based translation model: p(Chinese|English)
•Agreement features in the English sentence.
•Features over part-of-speech sequences in the English sentence.
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•A word-based translation model: p(Chinese|English)
•Agreement features in the English sentence.
•Features over part-of-speech sequences in the English sentence.
•How many times the sentence pair includes the English word north and Chinese word 北.
A feature can be any function in the form: hk : English× Chinese→ R+
What’s a Feature?
•A word-based translation model: p(Chinese|English)
•Agreement features in the English sentence.
•Features over part-of-speech sequences in the English sentence.
•How many times the sentence pair includes the English word north and Chinese word 北.
•Do words north and 北 appear in a dictionary?
A feature can be any function in the form: hk : English× Chinese→ R+
Learning
arg maxθ
1Z
exp
��
k
λkhk(English, Chinese)
�
θ = �λ1, ...,λK�where:
Learning
arg maxθ
1Z
exp
��
k
λkhk(English, Chinese)
�
θ = �λ1, ...,λK�where:
Techniques: SGD, L-BFGS
Learning
arg maxθ
1Z
exp
��
k
λkhk(English, Chinese)
�
θ = �λ1, ...,λK�where:
Techniques: SGD, L-BFGS
Require computing derivatives (expectations!), iterating.
Problems
Problems
•Inference is intractable!
Problems
•Inference is intractable!
•Compute over n-best lists of outputs.
Problems
•Inference is intractable!
•Compute over n-best lists of outputs.
•Compute over pruned search graphs.
Problems
•Inference is intractable!
•Compute over n-best lists of outputs.
•Compute over pruned search graphs.
•Reachability: what if data likelihood is zero?
Problems
•Inference is intractable!
•Compute over n-best lists of outputs.
•Compute over pruned search graphs.
•Reachability: what if data likelihood is zero?
•Throw away data.
Problems
•Inference is intractable!
•Compute over n-best lists of outputs.
•Compute over pruned search graphs.
•Reachability: what if data likelihood is zero?
•Throw away data.
•Pretend sentence with highest BLEU score is observed.
Problems
Problems
•Why maximize likelihood if we care about BLEU or some other metric?
BLEU(MT output)
BLEU(argmaxEnglish
score(English|Chinese))
BLEU(argmaxEnglish
score(English|Chinese))1�
Chinese∈Test
BLEU
• Ôptimization