Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | sharon-obrien |
View: | 220 times |
Download: | 5 times |
2
Roadmap (for next two classes)
Review LMs What are they? How (and where) are they used? How are they trained?
Evaluation metrics Entropy Perplexity
Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney
3
What is a language model?
Gives a probability of communicationof transmitted signalsof information(Claude Shannon, Information Theory)
Lots of ties to Cryptography and Information Theory
We most often use n-gram models
5
Applications
What word sequence (English) does this phoneme sequence correspond to:
AY D L AY K T UW
Phoneme Example Translation
AH hut HH AH T
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
EH Ed EH D
IY eat IY T
K key K IY
L lee L IY
N knee N IY
P pee P IY
S sea S IY
T tea T IY
UW two T UW
Z zee Z IY
6
Applications
What word sequence (English) does this phoneme sequence correspond to:
AY D L AY K T UW
R EH K AH N AY S B IY CH
Phoneme Example Translation
AH hut HH AH T
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
EH Ed EH D
IY eat IY T
K key K IY
L lee L IY
N knee N IY
P pee P IY
S sea S IY
T tea T IY
UW two T UW
Z zee Z IY
7
Applications
What word sequence (English) does this phoneme sequence correspond to:
AY D L AY K T UW
R EH K AH N AY S B IY CH
Goal of LM:P(“I’d like to recognize speech”) >
P(“I’d like to wreck a nice beach”)
Phoneme Example Translation
AH hut HH AH T
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
EH Ed EH D
IY eat IY T
K key K IY
L lee L IY
N knee N IY
P pee P IY
S sea S IY
T tea T IY
UW two T UW
Z zee Z IY
8
Why n-gram LMs?
We could just count how often a sentence occurs… …but language is too productive – infinite combos! Break down by word – predict each given its history
We could just count words in context… …but even contexts get too sparse. Just use the last words in a -gram model
10
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
11
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
If I can predict the next word before it comes, there’s no information content
12
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
If I can predict the next word before it comes, there’s no information content
Zero uncertainty means the signal has zero information How many bits of additional information do I need to
guess the next symbol?
13
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
If I can predict the next word before it comes, there’s no information content
Zero uncertainty means the signal has zero information How many bits of additional information do I need to
guess the next symbol?
Perplexity is the average branching factor If message has zero information,
14
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
If I can predict the next word before it comes, there’s no information content
Zero uncertainty means the signal has zero information How many bits of additional information do I need to
guess the next symbol?
Perplexity is the average branching factor If message has zero information, then branching factor
is 1
15
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
If I can predict the next word before it comes, there’s no information content
Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess
the next symbol?
Perplexity is the average branching factor If message has zero information, then branching factor is 1 If message needs one bit, branching factor is 2 If message needs two bits, branching factor is 4
16
Entropy and perplexity
Entropy measures the information content in a distribution == the uncertainty
If I can predict the next word before it comes, there’s no information content
Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next
symbol?
Perplexity is the average branching factor If message has zero information, then branching factor is 1 If message needs one bit, branching factor is 2 If message needs two bits, branching factor is 4
Entropy and perplexity measure the same thing (uncertainty / information content) with different scales
17
Information in a fair coin flip
event space probabilityentropy(bits) perplexity
total 1 1 2heads 0.5 0.5tails 0.5 0.5
18
Information in a fair coin flip
event space probabilityentropy(bits) perplexity
total 1 1 2heads 0.5 0.5tails 0.5 0.5
19
Information in a fair coin flip
event space probabilityentropy(bits) perplexity
total 1 1 2heads 0.5 0.5tails 0.5 0.5
20
Information in a single fair die
event space probabilityentropy(bits) perplexity
total 1 1 21 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43
21
Information in a single fair die
event space probabilityentropy(bits) perplexity
total 1 1 21 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43
22
Information in a single fair die
event space probabilityentropy(bits) perplexity
total 1 1 61 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43
23
Information in a single fair die
event space probabilityentropy(bits) perplexity
total 1 61 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43
24
Information in sum of two dice?
event space probabilityentropy(bits) perplexity
total 1 3.27 9.682 1/36 0.143 1/18 0.234 1/12 0.305 1/9 0.356 5/36 0.407 1/6 0.438 5/36 0.409 1/9 0.35
10 1/12 0.3011 1/18 0.2312 1/36 0.14
25
Entropy of a distribution
Start with a distribution over events in the event space
Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution
26
Entropy of a distribution
Start with a distribution over events in the event space
Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution
Key notion – you can use shorter codes for more common messages
(If you’ve heard of Huffman coding, here it is…)
29
Computing Entropy
𝐻 (𝑥 )=∑𝑥
(𝑝 (𝑥 )× (− log2𝑝 (𝑥 ) ))Ideal code length for this
symbol
Expected occurrences of
32
Perplexity
Just If entropy measures # of bits per symbol Just exponentiate to get the branching factor
34
The Train/Test Split and Entropy
Before, we were computing
This scores how well we’re doing if we know the true distribution
We estimate parameters on training and evaluate on test
35
Cross entropy
Estimate distribution on training corpus; see how well it predicts testing corpus
Let be the distribution we learned from training data be the test data
Then cross entropy of test given training is:
This is the negative average logprob Also, average number of bits required to encode each
test data symbol using our learned distribution
36
Cross entropy, formally
True distribution , assumed distribution Wrote codebook using , encode messages from
Let be count-based distribution of test data , then
37
Language model perplexity
Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate!
Perplexity correlates rather well with: Speech recognition error rates MT quality metrics
LM Perplexities for word-based models are normally between say 50 and 1000
Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact
39
Tasks
You get parameters
You want to produce data that conforms to this distribution
This is simulation or data generation
40
Tasks
You get parameters
And observations HHHHTHTTHHHHH
You need to answer: “How likely is this data according to the model?
This is evaluating the likelihood function
41
Tasks
You get observations: HHTHTTHTHTHHTHTHTTHTHT
You need to find a set of parameters:
This is parameter estimation
42
Parameter estimation
We keep talking about things like: as a distribution with parameters
How do we estimate parameters? What’s the likelihood of these parameters?
ObservationsHHTHTHTTTHTHTTTTTTTTTTTTHHHTHHHTHHHH
Parameters(Heads, Tails)(0.00, 1.00)(0.50, 0.50)(0.75, 0.25)
43
Parameter estimation techniques
Often use Relative Frequency Estimate
For certain distributions… “how likely is it that I get k heads when I flip n times”
(Binomial distributions) “how likely is it that I get five 6s when I roll five dice”
(Multinomial distributions) …Relative Freq = Maximum Likelihood Estimate (MLE)
This is the set of parameters for which the underlying distribution has the max likelihood (another max!)
Formalizes your intuition from the prior slide
44
Maximum Likelihood has problems :/ Remember:
Two problems: What happens if ?
We assign zero probability to an event… Even worse, what if ?
Divide by zero is undefined!
45
Smoothing
Main goal: prevent zero numerators (zero probs) and zero denominators (divide by zeros)
Make a “sharp” distribution (where some outputs have large probabilities and others have zero probs) be “smoother”
The smoothest distribution is the uniform distribution Constraint:
Result should still be a distribution
46
Smoothing techniques
Add one (Laplace)
This can help, but it generally doesn’t do a good job of estimating what’s going on
47
Mixtures / interpolation
Say I have two distributions and Pick any number between and Then is a distribution Two things to show:
(a) Sums to one:
(b) All values are and because they’re distributions and since and So the sum is non-negative, and we’re done
48
Laplace as a mixture
Say we have outcomes and total observations. Laplace says:
Laplace is a mixture between MLE and uniform! Mixture weight is determined by N and K
Laplace Smoothing Example
Consider the case where |V|= 100K C(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9
Laplace Smoothing Example
Consider the case where |V|= 100K C(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9
PMLE=
Laplace Smoothing Example
Consider the case where |V|= 100K C(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9
PMLE=9/10 = 0.9 PLAP=
Laplace Smoothing Example
Consider the case where |V|= 100K C(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9
PMLE=9/10 = 0.9 PLAP=(9+1)/10+100K ~ 0.0001
Laplace Smoothing Example
Consider the case where |V|= 100K C(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9
PMLE=9/10 = 0.9 PLAP=(9+1)/10+100K ~ 0.0001 Too much probability mass ‘shaved off’ for zeroes
Too sharp a change in probabilities Problematic in practice
Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ Padd-δ (wi|wi-1)
Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ Padd-δ (wi|wi-1) =
Issues:
Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ Padd-δ (wi|wi-1) =
Issues: Need to pick δ Still performs badly
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you haven’t
Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you haven’t
Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1
Good-Turing Smoothing Estimate probability of things which occur c times
with the probability of things which occur c+1 times
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,
salmon, eel, catfish, bass You have caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?
Slide adapted from Josh Goodman, Dan Jurafsky
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,
salmon, eel, catfish, bass You have caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?
3/18
Assuming so, how likely is it that next species is trout?
Slide adapted from Josh Goodman, Dan Jurafsky
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon,
eel, catfish, bass You have caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?
3/18
Assuming so, how likely is it that next species is trout? Must be less than 1/18
Slide adapted from Josh Goodman, Dan Jurafsky
Backoff and Interpolation
Another really useful source of knowledge If we are estimating:
trigram p(z|x,y) but count(xyz) is zero
Use info from:
Backoff and Interpolation
Another really useful source of knowledge If we are estimating:
trigram p(z|x,y) but count(xyz) is zero
Use info from: Bigram p(z|y)
Or even: Unigram p(z)
Backoff and Interpolation
Another really useful source of knowledge If we are estimating:
trigram p(z|x,y) but count(xyz) is zero
Use info from: Bigram p(z|y)
Or even: Unigram p(z)
How to combine this trigram, bigram, unigram info in a valid fashion?
Backoff Vs. Interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
Interpolation: always mix all three
Interpolation Simple interpolation
Lambdas conditional on context: Intuition: Higher weight on more frequent n-grams
How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of
some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search
Katz Backoff Note: We used P* (discounted probabilities) and α
weights on the backoff values Why not just use regular MLE estimates?
Katz Backoff Note: We used P* (discounted probabilities) and α
weights on the backoff values Why not just use regular MLE estimates?
Sum over all wi in n-gram context If we back off to lower n-gram?
Katz Backoff Note: We used P* (discounted probabilities) and α
weights on the backoff values Why not just use regular MLE estimates?
Sum over all wi in n-gram context If we back off to lower n-gram?
Too much probability mass > 1
Katz Backoff Note: We used P* (discounted probabilities) and α
weights on the backoff values Why not just use regular MLE estimates?
Sum over all wi in n-gram context If we back off to lower n-gram?
Too much probability mass > 1
Solution: Use P* discounts to save mass for lower order ngrams Apply α weights to make sure sum to amount saved Details in 4.7.1
Toolkits
Two major language modeling toolkits SRILM Cambridge-CMU toolkit
Publicly available, similar functionality Training: Create language model from text file Decoding: Computes perplexity/probability of text
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
Because GT assumes we know the number of unseen events
Instead: create an unknown word token <UNK>
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
Because GT assumes we know the number of unseen events
Instead: create an unknown word token <UNK> Training of <UNK> probabilities
Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to
<UNK> Now we train its probabilities like a normal word
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
Because GT assumes we know the number of unseen events
Instead: create an unknown word token <UNK> Training of <UNK> probabilities
Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to
<UNK> Now we train its probabilities like a normal word
At decoding time If text input: Use UNK probabilities for any word not in training
Google N-Gram Release
serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234
Google Caveat
Remember the lesson about test sets and training sets... Test sets should be similar to the training set (drawn from the same distribution) for the probabilities to be meaningful.
So... The Google corpus is fine if your application deals with arbitrary English text on the Web.
If not then a smaller domain specific corpus is likely to yield better results.
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data
Where do classes come from?
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data
Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data
Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set
What large corpus?
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set
What large corpus? Web counts! e.g. Google n-grams
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation
Model size Sparseness
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation
Model size Sparseness
What sorts of information in longer context?
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation
Model size Sparseness
What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better Cache n-gram:
Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus
Mix with main n-gram LM
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better Cache n-gram:
Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus
Mix with main n-gram LM
Topic models: Intuition: Text is about some topic, on-topic words likely
P(w|h) ~ Σt P(w|t)P(t|h)
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better Cache n-gram:
Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus
Mix with main n-gram LM
Topic models: Intuition: Text is about some topic, on-topic words likely
P(w|h) ~ Σt P(w|t)P(t|h)
Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams
Language Models
N-gram models: Finite approximation of infinite context history
Issues: Zeroes and other sparseness Strategies: Smoothing
Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff
Refinements Class, cache, topic, trigger LMs
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many
Interpolate based on # of contexts Words seen in more contexts, more likely to appear in
others
Kneser-Ney Smoothing
Modeling diversity of contexts Continuation probability
Backoff:
Interpolation:
Issues
Relative frequency Typically compute count of sequence
Divide by prefix
Corpus sensitivity Shakespeare vs Wall Street Journal
Very unnatural
Ngrams Unigram: little; bigrams: colloc;trigrams:phrase
)(
)()|(
1
11
n
nnnn wC
wwCwwP
Additional Issues in Good-Turing General approach:
Estimate of c* for Nc depends on N c+1
What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s