Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 1 times |
LING 438/538Computational Linguistics
Sandiway Fong
Lecture 18: 10/26
2
Administrivia
• Reminder– Homework 4 due tonight– Questions? After class
3
Last Time
• Background– general introduction to
probability concepts• Sample Space and Events• Permutations and
Combinations• Rule of Counting• Event Probability/Conditional
Probability• Uncertainty and Entropy
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
statistical experiment: outcomes
4
Today’s Topic
• statistical methods are widely used in language processing– apply probability theory to language
• N-grams
• reading– textbook chapter 6: N-grams
5
N-grams: Unigrams
• introduction– Given a corpus of text, the n-grams are the sequences of n consecutive
words that are in the corpus
• example (12 word sentence)– the cat that sat on the sofa also sat on the mat
• N=1 (8 unigrams)– the 3– sat 2– on 2 – cat 1– that 1– sofa 1– also 1– mat 1
6
N-grams: Bigrams
• example (12 word sentence)– the cat that sat on the sofa also sat on the mat
• N=2 (8 bigrams)– sat on 2– on the 2 – the cat 1– cat that 1– that sat 1 – the sofa 1– sofa also 1– also sat 1– the mat 1
2 words
7
N-grams: Trigrams
• example (12 word sentence)– the cat that sat on the sofa also sat on the mat
• N=3 (9 trigrams) – most language models stop here, some stop at quadrigrams
• too many n-grams• low frequencies
– sat on the 2 – the cat that 1– cat that sat 1– that sat on 1– on the sofa 1– the sofa also 1– sofa also sat 1– also sat on 1– on the mat 1
3 words
8
N-grams: Quadrigrams
• Example: (12 word sentence)– the cat that sat on the sofa also sat on the mat
• N=4 (8 quadrigrams) – the cat that sat 1– cat that sat on 1– that sat on the 1– sat on the sofa 1– on the sofa also 1– the sofa also sat 1– sofa also sat on 1– also sat on the 1– sat on the mat 1
4 words
9
N-grams: frequency curves
• family of curves sorted by frequency– unigrams, bigrams, trigrams, quadrigrams ...– decreasing frequency
f
frequency curve family
10
N-grams: the word as a unit
• we count words• but what counts as a word?
– punctuation• useful surface cue• also <s> = beginning of a sentence, as a dummy word• part-of-speech taggers include punctuation as words (why?)
– capitalization• They, they same token or not?
– wordform vs. lemma• cats, cat same token or not?
– disfluencies• part of spoken language• er, um, main- mainly• speech recognition systems have to cope with them
11
N-grams: Word
• what counts as a word?– punctuation
• useful surface cue• also <s> = beginning of a sentence, as a dummy word• part-of-speech taggers include punctuation as words (why?)From the Penn Treebank tagset
12
Language Models and N-grams
• Brown corpus (1million words):– word w f(w) p(w)– the 69,971 0.070– rabbit 11 0.000011
• given a word sequence – w1 w2 w3 ... wn
– probability of seeing wi depends on what we seen before• recall conditional probability introduced last time
• example (section 6.2)– Just then, the white rabbit– the– expectation is p(rabbit|white) > p(the|white)– but p(the) > p(rabbit)
13
Language Models and N-grams
• given a word sequence– w1 w2 w3 ... wn
• chain rule– how to compute the probability of a sequence of words– p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ...– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)
• note– It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1)
for all possible word sequences
14
Language Models and N-grams
• Given a word sequence– w1 w2 w3 ... wn
• Bigram approximation– just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history– 1st order Markov Model– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)
– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
• note– p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|
w1...wn-2 wn-1)
15
Language Models and N-grams
• Trigram approximation – 2nd order Markov Model– just look at the preceding two words only– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|
w1...wn-3wn-2wn-1)
– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2
wn-1)
• note– p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1)
but harder than p(wn|wn-1 )
16
Language Models and N-grams
• example: (bigram language model from section 6.2)– <s> = start of sentence
– p(I want to eat British food) = p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)
p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
figure 6.2
17
Language Models and N-grams
• example: (bigram language model from section 6.2)– <s> = start of sentence
– p(I want to eat British food) = p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)
figure 6.3
–p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)–= 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60–= 0.0000081 (different from textbook)
18
Language Models and N-grams
• estimating from corpora– how to compute bigram probabilities– p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word
– Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1
– p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency
• Note:– The technique of estimating (true) probabilities using a
relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)
19
Language Models and N-grams
• example– p(I want to eat British food) = p(I|
<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)
– = 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60
– = 0.0000081
• in practice calculations are done in log space
– p(I want to eat British food) = 0.0000081 a tiny number
– use logprob (log2 probability)– actually sum of (negative) log2s
of probabilities
• Question:– Why sum negative log of
probabilities?
• Answer (Part 1):– computer floating point
number storage has limited range
5.0×10−324 to 1.7×10308
double (64 bit)
• danger of underflow
20
Language Models and N-grams
• Question:– Why sum negative log of
probabilities?
• Answer (Part 2):– A = BC– log(A) = log(B) + log(C)– probabilities are in range (0, 1]– Note:
• want probabilities to be non-zero• log(0) = -
– log of probabilites will be negative (up to 0)
– take negative to make them positive
log function
region of interest
21
Motivation for smoothing
• Smoothing: avoid zero probabilities• Consider
• what happens when any individual probability component is zero?– multiplication law: 0×X = 0– very brittle!
• even in a large corpus, many n-grams will have zero frequency– particularly so for larger n
p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
22
Language Models and N-grams
• Example:
unigram frequencies
wn-1wn bigram frequencies
bigram probabilities
sparse matrix
zeros render probabilities unusable
(we’ll need to add fudge factors - i.e. do smoothing)
wn-1
wn
23
Smoothing and N-grams
• sparse dataset means zeros are a problem– Zero probabilities are a problem
• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model
• one zero and the whole product is zero
– Zero frequencies are a problem• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency
• bigram f(wn-1wn) doesn’t exist in dataset
• smoothing– refers to ways of assigning zero probability n-grams a non-zero value– we’ll look at two ways here (just one of them today)
24
Smoothing and N-grams
• Add-One Smoothing– add 1 to all frequency counts– simple and no more zeros (but there are better methods)
• unigram– p(w) = f(w)/N (before Add-One)
• N = size of corpus
– p(w) = (f(w)+1)/(N+V) (with Add-One)– f*(w) = (f(w)+1)*N/(N+V) (with Add-One)
• V = number of distinct words in corpus• N/(N+V) normalization factor adjusting for the effective increase in the corpus
size caused by Add-One
• bigram– p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One)– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One)– f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One)
must rescale so that total probability mass stays at 1
25
Smoothing and N-grams
• Add-One Smoothing– add 1 to all frequency counts
• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)
• frequencies
Remarks:perturbation problem
add-one causes largechanges in somefrequencies due to relative size of V (1616)
want to: 786 338
I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0
I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22
= figure 6.8
= figure 6.4
26
Smoothing and N-grams
• Add-One Smoothing– add 1 to all frequency counts
• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)
• Probabilities
Remarks:perturbation problem
similar changes inprobabilities
I want to eat Chinese food lunchI 0.00178 0.21532 0.00020 0.00277 0.00020 0.00020 0.00020want 0.00141 0.00035 0.27799 0.00035 0.00247 0.00318 0.00247to 0.00082 0.00021 0.00226 0.17672 0.00082 0.00021 0.00267eat 0.00039 0.00039 0.00117 0.00039 0.00783 0.00117 0.02075Chinese 0.00164 0.00055 0.00055 0.00055 0.00055 0.06616 0.00109food 0.00641 0.00032 0.00577 0.00032 0.00032 0.00032 0.00032lunch 0.00241 0.00048 0.00048 0.00048 0.00048 0.00096 0.00048
I want to eat Chinese food lunchI 0.00233 0.31626 0.00000 0.00378 0.00000 0.00000 0.00000want 0.00247 0.00000 0.64691 0.00000 0.00494 0.00658 0.00494to 0.00092 0.00000 0.00307 0.26413 0.00092 0.00000 0.00369eat 0.00000 0.00000 0.00213 0.00000 0.02026 0.00213 0.05544Chinese 0.00939 0.00000 0.00000 0.00000 0.00000 0.56338 0.00469food 0.01262 0.00000 0.01129 0.00000 0.00000 0.00000 0.00000lunch 0.00871 0.00000 0.00000 0.00000 0.00000 0.00218 0.00000
= figure 6.5
= figure 6.7
27
Smoothing and N-grams
• Excel spreadsheet available– addone.xls