LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 18: 10/26

2

Administrivia

• Reminder– Homework 4 due tonight– Questions? After class

3

Last Time

• Background– general introduction to

probability concepts• Sample Space and Events• Permutations and

Combinations• Rule of Counting• Event Probability/Conditional

Probability• Uncertainty and Entropy

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

statistical experiment: outcomes

4

Today’s Topic

• statistical methods are widely used in language processing– apply probability theory to language

• N-grams

• reading– textbook chapter 6: N-grams

5

N-grams: Unigrams

• introduction– Given a corpus of text, the n-grams are the sequences of n consecutive

words that are in the corpus

• example (12 word sentence)– the cat that sat on the sofa also sat on the mat

• N=1 (8 unigrams)– the 3– sat 2– on 2 – cat 1– that 1– sofa 1– also 1– mat 1

6

N-grams: Bigrams


• N=2 (8 bigrams)– sat on 2– on the 2 – the cat 1– cat that 1– that sat 1 – the sofa 1– sofa also 1– also sat 1– the mat 1

2 words

7

N-grams: Trigrams


• N=3 (9 trigrams) – most language models stop here, some stop at quadrigrams

• too many n-grams• low frequencies

– sat on the 2 – the cat that 1– cat that sat 1– that sat on 1– on the sofa 1– the sofa also 1– sofa also sat 1– also sat on 1– on the mat 1

3 words

8

N-grams: Quadrigrams

• Example: (12 word sentence)– the cat that sat on the sofa also sat on the mat

• N=4 (8 quadrigrams) – the cat that sat 1– cat that sat on 1– that sat on the 1– sat on the sofa 1– on the sofa also 1– the sofa also sat 1– sofa also sat on 1– also sat on the 1– sat on the mat 1

4 words

9

N-grams: frequency curves

• family of curves sorted by frequency– unigrams, bigrams, trigrams, quadrigrams ...– decreasing frequency

f

frequency curve family

10

N-grams: the word as a unit

• we count words• but what counts as a word?

– punctuation• useful surface cue• also <s> = beginning of a sentence, as a dummy word• part-of-speech taggers include punctuation as words (why?)

– capitalization• They, they same token or not?

– wordform vs. lemma• cats, cat same token or not?

– disfluencies• part of spoken language• er, um, main- mainly• speech recognition systems have to cope with them

11

N-grams: Word

• what counts as a word?– punctuation

• useful surface cue• also <s> = beginning of a sentence, as a dummy word• part-of-speech taggers include punctuation as words (why?)From the Penn Treebank tagset

12

Language Models and N-grams

• Brown corpus (1million words):– word w f(w) p(w)– the 69,971 0.070– rabbit 11 0.000011

• given a word sequence – w1 w2 w3 ... wn

– probability of seeing wi depends on what we seen before• recall conditional probability introduced last time

• example (section 6.2)– Just then, the white rabbit– the– expectation is p(rabbit|white) > p(the|white)– but p(the) > p(rabbit)

13


• given a word sequence– w1 w2 w3 ... wn

• chain rule– how to compute the probability of a sequence of words– p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ...– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

• note– It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1)

for all possible word sequences

14


• Given a word sequence– w1 w2 w3 ... wn

• Bigram approximation– just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history– 1st order Markov Model– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

• note– p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|

w1...wn-2 wn-1)

15


• Trigram approximation – 2nd order Markov Model– just look at the preceding two words only– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|

w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2

wn-1)

• note– p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1)

but harder than p(wn|wn-1 )

16


• example: (bigram language model from section 6.2)– <s> = start of sentence

– p(I want to eat British food) = p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)

p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

figure 6.2

17


• example: (bigram language model from section 6.2)– <s> = start of sentence

– p(I want to eat British food) = p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)

figure 6.3

–p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)–= 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60–= 0.0000081 (different from textbook)

18


• estimating from corpora– how to compute bigram probabilities– p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word

– Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1

– p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• Note:– The technique of estimating (true) probabilities using a

relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

19


• example– p(I want to eat British food) = p(I|

<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)

– = 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60

– = 0.0000081

• in practice calculations are done in log space

– p(I want to eat British food) = 0.0000081 a tiny number

– use logprob (log2 probability)– actually sum of (negative) log2s

of probabilities

• Question:– Why sum negative log of

probabilities?

• Answer (Part 1):– computer floating point

number storage has limited range

5.0×10−324 to 1.7×10308

double (64 bit)

• danger of underflow

20


• Question:– Why sum negative log of

probabilities?

• Answer (Part 2):– A = BC– log(A) = log(B) + log(C)– probabilities are in range (0, 1]– Note:

• want probabilities to be non-zero• log(0) = -

– log of probabilites will be negative (up to 0)

– take negative to make them positive

log function

region of interest

21

Motivation for smoothing

• Smoothing: avoid zero probabilities• Consider

• what happens when any individual probability component is zero?– multiplication law: 0×X = 0– very brittle!

• even in a large corpus, many n-grams will have zero frequency– particularly so for larger n

p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

22


• Example:

unigram frequencies

wn-1wn bigram frequencies

bigram probabilities

sparse matrix

zeros render probabilities unusable

(we’ll need to add fudge factors - i.e. do smoothing)

wn-1

wn

23

Smoothing and N-grams

• sparse dataset means zeros are a problem– Zero probabilities are a problem

• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model

• one zero and the whole product is zero

– Zero frequencies are a problem• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• bigram f(wn-1wn) doesn’t exist in dataset

• smoothing– refers to ways of assigning zero probability n-grams a non-zero value– we’ll look at two ways here (just one of them today)

24


• Add-One Smoothing– add 1 to all frequency counts– simple and no more zeros (but there are better methods)

• unigram– p(w) = f(w)/N (before Add-One)

• N = size of corpus

– p(w) = (f(w)+1)/(N+V) (with Add-One)– f*(w) = (f(w)+1)*N/(N+V) (with Add-One)

• V = number of distinct words in corpus• N/(N+V) normalization factor adjusting for the effective increase in the corpus

size caused by Add-One

• bigram– p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One)– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One)– f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One)

must rescale so that total probability mass stays at 1

25


• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• frequencies

Remarks:perturbation problem

add-one causes largechanges in somefrequencies due to relative size of V (1616)

want to: 786 338

I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0

I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22

= figure 6.8

= figure 6.4

26


• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• Probabilities

Remarks:perturbation problem

similar changes inprobabilities



= figure 6.5

= figure 6.7

27


• Excel spreadsheet available– addone.xls

Date post:	21-Dec-2015
Category:	Documents
View:	213 times
Download:	1 times

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Documents