+ All Categories
Home > Documents > LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Date post: 21-Dec-2015
Category:
View: 213 times
Download: 1 times
Share this document with a friend
27
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26
Transcript
Page 1: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 18: 10/26

Page 2: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

2

Administrivia

• Reminder– Homework 4 due tonight– Questions? After class

Page 3: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

3

Last Time

• Background– general introduction to

probability concepts• Sample Space and Events• Permutations and

Combinations• Rule of Counting• Event Probability/Conditional

Probability• Uncertainty and Entropy

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

statistical experiment: outcomes

Page 4: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

4

Today’s Topic

• statistical methods are widely used in language processing– apply probability theory to language

• N-grams

• reading– textbook chapter 6: N-grams

Page 5: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

5

N-grams: Unigrams

• introduction– Given a corpus of text, the n-grams are the sequences of n consecutive

words that are in the corpus

• example (12 word sentence)– the cat that sat on the sofa also sat on the mat

• N=1 (8 unigrams)– the 3– sat 2– on 2 – cat 1– that 1– sofa 1– also 1– mat 1

Page 6: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

6

N-grams: Bigrams

• example (12 word sentence)– the cat that sat on the sofa also sat on the mat

• N=2 (8 bigrams)– sat on 2– on the 2 – the cat 1– cat that 1– that sat 1 – the sofa 1– sofa also 1– also sat 1– the mat 1

2 words

Page 7: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

7

N-grams: Trigrams

• example (12 word sentence)– the cat that sat on the sofa also sat on the mat

• N=3 (9 trigrams) – most language models stop here, some stop at quadrigrams

• too many n-grams• low frequencies

– sat on the 2 – the cat that 1– cat that sat 1– that sat on 1– on the sofa 1– the sofa also 1– sofa also sat 1– also sat on 1– on the mat 1

3 words

Page 8: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

8

N-grams: Quadrigrams

• Example: (12 word sentence)– the cat that sat on the sofa also sat on the mat

• N=4 (8 quadrigrams) – the cat that sat 1– cat that sat on 1– that sat on the 1– sat on the sofa 1– on the sofa also 1– the sofa also sat 1– sofa also sat on 1– also sat on the 1– sat on the mat 1

4 words

Page 9: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

9

N-grams: frequency curves

• family of curves sorted by frequency– unigrams, bigrams, trigrams, quadrigrams ...– decreasing frequency

f

frequency curve family

Page 10: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

10

N-grams: the word as a unit

• we count words• but what counts as a word?

– punctuation• useful surface cue• also <s> = beginning of a sentence, as a dummy word• part-of-speech taggers include punctuation as words (why?)

– capitalization• They, they same token or not?

– wordform vs. lemma• cats, cat same token or not?

– disfluencies• part of spoken language• er, um, main- mainly• speech recognition systems have to cope with them

Page 11: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

11

N-grams: Word

• what counts as a word?– punctuation

• useful surface cue• also <s> = beginning of a sentence, as a dummy word• part-of-speech taggers include punctuation as words (why?)From the Penn Treebank tagset

Page 12: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

12

Language Models and N-grams

• Brown corpus (1million words):– word w f(w) p(w)– the 69,971 0.070– rabbit 11 0.000011

• given a word sequence – w1 w2 w3 ... wn

– probability of seeing wi depends on what we seen before• recall conditional probability introduced last time

• example (section 6.2)– Just then, the white rabbit– the– expectation is p(rabbit|white) > p(the|white)– but p(the) > p(rabbit)

Page 13: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

13

Language Models and N-grams

• given a word sequence– w1 w2 w3 ... wn

• chain rule– how to compute the probability of a sequence of words– p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ...– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

• note– It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1)

for all possible word sequences

Page 14: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

14

Language Models and N-grams

• Given a word sequence– w1 w2 w3 ... wn

• Bigram approximation– just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history– 1st order Markov Model– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

• note– p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|

w1...wn-2 wn-1)

Page 15: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

15

Language Models and N-grams

• Trigram approximation – 2nd order Markov Model– just look at the preceding two words only– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|

w1...wn-3wn-2wn-1)

– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2

wn-1)

• note– p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1)

but harder than p(wn|wn-1 )

Page 16: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

16

Language Models and N-grams

• example: (bigram language model from section 6.2)– <s> = start of sentence

– p(I want to eat British food) = p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)

p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

figure 6.2

Page 17: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

17

Language Models and N-grams

• example: (bigram language model from section 6.2)– <s> = start of sentence

– p(I want to eat British food) = p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)

figure 6.3

–p(I|<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)–= 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60–= 0.0000081 (different from textbook)

Page 18: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

18

Language Models and N-grams

• estimating from corpora– how to compute bigram probabilities– p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word

– Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1

– p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• Note:– The technique of estimating (true) probabilities using a

relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

Page 19: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

19

Language Models and N-grams

• example– p(I want to eat British food) = p(I|

<s>)p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British)

– = 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60

– = 0.0000081

• in practice calculations are done in log space

– p(I want to eat British food) = 0.0000081 a tiny number

– use logprob (log2 probability)– actually sum of (negative) log2s

of probabilities

• Question:– Why sum negative log of

probabilities?

• Answer (Part 1):– computer floating point

number storage has limited range

5.0×10−324 to 1.7×10308

double (64 bit)

• danger of underflow

Page 20: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

20

Language Models and N-grams

• Question:– Why sum negative log of

probabilities?

• Answer (Part 2):– A = BC– log(A) = log(B) + log(C)– probabilities are in range (0, 1]– Note:

• want probabilities to be non-zero• log(0) = -

– log of probabilites will be negative (up to 0)

– take negative to make them positive

log function

region of interest

Page 21: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

21

Motivation for smoothing

• Smoothing: avoid zero probabilities• Consider

• what happens when any individual probability component is zero?– multiplication law: 0×X = 0– very brittle!

• even in a large corpus, many n-grams will have zero frequency– particularly so for larger n

p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

Page 22: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

22

Language Models and N-grams

• Example:

unigram frequencies

wn-1wn bigram frequencies

bigram probabilities

sparse matrix

zeros render probabilities unusable

(we’ll need to add fudge factors - i.e. do smoothing)

wn-1

wn

Page 23: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

23

Smoothing and N-grams

• sparse dataset means zeros are a problem– Zero probabilities are a problem

• p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model

• one zero and the whole product is zero

– Zero frequencies are a problem• p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency

• bigram f(wn-1wn) doesn’t exist in dataset

• smoothing– refers to ways of assigning zero probability n-grams a non-zero value– we’ll look at two ways here (just one of them today)

Page 24: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

24

Smoothing and N-grams

• Add-One Smoothing– add 1 to all frequency counts– simple and no more zeros (but there are better methods)

• unigram– p(w) = f(w)/N (before Add-One)

• N = size of corpus

– p(w) = (f(w)+1)/(N+V) (with Add-One)– f*(w) = (f(w)+1)*N/(N+V) (with Add-One)

• V = number of distinct words in corpus• N/(N+V) normalization factor adjusting for the effective increase in the corpus

size caused by Add-One

• bigram– p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One)– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One)– f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One)

must rescale so that total probability mass stays at 1

Page 25: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

25

Smoothing and N-grams

• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• frequencies

Remarks:perturbation problem

add-one causes largechanges in somefrequencies due to relative size of V (1616)

want to: 786 338

I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0

I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22

= figure 6.8

= figure 6.4

Page 26: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

26

Smoothing and N-grams

• Add-One Smoothing– add 1 to all frequency counts

• bigram– p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) – (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V)

• Probabilities

Remarks:perturbation problem

similar changes inprobabilities

I want to eat Chinese food lunchI 0.00178 0.21532 0.00020 0.00277 0.00020 0.00020 0.00020want 0.00141 0.00035 0.27799 0.00035 0.00247 0.00318 0.00247to 0.00082 0.00021 0.00226 0.17672 0.00082 0.00021 0.00267eat 0.00039 0.00039 0.00117 0.00039 0.00783 0.00117 0.02075Chinese 0.00164 0.00055 0.00055 0.00055 0.00055 0.06616 0.00109food 0.00641 0.00032 0.00577 0.00032 0.00032 0.00032 0.00032lunch 0.00241 0.00048 0.00048 0.00048 0.00048 0.00096 0.00048

I want to eat Chinese food lunchI 0.00233 0.31626 0.00000 0.00378 0.00000 0.00000 0.00000want 0.00247 0.00000 0.64691 0.00000 0.00494 0.00658 0.00494to 0.00092 0.00000 0.00307 0.26413 0.00092 0.00000 0.00369eat 0.00000 0.00000 0.00213 0.00000 0.02026 0.00213 0.05544Chinese 0.00939 0.00000 0.00000 0.00000 0.00000 0.56338 0.00469food 0.01262 0.00000 0.01129 0.00000 0.00000 0.00000 0.00000lunch 0.00871 0.00000 0.00000 0.00000 0.00000 0.00218 0.00000

= figure 6.5

= figure 6.7

Page 27: LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

27

Smoothing and N-grams

• Excel spreadsheet available– addone.xls


Recommended