*Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)

transcript

*Introduction to Natural Language Processing (600.465)

Language Modeling (and the Noisy Channel)

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

hajic@cs.jhu.edu

www.cs.jhu.edu/~hajic

The Noisy Channel

• Prototypical case: Input Output (noisy)

The channel

0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...

• Model: probability of error (noise):

• Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6

• The Task:

known: the noisy output; want to know: the input (decoding)

Noisy Channel Applications• OCR

– straightforward: text → print (adds noise), scan →image

• Handwriting recognition– text → neurons, muscles (“noise”), scan/digitize → image

• Speech recognition (dictation, commands, etc.)– text → conversion to acoustic signal (“noise”) → acoustic waves

• Machine Translation– text in target language → translation (“noise”) → source language

• Also: Part of Speech Tagging– sequence of tags → selection of word forms → text

Noisy Channel: The Golden Rule of ...

OCR, ASR, HR, MT, ...• Recall:

p(A|B) = p(B|A) p(A) / p(B) (Bayes formula)

Abest = argmaxA p(B|A) p(A) (The Golden Rule)

• p(B|A): the acoustic/image/translation/lexical model– application-specific name

– will explore later

• p(A): the language model

Dan Jurafsky

Probabilistic Language Models

Dan Jurafsky

Probabilistic Language Modeling

• Goal: compute the probability of a sentence or sequence of words:• P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:• P(w5|w1,w2,w3,w4)

• A model that computes either of these:• P(W) or P(wn|w1,w2…wn-1) is called a language model.

• Better: the grammar But language model or LM is standard

The Perfect Language Model

• Sequence of word forms [forget about tagging for the moment]

• Notation: A ~ W = (w1,w2,w3,...,wd)

• The big (modeling) question:

p(W) = ?

• Well, we know (Bayes/chain rule →):

p(W) = p(w1,w2,w3,...,wd) =

= p(w1)ⅹp(w2|w1)ⅹp(w3|w1,w2)ⅹⅹp(wd|w1,w2,...,wd-1)

• Not practical (even short W →too many parameters)

Markov Chain

• Unlimited memory (cf. previous foil):– for wi, we know all its predecessors w1,w2,w3,...,wi-1

• Limited memory:– we disregard “too old” predecessors

– remember only k previous words: wi-k,wi-k+1,...,wi-1

– called “kth order Markov approximation”

• + stationary character (no change over time):

p(W) i=1..dp(wi|wi-k,wi-k+1,...,wi-1), d = |W|

n-gram Language Models

• (n-1)th order Markov approximation → n-gram LM:

p(W) df i=1..dp(wi|wi-n+1,wi-n+2,...,wi-1) !• In particular (assume vocabulary |V| = 60k):

• 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter

• 1-gram LM: unigram model, p(w), 6ⅹ104 parameters

• 2-gram LM: bigram model, p(wi|wi-1) 3.6ⅹ109 parameters

• 3-gram LM: trigram model, p(wi|wi-2,wi-1) 2.16ⅹ1014 parameters

prediction history

LM: Observations• How large n?

– nothing is enough (theoretically)

– but anyway: as much as possible (→close to “perfect” model)

– empirically: 3• parameter estimation? (reliability, data availability, storage space, ...)

• 4 is too much: |V|=60k →1.296ⅹ1019 parameters

• but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams!

• Reliability ~ (1 / Detail) (→ need compromise) (detail=many gram)

• For now, keep word forms (no “linguistic” processing)

Parameter Estimation

• Parameter: numerical value needed to compute p(w|h)• From data (how else?)• Data preparation:

• get rid of formatting etc. (“text cleaning”)• define words (separate but include punctuation, call it “word”)• define sentence boundaries (insert “words” <s> and </s>)• letter case: keep, discard, or be smart:

– name recognition

– number type identification

[these are huge problems per se!]

• numbers: keep, replace by <num>, or be smart (form ~ pronunciation)

Maximum Likelihood Estimate• MLE: Relative Frequency...

– ...best predicts the data at hand (the “training data”)

• Trigrams from Training Data T:– count sequences of three words in T: c3(wi-2,wi-1,wi)

– [NB: notation: just saying that the three words follow each other]

– count sequences of two words in T: c2(wi-1,wi):

• either use c2(y,z) = w c3(y,z,w)

• or count differently at the beginning (& end) of data!

p(wi|wi-2,wi-1) =est. c3(wi-2,wi-1,wi) / c2(wi-2,wi-1) !

Character Language Model

• Use individual characters instead of words:

• Same formulas etc.• Might consider 4-grams, 5-grams or even more• Good only for language comparison• Transform cross-entropy between letter- and word-

based models: HS(pc) = HS(pw) / avg. # of characters/word in S

p(W) df i=1..dp(ci|ci-n+1,ci-n+2,...,ci-1)

LM: an Example

• Training data: <s> <s> He can buy the can of soda.– Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125

p1(can) = .25

– Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,

p2(of|can) = .5, p2(the|buy) = 1,...– Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,

p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.– (normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,

H(p3) = 0 ← Great?!

Dan Jurafsky

Language Modeling Toolkits

• SRILM• http://www.speech.sri.com/projects/srilm/

Dan Jurafsky

Google N-Gram Release, August 2006

Dan Jurafsky

Google N-Gram Release

• serve as the incoming 92• serve as the incubator 99• serve as the independent 794• serve as the index 223• serve as the indication 72• serve as the indicator 120• serve as the indicators 45• serve as the indispensable 111• serve as the indispensible 40• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Dan Jurafsky

Google Book N-grams

• http://ngrams.googlelabs.com/

Dan Jurafsky

Evaluation: How good is our model?

• Does our language model prefer good sentences to bad ones?• Assign higher probability to “real” or “frequently observed” sentences

• Than “ungrammatical” or “rarely observed” sentences?

• We train parameters of our model on a training set.• We test the model’s performance on data we haven’t seen.

• A test set is an unseen dataset that is different from our training set, totally unused.

• An evaluation metric tells us how well our model does on the test set.

Dan Jurafsky

Extrinsic evaluation of N-gram models

• Best evaluation for comparing models A and B• Put each model in a task

• spelling corrector, speech recognizer, MT system• Run the task, get an accuracy for A and for B

• How many misspelled words corrected properly• How many words translated correctly

• Compare accuracy for A and B

Dan Jurafsky

Difficulty of extrinsic (in-vivo) evaluation of N-gram models

• Extrinsic evaluation• Time-consuming; can take days or weeks

• So• Sometimes use intrinsic evaluation: perplexity• Bad approximation

• unless the test data looks just like the training data• So generally only useful in pilot experiments

• But is helpful to think about.

Dan Jurafsky

Intuition of Perplexity

• The Shannon Game:• How well can we predict the next word?

• Unigrams are terrible at this game. (Why?)

• A better model of a text• is one which assigns a higher probability to the word that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

fried rice 0.0001

and 1e-100

Claude Shannon

Dan Jurafsky

Perplexity

• Perplexity is the probability of the test set, normalized by the number of words:

• Chain rule:

• For bigrams:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set• Gives the highest P(sentence)

Dan Jurafsky

The Shannon Game intuition for perplexity

• From Josh Goodman• How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’

• Perplexity 10

• How hard is recognizing (30,000) names at Microsoft. • Perplexity = 30,000

• If a system has to recognize• Operator (1 in 4)• Sales (1 in 4)• Technical Support (1 in 4)• 30,000 names (1 in 120,000 each)• Perplexity is 53

• Perplexity is weighted equivalent branching factor

Dan Jurafsky

Perplexity as branching factor

• Let’s suppose a sentence consisting of random digits• What is the perplexity of this sentence according to a model that

assign P=1/10 to each digit?

Dan Jurafsky

Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Order Unigram Bigram Trigram

Perplexity 962 170 109

Dan Jurafsky

The wall street journal

LM: an Example

• Training data: <s> <s> He can buy the can of soda.– Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125

p1(can) = .25

– Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,

p2(of|can) = .5, p2(the|buy) = 1,...– Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,

p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.– (normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,

H(p3) = 0 ← Great?!

LM: an Example (The Problem)

• Cross-entropy:• S = <s> <s> It was the greatest buy of all. (test data)

• Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:

– all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.

– all bigram probabilities are 0.

– all trigram probabilities are 0.

• We want: to make all probabilities non-zero. data sparseness handling

The Zero Problem• “Raw” n-gram language model estimate:

– necessarily, some zeros• !many: trigram model → 2.16ⅹ1014 parameters, data ~ 109 words

– which are true 0? • optimal situation: even the least frequent trigram would be seen several times,

in order to distinguish it’s probability vs. other trigrams

• optimal situation cannot happen, unfortunately (open question: how many data would we need?)

– → we don’t know– we must eliminate the zeros

• Two kinds of zeros: p(w|h) = 0, or even p(h) = 0!

Why do we need Nonzero Probs?

• To avoid infinite Cross Entropy:– happens when an event is found in test data which has

not been seen in training data

H(p) = ∞prevents comparing data with ≥ 0 “errors”

• To make the system more robust– low count estimates:

• they typically happen for “detailed” but relatively rare appearances

– high count estimates: reliable but less “detailed”

Eliminating the Zero Probabilities:Smoothing

• Get new p’(w) (same ): almost p(w) but no zeros• Discount w for (some) p(w) > 0: new p’(w) < p(w)

w∈discounted (p(w) - p’(w)) = D

• Distribute D to all w; p(w) = 0: new p’(w) > p(w) – possibly also to other w with low p(w)

• For some w (possibly): p’(w) = p(w)

• Make sure w∈p’(w) = 1

• There are many ways of smoothing

Smoothing by Adding 1(Laplace)• Simplest but not really usable:

– Predicting words w from a vocabulary V, training data T:

p’(w|h) = (c(h,w) + 1) / (c(h) + |V|)• for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|)

– Problem if |V| > c(h) (as is often the case; even >> c(h)!)

• Example: Training data: <s> what is it what is small ? |T| = 8

• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12

• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 .001

p(it is flying.) = .125ⅹ.25ⅹ02 = 0

• p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .152ⅹ.12 .0002

p’(it is flying.) = .1ⅹ.15ⅹ.052 .00004

(assume word independence!)

Adding less than 1

• Equally simple:– Predicting words w from a vocabulary V, training data T:

p’(w|h) = (c(h,w) + ) / (c(h) + |V|), • for non-conditional distributions: p’(w) = (c(w) + ) / (|T| + |V|)

• Example: Training data: <s> what is it what is small ? |T| = 8

• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12

• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 .001

p(it is flying.) = .125ⅹ.2502 = 0

• Use = .1:• p’(it).12, p’(what).23, p’(.).01 p’(what is it?) = .232ⅹ.122 .0007

p’(it is flying.) = .12ⅹ.23ⅹ.012 .000003

Language Modeling

Advanced: Good Turing Smoothing

Reminder: Add-1 (Laplace) Smoothing

More general formulations: Add-k

Unigram prior smoothing

Advanced smoothing algorithms

• Intuition used by many smoothing algorithms– Good-Turing

– Kneser-Ney

– Witten-Bell

• Use the count of things we’ve seen once– to help estimate the count of things we’ve never seen

Notation: Nc = Frequency of frequency c

• Nc = the count of things we’ve seen c times

• Sam I am I am Sam I do not eat

I 3sam 2am 2do 1not 1eat 1

N1 = 3

N2 = 2

N3 = 1

Good-Turing smoothing intuition

• You are fishing (a scenario from Josh Goodman), and caught:– 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

• How likely is it that next species is trout?– 1/18

• How likely is it that next species is new (i.e. catfish or bass)– Let’s use our estimate of things-we-saw-once to estimate the new things.– 3/18 (because N1=3)

• Assuming so, how likely is it that next species is trout?

– Must be less than 1/18 – discounted by 3/18!!

– How to estimate?

• Seen once (trout)• c = 1• MLE p = 1/18

• C*(trout) = 2 * N2/N1

= 2 * 1/3

• P*GT(trout) = 2/3 / 18 = 1/27

Good Turing calculations

• Unseen (bass or catfish)– c = 0:– MLE p = 0/18 = 0

– P*GT (unseen) = N1/N =

Ney et al.’s Good Turing Intuition

Held-out words:

H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out. IEEE Trans. PAMI. 17:12,1202-1212

Ney et al. Good Turing Intuition(slide from Dan Klein)

• Intuition from leave-one-out validation– Take each of the c training words out in turn– c training sets of size c–1, held-out of size 1– What fraction of held-out words are unseen in training?

• N1/c– What fraction of held-out words are seen k times in

training?• (k+1)Nk+1/c

– So in the future we expect (k+1)Nk+1/c of the words to be those with training count k

– There are Nk words with training count k– Each should occur with probability:

• (k+1)Nk+1/c/Nk

– …or expected count:

Training Held out

Good-Turing complications (slide from Dan Klein)

• Problem: what about “the”? (say c=4417)

– For small k, Nk > Nk+1

– For large k, too jumpy, zeros wreck estimates

– Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit power law once counts get unreliable

Resulting Good-Turing numbers

• Numbers from Church and Gale (1991)• 22 million words of AP Newswire

Count c

Good Turing c*

0 .00002701 0.4462 1.263 2.244 3.245 4.226 5.197 6.218 7.249 8.25

Language Modeling

Advanced:

Kneser-Ney Smoothing

Resulting Good-Turing numbers

• Numbers from Church and Gale (1991)• 22 million words of AP Newswire

• It sure looks like c* = (c - .75)

Count c

Good Turing c*

0 .0000270

1 0.446

2 1.26

3 2.24

4 3.24

5 4.22

6 5.19

7 6.21

8 7.24

9 8.25

Absolute Discounting Interpolation

• Save ourselves some time and just subtract 0.75 (or some d)!

– (Maybe keeping a couple extra values of d for counts 1 and 2)

• But should we really just use the regular unigram P(w)?

discounted bigram

unigram

Interpolation weight

Kneser-Ney Smoothing I• Better estimate for probabilities of lower-order unigrams!

– Shannon game: I can’t see without my reading___________?

– “Francisco” is more common than “glasses”

– … but “Francisco” always follows “San”

• The unigram is useful exactly when we haven’t seen this bigram!

• Instead of P(w): “How likely is w”

• Pcontinuation(w): “How likely is w to appear as a novel continuation?

– For each word, count the number of bigram types it completes

– Every bigram type was a novel continuation the first time it was seen

Franciscoglasses

Kneser-Ney Smoothing II

• How many times does w appear as a novel continuation:

• Normalized by the total number of word bigram types

Kneser-Ney Smoothing III

• Alternative metaphor: The number of # of word types seen to precede w

• normalized by the # of words preceding all words:

• A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

Kneser-Ney Smoothing IV

λ is a normalizing constant; the probability mass we’ve discounted

the normalized discountThe number of word types that can follow wi-1 = # of word types we discounted= # of times we applied normalized discount

Kneser-Ney Smoothing: Recursive formulation

Continuation count = Number of unique single word contexts for

Backoff and Interpolation

• Sometimes it helps to use less context– Condition on less context for contexts you haven’t learned

much about

• Backoff: – use trigram if you have good evidence,– otherwise bigram, otherwise unigram

• Interpolation: – mix unigram, bigram, trigram

• Interpolation works better

Smoothing by Combination:Linear Interpolation

• Combine what?• distributions of various level of detail vs. reliability

• n-gram models:• use (n-1)gram, (n-2)gram, ..., uniform

reliability

detail

• Simplest possible combination: – sum of probabilities, normalize:

• p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6:

• p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3

• (p’(0|0) = 0.5p(0|0) + 0.5p(0))

Typical n-gram LM Smoothing

• Weight in less detailed distributions using =(0,,,):

p’(wi| wi-2 ,wi-1) = p3(wi| wi-2 ,wi-1) +

p2(wi| wi-1) + p1(wi) + 0/|V|

• Normalize: i > 0, i=0..n i = 1 is sufficient (0 = 1 - i=1..n i) (n=3)

• Estimation using MLE:– fix the p3, p2, p1 and |V| parameters as estimated from the training data

– then find such {i}which minimizes the cross entropy (maximizes

probability of data): -(1/|D|)i=1..|D|log2(p’(wi|hi))

Held-out Data• What data to use? (to estimate

– (bad) try the training data T: but we will always get = 1

• why? (let piT be an i-gram distribution estimated using relative freq. from T)

• minimizing HT(p’) over a vector , p’ = p3T+p2T+p1T+/|V|

– remember: HT(p’) = H(p3T) + D(p3T||p’); (p3T fixed → H(p3T) fixed, best)

– which p’ minimizes HT(p’)? Obviously, a p’ for which D(p3T|| p’)=0

– ...and that’s p3T (because D(p||p) = 0, as we know).

– ...and certainly p’ = p3T if = 1 (maybe in some other cases, too).

– (p’ = 1ⅹp3T + 0ⅹp2T + 0ⅹp1T + 0/|V|)

– thus: do not use the training data for estimation of • must hold out part of the training data (heldout data, H):

• ...call the remaining data the (true/raw) training data, T

• the test data S (e.g., for comparison purposes): still different data!

The Formulas (for H)• Repeat: minimizing -(1/|H|)i=1..|H|log2(p’(wi|hi)) over

p’(wi| hi) = p’(wi| wi-2 ,wi-1) = p3(wi| wi-2 ,wi-1) +

p2(wi| wi-1) + p1(wi) + 0/|V|

• “Expected Counts (of lambdas)”: j = 0..3 – next page

c(j) = i=1..|H| (jpj(wi|hi) / p’(wi|hi))

• “Next ”: j = 0..3

j,next = c(j) / k=0..3 (c(k))

The (Smoothing) EM Algorithm

1. Start with some , such that j > 0 for all j ∈0..3.

2. Compute “Expected Counts” for each j.

3. Compute new set of j, using the “Next ” formula.

4. Start over at step 2, unless a termination condition is met.• Termination condition: convergence of .

– Simply set an , and finish if |j - j,next| < for each j (step 3).

• Guaranteed to converge: follows from Jensen’s inequality, plus a technical proof.

Simple Example

• Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p() = 1/64 for ∈{c…r}, = 0 for the rest: s,t,u,v,w,x,y,z

• Heldout data: baby; use one set of (1: unigram, 0: uniform)

• Start with 1 = .5; p’(b) = .5 x .5 + .5 / 26 = .27

p’(a) = .5 x .25 + .5 / 26 = .14

p’(y) = .5 x 0 + .5 / 26 = .02

c(1) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.72

c(0) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28

Normalize: 1,next = .68, 0,next = .32.

Repeat from step 2 (recompute p’ first for efficient computation, then c(i), ...)

Finish when new lambdas almost equal to the old ones (say, < 0.01 difference).

Some More Technical Hints

• Set V = {all words from training data}.• You may also consider V = T ∪ H, but it does not make the coding in

any way simpler (in fact, harder).• But: you must never use the test data for your vocabulary!

• Prepend two “words” in front of all data:• avoids beginning-of-data problems• call these index -1 and 0: then the formulas hold exactly

• When cn(w) = 0:• Assign 0 probability to pn(w|h) where cn-1(h) > 0, but a uniform

probability (1/|V|) to those pn(w|h) where cn-1(h) = 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!]

Introduction to Natural Language Processing (600.465)

Mutual Information and Word Classes(class n-gram)

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

hajic@cs.jhu.edu

www.cs.jhu.edu/~hajic

The Problem

• Not enough data• Language Modeling: we do not see “correct” n-grams

– solution so far: smoothing

• suppose we see:– short homework, short assignment, simple homework

• but not:– simple assigment

• What happens to our (bigram) LM?– p(homework | simple) = high probability

– p(assigment | simple) = low probability (smoothed with p(assigment))

– They should be much closer!

Word Classes

• Observation: similar words behave in a similar way– trigram LM:

– in the ... (all nouns/adj);

– catch a ... (all things which can be catched, incl. their accompanying adjectives);

– trigram LM, conditioning: – a ... homework (any atribute of homework: short, simple, late, difficult),

– ... the woods (any verb that has the woods as an object: walk, cut, save)

– trigram LM: both:– a (short,long,difficult,...) (homework,assignment,task,job,...)

Solution

• Use the Word Classes as the “reliability” measure• Example: we see

• short homework, short assignment, simple homework

– but not:• simple assigment

– Cluster into classes:• (short, simple) (homework, assignment)

– covers “simple assignment”, too

• Gaining: realistic estimates for unseen n-grams• Loosing: accuracy (level of detail) within classes

The New Model• Rewrite the n-gram LM using classes:

– Was: [k = 1..n]• pk(wi|hi) = c(hi,wi) / c(hi) [history: (k-1) words]

– Introduce classes:

pk(wi|hi) = p(wi|ci) pk(ci|hi) !• history: classes, too: [for trigram: hi = ci-2,ci-1, bigram: hi = ci-1]

– Smoothing as usual• over pk(wi|hi), where each is defined as above (except uniform which stays

at 1/|V|)

Training Data

• Suppose we already have a mapping:– r: V →C assigning each word its class (ci = r(wi))

• Expand the training data:– T = (w1, w2, ..., w|T|) into

– TC = (<w1,r(w1)>, <w2,r(w2)>, ..., <w|T|,r(w|T|)>)

• Effectively, we have two streams of data:– word stream: w1, w2, ..., w|T|

– class stream: c1, c2, ..., c|T| (def. as ci = r(wi))

• Expand Heldout, Test data too

Training the New Model

• As expected, using ML estimates:– p(wi|ci) = p(wi|r(wi)) = c(wi) / c(r(wi)) = c(wi) / c(ci)

• !!! c(wi,ci) = c(wi) [since ci determined by wi]

– pk(ci|hi):

• p3(ci|hi) = p3(ci|ci-2 ,ci-1) = c(ci-2 ,ci-1,ci) / c(ci-2 ,ci-1)

• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1)

• p1(ci|hi) = p1(ci) = c(ci) / |T|

• Then smooth as usual – not the p(wi|ci) nor pk(ci|hi) individually, but the pk(wi|hi)

Classes: How To Get Them

• We supposed the classes are given• Maybe there are in [human] dictionaries, but...

– dictionaries are incomplete

– dictionaries are unreliable

– do not define classes as equivalence relation (overlap)

– do not define classes suitable for LM • small, short... maybe; small and difficult?

• we have to construct them from data (again...)

Creating the Word-to-Class Map

• We will talk about bigrams from now• Bigram estimate:

• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1) = c(r(wi-1),r(wi)) / c(r(wi-1))

• Form of the model: (class bi-gram)– just raw bigram for now:

• P(T) = i=1..|T|p(wi|r(wi)) p2(r(wi))|r(wi-1)) (p2(c1|c0) =df p(c1))

• Maximize over r (given r → fixed p, p2):– define objective L(r) = 1/|T| i=1..|T|log(p(wi|r(wi)) p2(r(wi))|r(wi-1)))

– rbest = argmaxr L(r) (L(r) = norm. logprob of training data... as usual) (or negative cross entropy)

Simplifying the Objective Function• Start from L(r) = 1/|T| i=1..|T|log(p(wi|r(wi)) p2(r(wi)|r(wi-1))):

1/|T| i=1..|T|log(p(wi|r(wi)) p(r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =

1/|T| i=1..|T|log(p(wi,r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =

1/|T| i=1..|T|log(p(wi)) + 1/|T| i=1..|T|log(p2(r(wi)|r(wi-1)) / p(r(wi))) =

-H(W) + 1/|T| i=1..|T|log(p2(r(wi)|r(wi-1)) p(r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =

-H(W) + 1/|T| i=1..|T|log(p(r(wi),r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =

-H(W) + d,e∈C p(d,e) log( p (d,e) / (p(d) p(e)) ) =

-H(W) + I(D,E) (event E picks class adjacent (to the right) to the one picked by D)

• Since W does not depend on r, we ended up with I(D,E).the need to maximize

Maximizing Mutual Information(dependent on the mapping r)

• Result from previous foil:– Maximizing the probability of data amounts to

maximizing I(D,E), the mutual information of the adjacent classes.

• Good:– We know what a MI is, and we know how to maximize.

• Bad:– There is no way how to maximize over so many

possible partitionings: |V||V| - no way to test them all.

The Greedy Algorithm• Define merging operation on the mapping r: V →C:

– merge: R ⅹC ⅹ C →R’ ⅹC-1: (r,k,l) →r’,C’ such that– C-1 = {C - {k,l} ∪ {m}} (throw out k and l, add new m C)∉

– r’(w) = ..... m for w ∈rINV{k,l}),

..... r(w) otherwise.

• 1. Start with each word in its own class (C = V), r = identity.

• 2. Merge two classes k,l into one, m, such that (k,l) = argmaxk,l Imerge(r,k,l)(D,E).

• 3. Set new (r,C) = merge(r,k,l).

• 4. Repeat 2 and 3 until |C| reaches predetermined size.

Word Classes in Applications

• Word Sense Disambiguation: context not seen [enough(-times)]

• Parsing: verb-subject, verb-object relations• Speech recognition (acoustic model): need more

instances of [rare(r)] sequences of phonemes• Machine Translation: translation equivalent

selection [for rare(r) words]

Spelling Correction and

the Noisy Channel

The Spelling Correction Task

Dan Jurafsky

Applications for spelling correction

Web search

PhonesWord processing

Dan Jurafsky

Spelling Tasks

• Spelling Error Detection• Spelling Error Correction:

• Autocorrect • htethe

• Suggest a correction• Suggestion lists

Dan Jurafsky

Types of spelling errors

• Non-word Errors• graffe giraffe

• Real-word Errors• Typographical errors

• three there• Cognitive Errors (homophones)

• piecepeace, • too two

Dan Jurafsky

Rates of spelling errors

26%: Web queries Wang et al. 2003

13%: Retyping, no backspace: Whitelaw et al. English&German

7%: Words corrected retyping on phone-sized organizer2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003

1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983

Dan Jurafsky

Non-word spelling errors

• Non-word spelling error detection:• Any word not in a dictionary is an error• The larger the dictionary the better

• Non-word spelling error correction:• Generate candidates: real words that are similar to error• Choose the one which is best:

• Shortest weighted edit distance• Highest noisy channel probability

Dan Jurafsky

Real word spelling errors

• For each word w, generate candidate set:• Find candidate words with similar pronunciations• Find candidate words with similar spelling• Include w in candidate set

• Choose best candidate• Noisy Channel • Classifier

the Noisy Channel

The Noisy Channel Model of Spelling

Dan Jurafsky

Noisy Channel Intuition

Dan Jurafsky

Noisy Channel

• We see an observation x of a misspelled word• Find the correct word w

Dan Jurafsky

History: Noisy channel for spelling proposed around 1990

• IBM• Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991.

Context based spelling correction. Information Processing and Management, 23(5), 517–522

• AT&T Bell Labs• Kernighan, Mark D., Kenneth W. Church, and William A. Gale.

1990. A spelling correction program based on a noisy channel model. Proceedings of COLING 1990, 205-210

Dan Jurafsky

Non-word spelling error example

acress

Dan Jurafsky

Candidate generation

• Words with similar spelling• Small edit distance to error

• Words with similar pronunciation• Small edit distance of pronunciation to error

Dan Jurafsky

Damerau-Levenshtein edit distance

• Minimal edit distance between two strings, where edits are:• Insertion• Deletion• Substitution• Transposition of two adjacent letters

Dan Jurafsky

Words within 1 of acressError Candidate

CorrectionCorrect Letter

Error Letter

acress actress t - deletion

acress cress - a insertion

acress caress ca ac transposition

acress access c r substitution

acress across o e substitution

acress acres - s insertion

acress acres - s insertion90

Dan Jurafsky

Candidate generation

• 80% of errors are within edit distance 1• Almost all errors within edit distance 2

• Also allow insertion of space or hyphen• thisidea this idea• inlaw in-law

Dan Jurafsky

Language Model

• Use any of the language modeling algorithms we’ve learned• Unigram, bigram, trigram• Web-scale spelling correction (web-scale language modeling)

• Stupid backoff

• “Stupid backoff” (Brants et al. 2007)• No discounting, just use relative frequencies

Dan Jurafsky

Unigram Prior probability

word Frequency of word

P(word)

actress 9,321 .0000230573

cress 220 .0000005442

caress 686 .0000016969

access 37,038 .0000916207

across 120,844 .0002989314

acres 12,874 .000031846393

Counts from 404,253,213 words in Corpus of Contemporary English (COCA)

Dan Jurafsky

Channel model probability

• Error model probability, Edit probability• Kernighan, Church, Gale 1990

• Misspelled word x = x1, x2, x3… xm

• Correct word w = w1, w2, w3,…, wn

• P(x|w) = probability of the edit • (deletion/insertion/substitution/transposition)

Dan Jurafsky

Computing error probability: confusion matrix

del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(x typed as y)trans[x,y]: count(xy typed as yx)

Insertion and deletion conditioned on previous character

Dan Jurafsky

Confusion matrix for spelling errors

Dan Jurafsky

Generating the confusion matrix

• Peter Norvig’s list of errors• Peter Norvig’s list of counts of single-edit errors

Dan Jurafsky

Channel model

Kernighan, Church, Gale 1990

Dan Jurafsky

Channel model for acressCandidate Correction

Correct Letter

Error Letter

x|w P(x|word)

actress

t - c|ct .000117

cress - a a|# .00000144

caress ca ac ac|ca

.00000164

access c r r|c .000000209

across o e e|o .0000093

acres - s es|e .0000321

acres - s ss|s .000034299

Dan Jurafsky

Noisy channel probability for acressCandidate Correction

Correct Letter

Error Letter

x|w P(x|word) P(word) 109 *P(x|w)P(w)

actress t - c|ct .000117 .0000231 2.7

cress - a a|# .00000144 .000000544

.00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209

.0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e .0000321 .0000318 1.0

acres - s ss|s .0000342 .0000318 1.0100

Dan Jurafsky

Using a bigram language model

• “a stellar and versatile acress whose combination of sass and glamour…”

• Counts from the Corpus of Contemporary American English with add-1 smoothing

• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

Dan Jurafsky

Using a bigram language model

• “a stellar and versatile acress whose combination of sass and glamour…”

• Counts from the Corpus of Contemporary American English with add-1 smoothing

• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

Dan Jurafsky

Evaluation

• Some spelling error test sets• Wikipedia’s list of common English misspelling• Aspell filtered version of that list• Birkbeck spelling error corpus• Peter Norvig’s list of errors (includes Wikipedia and Birkbeck, for training

or testing)

the Noisy Channel

Real-Word Spelling Correction

Dan Jurafsky

Real-word spelling errors

• …leaving in about fifteen minuets to go to her house.• The design an construction of the system…• Can they lave him my messages?• The study was conducted mainly be John Black.

• 25-40% of spelling errors are real words Kukich 1992

Dan Jurafsky

Solving real-world spelling errors

• For each word in sentence• Generate candidate set

• the word itself • all single-letter edits that are English words• words that are homophones

• Choose best candidates• Noisy channel model• Task-specific classifier

Dan Jurafsky

Noisy channel for real-word spell correction

• Given a sentence w1,w2,w3,…,wn

• Generate a set of candidates for each word wi

• Candidate(w1) = {w1, w’1 , w’’1 , w’’’1 ,…}

• Candidate(w2) = {w2, w’2 , w’’2 , w’’’2 ,…}

• Candidate(wn) = {wn, w’n , w’’n , w’’’n ,…}

• Choose the sequence W that maximizes P(W)

Dan Jurafsky

Simplification: One error per sentence

• Out of all possible sentences with one word replaced• w1, w’’2,w3,w4 two off thew

• w1,w2,w’3,w4 two of the

• w’’’1,w2,w3,w4 too of thew

• …

• Choose the sequence W that maximizes P(W)

Dan Jurafsky

Where to get the probabilities

• Language model• Unigram• Bigram• Etc

• Channel model• Same as for non-word spelling correction• Plus need probability for no error, P(w|w)

Dan Jurafsky

Probability of no error

• What is the channel probability for a correctly typed word?• P(“the”|“the”)

• Obviously this depends on the application• .90 (1 error in 10 words)• .95 (1 error in 20 words)• .99 (1 error in 100 words)• .995 (1 error in 200 words)

Dan Jurafsky

Peter Norvig’s “thew” example

x w x|w P(x|w) P(w)109 P(x|w)P(w)

thew the ew|e 0.000007 0.02 144

thew thew 0.95 0.00000009 90

thew thaw e|a 0.001 0.0000007 0.7

thew threw h|hr 0.000008 0.000004 0.03

thew thweew|we 0.000003 0.00000004 0.0001

the Noisy Channel

State-of-the-art Systems

Dan Jurafsky

HCI issues in spelling

• If very confident in correction• Autocorrect

• Less confident• Give the best correction

• Less confident• Give a correction list

• Unconfident• Just flag as an error

Dan Jurafsky

State of the art noisy channel

• We never just multiply the prior and the error model• Independence assumptionsprobabilities not commensurate• Instead: Weigh them

• Learn λ from a development test set

Dan Jurafsky

Phonetic error model

• Metaphone, used in GNU aspell • Convert misspelling to metaphone pronunciation

• “Drop duplicate adjacent letters, except for C.”• “If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.”• “Drop 'B' if after 'M' and if it is at the end of the word”• …

• Find words whose pronunciation is 1-2 edit distance from misspelling’s• Score result list

• Weighted edit distance of candidate to misspelling• Edit distance of candidate pronunciation to misspelling pronunciation

Dan Jurafsky

Improvements to channel model

• Allow richer edits (Brill and Moore 2000)• entant• phf• leal

• Incorporate pronunciation into channel (Toutanova and Moore 2002)

Dan Jurafsky

Channel model

• Factors that could influence p(misspelling|word)• The source letter• The target letter• Surrounding letters• The position in the word• Nearby keys on the keyboard• Homology on the keyboard• Pronunciations• Likely morpheme transformations

Dan Jurafsky

Nearby keys

Dan Jurafsky

Classifier-based methods for real-word spelling correction

• Instead of just channel model and language model• Use many features in a classifier such as MaxEnt, CRF.• Build a classifier for a specific pair like: whether/weather

• “cloudy” within +- 10 words• ___ to VERB• ___ or not

*Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)

Documents