Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy ...

transcript

Language Modeling

Roadmap (for next two classes)

Review LM evaluation metrics Entropy Perplexity

Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney

Language Model Evaluation Metrics

Applications

Entropy and perplexity

Entropy – measure information content, in bits

is message length with ideal code Use if you want to measure in bits!

Cross entropy – measure ability of trained model to compactly represent test data

Average logprob of test data Perplexity – measure average branching factor

Language model perplexity

Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate!

Perplexity correlates rather well with: Speech recognition error rates MT quality metrics

LM Perplexities for word-based models are normally between say 50 and 1000

Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

Parameter estimation

What is it?

Model form is fixed (coin unigrams, word bigrams, …) We have observations

H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters

that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3

MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Smoothing

Take mass from seen events, give to unseen events Robin Hood for probability models

MLE at one end of the spectrum; uniform distribution the other

Need to pick a happy medium, and yet maintain a distribution

Smoothing techniques

Laplace Good-Turing Backoff Mixtures Interpolation Kneser-Ney

Laplace

From MLE:

To Laplace:

Good-Turing Smoothing

New idea: Use counts of things you have seen to estimate those you haven’t

Good-Turing Josh Goodman Intuition

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass You have caught

10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?

Slide adapted from Josh Goodman, Dan Jurafsky

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass You have caught

Assuming so, how likely is it that next species is trout?

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon,

eel, catfish, bass You have caught

Assuming so, how likely is it that next species is trout? Must be less than 1/18

Some more hypotheticalsSpecies Puget Sound Lake Washington Greenlake

Salmon 8 12 0

Trout 3 1 1

Cod 1 1 0

Rockfish 1 0 0

Snapper 1 0 0

Skate 1 0 0

Bass 0 1 14

TOTAL 15 15 15

How likely is it to find a new fish in each of these places?

Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1

Good-Turing Smoothing Estimate probability of things which occur c times

with the probability of things which occur c+1 times

Discounted counts: steal mass from seen cases to provide for the unseen:

GT Fish Example

Enough about the fish…how does this relate to language? Name some linguistic situations where the number

of new words would differ

of new words would differ Different languages:

Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!

of new words would differ Different languages:

Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!

Different domains: Airplane maintenance manuals: controlled vocabulary Random web posts: uncontrolled vocab

Bigram Frequencies of Frequencies and GT Re-estimates

N-gram counts to conditional probability

Use c* from GT estimate

Additional Issues in Good-Turing General approach:

Estimate of c* for Nc depends on N c+1

What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s

Modifications

Simple Good-Turing Compute Nc bins, then smooth Nc to replace zeroes

Fit linear regression in log space log(Nc) = a +b log(c)

What about large c’s? Should be reliable Assume c*=c if c is large, e.g c > k (Katz: k =5)

Typically combined with other approaches

Backoff and Interpolation

Another really useful source of knowledge If we are estimating:

trigram p(z|x,y) but count(xyz) is zero

Use info from:

Use info from: Bigram p(z|y)

Or even: Unigram p(z)

Use info from: Bigram p(z|y)

Or even: Unigram p(z)

How to combine this trigram, bigram, unigram info in a valid fashion?

Backoff vs. Interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Backoff vs. Interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Interpolation: always mix all three

Backoff

Bigram distribution

But could be zero… What if we fell back (or “backed off”) to a unigram

distribution?

Also could be zero…

Backoff

What’s wrong with this distribution?

Doesn’t sum to one! Need to steal mass…

Backoff

Mixtures

Given distributions and Pick any number between and is a distribution (Laplace is a mixture!)

Interpolation Simple interpolation

Or, pick interpolation value based on context

Intuition: Higher weight on more frequent n-grams

How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of

some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

P(Francisco|reading) backs off to P(Francisco)

P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)

P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Interpolate based on # of contexts Words seen in more contexts, more likely to appear in

others

Kneser-Ney Smoothing: bigrams

Modeling diversity of contexts

Backoff:

Interpolation:

OOV words: <UNK> word

Out Of Vocabulary = OOV words

Out Of Vocabulary = OOV words We don’t use GT smoothing for these

Because GT assumes we know the number of unseen events

Instead: create an unknown word token <UNK>

Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to

<UNK> Now we train its probabilities like a normal word

Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to

<UNK> Now we train its probabilities like a normal word

At decoding time If text input: Use UNK probabilities for any word not in training Plus an additional penalty! UNK predicts the class of unknown words;

then we need to pick a member

Class-Based Language Models

Variant of n-gram models using classes or clusters

Variant of n-gram models using classes or clusters Motivation: Sparseness

Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data

Where do classes come from?

Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

LM Adaptation

Challenge: Need LM for new domain Have little in-domain data

LM Adaptation

Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

LM Adaptation

Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set

What large corpus?

LM Adaptation

Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set

What large corpus? Web counts! e.g. Google n-grams

Incorporating Longer Distance Context

Why use longer context?

Why use longer context? N-grams are approximation

Model size Sparseness

What sorts of information in longer context?

What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax

Long Distance LMs Bigger n!

284M words: <= 6-grams improve; 7-20 no better

284M words: <= 6-grams improve; 7-20 no better Cache n-gram:

Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus

Mix with main n-gram LM

Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams

Language Models

N-gram models: Finite approximation of infinite context history

Issues: Zeroes and other sparseness Strategies: Smoothing

Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff

Refinements Class, cache, topic, trigger LMs

Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy ...

Documents