Language Modeling 1. Roadmap (for next two classes) Review LMs What are they? How (and where) are...

1

Language Modeling

2

Roadmap (for next two classes)

Review LMs What are they? How (and where) are they used? How are they trained?

Evaluation metrics Entropy Perplexity

Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney

3

What is a language model?

Gives a probability of communicationof transmitted signalsof information(Claude Shannon, Information Theory)

Lots of ties to Cryptography and Information Theory

We most often use n-gram models

4

Applications

5

Applications

What word sequence (English) does this phoneme sequence correspond to:

AY D L AY K T UW

Phoneme Example Translation

AH hut HH AH T

AY hide HH AY D

B be B IY

CH cheese CH IY Z

D dee D IY

EH Ed EH D

IY eat IY T

K key K IY

L lee L IY

N knee N IY

P pee P IY

S sea S IY

T tea T IY

UW two T UW

Z zee Z IY

6

Applications


AY D L AY K T UW

R EH K AH N AY S B IY CH


AH hut HH AH T

AY hide HH AY D

B be B IY

CH cheese CH IY Z

D dee D IY

EH Ed EH D

IY eat IY T

K key K IY

L lee L IY

N knee N IY

P pee P IY

S sea S IY

T tea T IY

UW two T UW

Z zee Z IY

7

Applications


AY D L AY K T UW

R EH K AH N AY S B IY CH

Goal of LM:P(“I’d like to recognize speech”) >

P(“I’d like to wreck a nice beach”)


AH hut HH AH T

AY hide HH AY D

B be B IY

CH cheese CH IY Z

D dee D IY

EH Ed EH D

IY eat IY T

K key K IY

L lee L IY

N knee N IY

P pee P IY

S sea S IY

T tea T IY

UW two T UW

Z zee Z IY

8

Why n-gram LMs?

We could just count how often a sentence occurs… …but language is too productive – infinite combos! Break down by word – predict each given its history

We could just count words in context… …but even contexts get too sparse. Just use the last words in a -gram model

9

Language Model Evaluation Metrics

10

Entropy and perplexity

Entropy measures the information content in a distribution == the uncertainty

11



If I can predict the next word before it comes, there’s no information content

12




Zero uncertainty means the signal has zero information How many bits of additional information do I need to

guess the next symbol?

13






Perplexity is the average branching factor If message has zero information,

14






Perplexity is the average branching factor If message has zero information, then branching factor

is 1

15




Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess

the next symbol?

Perplexity is the average branching factor If message has zero information, then branching factor is 1 If message needs one bit, branching factor is 2 If message needs two bits, branching factor is 4

16




Zero uncertainty means the signal has zero information How many bits of additional information do I need to guess the next

symbol?

Perplexity is the average branching factor If message has zero information, then branching factor is 1 If message needs one bit, branching factor is 2 If message needs two bits, branching factor is 4

Entropy and perplexity measure the same thing (uncertainty / information content) with different scales

17

Information in a fair coin flip

event space probabilityentropy(bits) perplexity

total 1 1 2heads 0.5 0.5tails 0.5 0.5

18




19




20

Information in a single fair die


total 1 1 21 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43

21



total 1 1 21 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43

22



total 1 1 61 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43

23



total 1 61 1/6 0.432 1/6 0.433 1/6 0.434 1/6 0.435 1/6 0.436 1/6 0.43

24

Information in sum of two dice?


total 1 3.27 9.682 1/36 0.143 1/18 0.234 1/12 0.305 1/9 0.356 5/36 0.407 1/6 0.438 5/36 0.409 1/9 0.35

10 1/12 0.3011 1/18 0.2312 1/36 0.14

25

Entropy of a distribution

Start with a distribution over events in the event space

Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution

26

Entropy of a distribution

Start with a distribution over events in the event space

Entropy measuresthe minimum number of bitsnecessary to encode a message assuming that has distribution

Key notion – you can use shorter codes for more common messages

(If you’ve heard of Huffman coding, here it is…)

27

Computing Entropy

𝐻 (𝑥 )=∑𝑥

(𝑝 (𝑥 )× (− log2𝑝 (𝑥 ) ))

28

Computing Entropy

𝐻 (𝑥 )=∑𝑥

(𝑝 (𝑥 )× (− log2𝑝 (𝑥 ) ))Ideal code length for this

symbol

29

Computing Entropy

𝐻 (𝑥 )=∑𝑥

(𝑝 (𝑥 )× (− log2𝑝 (𝑥 ) ))Ideal code length for this

symbol

Expected occurrences of

30

Entropy example

What binary code would I use to represent these?

31

Entropy example

What binary code would I use to represent these?

Sample: cabaa 3/1/1

32

Perplexity

Just If entropy measures # of bits per symbol Just exponentiate to get the branching factor

33

BIG SWITCH:Cross entropyand Language model perplexity

34

The Train/Test Split and Entropy

Before, we were computing

This scores how well we’re doing if we know the true distribution

We estimate parameters on training and evaluate on test

35

Cross entropy

Estimate distribution on training corpus; see how well it predicts testing corpus

Let be the distribution we learned from training data be the test data

Then cross entropy of test given training is:

This is the negative average logprob Also, average number of bits required to encode each

test data symbol using our learned distribution

36

Cross entropy, formally

True distribution , assumed distribution Wrote codebook using , encode messages from

Let be count-based distribution of test data , then

37

Language model perplexity

Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate!

Perplexity correlates rather well with: Speech recognition error rates MT quality metrics

LM Perplexities for word-based models are normally between say 50 and 1000

Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

38

Parameter estimation and smoothing

39

Tasks

You get parameters

You want to produce data that conforms to this distribution

This is simulation or data generation

40

Tasks

You get parameters

And observations HHHHTHTTHHHHH

You need to answer: “How likely is this data according to the model?

This is evaluating the likelihood function

41

Tasks

You get observations: HHTHTTHTHTHHTHTHTTHTHT

You need to find a set of parameters:

This is parameter estimation

42

Parameter estimation

We keep talking about things like: as a distribution with parameters

How do we estimate parameters? What’s the likelihood of these parameters?

ObservationsHHTHTHTTTHTHTTTTTTTTTTTTHHHTHHHTHHHH

Parameters(Heads, Tails)(0.00, 1.00)(0.50, 0.50)(0.75, 0.25)

43

Parameter estimation techniques

Often use Relative Frequency Estimate

For certain distributions… “how likely is it that I get k heads when I flip n times”

(Binomial distributions) “how likely is it that I get five 6s when I roll five dice”

(Multinomial distributions) …Relative Freq = Maximum Likelihood Estimate (MLE)

This is the set of parameters for which the underlying distribution has the max likelihood (another max!)

Formalizes your intuition from the prior slide

44

Maximum Likelihood has problems :/ Remember:

Two problems: What happens if ?

We assign zero probability to an event… Even worse, what if ?

Divide by zero is undefined!

45

Smoothing

Main goal: prevent zero numerators (zero probs) and zero denominators (divide by zeros)

Make a “sharp” distribution (where some outputs have large probabilities and others have zero probs) be “smoother”

The smoothest distribution is the uniform distribution Constraint:

Result should still be a distribution

46

Smoothing techniques

Add one (Laplace)

This can help, but it generally doesn’t do a good job of estimating what’s going on

47

Mixtures / interpolation

Say I have two distributions and Pick any number between and Then is a distribution Two things to show:

(a) Sums to one:

(b) All values are and because they’re distributions and since and So the sum is non-negative, and we’re done

48

Laplace as a mixture

Say we have outcomes and total observations. Laplace says:

Laplace is a mixture between MLE and uniform! Mixture weight is determined by N and K

BERP Corpus Bigrams

Original bigram probabilites

BERP Smoothed Bigrams

Smoothed bigram probabilities from the BERP

Laplace Smoothing Example

Consider the case where |V|= 100K C(Bigram w1w2) = 10; C(Trigram w1w2w3) = 9



PMLE=



PMLE=9/10 = 0.9 PLAP=



PMLE=9/10 = 0.9 PLAP=(9+1)/10+100K ~ 0.0001



PMLE=9/10 = 0.9 PLAP=(9+1)/10+100K ~ 0.0001 Too much probability mass ‘shaved off’ for zeroes

Too sharp a change in probabilities Problematic in practice

Add-δSmoothing Problem: Adding 1 moves too much probability mass

Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ Padd-δ (wi|wi-1)

Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ Padd-δ (wi|wi-1) =

Issues:

Add-δSmoothing Problem: Adding 1 moves too much probability mass Proposal: Add smaller fractional mass δ Padd-δ (wi|wi-1) =

Issues: Need to pick δ Still performs badly

Good-Turing Smoothing

New idea: Use counts of things you have seen to estimate those you haven’t



Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams



Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1

Good-Turing Smoothing Estimate probability of things which occur c times

with the probability of things which occur c+1 times

Good-Turing Josh Goodman Intuition

Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass You have caught

10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?

Slide adapted from Josh Goodman, Dan Jurafsky


Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,

salmon, eel, catfish, bass You have caught



3/18

Assuming so, how likely is it that next species is trout?



Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon,

eel, catfish, bass You have caught



3/18

Assuming so, how likely is it that next species is trout? Must be less than 1/18


GT Fish Example

Bigram Frequencies of Frequencies and GT Re-estimates


N-gram counts to conditional probability c* from GT estimate

Backoff and Interpolation

Another really useful source of knowledge If we are estimating:

trigram p(z|x,y) but count(xyz) is zero

Use info from:




Use info from: Bigram p(z|y)

Or even: Unigram p(z)




Use info from: Bigram p(z|y)

Or even: Unigram p(z)

How to combine this trigram, bigram, unigram info in a valid fashion?

Backoff Vs. Interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Backoff Vs. Interpolation

Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Interpolation: always mix all three

Interpolation Simple interpolation

Interpolation Simple interpolation

Lambdas conditional on context: Intuition: Higher weight on more frequent n-grams

How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of

some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search

Katz Backoff

Katz Backoff Note: We used P* (discounted probabilities) and α

weights on the backoff values Why not just use regular MLE estimates?



Sum over all wi in n-gram context If we back off to lower n-gram?




Too much probability mass > 1




Too much probability mass > 1

Solution: Use P* discounts to save mass for lower order ngrams Apply α weights to make sure sum to amount saved Details in 4.7.1

Toolkits

Two major language modeling toolkits SRILM Cambridge-CMU toolkit

Toolkits

Two major language modeling toolkits SRILM Cambridge-CMU toolkit

Publicly available, similar functionality Training: Create language model from text file Decoding: Computes perplexity/probability of text

OOV words: <UNK> word

Out Of Vocabulary = OOV words


Out Of Vocabulary = OOV words We don’t use GT smoothing for these



Because GT assumes we know the number of unseen events

Instead: create an unknown word token <UNK>




Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to

<UNK> Now we train its probabilities like a normal word




Instead: create an unknown word token <UNK> Training of <UNK> probabilities

Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to

<UNK> Now we train its probabilities like a normal word

At decoding time If text input: Use UNK probabilities for any word not in training

Google N-Gram Release

Google N-Gram Release

serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234

Google Caveat

Remember the lesson about test sets and training sets... Test sets should be similar to the training set (drawn from the same distribution) for the probabilities to be meaningful.

So... The Google corpus is fine if your application deals with arbitrary English text on the Web.

If not then a smaller domain specific corpus is likely to yield better results.

Class-Based Language Models

Variant of n-gram models using classes or clusters


Variant of n-gram models using classes or clusters Motivation: Sparseness

Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram




IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data

Where do classes come from?





Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus





Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

LM Adaptation

Challenge: Need LM for new domain Have little in-domain data

LM Adaptation


Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

LM Adaptation



Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set

What large corpus?

LM Adaptation



Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set

What large corpus? Web counts! e.g. Google n-grams

Incorporating Longer Distance Context

Why use longer context?


Why use longer context? N-grams are approximation

Model size Sparseness




What sorts of information in longer context?




What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax

Long Distance LMs Bigger n!

284M words: <= 6-grams improve; 7-20 no better


284M words: <= 6-grams improve; 7-20 no better Cache n-gram:

Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus

Mix with main n-gram LM





Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)





Topic models: Intuition: Text is about some topic, on-topic words likely

P(w|h) ~ Σt P(w|t)P(t|h)

Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams

Language Models

N-gram models: Finite approximation of infinite context history

Issues: Zeroes and other sparseness Strategies: Smoothing

Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff

Refinements Class, cache, topic, trigger LMs

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff

I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)



P(Francisco|reading) backs off to P(Francisco)



P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)



P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many



P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Interpolate based on # of contexts Words seen in more contexts, more likely to appear in

others

Kneser-Ney Smoothing

Modeling diversity of contexts Continuation probability



Backoff:



Backoff:

Interpolation:

Issues

Relative frequency Typically compute count of sequence

Divide by prefix

Corpus sensitivity Shakespeare vs Wall Street Journal

Very unnatural

Ngrams Unigram: little; bigrams: colloc;trigrams:phrase

)(

)()|(

1

11

n

nnnn wC

wwCwwP

Additional Issues in Good-Turing General approach:

Estimate of c* for Nc depends on N c+1

What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s

Modifications

Simple Good-Turing Compute Nc bins, then smooth Nc to replace zeroes

Fit linear regression in log space log(Nc) = a +b log(c)

What about large c’s? Should be reliable Assume c*=c if c is large, e.g c > k (Katz: k =5)

Typically combined with other interpolation/backoff

Date post:	04-Jan-2016
Category:	Documents
Upload:	sharon-obrien
View:	220 times
Download:	5 times

Language Modeling 1. Roadmap (for next two classes) Review LMs What are they? How (and where) are...

Documents