+ All Categories
Home > Documents > 1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the...

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the...

Date post: 21-Dec-2015
Category:
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
34
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009
Transcript

1

Language Model (LM)

LING 570

Fei Xia

Week 4: 10/21/2009

2

LM

• Word prediction: to predict the next word in a sentence.– Ex: I’d like to make a collect ___

• Statistical models of word sequences are called language models (LMs).

• Task: – Build a statistical model from the training data. – Now given a sentence w1 w2 … wn, we want to estimate its

probability P(w1 … wn)?

• Goal: the model should prefer good sentences to bad ones.

3

Some Terms

• Corpus: a collection of text or speech

• Words: may or may not include punctuation marks.

• Types: the number of distinct words in a corpus

• Tokens: the total number of words in a corpus

4

Applications of LMs

• Speech recognition– Ex: I bought two/too/to books.

• Handwriting recognition

• Machine translation

• Spelling correction

• …

5

Outline

• N-gram LM

• Evaluation

6

N-gram LM

7

N-gram LM• Given a sentence w1 w2 … wn, how to estimate its probability P(w1

… wn)?

• The Markov independence assumption: P(wn | w1, …, wn-1) depends only on the previous k words.

• P(w1… wn) = P(w1) * P(w2|w1)* … P(wn | w1, …, wn-1) ¼ P(w1)*P(w2|w1)* … P(wn | wn-k+1, …, wn-1)

• 0th order Markov model: unigram model• 1st order Markov model: bigram model• 2nd order Markov model: trigram model • …

8

Unigram LM

• P(w1… wn)

¼ P(w1)*P(w2)* … P(wn)

• Estimating P(w):– MLE: P(w) = C(w)/N, N is the num of tokens

• How many states in the FSA?

• How many model parameters?

9

Bigram LM

• P(w1… wn)

= P(BOS w1… wn EOS)

¼ P(BOS) * P(w1|BOS)* … *P(wn | wn-1)*P(EOS|wn)

• Estimating P(wn|wn-1):

– MLE: P(wn) = C(wn-1,wn)/C(wn-1)

• How many states in the FSA?• How many model parameters?

10

Trigram LM

• P(w1… wn) = P(BOS w1… wn EOS)

¼ P(BOS) * P(w1|BOS)* P(w2|BOS, w1)*…

*P(wn | wn-2, wn-1)*P(EOS | wn-1 wn)

• Estimating P(wn|wn-2, wn-1):

– MLE: P(wn) = C(wn-2, wn-1,wn)/C(wn-2, wn-1)

• How many states in the FSA?• How many model parameters?

11

Text generation

Unigram: To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have

Bigram: What means, sir. I confess she? then all sorts, he is trim, captain.

Trigram: Sweet prince, Falstaff shall die. Harry of Monmouth’s grave.

4-gram: Will you not tell me who I am? It cannot be but so.

12

N-gram LM packages

• SRI LM toolkit

• CMU LM toolkit

• …

13

So far

• N-grams:– Number of states in FSA: |V|N-1

– Number of model parameters: |V|N

• Remaining issues: – Data sparse problem smoothing

• Unknown words: OOV rate

– Mismatch between training and test data model adaptation– Other LM models: structured LM, class-based LM

14

Evaluation

15

Evaluation (in general)

• Evaluation is required for almost all CompLing papers.

• There are many factors to consider:

– Data– Metrics– Results of competing systems– …

• You need to think about evaluation from the very beginning.

16

Rules for evaluation

• Always evaluate your system• Use standard metrics• Separate training/dev/test data• Use standard training/dev/test data

• Clearly specify experiment setting• Include baseline and results from competing systems• Perform error analysis• Show the system is useful for real applications (optional)

17

Division of data

• Training data– True training data: to learn model parameters– held-out data: to tune other parameters

• Development data: used when developing a system.

• Test data: used only once, for the final evaluation

• Dividing the data:– Common ratio: 80%, 10%, 10%.– N-fold validation

18

Standard metrics for LM

• Direct evaluation– Perplexity

• Indirect evaluation:– ASR– MT– …

19

Perplexity

• Perplexity is based on computing the probabilities of each sentence in the test set.

• Intuitively, whichever model assigns a higher probability to the test set is a better model.

20

Definition of perplexity

Test data T = s0 … sm Let N be the total number of words in T

Lower values mean that the model is better.

21

Perplexity

22

Calculating Perplexity

Suppose T consists of m sentences: s1, …, sm

N = word_num + sent_num – oov_num

23

Calculating P(s)

• Let s = w1 … wn

P(w1… wn) = P(BOS w1… wn EOS)

= P(w1|BOS)*P(w2 |BOS,w1)* …

P(wn | wn-2, wn-1)*P(EOS | wn-1 wn)

If a n-gram contains a unknown word, skip the n-gram (i.e., remove it from the Eq) oov_num ++;

24

Some intuition about perplexity

• Given a vocabulary V and assume uniform distribution; i.e., P(w) = 1/ |V|

• The perplexity of any test data T with unigram LM is:

• Perplexity is a measure of effective “branching factor”.

25

Standard metrics for LM

• Direct evaluation– Perplexity

• Indirect evaluation:– ASR– MT– …

26

ASR

• Word error rate (WER):– System: And he saw apart of the movie– Gold: Andy saw a part of the movie– WER = 3/7

27

Summary

• N-gram LMs:

• Evaluation for LM:– Perplexity = 10- 1/N * lg P(T) = 2H(L,P) – Indirect measures: WER for ASR, BLEU for

MT, etc.

28

Next time

• Smoothing: J&M 4.5-4.9

• Other LMs: class-based LM, structured LM

29

Additional slides

30

Entropy

• Entropy is a measure of the uncertainty associated with a distribution.

• The lower bound on the number of bits it takes to transmit messages.

• An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results.

x

xpxpXH )(log)()(

31

An example

• Uniform distribution: pi=1/8.

• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64)

bitsXH 3)8

1log8

1(*8)( 2

bitsXH 2)64

1log

64

1*4

16

1log

16

1

8

1log8

1

4

1log4

1

2

1log2

1()(

(0, 10, 110, 1110, 111100, 111101, 111110, 111111)

Uniform distribution has higher entropy.MaxEnt: make the distribution as “uniform” as possible.

32

Cross Entropy

• Entropy:

• Cross Entropy:

• Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).

xc

x

xqxpXH

xpxpXH

)(log)()(

)(log)()(

)()( XHXH c

33

Cross entropy of a language

• The cross entropy of a language L:

• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

n

xqxp

qLH nxnn

n

1

)(log)(

lim),(11

n

xq

n

xqqLH nn

n

)(log)(loglim),( 11

34

Perplexity


Recommended