Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
2
LM
• Word prediction: to predict the next word in a sentence.– Ex: I’d like to make a collect ___
• Statistical models of word sequences are called language models (LMs).
• Task: – Build a statistical model from the training data. – Now given a sentence w1 w2 … wn, we want to estimate its
probability P(w1 … wn)?
• Goal: the model should prefer good sentences to bad ones.
3
Some Terms
• Corpus: a collection of text or speech
• Words: may or may not include punctuation marks.
• Types: the number of distinct words in a corpus
• Tokens: the total number of words in a corpus
4
Applications of LMs
• Speech recognition– Ex: I bought two/too/to books.
• Handwriting recognition
• Machine translation
• Spelling correction
• …
7
N-gram LM• Given a sentence w1 w2 … wn, how to estimate its probability P(w1
… wn)?
• The Markov independence assumption: P(wn | w1, …, wn-1) depends only on the previous k words.
• P(w1… wn) = P(w1) * P(w2|w1)* … P(wn | w1, …, wn-1) ¼ P(w1)*P(w2|w1)* … P(wn | wn-k+1, …, wn-1)
• 0th order Markov model: unigram model• 1st order Markov model: bigram model• 2nd order Markov model: trigram model • …
8
Unigram LM
• P(w1… wn)
¼ P(w1)*P(w2)* … P(wn)
• Estimating P(w):– MLE: P(w) = C(w)/N, N is the num of tokens
• How many states in the FSA?
• How many model parameters?
9
Bigram LM
• P(w1… wn)
= P(BOS w1… wn EOS)
¼ P(BOS) * P(w1|BOS)* … *P(wn | wn-1)*P(EOS|wn)
• Estimating P(wn|wn-1):
– MLE: P(wn) = C(wn-1,wn)/C(wn-1)
• How many states in the FSA?• How many model parameters?
10
Trigram LM
• P(w1… wn) = P(BOS w1… wn EOS)
¼ P(BOS) * P(w1|BOS)* P(w2|BOS, w1)*…
*P(wn | wn-2, wn-1)*P(EOS | wn-1 wn)
• Estimating P(wn|wn-2, wn-1):
– MLE: P(wn) = C(wn-2, wn-1,wn)/C(wn-2, wn-1)
• How many states in the FSA?• How many model parameters?
11
Text generation
Unigram: To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have
Bigram: What means, sir. I confess she? then all sorts, he is trim, captain.
Trigram: Sweet prince, Falstaff shall die. Harry of Monmouth’s grave.
4-gram: Will you not tell me who I am? It cannot be but so.
13
So far
• N-grams:– Number of states in FSA: |V|N-1
– Number of model parameters: |V|N
• Remaining issues: – Data sparse problem smoothing
• Unknown words: OOV rate
– Mismatch between training and test data model adaptation– Other LM models: structured LM, class-based LM
15
Evaluation (in general)
• Evaluation is required for almost all CompLing papers.
• There are many factors to consider:
– Data– Metrics– Results of competing systems– …
• You need to think about evaluation from the very beginning.
16
Rules for evaluation
• Always evaluate your system• Use standard metrics• Separate training/dev/test data• Use standard training/dev/test data
• Clearly specify experiment setting• Include baseline and results from competing systems• Perform error analysis• Show the system is useful for real applications (optional)
17
Division of data
• Training data– True training data: to learn model parameters– held-out data: to tune other parameters
• Development data: used when developing a system.
• Test data: used only once, for the final evaluation
• Dividing the data:– Common ratio: 80%, 10%, 10%.– N-fold validation
19
Perplexity
• Perplexity is based on computing the probabilities of each sentence in the test set.
• Intuitively, whichever model assigns a higher probability to the test set is a better model.
20
Definition of perplexity
Test data T = s0 … sm Let N be the total number of words in T
Lower values mean that the model is better.
22
Calculating Perplexity
Suppose T consists of m sentences: s1, …, sm
N = word_num + sent_num – oov_num
23
Calculating P(s)
• Let s = w1 … wn
P(w1… wn) = P(BOS w1… wn EOS)
= P(w1|BOS)*P(w2 |BOS,w1)* …
P(wn | wn-2, wn-1)*P(EOS | wn-1 wn)
If a n-gram contains a unknown word, skip the n-gram (i.e., remove it from the Eq) oov_num ++;
24
Some intuition about perplexity
• Given a vocabulary V and assume uniform distribution; i.e., P(w) = 1/ |V|
• The perplexity of any test data T with unigram LM is:
• Perplexity is a measure of effective “branching factor”.
26
ASR
• Word error rate (WER):– System: And he saw apart of the movie– Gold: Andy saw a part of the movie– WER = 3/7
27
Summary
• N-gram LMs:
• Evaluation for LM:– Perplexity = 10- 1/N * lg P(T) = 2H(L,P) – Indirect measures: WER for ASR, BLEU for
MT, etc.
30
Entropy
• Entropy is a measure of the uncertainty associated with a distribution.
• The lower bound on the number of bits it takes to transmit messages.
• An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results.
x
xpxpXH )(log)()(
31
An example
• Uniform distribution: pi=1/8.
• Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64)
bitsXH 3)8
1log8
1(*8)( 2
bitsXH 2)64
1log
64
1*4
16
1log
16
1
8
1log8
1
4
1log4
1
2
1log2
1()(
(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
Uniform distribution has higher entropy.MaxEnt: make the distribution as “uniform” as possible.
32
Cross Entropy
• Entropy:
• Cross Entropy:
• Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).
xc
x
xqxpXH
xpxpXH
)(log)()(
)(log)()(
)()( XHXH c
33
Cross entropy of a language
• The cross entropy of a language L:
• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
n
xqxp
qLH nxnn
n
1
)(log)(
lim),(11
n
xq
n
xqqLH nn
n
)(log)(loglim),( 11