+ All Categories
Home > Documents > Language Modeling

Language Modeling

Date post: 12-Jan-2016
Category:
Upload: tracen
View: 28 times
Download: 2 times
Share this document with a friend
Description:
Language Modeling. Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek). Word Prediction in Application Domains. Guessing the next word/letter Once upon a time there was ……. C’era una volta …. - PowerPoint PPT Presentation
30
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Transcript
Page 1: Language Modeling

Language Modeling

Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

Page 2: Language Modeling

Word Prediction in Application Domains Guessing the next word/letter

Once upon a time there was ……. C’era una volta ….

Domains: speech modeling, augmentative communication systems (disabled persons), T9

Page 3: Language Modeling

Word Prediction for Spelling

Andranno a trovarlo alla sua cassa domani. Se andrei al mare sarei abbronzato. Vado a spiaggia.

Hopefully, all with continue smoothly in my absence. Can they lave him my message? I need to notified the bank of this problem.

Page 4: Language Modeling

Probs

Prior probability that the training data D will be observed P(D)

Prior probability of h, P(h), my include any prior knowledge that h is the correct hypothesis

P(D|h), probability of observing data D given a world where hypothesis h holds.

P(h|D), probability that h holds given the data D, i.e. posterior probability of h, because it reflects our confidence that h holds after we have seen the data D.

Page 5: Language Modeling

The Bayes Rule (Theorem)

)(

)()|()|(

DP

hPhDPDhP

Page 6: Language Modeling

Maximum Aposteriory Hypothesis and Maximum Likelihood

)|(maxarg

)|(

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

hDPh

hDPDdataoflikelihood

hPhDP

DP

hPhDP

DhPh

HhML

Hh

Hh

HhMAP

Page 7: Language Modeling

Bayes Optimal Classifier

Motivation: 3 hypotheses with the posterior probs of 0.4, 0.3 and 0.3. Thus, the first one is the MAP hypothesis. (!) BUT:

(A problem) Suppose new instance us classified positive by the first hyp., while negative by the other two. So, the porb. that the new instance is positive is 0.4 opposed to 0.6 for negative classification. The MAP is the 0.4 one !

Solution: The most probable classification of the new instance is obtained by combining the prediction for all hypothesis weighted by their posterior probabilities.

Page 8: Language Modeling

Bayes Optimal Classifier

Classification: class

Bayes Optimal Classifier

Vv j

Hh

iijj

i

DhPhvPDvP )|()|()|(

Hh

iijVv

jVv

ijj

DhPhvPDvP )|()|(maxarg)|(maxarg

Page 9: Language Modeling

Naïve Bayes Classifier

Bayes Optimal Classifier

Naïve version

)()|...(maxarg)...(

)()|...(maxarg

)...|(maxarg)|(maxarg

2121

21

21

jjnVvn

jjn

Vv

njVv

jVv

vPvaaaPaaaP

vPvaaaP

aaavPDvP

jj

jj

)|()(maxarg ji

ijVv

vaPvPj

Page 10: Language Modeling

m-estimate of probability

mn

mpnc

mVocabularyp

Vocabularym

Vocabularyn

nc

1

||

1

||

||

1

Page 11: Language Modeling

Tagging

P (tag = Noun | word = saw) = ?

)()|(maxarg)|(maxarg

)(

)()|()|(

)()|()()|(

wPwtPtwP

tP

wPwtPtwP

tPtwPwPwtP

tt

Page 12: Language Modeling

)()|(maxarg)|(maxarg wPwtPtwPtt

Lan

gu

age

Mo

del

Use

cor

pus

to fi

nd th

em

Page 13: Language Modeling

N-gram Model

The N-th word is predicted by the previous N-1 words.

What is a word? Token, word-form, lemma, m-tag, …

)(),,...,,( 1121n

nn wPwwwwP

Page 14: Language Modeling

N-gram approximation models

)|()|(

)(...)|()|()(

...

)()|()|()(

)()|()(

1)1(

11

12

111

11

112213321

11221

nNnn

nn

nn

nn

n

wwPwwP

wPwwPwwPwP

wPwwPwwwPwwwP

wPwwPwwP

Page 15: Language Modeling

bi-gram and tri-gram models)|()|( 1)1(

11

nNnn

nn wwPwwP

)|()( 11

1 k

n

kk

n wwPwPN=2 (bi):

N=3 (tri): )|()( 121

1 kk

n

kk

n wwwPwP

Page 16: Language Modeling

Counting n-grams

)(

)()|(

)(

)(

)(

)(

)(

)()|(

1)1(

1)1(1

)1(

1

1

1

1

1

11

nNn

nn

NnnNnn

n

nn

nw

nn

nw

nnnn

wC

wwCwwP

wC

wwC

wwC

wwC

wwC

wwCwwP

Page 17: Language Modeling

The Language Model Allows us to Calculate Sentence Probs P( Today is a beautiful day . ) =

P( Today | <Start>) * P (is | Today) * P( a | is) * P(beautiful|a) * P(day| beautiful) * P(. | day) * P(<End>| .)

Work in log space !

Page 18: Language Modeling

Unseen n-grams and Smoothing Discounting (several types) Backoff Deleted Interpolation

Page 19: Language Modeling

Deleted Interpolation

||

1

)(

)|(

)|(

)|(ˆ

4

3

12

121

12

V

wP

wwP

wwwP

wwwP

n

nn

nnn

nnn

Page 20: Language Modeling

Searching For the Best Tagging

W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8

t_1_1 t_1_2 t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8t_2_1 t_2_2 t_2_3 t_2_5 t_2_8t_3_1 t_3_3t_4_1

Use Viterbi search to find the best path through the lattice.

Page 21: Language Modeling

Cross Entropy

Entropy from the point of view of the user who has misinterpreted the source distribution to be q rather than p [Cross entropy is an upper bound of entropy]

iii

iii

qppH

qp

log)(

log

Page 22: Language Modeling

Cross Entropy as a Quality Measure Two models, therefore 2 upper bounds of

entropy. The more accurate is the one with lower

cross entropy

Page 23: Language Modeling

Imagine that y was generated with either model A or model B. Then:

)()(

)()|(

)()(

)()|(

)(),(

)(),(

)(...

)(...

,,...var)(

)()()()()(

yPyP

yPyBxP

yPyP

yPyAxP

yPyBxP

yPyAxP

BxPlet

AxPlet

andBAxrandomnewlet

yPBPyPAPyP

BBAA

BB

BBAA

AA

BB

AA

B

A

BA

BA

Page 24: Language Modeling

Cont.

)(log)(

~)(

)()(

)(

)(log)(

~

),(

),()|(log)(

~

),(

),(log)|()(

~),(),(

),(log)|()(~

),(:

,

'

,

,,

,,

'

,,

'

'

'

'

'

yPyPF

FF

yP

yPyP

yxP

yxPyxPyP

yxP

yxPyxPyPAA

yxPyxPyPADefine

BAy

BAy

BAxBAy

BAyx

BAyx

Proof

of c

onve

rgen

ce o

f the

EM a

lgor

ithm

Page 25: Language Modeling

Estimation - Maximization Algorithm Consider a problem in which the data D is a

set of instances generated by a probability distribution that is a mixture of k distinct Normal distributions (assuming same variances)

Hypothesis is therefore defined by the vector of the means of the distributions

Page 26: Language Modeling

Estimation-Maximization Algorithm Step 1: Calculate the expected value of each

distribution, assuming that the current hypothesis holds

Step 2: Calculate a new maximum likelihood hypothesis assuming that the expected value is the true value. Then make the new hypothesis be the actual one.

Step 3: Goto Step 1.

Page 27: Language Modeling

If we find lambda prime such that )()(),(),( '' FFAA So we need to maximize A with respect to lambda primeUnder the constraint that all lambdas sum up to one. Use Lagrange multipliers

)()(

)()(

~

)()(

)()(

~

0

)1(),(),(

'

'

'

''''

yPyP

yPyP

yPyP

yPyP

G

AG

BBAA

BB

yB

BBAA

AA

yA

i

BA

Page 28: Language Modeling

The EM Algorithm

!

)()(

)()(

~

||

)()(

~;)()(

)()(

~

:

'

'

'

iterateand

CC

C

CC

C

yPyP

yPyPC

D

yCyP

yPyP

yPyPC

Define

assign

BA

BB

BA

AA

BBAA

BB

yB

D

BBAA

AA

yA

Can

be a

nalo

gica

lly g

ener

aliz

ed for

mor

e la

mbd

as

Page 29: Language Modeling

Measuring success rates

Recall = (#correct answers)/(#total possible answers)

Precision = (#correct answers)/(#answers) Fallout = (#incorrect answers)/(#of spourious

facts in the text) F-measure = [(b^2+1)*P*R]/(b^2*P+R)

If b > 1 P is favored.

Page 30: Language Modeling

Chunking as Tagging

Even certain parsing problems can be solved via tagging

E.g.: ((A B) C ((D F) G)) BIA tags: A/B B/A C/I D/B F/A G/A


Recommended