+ All Categories
Home > Documents > 6/9/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

6/9/2015CPSC503 Winter 20091 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Date post: 19-Dec-2015
Category:
Upload: andrew-elliott
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
40
03/31/22 CPSC503 Winter 2009 1 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini
Transcript

04/18/23 CPSC503 Winter 2009 1

CPSC 503Computational Linguistics

Lecture 4Giuseppe Carenini

04/18/23 CPSC503 Winter 2009 2

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/18/23 CPSC503 Winter 2009 3

Today Sep 18• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Min Edit Distance ?

• Start n-grams models: Language Models

04/18/23 CPSC503 Winter 2009 4

Background knowledge

• Morphological analysis• P(x) (prob. distribution)• joint p(x,y)• conditional p(x|y)• Bayes rule• Chain rule

04/18/23 CPSC503 Winter 2009 5

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, fun, ...

…in this context– trust funn – a lot of funn

Real-word isolated

Real-word context

?!

Is it an impossible (or very unlikely) word

in this context?

.. a wild dig.

Find the most likely substitution word in

this context

04/18/23 CPSC503 Winter 2009 6

Spelling: Data• .05% -3% - 38%• 80% of misspelled words, single error

– insertion (toy -> tony)– deletion (tuna -> tua) – substitution (tone -> tony) – transposition (length -> legnth)

• Types of errors– Typographic (more common, user knows the correct

spelling… the -> rhe)– Cognitive (user doesn’t know…… piece -> peace)

04/18/23 CPSC503 Winter 2009 7

Noisy Channel• An influential metaphor in language

processing is the noisy channel model

• Special case of Bayesian classification

signalsignal

noisysignal

04/18/23 CPSC503 Winter 2009 8

)|(maxargˆ OwPwVw

Goal: Find the most likely word given some observed (misspelled) word

Bayes and the Noisy Channel: Spelling Non-word

isolated

04/18/23 CPSC503 Winter 2009 9

Problem

• P(w|O) is hard/impossible to get (why?)

P(wine|winw)=

04/18/23 CPSC503 Winter 2009 10

Solution

1. Apply Bayes Rule2. Simplify

)|(maxargˆ OwPwVw

Vww

maxargˆ

Vww

maxargˆ

priorlikelihood

04/18/23 CPSC503 Winter 2009 11

Estimate of prior P(w) (Easy)

smoothingN

wCwP

)()(

||5.0

5.0)()(

VN

wCwP

1||5.0

5.0)(

||5.0

5.0)()(

VN

wC

VN

wCwP Vw

Vw Vw

Always verify…

04/18/23 CPSC503 Winter 2009 12

Estimate of P(O|w) is feasible(Kernighan et. al ’90)

For one-error misspelling:• Estimate the probability of each

possible error type e.g., insert a after c, substitute f with h

• P(O|w) equal to the probability of

the error that generated O from w e.g., P( cbat| cat) = P(insert b after c)

04/18/23 CPSC503 Winter 2009 13

Estimate P(error type)

(e.g substitution: sub[x,y]) and count

matrix

……

a b c d … ……

… ……

……

……

……

a b c ……

… 5 88

15

#Times b was incorrectly used for a

Large corpus compute confusion matrices

)(

],[)(

acount

absubaforsubsbP

Count(a)= # of a in corpus

04/18/23 CPSC503 Winter 2009 14

Corpus: Example

… On 16 January, he sais [sub[i,y] 3] that because of astronaut safety tha [del[a,t] 4] would be no more space shuttle missions to miantain [tran[a,i] 2] and upgrade the orbiting telescope……..

04/18/23 CPSC503 Winter 2009 15

Final Method single error(1) Given O, collect all the wi that could

have generated O by one error. E.g., O=acress => w1 = actress (t deletion),

w2 = across (sub o with e), … …

(3) Sort and display top-n to user

)()|( ii wPwOPword prior

Probability of the error generating O from w1

(2) For all the wi compute:

How to do (1): Generate all the strings that could have generated O by one error (how?). Keep the words

04/18/23 CPSC503 Winter 2009 16

Example: O = acress

)( iwP )()|( ii wPwOP)|( iwOP)( iwCiw

…stellar and versatile acress whose…

_

_

_

_

_

1988 AP newswire corpus 44 million words

04/18/23 CPSC503 Winter 2009 17

Evaluation “correct” system

0 1 2 other

04/18/23 CPSC503 Winter 2009 18

Corpora: issues to remember

• Zero counts in the corpus: Just because an event didn’t happen in the corpus doesn’t mean it won’t happene.g., cress has not really zero

probability • Getting a corpus that matches the actual use.

e.g., Kids don’t misspell the same way that adults do

04/18/23 CPSC503 Winter 2009 19

Multiple Spelling Errors• (BEFORE) Given O collect all the wi

that could have generated O by one error…….

• (NOW) Given O collect all the wi that could have generated O by 1..k errors How? (for two errors): Collect all the strings that could have generated O by one error, then collect all the wi that could have generated one of those strings by one error

Etc.

04/18/23 CPSC503 Winter 2009 20

Final Method multiple errors(1) Given O, for each wi that can be

generated from O by a sequence of edit operations EdOpi ,save EdOpi .

(3) Sort and display top-n to user

)()|( ii wPwOPword prior

Probability of the errors generating O from wi

(2) For all the wi compute:

iEdOpx

xP )(

04/18/23 CPSC503 Winter 2009 21

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, funnel...

…in this context– trust funn – a lot of funn

Real-word isolated

Real-word context

?!

Is it an impossible (or very unlikely) word

in this context?

.. a wild dig.

Find the most likely sub word in this

context

04/18/23 CPSC503 Winter 2009 22

Real Word Spelling Errors• Collect a set of common sets of

confusions: C={C1 .. Cn}

e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..}

• Whenever c’ Ci is encountered • Compute the probability of the

sentence in which it appears• Substitute all cCi (c ≠ c’) and

compute the probability of the resulting sentence

• Choose the higher one

Want to play with Spelling Correction: minimal noisy

channel model implementation

• (Python) http://www.norvig.com/spell-correct.html

04/18/23 CPSC503 Winter 2009 23

• By the way Peter Norvig is Director of Research at Google Inc.

04/18/23 CPSC503 Winter 2009 24

Today Sep 18• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Min Edit Distance ?

• Start n-grams models: Language Models

04/18/23 CPSC503 Winter 2009 25

Minimum Edit Distance• Def. Minimum number of edit

operations (insertion, deletion and substitution) needed to transform one string into another.

gumbo

gumb

gum

gam

delete o

delete b

substitute u by a

04/18/23 CPSC503 Winter 2009 26

Minimum Edit Distance Algorithm

• Dynamic programming (very common technique in NLP)

• High level description:– Fills in a matrix of partial comparisons– Value of a cell computed as “simple”

function of surrounding cells– Output: not only number of edit operations

but also sequence of operations

04/18/23 CPSC503 Winter 2009 27

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

04/18/23 CPSC503 Winter 2009 28

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

04/18/23 CPSC503 Winter 2009 29

Min edit distance and alignment

See demo

04/18/23 CPSC503 Winter 2009 30

Today Sep 18• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Min Edit Distance ?

• Start n-grams models: Language Models

04/18/23 CPSC503 Winter 2009 31

Key Transition

• Up to this point we’ve mostly been discussing words in isolation

• Now we’re switching to sequences of words

• And we’re going to worry about assigning probabilities to sequences of words

04/18/23 CPSC503 Winter 2009 32

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

04/18/23 CPSC503 Winter 2009 33

Only Spelling?A.Assign a probability to a sentence

• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing

B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the

disabled

AB

?),..,( 1 nwwP Impossible to estimate

04/18/23 CPSC503 Winter 2009 34

Decompose: apply chain rule

Chain Rule:

)|(),..(1

111

i

jji

n

in AAPAAP

nw1

)|()(

)|()...|()()(),..,(

112

1

1112111

k

kn

k

nn

nn

wwPwP

wwPwwPwPwPwwP

Applied to a word sequence from position 1 to n:

04/18/23 CPSC503 Winter 2009 35

Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) *

P(big|the) * P(red|the big)*

P(dog|the big red)* P(barks|the big red dog)

Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|

<S>)

04/18/23 CPSC503 Winter 2009 36

Not a satisfying solution Even for small n (e.g., 6) we would

need a far too large corpus to estimate:

)|(....... 516 wwP

Markov Assumption: the entire prefix history isn’t necessary.

)|()|( 11

11

nNnn

nn wwPwwP

),|()|(3

)|()|(2

)()|(1

211

1

11

1

11

nnnn

n

nnn

n

nn

n

wwwPwwPN

wwPwwPN

wPwwPN unigram

bigram

trigram

04/18/23 CPSC503 Winter 2009 37

Prob of a sentence: N-Grams

)|()()(),..,( 112

111

kk

n

k

nn wwPwPwPwwP

)()()(),..,(2

111 kn

k

nn wPwPwPwwP

)|()()(),..,( 12

111 kkn

k

nn wwPwPwPwwP

unigram

bigram

trigram)|()()(),..,( 2,12

111 kkkn

k

nn wwwPwPwPwwP

04/18/23 CPSC503 Winter 2009 38

Bigram<s>The big red dog barks

P(The big red dog barks)= P(The|<S>) *

P(big|the) *P(red|big)*

P(dog|red)* P(barks|dog)

)|()|()(),..,( 12

111 kkn

k

nn wwPSwPwPwwP

Trigram?

04/18/23 CPSC503 Winter 2009 39

Estimates for N-Grams

)(

),()(

),(

)(

)()|(

1

1

1

1

1

11

n

nn

words

n

pairs

nn

n

nnnn

wC

wwC

NwC

NwwC

wP

wwPwwP

bigram

..in general)(

)()|(

11

111

1

nNn

nnNnn

NnnwC

wwCwwP

04/18/23 CPSC503 Winter 2009 40

Next Time

• Finish N-Grams (Chp. 4)• Model Evaluation (sec. 4.4)• No smoothing 4.5-4.7• Start Hidden Markov-Model


Recommended