+ All Categories
Home > Documents > 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Date post: 19-Dec-2015
Category:
Upload: justina-gordon
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
41
06/27/22 CPSC503 Winter 2010 1 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini
Transcript
Page 1: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 1

CPSC 503Computational Linguistics

Lecture 4Giuseppe Carenini

Page 2: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 2

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

Page 3: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 3

Today Sep 21• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Min Edit Distance ?

• Start n-grams models: Language Models

Page 4: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 4

Background knowledge

• Morphological analysis

• P(x) (prob. distribution)

• joint P(x,y)

• conditional P(x|y)

• Bayes rule

• Chain rule

Page 5: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 5

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, fun, ...

…in this context– trust funn – a lot of funn

Real-word isolated

Real-word context

?!

Is it an impossible (or very unlikely) word

in this context?

.. a wild dig.

Find the most likely substitution word in

this context

Page 6: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 6

Spelling: Data• .05% -3% - 38%• 80% of misspelled words, single error

– insertion (toy -> tony)– deletion (tuna -> tua) – substitution (tone -> tony) – transposition (length -> legnth)

• Types of errors– Typographic (more common, user knows the correct

spelling… the -> rhe)– Cognitive (user doesn’t know…… piece -> peace)

Page 7: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 7

Noisy Channel• An influential metaphor in language

processing is the noisy channel model

• Special case of Bayesian classification

signalsignal

noisysignal

Page 8: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 8

)|(maxargˆ OwPwVw

Goal: Find the most likely word given some observed (misspelled) word

Bayes and the Noisy Channel: Spelling Non-word

isolated

Page 9: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 9

Problem

• P(w|O) is hard/impossible to get (why?)

P(wine|winw)=

Page 10: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 10

Solution1. Apply Bayes Rule2. Simplify

)|(maxargˆ OwPwVw

Vww

maxargˆ

Vww

maxargˆ

priorlikelihood

Page 11: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 11

Estimate of prior P(w) (Easy)

smoothingN

wCwP

)()(

||5.0

5.0)()(

VN

wCwP

1||5.0

5.0)(

||5.0

5.0)()(

VN

wC

VN

wCwP Vw

Vw Vw

Always verify…

Page 12: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 12

Estimate of P(O|w) is feasible(Kernighan et. al ’90)

For one-error misspelling:• Estimate the probability of each

possible error type e.g., insert a after c, substitute f with h

• P(O|w) equal to the probability of

the error that generated O from w e.g., P( cbat| cat) = P(insert b after c)

Page 13: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 13

Estimate P(error type)

(e.g substitution: sub[x,y]) and count

matrix

……

a b c d … ……

… ……

……

……

……

a b c ……

… 5 88

15

#Times b was incorrectly used for a

Large corpus compute confusion matrices

)(

],[)(

acount

absubaforsubsbP

Count(a)= # of a in corpus

Page 14: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 14

Corpus: Example

… On 16 January, he sais [sub[i,y] 3] that because of astronaut safety tha [del[a,t] 4] would be no more space shuttle missions to miantain [tran[a,i] 2] and upgrade the orbiting telescope……..

Page 15: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 15

Final Method single error(1) Given O, collect all the wi that could

have generated O by one error. E.g., O=acress => w1 = actress (t deletion),

w2 = across (sub o with e), … …

(3) Sort and display top-n to user

)()|( ii wPwOPword prior

Probability of the error generating O from w1

(2) For all the wi compute:

How to do (1): Generate all the strings that could have generated O by one error (how?). Keep the words

Page 16: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Example: collect all the wi that could have generated “acress” by one

error.

a c r e s s

04/18/23 CPSC503 Winter 2010 16

# of deletions

# of transpositions

# of alternations# of insertions

Page 17: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 17

Example: O = acress

)( iwP )()|( ii wPwOP)|( iwOP)( iwCiw

…stellar and versatile acress whose…

_

_

_

_

_

1988 AP newswire corpus 44 million words

Page 18: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 18

Evaluation “correct” system

0 1 2 other

Page 19: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 19

Corpora: issues to remember

• Zero counts in the corpus: Just because an event didn’t happen in the corpus doesn’t mean it won’t happene.g., cress has not really zero

probability • Getting a corpus that matches the actual use.

e.g., Kids don’t misspell the same way that adults do

Page 20: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 20

Multiple Spelling Errors• (BEFORE) Given O collect all the wi

that could have generated O by one error…….

• (NOW) Given O collect all the wi that could have generated O by 1..k errors How? (for two errors): Collect all the strings that could have generated O by one error, then collect all the wi that could have generated one of those strings by one error

Etc.

Page 21: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 21

Final Method multiple errors(1) Given O, for each wi that can be

generated from O by a sequence of edit operations EdOpi ,save EdOpi .

(3) Sort and display top-n to user

)()|( ii wPwOPword prior

Probability of the errors generating O from wi

(2) For all the wi compute:

iEdOpx

xP )(

Page 22: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 22

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, funnel...

…in this context– trust funn – a lot of funn

Real-word isolated

Real-word context

?!

Is it an impossible (or very unlikely) word

in this context?

.. a wild dig.

Find the most likely sub word in this

context

Page 23: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 23

Real Word Spelling Errors• Collect a set of common sets of

confusions: C={C1 .. Cn}

e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..}

• Whenever c’ Ci is encountered • Compute the probability of the

sentence in which it appears• Substitute all cCi (c ≠ c’) and

compute the probability of the resulting sentence

• Choose the highest one

Page 24: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Want to play with Spelling Correction: minimal noisy

channel model implementation

• (Python) http://www.norvig.com/spell-correct.html

04/18/23 CPSC503 Winter 2010 24

• By the way Peter Norvig is Director of Research at Google Inc.

• (He will be visiting our dept. on Thurs!)

Page 25: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 25

Today Sep 21• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Min Edit Distance ?

• Start n-grams models: Language Models

Page 26: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 26

Minimum Edit Distance• Def. Minimum number of edit

operations (insertion, deletion and substitution) needed to transform one string into another.

gumbo

gumb

gum

gam

delete o

delete b

substitute u by a

Page 27: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 27

Minimum Edit Distance Algorithm

• Dynamic programming (very common technique in NLP)

• High level description:– Fills in a matrix of partial comparisons– Value of a cell computed as “simple”

function of surrounding cells– Output: not only number of edit operations

but also sequence of operations

Page 28: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 28

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

ed[i,j]

Page 29: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 29

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

Page 30: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 30

Min edit distance and alignment

See demo

Page 31: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 31

Today Sep 21• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Min Edit Distance ?

• Start n-grams models: Language Models

Page 32: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 32

Key Transition

• Up to this point we’ve mostly been discussing words in isolation

• Now we’re switching to sequences of words

• And we’re going to worry about assigning probabilities to sequences of words

Page 33: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 33

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

Page 34: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 34

Only Spelling?A.Assign a probability to a sentence

• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing

B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the

disabled

AB

?),..,( 1 nwwP Impossible to estimate

Page 35: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 35

Decompose: apply chain rule

Chain Rule:

)|(),..(1

111

i

jji

n

in AAPAAP

nw1

)|()(

)|()...|()()(),..,(

112

1

1112111

k

kn

k

nn

nn

wwPwP

wwPwwPwPwPwwP

Applied to a word sequence from position 1 to n:

Page 36: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 36

Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) *

P(big|the) * P(red|the big)*

P(dog|the big red)* P(barks|the big red dog)

Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|

<S>)

Page 37: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 37

Not a satisfying solution Even for small n (e.g., 6) we would

need a far too large corpus to estimate:

)|(....... 516 wwP

Markov Assumption: the entire prefix history isn’t necessary.

)|()|( 11

11

nNnn

nn wwPwwP

),|()|(3

)|()|(2

)()|(1

211

1

11

1

11

nnnn

n

nnn

n

nn

n

wwwPwwPN

wwPwwPN

wPwwPN unigram

bigram

trigram

Page 38: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 38

Prob of a sentence: N-Grams

)|()()(),..,( 112

111

kk

n

k

nn wwPwPwPwwP

)()()(),..,(2

111 kn

k

nn wPwPwPwwP

)|()()(),..,( 12

111 kkn

k

nn wwPwPwPwwP

unigram

bigram

trigram)|()()(),..,( 2,12

111 kkkn

k

nn wwwPwPwPwwP

Page 39: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 39

Bigram<s>The big red dog barks

P(The big red dog barks)= P(The|<S>) *

P(big|the) *P(red|big)*

P(dog|red)* P(barks|dog)

)|()|()(),..,( 12

111 kkn

k

nn wwPSwPwPwwP

Trigram?

Page 40: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 40

Estimates for N-Grams

)(

),()(

),(

)(

)()|(

1

1

1

1

1

11

n

nn

words

n

pairs

nn

n

nnnn

wC

wwC

NwC

NwwC

wP

wwPwwP

bigram

..in general)(

)()|(

11

111

1

nNn

nnNnn

NnnwC

wwCwwP

Page 41: 6/9/2015CPSC503 Winter 20101 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/18/23 CPSC503 Winter 2010 41

Next Time

• N-Grams (Chp. 4)• Model Evaluation (sec. 4.4)• No smoothing 4.5-4.7


Recommended