+ All Categories
Home > Documents > 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Date post: 17-Jan-2016
Category:
Upload: jared-booker
View: 214 times
Download: 0 times
Share this document with a friend
38
06/27/22 CPSC503 Winter 2008 1 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini
Transcript
Page 1: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 1

CPSC 503Computational Linguistics

Lecture 4Giuseppe Carenini

Page 2: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 2

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

Page 3: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 3

Today Sep 17• Dealing with spelling errors

– Noisy channel model

– Bayes rule applied to Noisy channel model (single and multiple spelling errors)

• Start n-grams models: Language Models (LM)

Page 4: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 4

Background knowledge

• Morphological analysis• P(x) (prob. distribution)• joint p(x,y)• conditional p(x|y)• Bayes rule• Chain rule

Page 5: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 5

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, fun, ...

…in this context– trust funn – a lot of funn

Real-word isolated

Real-word context

?!

Is it an impossible (or very unlikely) word

in this context?

.. a wild big.

Find the most likely substitution word in

this context

Page 6: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 6

Spelling: Data• 05% -3% - 38%• 80% of misspelled words, single error

– Insertion (toy -> tony)– deletion (tuna -> tua) – substitution (tone -> tony) – transposition (length -> legnth)

• Types of errors– Typographic (more common, user knows the correct

spelling… the -> rhe)– Cognitive (user doesn’t know…… piece -> peace)

Page 7: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 7

Noisy Channel• An influential metaphor in language

processing is the noisy channel model

• Special case of Bayesian classification

signalsignal

noisysignal

Page 8: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 8

)|(maxargˆ OwPwVw

Goal: Find the most likely word given some observed (misspelled) word

Bayes and the Noisy Channel: Spelling Non-word

isolated

Page 9: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 9

Problem

• P(w|O) is hard/impossible to get (why?)

P(wine|winw)=

Page 10: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 10

Solution

1. Apply Bayes Rule2. Simplify

)|(maxargˆ OwPwVw

Vww

maxargˆ

Vww

maxargˆ

priorlikelihood

Page 11: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 11

Estimate of prior P(w) (Easy)

smoothingN

wCwP

)()(

||5.0

5.0)()(

VN

wCwP

1||5.0

5.0)(

||5.0

5.0)()(

VN

wC

VN

wCwP Vw

Vw Vw

Always verify…

Page 12: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 12

Estimate of P(O|w) is feasible(Kernighan et. al ’90)

For one-error misspelling:• Estimate the probability of each

possible error type e.g., insert a after c, substitute f with h

• P(O|w) equal to the probability of

the error that generated O from w e.g., P( cbat| cat) = P(insert b after c)

Page 13: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 13

Estimate P(error type)

(e.g substitution: sub[x,y]) and count

matrix

……

a b c d … ……

… ……

……

……

……

a b c ……

… 5 88

15

#Times b was incorrectly used for a

Large corpus compute confusion matrices

)(

],[)(

acount

absubaforsubsbP

Count(a)= # of a in corpus

Page 14: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 14

Corpus: Example

… On 16 January, he sais [sub[i,y] 3] that because of astronaut safety tha [del[a,t] 4] would be no more space shuttle missions to miantain [tran[a,i] 2] and upgrade the orbiting telescope……..

Page 15: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 15

Final Method single error(1) Given O, collect all the wi that could

have generated O by one error. E.g., O=acress => w1 = actress (t deletion),

w2 = across (sub o with e), … …

(3) Sort and display top-n to user

)()|( ii wPwOPword prior

Probability of the error generating O from w1

(2) For all the wi compute:

Page 16: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 16

Example: O = acress

)( iwP )()|( ii wPwOP)|( iwOP)( iwCiw

…stellar and versatile acress whose…

_

_

_

_

_

1988 AP newswire corpus 44 million words

Page 17: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 17

Evaluation “correct” system

0 1 2 other

Page 18: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 18

Corpora: issues to remember

• Zero counts in the corpus: Just because an event didn’t happen in the corpus doesn’t mean it won’t happene.g., cress has not really zero

probability • Getting a corpus that matches the actual use.

e.g., Kids don’t misspell the same way that adults do

Page 19: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 19

Multiple Spelling Errors

• (BEFORE) Given O collect all the wi that could have generated O by one error…….

• (NOW) Given O collect all the wi that could have generated O by 1..k errors

General Solution: How to compute # and type of errors “between” O

and wi?

Page 20: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 20

Minimum Edit Distance• Def. Minimum number of edit

operations (insertion, deletion and substitution) needed to transform one string into another.

gumbo

gumb

gum

gam

delete o

delete b

substitute u by a

w

O

Page 21: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 21

Minimum Edit Distance Algorithm

• Dynamic programming (very common technique in NLP)

• High level description:– Fills in a matrix of partial comparisons– Value of a cell computed as “simple”

function of surrounding cells– Output: not only number of edit operations

but also sequence of operations

Page 22: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 22

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

Page 23: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 23

target

source

ij

Minimum Edit Distance Algorithm Details

ed[i,j] = min distance between first i chars of the source and first j chars of the target

del-cost =1sub-cost=2

ins-cost=1

update

x

y

z

del

ins

sub or equal

?

i-1 , ji-1, j-1

i , j-1

MIN(z+1,y+1, x + (2 or 0))

Page 24: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 24

Min edit distance and alignment

See demo

Page 25: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 25

Final Method multiple errors(1) Given O, for each wi compute:

mei=min-edit distance(wi,O)

if mei<k save corresponding edit operations in EdOpi

(3) Sort and display top-n to user

)()|( ii wPwOPword prior

Probability of the errors generating O from wi

(2) For all the wi compute:

iEdOpx

xP )(

Page 26: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 26

Spelling: the problem(s)

Non-word isolated

Non-word context

Detection

Correction

Vw?

Find the most

likely correct word

funn -> funny, funnel...

…in this context– trust funn – a lot of funn

Real-word isolated

Real-word context

?!

Is it an impossible (or very unlikely) word

in this context?

.. a wild big.

Find the most likely sub word in this

context

Page 27: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 27

Real Word Spelling Errors• Collect a set of common sets of

confusions: C={C1 .. Cn}

e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..}

• Whenever c’ Ci is encountered • Compute the probability of the

sentence in which it appears• Substitute all cCi (c ≠ c’) and

compute the probability of the resulting sentence

• Choose the higher one

Page 28: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Want to play with Spelling Correction: minimal noisy

channel model implementation

• (Python) http://www.norvig.com/spell-correct.html

04/21/23 CPSC503 Winter 2008 28

• By the way Peter Norvig is Director of Research at Google Inc.

Page 29: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 29

Key Transition

• Up to this point we’ve mostly been discussing words in isolation

• Now we’re switching to sequences of words

• And we’re going to worry about assigning probabilities to sequences of words

Page 30: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 30

Knowledge-Formalisms Map(including probabilistic formalisms)

Logical formalisms (First-Order Logics)

Rule systems (and prob. versions)(e.g., (Prob.) Context-Free

Grammars)

State Machines (and prob. versions)

(Finite State Automata,Finite State Transducers, Markov Models)

Morphology

Syntax

PragmaticsDiscourse

and Dialogue

Semantics

AI planners

Page 31: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 31

Only Spelling?A.Assign a probability to a sentence

• Part-of-speech tagging• Word-sense disambiguation• Probabilistic Parsing

B.Predict the next word• Speech recognition• Hand-writing recognition• Augmentative communication for the

disabled

AB

?),..,( 1 nwwP Impossible to estimate

Page 32: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 32

Decompose: apply chain rule

Chain Rule:

)|(),..(1

111

i

jji

n

in AAPAAP

nw1

)|()(

)|()...|()()(),..,(

112

1

1112111

k

kn

k

nn

nn

wwPwP

wwPwwPwPwPwwP

Applied to a word sequence from position 1 to n:

Page 33: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 33

Example• Sequence “The big red dog barks”• P(The big red dog barks)= P(The) *

P(big|the) * P(red|the big)*

P(dog|the big red)* P(barks|the big red dog)

Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|

<S>)

Page 34: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 34

Not a satisfying solution Even for small n (e.g., 6) we would

need a far too large corpus to estimate:

)|(....... 516 wwP

Markov Assumption: the entire prefix history isn’t necessary.

)|()|( 11

11

nNnn

nn wwPwwP

),|()|(3

)|()|(2

)()|(1

211

1

11

1

11

nnnn

n

nnn

n

nn

n

wwwPwwPN

wwPwwPN

wPwwPN unigram

bigram

trigram

Page 35: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 35

Prob of a sentence: N-Grams

)|()()(),..,( 112

111

kk

n

k

nn wwPwPwPwwP

)()()(),..,(2

111 kn

k

nn wPwPwPwwP

)|()()(),..,( 12

111 kkn

k

nn wwPwPwPwwP

unigram

bigram

trigram)|()()(),..,( 2,12

111 kkkn

k

nn wwwPwPwPwwP

Page 36: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 36

Bigram<s>The big red dog barks

P(The big red dog barks)= P(The|<S>) *

P(big|the) *P(red|big)*

P(dog|red)* P(barks|dog)

)|()|()(),..,( 12

111 kkn

k

nn wwPSwPwPwwP

Trigram?

Page 37: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 37

Estimates for N-Grams

)(

),()(

),(

)(

)()|(

1

1

1

1

1

11

n

nn

words

n

pairs

nn

n

nnnn

wC

wwC

NwC

NwwC

wP

wwPwwP

bigram

..in general)(

)()|(

11

111

1

nNn

nnNnn

NnnwC

wwCwwP

Page 38: 12/7/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

04/21/23 CPSC503 Winter 2008 38

Next Time

• Finish N-Grams (Chp. 4)• Model Evaluation (sec. 4.4)• No smoothing 4.5-4.7• Start Hidden Markov-Model

Assignment 1 is due


Recommended