Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a...

Out-of-Vocabularyforms

Spelling ErrorCorrection

©EPFLJ.-C. Chappelier & M. Rajman

Out of Vocabulary FormsSpelling Error correction

J.-C. Chappelier & M. Rajman

Laboratoire d’Intelligence ArtificielleFaculté I&C

Introduction to INLP – 1 / 33




Contents

I Out of Vocabulary FormsI Spelling Error Correction

I Edit distanceI Spelling error correction with FSAI Weighted edit distance





Out of Vocabulary forms

I Out of Vocabulary (OoV) forms matter:they occur quite frequently (e.g. ' 10% in newspapers)

What do they consist of?I spelling errors:foget, summmary, usqge, ...

I neologisms:Internetization, Tacherism, ...

I borrowings:gestalt, rendez-vous, ...

I forms difficult to exhaustively lexicalize:(numbers,) proper names, abbreviations, ...

I identification based on patterns is not well-adapted for all OoV forms+ We will focus here on spelling errors, neologisms and borrowings





Spelling errors and neologisms

I for spelling errors (resp. neologisms), distortions (resp. derivations) are modelledby transformations, i.e. rewriting rules (sometimes weighted)

Example:I Transposition (distortion): XY → YX [1.0]

where X and Y stands for variables

I tripling (distortion): XX → XXX [1.0]

I name derivation: ize:INF → ization:N [1.0]

I a given lexicon (regular language) and a set of transformations define the editspace to be explored

+ The aim is to find the position of the OoV forms in the edit space with respect to known(lexicalized) forms (neighbourhoods, similarity, distance)





Spelling errors and neologisms (2)

I if the transformation set is simple enough: automatic (or semi-automatic) learningof the transformation set is possible

Examples:I morphological rules for SpanishI transformations for spelling error correction after OCR:

I print a selected set of texts (you have in electronic form)making use of several different printers and fonts

I OCR the printed copiesI learn required transformations from errors





BorrowingsFor borrowings + identification of the source language[when no large coverage lexica are avalaible for the other languages, but only representative

texts]

Decomposition into n-grams of characters.Example: for trigrams

dribble→ (dri,rib,ibb,bbl,ble)

In practice: n varies from 2 to 4

From reference corpora, estimate the likelihood of a word to belong to a givenlanguage.Example for trigrams:

P(dribble|L) = P(dri|L) · P(rib|L)

P(ri|L)· ... · P(ble|L)

P(bl|L)

Trigrams for French, English, German and Spanish: 87% discrimination accuracyIntroduction to INLP – 6 / 33




Likelihood vs. Posterior probabilityIn the former slide, why make use of the likelihood P(w |L) rather than the posteriorprobability P(L|w)?I They are both hard to accurately model without further assumptions (w belongs

to a huge set!)but no further simplification can be made on P(L|w): w is fixed (and there isnothing to gain “simplifying” L!)P(w |L) can be further simplified making assumptions on w

I Using the Bayes’ rule:

argmaxL

P(L|w) = argmaxL

P(w |L) ·P(L)

+ introduces the likelihood anyway! (which could then be simplified further)

I If you can accurately estimate P(L), sure, make use of it!

I Otherwise, the least biaised hypothesis (maximum entropy) is to a priori assumethat all languages are all equally possible: maximizing posterior probability is thenthe same as maximizing likelihood



Spelling ErrorCorrectionTwo approaches

Edit distance

Spell. err. corr. witha FSA

Weighted editdistance


Contents

I Out of Vocabulary Formsä Spelling Error Correction

I Edit distanceI Spelling error correction with FSAI Weighted edit distance




Edit distance




Spelling error correction

correct strings

all strings

max. distancemax. distance

input stringinput string

correct strings

solutionsTwo approaches:

Exact Probabilisticlexicon-based

correct forms: lexicon any stringmetric: edit distance probability

In this lecture:I only a few words about the probabilistic approach (next slide)I mainly: exact, lexicon-based, approach




Edit distance




Probabilistic approach summarized (1/2)

Make (one more time!) use of n-grams (both levels, characters and tokens, arecombined)

w : OoV token to be corrected

c: candidate correction, out of C (w), set of possible candidates for w

argmaxc∈C (w)

P(c|w) = argmaxc∈C (w)

P(c) ·P(w |c)

P(c): language model (n-grams of tokens/words; n = 1 here, but could easily beextended to neighboring tokens (n > 1 then))

P(w |c): error model: edit distance and/or m-grams of characters




Edit distance




Probabilistic approach summarized (2/2)

A usual (unexplicit?) assumption is that P(w |c) is many orders of magnitude higher forsmaller edit distance (than for higher): thus closer candidate are considereds first,leading to this simple algorithm,where Cd (w) is the set of candidates at distance d from w :

I if C1(w) is not empty, return argmaxc∈C1(w)

P(c);

I (else) if C2(w) is not empty, return argmaxc∈C2(w)

P(c);

I etc...

For more details: see http://norvig.com/spell-correct.html


http://norvig.com/spell-correct.html



Edit distance




Edit distance

also called “Levenshtein distance”

+ distance between 2 forms/strings= minimal number of transformations to change one into the other

+ depends on the set of transformations considered

Examples of transformations:` insertion: exmple→ example` deletion: example→ exmple` substitution: exemple→ example` transposition: exmaple→ example




Edit distance




Computation of edit distance (1)

Notations:Xi : i th char of string X

X ji : if i ≤ j : substring Xi ,...,Xj ; empty string otherwise

Example: X = castleX3 = s X 6

4 = tle X 41 = cast X 0

1 = ε

Computation of the distance D(X ,Y ) by dynamic programming:

+ step by step in a chart m where each cell mij contains the distance between thetwo substrings X i

1 and Y j1 :

mij = D(X i1,Y

j1)




Edit distance




Computation of edit distance (2)

D(X 01 ;Y j

1) = j initialization

D(X i1;Y 0

1 ) = i

D(X i1;Y j

1) = D(X i−11 ;Y j−1

1 ) if Xi = Yj (equality)

= 1 + min{

D(X i−21 ;Y j−2

1 ),

D(X i−11 ;Y j

1),D(X i1;Y j−1

1 )} else if i ≥ 2 and j ≥

2 and Xi−1 = Yj andXi = Yj−1 (transposition,deletion, insertion)

= 1 + min{

D(X i−11 ;Y j−1

1 ),

D(X i−11 ;Y j

1),D(X i1;Y j−1

1 )} else (substitution, dele-

tion, insertion)




Edit distance




Computation of edit distance: computation order

Id

Subs

Transp

Del

Ins

Y

X

i

jj-2

i-2

+ several possible ways of computing: rowwise, columnwise or diagonal




Edit distance




Computation of edit distance (3)Example, columnwise:

for all i from 0 to |X | (size of X ) domi0 = i

for all j from 1 to |Y | dom0j = jfor all i from 1 to |X | do

if Xi = Yj thenmij = mi−1,j−1

else if i ≥ 2 and j ≥ 2 and Xi−1 = Yj and Xi = Yj−1 then

mij = 1 + min{

mi−2,j−2; mi ,j−1; mi−1,j

}else

mij = 1 + min{

mi−1,j−1; mi ,j−1; mi−1,j

}Return m|X |,|Y |




Edit distance




Edit Distance (example)

D(exmple;exemple) D(exmaple;example)

e x e m p l e

0 1 2 3 4 5 6 7

e 1 0 1 2 3 4 5 6

x 2 1 0 1 2 3 4 5

m 3 2 1 1 1 2 3 4

p 4 3 2 2 2 1 2 3

l 5 4 3 3 3 2 1 2

e 6 5 4 3 4 3 2 1

e x a m p l e

0 1 2 3 4 5 6 7

e 1 0 1 2 3 4 5 6

x 2 1 0 1 2 3 4 5

m 3 2 1 1 1 2 3 4

a 4 3 2 1 1 2 3 4

p 5 4 3 2 2 1 2 3

l 6 5 4 3 3 2 1 2

e 7 6 5 4 4 3 2 1




Edit distance




Spelling error correction using a FSA

Problem: approximative search of lexicalized (surface) forms= within a max. distance range

i.e. Fault-tolerant recognition (within a regular language):

Find all ending paths such that the corresponding string is within adistance range less than θ of the given input string.

Remark: a trie is a special case of FSA




Edit distance




Finite-State Automata (FSA)

Formally:I Q: (finite) set of statesI Σ: (finite) alphabetI δ : arcs (mapping from Q×Σ to Q)

q1

q2a

q1

q2

δ ( , a )=

I q0 ∈Q: inital stateI F ⊂Q: final states

Interface:I initialState(): provides q0

I (q,a)=nextAfter(p,c): returns next state and characterafter character ’c’ starting from state ’p’Formally: returns Argminα {(q,α) ∈Q×Σ such that α > c and δ (p,α) = q}

I isFinal(p): are we done with p? Checks whether p ∈ F or not.

p q

· ··

c

a

· · ·




Edit distance




Pruning criteria: cut-off edit distance

To make it useful in practice⇒ Fast⇒ good pruning+ cut-off edit distance: [Oflazer 1996]

Co(X n1 ,Y

m1 ) = min

I(m)≤i≤J(m)D(X i

1;Y m1 )

I(m) = min(n,max(1,m−θ)) J(m) = min(n,max(1,m + θ))

Important property:

Co(X ,Y ) > θ =⇒ ∀Z D(X ,Y + Z ) > θ




Edit distance




Cut-off Edit Distance: example

=2θ

seX

Y

n=7

m=4

x m p l e

I=4-2=2 J=4+2=6

ex

exm

exmp

exmpl

exmple

e x ma

Ye x a m

0 1 2 3 4e 1 0 1 2 3x 2 1 0 1 2m 3 2 1 1 1p 4 3 2 2 2l 5 4 3 3 3e 6 5 4 4 4

X s 7 6 5 5 5

Co(X ,Y ) = min{2,1,2,3,4}= 1




Edit distance




Walk through a FSA within a θ distance rangePrefix-compatible Depth-first version

Input: a string to be corrected (X ), a lexicon in the form of a FSA and a maximal errorthreshold (θ )

Push(ε,ε,q0)while Stack is not empty do

Pop(Z ,c,p)(q,a) = nextAfter(p,c)if (q,a) 6= /0 then

Push(Z ,a,p)Y ← Z + aif Co(X ,Y )≤ θ then

Push(Y ,ε,q)if isFinal(q) and D(X ,Y )≤ θ then

Add Y to solutions

q0 p qZca




Edit distance




Example: ababa vs. (aba|bab)* with θ = 11

[2] [2] [2] [2] [2] [2]

1

2

3

5

4

2 4

3

2 4

1

42

3

2

5

4

1

3

2

2 4

5

4

X=ababa (aba|bab)* for θ =1

[0]

[1]

[1]

[1]

[1]

[1]

[1]

[1]

[1]

[2]

[1]

[1]

[0]

[0]

[0]

[0]

[0]

.vs.

a

a

a

b

b

b

a

b

b

a

a

b

b

b

b

b

a

a

a

b

b

b

b a

a

a

a

a

1 1 1

a b a a b a0 1 2 3 4 5 6

a 1 0 1 2 3 4 5b 2 1 0 1 2 3 4a 3 2 1 0 1 2 3b 4 3 2 1 1 1 2a 5 4 3 2 1 1 1

Solutions: abaaba, ababab, bababaIntroduction to INLP – 22 / 33



Edit distance




Implementation issues

À Efficient computation of Co with the previously described chart :

+ recomputation of the last column (m) only

+ Computation of D and Co in the same loop

Á Y ← Z + a: beware (local copies, pointers etc...).Similarly, do not naively implement "Push(Y ,q)".

Â In some (programming) languages: it could be worth transposing the algorithm:Y (which is changing) for rows and X for columns




Edit distance




Contents

I Out of Vocabulary FormsI Spelling Error Correction

I Edit distanceI Spelling error correction with FSAä Weighted edit distance




Edit distance




Limitations?

å weightingExample: diacritics, uppercase

eleves→ élèves aloves→ élèves

å specific transformationsExample: typing errors

tupe→ type more generally: deuit→ fruitusqge→ usage

å whitespacestheothers→ the others othe rs→ others

+ 3 aspects of the same problem

Solution: generalization of the edit distance: weighted edit distance




Edit distance




Weighted Edit Distance

weighted transformations such that :à W (Id) = 0à W (f ) > 0 f 6= Id

à W (f−1) = W (f )

à W (f ◦g) = W (f ) + W (g)

D(X ;Y ) = minf :Y=f (X )

W (f )

+ It is actually a distance on Σ∗

Difference with the preceding distance: W (f ) is not necessarily the same (= 1).




Edit distance




Remarks

Ê Distance on Σ∗ ⇒ ∀X Y , ∃f : Y = f (X )

True if Ins and Del are in the transformation set

Ë non overlapping transformationsi.e. cannot apply a transformation to the result of the previous transformation

Counter-Example: ba Transp→ abSub→ ac




Edit distance




Coherence Constraints

"Semantic Integrity":I W (Del) + W (Ins(x)) > W (Sub(x))

I W (Split) < W (Ins(x)) ( which implies W (Merge) < W (Del))

I W (Transp) < W (Ins(x)) + W (Del)

+ Introduction a new f such that f = ◦i fi , is useful if and only ifW (f ) < ∑

iW (fi )




Edit distance




Weighted Edit Distance: computation

xxx

yyy

xxxx

yyy

xxxx

yyyyy

X :

Y :

input range(f)

min2min1

min2 = min [ min1(f) + W(f) ]f

(min1 and min2 are the values stored in the chart)




Edit distance




Weighted Edit Distance: computation (2)

D(X 01 ;Y j

1) = j initializationD(X i

1;Y 01 ) = i

D(X i1;Y j

1) = D(X i−11 ;Y j−1

1 ) if Xi = Yj (equality)= W (f ) + min{min1(f )} for all applicable trans-

formations f of thesame weight

= ... for all possibleweights.

increasing W (f )

+ The optimization lies in the grouping of similar cases: same weight and compatibletransformations(Example: previously Transp and Sub were incompatible becauseW (Transp) < 2W (Sub). But each of them is compatible with Del and Ins.)

Note: {min1(f )} is the set of all the minimal values for all possible f at this point; theyshall, of course, already be computed at this point (loop condition)




Edit distance




Example

D(example;exemple) D(exémple;exemple)e x e m p l e

0 1 2 3 4 5 6 7e 1 0 1 2 3 4 5 6x 2 1 0 1 2 3 4 5a 3 2 1 1 2 3 4 5m 4 3 2 2 1 2 3 4p 5 4 3 3 2 1 2 3l 6 5 4 4 3 2 1 2e 7 6 5 4 4 3 2 1

e x e m p l e0 1 2 3 4 5 6 7

e 1 0 1 2 3 4 5 6x 2 1 0 1 2 3 4 5é 3 2 1 0.1 1.1 2.1 3.1 4.1m 4 3 2 1.1 0.1 1.1 2.1 3.1p 5 4 3 2.1 1.1 0.1 1.1 2.1l 6 5 4 3.1 2.1 1.1 0.1 1.1e 7 6 5 4 3.1 2.1 1.1 0.1

W (é↔e)=0.1




Edit distance




Keypoints

ß One has to handle out of vocabulary forms

ß Edit (Levenshtein) distance, weighted edit distance

ß Spelling error correction with FSA




Edit distance




References

K. Oflazer, Error-tolerant Finite State Recognition with Applications to MorphologicalAnalysis and Spelling Correction, Computational Linguistics, Volume 22, Number 1,1996.

Section 8.2 in M. Rajman editor, "Speech and Language Engineering", EPFL Press,2006.

Sections 3.10 and 3.11 in D. Jurafsky and J. H. Martin, "Speech and LanguageProcessing", Prentice Hall, 2008 (2nd edition).

Section 3.3 in C. D. Manning, P. Raghavan and H. Schütze, "Introduction to InformationRetrieval", Cambridge University Press. 2008


Date post:	13-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Introduction to Natural Language Processingcoling.epfl.ch/slides/coursTAL-corrlex-en.pdf · •a...

Documents