Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
Out of Vocabulary FormsSpelling Error correction
J.-C. Chappelier & M. Rajman
Laboratoire d’Intelligence ArtificielleFaculté I&C
Introduction to INLP – 1 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
Contents
I Out of Vocabulary FormsI Spelling Error Correction
I Edit distanceI Spelling error correction with FSAI Weighted edit distance
Introduction to INLP – 2 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
Out of Vocabulary forms
I Out of Vocabulary (OoV) forms matter:they occur quite frequently (e.g. ' 10% in newspapers)
What do they consist of?I spelling errors:foget, summmary, usqge, ...
I neologisms:Internetization, Tacherism, ...
I borrowings:gestalt, rendez-vous, ...
I forms difficult to exhaustively lexicalize:(numbers,) proper names, abbreviations, ...
I identification based on patterns is not well-adapted for all OoV forms+ We will focus here on spelling errors, neologisms and borrowings
Introduction to INLP – 3 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
Spelling errors and neologisms
I for spelling errors (resp. neologisms), distortions (resp. derivations) are modelledby transformations, i.e. rewriting rules (sometimes weighted)
Example:I Transposition (distortion): XY → YX [1.0]
where X and Y stands for variables
I tripling (distortion): XX → XXX [1.0]
I name derivation: ize:INF → ization:N [1.0]
I a given lexicon (regular language) and a set of transformations define the editspace to be explored
+ The aim is to find the position of the OoV forms in the edit space with respect to known(lexicalized) forms (neighbourhoods, similarity, distance)
Introduction to INLP – 4 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
Spelling errors and neologisms (2)
I if the transformation set is simple enough: automatic (or semi-automatic) learningof the transformation set is possible
Examples:I morphological rules for SpanishI transformations for spelling error correction after OCR:
I print a selected set of texts (you have in electronic form)making use of several different printers and fonts
I OCR the printed copiesI learn required transformations from errors
Introduction to INLP – 5 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
BorrowingsFor borrowings + identification of the source language[when no large coverage lexica are avalaible for the other languages, but only representative
texts]
Decomposition into n-grams of characters.Example: for trigrams
dribble→ (dri,rib,ibb,bbl,ble)
In practice: n varies from 2 to 4
From reference corpora, estimate the likelihood of a word to belong to a givenlanguage.Example for trigrams:
P(dribble|L) = P(dri|L) · P(rib|L)
P(ri|L)· ... · P(ble|L)
P(bl|L)
Trigrams for French, English, German and Spanish: 87% discrimination accuracyIntroduction to INLP – 6 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrection
©EPFLJ.-C. Chappelier & M. Rajman
Likelihood vs. Posterior probabilityIn the former slide, why make use of the likelihood P(w |L) rather than the posteriorprobability P(L|w)?I They are both hard to accurately model without further assumptions (w belongs
to a huge set!)but no further simplification can be made on P(L|w): w is fixed (and there isnothing to gain “simplifying” L!)P(w |L) can be further simplified making assumptions on w
I Using the Bayes’ rule:
argmaxL
P(L|w) = argmaxL
P(w |L) ·P(L)
+ introduces the likelihood anyway! (which could then be simplified further)
I If you can accurately estimate P(L), sure, make use of it!
I Otherwise, the least biaised hypothesis (maximum entropy) is to a priori assumethat all languages are all equally possible: maximizing posterior probability is thenthe same as maximizing likelihood
Introduction to INLP – 7 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Contents
I Out of Vocabulary Formsä Spelling Error Correction
I Edit distanceI Spelling error correction with FSAI Weighted edit distance
Introduction to INLP – 7 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Spelling error correction
correct strings
all strings
max. distancemax. distance
input stringinput string
correct strings
solutionsTwo approaches:
Exact Probabilisticlexicon-based
correct forms: lexicon any stringmetric: edit distance probability
In this lecture:I only a few words about the probabilistic approach (next slide)I mainly: exact, lexicon-based, approach
Introduction to INLP – 8 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Probabilistic approach summarized (1/2)
Make (one more time!) use of n-grams (both levels, characters and tokens, arecombined)
w : OoV token to be corrected
c: candidate correction, out of C (w), set of possible candidates for w
argmaxc∈C (w)
P(c|w) = argmaxc∈C (w)
P(c) ·P(w |c)
P(c): language model (n-grams of tokens/words; n = 1 here, but could easily beextended to neighboring tokens (n > 1 then))
P(w |c): error model: edit distance and/or m-grams of characters
Introduction to INLP – 9 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Probabilistic approach summarized (2/2)
A usual (unexplicit?) assumption is that P(w |c) is many orders of magnitude higher forsmaller edit distance (than for higher): thus closer candidate are considereds first,leading to this simple algorithm,where Cd (w) is the set of candidates at distance d from w :
I if C1(w) is not empty, return argmaxc∈C1(w)
P(c);
I (else) if C2(w) is not empty, return argmaxc∈C2(w)
P(c);
I etc...
For more details: see http://norvig.com/spell-correct.html
Introduction to INLP – 10 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Edit distance
also called “Levenshtein distance”
+ distance between 2 forms/strings= minimal number of transformations to change one into the other
+ depends on the set of transformations considered
Examples of transformations:` insertion: exmple→ example` deletion: example→ exmple` substitution: exemple→ example` transposition: exmaple→ example
Introduction to INLP – 11 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Computation of edit distance (1)
Notations:Xi : i th char of string X
X ji : if i ≤ j : substring Xi ,...,Xj ; empty string otherwise
Example: X = castleX3 = s X 6
4 = tle X 41 = cast X 0
1 = ε
Computation of the distance D(X ,Y ) by dynamic programming:
+ step by step in a chart m where each cell mij contains the distance between thetwo substrings X i
1 and Y j1 :
mij = D(X i1,Y
j1)
Introduction to INLP – 12 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Computation of edit distance (2)
D(X 01 ;Y j
1) = j initialization
D(X i1;Y 0
1 ) = i
D(X i1;Y j
1) = D(X i−11 ;Y j−1
1 ) if Xi = Yj (equality)
= 1 + min{
D(X i−21 ;Y j−2
1 ),
D(X i−11 ;Y j
1),D(X i1;Y j−1
1 )} else if i ≥ 2 and j ≥
2 and Xi−1 = Yj andXi = Yj−1 (transposition,deletion, insertion)
= 1 + min{
D(X i−11 ;Y j−1
1 ),
D(X i−11 ;Y j
1),D(X i1;Y j−1
1 )} else (substitution, dele-
tion, insertion)
Introduction to INLP – 13 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Computation of edit distance: computation order
Id
Subs
Transp
Del
Ins
Y
X
i
jj-2
i-2
+ several possible ways of computing: rowwise, columnwise or diagonal
Introduction to INLP – 14 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Computation of edit distance (3)Example, columnwise:
for all i from 0 to |X | (size of X ) domi0 = i
for all j from 1 to |Y | dom0j = jfor all i from 1 to |X | do
if Xi = Yj thenmij = mi−1,j−1
else if i ≥ 2 and j ≥ 2 and Xi−1 = Yj and Xi = Yj−1 then
mij = 1 + min{
mi−2,j−2; mi ,j−1; mi−1,j
}else
mij = 1 + min{
mi−1,j−1; mi ,j−1; mi−1,j
}Return m|X |,|Y |
Introduction to INLP – 15 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Edit Distance (example)
D(exmple;exemple) D(exmaple;example)
e x e m p l e
0 1 2 3 4 5 6 7
e 1 0 1 2 3 4 5 6
x 2 1 0 1 2 3 4 5
m 3 2 1 1 1 2 3 4
p 4 3 2 2 2 1 2 3
l 5 4 3 3 3 2 1 2
e 6 5 4 3 4 3 2 1
e x a m p l e
0 1 2 3 4 5 6 7
e 1 0 1 2 3 4 5 6
x 2 1 0 1 2 3 4 5
m 3 2 1 1 1 2 3 4
a 4 3 2 1 1 2 3 4
p 5 4 3 2 2 1 2 3
l 6 5 4 3 3 2 1 2
e 7 6 5 4 4 3 2 1
Introduction to INLP – 16 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Spelling error correction using a FSA
Problem: approximative search of lexicalized (surface) forms= within a max. distance range
i.e. Fault-tolerant recognition (within a regular language):
Find all ending paths such that the corresponding string is within adistance range less than θ of the given input string.
Remark: a trie is a special case of FSA
Introduction to INLP – 17 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Finite-State Automata (FSA)
Formally:I Q: (finite) set of statesI Σ: (finite) alphabetI δ : arcs (mapping from Q×Σ to Q)
q1
q2a
q1
q2
δ ( , a )=
I q0 ∈Q: inital stateI F ⊂Q: final states
Interface:I initialState(): provides q0
I (q,a)=nextAfter(p,c): returns next state and characterafter character ’c’ starting from state ’p’Formally: returns Argminα {(q,α) ∈Q×Σ such that α > c and δ (p,α) = q}
I isFinal(p): are we done with p? Checks whether p ∈ F or not.
p q
· ··
c
a
· · ·
Introduction to INLP – 18 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Pruning criteria: cut-off edit distance
To make it useful in practice⇒ Fast⇒ good pruning+ cut-off edit distance: [Oflazer 1996]
Co(X n1 ,Y
m1 ) = min
I(m)≤i≤J(m)D(X i
1;Y m1 )
I(m) = min(n,max(1,m−θ)) J(m) = min(n,max(1,m + θ))
Important property:
Co(X ,Y ) > θ =⇒ ∀Z D(X ,Y + Z ) > θ
Introduction to INLP – 19 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Cut-off Edit Distance: example
=2θ
seX
Y
n=7
m=4
x m p l e
I=4-2=2 J=4+2=6
ex
exm
exmp
exmpl
exmple
e x ma
Ye x a m
0 1 2 3 4e 1 0 1 2 3x 2 1 0 1 2m 3 2 1 1 1p 4 3 2 2 2l 5 4 3 3 3e 6 5 4 4 4
X s 7 6 5 5 5
Co(X ,Y ) = min{2,1,2,3,4}= 1
Introduction to INLP – 20 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Walk through a FSA within a θ distance rangePrefix-compatible Depth-first version
Input: a string to be corrected (X ), a lexicon in the form of a FSA and a maximal errorthreshold (θ )
Push(ε,ε,q0)while Stack is not empty do
Pop(Z ,c,p)(q,a) = nextAfter(p,c)if (q,a) 6= /0 then
Push(Z ,a,p)Y ← Z + aif Co(X ,Y )≤ θ then
Push(Y ,ε,q)if isFinal(q) and D(X ,Y )≤ θ then
Add Y to solutions
q0 p qZca
Introduction to INLP – 21 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Example: ababa vs. (aba|bab)* with θ = 11
[2] [2] [2] [2] [2] [2]
1
2
3
5
4
2 4
3
2 4
1
42
3
2
5
4
1
3
2
2 4
5
4
X=ababa (aba|bab)* for θ =1
[0]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[1]
[1]
[0]
[0]
[0]
[0]
[0]
.vs.
a
a
a
b
b
b
a
b
b
a
a
b
b
b
b
b
a
a
a
b
b
b
b a
a
a
a
a
1 1 1
a b a a b a0 1 2 3 4 5 6
a 1 0 1 2 3 4 5b 2 1 0 1 2 3 4a 3 2 1 0 1 2 3b 4 3 2 1 1 1 2a 5 4 3 2 1 1 1
Solutions: abaaba, ababab, bababaIntroduction to INLP – 22 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Implementation issues
À Efficient computation of Co with the previously described chart :
+ recomputation of the last column (m) only
+ Computation of D and Co in the same loop
Á Y ← Z + a: beware (local copies, pointers etc...).Similarly, do not naively implement "Push(Y ,q)".
 In some (programming) languages: it could be worth transposing the algorithm:Y (which is changing) for rows and X for columns
Introduction to INLP – 23 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Contents
I Out of Vocabulary FormsI Spelling Error Correction
I Edit distanceI Spelling error correction with FSAä Weighted edit distance
Introduction to INLP – 24 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Limitations?
å weightingExample: diacritics, uppercase
eleves→ élèves aloves→ élèves
å specific transformationsExample: typing errors
tupe→ type more generally: deuit→ fruitusqge→ usage
å whitespacestheothers→ the others othe rs→ others
+ 3 aspects of the same problem
Solution: generalization of the edit distance: weighted edit distance
Introduction to INLP – 25 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Weighted Edit Distance
weighted transformations such that :à W (Id) = 0à W (f ) > 0 f 6= Id
à W (f−1) = W (f )
à W (f ◦g) = W (f ) + W (g)
D(X ;Y ) = minf :Y=f (X )
W (f )
+ It is actually a distance on Σ∗
Difference with the preceding distance: W (f ) is not necessarily the same (= 1).
Introduction to INLP – 26 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Remarks
Ê Distance on Σ∗ ⇒ ∀X Y , ∃f : Y = f (X )
True if Ins and Del are in the transformation set
Ë non overlapping transformationsi.e. cannot apply a transformation to the result of the previous transformation
Counter-Example: ba Transp→ abSub→ ac
Introduction to INLP – 27 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Coherence Constraints
"Semantic Integrity":I W (Del) + W (Ins(x)) > W (Sub(x))
I W (Split) < W (Ins(x)) ( which implies W (Merge) < W (Del))
I W (Transp) < W (Ins(x)) + W (Del)
+ Introduction a new f such that f = ◦i fi , is useful if and only ifW (f ) < ∑
iW (fi )
Introduction to INLP – 28 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Weighted Edit Distance: computation
xxx
yyy
xxxx
yyy
xxxx
yyyyy
X :
Y :
input range(f)
min2min1
min2 = min [ min1(f) + W(f) ]f
(min1 and min2 are the values stored in the chart)
Introduction to INLP – 29 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Weighted Edit Distance: computation (2)
D(X 01 ;Y j
1) = j initializationD(X i
1;Y 01 ) = i
D(X i1;Y j
1) = D(X i−11 ;Y j−1
1 ) if Xi = Yj (equality)= W (f ) + min{min1(f )} for all applicable trans-
formations f of thesame weight
= ... for all possibleweights.
increasing W (f )
+ The optimization lies in the grouping of similar cases: same weight and compatibletransformations(Example: previously Transp and Sub were incompatible becauseW (Transp) < 2W (Sub). But each of them is compatible with Del and Ins.)
Note: {min1(f )} is the set of all the minimal values for all possible f at this point; theyshall, of course, already be computed at this point (loop condition)
Introduction to INLP – 30 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Example
D(example;exemple) D(exémple;exemple)e x e m p l e
0 1 2 3 4 5 6 7e 1 0 1 2 3 4 5 6x 2 1 0 1 2 3 4 5a 3 2 1 1 2 3 4 5m 4 3 2 2 1 2 3 4p 5 4 3 3 2 1 2 3l 6 5 4 4 3 2 1 2e 7 6 5 4 4 3 2 1
e x e m p l e0 1 2 3 4 5 6 7
e 1 0 1 2 3 4 5 6x 2 1 0 1 2 3 4 5é 3 2 1 0.1 1.1 2.1 3.1 4.1m 4 3 2 1.1 0.1 1.1 2.1 3.1p 5 4 3 2.1 1.1 0.1 1.1 2.1l 6 5 4 3.1 2.1 1.1 0.1 1.1e 7 6 5 4 3.1 2.1 1.1 0.1
W (é↔e)=0.1
Introduction to INLP – 31 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
Keypoints
ß One has to handle out of vocabulary forms
ß Edit (Levenshtein) distance, weighted edit distance
ß Spelling error correction with FSA
Introduction to INLP – 32 / 33
Out-of-Vocabularyforms
Spelling ErrorCorrectionTwo approaches
Edit distance
Spell. err. corr. witha FSA
Weighted editdistance
©EPFLJ.-C. Chappelier & M. Rajman
References
K. Oflazer, Error-tolerant Finite State Recognition with Applications to MorphologicalAnalysis and Spelling Correction, Computational Linguistics, Volume 22, Number 1,1996.
Section 8.2 in M. Rajman editor, "Speech and Language Engineering", EPFL Press,2006.
Sections 3.10 and 3.11 in D. Jurafsky and J. H. Martin, "Speech and LanguageProcessing", Prentice Hall, 2008 (2nd edition).
Section 3.3 in C. D. Manning, P. Raghavan and H. Schütze, "Introduction to InformationRetrieval", Cambridge University Press. 2008
Introduction to INLP – 33 / 33