Evaluation of pairwise string alignment methods
Martijn Wieling, Jelena Prokic and John Nerbonne
Department of Computational Linguistics, University of Groningen
Feb. 20, 2009, Kampala
Martijn Wieling, Jelena Prokic and John Nerbonne 1/20
Overview
IntroductionDataset and gold standardAlgorithmsEvaluation methodResultsDiscussion
Martijn Wieling, Jelena Prokic and John Nerbonne 2/20
Introduction
There are many string-similarity measures based on pairwisestring alignment (PWA)Evaluations at the aggregate level show almost no performancedifference between PWA methods (Heeringa et al., 2006; Wieling et al., 2007)
More sensitive evaluation techniques needed to determine effectat the alignment level
Martijn Wieling, Jelena Prokic and John Nerbonne 3/20
Dataset
Bulgarian dialect dataTranscriptions of 152 words in 197 sites98 phonetic types
Martijn Wieling, Jelena Prokic and John Nerbonne 4/20
Gold standard
Automatically generated from manually corrected multiplealignment
L1: j "A - - - -L2: - "A s - - -L3: - "6 s - - -L4: j "A s - - -L5: j "A z e k a
Each transcription in the gold standard is aligned with all others(L1:L2, L1:L3, ...): 3.5 million pairwise alignmentsGap-gap alignments are removed
L2: "A sL3: "6 s
Martijn Wieling, Jelena Prokic and John Nerbonne 5/20
Pairwise alignment algorithms
Evaluated algorithmsRegular Levenshtein algorithmLevenshtein algorithm with swap-operationLevenshtein algorithm with PMI generated segment distancesPair Hidden Markov Model - Viterbi algorithm
Martijn Wieling, Jelena Prokic and John Nerbonne 6/20
Regular Levenshtein algorithm
One of the most popular pairwise string alignment methodsVowel-consonant alignment restriction
j"As delete j 1"As subst. s/z 1"Az insert i 1"Azi
3
j "A s"A z i
1 1 1
Martijn Wieling, Jelena Prokic and John Nerbonne 7/20
Levenshtein with swap-operation
Bulgarian dialect data often contains metathesis
Implementation: the swap operation is used whenever possiblev r "7v "7 r
>< 1But only involving exactly the same symbols
v r "7v "a r
1 1
Martijn Wieling, Jelena Prokic and John Nerbonne 8/20
Levenshtein with PMI segment distances (1)
Pointwise Mutual Information (PMI): assesses degree of statisticaldependence between aligned segments (x and y )
PMI(x , y) = log2
(p(x , y)
p(x) p(y)
)p(x , y): relative occurrence of the aligned segments x and y in thewhole datasetp(x) and p(y): relative occurrence of x or y in the whole dataset
The greater the PMI value, the more segments tend to cooccur incorrespondences
Martijn Wieling, Jelena Prokic and John Nerbonne 9/20
Levenshtein with PMI segment distances (2)
Algorithm:Initially all string pairs are aligned using Levenshtein algorithmDistance between tokens x and y is set to: 0 - PMI(x , y )Repeatedly strings are aligned with the Levenshtein algorithm usingthe token distances until alignments remain constant
Advantage: second alignment is not generated anymore
v "7 nv "7 n’ k @
1 1 1
v "7 nv "7 n’ k @
1 1 1
Martijn Wieling, Jelena Prokic and John Nerbonne 10/20
Pair Hidden Markov Model
Pair Hidden Markov ModelAdapted Hidden Markov Model: 2 parallel output streamsLinguistically introduced by Mackay and Kondrak (2005)Large number of probabilities to be estimated in trainingProbabilities linguistically sensible (Wieling et al., 2007)After training Viterbi algorithm yields most probable alignment
Martijn Wieling, Jelena Prokic and John Nerbonne 11/20
Evaluation method (1)
All pairwise alignments are generated for every algorithmInsertion-deletion sequences are standardized
v "i A v "i Av "i j v "i j
Two-to-one mappings are standardizedv "ô
"x v "ô
"x
v "A r x v "A r x
Martijn Wieling, Jelena Prokic and John Nerbonne 12/20
Evaluation method (2)
Each token alignment is converted to a single symbol
v l "7 k v l "7 kv "7 l k v "7 l k
Generated strings:
v/v l/"7 "7/l k/k v/v l/- "7/"7 -/l k/k
Martijn Wieling, Jelena Prokic and John Nerbonne 13/20
Evaluation method (3)
The generated strings can be aligned to determine their distancev/v l/"7 "7/l k/kv/v l/- "7/"7 -/l k/k
1 1 1For every algorithm the generated strings are aligned with thegenerated strings of the gold standard (GS)The distance between an algorithm and the GS is simply the sumof all generated string distancesBaseline: Hamming alignments (only substitutions)
Martijn Wieling, Jelena Prokic and John Nerbonne 14/20
Quantitative results: segment distances
PMI distances: D(a, a) < D(V , V ) (t < −13, p < .001)
PHMM substitution probabilities: P(a, a) > P(V , V ) > P(V , C)(t’s > 9, p < .001)
PMI versus PHMM log odds transformed substitution scores:Spearman’s ρ = −.965, p < .001(indels: Spearman’s ρ = −.736, p < .001)
Martijn Wieling, Jelena Prokic and John Nerbonne 15/20
Quantitative results: segment distances
PMI distances: D(a, a) < D(V , V ) (t < −13, p < .001)
PHMM substitution probabilities: P(a, a) > P(V , V ) > P(V , C)(t’s > 9, p < .001)
PMI versus PHMM log odds transformed substitution scores:Spearman’s ρ = −.965, p < .001(indels: Spearman’s ρ = −.736, p < .001)
Martijn Wieling, Jelena Prokic and John Nerbonne 15/20
Quantitative results: segment distances
PMI distances: D(a, a) < D(V , V ) (t < −13, p < .001)
PHMM substitution probabilities: P(a, a) > P(V , V ) > P(V , C)(t’s > 9, p < .001)
PMI versus PHMM log odds transformed substitution scores:Spearman’s ρ = −.965, p < .001(indels: Spearman’s ρ = −.736, p < .001)
Martijn Wieling, Jelena Prokic and John Nerbonne 15/20
Quantitative results: alignments
MS: Number of misaligned segmentsError rate (E): MS / 15898147 (aligned segments in GS)
0 ≤ E ≤ 2
IA: Number of incorrect alignments with respect to the GS
MS (E) IA (%)Hamming 2510094 (0.1579) 726844 (20.92%)Levenshtein 490703 (0.0309) 191674 (5.52%)Levenshtein PMI 399216 (0.0251) 156440 (4.50%)Levenshtein swap 392345 (0.0247) 161834 (4.66%)PHMM Viterbi 362423 (0.0228) 160896 (4.63%)
Martijn Wieling, Jelena Prokic and John Nerbonne 16/20
Quantitative results: alignments
MS: Number of misaligned segmentsError rate (E): MS / 15898147 (aligned segments in GS)
0 ≤ E ≤ 2
IA: Number of incorrect alignments with respect to the GS
MS (E) IA (%)Hamming 2510094 (0.1579) 726844 (20.92%)Levenshtein 490703 (0.0309) 191674 (5.52%)Levenshtein PMI 399216 (0.0251) 156440 (4.50%)Levenshtein swap 392345 (0.0247) 161834 (4.66%)PHMM Viterbi 362423 (0.0228) 160896 (4.63%)
Martijn Wieling, Jelena Prokic and John Nerbonne 16/20
Qualitative results (1)
No perfect performance possible:
p "ô"
v j 7 tp "ô
"v n i o
p "ô"
v n i j 7 tp "ô
"v j 7 t
p "ô"
v n i o
Martijn Wieling, Jelena Prokic and John Nerbonne 17/20
Qualitative results (1)
No perfect performance possible:
p "ô"
v j 7 tp "ô
"v n i o
p "ô"
v n i j 7 tp "ô
"v j 7 t
p "ô"
v n i o
Martijn Wieling, Jelena Prokic and John Nerbonne 17/20
Qualitative results (2)
Problems of Levenshtein (and PMI):Detecting correct alignments of a vowel with a consonantDetecting metathesisAligning one consonant with either of two other consonants
Problems of Levenshtein swap:Problem 1 and 3 of LevenshteinApplying metathesis too often
b r @ nj "ebj @ r "A n i1 >< 1 1 1 1
Problems of PHMM:Segment distances causes wrong alignments of vowels withconsonants which appear often in swaps
Martijn Wieling, Jelena Prokic and John Nerbonne 18/20
Qualitative results (2)
Problems of Levenshtein (and PMI):Detecting correct alignments of a vowel with a consonantDetecting metathesisAligning one consonant with either of two other consonants
Problems of Levenshtein swap:Problem 1 and 3 of LevenshteinApplying metathesis too often
b r @ nj "ebj @ r "A n i1 >< 1 1 1 1
Problems of PHMM:Segment distances causes wrong alignments of vowels withconsonants which appear often in swaps
Martijn Wieling, Jelena Prokic and John Nerbonne 18/20
Qualitative results (2)
Problems of Levenshtein (and PMI):Detecting correct alignments of a vowel with a consonantDetecting metathesisAligning one consonant with either of two other consonants
Problems of Levenshtein swap:Problem 1 and 3 of LevenshteinApplying metathesis too often
b r @ nj "ebj @ r "A n i1 >< 1 1 1 1
Problems of PHMM:Segment distances causes wrong alignments of vowels withconsonants which appear often in swaps
Martijn Wieling, Jelena Prokic and John Nerbonne 18/20
Discussion
PHMM performs best at segment levelSlow to train: multiple hours
Quicker, clearer and also good alignments: Levenshtein PMI /Swap
Interesting further research:Combine PMI and Swap Levenshtein methodsVerify results against a gold standard of another dataset
Martijn Wieling, Jelena Prokic and John Nerbonne 19/20
Discussion
PHMM performs best at segment levelSlow to train: multiple hours
Quicker, clearer and also good alignments: Levenshtein PMI /Swap
Interesting further research:Combine PMI and Swap Levenshtein methodsVerify results against a gold standard of another dataset
Martijn Wieling, Jelena Prokic and John Nerbonne 19/20
Any questions?
Thank You!
Martijn Wieling, Jelena Prokic and John Nerbonne 20/20