Evaluation of pairwise string alignment methodsnerbonne/talks/wieling-et-al-2009.pdf · All...

Evaluation of pairwise string alignment methods

Martijn Wieling, Jelena Prokic and John Nerbonne

Department of Computational Linguistics, University of Groningen

Feb. 20, 2009, Kampala

Martijn Wieling, Jelena Prokic and John Nerbonne 1/20

Overview

IntroductionDataset and gold standardAlgorithmsEvaluation methodResultsDiscussion


Introduction

There are many string-similarity measures based on pairwisestring alignment (PWA)Evaluations at the aggregate level show almost no performancedifference between PWA methods (Heeringa et al., 2006; Wieling et al., 2007)

More sensitive evaluation techniques needed to determine effectat the alignment level


Dataset

Bulgarian dialect dataTranscriptions of 152 words in 197 sites98 phonetic types


Gold standard

Automatically generated from manually corrected multiplealignment

L1: j "A - - - -L2: - "A s - - -L3: - "6 s - - -L4: j "A s - - -L5: j "A z e k a

Each transcription in the gold standard is aligned with all others(L1:L2, L1:L3, ...): 3.5 million pairwise alignmentsGap-gap alignments are removed

L2: "A sL3: "6 s


Pairwise alignment algorithms

Evaluated algorithmsRegular Levenshtein algorithmLevenshtein algorithm with swap-operationLevenshtein algorithm with PMI generated segment distancesPair Hidden Markov Model - Viterbi algorithm


Regular Levenshtein algorithm

One of the most popular pairwise string alignment methodsVowel-consonant alignment restriction

j"As delete j 1"As subst. s/z 1"Az insert i 1"Azi

3

j "A s"A z i

1 1 1


Levenshtein with swap-operation

Bulgarian dialect data often contains metathesis

Implementation: the swap operation is used whenever possiblev r "7v "7 r

>< 1But only involving exactly the same symbols

v r "7v "a r

1 1


Levenshtein with PMI segment distances (1)

Pointwise Mutual Information (PMI): assesses degree of statisticaldependence between aligned segments (x and y )

PMI(x , y) = log2

(p(x , y)

p(x) p(y)

)p(x , y): relative occurrence of the aligned segments x and y in thewhole datasetp(x) and p(y): relative occurrence of x or y in the whole dataset

The greater the PMI value, the more segments tend to cooccur incorrespondences


Levenshtein with PMI segment distances (2)

Algorithm:Initially all string pairs are aligned using Levenshtein algorithmDistance between tokens x and y is set to: 0 - PMI(x , y )Repeatedly strings are aligned with the Levenshtein algorithm usingthe token distances until alignments remain constant

Advantage: second alignment is not generated anymore

v "7 nv "7 n’ k @

1 1 1

v "7 nv "7 n’ k @

1 1 1


Pair Hidden Markov Model

Pair Hidden Markov ModelAdapted Hidden Markov Model: 2 parallel output streamsLinguistically introduced by Mackay and Kondrak (2005)Large number of probabilities to be estimated in trainingProbabilities linguistically sensible (Wieling et al., 2007)After training Viterbi algorithm yields most probable alignment


Evaluation method (1)

All pairwise alignments are generated for every algorithmInsertion-deletion sequences are standardized

v "i A v "i Av "i j v "i j

Two-to-one mappings are standardizedv "ô

"x v "ô

"x

v "A r x v "A r x



Each token alignment is converted to a single symbol

v l "7 k v l "7 kv "7 l k v "7 l k

Generated strings:

v/v l/"7 "7/l k/k v/v l/- "7/"7 -/l k/k



The generated strings can be aligned to determine their distancev/v l/"7 "7/l k/kv/v l/- "7/"7 -/l k/k

1 1 1For every algorithm the generated strings are aligned with thegenerated strings of the gold standard (GS)The distance between an algorithm and the GS is simply the sumof all generated string distancesBaseline: Hamming alignments (only substitutions)


Quantitative results: segment distances

PMI distances: D(a, a) < D(V , V ) (t < −13, p < .001)

PHMM substitution probabilities: P(a, a) > P(V , V ) > P(V , C)(t’s > 9, p < .001)

PMI versus PHMM log odds transformed substitution scores:Spearman’s ρ = −.965, p < .001(indels: Spearman’s ρ = −.736, p < .001)












Quantitative results: alignments

MS: Number of misaligned segmentsError rate (E): MS / 15898147 (aligned segments in GS)

0 ≤ E ≤ 2

IA: Number of incorrect alignments with respect to the GS

MS (E) IA (%)Hamming 2510094 (0.1579) 726844 (20.92%)Levenshtein 490703 (0.0309) 191674 (5.52%)Levenshtein PMI 399216 (0.0251) 156440 (4.50%)Levenshtein swap 392345 (0.0247) 161834 (4.66%)PHMM Viterbi 362423 (0.0228) 160896 (4.63%)


Quantitative results: alignments

MS: Number of misaligned segmentsError rate (E): MS / 15898147 (aligned segments in GS)

0 ≤ E ≤ 2

IA: Number of incorrect alignments with respect to the GS

MS (E) IA (%)Hamming 2510094 (0.1579) 726844 (20.92%)Levenshtein 490703 (0.0309) 191674 (5.52%)Levenshtein PMI 399216 (0.0251) 156440 (4.50%)Levenshtein swap 392345 (0.0247) 161834 (4.66%)PHMM Viterbi 362423 (0.0228) 160896 (4.63%)


Qualitative results (1)

No perfect performance possible:

p "ô"

v j 7 tp "ô

"v n i o

p "ô"

v n i j 7 tp "ô

"v j 7 t

p "ô"

v n i o



No perfect performance possible:

p "ô"

v j 7 tp "ô

"v n i o

p "ô"

v n i j 7 tp "ô

"v j 7 t

p "ô"

v n i o



Problems of Levenshtein (and PMI):Detecting correct alignments of a vowel with a consonantDetecting metathesisAligning one consonant with either of two other consonants

Problems of Levenshtein swap:Problem 1 and 3 of LevenshteinApplying metathesis too often

b r @ nj "ebj @ r "A n i1 >< 1 1 1 1

Problems of PHMM:Segment distances causes wrong alignments of vowels withconsonants which appear often in swaps





b r @ nj "ebj @ r "A n i1 >< 1 1 1 1






b r @ nj "ebj @ r "A n i1 >< 1 1 1 1



Discussion

PHMM performs best at segment levelSlow to train: multiple hours

Quicker, clearer and also good alignments: Levenshtein PMI /Swap

Interesting further research:Combine PMI and Swap Levenshtein methodsVerify results against a gold standard of another dataset


Discussion

PHMM performs best at segment levelSlow to train: multiple hours

Quicker, clearer and also good alignments: Levenshtein PMI /Swap

Interesting further research:Combine PMI and Swap Levenshtein methodsVerify results against a gold standard of another dataset


Any questions?

Thank You!


Date post:	21-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Evaluation of pairwise string alignment methodsnerbonne/talks/wieling-et-al-2009.pdf · All...

Documents