Effective Indexing and Filtering for Similarity Search in Large
Biosequence Databases
O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366.
Washington, DC. March 2003.
BMI 731 - Winter'04 2
Overview
• Applications of queries
• Background on queries
• Current problem
• Solutions and our solution
• Comparison experiments and results
• Future work
BMI 731 - Winter'04 3
Queries in general
• We need a metric distance function– To measure the (dis)similarity btw objects
• Dynamic programming Algorithm
– O( |string1| * |string2| ) time and space• i.e. O(n2) where n is length of the strings
– Especially bad for genetic sequence queries where you have long sequences
BMI 731 - Winter'04 4
2 kinds of queries
-range queries– Retrieve all objects similar to query more than a certain
degree
BMI 731 - Winter'04 5
2 kinds of queriesk-nearest neighbor (k-NN) queries
– Retrieve k most similar objects
• No domain knowledge necessary
Ex: 4 NN
BMI 731 - Winter'04 6
2 kinds of queries
-range queries• Requires domain knowledge
– Data distribution & Distance definition
too smallNone returned
BMI 731 - Winter'04 7
2 kinds of queries-range queries
too largeAll returned
BMI 731 - Winter'04 8
Measuring similarity
• We need a metric distance function– To measure the (dis)similarity btw objects
• Edit Distance (ED)– Three kinds of operations
• Insert, delete, replace
– ACTTAGC to AATGATAG
– A C T - - T A G C R I I D ED = 4 A A T G A T A G -
– Dynamic programming Algorithm– O(mn) time and space
BMI 731 - Winter'04 9
DPA
BMI 731 - Winter'04 11
String/Genome Data• Asks the most similar substrings in the
database to the given string.• BLAST has -range queries
– Naïve search (linear scan)– scalability problems
• How to Handle Size– Partial information rather than whole
database • Approximate the string data (compress)
may fit in memory may be used for indexing, clustering
BMI 731 - Winter'04 12
How to Handle Size
• 3 approaches to make use of compressed data
1. Prune irrelevant data, I/O for non-pruned entries calculate exact values for non-pruned
(especially -range queries)
2. Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries)
3. Approximate pruning for -range queries
BMI 731 - Winter'04 13
Overview
• Background on queries
• Current problem
• Transformation and Indexing
• Comparison experiments and results
• Future work
BMI 731 - Winter'04 14
Big PictureGeneral Approach step by step
• Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors
• Develop a distance function df in vector spaces to approximate the string similarity
• Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing-
• Implement one of the three approaches mentioned -Query-
BMI 731 - Winter'04 15
String Database Overlapping Windows
Windowing
1
MultidimentionalVectors
Indexed with respect to some
distance function
Transformation Into vector
Space Indexing
3
2
Preprocessing
BMI 731 - Winter'04 16
Index of vectors
Transformation
ApproximateQuery(k-NN or -range)
Query sequence
1
Index of vectors
Exact Query(k-NN or -range)
2a
2b
DoneThe vectors returned represent most of k-NN (or vectors in -range ) + some false positives
Candidate set
Using the index
Continued
BMI 731 - Winter'04 17
Calculate ED for each of them. (Remove false positives.)
Refine
I/O for strings represented by those vectors.
3
Candidate set
Using the index
BMI 731 - Winter'04 18
1ST Step: Partitioning into overlapping Windows
• AACCGGTTACGTACGT…
• AACCGGTTACGTACGT…
• AACCGGTTACGTACGT…
e.g W=6
e.g =2
BMI 731 - Winter'04 19
2ND Step: Mapping Windows into Vector Space
• Choose a tuple size k
• Associate an int to each 4k k-tuples
• Frequencies of those k-tuples, is the vector
• If k=2 4k=16 k-tuples• AA, AC, AG, AT,
• CA, CC, CG, CT
• TA, TC, TG, TT
• GA, GC, GG, GT
BMI 731 - Winter'04 20
Example Mapping
• The integers assigned• AA=0, AC=1, AG=2, AT=3,
• CA=4, CC=5, CG=6, CT=7
• TA=8, TC=9, TG=10, TT=11
• GA=12, GC=13, GG=14, GT=15
• Assume window AACCGG
• AA, AC, CC, CG, GG all occur once
• 1100011000100000 is the matching vector.
BMI 731 - Winter'04 21
Different transformations & Distance Functions
• Tuple size transformation size– 1 4 (frequencies of A, C, G, T) FV1
– 2 16 (frequencies of 2-tuples)FV2
BMI 731 - Winter'04 22
Different transformations & Distance Functions 2
• WVn transformation– String into halves x,y
– FVns for x,yFVx,FVy
– Concatenate addition and subtraction of them
[ FVx + FVy, FVx-FVy]
• Wavelet 1 on example– TCACTTAG
– 1st: divide into halves & find FV1 transformation
• x:TCAC 1 2 0 1
• y:TTAG 1 0 1 2
– 2nd: add and subtract• 2 2 1 3 0 2 –1 –1 WV1
• Same operations on 2-tuples WV2
BMI 731 - Winter'04 23
Distance Functions on the Vector Spaces
• All of them are proved to be lower-bounds to edit-distance
• FD1 distance on FV1
• FD2 distance on FV2
• WD1 distance on WV1
• WD2 distance on WV2
BMI 731 - Winter'04 24
Frequency Distance FDn
Algorithm Example (n=1)
FDn (n-gram frequencies u,v)
• posDist:=negDist:=0• for all dimensions ui,vi
– If ui>vi then posDist:=ui-vi
– else
negDist:=ui-vi
• Return max(posDist, negDist)/n
• u:ACTTAGC2,2,1,2 v:AATGATAG4,0,2,2• – 2-4<0 negDist+=|2-4|
– 2-0>0 posDist+=|2-0|– 1-2<0 negDist+=|1-2|– 2-2=0
• posDist:2 negDist:3• FD1 is 3
BMI 731 - Winter'04 25
FDn Why lower bound? • On example
– need to incresase A by 2 G by 1 3– need to decrease c by 2
• We may “increase+decrease” if we can replace (back to slide #8)
• So in best case edit dist is only FD1 • But it may not be the case, you may need
more operations, because of mismatch of locations…
• Divide by n is because a change in one character, updates frequency of n n-grams.
BMI 731 - Winter'04 26
Wavelet Distance WDn
Algorithm Example (n=1)WDn (n-gram frequency
wavelets u,v)• Find posDist and negDist
on u,v• m:=min(posDist, negDist)• d:= (posDist-negDist)/2• if m < d
– Return d / n
• else– Return (d + (m-d )/2 )/n
• u:ACTC TAGC 1201 1111
2 3 1 2 0 1 –1 0• v:AATG ATAG 2011
2011
4 0 2 2 0 0 0 0
• posDist: 3 + 1 = 4• negDist: 2 + 1 + 1 = 4• m:4 d:0• (0 + 4/2)/1• Return 2
BMI 731 - Winter'04 27
WDn Why lower bound?
• Assume a string transformed into wavelet
[a1,…a, b1,…b]
• Largest change posDist+=3 negDist-=1 or vice versa– So use this change whenever posDist<>negDist
BMI 731 - Winter'04 28
Overview
• Background on queries
• Current problem
• Transformation and Indexing
• Comparison experiments and results
• Future work
BMI 731 - Winter'04 29
Experiment Design
• Implemented transformations & distance functions• Evaluated their pruning efficiency on -range
queries and approximation efficiency on k-NN queries experimentally on real genetic data
• Ran queries with different parameters– Varying string size W, shift amount – Some containing exact match, some not– For -range queries different values– For k-NN queries different k values
BMI 731 - Winter'04 30
K-nearest efficiency
0
10
20
30
40
50
60
70
80
90
5 10 15 20 25
k (for k-nearest neighbor query )
Av
era
ge
of
ed
it-d
ista
nc
es
of
k-n
ea
res
t
EditDist
Freq
Freq2
MaxFreq
Wav
Wav2
BMI 731 - Winter'04 31
Error Rates Compared
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
140.00%
160.00%
5 10 15 20 25
k
per
cen
tag
e er
ror (Freq-Edit)/Edit
(Freq2-Edit)/Edit
(MaxFreq-Edit)/Edit
(Wav-Edit)/Edit
(Wav2-Edit)/Edit
BMI 731 - Winter'04 32
Sorted Graphs
• To depict why our distance functions perform so good in k-NN
• Imitate what our k-NN approximation does, and graph the result– It sorts the data values in increasing order, and
takes the k-nearest ones
BMI 731 - Winter'04 33
Edit Distances and Matching FD1 Distances sorted by FD1
0
20
40
60
80
100
120
140
1 12 23 34 45 56 67 78 89 100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
320
331
342
353
364
375
386
397
First 400 strings when sorted by FD1
Dis
tan
ce V
alu
e
ED
FD1
20 nearest50 nearest
BMI 731 - Winter'04 3420 nearest50 nearest
Edit Distances and Matching WD2 sorted by WD2
0
20
40
60
80
100
120
140
1 14 27 40 53 66 79 92 105
118
131
144
157
170
183
196
209
222
235
248
261
274
287
300
313
326
339
352
365
378
391
First 400 strings when sorted by WD2
Dis
tan
ce
Va
lue
EDWD2
BMI 731 - Winter'04 35
Nature of the distance functions
• WD2 has very good performance in k-NN even though not so well pruning– Its variance of its ratio to edit distance is much
lower than others as you would like for a distance function
BMI 731 - Winter'04 36
wav2
0
20
40
60
80
100
120
140
1
20
39
58
77
96
11
5
13
4
15
3
17
2
19
1
21
0
22
9
24
8
26
7
28
6
30
5
32
4
34
3
36
2
38
1
40
0
41
9
43
8
45
7
47
6
49
5
51
4
53
3
55
2
57
1
59
0
60
9
62
8
64
7
66
6
EditDist
WaveletDist2
BMI 731 - Winter'04 37
Freq
0
20
40
60
80
100
120
1401
20
39
58
77
96
11
5
13
4
15
3
17
2
19
1
21
0
22
9
24
8
26
7
28
6
30
5
32
4
34
3
36
2
38
1
40
0
41
9
43
8
45
7
47
6
49
5
51
4
53
3
55
2
57
1
59
0
60
9
62
8
64
7
66
6
string sorted by edit dist to query
dis
tan
ce
(e
dit
an
d f
req
)
EditDist
FreqDist
BMI 731 - Winter'04 38
Results
• Tested the parameters obtained by this random experiments, on real data.
• Then also did the parameter extraction using real data too.
BMI 731 - Winter'04 39
Comparison of index structures
BMI 731 - Winter'04 40
Future Work
• Check applicability of those methods to other kinds of sequence data.– Text– Image search
• Implement index structure in the standalone program, and make performance evaluation