Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 217 times |
Download: | 1 times |
Guided Forest Edit Distance: Better Structure Comparisons by Using Domain-knowledge
Z.S. Peng
H.F. Ting
The Forest Edit Distance
Edit distance of two ordered, labeled forests
Edit operations between E and F Relabling node i in E by the label of node j in F
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
Edit distance of two ordered, labeled forests
Edit operations between E and F Relabling node i in E by the label of node j in F
Relabel (3,5)
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy y
Edit distance of two ordered, labeled forests
Edit operations between E and F Relabling node i in E by the label of node j in F
Cost of the operation: (3,5)
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy p
Edit distance of two ordered, labeled forests
Edit operations between E and F Delete node i from E
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
Edit distance of two ordered, labeled forests
Edit operations between E and F Delete node i from E
Delete (2,-)
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
Edit distance of two ordered, labeled forests
Edit operations between E and F Delete node i from E
Delete (2,-)
4
3
1
4
1
2
3
7
5 6
E F
a
h
m
a
me
z
v
uy
Edit distance of two ordered, labeled forests
Edit operations between E and F Delete node i from E
Cost of the operation: (2,-)
4
3
1
4
1
2
3
7
5 6
E F
a
h
m
a
me
z
v
uy
Edit distance of two ordered, labelled forests
Edit operations between E and F Delete node j from F
The cost of operation: (-,j)
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
Edit distance of two ordered, labelled forests
The edit distance (E,F) between E and F is the minimum cost of edit operations that transform E to E' and F to F' such that E' = F'.
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
4
2 3
1
4
1
2
3
7
5 6
a
h
f m
a
me
z
v
uy
Edit distance of two ordered, labelled forests
The edit distance (E,F) between E and F is the minimum cost of edit operations that transform E to E' and F to F' such that E' = F'.
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
4
2 3
1
4
1
2
3
7
5 6
a
h
f m
a
me
z
v
uy e
Edit distance of two ordered, labelled forests
The Guided edit distance (E,F,G) between E and F with respect to a third forest G is the minimum cost of edit operations that transform E to E' and F to F' such that E' = F' include G as a subforest.
4
2 3
1
4
1
2
3
7
5 6
E F
a
h
f m
a
me
z
v
uy
4
2 3
4
1 3
a
m
a
mee
3
1 2
a
me
G
Application 1: RNA comparisons
Cherry small circular viroid-Like RNA GI:2347024 between base 287 and base 337. The Hammerhead motif of the RNA is printed in bold.
Application 2: Comparing XML documents
XML documents with same Document Type Descriptor should be aligned with this DTD to get more accurate results
The algorithms
(E,F)
Tai 1979:Zhang and Shasha 1989:
where Klein 1998:
(E,F,G):
This paper:
))()(|||(| 22 FdEdFEO
))()(|||(| FEFEO
|)|log|||(| 2 FFEO
))(|)()(|||||(| 2GLFEGFEO
)}(),(min{)( XdXLX
Special Cases
a
a
c
c
b
a
c
c
a
c
c
f
f
Special Cases
a
a
c
c
b
a
c
c
a
c
c
f
f
Longest Constraint Common Subsequence
Constrained Sequence Alignment
The algorithms
Constrained Longest Common Subsequent Tsai 2003:
Constrained Sequence Alignment Chin et al. :
This paper:
where
Since G has one leaf, the time becomes
|)||||(| gfeO
|)||||(| gfeO
))(|)()(|||||(| 2GLFEGFEO )}(),(min{)( XdXLX
|)||||(| GFEO
Our algorithm for computing (E,F,G)
Dynamic Programming
The sub-problems
Post-order numbering (naming) of the nodes
5
3 4
1 2
14
10
1211
138
7
9
6
18
16
15
17
20
19 2221
23
The sub-problems
: A "consecutive" sub-forest
'..iiE
5
3 4
1 2
14
10
1211
138
7
9
6
18
16
15
17
20
19 2221
23
The sub-problems
: A "consecutive" sub-forest
'..iiE
5
3 4
1 2
14
10
1211
138
7
9
6
18
16
15
17
20
19 2221
23
21..4E
The sub-problems
),,( '..'..'.. kkjjii GFE
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
The sub-problems
),,( 3..27..47..2 GFE
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
7..2E 7..4F 3..2G
is equal to the minimum of the followings:
),()( ..,1)..(,)..( 00jGFE kjjsiis
1.
2.
3.
4.
5.
),()( ..,)..(,1)..( 00 iGFE kjjsiis
)],[],[()( ..,1)()..(,1)()..( 00 jEiEGFE kjsjsisis
),()()( )..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE psjjsiispskjsjsisis
),()()( 1)..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE sjjsiisskjsjsisis
)( ..,)..(,)..( 00 kjjsiis GFE
1. ),()( ..,)..(,1)..( 00 iGFE kjjsiis
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
1. ),()( ..,)..(,1)..( 00 iGFE kjjsiis
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
2. ),()( ..,1)..(,)..( 00jGFE kjjsiis
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
3.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
)],[],[()( ..,1)()..(,1)()..( 00 jEiEGFE kjsjsisis
3.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
)],[],[()( ..,1)()..(,1)()..( 00 jEiEGFE kjsjsisis
4.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
),()()( )..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE psjjsiispskjsjsisis
4.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
),()()( )..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE psjjsiispskjsjsisis
5.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
),()()( 1)..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE sjjsiisskjsjsisis
5.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
),()()( 1)..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE sjjsiisskjsjsisis
5.
5
3 4
1 2
5
1
32
48
7
9
6
9
6
7
8
2
1 43
5
E F G
),()()( 1)..(1)..(1)..(1)(..,1)()..(,1)()..( 00jiGFEGFE sjjsiisskjsjsisis
The order for solving the sub-problems
for i=1 to |E|
for j=1 to |F|
for h=1 to |G|
for k=1 to (|G|-h+1)
if k is a leaf then find ),,( )1..(..1..1 hkkji GFE
The time complexity
)|)(||||||(| 222 GLGFEO
Sparsify the dynamic program
using a clever trick of Zhang and Shasha
key-root: if it is the root, or has a left-slibling
5
3 4
1 2
5
1
32
4
8
7
9
6
9
6
7
8
2
1 43
5
E F G
2
1
key-root: if it is the root, or has a left-slibling
5
3 4
1 2
5
1
32
4
8
7
9
6
9
6
7
8
2
1 43
5
E F G
2
1
No. of key-roots ≤ no. of leaves
To compute (E,F,G)= (E||1..|E| ,F||1..|F| ,G||1..|G|)
for i=1 to |E|
for j=1 to |F|
for h=1 to |G|
for k=1 to (|G|-h+1)
if k is a leaf
find ),,( )1..(..1..1 hkkji GFE
To compute (E,F,G)= (E||1..|E| ,F||1..|F| ,G||1..|G|)
for i=1 to |E|
for j=1 to |F|
for h=1 to |G|
for k=1 to (|G|-h+1)
if k is a leaf and i and j are key-roots
find ),,( )1..(..1..1 hkkji GFE
The new running time
))(|)()(|||||(| 2GLFEGFEO
Thank you