Liang Jin* UC Irvine
Nick Koudas University of Toronto
Chen Li* UC Irvine
Anthony K.H. Tung National University of Singapore
* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586
Indexing Mixed Types for Approximate Retrieval
2
Queries with Mixed-Type Predicates
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;• SIMLARTO:
– a domain-specific function – returns a similarity value between two strings
• Example: edit distance ed(Tom Hanks, Ton Hank) = 2
3
Why fuzzy predicates?
• Errors in queries– User doesn’t remember a string exactly– User types a wrong string
Samuel Jackson
…
Schwarzenegger
Samuel Jackson
Keanu ReevesStar
…
Samuel L. Jackson
Schwarzenegger
Samuel L. Jackson
Keanu ReevesStar
Relation R Relation S
• Errors in databases:– Data is not clean– Especially true in data integration and cleansing
4
Problem Formulation
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
Given: A query with fuzzy predicates on strings and
range predicates on numeric attributes
on a single relation
Goal: Answer the query efficiently
5
Rest of the talk
• Motivation: supporting queries with mixed-type predicates
• Our approach: MAT tree• Construction and maintenance of MAT tree• Experiments
6
Assumptions
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1980| <= 5;
• One fuzzy string predicate (edit distance)
• One numeric predicate
(’Schwarrzenger’, 2, 1980, 5)
(Qs, δs, Qn, δn)Query:
7
Intuition of MAT (Mixed-attribute-type) Tree
• “2 > 1 + 1”– One integrated indexing structure is better than– two independent indexing structures on two attributes
• Indexing numeric attributes: B-tree or R-tree• Indexing strings as a tree to support fuzzy predicates?
Spielberg1946
Hanks1956
Gibson1956
Hanks1957
Crowe1964
Robert1968
DiCaprio1974
Roberrts1977
<1946,1956> <1956,1957>
<1946,1957> <1964,1977>
MBR
Root
Leaf nodes
*
<1964,1968>
*
<1974,1977>
*
* *
......
...
......
*
...
MAT tree
8
Answering a query (Qs, δs, Qn, δn)
• Top-down traverse the MAT-tree• At each node, do pruning by checking:
– If [Qn – δn, Qn + δn] overlap with the numeric range.
– If minEditDistance(Qs, Tn) <= δs.
Spielberg1946
Hanks1956
Gibson1956
Hanks1957
Crowe1964
Robert1968
DiCaprio1974
Roberrts1977
<1946,1956> <1956,1957>
<1946,1957> <1964,1977>
MBR
Root
Leaf nodes
*
<1964,1968>
*
<1974,1977>
*
* *
......
...
......
*
...
9
Spielberg1946
Hanks1956
Gibson1956
Hanks1957
Crowe1964
Robert1968
DiCaprio1974
Roberrts1977
<1946,1956> <1956,1957>
<1946,1957> <1964,1977>
MBR
Root
Leaf nodes
*
<1964,1968>
*
<1974,1977>
*
* *
......
...
......
*
...
Challenge
• How to represent strings to fit into a limited space• and support fuzzy-predicate pruning
Limited space (disk based)
10
Existing Approaches to Indexing Strings as Trees
• M-tree: – Edit distance: metric space
• Q-tree– Utilize the q-gram property of strings. – See our paper for details
11
Representing strings as a trie
n1
n2 n3 n4
n5 n6 n7 n8
n14
n9
n10 n11 n12 n13
n15
a b c
a b
c
d
e a b
dd h
Strings:aad, abcde, abdfg, beh, ca, cb
n16 n17e
f
g
12
Compressing a trie
• Select k representative nodes (centers).
• Each center is in the format of <alphabet,height>.
• A compressed trie represents more strings
n1
n2 n4
n5 n6
n7
n11
n13
n15
<{b},2>
<{e},1>
<{h},1>
<{a,b,c},2><{a},1>
<{a,d},2>
<{b,d},2>
<{f,g},2><{c,d,e},3>
Strings:aad, abcde, abdfg, beh, ca, cb
n1
n2 n3 n4
n5 n6 n7 n8
n14
n9
n10 n11 n12 n13
n15
a b c
a b
c
d
e a b
dd h
Strings:aad, abcde, abdfg, beh, ca, cb
n16 n17e
f
g
compression
13
minEditDistace (Qs, Tn)?– Convert a trie to an automaton.– Compute the min distance between a string and an automaton [Myers and
Miller, 1989]– Early termination possible
Minimum edit distance between a string a trie
a
b
d
a
b
d
a
b
d
[a,*]
[a,*]
[a,*]
[a,*]
[c,*][c,*]
[c,*]
[c,*]
[*,b]
[*,d]
[*,*]
[*,*]
[*,*]
[*,d]
[*,d]
[*,a]
[*,a]
[*,a]
[*,b]
[*,b]
[a,b]
[a,a] [a,d]
[c,b][c,a] [c,d]
a
b
d
Automaton
Query String
“ac”
Edit Graph
14
Compressed trie Automaton
• Each node is a state.• Each edge becomes a transition between two states.• For compressed node <Σ, L>, expand it to L levels.
At each level, all characters in Σ become single states and are connected to a common tail ε.
Convert a compressed node <{a,b,c},2> into automaton nodes.
c
a
b
c
a
b
15
Outline
• Motivation: supporting queries with mixed-type predicates
• Our approach: MAT tree• Construction and maintenance of MAT tree• Experiments
16
Constructing MAT-tree
• Option 1: insert records one by one. • Option 2:
– bulk-load records– construct the MAT-tree bottom-up
17
Compressing a trie
• Important:– Accurately represent strings in a limited space.– Minimize “information loss”.– Maintain the pruning power during a traversal.
• Three methods:– (1) Reducing # of accepted strings– (2) Keeping accepted strings “clustered”– (3) Combining of (1) and (2)
18
Method (1): Reducing # of accepted strings
• Intuition: – reducing this # makes the compressed trie more
accurate
• Goodness function: # of accepted strings• Algorithm: “Randomized”
– Randomly select k initial centers– Randomly select one of the centers– Randomly select an unselected node– Swap them if it can improve the goodness function– Do certain # of iterations
19
Method (2): Keeping accepted strings clustered
• Intuition: – keeping the accepted strings similar to the original ones by
letting them share common prefix. – Place k centers as close to the root as possible.
• Algorithm: “BreadthFirst”
n1
n2 n3 n4
n5 n6 n7 n8 n9
a b c
<{a,d},2>
<{b,c,d,e,f,g},4> <{e,h},2>
<{a},1>
<{b},1>
Strings:aad, abcde, abdfg, beh, ca, cb
20
Method (3): Combining (1) and (2)
• Intuition: – minimize the number of accepted strings, and in
the same time maintain their similarity to the originals.
• Algorithm: “Bottomup”– Keep shrinking the trie bottom up until we have k
nodes– Compress a node that minimizes # of additional
strings
21
Dynamic maintenance
Insertion (s, n)• Search the index for (s, n). If it’s not in the
index, identify the correct leaf node.• If no overflow:
– update the “MBR” of the leaf node and its precedents recursively if necessary.
• If overflow:– Split the leaf node and – Construct two compressed tries– Cascade the split to the precedents if necessary.
Deletion and Update are handled similarly
22
Outline
• Motivation: supporting queries with mixed-type predicates
• Our approach: MAT tree• Construction and maintenance of MAT tree• Experiments
23
Setting
• Data– IMDB: 100K movie star records (Name and YOB).– Customers: 50K records (Name and YOB)
• Test bed– PC: 2.4G P4, 1.2GB Memory, Windows XP– Visual C++ compiler
• Similar results. Report result for IMDB.
24
Implemented approaches
• B-tree• Q-tree• B-tree & Q-tree• BQ-tree• BM-tree• Sequential scan
“BBQ-tree”?
25
“2 > 1 + 1”
An integrated indexing structure is better than two separate indexing structures
δs=3, δn=4
26
Scalability
27
Effect of numeric threshold δn
28
Effect of string threshold δs
29
Dynamic Maintenance: time
30
Dynamic maintenance: MAT quality
31
Number of centers
• Increasing cluster # may not reduce the running time: pruning power versus computational cost
• For BottomUp and BreadthFirst (compared to Randomized)
- Centers close to the root, thus more likely to do early termination
32
Conclusion
• MAT-tree: an efficient indexing structure for queries with mixed-type predicates
• Can be efficiently constructed and maintained
• Future work: develop a uniform framework to support different kinds of similarity functions
Q&A?
The Flamingo Project : http://www.ics.uci.edu/~flamingo/
33
Backup Slides
34
Constructing MAT-tree
• Option 1: inserting records one by one. • Option 2: bulk-loading data records and
constructing the MAT-tree in a bottom-up fashion.– Records are sorted based on one attribute.– Fill pages with records until full.– Calculate the numeric range and the compressed
trie for each leaf nodes.– Merge leaf nodes into internal nodes recursively
according to desired fanout, until a single root is formed.
35
Example – Customer Service Call Center
Name SSN YOB
Jack Lemmon 430-871-8294 1978
Harrison Ford 292-918-2913 1962
Tom Hanks 234-762-1234 1956
Tim Legler 125-457-8654 1870
… … …
Customer calls in
Issue a fuzzy query:
Name LIKE “Tom Hanks” AND YOB CLOSE to 1958
Return result
Serve the customer
In this example, the underline system should be able to support fuzzy query on both the string and numeric attributes!
36
Scalability test (IO)