Chen LiDepartment of Computer Science
Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica
Answering Approximate Queries Efficiently
2
30,000-Foot View of Info Systems
Data Repository (RDBMS, Search
Engines, etc.)
QueryAnswers matching
conditions
3
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Tom
Find movies starred Samuel Jackson
4
How about our governor: Schwarrzenger?
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
The user doesn’t know the exact spelling!
5
Relaxing Conditions
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Find movies with a star “similar to” Schwarrzenger.
6
In general: Gap between Queries and Facts
• Errors in the query– The user doesn’t remember a string exactly– The user unintentionally types a wrong string
Samuel Jackson
…
Schwarzenegger
Samuel Jackson
Keanu ReevesStar
…
Samuel L. Jackson
Schwarzenegger
Samuel L. Jackson
Keanu ReevesStar
Relation R Relation S
• Errors in the database:– Data often is not clean by itself– Especially true in data integration and cleansing
7
“Did you mean…?” features in Search Engines
8
What if we don’t want the user to change the query?Answering Queries Approximately
Data Repository (RDBMS, Search
Engines, etc.)
QueryAnswers matching
conditions approximately
9
Technical Challenges
• How to relax conditions?– Name: “Schwarzenegger” vs “Schwarrzenger”– Salary: “in [50K,60K]” vs “in [49K,63K]”
• How to answer queries efficiently?– Index structures– Selectivity estimation
See our three recent VLDB papers
10
Rest of the talk
• Selectivity estimation of fuzzy predicates• Our approach: SEPIA• Construction and maintenance of SEPIA• Experiments• Other works
11
Queries with Fuzzy String Predicates
• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-
0964”
• Similar to: – a domain-specific function – returns a similarity value between two strings
• Examples:– Edit distance: ed(Schwarrzenger, Schwarzenegger)=2– Cosine similarity– Jaccard coefficient distance– Soundex– …
Database
12
• A widely used metric to define string similarity• Ed(s1,s2)= minimum # of operations (insertion,
deletion, substitution) to change s1 to s2• Example:
s1: Tom Hankss2: Ton Hanked(s1,s2) = 2
Example Similarity Function: Edit Distance
13
Selectivity of Fuzzy Predicates
star SIMILARTO ’Schwarrzenger’• Selectivity: # of records satisfying the predicate
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
14
Selectivity Estimation: Problem Formulation
A bag of strings
Input: fuzzy string predicate P(q, δ)
star SIMILARTO ’Schwarrzenger’
Output: # of strings s that satisfy dist(s,q) <= δ
15
Why Selectivity Estimation?
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND year BETWEEN [1980,1989];
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Movies
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND year BETWEEN [1970,1971];
The optimizer needs to know the selectivity of a predicate to decide a good plan.
16
• No “nice” order for strings• Lexicographical order?
– Similar strings could be far from each other: Kammy/Cammy– Adjacent strings have different selectivities: Cathy/Catherine
Using traditional histograms?
17
Outline
• Selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Overview– Proximity between strings– Estimation algorithm
• Construction and maintenance of SEPIA• Experiments• Other works
18
Our approach: SEPIA
Selectivity Estimation of Approximate Predicates
Cluster
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
Intuition
19
Proximity between Strings
lukas
luciano
lucia
lucas2
3Query String
Pivot2
Cluster
Edit Distance? Not discriminative enough
20
Edit Vector from s1 to s2
• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit
distance
– Easily computable– Not symmetric– Not unique, but tend to be (ed <= 3 91% unique)
luciano
lucas<1,1,0>
<2,0,0>lucia
lucia
21
Why Edit Vector?
More discriminative
lukas
luciano
lucia
lucas
<1,1,0><1,1,1>
<2,0,0>
Cluster
22
SEPIA histograms: Overview
Frequency Table
Cluster 1
Cluster k
Cluster 2
...
Global PPD TablePivot p1
Pivot p2
Pivot pk
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
Edit Distance
5
…………
30
18
9
Count
25
22
19
8
…
Edit Vector
......
40<0,1,0>
3<0,0,0>
# of Strings
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
<0,1,0> 7
Edit Vector
......
84<1,0,2>
2<0,0,0>
# of Strings
Frequency Table
Frequency Table
23
Frequency table for each cluster
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
Cluster iPivot pi
<0,1,0> 7
[0,1,0]
7 strings with an edit vector <0,1,0> from pi
24
Global PPD Table
Proximity Pair Distribution table
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
Edit Distance
5
…………
30
18
9
Count
25
22
19
8
…
Cluster
Pivot: p
String s
Query String: q
<1,0,1>
<1,1,0>ed(p,s)1 2 3
Probability
30%
60%
100%
25
SEPIA histograms: summary
Edit Vector
......
12<0,0,1>4<0,0,0>
# of Strings
Edit Vector
......
40<0,1,0>
3<0,0,0>
# of Strings
Edit Vector
......
84<1,0,2>
2<0,0,0>
# of Strings
Frequency Table
Cluster 1
Cluster k
Cluster 2
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
...
Edit Distance
5
…………
Global PPD TablePivot p1
Pivot p2
Pivot pk
<0,1,0> 730
18
9
Count
25
22
19
8
…
26
Selectivity Estimation: ed(lukas, 2)
• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions
Cluster i
lucialukas[1,1,1]
<0,1,0>Edit Vector
......
40<0,1,0>
# of Strings
Vector v1 Vector v2Percentage
(%)
<0,1,0><1,1,1> 762
Edit Distance
Count
19
... ...
Expected Contribution: 76% * 40
Global PPD Table
Frequency Table i
27
Selectivity Estimation for ed(q,d)
• For each cluster Ci
• For each v2 in frequency table of Ci
• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)
Cluster i
pivotqv1
v2Edit Vector
......
# of Strings
Vector v1 Vector v2Percentage
(%)
v2v1 f
Edit Distance
Count
19
... ...
Expected Contribution: f * N
Global PPD Table
Frequency Table i
d
v2 N
28
Outline
• Selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Overview– Proximity between strings– Estimation algorithm
• Construction and maintenance of SEPIA• Experiments• Other works
29
Clustering Strings
Two example algorithms• Lexicographic order based.• K-Medoids
– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings
30
Number of Clusters
It affects:• Cluster quality
– Similarity of strings within each cluster
• Costs:– Space– Estimation time
31
Constructing Frequency Tables
• For each cluster, group strings based on their edit vector from the pivot
• Count the frequency for each group
Cluster i
Pivot pi
[0,1,0]
[0,1
,0]
32
Constructing PPD Table
• Get enough samples of string triplets (q,p,s)• Propose a few heuristics
– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
A collection of q strings
A set of clusters
33
Dynamic Maintenance: Frequency Table
Take insertion as an example
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
Cluster iPivot pi
<0,1,0> 7
[0,1,0]
New String
8
34
Dynamic Maintenance: PPD
Pivot: pq v1
v2
ed(p,s)=2
A collection of q strings in the construction of PPD
One of the clusters in the construction of PPD
New String
Vector v1 Vector v2Percentage
(%)
100
88
76
32
Edit Distance
…………
Count
25
22
19
8
…
v1 v2
v1 v2
v1 v2
v1 v2
0
1
2
3
+1
Adjust
35
Improving Estimation Accuracy
• Reasons of estimate errors– Miss hits in PPD.– Inaccurate percentage entries in PPD.
• Improvement: use sample fuzzy predicates to analyze their estimation errors
Predicates Real
P4(david, 2)P3(jordan, 2)P2(james,3)P1(tommy,2)
500600400500
Estimate
600300
600750
Relative Error+50%
-40%0%
+50%
-40% 0% +50%
25%
50%
25%
Relative Error
Probability
36
Relative-Error Model
• Use the errors to build a model• Use the model to adjust initial estimation
d: threshold;L: query string length;IE: Initial estimate
0<=IE<=400<=IE<=40
0<=IE<=40
IE>=41
1<=L<=51<=L<=5
L>=6
...
d = 1
d = 2
d = 3
-15% -20% +17% -8% 1%
IE>=41
+12% -23% +25%
IE>=41 IE>=41
L>=6
0<=IE<=40
37
Outline
• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Overview– Proximity between strings– Estimation algorithm
• Construction and maintenance of SEPIA• Experiments• Other works
38
Data
• Citeseer: – 71K author names– Length: [2,20], avg = 12
• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35
• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform
• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles
39
Setting
• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler
• Query workload:– Strings from the data– String not in the data– Results similar
• Quality measurements– Relative error: (fest – freal) / freal
– Absolute relative error : |fest – freal | / freal
40
Clustering Algorithms
217
45
18
120
47 37
Clustering Time(sec)
Estimation Time(ms)
Average AbsoluteRelative Error (%)
k-Medoids Lexicographic
K-Metoids is better
41
Quartile distribution of relative errors
0
0.25
0.5
0.75
1
-100 -7
5-5
0-2
5 0 25 50 75 100
Infin
ity
Relative Error (%)
Perc
enta
ge in
Wor
kloa
d
Data set 1. CLOSE_RAND; 1000 clusters
42
Number of Clusters
43
Effectiveness of Applying Relative-Error Model
18
25
1012
Average Absolute RelativeError for Data set 1 (%)
Average Absolute RelativeError for Data set 2 (%)
Without Error Correction With Error Correction
44
Dynamic Maintenance
45
Other work 1: Relaxing SQL queries with Selections/Joins
SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5
Jobs Candidates
JID Company
Zipcode
Salary CID Zipcode
ExpSalary
WorkYear
r1 Broadcom
92047 80 s1 93652 120 3
r2 Intel 93652 95 s2 92612 130 6
r3 Microsoft 82632 120 s3 82632 100 5
r4 IBM 90391 130 s4 90391 150 1
... … … … ... … … …
46
Query Relaxation: Skyline!
{}
R J S
RJ RS SJ
RSJ
J .Salary
C.WorkYear
J .Salary <= 95C.WorkYear >=5
5
95
47
Other work 2: Fuzzy predicates on attributes of mixed types
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1977| <= 3;
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Movies
48
Mixed-Typed Predicates
• String attributes: edit distance• Numeric attributes: absolute numeric
difference
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1977| <= 3;
49
MAT-tree: Intuition
• Indexing on two attributes is more effective than two separate indexing structures
• Numeric attribute: B-tree• String attribute: tree-based index structure?
50
MAT-tree: Overview
• Tree-based indexing structure:– Each node has MBR for both numeric attribute and string attribute
• Compressing strings as a “compressed trie” that fits into a limited space• An edit distance between a string and compressed trie can be computed• Experiments show that MAT-tree is very efficient
Spielberg1946
Hanks1956
Gibson1956
Hanks1957
Crowe1964
Robert1968
DiCaprio1974
Roberrts1977
<1946,1956> <1956,1957>
<1946,1957> <1964,1977>
MBR
Root
Leaf nodes
*
<1964,1968>
*
<1974,1977>
*
* *
......
...
......
*
...
51
Conclusion
• It’s important to support answering approximate queries efficiently
• Our results so far:– SEPIA: provides accurate selectivity
estimation for fuzzy string predicates– Relaxing SQL queries with selections and
joins– MAT-tree: indexing structure supporting fuzzy
queries with mixed-types predicates