Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz...

transcript

Extending Q-Grams to Estimate Selectivity of String Matching with

Low Edit Distance [1]

Pirooz Chubak

May 22, 2008

Motivation

• Selectivity estimation of approximate string matching queries

• Applications– Misspelling correction/suggestion– Data integration and data cleaning– Query optimization (generating query plans)

Approximate String Matching

• String similarity measures– Edit distance– Hamming distance– Jaccard similarity co-efficient

• Edit distance– Minimum number of edit (insertion, deletion,

replacement) operations to convert a string to the other

Short Identifying Substring

• SIS by Chaudhuri, et al. [2]– String s usually has a substring s’ that if an

attribute value contains s, it almost always contains s’

– Thus, approximate selectivity of long string queries with their shorter substrings

Related Work

• SEPIA [3]– Clusters similar strings– Selects a pivot for each cluster– Captures the edit distance distribution with

histograms– For each query, visit all the clusters and

estimate the number of strings within the distance threshold

Problem Statement

• Given a query string sq and a bag of strings DB estimate the size of the answer set

• Interested in low edit thresholds (1-3)

},),(|{ DBssseds q

Basic Definitions

• Q-gram– Any string of length q

• N-gram table– Frequencies of all q-grams for q=1…N

• Ans(sq,iDjImR) = set of strings s’ such that sq can be converted to s’ with i deletions, j insertions and m replacements

• Ans(sq,k) = set of string s’ obtained from sq with exactly k edit operations

Examples

• Ans(“abcd”,1R) = {“?bcd”,”a?cd”,”ab?d”,”abc?”}

• Alphabet for extended Q-grams =

• 3-gram table for “beau” contains frequencies for– 1-grams (b, e, a, u)

– 2-grams (#b, be, ea, au, u$)

– 3-grams (#be, bea, eau, au$

• Extended 3-gram table also contains frequencies for– For 2-grams (?b, ?e, ?u, b?, e?, a?, u?, ??, #?, ?$

– For 3-grams (?ea, #?e, ??$, etc.)

?},$,{#

Replacement semi-lattice

• Assume only replacements are allowed• E.g. Ans(“abcd”,2R)

– Possible answers = ab??, a?c?, ?bc?, a??d, ?b?d, ??cd

• Find value of | Ans(“abcd”,2R)| using

• S1 = ab??, … , S6=??cd

Replacement semi-lattice (Cont.)

Get the values of intersections from this table and plug them into the formula for |Ans(“abcd”,2R)|

Semi-lattice for Ans(“abcd”,2R)

General Formulas

• Generalize the above idea to find |Ans(sq,kR)|

• The general formulas for deletion is very trivial and can be shown to always be the sum of the frequencies of the level-0 nodes

• The general case for insertion can be very complex, only interested in at most 3 insertions

Estimate selectivity

• General idea– group Ans(sq,k) by the length of the strings (l-k...l+k)

– Estimate the size of each subset separately

• Ans(“abcde”,2)– 5 subsets, having strings of size 3 to 7

– Length 3 is Ans(“abcde”,2D)

– Length 5 is Ans(“abcde”,1I1D) U Ans(“abcde”,2R)

Lots of overlap

Estimate selectivity (Cont.)

• Combined Approach– Obtain base strings for both sets

– Remove redundant base strings

• Ans(“abcde”,2R) generates “abc??”• Ans(“abcde”,1I1D) generates “abcd?”• “abc??” has all the strings in “abcd?”

Remove “abcd?” from base strings

Estimate selectivity (cont.)

• BasicEQ, for a given string length– Find the base strings (remove redundancies)

– Iteratively intersect base strings to obtain r-intersections (r = 2..|base strings|)

• This will generate new nodes in the hierarchy

– Partition the nodes and estimate their frequencies

– Add these estimated frequencies

Estimate selectivity (cont.)• Node Partitioning

– Partition the nodes, so that every node q in a partition has the same coefficient Cq

– Cq is the number of times q appears in all the intersections of base strings

– For each partition find Cq and sum of frequencies of its nodes

Frequency Estimation• Estimate the frequency of an extended q-gram in the

extended N-gram table

• Maximal Overlap (MO) [4]– Finds the substring in the table that has the maximum overlap with

• MAX approach– If MO(“abc?”) < MO(“abcd”), then set MO(“abcd”) for “abc?”

• MO+– Find the substring with the minimum frequency

• MM– Combination of MAX and MO+

Estimate selectivity (cont.)

• BasicEQ is efficient if the general formulas are applicable

• Propose OptEQ that adds two enhancements to BasicEQ– Approximates the co-efficient Cq but achieves a better

performance

– Groups the set of strings obtained in each iteration of BasicEQ to obtain faster intersection tests (for being empty)

Experimetal Evaluation(method, NB, NE, PT)

Experimetal Evaluation

Space vs. Accuracy

Conclusions

• Proposed OptEQ– Approximates coefficients of partitions

– Groups semi-lattices to obtain scalability

– More accurate than SEPIA

– Exploits disk space to give higher precisions

• MM and Max estimates give good results

References

[1] H. Lee, R. T. Ng, and K. Shim, “Extending Q-grams to estimate selectivity of string matching with low edit distance”, VLDB 2007

[2] S. Chaudhuri, V. Ganti, and L. Gravano “Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem”, ICDE 2004

[3] L. Jin and C. Li, “Selectivity Estimation for Fuzzy String Predicates in Large Data Sets”, VLDB 2005

[4] H. V. Jagadish, R. T. Ng and D. Srivastava. “Substring Selectivity Estimation”, PODS 1999

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz...

Documents