+ All Categories
Home > Documents > Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...

Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...

Date post: 08-Jan-2018
Category:
Upload: allison-stevenson
View: 215 times
Download: 0 times
Share this document with a friend
Description:
1/19/2016 Columbia University Computer Science Dept. 3 The Need for String Joins Service A Jenny Stamatopoulou John Paul McDougal Aldridge Rodriguez Panos Ipeirotis John Smith … … Service B Panos Ipirotis Jonh Smith … Jenny Stamatopulou John P. McDougal … Al Dridge Rodriguez Substantial amounts of data in existing DBMSs are strings Often, there is a need to correlate data stored in different tables Example: Find common customers across various customer databases
38
Approximate String Joins in a Database (Almost) for Free L. Gravano P.G. Ipeirotis H.V. Jagadish Columbia Univ. Columbia Univ. Univ. of Michigan N. Koudas S. Muthukrishnan D. Srivastava AT&T Labs AT&T Labs AT&T Labs
Transcript
Page 1: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Approximate String Joins in a Database (Almost) for Free

L. Gravano P.G. Ipeirotis H.V. JagadishColumbia Univ. Columbia Univ. Univ. of Michigan

N. Koudas S. Muthukrishnan D. SrivastavaAT&T Labs AT&T Labs AT&T Labs

Page 2: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Scenario of Application

The work was done before 2001. The related research has been put in practice

05/03/23 Columbia UniversityComputer Science Dept.

2

Page 3: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

3

The Need for String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

Service B

Panos Ipirotis

Jonh Smith

Jenny Stamatopulou

John P. McDougal

Al Dridge Rodriguez

Substantial amounts of data in existing DBMSs are strings Often, there is a need to correlate data stored in different tables

Example: Find common customers across various customer databases

Page 4: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

4

Problems with Exact String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

Service B

Panos Ipirotis

Jonh Smith

Jenny Stamatopulou

John P. McDougal

Al Dridge Rodriguez

Typing mistakes (e.g., John vs. Jonh) No standard way of recording string data Standard equijoins do not “forgive such mistakes”

⋈= ∅

Page 5: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

5

Approximate String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

Service B

Panos Ipirotis

Jonh Smith

Jenny Stamatopulou

John P. McDougal

Al Dridge Rodriguez

We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one (minimal!)

K=1

K=2

K=1

K=3

K=1

Page 6: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

6

Our Focus: Approximate String Joins over Relational DBMSs

Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K

Solve the problem in a database-friendly way(if possible with an existing "vanilla" RDBMS)

Page 7: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

7

Related Work

Data Cleaning Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 Monge and Elkan, SIGMOD DMKD Workshop, 1997 ...

Approximate String Matching Baeza-Yates and Gonnet, SPIRE 1999 Sutinen and Tarhio, ESA’95, CPM’96 Smith and Waterman, Journal of Molecular Biology 147, 1981 Ukkonen, TCS 92(1), 1992 Ullman, Computer Journal, 1977 …

Page 8: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

8

Current Approaches for Processing Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS

Page 9: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

9

Approximate String Joins outside of a DBMS

1. Export data2. Join outside of DBMS3. Import the result

Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the

DBMS

Page 10: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

10

Approximate String Joins with UDFs

1. Write a UDF to check if two strings match within distance K2. Write an SQL statement that applies the UDF to the string pairs

SELECT R.stringAttr, S.stringAttrFROM R, SWHERE edit_distance(R.stringAttr, S.stringAttr, K)

Main advantage: Ease of implementation

Main disadvantage: UDF applied to entire cross-product of relations

Page 11: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Edit distance Algorithms

Applied in UDF method: WHERE edit_distance(R.stringAttr, S.stringAttr, K)

Input: two strings s and t with reasonable lengths m and n.

Output: their edit_distance. Dynamic Programming Algorithm: Levenshtein distance

05/03/23 Columbia UniversityComputer Science Dept.

11

Page 12: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Levenshtein distance Algorithm

int LevenshteinDistance(char s[1..m], char t[1..n]) declare int d[0..m, 0..n] //(m+1)*(n+1) Matrix //initialization of the Matrix

for i from 0 to m d[i, 0] := i for j from 0 to n d[0, j] := j

05/03/23 Columbia UniversityComputer Science Dept.

12

Page 13: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Matrix for Levenshtein distance (intialization) K I T T E N

0 1 2 3 4 5 6

S 1

I 2

T 3

T 4

I 5

N 6

G 7

05/03/23 Columbia UniversityComputer Science Dept.

13

Page 14: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Levenshtein distance Algorithm (con’d)

//after the initialization for j from 1 to n { if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) } 05/03/23 Columbia University

Computer Science Dept.14

Page 15: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Matrix for Levenshtein distance

05/03/23 Columbia UniversityComputer Science Dept.

15

K I T T E N0 1 2 3 4 5 6

S 1 1 2 3 4 5 6

I 2 2 1 2 3 4 5

T 3 3 2 1 2 3 4

T 4 4 3 2 1 2 3

I 5 5 4 3 2 2 3

N 6 6 5 4 3 3 2

G 7 7 6 5 4 4 3

The “right-most and lowest” element d[m, n] is the edit-distance.

Page 16: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

Levenshtein distance Algorithm

Correctness: Invariant: Given the distance of two substrings s[1…i]

and t[1…j], we can determine s[1…i+1] and t[1…j+1] by adding d[i, j].

The initialization step meets the definition of edit distance well, and then we induce it according to the Invariant and prove it true.

O(mn) time, which could be improved (not the heart of the matter..)

05/03/23 Columbia UniversityComputer Science Dept.

16

Page 17: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

17

Our Approach: Approximate String Joins over an Unmodified RDBMS

1. Preprocess data and generate auxiliary tables2. Perform join exploiting standard RDBMS capabilities

Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs

Page 18: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

18

Our Approach: Intuition and Roadmap

Intuition: Similar strings have many common substrings Use exact joins to perform approximate joins (current

DBMSs are good for exact joins) A good candidate set can be verified for false positives[Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

Roadmap: Break strings into substrings of length q (q-grams) Perform an exact join on the q-grams Find candidate string pairs based on the results Check only candidate pairs with a UDF to obtain final answer

Page 19: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

19

What is a “Q-gram”?

Q-gram: A sequence of q characters of the original string

Example for q=3vacations

{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

String with length L → L + q - 1 q-grams

Similar strings have a many common q-grams

Page 20: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

20

Q-grams and Edit Distance Operations

With no edits: L + q - 1 common q-grams

Replacement: (L + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

Insertion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

Deletion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Page 21: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

21

“Q-gram Distance” and Edit Distance

In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: |Maximum number of q-grams| - |Common q-grams|

Each Edit Distance operation affects a q-gram in the following ways: It destroys it, or It leaves it intact, or It shifts it by one position

Each Edit Distance operation destroys at most q q-grams

For Edit Distance = K, the “q-gram distance” is at most Kq

Page 22: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

22

Number of Common Q-grams and Edit Distance

For Edit Distance = K, there could be at most K replacements, insertions, deletions

Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common

Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Page 23: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

23

Using a DBMS for Q-gram Joins

If we have the q-grams in the DBMS, we can perform this counting efficiently.

Create auxiliary tables with tuples of the form:<sid, strlen, qgram>

and join these tables

A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Page 24: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

24

Eliminating Candidate Pairs: COUNT FILTERING

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgramGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Page 25: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

25

Eliminating Candidate Pairs Further: LENGTH FILTERING

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=KGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as LENGTH FILTERING

Page 26: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

26

Exploiting Q-gram Positions for Filtering

Consider strings aabbzzaacczz and aacczzaabbzz Strings are at edit distance 4 Strings have identical q-grams for q=3

Problem: Matching q-grams that are at different positions in both strings Either q-grams do not "originate" from same q-gram, or Too many edit operations "caused" spurious q-grams

at various parts of strings to match

Page 27: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

27

POSITION FILTERING - Filtering using positions

Keep the position of the q-grams <sid, strlen, pos, qgram> Do not match q-grams that are more than K positions away

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram

AND abs(R.strlen - S.strlen)<=KAND abs(R.pos - S.pos)<=K

GROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as POSITION FILTERING

Page 28: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

28

The Actual, Complete SQL Statement

SELECT R1.string, S1.string, R1.sid, S1.sidFROM R1, S1, R, S, WHERE R1.sid=R.sid

AND S1.sid=S.sidAND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=KAND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.stringHAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

Page 29: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

29

Experimental Results: Data

Three sets of customer data from AT&T Worldnet Service 3 customer data sets from AT&T Worldnet:

(a) set1 with about 40K strings (b) set2 and (c) set3 with about 30K strings each

0

2000

4000

6000

8000

10000

12000

1 6 11 16 21 26 31

String Length

Num

ber o

f Str

ings

(a)

0

200

400

600

800

1000

1200

1 9 17 25 33 41 49 57 65

String Length

Num

ber o

f Stri

ngs

(b)

0

100

200

300

400

500

600

1 7 13 19 25 31 37 43 49 55 61 67

String Length

Num

ber o

f Stri

ngs

(c)

Page 30: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

30

DBMS Setup

Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server

Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table)

Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index)

The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible

Page 31: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

31

Query Plans Generated by RDBMS

Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets)

Q-gram approach: usually sort-merge joins

In our prototype implementation, sort-merge joins is the fastest as well

Page 32: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

32

Naïve UDFs vs. Filtering

10

100

1000

10000Q1 (UDF only) Q4 (Filtering)

Q1 (UDF only) 1954 2028 2044Q4 (Filtering) 48 68 91

k=1 k=2 k=3

For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs

Page 33: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

33

Effect of Filters (Candidate Set)CP=Cross Product, L=Length Filtering, LP=Length and Position Filtering, LC=Length and Count Filtering, LPC=Length, Position, and Count Filtering, Real=Number of Real matches

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

k=1 k=2 k=3

Cand

idat

e Se

t Siz

e

CP L LP LC LPC Real

(a)

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

k=1 k=2 k=3

Cand

idat

e Se

t Siz

e

CP L LP LC LPC Real

(b)

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

k=1 k=2 k=3

Cand

idat

e Se

t Siz

e

CP L LP LC LPC Real

(c)

LENGTH FILTERING: 40-70% reduction for set1 (small length deviation)

90-98% reductions for set2, set3 (big length deviation)

+COUNT FILTERING: > 99% reduction

POSITION FILTERING: ~ 50% reduction (additionally)

Page 34: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

34

Effect of Filters (Candidate Set) – Best Q

1.E+05

1.E+06

1.E+07

set1 set2 set3

Cand

idat

e Se

t Siz

e

q=1 q=2 q=3 q=4 q=5

(a)

1.E+06

1.E+07

1.E+08

set1 set2 set3

Cand

idat

e Se

t Siz

e

q=1 q=2 q=3 q=4 q=5

(b)

For the given data sets q=2 worked best q=2 is close to the theoretical approximations as well q=2 is small enough to avoid, as much as possible, the space

overhead for the auxiliary tables

Page 35: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

35

Effect of Filters (Q-gram Join Size)

COUNT FILTERING is applied last (in HAVING clause)

For efficiency, it is important to have a small join size for the q-gram tables

LENGTH FILTERING cuts the size by a factor of 2 to 10

+POSITION FILTERING cuts the size by a factor of ~100

Page 36: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

36

Extensions: Substring Joins

Substring approximate joins: Length filtering is not applicable Substring matches have fewer common q-grams (no “overflow”

q-grams) Position filtering is not directly applicable (it needs tuning

depending on the substring location)

We also propose a fast in-memory filter (see paper) The result is not trivial! Exploiting position of q-grams extensively to find possible

alignments

Page 37: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

37

Extensions: Block Moves

Match “AT&T Corp” with “Corp AT&T” as a unit cost operation.

We can allow block moves: Length filtering is still applicable The threshold for count filtering is different Position filtering is not applicable (the q-gram may move far

away)

Page 38: Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish Columbia Univ. Univ. of Michigan N. KoudasS. MuthukrishnanD.

05/03/23 Columbia UniversityComputer Science Dept.

38

Conclusions

We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression

Our technique does not require modifying the underlying RDBMS

Our technique exploits the RDBMS's query optimizer

Our technique significantly outperforms existing approaches

Many opportunities for improvements and future work!


Recommended