1
Foto Afrati — National Technical University of Athens
Anish Das Sarma — ClearList Inc.Anand Rajaraman — Cambrian Ventures
Pokey Rule — Stanford UniversitySemih Salihoglu — Stanford University
Jeff Ullman — Stanford University
Anchor Points Algorithms for Hamming and Edit Distance
Fuzzy Joins
2
Input: set of records ROutput: <reci, recj> pairs s.t. dist(reci, recj) ≤ d
rec1
rec2
…recm
Input Output<rec1, rec5><rec7, rec9>
…<rec3, reck>
Example Applications: entity resolution, clustering, collaborative filtering
Two Specific Distance Measures
3
1. Hamming Distance Input: bit strings R of length n
2. Edit Distance Input: strings R of length n over alphabet A
0000000001
…10011
<00000, 00001>
…<10011, 10010>
abcd
eabc…
dddd
<abcd, eabc>
…<dddd, dadd>
Fuzzy Joins In One-Round MapReduce
4
rec1
rec2
rec3
…
recm-1
recm
Map
values
rec1, rec5, rec7
rec2, rec7, recm
…
rec2, recm
Reduce
key
reducer1
reducer2
…
reducerp
Per-Reducer-Memory-Cost
Communication Cost
5
communication
|R|=2n
2 |R|=2n
Grouping
(naïve)
per-reducer memory
22n
2n-d+1
Ball-Hashing
O(nd/2)
Splitting
Communication Cost vs Per-reducer Memory
Anchor Points
Outline
6
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Covering
Codes
3. Explicit Construction of Edit Distance Covering Codes
Outline
7
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Codes
3. Explicit Construction of Edit Distance Codes
Covering Code
8
Given set of strings R of length n, and radius k Definition: <n, k> covering code C
for each s∈R, there is a c∈C, s.t dist(c, s) ≤ k
kn length of stringsd distance of pairsk radius of code
Example Covering Code
9
01111 … 11101 11110
00111 … 10011 … 11100
00011 00101 … 10001 11000
00001 … 01000 10000
Example: Hamming Distance, n=5, k = 2
… …
……
…
… …
…n length of stringsd distance of pairsk radius of code
11111
00000
R
10
00000010000101101100
…1111011111
Map Reduce
Let C be an <n, k> covering code => (e.g. n=5, k=2)One reducer for each code wordMap s to code words at distance ≤ k + d/2 => (e.g. d=2 => 2 + 2/2 = 3)
Anchor Points Algorithm (1)
r00000
r11111
11
Anchor Points Algorithm (2)
≤d/2
c
v≤d
≤k
u
w≤d/2
≤k + d/2≤k + d/2
Triangle Inequality
n length of stringsd distance of pairsk radius of code
12
Cost of Anchor Points Algorithm
B(n, r): size of the ball of radius rPer-reducer memory: B(n, k + d/2)Communication: |C|B(n, k + d/2)
Reducer for code word c
c
k + d/2s4
s7 s6
s9
s17
s11
s5
s1
n length of stringsd distance of pairsk radius of code
13
communication
|R|=2n
2 |R|=2n
Groupin
g (naïve)
per-reducer memory
22n
2n-d+1
Ball-Hashing
O(nd/2)
Splitting
Anchor Pointsk=0
k=1
k=2
k=n
n length of stringsd distance of pairsk radius of code
Communication Cost vs Per-reducer Memory
Outline
14
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Codes
3. Explicit Construction of Edit Distance Codes
Some Known Hamming Distance Codes
15
k n |C|0 any 2n
n any 11 n=2r-1 2n/n+1
Perfect <n, k> Code (i.e., smallest possible) : 2n/B(n, k)
Hamming Codes
n length of stringsd distance of pairsk radius of code
For any k: existence of n2n/B(n, k) => not Perfect Problem: no explicit construction
16
Cross Product Method (Explicit HD <n, k> Codes)Start with <n/t, k/t> code DLet C = D x D x … x D (t times)Claim: C is a <n, k> covering codeProof:
s = s1 s2 s3 … st
c = d1 d2 d3 … dt
≤k/t ≤k/t ≤k/t ≤k/tdist(s, c) ≤ k
n length of stringsd distance of pairsk radius of code
Example of Cross Product Methodn = 10, k = 4, t=2 => use a <5, 2>
code D D = {00000, 11111}
17
00000--11111
11111--11111
11111--00000
00000--00000
1100011100≤2+2
=4
1110000001
≤2+1=3
11000--11100
11100--00001
n length of stringsd distance of pairsk radius of code
Size of Cross Product Codes: Dk
Assume D is perfect (e.g., Hamming code)
18
Perfect <n, k> code:
For large n, small t => same asymptotic size
Example: n, k=2, t=2
vs
Outline
19
1. Anchor Points Algorithm
• Covering Code
2. Explicit Construction of Hamming Distance Covering
Codes
3. Explicit Construction of Edit Distance Covering Codes
Edit Distance Fuzzy Joins
20
abcd
eabc
cadb…
dadd
dddd
<abcd, eabc>
…<dddd, dadd>
Input Output
strings of length n over alphabet A (i.e.,|A|n strings)
Covering codes algorithm works in the same way: If C is a <n, k> edit distance code Send s to all code words at distance k+d/2
Differences with Hamming Distance
21
1. Length of code words might be different E.g. 1 insertion, |c| = n+1 => insertion-1 code E.g. 1 deletion, |c| = n-1 => deletion-1 code
2. Different code words might have different ball sizes
3. No known perfect codes or explicit construction
ababa…a(n+1)
aaba…a(n) abba…a
(n)
abaa…a(n)
baba…a(n)
…
… …
aaaaa…a(n+1)
aaaa..a(n)
Insertion-1 Codes
22
Let n=5, |A|=a=4, code words are of length 6Letters as integers from 0 to (a-1): e.g. 0230, 1124, …Let si be the ith digit of s1. sum(s) = 2. score(s) = sum(s) % (n+1)(a-1) (e.g., 6*3=18)3. R = Any a-1 consecutive residues:
e.g. {0,1,2}, {12,13,14}, {16,17,0}C = {003000, 303000, 003001, 003002, 200000, …}
|C| =
**factor a worse than best possible**
Example: s=23010, sum(s)=24, score(s)=6
23
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 023010 Y323010
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 203010 Y233010
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 230010 Y233010
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 230010 Y230310
sum 23 24 25 26 27 28
29
30
31
32
33
34
35
36
37
38 39 40 41 42 43
score 5 6 7 8 9 10
11
12
13
14
15
16
17
0 1 2 3 4 5 6 7
X 230100 Y230130
Edit Distance Codes
24
Insertion/Deletion Size Explicit/Existence
Insertion-1 explicit
Deletion-1 explicit
Deletion-2 explicit
Deletion-1 existence
Summary
25
1. Fuzzy Joins for Hamming and Edit Distance in One-round MR
2. Anchor Points Algorithm Covering Codes Flexible parallelism Better communication cost than naive
3. Explicit construction of Hamming distance covering codes
4. Explicit Construction of Edit distance covering codes
Open Questions
26
Fuzzy Joins in MR Minimum communication for a given per-reducer
memory for 1 round MR algorithms? Know the answer for only Hamming Distance 1
How about multi-round MR algorithms? Covering Codes
Are there smaller codes? Can we construct smaller codes explicitly? What is the size of the smallest codes?
Related Work
27
Fuzzy Joins in MR Fuzzy Joins Using MapReduce, Afrati et. al., ICDE 2012 Document Similarity Self-Join with MapReduce, Baraglia et. al.,
ICDM 2010 Efficient Parallel Set-similarity Joins Using MapReduce, Vernica
et. al., SIGMOD 2010 Efficient Similarity Joins for Near Duplicate Detection, Xiao et.
al., WWW 2008Covering Codes
Covering codes, Gary Cohen On Asymmetric Coverings and Covering Numbers, Applegate
et. al., Comb. Designs 2003 Asymmetric Binary Covering Codes, Cooper et. al., Comb.
Theory 2002
28
Questions?