Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc .

Post on 23-Feb-2016

44 views 0 download

Tags:

description

Anchor Points Algorithms for Hamming and Edit Distance. Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc . Anand Rajaraman — Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University - PowerPoint PPT Presentation

transcript

1

Foto Afrati — National Technical University of Athens

Anish Das Sarma — ClearList Inc.Anand Rajaraman — Cambrian Ventures

Pokey Rule — Stanford UniversitySemih Salihoglu — Stanford University

Jeff Ullman — Stanford University

Anchor Points Algorithms for Hamming and Edit Distance

Fuzzy Joins

2

Input: set of records ROutput: <reci, recj> pairs s.t. dist(reci, recj) ≤ d

rec1

rec2

…recm

Input Output<rec1, rec5><rec7, rec9>

…<rec3, reck>

Example Applications: entity resolution, clustering, collaborative filtering

Two Specific Distance Measures

3

1. Hamming Distance Input: bit strings R of length n

2. Edit Distance Input: strings R of length n over alphabet A

0000000001

…10011

<00000, 00001>

…<10011, 10010>

abcd

eabc…

dddd

<abcd, eabc>

…<dddd, dadd>

Fuzzy Joins In One-Round MapReduce

4

rec1

rec2

rec3

recm-1

recm

Map

values

rec1, rec5, rec7

rec2, rec7, recm

rec2, recm

Reduce

key

reducer1

reducer2

reducerp

Per-Reducer-Memory-Cost

Communication Cost

5

communication

|R|=2n

2 |R|=2n

Grouping

(naïve)

per-reducer memory

22n

2n-d+1

Ball-Hashing

O(nd/2)

Splitting

Communication Cost vs Per-reducer Memory

Anchor Points

Outline

6

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Covering

Codes

3. Explicit Construction of Edit Distance Covering Codes

Outline

7

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Codes

3. Explicit Construction of Edit Distance Codes

Covering Code

8

Given set of strings R of length n, and radius k Definition: <n, k> covering code C

for each s∈R, there is a c∈C, s.t dist(c, s) ≤ k

kn length of stringsd distance of pairsk radius of code

Example Covering Code

9

01111 … 11101 11110

00111 … 10011 … 11100

00011 00101 … 10001 11000

00001 … 01000 10000

Example: Hamming Distance, n=5, k = 2

… …

……

… …

…n length of stringsd distance of pairsk radius of code

11111

00000

R

10

00000010000101101100

…1111011111

Map Reduce

Let C be an <n, k> covering code => (e.g. n=5, k=2)One reducer for each code wordMap s to code words at distance ≤ k + d/2 => (e.g. d=2 => 2 + 2/2 = 3)

Anchor Points Algorithm (1)

r00000

r11111

11

Anchor Points Algorithm (2)

≤d/2

c

v≤d

≤k

u

w≤d/2

≤k + d/2≤k + d/2

Triangle Inequality

n length of stringsd distance of pairsk radius of code

12

Cost of Anchor Points Algorithm

B(n, r): size of the ball of radius rPer-reducer memory: B(n, k + d/2)Communication: |C|B(n, k + d/2)

Reducer for code word c

c

k + d/2s4

s7 s6

s9

s17

s11

s5

s1

n length of stringsd distance of pairsk radius of code

13

communication

|R|=2n

2 |R|=2n

Groupin

g (naïve)

per-reducer memory

22n

2n-d+1

Ball-Hashing

O(nd/2)

Splitting

Anchor Pointsk=0

k=1

k=2

k=n

n length of stringsd distance of pairsk radius of code

Communication Cost vs Per-reducer Memory

Outline

14

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Codes

3. Explicit Construction of Edit Distance Codes

Some Known Hamming Distance Codes

15

k n |C|0 any 2n

n any 11 n=2r-1 2n/n+1

Perfect <n, k> Code (i.e., smallest possible) : 2n/B(n, k)

Hamming Codes

n length of stringsd distance of pairsk radius of code

For any k: existence of n2n/B(n, k) => not Perfect Problem: no explicit construction

16

Cross Product Method (Explicit HD <n, k> Codes)Start with <n/t, k/t> code DLet C = D x D x … x D (t times)Claim: C is a <n, k> covering codeProof:

s = s1 s2 s3 … st

c = d1 d2 d3 … dt

≤k/t ≤k/t ≤k/t ≤k/tdist(s, c) ≤ k

n length of stringsd distance of pairsk radius of code

Example of Cross Product Methodn = 10, k = 4, t=2 => use a <5, 2>

code D D = {00000, 11111}

17

00000--11111

11111--11111

11111--00000

00000--00000

1100011100≤2+2

=4

1110000001

≤2+1=3

11000--11100

11100--00001

n length of stringsd distance of pairsk radius of code

Size of Cross Product Codes: Dk

Assume D is perfect (e.g., Hamming code)

18

Perfect <n, k> code:

For large n, small t => same asymptotic size

Example: n, k=2, t=2

vs

Outline

19

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Covering

Codes

3. Explicit Construction of Edit Distance Covering Codes

Edit Distance Fuzzy Joins

20

abcd

eabc

cadb…

dadd

dddd

<abcd, eabc>

…<dddd, dadd>

Input Output

strings of length n over alphabet A (i.e.,|A|n strings)

Covering codes algorithm works in the same way: If C is a <n, k> edit distance code Send s to all code words at distance k+d/2

Differences with Hamming Distance

21

1. Length of code words might be different E.g. 1 insertion, |c| = n+1 => insertion-1 code E.g. 1 deletion, |c| = n-1 => deletion-1 code

2. Different code words might have different ball sizes

3. No known perfect codes or explicit construction

ababa…a(n+1)

aaba…a(n) abba…a

(n)

abaa…a(n)

baba…a(n)

… …

aaaaa…a(n+1)

aaaa..a(n)

Insertion-1 Codes

22

Let n=5, |A|=a=4, code words are of length 6Letters as integers from 0 to (a-1): e.g. 0230, 1124, …Let si be the ith digit of s1. sum(s) = 2. score(s) = sum(s) % (n+1)(a-1) (e.g., 6*3=18)3. R = Any a-1 consecutive residues:

e.g. {0,1,2}, {12,13,14}, {16,17,0}C = {003000, 303000, 003001, 003002, 200000, …}

|C| =

**factor a worse than best possible**

Example: s=23010, sum(s)=24, score(s)=6

23

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 023010 Y323010

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 203010 Y233010

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 230010 Y233010

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 230010 Y230310

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 230100 Y230130

Edit Distance Codes

24

Insertion/Deletion Size Explicit/Existence

Insertion-1 explicit

Deletion-1 explicit

Deletion-2 explicit

Deletion-1 existence

Summary

25

1. Fuzzy Joins for Hamming and Edit Distance in One-round MR

2. Anchor Points Algorithm Covering Codes Flexible parallelism Better communication cost than naive

3. Explicit construction of Hamming distance covering codes

4. Explicit Construction of Edit distance covering codes

Open Questions

26

Fuzzy Joins in MR Minimum communication for a given per-reducer

memory for 1 round MR algorithms? Know the answer for only Hamming Distance 1

How about multi-round MR algorithms? Covering Codes

Are there smaller codes? Can we construct smaller codes explicitly? What is the size of the smallest codes?

Related Work

27

Fuzzy Joins in MR Fuzzy Joins Using MapReduce, Afrati et. al., ICDE 2012 Document Similarity Self-Join with MapReduce, Baraglia et. al.,

ICDM 2010 Efficient Parallel Set-similarity Joins Using MapReduce, Vernica

et. al., SIGMOD 2010 Efficient Similarity Joins for Near Duplicate Detection, Xiao et.

al., WWW 2008Covering Codes

Covering codes, Gary Cohen On Asymmetric Coverings and Covering Numbers, Applegate

et. al., Comb. Designs 2003 Asymmetric Binary Covering Codes, Cooper et. al., Comb.

Theory 2002

28

Questions?