Large Relational Databases (Almost) for Free K Nearest Neighbor …lifeifei/papers/KNN.pdf ·...

Post on 03-Jun-2020

6 views 0 download

transcript

1-1

K Nearest Neighbor Queries and KNN-Joins inLarge Relational Databases (Almost) for Free

Bin Yao, Feifei Li, Piyush Kumar

Florida State University

2-1

Introduction

KNN queries and KNN-Joins:spatial databases, pattern recognition, DNA sequencing.

2-2

Introduction

KNN queries and KNN-Joins:spatial databases, pattern recognition, DNA sequencing.

Our goal: design relational algorithms for KNN and KNN-Joins.

2-3

Introduction

KNN queries and KNN-Joins:spatial databases, pattern recognition, DNA sequencing.

Our goal: design relational algorithms for KNN and KNN-Joins.

Augmented with ad-hoc query conditions and optimized bythe query optimizer.

Readily applied on relational databases without updating theengine.

Do it in SQL!

3-1

Challenge and benefit in designingrelational algorithms

The main challenge:A query optimizer cannot optimize user-defined functions (UDF).

3-2

Challenge and benefit in designingrelational algorithms

The main challenge:A query optimizer cannot optimize user-defined functions (UDF).

SELECT TOP k * FROM Address A, Restaurant RWHERE R.Type=’Italian’ AND R.Wine=’French’ORDER BY Euclidean (A.X, A.Y, R.X, R.Y)

3-3

Challenge and benefit in designingrelational algorithms

The main challenge:A query optimizer cannot optimize user-defined functions (UDF).

SELECT TOP k * FROM Address A, Restaurant RWHERE R.Type=’Italian’ AND R.Wine=’French’ORDER BY Euclidean (A.X, A.Y, R.X, R.Y)

4-1

Previous work on kNN and kNN-Join

iDistance for high dimensions.

Exact kNN solution:R-tree for low dimensions

4-2

Previous work on kNN and kNN-Join

iDistance for high dimensions.

Balanced box decomposition tree

Exact kNN solution:R-tree for low dimensions

LSB-tree

Approximate kNN solution:

Locality sensitive hashing

Medrank

4-3

Previous work on kNN and kNN-Join

iDistance for high dimensions.

Balanced box decomposition tree

the iJoin algorithm

Exact kNN solution:R-tree for low dimensions

LSB-tree

Approximate kNN solution:

Locality sensitive hashing

Medrank

the Gorder algorithm

kNN-Join solution:

5-1

Problem formulation

Data set P stored in table RP : {pid, Y1, · · · , Yd, A1, · · · , Ah}.Query set Q stored in table RQ: {qid,X1, · · · , Xd, B1, · · · , Bg}.

5-2

Problem formulation

KNN queries: let A = kNN(q,RP ),

(A ⊆ RP ) ∧ (|A| = k) ∧ (∀a ∈ A,∀r ∈ RP −A, |a, q| ≤ |r, q|).

Data set P stored in table RP : {pid, Y1, · · · , Yd, A1, · · · , Ah}.Query set Q stored in table RQ: {qid,X1, · · · , Xd, B1, · · · , Bg}.

5-3

Problem formulation

KNN queries: let A = kNN(q,RP ),

(A ⊆ RP ) ∧ (|A| = k) ∧ (∀a ∈ A,∀r ∈ RP −A, |a, q| ≤ |r, q|).

KNN-Join:for ∀s ∈ Q, produce k pairs (s, r), for ∀r ∈ kNN(s, Rp).

Data set P stored in table RP : {pid, Y1, · · · , Yd, A1, · · · , Ah}.Query set Q stored in table RQ: {qid,X1, · · · , Xd, B1, · · · , Bg}.

5-4

Problem formulation

KNN queries: let A = kNN(q,RP ),

(A ⊆ RP ) ∧ (|A| = k) ∧ (∀a ∈ A,∀r ∈ RP −A, |a, q| ≤ |r, q|).

KNN-Join:for ∀s ∈ Q, produce k pairs (s, r), for ∀r ∈ kNN(s, Rp).

Approximate k nearest neighbors:Suppose q’s kth nn from P is p∗ and r∗ = |q, p∗|,p be the kth NN of q for some kNN algorithm A and rp = |q, p|,(p, rp) ∈ Rd × R is (1 + ε)-approximate solution of kNN ifr∗ ≤ rp ≤ (1 + ε)r∗.

Data set P stored in table RP : {pid, Y1, · · · , Yd, A1, · · · , Ah}.Query set Q stored in table RQ: {qid,X1, · · · , Xd, B1, · · · , Bg}.

6-1

Z-value and Z-order curve

z-value of a point:For point (2, 6), binary representation is (010, 110), z-value is011100 = 28.

6-2

Z-value and Z-order curve

z-value of a point:For point (2, 6), binary representation is (010, 110), z-value is011100 = 28.

A well-known approach:

6-3

Z-value and Z-order curve

z-value of a point:For point (2, 6), binary representation is (010, 110), z-value is011100 = 28.

A well-known approach:

Map points in a multi-dimensional space into one dimensionby using z-values.

6-4

Z-value and Z-order curve

z-value of a point:For point (2, 6), binary representation is (010, 110), z-value is011100 = 28.

A well-known approach:

Map points in a multi-dimensional space into one dimensionby using z-values.

Translate the kNN search into one dimensional range searchon the z-values.

7-1

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

7-2

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

q

7-3

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

q

7-4

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

q

7-5

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

Our idea: produce α randomly shifted copies of the input dataset (P 0, . . . , Pα) and repeat the one dimensional range search(γ = O(k) points up and down next to the q) for each copy.

7-6

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

q

7-7

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

q

−→v

7-8

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

−→vq

7-9

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

−→vq

γ = 2

7-10

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

Our idea: produce α randomly shifted copies of the input dataset (P 0, . . . , Pα) and repeat the one dimensional range search(γ = O(k) points up and down next to the q) for each copy.

Retrieve the kNN from the unioned candidates of the α copies.

7-11

Approximation by random shifts

Z-values preserve the spatial locality, but not always the case.

Our idea: produce α randomly shifted copies of the input dataset (P 0, . . . , Pα) and repeat the one dimensional range search(γ = O(k) points up and down next to the q) for each copy.

Retrieve the kNN from the unioned candidates of the α copies.

Theorem 1:Using α = O(1) and γ = O(k), zχ-kNN guaranteesan expected constant factor approximate kNN result withO(logf

NB + k/B) number of page accesses (clustered index

on z-values ).

8-1

Approximation algorithm

Candidates C = ∅;For i = 0, . . . , α {

Find zip as the successor of zq+vi in P i;

Let Ci be γ points up and down next to zip in P i;

For each point p in Ci, let p = p− vi;C = C

⋃Ci;

}Let Aχ = kNN(q, C) and output Aχ.

zχ-kNN (point q, point sets {P 0, . . . , Pα})

8-2

Approximation algorithm

Candidates C = ∅;For i = 0, . . . , α {

Find zip as the successor of zq+vi in P i;

Let Ci be γ points up and down next to zip in P i;

For each point p in Ci, let p = p− vi;C = C

⋃Ci;

}Let Aχ = kNN(q, C) and output Aχ.

zχ-kNN (point q, point sets {P 0, . . . , Pα})

8-3

Approximation algorithm

Candidates C = ∅;For i = 0, . . . , α {

Find zip as the successor of zq+vi in P i;

Let Ci be γ points up and down next to zip in P i;

For each point p in Ci, let p = p− vi;C = C

⋃Ci;

}Let Aχ = kNN(q, C) and output Aχ.

zχ-kNN (point q, point sets {P 0, . . . , Pα})

8-4

Approximation algorithm

Candidates C = ∅;For i = 0, . . . , α {

Find zip as the successor of zq+vi in P i;

Let Ci be γ points up and down next to zip in P i;

For each point p in Ci, let p = p− vi;C = C

⋃Ci;

}Let Aχ = kNN(q, C) and output Aχ.

zχ-kNN (point q, point sets {P 0, . . . , Pα})

8-5

Approximation algorithm

Candidates C = ∅;For i = 0, . . . , α {

Find zip as the successor of zq+vi in P i;

Let Ci be γ points up and down next to zip in P i;

For each point p in Ci, let p = p− vi;C = C

⋃Ci;

}Let Aχ = kNN(q, C) and output Aχ.

zχ-kNN (point q, point sets {P 0, . . . , Pα})

9-1

SQL statement for approximation algorithm

SELECT TOP k * FROM(SELECT TOP γ + 1 * FROM RP ,

(SELECT TOP 1 zval FROM RPWHERE RP .zval ≥ q.zvalORDER BY RP .zval ASC ) AS T

WHERE RP .zval≥T.zvalORDER BY RP .zval ASC

UNIONSELECT TOP γ * FROM RPWHERE RP .zval < T.zvalORDER BY RP .zval DESC ) AS C

ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2)

9-2

SQL statement for approximation algorithm

SELECT TOP k * FROM(SELECT TOP γ + 1 * FROM RP ,

(SELECT TOP 1 zval FROM RPWHERE RP .zval ≥ q.zvalORDER BY RP .zval ASC ) AS T

WHERE RP .zval≥T.zvalORDER BY RP .zval ASC

UNIONSELECT TOP γ * FROM RPWHERE RP .zval < T.zvalORDER BY RP .zval DESC ) AS C

ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2)

9-3

SQL statement for approximation algorithm

SELECT TOP k * FROM(SELECT TOP γ + 1 * FROM RP ,

(SELECT TOP 1 zval FROM RPWHERE RP .zval ≥ q.zvalORDER BY RP .zval ASC ) AS T

WHERE RP .zval≥T.zvalORDER BY RP .zval ASC

UNIONSELECT TOP γ * FROM RPWHERE RP .zval < T.zvalORDER BY RP .zval DESC ) AS C

ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2)

9-4

SQL statement for approximation algorithm

SELECT TOP k * FROM(SELECT TOP γ + 1 * FROM RP ,

(SELECT TOP 1 zval FROM RPWHERE RP .zval ≥ q.zvalORDER BY RP .zval ASC ) AS T

WHERE RP .zval≥T.zvalORDER BY RP .zval ASC

UNIONSELECT TOP γ * FROM RPWHERE RP .zval < T.zvalORDER BY RP .zval DESC ) AS C

ORDER BY Euclidean(q.X1,q.X2,C.Y1,C.Y2)

10-1

Exact KNN retrieval: naive solution

The exact kNN points are enclosed by the approximate kthnearest neighbor ball.

10-2

Exact KNN retrieval: naive solution

SELECT TOP k * FROM RPWHERE Euclidean(q.X1,q.X2,RP .Y1,RP .Y2)≤ rad(p,Aχ)ORDER BY Euclidean(q.X1,q.X2,RP .Y1,RP .Y2)

The exact kNN points are enclosed by the approximate kthnearest neighbor ball.

10-3

Exact KNN retrieval: naive solution

SELECT TOP k * FROM RPWHERE Euclidean(q.X1,q.X2,RP .Y1,RP .Y2)≤ rad(p,Aχ)ORDER BY Euclidean(q.X1,q.X2,RP .Y1,RP .Y2)

The exact kNN points are enclosed by the approximate kthnearest neighbor ball.

Can we do better?

11-1

Exact KNN retrieval

Mk = 3

11-2

Exact KNN retrieval

Mk = 3

approximate kth nn ball

11-3

Exact KNN retrieval

Mk = 3

approximate kth nn ball

exact kth nn ball

11-4

Exact KNN retrieval

Mk = 3

approximate kth nn ball

exact kth nn ballapproximatekth nn box

11-5

Exact KNN retrieval

Lemma 4: For a rectangular box M and its lower-left and upper-right corner points δ`, δh, ∀p ∈M , zp ∈ [z`, zh], where zp standsfor the z-value of a point p and z`, zh correspond to the z-valuesof δ` and δh respectively.

Mk = 3

approximate kth nn ball

exact kth nn ballapproximatekth nn box

11-6

Exact KNN retrieval

Corollary 1: Let z` and zh be the z-values of δ` and δhpoints of M(Aχ). For all p ∈ A, zp ∈ [z`, zh].

Mk = 3

approximate kth nn ball

exact kth nn ballapproximatekth nn box

11-7

Exact KNN retrieval

Corollary 1: Let z` and zh be the z-values of δ` and δhpoints of M(Aχ). For all p ∈ A, zp ∈ [z`, zh].

Mk = 3

approximate kth nn ball

exact kth nn ballapproximatekth nn box

12-1

Exact KNN retrieval

Let γ` and γh denote the left and right γ-th points close tothe query point, if zγ`

≤ z` and zγh≥ zh in at least one of

the α tables, Aχ = A

12-2

Exact KNN retrieval

Let γ` and γh denote the left and right γ-th points close tothe query point, if zγ`

≤ z` and zγh≥ zh in at least one of

the α tables, Aχ = A

13-1

Exact KNN retrieval

If not, we can find A by doing a range query with [zj` , zjh] on any

of the α tables. Ideally, we use the table with smallest [zj` , zjh].

13-2

Exact KNN retrieval

If not, we can find A by doing a range query with [zj` , zjh] on any

of the α tables. Ideally, we use the table with smallest [zj` , zjh].

13-3

Exact KNN retrieval

If not, we can find A by doing a range query with [zj` , zjh] on any

of the α tables. Ideally, we use the table with smallest [zj` , zjh].

13-4

Exact KNN retrieval

If not, we can find A by doing a range query with [zj` , zjh] on any

of the α tables. Ideally, we use the table with smallest [zj` , zjh].

13-5

Exact KNN retrieval

If not, we can find A by doing a range query with [zj` , zjh] on any

of the α tables. Ideally, we use the table with smallest [zj` , zjh].

14-1

KNN-join, higher dimensions and updates

Our approach can easily and efficiently support join queries.

14-2

KNN-join, higher dimensions and updates

Deal with data in any dimension: without changing the frame-work; for large dimensionality (say d > 20), using LSH-basedmethod.

Our approach can easily and efficiently support join queries.

14-3

KNN-join, higher dimensions and updates

Deal with data in any dimension: without changing the frame-work; for large dimensionality (say d > 20), using LSH-basedmethod.

Updates: for deletion, delete record r based on its pid fromall talbes R0, . . . ,Rα; for insertion, calculate the z-values ofthe point for all randomly shifted versions, insert them intocorresponding tables.

Our approach can easily and efficiently support join queries.

15-1

Experiment Setup

All algorithms are implemented in Microsoft SQL Server 2005.Experiments are conducted on an Intel Xeon CPU @ 2.33GHz.The memory of the SQL Server is set to 1.5GB.

15-2

Experiment Setup

All algorithms are implemented in Microsoft SQL Server 2005.Experiments are conducted on an Intel Xeon CPU @ 2.33GHz.The memory of the SQL Server is set to 1.5GB.

Real data sets: points representing the road-networks for statesin USA

15-3

Experiment Setup

All algorithms are implemented in Microsoft SQL Server 2005.Experiments are conducted on an Intel Xeon CPU @ 2.33GHz.The memory of the SQL Server is set to 1.5GB.

Real data sets: points representing the road-networks for statesin USA

Two synthetic data sets: uniform points and random clusteredpoints.

15-4

Experiment Setup

All algorithms are implemented in Microsoft SQL Server 2005.Experiments are conducted on an Intel Xeon CPU @ 2.33GHz.The memory of the SQL Server is set to 1.5GB.

Real data sets: points representing the road-networks for statesin USA

Two synthetic data sets: uniform points and random clusteredpoints.

Compare against the Medrank and iDistance algorithms (im-plemented by SQL statement and store precedure).

16-1

Experiment Setup

The default experimental parameters are summarized below

Symbol Definition Default Valuek number of neighbors 10N size of points set 1,000,000α randomly shifted copies 2γ number of points up and down 2kd dimensionality 2

17-1

Results for the kNN query: approximation quality

18-1

Results for the kNN query: approximation quality

18-2

Results for the kNN query: approximation quality

18-3

Results for the kNN query: approximation quality

19-1

Results for the kNN query: running time

19-2

Results for the kNN query: running time

UN

19-3

Results for the kNN query: running time

California

19-4

Results for the kNN query: running time

UN

19-5

Results for the kNN query: running time

California

19-6

Results for the kNN query: running time

20-1

Conclusions

Presented a constant approximation for the kNN query, with loga-rithmic page accesses in any fixed dimension and extended it to theexact solution, both using just O(1) random shifts.

20-2

Conclusions

Presented a constant approximation for the kNN query, with loga-rithmic page accesses in any fixed dimension and extended it to theexact solution, both using just O(1) random shifts.

All the algorithms can be implemented by SQL operators in rela-tional databases.

20-3

Conclusions

Presented a constant approximation for the kNN query, with loga-rithmic page accesses in any fixed dimension and extended it to theexact solution, both using just O(1) random shifts.

Our approach naturally supports kNN-Joins.

All the algorithms can be implemented by SQL operators in rela-tional databases.

20-4

Conclusions

Presented a constant approximation for the kNN query, with loga-rithmic page accesses in any fixed dimension and extended it to theexact solution, both using just O(1) random shifts.

Our approach naturally supports kNN-Joins.

No changes are required for different dimensions, and the up-date is trivial.

All the algorithms can be implemented by SQL operators in rela-tional databases.

20-5

Conclusions

Presented a constant approximation for the kNN query, with loga-rithmic page accesses in any fixed dimension and extended it to theexact solution, both using just O(1) random shifts.

Our approach naturally supports kNN-Joins.

No changes are required for different dimensions, and the up-date is trivial.

Study other related, interesting queries in this framework, e.g.,the reverse nearest neighbor queries.

Examine the relational algorithms to the data space other thanthe Lp-norms, such as the road networks.

Future research:

All the algorithms can be implemented by SQL operators in rela-tional databases.

21-1

The End

THANK Y OU

Q and A