Reverse Spatial and Textual k Nearest Neighbor Search.

Reverse Spatial and Textual k Nearest Neighbor Search

Jiaheng Lu, Ying Lu

Renmin University of China

Gao Cong

Nanyang TechonlogicalUniversity, Singapore

Outline

• Motivation & Problem Statement

• Related Work

• RSTkNN Search Strategy

• Experiments

• Conclusion

1

• If add a new shop at Q, which shops will be influenced?

• Influence facts– Spatial Distance

• Results: D, F– Textual Similarity

• Services/Products...• Results: F, C

Motivation

food

clothes

sports

food

clothes

clothes

clothes

2

Problems of finding Influential Sets

Traditional queryReverse k nearest neighbor query (RkNN)

Our new queryReverse spatial and textual k nearest neighbor query (RSTkNN)

3

Problem Statement

baba

babaEJ

22 ||||||||),(

Spatial-Textual Similarity• describe the similarity between such objects based o

n both spatial proximity and textual similarity.

Spatial-Textual Similarity Function

4

Problem Statement (con’t)

• RSTkNN query– is finding objects which have the query object

as one of their k spatial-textual similar objects.

5

Outline


• Related Work


• Experiments

• Conclusion

6

Related Work• Pre-computing the kNN for each object

(Korn ect, SIGMOD2000, Yang ect, ICDE2001)

• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)

• 60-degree-pruning method(Stanoi ect, SIGMOD2000)

• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)

• Pre-computing the kNN for each object(Korn ect, SIGMOD2000, Yang ect, ICDE2001)

• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)

• 60-degree-pruning method(Stanoi ect, SIGMOD2000)

• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)

7

Challenging Features:

• Lose Euclidean geometric properties.

• High dimension in text space.

• k and α are different from query to query.

Challenging Features:

• Lose Euclidean geometric properties.

• High dimension in text space.

• k and α are different from query to query.

Baseline method

PrecomputePrecompute

Spatial NNsTextual NNs

Threshold AlgorithmThreshold Algorithm

Spatial-textual kNN o’

q is no more similar than o’

Object oObject o

q is more similar than o’

Give query q, k

& α

Inefficient since lacking a novel data structureInefficient since lacking a novel data structure

For each object o in the databaseFor each object o in the database

8

Outline


• Related Work


• Experiments

• Conclusion

9

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

ObjVct2

[0, 1][0, 1]

ObjVct3

[1, 0][1, 0]

[4,4][4,4]

p2 p3

IntUniVct11

[4,4][4,4]

p1ObjVct1

N1 N2

N4

ObjVct4

[3, 2.5][3, 2.5]

ObjVct5

[3.5, 1.5][3.5, 1.5]

p4 p5

N3

[0,0][1,1]

2

[3,1.5][3.5,2.5]

2

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

UniVct1 1 1

UniVct2 5 5

UniVct3 8 8

IntVct1 1 1

IntVct2 1 1

IntVct3 1 1

IntUniVct2

IntUniVct3

Intersection and Union R-tree (IUR-tree)

10

Main idea of Search Strategy

Prune an entry E in IUR-Tree, when query q is no more similar than kNNL(E).

Report an entry E to be results, when query q is more similar than kNNU(E).

11

queryobjects

q2

q3

Eq1

Lu

kNNL(E)

kNNU(E)

How to Compute the Bounds

Similarity approximations

MinST(E, E’):

TightMinST(E, E’):

MaxST(E, E’):

)',()',(,,'' EEMinSTooSimSTEoEo

)',()',(,,'' EETightMinSTooSimSTEoEo

)',()',(,,'' EEMaxSTooSimSTEoEo

12

Example for Computing BoundsN1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

Current traveled entries: N1, N2, N3Given k=2, to compute kNNL(N1) and kNNU(N1).

TightMinST(N1, N3) = 0.564

MinST(N1, N3) = 0.370

TightMinST(N1, N2) = 0.179

MinST(N1, N2) = 0.095

N1 N3effect

N1 N2

Compute kNNL(N1)

decrease

kNNL(N1) = 0.370

Compute kNNU(N1)

decrease

kNNU(N1) = 0.432

MaxST(N1, N3) = 0.432

MaxST(N1, N2) = 0.150

13

Overview of Search Algorithm

• RSTkNN Algorithm:– Travel from the IUR-tree root– Progressively update lower and upper bounds – Apply search strategy:

• prune unrelated entries to Pruned;• report entries to be results Ans;• add candidate objects to Cnd.

– FinalVerification• For objects in Cnd, check whether to results or not by updating

the bounds for candidates using expanding entries in Pruned.

14

N4

N1p1

N2p2 p3

N3p4 p5

EnQueue(U, N4);

Initialize N4.CLs;

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6

U

N4, (0, 0)15


U N4(0, 0)

DeQueue(U, N4) Mutual-effectN1 N2

N1 N3

N2 N3

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

N4

N1p1

N2p2 p3

N3p4 p5

EnQueue(U, N2)EnQueue(U, N3)Pruned.add(N1)

Pruned N1(0.37, 0.432)

N3(0.323, 0.619 ) N2(0.21, 0.619 )

16


U

DeQueue(U, N3) Mutual-effectp4 N2

p5 p4,N2

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

Answer.add(p4)Candidate.add(p5)

Pruned N1(0.37, 0.432)

N3(0.323, 0.619 ) N2(0.21, 0.619 )

Answer

Candidate

p4(0.21, 0.619 )

p5(0.374, 0.374)

N4

N1

p1

N2

p2 p3

N3

p4 p5

17


U

DeQueue(U, N2) Mutual-effectp2 p4,p5

p3 p2,p4,p5

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

Answer.add(p2, p3)

Pruned.add(p5)

Pruned N1(0.37, 0.432)

N2(0.21, 0.619 )

Answer

Candidate

p4

p5(0.374, 0.374)

N4

N1

p1

N2

p2 p3

N3

p4 p5

p2 p3

So far since U=Cand=empty, algorithm ends.

Results: p2, p3, p4.

So far since U=Cand=empty, algorithm ends.

Results: p2, p3, p4.

18

Cluster IUR-tree: CIUR-treeIUR-tree: Texts in an index node could be very different.

CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

ObjVct2

[0, 1][0, 1]

ObjVct3

[1, 0][1, 0]

[4,4][4,4]

p2 p3

IntUniVct1 1

[4,4][4,4]

p1ObjVct1

N1 N2

N4

ObjVct4

[3, 2.5][3, 2.5]

ObjVct5

[3.5, 1.5][3.5, 1.5]

p4 p5

N3

[0,0][1,1]

2

[3,1.5][3.5,2.5]

2

x

ObjVct1 1 1

ObjVct2 5 5

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectors word1

word2

y

word

2

word

1

UniVct1 1 1

UniVct2 5 5

UniVct3 8 8

word

2

word

1

IntVct1 1 1

IntVct2 1 1

IntVct3 1 1

IntUniVct2

IntUniVct3

C1:1

C2:2

C1:1, C3:1

C1

C2

C2

C3

C1

cluster

19

Optimizations• Motivation

– To give a tighter bound during CIUR-tree traversal– To purify the textual description in the index node

• Outlier Detection and Extraction (ODE-CIUR)– Extract subtrees with outlier clusters– Take the outliers into special account and calculate their bounds

separately.

• Text-entropy based optimization (TE-CIUR)– Define TextEntropy to depict the distribution of text clusters in an

entry of CIUR-tree– Travel first for the entries with higher TextEntropy, i.e. more diverse

in texts.

20

Experimental Study• Experimental Setup

– OS: Windows XP; CPU: 2.0GHz; Memory: 4GB

– Page size: 4KB; Language: C/C++.

• Compared Methods– baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE.

• Datasets– ShopBranches(Shop), extended from a small real data

– GeographicNames(GN), real data

– CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia.

• Metric– Total query time

– Page access number

Statistics Shop CD GN

Total # of objects 304,008 1,555,209 1,868,821

Total unique words in dataset 3933 21,578 222,409

Average # words per object 45 47 4

21

Scalability

0.1

1

10

100

1000

10000

100000

1000000

10000000

50K 300K 550K 800K 1050K

dataset size

quer

y tim

e (s

ec)

baseline IUR-Tree

ODE-CIUR TE-CIUR

ODE-TE

0

2

4

6

8

50K 300K 550K 800K 1050K

dataset size

qu

ery

tim

e (s

ec)

baseline IUR-Tree

ODE-CIUR TE-CIUR

ODE-TE

0.2K 3K 40K 550K 4M

(1) Log-scale version (2) Linear-scale version

22

Effect of k

0

1

2

3

4

1 3 5 7 9

k

qu

ery

tim

e (s

ec)

IUR-Tree ODE-CIUR TE-CIUR ODE-TE

0

500

1000

1500

2000

2500

3000

1 3 5 7 9

k#

Pag

e A

cces

ses

IUR-Tree ODE-CIUR TE-CIUR ODE-TE

(a) Query time (b) Page access

23

Conclusion

• Propose a new query problem RSTkNN.• Present a hybrid index IUR-Tree.• Present the efficient search algorithm to answer

the queries.• Show the enhancing variant CIUR-Tree and two

optimizations ODE-CIUR and TE-CIUR to further improve search processing.

• Extensive experiments confirm the efficiency and scalability of our algorithms.

24

Reverse Spatial and Textual k Nearest Neighbor Search

Thanks!

Q & A

A straightforward method

1. Compute RSkNN and RTkNN, respectively;

2. Combine both results of RSkNN and RTkNN get RSTkNN results.

No sensible way for combination. (Infeasible)

Date post:	10-Dec-2015
Category:	Documents
Upload:	owen-adkins
View:	224 times
Download:	1 times

Reverse Spatial and Textual k Nearest Neighbor Search.

Documents