Date post: | 10-Dec-2015 |
Category: |
Documents |
Upload: | owen-adkins |
View: | 224 times |
Download: | 1 times |
Reverse Spatial and Textual k Nearest Neighbor Search
Jiaheng Lu, Ying Lu
Renmin University of China
Gao Cong
Nanyang TechonlogicalUniversity, Singapore
Outline
• Motivation & Problem Statement
• Related Work
• RSTkNN Search Strategy
• Experiments
• Conclusion
1
• If add a new shop at Q, which shops will be influenced?
• Influence facts– Spatial Distance
• Results: D, F– Textual Similarity
• Services/Products...• Results: F, C
Motivation
food
clothes
sports
food
clothes
clothes
clothes
2
Problems of finding Influential Sets
Traditional queryReverse k nearest neighbor query (RkNN)
Our new queryReverse spatial and textual k nearest neighbor query (RSTkNN)
3
Problem Statement
baba
babaEJ
22 ||||||||),(
Spatial-Textual Similarity• describe the similarity between such objects based o
n both spatial proximity and textual similarity.
Spatial-Textual Similarity Function
4
Problem Statement (con’t)
• RSTkNN query– is finding objects which have the query object
as one of their k spatial-textual similar objects.
5
Outline
• Motivation & Problem Statement
• Related Work
• RSTkNN Search Strategy
• Experiments
• Conclusion
6
Related Work• Pre-computing the kNN for each object
(Korn ect, SIGMOD2000, Yang ect, ICDE2001)
• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)
• 60-degree-pruning method(Stanoi ect, SIGMOD2000)
• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)
• Pre-computing the kNN for each object(Korn ect, SIGMOD2000, Yang ect, ICDE2001)
• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)
• 60-degree-pruning method(Stanoi ect, SIGMOD2000)
• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)
7
Challenging Features:
• Lose Euclidean geometric properties.
• High dimension in text space.
• k and α are different from query to query.
Challenging Features:
• Lose Euclidean geometric properties.
• High dimension in text space.
• k and α are different from query to query.
Baseline method
PrecomputePrecompute
Spatial NNsTextual NNs
Threshold AlgorithmThreshold Algorithm
Spatial-textual kNN o’
q is no more similar than o’
Object oObject o
q is more similar than o’
Give query q, k
& α
Inefficient since lacking a novel data structureInefficient since lacking a novel data structure
For each object o in the databaseFor each object o in the database
8
Outline
• Motivation & Problem Statement
• Related Work
• RSTkNN Search Strategy
• Experiments
• Conclusion
9
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
ObjVct2
[0, 1][0, 1]
ObjVct3
[1, 0][1, 0]
[4,4][4,4]
p2 p3
IntUniVct11
[4,4][4,4]
p1ObjVct1
N1 N2
N4
ObjVct4
[3, 2.5][3, 2.5]
ObjVct5
[3.5, 1.5][3.5, 1.5]
p4 p5
N3
[0,0][1,1]
2
[3,1.5][3.5,2.5]
2
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
UniVct1 1 1
UniVct2 5 5
UniVct3 8 8
IntVct1 1 1
IntVct2 1 1
IntVct3 1 1
IntUniVct2
IntUniVct3
Intersection and Union R-tree (IUR-tree)
10
Main idea of Search Strategy
Prune an entry E in IUR-Tree, when query q is no more similar than kNNL(E).
Report an entry E to be results, when query q is more similar than kNNU(E).
11
queryobjects
q2
q3
Eq1
Lu
kNNL(E)
kNNU(E)
How to Compute the Bounds
Similarity approximations
MinST(E, E’):
TightMinST(E, E’):
MaxST(E, E’):
)',()',(,,'' EEMinSTooSimSTEoEo
)',()',(,,'' EETightMinSTooSimSTEoEo
)',()',(,,'' EEMaxSTooSimSTEoEo
12
Example for Computing BoundsN1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
Current traveled entries: N1, N2, N3Given k=2, to compute kNNL(N1) and kNNU(N1).
TightMinST(N1, N3) = 0.564
MinST(N1, N3) = 0.370
TightMinST(N1, N2) = 0.179
MinST(N1, N2) = 0.095
N1 N3effect
N1 N2
Compute kNNL(N1)
decrease
kNNL(N1) = 0.370
Compute kNNU(N1)
decrease
kNNU(N1) = 0.432
MaxST(N1, N3) = 0.432
MaxST(N1, N2) = 0.150
13
Overview of Search Algorithm
• RSTkNN Algorithm:– Travel from the IUR-tree root– Progressively update lower and upper bounds – Apply search strategy:
• prune unrelated entries to Pruned;• report entries to be results Ans;• add candidate objects to Cnd.
– FinalVerification• For objects in Cnd, check whether to results or not by updating
the bounds for candidates using expanding entries in Pruned.
14
N4
N1p1
N2p2 p3
N3p4 p5
EnQueue(U, N4);
Initialize N4.CLs;
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U
N4, (0, 0)15
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U N4(0, 0)
DeQueue(U, N4) Mutual-effectN1 N2
N1 N3
N2 N3
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
N4
N1p1
N2p2 p3
N3p4 p5
EnQueue(U, N2)EnQueue(U, N3)Pruned.add(N1)
Pruned N1(0.37, 0.432)
N3(0.323, 0.619 ) N2(0.21, 0.619 )
16
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U
DeQueue(U, N3) Mutual-effectp4 N2
p5 p4,N2
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
Answer.add(p4)Candidate.add(p5)
Pruned N1(0.37, 0.432)
N3(0.323, 0.619 ) N2(0.21, 0.619 )
Answer
Candidate
p4(0.21, 0.619 )
p5(0.374, 0.374)
N4
N1
p1
N2
p2 p3
N3
p4 p5
17
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U
DeQueue(U, N2) Mutual-effectp2 p4,p5
p3 p2,p4,p5
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
Answer.add(p2, p3)
Pruned.add(p5)
Pruned N1(0.37, 0.432)
N2(0.21, 0.619 )
Answer
Candidate
p4
p5(0.374, 0.374)
N4
N1
p1
N2
p2 p3
N3
p4 p5
p2 p3
So far since U=Cand=empty, algorithm ends.
Results: p2, p3, p4.
So far since U=Cand=empty, algorithm ends.
Results: p2, p3, p4.
18
Cluster IUR-tree: CIUR-treeIUR-tree: Texts in an index node could be very different.
CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
ObjVct2
[0, 1][0, 1]
ObjVct3
[1, 0][1, 0]
[4,4][4,4]
p2 p3
IntUniVct1 1
[4,4][4,4]
p1ObjVct1
N1 N2
N4
ObjVct4
[3, 2.5][3, 2.5]
ObjVct5
[3.5, 1.5][3.5, 1.5]
p4 p5
N3
[0,0][1,1]
2
[3,1.5][3.5,2.5]
2
x
ObjVct1 1 1
ObjVct2 5 5
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectors word1
word2
y
word
2
word
1
UniVct1 1 1
UniVct2 5 5
UniVct3 8 8
word
2
word
1
IntVct1 1 1
IntVct2 1 1
IntVct3 1 1
IntUniVct2
IntUniVct3
C1:1
C2:2
C1:1, C3:1
C1
C2
C2
C3
C1
cluster
19
Optimizations• Motivation
– To give a tighter bound during CIUR-tree traversal– To purify the textual description in the index node
• Outlier Detection and Extraction (ODE-CIUR)– Extract subtrees with outlier clusters– Take the outliers into special account and calculate their bounds
separately.
• Text-entropy based optimization (TE-CIUR)– Define TextEntropy to depict the distribution of text clusters in an
entry of CIUR-tree– Travel first for the entries with higher TextEntropy, i.e. more diverse
in texts.
20
Experimental Study• Experimental Setup
– OS: Windows XP; CPU: 2.0GHz; Memory: 4GB
– Page size: 4KB; Language: C/C++.
• Compared Methods– baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE.
• Datasets– ShopBranches(Shop), extended from a small real data
– GeographicNames(GN), real data
– CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia.
• Metric– Total query time
– Page access number
Statistics Shop CD GN
Total # of objects 304,008 1,555,209 1,868,821
Total unique words in dataset 3933 21,578 222,409
Average # words per object 45 47 4
21
Scalability
0.1
1
10
100
1000
10000
100000
1000000
10000000
50K 300K 550K 800K 1050K
dataset size
quer
y tim
e (s
ec)
baseline IUR-Tree
ODE-CIUR TE-CIUR
ODE-TE
0
2
4
6
8
50K 300K 550K 800K 1050K
dataset size
qu
ery
tim
e (s
ec)
baseline IUR-Tree
ODE-CIUR TE-CIUR
ODE-TE
0.2K 3K 40K 550K 4M
(1) Log-scale version (2) Linear-scale version
22
Effect of k
0
1
2
3
4
1 3 5 7 9
k
qu
ery
tim
e (s
ec)
IUR-Tree ODE-CIUR TE-CIUR ODE-TE
0
500
1000
1500
2000
2500
3000
1 3 5 7 9
k#
Pag
e A
cces
ses
IUR-Tree ODE-CIUR TE-CIUR ODE-TE
(a) Query time (b) Page access
23
Conclusion
• Propose a new query problem RSTkNN.• Present a hybrid index IUR-Tree.• Present the efficient search algorithm to answer
the queries.• Show the enhancing variant CIUR-Tree and two
optimizations ODE-CIUR and TE-CIUR to further improve search processing.
• Extensive experiments confirm the efficiency and scalability of our algorithms.
24
Reverse Spatial and Textual k Nearest Neighbor Search
Thanks!
Q & A
A straightforward method
1. Compute RSkNN and RTkNN, respectively;
2. Combine both results of RSkNN and RTkNN get RSTkNN results.
No sensible way for combination. (Infeasible)