+ All Categories
Home > Documents > 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health...

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health...

Date post: 18-Jan-2018
Category:
Upload: trevor-garrett
View: 215 times
Download: 0 times
Share this document with a friend
Description:
23 3 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.
23
1 23 Christian Böhm 1 , Florian Krebs 2 , and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join
Transcript
Page 1: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

123

Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel2

1University for Health Informatics and Technology, Innsbruck2University of Munich

Optimal Dimension Order: A Generic Technique for the Similarity Join

Page 2: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

223 Feature Based Similarity

Page 3: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

323 Simple Similarity Queries

Specify query object and• Find similar objects – range query• Find the k most similar objects – nearest neighbor q.

Page 4: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

423 Join Applications: Catalogue Matching

Catalogue matching• E.g. Astronomy catalogues

R

S

Page 5: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

523 Join Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

Page 6: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

623 R-Tree Similarity Join

Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]

R S

Page 7: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

723 The -kdB-Tree

[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...

• clustered, • skewed and • high-dimensional data

Page 8: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

823 Epsilon Grid Order

[Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

Page 9: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

923 Common Properties

Decomposition of data/space into regions Regions described by hyper-rectangles

for each pair (P,Q) of partitions having dist (P,Q)

for each pair of points (p,q) on (P,Q)test dist (p,q) ;

Most CPU-effort in distance test between vectors:Idea: Speed-up distance test

Page 10: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1023 Related Work: Plane Sweep for Polygons

[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]

Observations:• More efficient to use x-axis as sweep direction.• Projection of polygons to y-axis yield high overlap• Decide by projections of the bounding boxes

(integrate a pdf)

Page 11: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1123

Distance computation between feature vectors p,qfor (i=0 ; i<d ; i++) {dist2 = dist2 + (p[i] q[i])2 ;if (dist2 > 2)break ;}

Order dimensions by Mating Probability (increasing)

Feature Vectors in the Similarity Join

d0

d1

Page 12: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1223 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis

d0

d1

Page 13: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1323 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0

d0

d 0

d 0

d 0

d 0

d 0

Page 14: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1423 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0-Projection of each point pair located inthis event space

d0[P]

d0[Q]

Page 15: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1523 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

d0[P]

d0-Projection of each point pair located inthis event space

mating

point

pairs

on -

stripe

d0[Q]

y x y

x +

Page 16: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1623 Computation of the Mating Probability

To determine mating probability for di: Project bounding boxes on di-axis Consider two projections in 2-dimensional space

MatingProbabilityfor d0

d0[P]

d0[Q]

Page 17: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1723 Optimal Dimension Order

For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability

Algorithm:for each pair (P,Q) of partitions having dist (P,Q)

determine ODO ;for each pair of points (p,q) on (P,Q)

test dist (p,q) using ODO ;

Page 18: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1823 Shape of the Intersection Area

20 different shapes are possible, e.g.

1223 2233 2223

Easy proof of completeness and efficient case distinction by assigning codes to the corners• 1: Corner is left or above the -stripe• 2: Corner is on the -stripe• 3: Corner is right or below the -stripe

Easy formulas (only 45° and 90° angles)

Page 19: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1923 Experimental Evaluation: R-tree Sim. Join

0%100%200%300%400%500%600%700%

base

tech

nique

ODO-algo

rithm

SDO dimen

sion 1

SDO dimen

sion 2

SDO dimen

sion 3

SDO dimen

sion 4

SDO dimen

sion 5

SDO dimen

sion 6

SDO dimen

sion 7

SDO dimen

sion 8

8-dimensional data, uniformly distributed

Page 20: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

2023 Experimental Evaluation: R-tree Sim. Join

16-dimensional data, from CAD-similarity search

Page 21: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

2123 Experimental Evaluation: Scalability

MuX, uniform data Z-RSJ, uniform data

Page 22: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

2223 Experimental Evaluation: Scalability

EGO, CAD data

Page 23: 23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

2323 Conclusion

Conclusion:• Similarity join is an important database primitive for

knowledge discovery in databases• Many different basic algorithms• Most accelerable by our optimal dimension order

Future Work:• New applications of the similarity join• Further optimization (multi-parameter) of the sim. join• Parallel and distributed environments


Recommended