Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | kerry-wilkerson |
View: | 220 times |
Download: | 0 times |
Christian BöhmLudwig Maximilians Universität München
The Similarity Join: A Powerful Database Primitive for High Performance Data MiningTutorial, 17th Int. Conf. on Data Engineering, 2001-04-02
Christian BöhmLudwig Maximilians Universität München
The Similarity Join: A Powerful Database Primitive for High Performance Data MiningTutorial, 17th Int. Conf. on Data Engineering, 2001-04-02
Chr
isti
an B
öhm
3
150
High Performance Data MiningHigh Performance Data Mining
Fast decisions require knowledge just in time
Marketing Fraud Detection CRM Online Scoring OLAP
Chr
isti
an B
öhm
4
150
Previous Approaches to Fast Data MiningPrevious Approaches to Fast Data Mining
Sampling Approximations (grid) Dimensionality reduct. Parallelism
Loss of quality
Expensive & complex
All approaches combinable with join
KDD appl. get parallelism for free
Chr
isti
an B
öhm
6
150
Simple Similarity QueriesSimple Similarity Queries
• Specify query object and- Find similar objects – range query- Find the k most similar objects – nearest neighbor q.
Chr
isti
an B
öhm
7
150
Similarity – Range QueriesSimilarity – Range Queries
• Given: Query point qMaximum distance
• Formal definition:
• Cardinality of the result set is difficult to control: too small no results too large complete DB
Chr
isti
an B
öhm
8
150
Index Based Processing of Range QueriesIndex Based Processing of Range Queries
Chr
isti
an B
öhm
9
150
Similarity – Nearest Neighbor QueriesSimilarity – Nearest Neighbor Queries
• Given: Query point q
• Formal definition:
• Ties must be handled:- Result set enlargement- Non-determinism (don’t care)
Chr
isti
an B
öhm
11150
k-Nearest Neighbor Search and Rankingk-Nearest Neighbor Search and Ranking
• k-nearest neighbor query:- Do not only search only for one nearest neighbor but k
- Stop distance is the distance of the kth (last) candidate point
-
• Ranking-query:- Incremental version of k-nearest neighbor search- First call of FetchNext() returns first neighbor- Second call of FetchNext() returns second neighbor...- Typically only few results are fetched Don‘t generate all!
Chr
isti
an B
öhm
12150
Advanced Applications: DuplicatesAdvanced Applications: Duplicates
• Duplicate detection- E.g. Astronomic catalogue matching
• Similarity queries for large number of query obj
C1
C2
Chr
isti
an B
öhm
13150
Advanced Applications: Data MiningAdvanced Applications: Data Mining
• Density based clustering (DBSCAN)
Chr
isti
an B
öhm
14150
What is a Similarity Join?What is a Similarity Join?
• Given two sets R, S of points• Find all pairs of points according to similarity
• Various exact definitions for the similarity join
R
S
Chr
isti
an B
öhm
15150
What is a Similarity Join?What is a Similarity Join?
• Similarity join corresponds to set of identical similarity queries, evaluated for a large number of query points
• Sequential evaluation of similarity queries with index is the easiest similarity join algorithm
• Many more sophisticated approaches exist• Powerful database primitive to support modern
applications of data analysis and data mining
Chr
isti
an B
öhm
16150
Curse of DimensionalityCurse of Dimensionality
• Index structures fail (outperformed by the sequential scan) if the data space dimension becomes too high
• Many effects usually called Curse of Dimensionality
Chr
isti
an B
öhm
17150
Curse of DimensionalityCurse of Dimensionality
[Berchtold, Böhm, Keim, Kriegel: A Cost Model for High-Dim. Nearest Neighb. Search, PODS 1997]
With increasing dimension also increases... Typical radius of range queries Distance of a point to its nearest neighbor Edge length of regions of index structures
0.51=0.50.720.5 0.830.
5
Chr
isti
an B
öhm
18150
Curse of DimensionalityCurse of Dimensionality
A cost model for the access probability of index pages using the concept of Minkowski Sum
Chr
isti
an B
öhm
20150
Curse of DimensionalityCurse of Dimensionality
• Asymptotic behavior of similarity search
• Suppose number points VMink 2d VSphere
• Access probability = O(2d), but limited by 100%• Saturation area with near linear I/O cost O(n)
Chr
isti
an B
öhm
21150
Curse of DimensionalityCurse of Dimensionality
• For high dimension: Each similarity query accesses considerable fraction of all index pages.
• Index does not pay off, anyway sequ. scan• Strategies needed for efficient evaluation• Join: Base applications on powerful database
primitive that exploits high number of queries• Efficient algorithms for Similarity Join
Chr
isti
an B
öhm
22150
Organization of the TutorialOrganization of the Tutorial
1. Motivation
2. Defining the Similarity Join
3. Applications of the Similarity Join
4. Similarity Join Algorithms
5. Conclusion & Future Potential
Chr
isti
an B
öhm
24150
What Is a Similarity Join?What Is a Similarity Join?
Intuitive notion: 3 properties of the similarity join1. The similarity join is a join in the relational sense
Two sets R and S are combined into one such that the new set contains pairs of points that fulfill a join condition
2. Vector or metric objects rather than ordinary tuples of any type
3. The join condition involves similarity
Chr
isti
an B
öhm
25150
What Is a Similarity Join?What Is a Similarity Join?
Similarity Join
Distance Range Join NN-based Approaches
Closest Pair Query k-NN Join
Chr
isti
an B
öhm
26150
Distance Range Join (-Join)Distance Range Join (-Join)
• Intuitition: Given parameter All pairs of points where distance
• Formal Definition:
• In SQL-like notation:SELECT * FROM R, S WHERE ||R.obj S.obj||
Chr
isti
an B
öhm
27150
Distance Range Join (-Join)Distance Range Join (-Join)
• Most widespread and best evaluated join • Often also called the similarity join
Chr
isti
an B
öhm
28150
Distance Range Join (-Join)Distance Range Join (-Join)
• The distance range self join
is of particular importance for data mining (clustering) and robust similarity search
• Change definition to exclude trivial results•
Chr
isti
an B
öhm
29150
Distance Range Join (-Join)Distance Range Join (-Join)
• Disadvantage for the user:Result cardinality difficult to control: too small no result pairs are produced too large all pairs from R S are produced
• Worst case complexity is at least o(|R||S|)• For reasonable result set size, advanced join
algorithms yield asymptotic behavior which is better than O(|R||S|)
Chr
isti
an B
öhm
30150
k-Closest Pair Queryk-Closest Pair Query
• Intuition: Find those k pairs that yield least distance
• The principle of nearest neighbor search is applied on a basis per pair
• Classical problem of Computational Geometry• In the database context introduced by
[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998] • There called distance join
Chr
isti
an B
öhm
31150
k-Closest Pair Queryk-Closest Pair Query
• Formal Definition:
• Ties solved by result set enlargement
• Other possibility: Non-determinism(don’t care which of the tie tuples are reported)
Chr
isti
an B
öhm
32150
k-Closest Pair Queryk-Closest Pair Query
In SQL notation: SELECT * FROM R, SORDER BY ||R.obj S.obj||STOP AFTER k
Chr
isti
an B
öhm
33150
k-Closest Pair Queryk-Closest Pair Query
• Self-join:- Exclude |R| trivial pairs (ri,ri) with distance 0
- Result is symmetric
• Applications:- Find all pairs of stock quota in a database that are
most similar to each other- Find music scores which are similar to each other- Noise robust duplicate elimination
Chr
isti
an B
öhm
34150
k-Closest Pair Queryk-Closest Pair Query
• Incremental ranking instead of exact specification of k
• No STOP AFTER clause:
SELECT * FROM R, S ORDER BY ||R.obj S.obj||
• Open cursor and fetch results one-by-one• Important: Only few results typically fetched
Don’t determine the complete ranking
Chr
isti
an B
öhm
35150
k-Nearest Neighbor Joink-Nearest Neighbor Join
• Intuition: Combine each point with its k nearest neighbors
• The principle of nearest neighbor search is applied for each point of R
• In the database context introduced by[Hjaltason & Samet, Incremental Distance Join Algorithms, SIGMOD Conf. 1998]
• There called distance semijoin
Chr
isti
an B
öhm
36150
k-Nearest Neighbor Joink-Nearest Neighbor Join
• Formal Definition:
• Ties solved by result set enlargement
• Other possibility: Non-determinism(don’t care which of the tie tuples are reported)
Chr
isti
an B
öhm
37150
k-Nearest Neighbor Joink-Nearest Neighbor Join
In SQL notation:(limited to k = 1)
SELECT * FROM R, SGROUP BY R.objORDER BY ||R.obj S.obj||STOP AFTER K (* k *)
Chr
isti
an B
öhm
38150
k-Nearest Neighbor Joink-Nearest Neighbor Join
• The k-NN-join is inherently asymmetric:
Chr
isti
an B
öhm
39150
k-Nearest Neighbor Joink-Nearest Neighbor Join
• Applications of the k-NN-join:- k-means and k-medoid clustering- Simultaneous nearest neighbor classification:
A large set of new objects without class label are assigned according to the majority of k nearest neighbors of each of the new objects
• Astronomic observation• Online customer scoring
• Ranking on the k-NN-join is difficult to define
Chr
isti
an B
öhm
40150
Further possible definitionsFurther possible definitions
• Inverse nearest neighbor join:Combine each point ri of R with every point of S which considers ri to be its nearest neighbor
• Metric data sets:Instead of vectors use arbitrary objects with a distance metric- E.g. Text sequences with edit distance- Text mining using the similarity join applies A*
Chr
isti
an B
öhm
43150
Schema for Data Mining AlgorithmsSchema for Data Mining Algorithms
Algorithmic Schema A1
foreach Point p DPointSet S := SimilarityQuery (p,
);foreach Point q S
DoSomething (p,q) ;
Chr
isti
an B
öhm
44150
Iterative similarity queries and cacheIterative similarity queries and cache
Due to curse of dimensionality:No sufficient inter-query locality of the pages
0,00
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0 10 20 30 40Dimension (d )
Ave
rag
e ca
che
hit
rat
io
10-nn querysim. range query
Chr
isti
an B
öhm
45150
Iterative similarity queries and cacheIterative similarity queries and cache
Chr
isti
an B
öhm
46150
Idea: Query Order TransformationIdea: Query Order Transformation
[Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]
Transform order of similarity queries such that packing of points into pages is considered
If one pair of index pages is in the cache: process all sim. queries regarding this pair
Each pair of pages is considered at most once
Chr
isti
an B
öhm
48150
Transform the Original Schema A1…Transform the Original Schema A1…
Algorithmic Schema A1
foreach Point p DPointSet S := SimilarityQuery (p,
);foreach Point q S
DoSomething (p,q) ;
Chr
isti
an B
öhm
49150
…Into a New Algorithmic Schema A2…Into a New Algorithmic Schema A2
foreach DataPage PLoadAndPinPage (P) ;foreach DataPage Q
if (mindist (P,Q) )CachedAccess (Q) ;foreach Point p P
foreach Point q Qif (distance (p,q) )
DoSomething’ (p,q) ;UnFixPage (P) ;
Chr
isti
an B
öhm
50150
Similarity JoinSimilarity Join
A2 is a Similarity-Join-Algorithm:
foreach PointPair (p,q) DoSomething’ (p,q) ;
Where denotes the Similarity-Join:
SELECT * FROM R r1, R r2
WHERE distance (r1.object, r2.object)
Chr
isti
an B
öhm
51150
Implementation VariantsImplementation Variants
• Change of the order in which points are combined must partially be considered
Implementation
Semantic Materialization
Change algorithm to take unknown order into account
Materialize join result j and answer original queries by j
Chr
isti
an B
öhm
52150
Example Clustering AlgorithmsExample Clustering Algorithms
DBSCAN[Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise´, KDD 1996]
Flat clustering (non hierarchical)
OPTICS[Ankerst, Breunig, Kriegel, Sander: OPTICS: Ordering Points To Identify the Clustering Structure, SIGMOD Conf. 1999]
Hierachicalcluster-structure
1
2
3
Semantic Rewriting Materialization
Chr
isti
an B
öhm
53150
Transformation by Semantic RewritingTransformation by Semantic Rewriting
• Rewrite the algorithm to take the changed order of pairs into account
• Don´t assume any specific order in which pairs are generated Arbitrary similarity join algorithm possible
Chr
isti
an B
öhm
54150
Example: DBSCANExample: DBSCAN
p core object in D wrt. , MinPts: | N (p) | MinPts p directly density-reachable from q in D wrt. , MinPts:
1) p N(q) and 2) q is a core object wrt. , MinPts
density-reachable: transitive closure.
cluster:- maximal wrt. density reachability- any two points are density-reachable from
a third object
Chr
isti
an B
öhm
55150
Implementation of DBSCAN on JoinImplementation of DBSCAN on Join
Core point property:DoSomething() increments a counter attribute
Determination of maximal density-reachable clusters:DoSomething():- Assign ID of known cluster point to unknown cluster points - Unify two known clusters
Chr
isti
an B
öhm
58150
Implementing OPTICS (Materialization)Implementing OPTICS (Materialization)
• The join result is predetermined before starting the actual OPTICS algorithm
• The result is materialized in some table with GROUP-BY on the first point of the pair
• The OPTICS algorithm runs unchanged• Similarity queries are answered from the join
materialization table (much faster)• Disadvantage: High memory requirements
Chr
isti
an B
öhm
59150
Experimental Results: Page CapacityExperimental Results: Page Capacity
100
1000
10000
100000
1000000
0 2000 4000 6000 8000 10000
page capacity
run
tim
e [
sec]
100
1000
10000
100000
1000000
0 100 200 300page capacity
run
tim
e [
sec]
Q-DBSCAN (Seq. Scan)
Q-DBSCAN (R*-tree)
Q-DBSCAN (X-tree)
J-DBSCAN (R*-tree)
J-DBSCAN (X-tree)
Meteorology data9-dimensional
Color image data64-dimensional
Chr
isti
an B
öhm
60150
Experimental Results: ScalabilityExperimental Results: Scalability
0
30000
60000
90000
120000
150000
0 30000 60000 90000
size of database [points]
run
tim
e [
sec]
Q-DBSCAN (Seq. Scan)
Q-DBSCAN (X-tree)
J-DBSCAN (X-tree)
0
30000
60000
90000
120000
150000
50000 150000 250000
size of database [points]
run
tim
e [
sec]
Q-OPTICS (Seq. Scan)
Q-OPTICS (X-tree)
J-OPTICS (X-tree)
Color image data Meteorology data
Chr
isti
an B
öhm
61150
Experimental Results: Query RangeExperimental Results: Query Range
0
10000
20000
30000
40000
50000
60000
70000
0,00 0,05 0,10 0,15 0,20 0,25
epsilon
run
tim
e [
sec]
Querybased (X-tree)
Joinbased (X-tree)
0
20000
40000
60000
80000
100000
120000
140000
0,1 0,15 0,2 0,25 0,3
epsilonru
nti
me
[se
c]
Q-OPTICS (Seq. Scan)Q-OPTICS (X-tree)J-OPTICS (X-tree)
Color image data Color image data
Q-DBSCAN (X-tree)J-DBSCAN (X-tree)
Chr
isti
an B
öhm
62150
Robust Similarity SearchRobust Similarity Search
[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]
• Usual similarity search with feature vectors:Not robust with respect to- Noise:
Euclidean distance sensitive to mismatch in single dimension
- Partial similarity: Not complete objects are similar, but parts thereof
• Concept to achieve robustness:Decompose each data object and query object into sub-objects and search for a maximum number of similar subobjects
Chr
isti
an B
öhm
63150
Robust Similarity SearchRobust Similarity Search
• Prominent concept borrowed from IR research:String decomposition: Search for similar words by indexing of character triplets (n-lets)
• Query transformed to set of similarity queries similarity join between query set and data set
• Robustness achieved in result recombination:- Noise robustness: Ignore missing matches- Partial search: Dont enforce complete recombination
Chr
isti
an B
öhm
64150
Robust Similarity SearchRobust Similarity Search
Applications:• Robust search for sequences:
[Agrawal, Lin, Sawhney, Shim: Fast Similariy Search in the Presence of Noise,...., VLDB 1995]
• Principle can be generalized for objects like- Raster images- CAD objects- 3D molecules- etc.
Chr
isti
an B
öhm
65150
Astronomic Catalogue MatchingAstronomic Catalogue Matching
• Relative position of catalogues approx. known:- Position and intensity parameters in different bands
C1
C2
• C1 C2
• Determine according to device tolerance
Chr
isti
an B
öhm
66150
Astronomic Catalogue MatchingAstronomic Catalogue Matching
• Relative position unknown:- Match according to triangles and intensity
C1
C2
• Search triangles and store parameters (height,...)• triangles (C1) triangles (C2)
Chr
isti
an B
öhm
67150
k-Nearest Neighbor Classificationk-Nearest Neighbor Classification
• Simultaneous classification of many objects[Braunmüller, Ester, Kriegel, Sander: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, ICDE 2000]
- Astronomy• Some 10,000 new objects collected per night• Classify according to some millions of known objects
- Online customer scoring• Some 1,000 customers online• Rate them according to some millions of known patterns
Chr
isti
an B
öhm
68150
k-Nearest Neighbor Classificationk-Nearest Neighbor Classification
• Example:
Objects with known class
New objects
k = 3
• New objects Known objects
Chr
isti
an B
öhm
69150
k-Means and k-Medoid Clusteringk-Means and k-Medoid Clustering
• k Points initially randomly selected („centers“)• Each database point assigned to nearest center• Centers are re-determined
- k-means: Means of all assigned points (artificial p.)- k-medoid: One central database point of the cluster
• Assignment and center determination are repeated until convergence
Chr
isti
an B
öhm
70150
k-Means and k-Medoid Clusteringk-Means and k-Medoid Clustering
• Example: (k-means with k = 3)
Convergence!
• Each assignment phase: DB-Points Centers
Chr
isti
an B
öhm
72150
Algorithms´ OverviewAlgorithms´ Overview
Similarity join
Range dist. join
Closest pair qu.
k-NN join
Index based
Hashing based
Sorting based
on-the-fly index
Optimization
Cost modeling
CPU optimizing
Chr
isti
an B
öhm
73150
Algorithms´ OverviewAlgorithms´ Overview
Distance range join (-join) Index joins with depth-first and breadth-first search
[Brinkhoff, Kriegel, Seeger: Efficient Proc. of Spatial Joins Using R-trees, SIGMOD Conf. 1993][Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996][Huang, Jing, Rundensteiner: Spatial Joins Usg. R-trees: Breadth-First Traversal..., VLDB 1997]
Index construction on-the-fly[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994][Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997][Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997][van den Bercken, Schneider, Seeger: Plug&Join, EDBT 2000]
Join-algorithms based on hashing[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996][Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]
Chr
isti
an B
öhm
74150
Algorithms´ OverviewAlgorithms´ Overview
Join-algorithms based on sorting[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991][Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997][Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order, SIGMOD Conf. 2001]
Closest pair query and nearest neighbor join[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998][Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000][Corral, Manolopoulos, Theodoridis, Vassilakopoulos: Closest Pair Queries in Spatial Databases, SIGMOD Conf. 2000]
Optimization approaches[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday 1630][Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted]
Chr
isti
an B
öhm
75150
Nested Loop JoinNested Loop Join
• Simple nested loop join:- Iterate over R-points- Nested iteration over S-points
S is scanned |R| times, high I/O cost
• Nested block loop join:- First iterate over blocks- Nested iterate over tuples
S scanned |R|/|B| times
R S
S-tuples
R-t
uple
s
S-bl
ocks
R-b
lock
s
Chr
isti
an B
öhm
76150
Indexed Nested Loop JoinIndexed Nested Loop Join
• Iterate over every point of R• Determine matches in S by
similarity queries on the index
• Due to the curse of dimensionality: Performance deterioration of the similarity q. Then not competitive with nested loop join(Depends on dimensionality and selectivity determined by )
S
R
Chr
isti
an B
öhm
77150
Spatial Join Similarity Join Spatial Join Similarity Join
• 2D polygon databases• Join-predicate: Overlap• Conserv. approximation:
MBR (ax-par. rectangle)
• High-D point databases• Join-predicate: Distance• Map -join to spatial join
Cube with edge-length
• Some strategies can be borrowed from the spatial join
Chr
isti
an B
öhm
78150
R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)
[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]
• Originally: Spatial join for 2D rect. intersection• Depth-first search in R-trees and similar indexes• Assumption: Index preconstructed on R and S• Simple recursion scheme (equal tree height):
procedure r_tree_join (R, S: page) foreach r R.children do foreach s S.children do if intersect (r,s) then r_tree_join (r,s) ;
Chr
isti
an B
öhm
79150
R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)
• Adaptation for the similarity join:Distance predicate rather than intersection
• For pair (R,S) of pages: mindist (R,S) Least possible distance of two points in (R,S)
Chr
isti
an B
öhm
80150
R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q);
R S
Chr
isti
an B
öhm
81150
R-tree Spatial Join (RSJ)R-tree Spatial Join (RSJ)
• Extension to different tree heights straightforw.• Several additional optimizations possible• CPU-bound
- Cost dominated by point-distance calculations
• Disadvantages- No clear strategies for page access priorization- Single page accesses
Can be outperformed by nested block loop join
Chr
isti
an B
öhm
82150
Parallel RSJParallel RSJ
[Brinkhoff, Kriegel, Seeger: Parallel Processing of Spatial Joins Using R-trees, ICDE 1996]
• Again spatial join for 2D rectangle intersection• Three phases of parallel execution:
- Task creation (non-parallel)- Task assignment (non-parallel)- Task execution (completely parallel)
• A task corresponds to a pair of subtrees- At high tree level (e.g. root or second level)
Chr
isti
an B
öhm
86150
Parallel RSJParallel RSJ
• Strategy 3: Dynamic task assignment- Processor requests a task when idle- Best load balancing
Chr
isti
an B
öhm
87150
Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)
[Huang, Jing, Rundensteiner: Spatial Joins Using R-trees: Breadth-First Traversal..., VLDB 1997]
• Again spatial join for 2D rectangle intersection• Shortcoming of RSJ:
- No strategy in outer loop improving locality in inner - Depth-first traversal not flexible, because a pair of
tree branches must be ended before next pair started
unnecessary page accesses
Chr
isti
an B
öhm
88150
Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)
• Solution:- Proceed level by level (breadth-first traversal)- Determine all relevant pairs for the next level
intermediate join index (IJI)- Sort the IJI according to suitable order before
accessing the next level global optimization strategy
Chr
isti
an B
öhm
90150
Breadth-First R-tree Join (BFRJ)Breadth-First R-tree Join (BFRJ)
Options for ordering:1. No particular order
2. Consider the lower x-coordinate of R´s nodes
3. Sum of the centers of x-coordinates of R and S
4. x-coordinate of center of common MBR
5. Hilbert-value of center of common MBR
Higher locality (better cache hit rates) for better
ordering strategies.
Chr
isti
an B
öhm
92150
Approaches without Preconstructed IndexApproaches without Preconstructed Index
• Indexes can be constructed temporarily for join• R-tree construction by INSERT too expensive
Use cheap bottom-up-construction- Hilbert R-trees: O (n log n)
[Kamel, Faloutsos: Hilbert R-trees: An Improved R-tree using Fractals, VLDB 1994]
Sort points by SFC and pack adjacent points to page- Buffer trees
[van den Bercken, Seeger, Widmayer: A Generic Approach to Bulk Loading.., VLDB 1997]
- Repeated partitioning[Berchtold, Böhm, Kriegel: Improving the Query Performance ..., EDBT 1998]
• Index construction can amortize during join
Chr
isti
an B
öhm
93150
Seeded TreesSeeded Trees
[Lo, Ravishankar: Spatial Joins Using Seeded Trees, SIGMOD Conf. 1994]
• Again spatial join for 2D rectangle intersection• Assumption:
Only one data set (R) is supported by index• Typical application:
Set S is subquery result• Idea:
Use partitioning of R as a template for S
Chr
isti
an B
öhm
94150
Seeded TreesSeeded Trees
• Motivation- Early inserts to R-trees decide initial organization- We know that S will be matched with R- Start with small template tree instead of empty root
seed levels
Chr
isti
an B
öhm
95150
Seeded TreesSeeded Trees
• Tree consist of- Seed levels- Grown levels
• Tree unbalanced• Phases of tree
construction:- Seeding phase- Growing phase- Cleanup phase
Chr
isti
an B
öhm
96150
Seeded TreesSeeded Trees
• Seeding phase:- Copy k levels of the R-tree of set R- Last level: defined MBRs, but empty child pointers
called slot
- Three strategies for (slot and other) MBRs:• Copy complete MBR• Use only center point rather than complete MBR• Center point at slot level, otherwise complete MBR
Chr
isti
an B
öhm
97150
Seeded TreesSeeded Trees
• Growing phase- Insert of points: Choose subtree like in R*-tree- Seed level is not affected during growth phase:
• No insertions to seed level nodes• No split of seed level nodes
- If point is inserted into empty slot (NULL pointer):• A new empty data node is allocated• Further, this node is treated like a root in R-trees:
on overflow, no split is propagated upward (new root)• The R-trees in the slots are called grown subtree.
Chr
isti
an B
öhm
98150
Seeded TreesSeeded Trees
• Growing phase (cont...)- Various strategies for update of the MBRs in the
seed levels during insert operations:• No updates• Enlarge bounding box after insert of a not contained point• Determine minimum bounding rectangle after insert• ...
- In seed levels: In general, the page regions are ...• Not bounding rectangles, i.e. no conservative appx. of set• Not minimal
Chr
isti
an B
öhm
99150
Seeded TreesSeeded Trees
• Cleanup Phase- The MBR property of page regions is needed ...
• ... not for tree construction• ... but for join processing
- Therefore, actual MBRs are determined in cleanup- Empty slots (without grown subtrees) are deleted- No attempt to make the tree balanced
• Join the two indexed sets R and S like in RSJ
Chr
isti
an B
öhm
101150
The -kdB-treeThe -kdB-tree
[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]
• Algorithm for the range distance self join
• General idea: Grid approximation where grid line distance =
• Not all dimensions used for decomposition:As many dimensions as needed defined node capacity
Chr
isti
an B
öhm
103150
The -kdB-treeThe -kdB-tree
• Node fanout: 1/(assuming data space [0..1]d)• Tree structure is specific to given parameter
must be constructed for each join• The -kdB-trees of two adjacent stripes are
assumed to fit into main memory
Chr
isti
an B
öhm
104150
The -kdB-treeThe -kdB-tree
procedure t_match (R, S: node) if is_leaf (R) is_leaf (S) then ... else for i:=1 to 1/1 do t_match(R.child[i], S.child [i]) ; t_match (R.child[i], S.child [i+1]) ; t_match (R.child[i+1], S.child[i]) ; t_match (R.child[1/], S.child[1/]) ;
Chr
isti
an B
öhm
105150
The -kdB-treeThe -kdB-tree
• Limitation:For large values not really scalable
• In high-dimensional cases, =0.3 can be typical 60% of data must be held in main memory
• As long as data fit into main memory:-kdB-tree is one of the best similarity join alg.
• Unfortunately:IBM does not provide any code for comparison
Chr
isti
an B
öhm
107150
The Parallel -kdB-treeThe Parallel -kdB-tree
[Shafer, Agrawal: Parallel Algorithms for High-dimensional Similarity Joins, VLDB 1997]
• Parallel construction of the -kdB-tree:- Each processor has random subset of the data (1/N)- Each processor constructs -kdB-tree of its own set- Identical structure is enforced e.g. by split broadcast
CPU1 CPU2
Chr
isti
an B
öhm
108150
The Parallel -kdB-treeThe Parallel -kdB-tree
• Workload distribution:- Global determination of the cumulated node sizes- A unit workload is a pair (r,s) of leaf nodes- The cost of a workload is
|r||s| for different leaves and |r|(|r|+1)/2 for a single leaf (self join)
- Data is redistributed: Each processor gets 1/N work• join units are clustered to preserve locality• minimize redistribution (communication) and replication
Chr
isti
an B
öhm
109150
The Parallel -kdB-treeThe Parallel -kdB-tree
• Workload execution:- delete internal structure- cum. node size too large
second growth phase- data redistribution per-
formed asynchronously:Data sent in depth-first order of tree traversal to avoid network flooding
Chr
isti
an B
öhm
111150
Plug & JoinPlug & Join
[van den Bercken, Schneider, Seeger: Plug&Join: An Easy-to-Use Generic Algorithm, EDBT 2000]
Generic technique for several kinds of join- Main-memory R-tree constructed from R-sample- Partition R and S acc. to R-tree (buffers at leaves)
1 2 3 4
main memory
R
flush
1 2 3 4
main memory
S
Chr
isti
an B
öhm
112150
Spatial Hash JoinSpatial Hash Join
[Lo, Ravishankar: Spatial Hash Joins, SIGMOD Conf. 1996]
• Method for the spatial join using replication- Set R is partitioned without replication- Set S is partitioned according to R‘s buckets;
replication if intersection with more than 1 R-bucket- Join only corresponding buckets
Chr
isti
an B
öhm
113150
Spatial Hash JoinSpatial Hash Join
• Partitioning of R:- Using bootstrap-seeding, generates a seeded tree- A suitable number # of slots is determined- The set R is sampled (sample size c #)- Using some clustering method, # cluster centers are
determined in the set- The cluster centers are the slots in the seeded tree- Assign each R-obj. to slot with least enlargement
Chr
isti
an B
öhm
114150
Spatial Hash JoinSpatial Hash Join
• Partitioning of S and join phase:- Bucket extents of R are copied to S-buckets- For spatial join: Each object s of S is assigned ...
... to all buckets b which are intersected by s- For similarity join:
... to all buckets b with mindist (s,b) - All corresponding bucket pairs (r,s) are joined by
constructing a quadratic split-R-tree on r.- Each obj in s is probed to the R-tree on r.
Chr
isti
an B
öhm
116150
Partition Based Spatial Merge JoinPartition Based Spatial Merge Join
[Patel, DeWitt: Partition Based Spatial-Merge Join, SIGMOD Conf. 1997]
• Again spatial join method using replication- Both sets R and S are partitioned with replication- Space is regularly decomposed into tiles- Partitions either corre-
spond to tiles or are determined from them using hashing
Chr
isti
an B
öhm
117150
Partition Based Spatial Merge JoinPartition Based Spatial Merge Join
• Duplicate pairs can be generated duplicate elimination by sorting according to (OIDR, OIDS)
• Initial number of partitions determined: (|R| + |S|) size_pt / memsizeThis formula does not take into account:- replication- data skew
Chr
isti
an B
öhm
119150
Approaches Using Space Filling CurvesApproaches Using Space Filling Curves
• Space filling curves recur- sively decompose the data space in uniform pieces
• Various different orders:
Chr
isti
an B
öhm
120150
Approaches Using Space Filling CurvesApproaches Using Space Filling Curves
• Efficient filter for the join:Objects in different cells cannot intersect each other Sort-merge-join e.g. on Z-order
• Problem:Object may cross grid lines- either decompose object (redundant)- or assign to containing cell
Chr
isti
an B
öhm
121150
Approaches Using Space Filling CurvesApproaches Using Space Filling Curves
• If all cells have uniform size: Equi-join on grid cell numbers (bit strings)
• If cells have varying size: Bit strings of varying length
• Objects may intersect ...- if bitstr (r) is prefix of bitstr (s)- or bitstr (s) is prefix of bitstr (r)
Chr
isti
an B
öhm
122150
Orenstein‘s Spatial JoinOrenstein‘s Spatial Join
[Orenstein: An Algorithm for Computing the Overlay of k-Dim. Spaces, SSD 1991]
• Allows (limited) redundancy, object decompos.• Algorithm:
- Objects are decomposed- Partial objects are ordered according to the
lexicographical order of the bit strings- Objects are accessed in sort-merge like fashion- Two stacks are maintained to keep track of the
prefix objects of R and S.
Chr
isti
an B
öhm
123150
Orenstein‘s Spatial JoinOrenstein‘s Spatial Join
• Stacks for prefix objects:
Chr
isti
an B
öhm
124150
Orenstein‘s Spatial JoinOrenstein‘s Spatial Join
• Mergesort principle:From the two files, read the next element which is smaller according to the lexicographical order
• The stacks are updated:Discard anything thats not a prefix of new string
• The new object is compared to every object on the other stack
Chr
isti
an B
öhm
125150
Orenstein‘s Spatial JoinOrenstein‘s Spatial Join
• Controlling redundancy:- Allowing no redundancy:
Many objects approximated by empty string- Decomposing every object until basis resolution
No manageable set of objects
• 2 Methods for controlling redundancy:- Size-bound: Given a max. number of partial objects- Error-bound: Given a max. error volume of appx.
Chr
isti
an B
öhm
126150
Multidimensional Spatial JoinMultidimensional Spatial Join
[Koudas, Sevcik: High-Dimensional Similarity Joins, ICDE 1997, Best Paper Award]
• No redundancy allowed at all• Instead of stacks:
Separate level files for different bitstring length• Problems with no redundancy:
- With increasing dimension: increasing - Increasing chance that object intersects one of the
primary decomposition lines approx. by < >
Chr
isti
an B
öhm
128150
Epsilon Grid OrderEpsilon Grid Order
[Böhm, Braunmüller, Krebs, Kriegel:
Epsilon Grid Order, SIGMOD Conf. 2001]
• Motivation like -kdB-tree:Based on grid with grid line distance
• Possible join mates restricted to 3d cells
• Here no tree structure but sort order of points based on lexicographical order of the grid cells
Chr
isti
an B
öhm
130150
Epsilon Grid OrderEpsilon Grid Order
• A simple exclusion test (used for I/O):A point q with orcannot be join mate of point p or any point beyond p (with respect to epsilon grid order)
• The interval between p[,...,]T and p+[,...,]T is called -interval
Chr
isti
an B
öhm
131150
Epsilon Grid OrderEpsilon Grid Order
• Sort file and decompose it into I/O units
Chr
isti
an B
öhm
134150
Closest Pair QueriesClosest Pair Queries
[Hjaltason, Samet: Incremental Distance Join Algorithms for Spatial DB, SIGMOD Conf. 1998]
• For both point objects and spatial objects• Find k objects with least distance
• Basis algorithm* for nearest neighbor search extended to take point pairs into account
* [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]
Chr
isti
an B
öhm
135150
Basis Algorithm for NN SearchBasis Algorithm for NN Search
Active Page List:rootp2 | p1 | p4 | p3p1 | p4 | p24 | p3 | p23 | p21 | p22p14 | p4 | p24 | p3 | p12 | p23 | p13 | p21 | p22
1 2 3 4
11 12 14 2213 21 24 3223 31 33 41 4434 4342
Chr
isti
an B
öhm
136150
Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries
• Nearest Neighbor Closest Pair Query• k result points k point pairs• active page list list of active page pairs• initialization root pair (rootR, rootS)
• distance point/query distance of point pair• mindist page/query mindist betw. page
pair
Chr
isti
an B
öhm
137150
Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries
Active Page List:(root,root)(root,p1)|(root,p2)|(root,p3)|(root,p4)
1 2 3 4
Chr
isti
an B
öhm
138150
Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries
• Unidirectional node expansion:Given a pair (ri,sj) only one node is expanded
• Closest pair ranking:Incremental version of k-closest pair queries stopping criterion is validation of next pair
• k-nearest neighbor join:Runs a closest pair ranking and filters out the (k+1)st occurrence (and more) of each point of R
Chr
isti
an B
öhm
139150
Hjaltason/Samet: Closest Pair QueriesHjaltason/Samet: Closest Pair Queries
• Two strategies for tie breaks (same distance):- Depth-first- Breadth first
• Three policies for tree traversal- Basic (one tree determines priority)- Even (priority to node with shallower depth)- Simultaneous (all possible pairs are candidates for
traversal)
Chr
isti
an B
öhm
140150
Alternative ApproachesAlternative Approaches
[Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]
• Various improvements and optimizations- Bidirectional node expansion
- Plane sweep technique for bidirectional node exp.- Adaptive multi-stage algorithm
• Aggressive pruning using estimated distances
(root,root) (p1,p3) | (p2, p3) | (p2, p4) | (p1, p2) | (p3, p4) | (p1, p4)
Chr
isti
an B
öhm
141150
Alternative ApproachesAlternative Approaches
[Corral, Manolopoulos, Theodoridis,
Vassilakopoulos: Closest Pair Queries in
Spatial Databases, SIGMOD Conf. 2000]
• 5 different algorithms for closest point queries- Naive: Depth-first traversal of the two R-trees
recursive call for each child pair (ri,sj) of (r,s)
- Exhaustive: like naive but prune page pairs the mindist of which exceeds the current k-CP-dist
- Simple recursive: addit. prune using minmaxdist
maxdistm
inmaxdist
mindist
Chr
isti
an B
öhm
142150
Alternative ApproachesAlternative Approaches
• 5 different algorithms (...)- Sorted distances recursive:
Before descending sort childpairs acc. to their mindist fast get good distance for pruning. Analogous to[Roussopoulos, Kelley, Vincent: Nearest Neighbor Queries. SIGMOD Conf. 1995]
- Heap algorithm:Similar to the algorithm by Hjaltason & Sametwith some minor differences
• New strategies for ties and different tree height
maxdist
minm
axdist
mindist
Chr
isti
an B
öhm
143150
Modeling and OptimizationModeling and Optimization
[Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, Wednesday, 1630]
Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum
Chr
isti
an B
öhm
144150
Modeling and OptimizationModeling and Optimization
• I/O cost:• High const. cost per page• Large capacity optimum
• CPU cost:• Low const. cost per page• Low capacity optimum
CPU-performance like CPU optimized index
I/O- performance like I/O optimized index
Chr
isti
an B
öhm
145150
Plane Sweep OptimizationPlane Sweep Optimization
[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, SIGMOD Conf. 1993]
For the directory in the R-tree spatial join (RSJ):- Avoid computation of all C2 box overlaps/distances- Sort boxes according to lower x-coordinates- Plane sweep to
determine the box pairs:- Hold all rectangles inter-
sected by sweep planein the status structure
Sweep plane
Chr
isti
an B
öhm
146150
Plane Sweep OptimizationPlane Sweep Optimization
[Arge, Procopiuc, Ramaswamy, Suel, Vitter: Scalable Sweeping Based Spatial Join, VLDB 1998]
• A plane sweep algorithm for the spatial join- Partition space into k stripes
at most 2N/k objects start/end in each stripe- Rectangle contained in a single strip is called small- Other rectangles decomposed: start, end, centerpiece- Recursive determination of intersections for start-
and endpieces and small rectangles
• Optimum complexity O(n log n + |R S|)
Chr
isti
an B
öhm
147150
Plane Sweep OptimizationPlane Sweep Optimization
[Böhm, Krebs, Kriegel: Optimal Dimension Sweeping: A Generic Technique, submitted for pub.]
• Reduction of the computational cost of point-distances• Most important cost factor for all similairty join algorithms
• Plane-sweep or also sort-merge method:• Sort points on both pages according to a selected dimension• Many point pairs can be excluded beforehand
• Crucial: Dimension• Distance or overlap• Extent of the pages• Probability model
Chr
isti
an B
öhm
149150
SummarySummary
• Similarity join is a powerful database primitive• Supports many new applications of
- Data mining- Data analysis
• Considerable performance improvements
Chr
isti
an B
öhm
150150
SummarySummary
• Many different algorithms for the similarity join- Most for the distance range join ( join)- Some approaches for closest pair queries
- Important operation of nearest neighbor join has almost not been considered yet
• All 3 types of join have different applications• Comparison of different join algorithms:
- Mostly a competition for speed
Chr
isti
an B
öhm
151150
SummarySummary
• Only few other advantages/disadvantages:- Scalability:
• MSJ and -kdB-tree have high main memory requirements in high-dimensional spaces
- Existence of an index:• Actually no matter because R-trees can be fast
constructed bottom-up. Construction time often much less than join time
• Even if preconstructed indexes exist:Approaches based on sorting often better
- No good criteria known for algorithm selection
Chr
isti
an B
öhm
152150
Future Research DirectionsFuture Research Directions
• Applications:- Many standard data mining methods accelerable:
• Outlier detection• Various clustering algorithms (e.g. obstacle clustering)• Hough transformation and similar analysis methods• ...
- New data mining methods will become feasable:• Subspace clustering & correlation detection• Methods may become interactive• ...
Chr
isti
an B
öhm
153150
Future Research DirectionsFuture Research Directions
• Algorithms- Sufficient research for join and closest pair query- Almost no convincing approaches for the k-NN-join
Important database primitive for many applications- Parallel Algorithms- Non-vector metric data (e.g. text mining)- Approximative join algorithms
• Similarity search: Approximative search often sufficient• Join performance could be considerably improved
- ...
Chr
isti
an B
öhm
154150
Future Research DirectionsFuture Research Directions
• Optimization of various critical parameters- Dimension- Replication - Index scan strategies- ...
Chr
isti
an B
öhm
156150
Comparison with Multiple QueriesComparison with Multiple Queries
0
10000
20000
30000
40000
50000
60000
70000
0,00 0,05 0,10 0,15 0,20epsilon
run
tim
e [s
ec] SQ-DBSCAN (X-tree)
MQ-DBSCAN (Scan)
MQ-DBSCAN
J-DBSCAN (X-tree)
Chr
isti
an B
öhm
157150
Experimente: SeitenkapazitätExperimente: Seitenkapazität
100
1000
10000
100000
1000000
0 2000 4000 6000 8000 10000
page capacity
run
tim
e [
sec]
100
1000
10000
100000
1000000
0 100 200 300page capacity
run
tim
e [
sec]
Q-DBSCAN (Seq. Scan)
Q-DBSCAN (R*-tree)
Q-DBSCAN (X-tree)
J-DBSCAN (R*-tree)
J-DBSCAN (X-tree)
Meteorology data9-dimensional
Color image data64-dimensional
Chr
isti
an B
öhm
158150
Experimente: AnfrageregionExperimente: Anfrageregion
0
10000
20000
30000
40000
50000
60000
70000
0,00 0,05 0,10 0,15 0,20 0,25
epsilon
run
tim
e [
sec]
Querybased (X-tree)
Joinbased (X-tree)
0
20000
40000
60000
80000
100000
120000
140000
0,1 0,15 0,2 0,25 0,3
epsilonru
nti
me
[se
c]
Q-OPTICS (Seq. Scan)Q-OPTICS (X-tree)J-OPTICS (X-tree)
Color image data Color image data
Q-DBSCAN (X-tree)J-DBSCAN (X-tree)
Chr
isti
an B
öhm
159150
Experimente: Künstliche DatenExperimente: Künstliche Daten
4d-UNIFORM 8d-UNIFORM 8d-UNIFORM
Chr
isti
an B
öhm
160150
Future WorkFuture Work
Weitere KDD-Algorithmen auf Join abstützen Z.B. Outlier Detection Subspace Clustering, Ermittlung von Korrelationen Interaktivität
Neue Algorithmen für den Similarity Join Nutzung des Optimierungspotentials (Dimension,...) Parallelisierung Approximative Join-Bearbeitung „k-nearest-neighbor Joins“ und „k-best-pair Joins“
Chr
isti
an B
öhm
163150
KDD Algorithms Based on Similarity QueriesKDD Algorithms Based on Similarity Queries
DBSCAN
OPTICS
....
LOF
Dist.Based
Outliers
....
Simultan.Nearest
NeighborClassific.
....
SpatialTrend
Detect.
SpatialAssoc.Rules
Chr
isti
an B
öhm
164150
Curse of DimensionalityCurse of Dimensionality
Cost model opens optimization potential Optimization of the page capacity (# points)
[Böhm, Kriegel: Dynamically Optimizing High-Dimensional Index, EDBT 2000]
Optimized index compression[Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization: An Index Compression Technique for High-Dimensional Spaces, ICDE 2000]
Optimized dimension assignment[Berchtold, Böhm, Keim, Kriegel, Xu: Optimal Multidimensional Query Processing Using Tree Striping, DaWaK 2000]