Making the Pyramid Technique Robust to Query Types and Workloads
Rui Zhang, Beng Chin Ooi, Kian-Lee Tan
Department of Computer Science
National University of Singapore
Singapore
Outline
• Backgrounds
• Existing work and limitations• Our proposal: The P+-tree
• Experimental results• Conclusion
Problem & Motivation
Problem:
Indexing multidimensional point data
Applications:
• Low dimension: GIS, CAD, Medical image (X-rays, MRI brain scans)
• High dimension: Image database, Video database, data warehouse
Typical Query Types
• Point Query
• Window Query
[q0min; q0max]; [q1min; q1max]… [qd-1min; qd-1max]
• Range Query
X(x1 , x2 , … xd-1), r
• K-Nearest Neighbor Query (kNN query)
X(x1 , x2 , … xd-1), k
Existing work: Four Strategies
• Data partitioning: R-tree family
• Space partitioning: k-d-tree family
• Dimensionality Reduction: mapping
• Data Compression: VA-file, IQ-tree
Existing work: Comparison
• Low-dimensional space– The R-tree family structures
• For high-dimensional space– Window query: the Pyramid tech. , the
iMinMax– kNN query: the IQ-tree, the iDistance
Existing work: Limitations
• Limited to query types– The Pyramid tech. , the iMinMax: window
query– The iDistance, the IQ-tree: kNN query
• Limited to certain workloads– The Pyramid tech. : hyper-cube shaped window
query, located around center of the data space
Our proposal: the P+-tree
• Based on the Pyramid tech.
• Support both window and kNN queries
• Robust under different workloads
Review of the Pyramid Tech.
i: pyramid numberhv: height , in the i’th (if i<d)or (i-d)’th (if i>=d) dimension
pvv=i+hv
Sensitivity to location of query window / data distribution
Sensitivity to shape of query
The P+-tree
• Divide data space to subspaces– Based on clustering– Divide in the dimension where two clusters differ
greatest
• Transform the points in each subspace– Transform a subspace to unit hyper-cube, [si min, si max]d -
>[0, 1]d, so that the pyramid tech can be applied– Move the cluster center to center of the transformed
space (0.5, 0.5, … 0.5), the case when the pyramid tech is efficient
Space division and data transformation
Transformation function• A set of d functions, t0 t1 … td-1 • Requirements:
– ti is a bijection from [si min , si max] to [0,1]– ti is monotonous– ti ( ci ) = 0.5
• In equations:– ti (si min ) = 0– ti (si max ) = 1– ti ( ci ) = 0.5
Transformation function
• ti(x)=(ai x – bi)^ei i=0, 1, … d-1
• For subspace [s0 min , s0 max], [s0 min , s0 max], … [sd-1 min , sd-1 max]
ai=1/(si min - si max)
bi= si min /(si min - si max)
ei=-1/log2(ai ci - bi)
The space-tree
SNo, ai, bi, ei are stored in leaf nodes
Space division algorithm
• Clustering data
• Divide space to two subspaces in the dimension where the two cluster centers differ greatest (Recursively)
• Build the space-tree
Build the P+-tree
• The P+-tree is in effect a B+-tree that store the data points in the leaf nodes with the P+-value as keys
• P+-value: SNo · 2d + pv(v’)• For a newly inserted point v, traverse the space-
tree to determine the subspace it belongs to.• Transform the point v to v’, calculate P+-value• Insert the point v, with its P+-value as key
Window search algorithm
• Traverse the space-tree to see which subspaces are intersected by the query
• For each intersected subspace, transform the query according to the transformation function for the subspace
• Search the subspace according to the transformed query
KNN search algorithm
• Start from a small window query
• Gradually increase the side length of the query window until kNN are found
Experiments: Window Queries
Experiments: Partial Window Queries
Experiments: kNN Queries