1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University.

transcript

Queryy Sampling Based High Dimensional Hybrid Index

Junqi Zhang, Xiangdong ZhouFudan University

Nearest Neighbors Query

Dims Overlap Accessed

Query cover area

Cluster Partitioning Based B+-tree

Cluster i

Core Sub-cluster i Marginal Sub-cluster i

Cluster splitting

Index Structure

Oi Oi Oi

Leafe nodes of B+-tree

Marginal sub-cluster1

Corel sub-cluster1Corel sub-cluster i

Marginal sub-cluster i

Query cover area

C x 1 C x 2 C x i C x ( i+1)

Index key

where C is a hash factor

What’s the optimal extent to partition ？

iDistance ： by experiments Ours: by cost model to predict

Object of Cluster Partitioning － Lowest Query Cost Appropriate M ：

Distribute M to each cluster

Overall number of clusters ：

HuNNMQKNNNodes c

M2)))(((minarg

HuNN opt

Dimension Curse dim>10 ： tree<scan< VA-file

dim<10 : tree>scan> VA-file

Non uniform ： tree VA-file

VA-file defectHow to improve tree performance ？

Tree and scan—which better ？ tree advantage ： filter data instead of linear scan the whole file disadvantage ： position cost for each data is the height of

intermediate nodes,which is higher than scan

scan advantage ： position cost for each data is 0 disadvantage ： linear scan the whole file

Cost that view from each point

(C<1) : tree － useful － compared with scan

( C>=1) ： tree － useless － compared with scan

scanlinearbytaveragetreeontaverage

Data distribution and index performance Known work ： index data in a single

index DIMS tree

Real image data set ： Non uniform

Non uniform data aggregate tree

Data type Sparse data tree<scan

Dense data

tree>scan

Hybrid data type hybrid index

hybrid index

Sequencial file B ＋ -tree

Sparse data dense data

tree<scan tree>scan

How to differentiate data type ？ Each data as a unit

difficult

Each cluster ring as a unit easier

Clsuter partitioning

cluster middle circleout circle

cluster split

inner circle

What extent ？ HuNNMQKNNNodes c

M2)))(((minarg

Clsuter partitioning based B+-tree

c x 1 c x 2 c x i c x ( i+1)

O12O11Oi2O13

Marginal data file

Leafe nodes of B+-treeIndex keywhere c is a hash factor

Query cover area

Clsuter partitioning based image retrieval system

Outer rings of custers are often accessed

Some rings of custers are often accessed

Treat outer rings as sparse rings?

Frequence of being accessed for each ring

Hybrid index － cut branches（ according to the contribution of each ring to the query cost ）

Expected cost

）＋（u

NH ciP(ci)

Cost by linear scan b

C x 1 C x 2 C x i C x ( i+1)

O12O11Oi2O13

Marginal data file

Leafe nodes of B+-treeIndex keywhere C is a hash factor

Query cover area

Standard of rings being cut － Index

Capability IC （ index capability ）：

Question ： how to determine ？

）＋（u

N cici P(ci)IC i

Estimate － query samping

Question ： for large database ， lot of queris bring expensive

cost Object ： given confidence a% ， make minimum

Qqueriesofnumberaccessedbeingcringoftimes i

:P(ci)

Threshold of rings being cut When IC equal 0 ：

Rule ： When the probability of ring being accessed by

queries is lower than this threshold, this ring should remain in the tree, or else, it should be cut into the sequence file for linear scan.

0P(ci)IC i ）＋（u

N cici

cicici

bNHubuN

＋＝）＋（＝ P(ci)P(ci)

Query sampling algorithm ： When or or

， stop sampling.

User can balance the accuracy and efficiency of sampling by tuning the confidence a% ， and the complexity of this algorithm is less than . N

)1(P(ci) 2/ ntn

SbNHub

)1(P(ci) 2/ ntn

SbNHub

0ierror

Query algorithm of hybrid index Linear scan the sequence file for sparse

Retrieve the dense data on the B+-tree

Thanks!Thanks!

1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University.

Documents