1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University.

Post on 20-Jan-2018

217 views 0 download

description

3 Cluster Partitioning Based B + -tree

transcript

1

Queryy Sampling Based High Dimensional Hybrid Index

Junqi Zhang, Xiangdong ZhouFudan University

2

Nearest Neighbors Query

Dims Overlap Accessed

O3

Q

Query cover area

O1

O2

r1

P1

Q

Query cover area

O1P1

r2

O2

3

Cluster Partitioning Based B+-tree

Oi

Cluster i

Oi

Core Sub-cluster i Marginal Sub-cluster i

split

rrc

Oi

Cluster splitting

4

Index Structure

Oi Oi Oi

Q

Oi

Leafe nodes of B+-tree

Marginal sub-cluster1

Corel sub-cluster1Corel sub-cluster i

Marginal sub-cluster i

Query cover area

C x 1 C x 2 C x i C x ( i+1)

...

Index key

Q

where C is a hash factor

What’s the optimal extent to partition ?

iDistance : by experiments Ours: by cost model to predict

5

Object of Cluster Partitioning - Lowest Query Cost Appropriate M :

Distribute M to each cluster

Overall number of clusters :

HuNNMQKNNNodes c

M2)))(((minarg

HuNN opt

2

6

Dimension Curse dim>10 : tree<scan< VA-file

dim<10 : tree>scan> VA-file

Non uniform : tree VA-file

VA-file defectHow to improve tree performance ?

7

Tree and scan—which better ? tree advantage : filter data instead of linear scan the whole file disadvantage : position cost for each data is the height of

intermediate nodes,which is higher than scan

scan advantage : position cost for each data is 0 disadvantage : linear scan the whole file

8

Cost that view from each point

(C<1) : tree - useful - compared with scan

( C>=1) : tree - useless - compared with scan

)cos

cos(

scanlinearbytaveragetreeontaverage

C

9

Data distribution and index performance Known work : index data in a single

index DIMS tree

Real image data set : Non uniform

Non uniform data aggregate tree

FAST

10

Data type Sparse data tree<scan

Dense data

tree>scan

11

Hybrid data type hybrid index

hybrid index

Sequencial file B + -tree

Sparse data dense data

tree<scan tree>scan

12

How to differentiate data type ? Each data as a unit

difficult

Each cluster ring as a unit easier

13

Clsuter partitioning

cluster middle circleout circle

cluster split

r

Q

O1

QQ Q

O2 O3

inner circle

rr2

r3

What extent ? HuNNMQKNNNodes c

M2)))(((minarg

14

Clsuter partitioning based B+-tree

c x 1 c x 2 c x i c x ( i+1)

O12O11Oi2O13

c x3

Oi1

...

...

Marginal data file

Leafe nodes of B+-treeIndex keywhere c is a hash factor

Q

Query cover area

15

Clsuter partitioning based image retrieval system

Outer rings of custers are often accessed

16

Some rings of custers are often accessed

Treat outer rings as sparse rings?

17

Frequence of being accessed for each ring

18

19

Hybrid index - cut branches( according to the contribution of each ring to the query cost )

Expected cost

)+(u

NH ciP(ci)

Cost by linear scan b

Nci

C x 1 C x 2 C x i C x ( i+1)

O12O11Oi2O13

C x3

Oi1

...

...

Marginal data file

Leafe nodes of B+-treeIndex keywhere C is a hash factor

Q

Query cover area

20

Standard of rings being cut - Index

Capability IC ( index capability ):

Question : how to determine ?

)+(u

NHb

N cici P(ci)IC i

P(ci)

21

Estimate - query samping

Question : for large database , lot of queris bring expensive

cost Object : given confidence a% , make minimum

P(ci)

Qqueriesofnumberaccessedbeingcringoftimes i

:P(ci)

Q

22

Threshold of rings being cut When IC equal 0 :

Rule : When the probability of ring being accessed by

queries is lower than this threshold, this ring should remain in the tree, or else, it should be cut into the sequence file for linear scan.

0P(ci)IC i )+(u

NHb

N cici

ci

cicici

bNHubuN

uNH

bN

+=)+(= P(ci)P(ci)

23

Query sampling algorithm : When or or

, stop sampling.

User can balance the accuracy and efficiency of sampling by tuning the confidence a% , and the complexity of this algorithm is less than . N

)1(P(ci) 2/ ntn

SbNHub

uNa

i

ci

ci

)1(P(ci) 2/ ntn

SbNHub

uNa

i

ci

ci

0ierror

24

Query algorithm of hybrid index Linear scan the sequence file for sparse

data

Retrieve the dense data on the B+-tree

25

Thanks!Thanks!