1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006.

1

Top-k Dominating Queries

DB seminarSpeaker: Ken YiuDate: 25/05/2006

2

Outline

Motivations and applications Background Skyline-based algorithm Best-first algorithms Experimental results Conclusions

3

Top-k query, skyline query

D: dataset of points in multi-dimensional space d

Top-k query k points with the lowest F values Top-2: p4, p6

Require a ranking function Result affected by scales of dimensions

Skyline query p>p’: ( i, p[i] < p’[i] ) ( i, p[i] p’[i] ) Points not dominated by any other point Skyline: p1, p4, p6, p7

Result size cannot be controled

p

x (time to conf. venue)0.5 1

0.5

1

y (price)

F=x+y

1 p2

p3

p4

p5

p6 p

7

4

Top-k dominating query

Intuitive score function (p) = | { p’D, p>p’ } | Property: p,p’D, p>p’ (p)>(p’)

Top-k dominating query k points with the highest values Also known as k-dominating query [PTFS05] Top-2 dominating points: p4 (3), p5 (2)

Applications: decision support systems, find the most `popular’ objects

Advantages Control of result size No need to specify ranking function Result independent of scales of dimensions

p

x (time to conf. venue)0.5 1

0.5

1

y (price)

F=x+y

1 p2

p3

p4

p5

p6 p

7

5

Related work

Spatial aggregation processing aggregate measures (e.g., number of cars in

car-parks) in a region (e.g., district) aR-trees [PKZT01]

Each entry is augmented with the aggregate measure of all points in its subtree

Example: COUNT R-tree Query: find the number of points intersect W

Prune entries that do not intersect W Fully covered by W, increment by its count Partially covered by W, recursive call

Cost: 10 for aR-tree, but 17 for typical R-tree

x0.5 1

0.5

1y

e1 e

2

e3 e

4

e5

e6

e7

e13e

9e

10

e11

e12

e14

e15 e

16

e8

e17

e18

e19

e20

W

e1 e2

contents of leaf nodes omitted

10

e3 e4

e5 e6

e7 e8

e9 e10

e11 e12

e13 e14

e15 e16

e17 e18 e19e2010 10 10

2 3

3 2

3 2

2 3

3 2

2 3

2 3

3 2

root node

6

Related work: skyline computation

Non-indexed data DC (divide-and-conquer), BNL (block-nested loop), SFS

(sort-filter-skyline), LESS (linear elimination sort for skyline)

Indexed data NN, BBS [PTFS05]

Skyline variants based on dominance relationship Top-k frequent skyline points [CJT+06a]

Frequency (p): number of subspaces that p is a skyline point k-dominant skyline points [CJT+06b]

Relax the dominance relationship by k k=d: original skyline; k decreases skyline size decreases

Data cube for analyzing dominance relationship of points [LOTW06]

7

Top-k dominating query

How to answer top-k dominating query Block nested loop join: compute the

score of every point Quadratic cost

Skyline based solution: retrieve the skyline points and compute their scores, find the top-1 from the skyline Expensive for datasets with large skylines

Goal: develop efficient algorithm for the query on indexed multi-dimensional data

8

Problem characteristics

Pre-computation possible? Materialize the `score’ of every point Updates: change the ‘score’ of influenced points Update cost is expensive for dynamic datasets

Find (K >> k) points with the highest dominating area, compute their scores to get best-k results

Approximate solution, hard to specify K Dominating area cannot provide bounds for

DomArea(p1) = (1 – 0.25) * (1 – 0.50) = 0.375 DomArea(p4) = (1 – 0.45) * (1 – 0.40) = 0.330 (p1)=1 < (p4)=2 !!!

Unlike the dominating area, computing value (or even its upper bound) requires accessing data

x0.5 1

0.5

1y

p2

p3

p1 p

6

p7

p4 p

5

9

Skyline-based solution BBS Top-k dominating algorithm [PTFS05] Example: top-2 dominating query Iteration 1

Find the skyline points Score of a point is smaller than the one

dominating it (if any) Compute their scores (by accessing the tree) Report p2 (4) as the first result

Iteration 2 Find the constrained skyline (gray region)

WHY ? Region dominated by p2 but not others (p1, p3)

Compute their scores and compare them with retrieved points in all previous iterations

Report p4 (2) as the second result

x0.5 1

0.5

1y

p2

p3

p1 p

6

p7

p4 p

5

10

Our optimizations

e+

e _

p1

e

p3

p2

Hilbert ordering of retrieved points before counting

Exhibit locality of node accesses Batch counting

Pack B (page capacity) points into one page and count their scores simultaneously

e– and e+ denote the lower and upper corners

(virtual points) of an entry e respectively Properties

p1 > e– p1 dominates all points in e

p2 > e+ and p2 > e– p2 may dominate some points in e

p3 > e+ p3 dominates no points in e

11

The best-first approach

The optimized BBS is inefficient when the skyline is large Not necessary to compute the whole skyline Best-first approach: visit the nodes in descending order of

their upper bound scores Use a max-heap H for organizing the entries to be visited in

descending order of their upper bound scores Keep an array W of the best-k data points found so far Terminates when the top entry in H cannot improve the result

Compute upper bound scores of entries in the same non-leaf node

Upper score of the entry e is (e_) WHY? For each entry e in the node, put the point e_ in the set T Perform batch counting for the points in T

e+

e _

e

12

Optimizations for best-first search

Pruning technique Let be the best-k score found so far (lowest score in result

array) Suppose that a point p satisfies (p) . Any point p’

dominated by p (p’) < . Keep a pruner set F of visited data points whose scores are

Among the points in F, only need to maintain their skyline Apply F to eliminate unqualified entries

Lazy counting (for computing scores of leaf entries) some data points (in the same leaf node) remain before

counting, not cost-effective to perform batch counting for them Use a FIFO queue L to store discovered points Once L is full (i.e., |L|= page capacity B), perform batch

counting for the points in L, update the result and clear L

13

Lightweight best-first search

Expensive to compute upper bound scoresfor non-leaf entries

Root node contains e1, e2, e3

Compute (e1–), (e2–), (e3–) in batch e2– may dominate some points in e1 and e3

Cost: 3 node accesses; (e1–)=3, (e2–)=7, (e3–)=3

Use a lightweight function to compute upper bound scores for non-leaf entries

Goal: do not allow leaf nodes to be accessed Compute ub(e1–), ub(e2–), ub(e3–) in batch Cost: 1 node access, since leaf nodes not accessed e2– dominates all points in e2 and some in e1 and e3

ub(e1–)=3, ub(e2–)=9, ub(e3–)=3

x0.5 1

0.5

1y

e1

e2

e3

p4

p5

p6

p7

p8

p9

p3

p2

p1

Correct bound !

Approx. preserve original

ordering of entries !

14

Incremental best-first search

No objects need to be pruned Data points are inserted into the heap H after their

scores have been computed When a data point p is deheaped, check whether its

score is greater than those in the Lazy Counting Queue L

If yes, report p as the next result If not,

consider the points in L whose (upper bound) score is greater than p

Compute their actual scores and insert them to H Insert p back to H again The next result is now at the top of H (to be found in next iteration)

15

Query variant

Bichromatic top-k dominating query Given a provider dataset DP and a consumer

dataset DA, a point p in DP,

A(p) = | { aDA, p>a } | A(p1)=2, A(p2)=3, A(p3)=1 Bichromatic top-1 point: p2

Application: find the most popular hotel, where DP contains hotels and DA specify requirements from different customers

Query processing The proposed algorithms are still applicable Search for the results in DP

Perform counting on DA

x0.5 1

0.5

1y

p2

a4

p1 a

3

a1

a2 p

3

y (price)

x (time to conf. venue)

16

Setup of efficiency experiments

Algorithms BBS (skyline-based method) Best first search: BF1 (basic), BF2 (lightweight) Incremental best-first: IBF1, IBF2

Synthetic datasets UI (independent), CO (correlated), AC (anti-

correlated) Parameters and other settings

aR tree node page size: 4K bytes LRU buffer size (%): 0, 1, 2, 5, 10, 15, 20 Datasize N (million): 0.25, 0.5, 1, 2, 4 Data dimensionality d: 2, 3, 4, 5 Result size k: 1, 4, 16, 64, 256

17

I/O cost vs buffer sizeI/O

buffer (%)

BBS

BF1

BF2

0

10000

20000

30000

40000

50000

0 1 2 5 10 15 20

I/O

buffer (%)

BBS

BF1

BF2

0

1000

2000

3000

4000

5000

6000

7000

0 1 2 5 10 15 20

I/O

buffer (%)

BBS

BF1

BF2

0

10000

20000

30000

40000

50000

60000

0 1 2 5 10 15 20

UI

CO

AC

18

I/O cost vs k

I/O

k

BBS

BF1

BF2

IBF1

IBF2

1E+3

1E+4

1E+5

1E+6

1 4 16 64 256

I/O

k

BBS

BF1

BF2

IBF1

IBF2

1E+2

1E+3

1E+4

1 4 16 64 256

I/O

k

BBS

BF1

BF2

IBF1

IBF2

1E+3

1E+4

1E+5

1E+6

1 4 16 64 256

UICO

AC

19

I/O cost vs d

UICO

AC

I/O

d

BBS

BF1

BF2

1E+2

1E+3

1E+4

1E+5

1E+6

2 3 4 5

I/O

d

BBS

BF1

BF2

1E+2

1E+3

1E+4

1E+5

2 3 4 5

I/O

d

BBS

BF1

BF2

1E+3

1E+4

1E+5

1E+6

2 3 4 5

20

I/O cost vs N

UI

CO

AC

I/O

BBS

BF1

BF2

0

20000

40000

60000

80000

100000

120000

140000

0.25 0.5 1 2 4N (million)

I/O

BBS

BF1

BF2

N (million)0

2000

4000

6000

8000

10000

0.25 0.5 1 2 4

I/O

N (million)0

50000

100000

150000

200000

250000

0.25 0.5 1 2 4

BBS

BF1

BF2

21

Bichromatic queries, I/O cost vs dataset combination

I/O

combination

BBS

BF1

BF2

1E+2

1E+3

1E+4

1E+5

1E+6

UI/UI UI/CO UI/AC CO/UI CO/CO CO/AC AC/UI AC/CO AC/AC (D / D )P A

Column UI/CO means provider dataset DP is UI and consumer

dataset DA is CO BF1 is more efficient than BBS in 7 cases BF2 outperforms its competitors in all cases

22

Meaningfulness of results

Explore the meaningfulness of the results returned by top-k dominating queries

Real datasets Statistics of NBA players

http://basketballreference.com/stats_download.htm 19112 players (identified by both name and year) Attributes for query: GP (games played), PTS (points),

REB (rebounds), and AST (assists) Statistics BASEBALL pitchers

http://baseball1.com/statistics/ 36898 players (identified by both name and year) Attributes for query: W (Wins), H (Hits), ERA (Earned

Run Average), and R (Runs Allowed)

23

Top-k dominating points meaningful?

Top-5 dominating points

Results match the public’s view of super-star players in NBA and BASEBALL

Enables users to discover `top’ objects without any specific domain knowledge

24

Skyline vs top-k dominating points

score

k0

5000

10000

15000

20000

0 10 20 30 40 50 60

SkylineTop-k Dominating

70

score

k0 100 200 300 400 500 600

SkylineTop-k Dominating

7000

5000

10000

15000

20000

25000

NBA BASEBALL

Perform a skyline query, compute top-k dominating points by setting k to the skyline size (69 for NBA and 700 for BASEBALL)

Plot their dominating scores in descending order Observations

Top-k dominating points have much higher scores than skyline points Top-k dominating points are more informative to users

25

Conclusions

Study top-k dominating queries on indexed multi-dimensional data

Present algorithms for the problem The lightweight best-first algorithm

BF2 performs the best Top-k dominating queries produce

more meaningful results than skylines

26

References[FLN01] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation

Algorithms for Middleware. In PODS, 2001.

[BKS01] S. Borzsonyi, D. Kossmann, and K. Stocker. The Skyline Operator. In ICDE, 2001.

[PKZT01] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP Operations in Spatial Data Warehouses. In SSTD, 2001.

[PTFS05] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive Skyline Computation in Database Systems. TODS, 30(1):41–82, 2005.

[CJT+06a] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. On High Dimensional Skylines. In EDBT, 2006.

[CJT+06b] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. Finding k-Dominant Skylines in High Dimensional Space. In SIGMOD, 2006.

[LOTW06] C. Li, B. C. Ooi, A. Tung, and S.Wang. DADA: A Data Cube for Dominant Relationship Analysis. In SIGMOD, 2006.

Date post:	04-Jan-2016
Category:	Documents
Upload:	hilda-campbell
View:	212 times
Download:	0 times

1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006.

Documents