Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | hilda-campbell |
View: | 212 times |
Download: | 0 times |
1
Top-k Dominating Queries
DB seminarSpeaker: Ken YiuDate: 25/05/2006
2
Outline
Motivations and applications Background Skyline-based algorithm Best-first algorithms Experimental results Conclusions
3
Top-k query, skyline query
D: dataset of points in multi-dimensional space d
Top-k query k points with the lowest F values Top-2: p4, p6
Require a ranking function Result affected by scales of dimensions
Skyline query p>p’: ( i, p[i] < p’[i] ) ( i, p[i] p’[i] ) Points not dominated by any other point Skyline: p1, p4, p6, p7
Result size cannot be controled
p
x (time to conf. venue)0.5 1
0.5
1
y (price)
F=x+y
1 p2
p3
p4
p5
p6 p
7
4
Top-k dominating query
Intuitive score function (p) = | { p’D, p>p’ } | Property: p,p’D, p>p’ (p)>(p’)
Top-k dominating query k points with the highest values Also known as k-dominating query [PTFS05] Top-2 dominating points: p4 (3), p5 (2)
Applications: decision support systems, find the most `popular’ objects
Advantages Control of result size No need to specify ranking function Result independent of scales of dimensions
p
x (time to conf. venue)0.5 1
0.5
1
y (price)
F=x+y
1 p2
p3
p4
p5
p6 p
7
5
Related work
Spatial aggregation processing aggregate measures (e.g., number of cars in
car-parks) in a region (e.g., district) aR-trees [PKZT01]
Each entry is augmented with the aggregate measure of all points in its subtree
Example: COUNT R-tree Query: find the number of points intersect W
Prune entries that do not intersect W Fully covered by W, increment by its count Partially covered by W, recursive call
Cost: 10 for aR-tree, but 17 for typical R-tree
x0.5 1
0.5
1y
e1 e
2
e3 e
4
e5
e6
e7
e13e
9e
10
e11
e12
e14
e15 e
16
e8
e17
e18
e19
e20
W
e1 e2
contents of leaf nodes omitted
10
e3 e4
e5 e6
e7 e8
e9 e10
e11 e12
e13 e14
e15 e16
e17 e18 e19e2010 10 10
2 3
3 2
3 2
2 3
3 2
2 3
2 3
3 2
root node
6
Related work: skyline computation
Non-indexed data DC (divide-and-conquer), BNL (block-nested loop), SFS
(sort-filter-skyline), LESS (linear elimination sort for skyline)
Indexed data NN, BBS [PTFS05]
Skyline variants based on dominance relationship Top-k frequent skyline points [CJT+06a]
Frequency (p): number of subspaces that p is a skyline point k-dominant skyline points [CJT+06b]
Relax the dominance relationship by k k=d: original skyline; k decreases skyline size decreases
Data cube for analyzing dominance relationship of points [LOTW06]
7
Top-k dominating query
How to answer top-k dominating query Block nested loop join: compute the
score of every point Quadratic cost
Skyline based solution: retrieve the skyline points and compute their scores, find the top-1 from the skyline Expensive for datasets with large skylines
Goal: develop efficient algorithm for the query on indexed multi-dimensional data
8
Problem characteristics
Pre-computation possible? Materialize the `score’ of every point Updates: change the ‘score’ of influenced points Update cost is expensive for dynamic datasets
Find (K >> k) points with the highest dominating area, compute their scores to get best-k results
Approximate solution, hard to specify K Dominating area cannot provide bounds for
DomArea(p1) = (1 – 0.25) * (1 – 0.50) = 0.375 DomArea(p4) = (1 – 0.45) * (1 – 0.40) = 0.330 (p1)=1 < (p4)=2 !!!
Unlike the dominating area, computing value (or even its upper bound) requires accessing data
x0.5 1
0.5
1y
p2
p3
p1 p
6
p7
p4 p
5
9
Skyline-based solution BBS Top-k dominating algorithm [PTFS05] Example: top-2 dominating query Iteration 1
Find the skyline points Score of a point is smaller than the one
dominating it (if any) Compute their scores (by accessing the tree) Report p2 (4) as the first result
Iteration 2 Find the constrained skyline (gray region)
WHY ? Region dominated by p2 but not others (p1, p3)
Compute their scores and compare them with retrieved points in all previous iterations
Report p4 (2) as the second result
x0.5 1
0.5
1y
p2
p3
p1 p
6
p7
p4 p
5
10
Our optimizations
e+
e _
p1
e
p3
p2
Hilbert ordering of retrieved points before counting
Exhibit locality of node accesses Batch counting
Pack B (page capacity) points into one page and count their scores simultaneously
e– and e+ denote the lower and upper corners
(virtual points) of an entry e respectively Properties
p1 > e– p1 dominates all points in e
p2 > e+ and p2 > e– p2 may dominate some points in e
p3 > e+ p3 dominates no points in e
11
The best-first approach
The optimized BBS is inefficient when the skyline is large Not necessary to compute the whole skyline Best-first approach: visit the nodes in descending order of
their upper bound scores Use a max-heap H for organizing the entries to be visited in
descending order of their upper bound scores Keep an array W of the best-k data points found so far Terminates when the top entry in H cannot improve the result
Compute upper bound scores of entries in the same non-leaf node
Upper score of the entry e is (e_) WHY? For each entry e in the node, put the point e_ in the set T Perform batch counting for the points in T
e+
e _
e
12
Optimizations for best-first search
Pruning technique Let be the best-k score found so far (lowest score in result
array) Suppose that a point p satisfies (p) . Any point p’
dominated by p (p’) < . Keep a pruner set F of visited data points whose scores are
Among the points in F, only need to maintain their skyline Apply F to eliminate unqualified entries
Lazy counting (for computing scores of leaf entries) some data points (in the same leaf node) remain before
counting, not cost-effective to perform batch counting for them Use a FIFO queue L to store discovered points Once L is full (i.e., |L|= page capacity B), perform batch
counting for the points in L, update the result and clear L
13
Lightweight best-first search
Expensive to compute upper bound scoresfor non-leaf entries
Root node contains e1, e2, e3
Compute (e1–), (e2–), (e3–) in batch e2– may dominate some points in e1 and e3
Cost: 3 node accesses; (e1–)=3, (e2–)=7, (e3–)=3
Use a lightweight function to compute upper bound scores for non-leaf entries
Goal: do not allow leaf nodes to be accessed Compute ub(e1–), ub(e2–), ub(e3–) in batch Cost: 1 node access, since leaf nodes not accessed e2– dominates all points in e2 and some in e1 and e3
ub(e1–)=3, ub(e2–)=9, ub(e3–)=3
x0.5 1
0.5
1y
e1
e2
e3
p4
p5
p6
p7
p8
p9
p3
p2
p1
Correct bound !
Approx. preserve original
ordering of entries !
14
Incremental best-first search
No objects need to be pruned Data points are inserted into the heap H after their
scores have been computed When a data point p is deheaped, check whether its
score is greater than those in the Lazy Counting Queue L
If yes, report p as the next result If not,
consider the points in L whose (upper bound) score is greater than p
Compute their actual scores and insert them to H Insert p back to H again The next result is now at the top of H (to be found in next iteration)
15
Query variant
Bichromatic top-k dominating query Given a provider dataset DP and a consumer
dataset DA, a point p in DP,
A(p) = | { aDA, p>a } | A(p1)=2, A(p2)=3, A(p3)=1 Bichromatic top-1 point: p2
Application: find the most popular hotel, where DP contains hotels and DA specify requirements from different customers
Query processing The proposed algorithms are still applicable Search for the results in DP
Perform counting on DA
x0.5 1
0.5
1y
p2
a4
p1 a
3
a1
a2 p
3
y (price)
x (time to conf. venue)
16
Setup of efficiency experiments
Algorithms BBS (skyline-based method) Best first search: BF1 (basic), BF2 (lightweight) Incremental best-first: IBF1, IBF2
Synthetic datasets UI (independent), CO (correlated), AC (anti-
correlated) Parameters and other settings
aR tree node page size: 4K bytes LRU buffer size (%): 0, 1, 2, 5, 10, 15, 20 Datasize N (million): 0.25, 0.5, 1, 2, 4 Data dimensionality d: 2, 3, 4, 5 Result size k: 1, 4, 16, 64, 256
17
I/O cost vs buffer sizeI/O
buffer (%)
BBS
BF1
BF2
0
10000
20000
30000
40000
50000
0 1 2 5 10 15 20
I/O
buffer (%)
BBS
BF1
BF2
0
1000
2000
3000
4000
5000
6000
7000
0 1 2 5 10 15 20
I/O
buffer (%)
BBS
BF1
BF2
0
10000
20000
30000
40000
50000
60000
0 1 2 5 10 15 20
UI
CO
AC
18
I/O cost vs k
I/O
k
BBS
BF1
BF2
IBF1
IBF2
1E+3
1E+4
1E+5
1E+6
1 4 16 64 256
I/O
k
BBS
BF1
BF2
IBF1
IBF2
1E+2
1E+3
1E+4
1 4 16 64 256
I/O
k
BBS
BF1
BF2
IBF1
IBF2
1E+3
1E+4
1E+5
1E+6
1 4 16 64 256
UICO
AC
19
I/O cost vs d
UICO
AC
I/O
d
BBS
BF1
BF2
1E+2
1E+3
1E+4
1E+5
1E+6
2 3 4 5
I/O
d
BBS
BF1
BF2
1E+2
1E+3
1E+4
1E+5
2 3 4 5
I/O
d
BBS
BF1
BF2
1E+3
1E+4
1E+5
1E+6
2 3 4 5
20
I/O cost vs N
UI
CO
AC
I/O
BBS
BF1
BF2
0
20000
40000
60000
80000
100000
120000
140000
0.25 0.5 1 2 4N (million)
I/O
BBS
BF1
BF2
N (million)0
2000
4000
6000
8000
10000
0.25 0.5 1 2 4
I/O
N (million)0
50000
100000
150000
200000
250000
0.25 0.5 1 2 4
BBS
BF1
BF2
21
Bichromatic queries, I/O cost vs dataset combination
I/O
combination
BBS
BF1
BF2
1E+2
1E+3
1E+4
1E+5
1E+6
UI/UI UI/CO UI/AC CO/UI CO/CO CO/AC AC/UI AC/CO AC/AC (D / D )P A
Column UI/CO means provider dataset DP is UI and consumer
dataset DA is CO BF1 is more efficient than BBS in 7 cases BF2 outperforms its competitors in all cases
22
Meaningfulness of results
Explore the meaningfulness of the results returned by top-k dominating queries
Real datasets Statistics of NBA players
http://basketballreference.com/stats_download.htm 19112 players (identified by both name and year) Attributes for query: GP (games played), PTS (points),
REB (rebounds), and AST (assists) Statistics BASEBALL pitchers
http://baseball1.com/statistics/ 36898 players (identified by both name and year) Attributes for query: W (Wins), H (Hits), ERA (Earned
Run Average), and R (Runs Allowed)
23
Top-k dominating points meaningful?
Top-5 dominating points
Results match the public’s view of super-star players in NBA and BASEBALL
Enables users to discover `top’ objects without any specific domain knowledge
24
Skyline vs top-k dominating points
score
k0
5000
10000
15000
20000
0 10 20 30 40 50 60
SkylineTop-k Dominating
70
score
k0 100 200 300 400 500 600
SkylineTop-k Dominating
7000
5000
10000
15000
20000
25000
NBA BASEBALL
Perform a skyline query, compute top-k dominating points by setting k to the skyline size (69 for NBA and 700 for BASEBALL)
Plot their dominating scores in descending order Observations
Top-k dominating points have much higher scores than skyline points Top-k dominating points are more informative to users
25
Conclusions
Study top-k dominating queries on indexed multi-dimensional data
Present algorithms for the problem The lightweight best-first algorithm
BF2 performs the best Top-k dominating queries produce
more meaningful results than skylines
26
References[FLN01] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation
Algorithms for Middleware. In PODS, 2001.
[BKS01] S. Borzsonyi, D. Kossmann, and K. Stocker. The Skyline Operator. In ICDE, 2001.
[PKZT01] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP Operations in Spatial Data Warehouses. In SSTD, 2001.
[PTFS05] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive Skyline Computation in Database Systems. TODS, 30(1):41–82, 2005.
[CJT+06a] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. On High Dimensional Skylines. In EDBT, 2006.
[CJT+06b] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. Finding k-Dominant Skylines in High Dimensional Space. In SIGMOD, 2006.
[LOTW06] C. Li, B. C. Ooi, A. Tung, and S.Wang. DADA: A Data Cube for Dominant Relationship Analysis. In SIGMOD, 2006.