Feifei Li1, Ke Yi2, Wangchao Le1
Florida State UniversityHongKong University of Science & Technology
Top-k Queries on Temporal Data
Temporal data: temporal data refer to data that change over time.
Typical examples - stock traces- objects’ trajectories.
Problem Def.
Time
Score
For the efficiency of storage, indexing , queries, etc., time series are often represented as piecewise linear functions, each called a Piecewise Linear Approximation (PLA).
Problem Def.
Time
Score
Time
Score
Each PLA is called an object.An PLA object with 4 line segments.
Ranking Queries on Temporal data : top-k queries on time instants.
Problem Def. (cont.)
Given a set of PLA objects {oi|i=1 … n}, a time instant t and k, a top-k/t query retrieves the k objects that have the highest scores on time instant t.
Use R-tree R-tree revisit:
- Index multi-dim. info.- linear space - Branch and bound with a priority queue- Do NOT have a worst case query cost guarantee (linear scan in worst case).
Treat an object as a trajectory - Break up each trajectory into pieces of segments
- R-tree is built on pieces of segmentsUse kNN query at time t
-Adding an artificial query point that is high enough (example in next slide).
State of the Art
kNN query at time t using R-tree- Use min. snapshot distance (MinSTDist.), distance along time instance t from q.
State of the Art (cont.)
Branch & bound with MinSTDist- Stop when there are k objects in the priority queue whose MinSTDist are smaller than other unseen objects.
Efficiency of R-tree based approach- Linear space consumption- Handle queries on higher dimensional problems
Deficiency of R-tree based approach- Do not have worse case performance guarantee (build, query)- Current commercial DBMSs have limited supports on R-tree
State of the Art (cont.)
We propose seb-tree, the Sampled Envelope B-tree.Simplicity
- B-tree is the only building block , easily to integrate into commercial DBMSsOptimal query performance
- Answer a top-k/t query in logarithm I/O on expectation Handle update
- 99.5% updates will end up in simple insertions/deletions- Only 0.5% updates need to lock and modify a larger portion of the B-tree
Size & construction- Occupy near linear space- Require near linear time to build.
Our contribution
Let S be a set of N line segments in the planeBuild series of random sampling on S
- Define l independent sampling ratio pi (0≤i≤l) - Sampling on S with pi - Sampled set Si & unsampled set USi - l+1 groups of Si and USi
How to decide l and pi?- , kmax is the highest possible k
- pi is a geometrically decreasing series : 1/(2iB), i= 0, 1, …, l, B is the # of segments can be hold in a disk block
Seb-tree (rand. sampling)
For each sample Si, compute its upper envelope envi
- What’s upper envelope?
Upper envelope can be computed in near linear time (1989)
Seb-tree ( the upper envelope)
A random sampled set Si
Si and its upper envelope envi
For each vertex on envi- shoot up a vertical line- if it is an endpoint of a segment, also shown down until it hits another segment or score=0.
This results the trapezoidal decomposition of Si: D(Si).
Seb-tree ( the trapezoidal decomp.)
Si and its upper envelope envi
Si and its decomposition
Conflict- consider a trapezoid ∆ from some D(Si) and s USi
- we say s conflicts with ∆ if s intersects ∆ Conflict list
- for each ∆, find all s USi conflicted with it (do we need to consider s Si?)- collect all such segments into a list, which is named conflict list C(∆)
Seb-tree (the conflict list)
∆
Sa
Sb Sc
Se
Sd
C(∆)= {Sa, Sb, Sc, Sd, Se}
Let ∆1, ∆2, …, ∆t be the trapezoids of D(Si) from left to right - sort by the starting x value of ∆
Build a B-tree Ti on C(∆1), C(∆2), …, C(∆t) in order
Build a B-tree for each level of sampling- totally we have l+1 B-trees
Seb-tree (the index)
Lemma 1 (1989): E(|C(∆)|)=O(1/p)By Lemma1, for a ∆ on level i, E(|C(∆)|)=O(2iB)
Lemma 2 (1986): There are O(n*α(n)) vertices on the upper envelope of n line segments in the plane, where α(n) is the inverse Ackermann function and can be treated as a constant of all imaginable input size.
- for Si, it has expected O(1/2i*N/B* α(N/B)) trapezoids- for B-tree Ti, it occupied O(N*α(N/B)) blocks.
Size of seb-treeFor B-trees, the size of seb-tree is
Size of seb-tree
Each line segment might intersect with multiple trapezoids
How to build the conflict list efficientlyHierarchical decompositionConflict lists can be build in near linear time.
More on seb-tree
Let L0 be the set of segments in Si, we then build a gradation
where Lj is ½ sampling of Lj-1, λ=O(log|L0|)
The hirarchical decomposition
LLL 10
L0
L1
L2
For each Lj, we build its trapezoidal decomposition D(Lj)
The hirarchical decomposition
L0
L1
L2
For each Lj, we build its trapezoidal decomposition D(Lj)
We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(L λ)
The hierarchical decomposition
L0
L1
L2
For each Lj, we build its trapezoidal decomp. D(Lj)
We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(L λ)
Store all trapezoids in this hierarchy in a tree (HDT).
The hierarchical decomposition
L0
L1
L2
To judge which C(∆) a line segment belongs to at L0, we search top-down from L λ, visiting a ∆ if only if the segment intersect with it.
The hierarchical decomposition
L0
L1
L2
seg1
seg1
seg2
seg2
seg2
ab
b d e
f g
For a particular level Si, the decomp. has a height of λ=O(log|Si|)
For a segment s, the time it spent to visit the HDT will be proportional to the size of the HDT, which is
At Lj, its conflict list has an expected size E(|C(∆)|)=O(2i+jB) |Lj|= O(N/2i+jB), there are O(|Lj|α(|Lj|)) trapezoids in D(Lj),
so D(Lj) has an expected size of O(N *α(N/B)*log(N/B)) The total time spent on the entire l+1 HDTs is
Cost on building conflict lists
Query on seb-tree is simple (in 1 for-loop)- Given k and a time instant t, initiate i=0 1. use B-tree Ti, do point search and find ∆ whose x-span contains t,
read its conflict list C(∆) 2. if there are at least k segments in C(∆) intersect with t, return the top-k
segments, else if i<l, then i=i+1, repeat step 1 3. scan entire S to find top-k segments to find top-k
- An improvement is that instead of letting i=0 at the first step, we can directly
start at level i=log(k/B) (because 2iB need to larger than k).
Query on seb-tree
Query performance guarantee comes from B-tree
For any query, seb-tree index can find the top-
k/t segments in expected O(logBN+k/B) I/Os
The probability that seb-tree needs to trigger a brute force scan is less than B/N, and scanning the whole data set needs O(N/B) I/Os, this adds only O(1) to the total query cost.
Query cost
Recall that to build a B-tree at level i, we need to- take a 1/2iB sampling on S to get Si
- build a trapezoidal decomp. D(Si)
- store the conflict list in the level i B-treeGiven a new segment s
- if it changes none of the D(L0), …, D(Lλ), then simply follow the HDT to check where s belongs to.- if it does change one of the D(L0), …, D(Lλ), then we need to rebuild a larger potion of the seb-tree.
Deletion can be handled similarly.
Updating the seb-tree
Based on lemma 1:One will expect to see O(1/p) conflicting segments for any trapezoid on level Si, where p is sampling rate = 1/2iB
To avoid expensive I/O, we define threshold λ, when |C(∆)| > λ O(1/p), simply don’t store it (for query part, skip it)
In practice, λ=3 or 4
Space-query tradeoff
∆
Sa
Sb Sc
Se
Sd
|C(∆)|=O(1/p)
How seb-tree will behave when …1) the number of time series changes
2) the deviation of time series changes3) the threshold λ changes4) Kmax in changes
Compare to R-tree
Experiment
Index size & construction time
Experiment (1)
Query cost
Experiment (2)
Effect of Kmax
Experiment (3)
Study ranking queries on temporal dataPropose seb-treeTake near-linear time to constructionOccupy near-linear spaceSupport dynamic update efficientlyEmploy B-tree as its only building block.
Conclusion