Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | lev-levine |
View: | 25 times |
Download: | 2 times |
The Database and Info. Systems Lab.University of Illinois at Urbana-Champaign
Boolean + Ranking: Querying a Database by K-Constrained Optimization
Zhen ZhangJoint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang
AIM 2
Many queries naturally combine Boolean and ranking
Information retrieval
Ranking query:
Top 5 ranked by gpa
+Database applications on Web
Traditional databases
Boolean query:
dept = CS and year = 2
Qualifying constraint
Quantifying function R: gpa
B: dept = CS and year = 2
Find top answers
AIM 3
Motivating scenarios
Data retrieval: Find houses in certain price range with good
price/sqrft ratio
Data analysis: Find products with highest sale increase in
consecutive years
Select h.address from House h
Where h.price ≤ 200k ν h.price ≥ 400k
Order by h.size/|h.price-300k| Limit 1
Select h.address from House h, CrimeRate c
Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode
Order by h.size/|h.price-300k| *c.crimerate-1 Limit 10
Select itemid from Sales s1, Sales s2
Where s1.itemid = s2.itemid and s2.year – s1.year = 1
Order by s2.sale – s1.sale Limit 10
AIM 4
Boolean + Ranking form a coherent goal function
Boolean B + Ranking R = Goal function G
For a tuple t
G(t) = B(t)*R(t) = R(t) if B(t) is true
0 if B(t) is false(ie, lowest score)
AIM 5
The nature of Boolean + Ranking is K-constrained optimization query Optimize goal function G over database D
h.size/|h.price-300k|
[h.price ≤ 200k ν h.price ≥ 400k ]
Addr Zip Price Size
1. Oak park, Chicago 60644 600K 4500
2. Mattis, Champaign 61821 350K 2000
3. … 150K 1000
4. … 250K 2000
5. … 300K 3500
6. … 80K 500
Goal function G
Database D
D
G
AIM 6
What is the query evaluation mechanism?
Ranking query+Boolean query
How to answer?
AIM 7
Current techniques lack of global search mechanism
If evaluated as separate operators
If search by an overall goal function G as a ranking
function
Boolean query B
………
Ranking query R
Current techniques restrict G to be monotonic
Current techniques optimize only condition-by-condition
D Boolean query B
Ranking query R
D RBGoal function G
AIM 8
Our thesis: Evaluate query as its nature suggests!
Optimize G over D
Function optimization
of GDiscrete state
search over D
G
D
D
OPT*
AIM 9
We view compound index as discrete space
Addr Zip Price Size
1. Oak park, Chicago 60644 600K 4500
2. Mattis, Champaign 61821 350K 2000
3. … 150K 1000
4. … 250K 2000
5. … 300K 3500
6. … 80K 500
AIM 10
250
3000
350
100
1500
4000
4500
600
We view compound index as discrete space
250-6000-250
100-2500-100 350-600250-350
52 1………
b1
b3b2
b7b6
3000-45000-3000
1500-30000-1500 4000-60003000-4000
5 1………
a1
a6
a3a2
a7
size
Price (k)
1
52
3 4
6
AIM 11
250
3000
350
100
1500
4000
4500
600
We view compound index as discrete space
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M56M75
154 2
250-6000-250
100-2500-100 350-600250-350
52 1………
b1
b3b2
b7b6
3000-45000-3000
1500-30000-1500 4000-60003000-4000
5 1………
a1
a6
a3a2
a7
size
Price (k)
1
52
3 4
6
Mij =(ai, bj)
……
AIM 12
250
3000
350
100
1500
4000
4500
600
We view compound index as discrete space
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M56M75
154 2
250-6000-250
100-2500-100 350-600250-350
52 1………
b1
b3b2
b7b6
3000-45000-3000
1500-30000-1500 4000-60003000-4000
5 1………
a1
a6
a3a2
a7
size
Price (k)
1
52
3 4
6
Mij =(ai, bj)
conceptually, combined space
…
AIM 13
How to perform the search in the space?
What is the search mechanism? How to conceptually view the index space of
D for search How to guide the search?
How to use function G to focus the search
AIM 14
Challenge 1: What is the search mechanism?
AIM 15
We encode as A* because it’s optimal
What A* is: Finding the shortest path Why we choose: Completeness and optimality with
proper heuristics Complete: guarantee to find shortest path Optimal: visit least number of nodes
origin
destination
5
2
96
3
5
1
1
7
AIM 16
Encoding our problem into shortest path is challenging
How to encode: a tuple a path? score of tuple distance of path?
K-constrained optimization
Find a tuple with maximal score
Shortest path
Find a path with minimal distance
AIM 17
Therefore, we encode K-constrained opt. as: How to encode a tuple to a path?
Adding a virtual target t* only reachable through tuples How to encode maximal tuple with minimal path?
Quality of path depends solely on the tuple it passes by For tuple state t
D(t, t*) = - G(t) For two states r, u
D(r, u) = 0
M55
M11
M22 M32 M23 M33
M66 M77 M67 M76M75 M56
154 2
t*
0
0
0
0
- G(4)- G(1)
0
0
…
AIM 18
Challenge 2: How to guide the search?
AIM 19
We use function opt. to sketch the landscape of G Function optimization measures quality of states Function optimization enables:
1. How to define heuristics? 2. How to configure space? 3. Where to start the search?
AIM 20
1. Define admissible heuristics: Measure tightest upper bound
H(region) = OPTMAX(G, region)
ie, maximal value of G in the region
To guarantee completeness A* requires admissible heuristics, ie, estimate
optimistically To ensure admissible heuristics
Function optimization gives tightest upper bound Analytical approaches Numeric analysis package
AIM 21
2. Configure descending space: disconnect uphills To guarantee optimality
A* requires descending heuristics To ensure descending heuristics
Remove uphill links
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M75 M56
154 2
…
AIM 22
Find right start point: Start from local optima To guarantee correctness
Every tuple state must be reachable from start states Taking only downhills requires start with high points
To ensure reachability Initial states should contain all local optima
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M75 M56
154 2
…
AIM 23
Putting together: Executing A* on the configured space
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M75 M56
154 2
M57…
Search is implemented as priority queue driven traversal
top-down
AIM 24
Putting together: Executing A* on the configured space
Bottom-up approach is always better than top-down
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M75 M56
154
2
M57
M11
M22 M32 M23 M33
M66 M77 M67 M76M55 M75 M56
154 2
M57…
…
top-down
bottom-up
AIM 25
Experiments
Comparison vs. Boolean then ranking Ranking then boolean
Metrics: node accessed = Nl + Nt
Settings: Benchmark queries over real dataset Controlled queries over synthetic dataset
AIM 26
Benchmark queries
Datasets: 19,706 real estate listing crawled online
Queries Q1: size * bedrms/| price-450k| : [40k<=price<=50k] Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] Q3: size/price : [bedrms=3 ν bedrms=4]
BR_unclustered
BR_clustered
OPT*
Q1 Q2 Q3
AIM 27
Controlled queries Datasets
Three randomly generated datasets of 100k points Uniform, gaussian, logvariatenormal
Queries Linear average queries: (eg, 0.4*a + 0.6*b) Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)
!"#$
%
!"#$
! "#$%
AIM 28
Conclusion
Problem Study K-constrained optimization queries as boolean +
ranking Abstraction
Encode K-constrained optimization into shortest path problem
Framework Develop OPT* to process K-constrained optimization
AIM 29
Thank you!
Questions?
AIM 30
How to implement function optimization? How do we compare with RankSQL? If bottom-up is always better, why consider top-down Computing upper bound for each region is costly Random vs. sequential I/O Assuming indices on every attribute? Materialize state space for every query? Exponential number of states when attribute grows
Not every attribute has index on it Selective choose the right index (attribute) to use We do perform experiment to study how the system scale with
#attr Your algorithm is not optimal because you change the
space