Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | noah-hendricks |
View: | 31 times |
Download: | 0 times |
Evaluating Top-k Queries over Web-Accessible Databases
Nicolas BrunoLuis GravanoAmélie MarianColumbia University
2/27/2002 2
“Top-k” Queries Natural in Many Scenarios
Example: NYC Restaurant Recommendation Service.Goal: Find best restaurants for a user:
Close to address: “2290 Broadway”Price around $25Good rating
Query: Specification of Flexible Preferences
Answer: Best k Objects for Distance Function
2/27/2002 3
Attributes often Handled by External Sources
MapQuest returns the distance between two addresses.NYTimes Review gives the price range of a restaurant.Zagat gives a food rating to the restaurant.
2/27/2002 4
“Top-k” Query Processing Challenges
Attributes handled by external sources (e.g., MapQuest distance).External sources exhibit a variety of interfaces (e.g., NYTimes Review, Zagat).Existing algorithms do not handle all types of interfaces.
2/27/2002 5
Processing Top-k Queries over Web-Accessible Data Sources
Data and query modelAlgorithms for sources with different interfacesOur new algorithm: UpperExperimental results
2/27/2002 6
Data ModelTop-k Query: assignment of weights
and target values to attributes
< $25, “2290 Broadway”, very good >
preferred price close to address preferred rating
weights: <4, 1, 2>
price: most important attribute
Combined in scoring function
2/27/2002 7
Sorted Access Source S
Return objects sorted by scores for a given query.
Example: Zagat
GetNextS interface
S-SourceAccess Time: tS(S)
2/27/2002 8
Random Access Source R
Return the score of a given object for a given query.
Example: MapQuest
R-SourceAccess Time: tR(R)
GetScoreR interface
2/27/2002 9
Query Model
Attributes scores between 0 and 1. Sequential access to sources.Score Ties broken arbitrarily.No wild guesses.One S-Source (or SR-Source) and multiple R-sources. (More on this later.)
2/27/2002 10
Query Processing Goals
Processing top-k queries over R-Sources.Returning exact answer to top-k query q.Minimizing query response time.Naïve solution too expensive (access all sources for all objects).
2/27/2002 11
Example: NYC Restaurants
S-Source:Zagat: restaurants sorted by food rating.
R-Sources:MapQuest: distance between two input addresses.User address: “2290 Broadway”
NYTimes Review: price range of the input restaurant.Target Value: $25
2/27/2002 12
TA Algorithm for SR-Sources
Perform sorted access sequentially to all SR-Sources Completely probe every object found for all attributes using random access.Keep best k objects.Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold).
Fagin, Lotem, and Naor (PODS 2001)
Does NOT handle R-Sources
2/27/2002 13
Our Adaptation of TA Algorithm for R-Sources: TA-Adapt
Perform sorted access to S-Source S.Probe every R-Source Ri for newly found object.Keep best k objects.Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold).
2/27/2002 14
An Example Execution of TA-AdaptObject
S(Zagat) R1(MQ)
R2(NYT)
Final Score
tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
Threshold = 1
Total Execution Time = 9
o1
GetNextS(q) Threshold =
0.95
0.9
score
o1
GetScoreR1(q,o1)
Threshold = 0.95
0.1
score
o1
GetScoreR2(q,o1)
Threshold = 0.95
0.5 0.56
score
o1
x
GetNextS(q) Threshold =
0.9
o2 0.8
score
o1
x
o2
GetScoreR1(q,o2)
Threshold = 0.9
0.7
score
o1
x
o2
GetScoreR2(q,o2)
Threshold = 0.9
0.7 0.75
score
o1
x
o2
x
GetNextS(q) Threshold =
0.725
o3 0.45
score
o1
x
o2
x
o3
GetScoreR1(q,o
3) Threshold =
0.725
0.6
score
o1
x
o2
x
o3
GetScoreR2(q,o
3) Threshold =
0.725
0.3 0.55
score
o1
x
o2
x
o3
x
2/27/2002 15
Improvements over TA-Adapt
Add a shortcut test after each random-access probe (TA-Opt).Exploit techniques for processing selections with expensive predicates (TA-EP).
Reorder accesses to R-Sources.Best weight/time ratio.
2/27/2002 16
The Upper Algorithm
Selects a pair (object,source) to probe next.Based on the property:
The object with the highest upper bound will be probed before top-k solution is
reached.score
Object is one of top-k objects
score
x Object is not one of top-k objects
score
x
score
x
2/27/2002 17
Threshold = 1
An Example Execution of Upper
Object
Upper Bound
S(Zagat)
R1(MQ)
R2(NYT)
Final Score
Total Execution Time = 6
0.95
GetNextS(q) Threshold = 0.95
o1 0.9
score
o1
0.10.65
GetScoreR1(q,o1) Threshold = 0.95
score
o1
o2 0.80.9
GetNextS(q) Threshold =
0.9
score
o1 o2
0.7
GetScoreR1(q,o2
) Threshold = 0.9
0.8
score
o1 o2
o3
0.45
0.725
GetNextS(q) Threshold =
0.725
0.8
score
o1 o2
o3
0.750.7
GetScoreR2(q, o2) Threshold = 0.725
0.75
score
o1 o2
x
o3
tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
2/27/2002 18
The Upper AlgorithmChoose object with highest upper bound.If some unseen object can have higher upper bound:
Access S-Source SElse:
Access best R-Source Ri for chosen object
Keep best k objectsIf top-k objects have final values higher than maximum possible value of any other object, return top-k objects.Interleaves accesses on objects
2/27/2002 19
Selecting the Best SourceUpper relies on expected values to make its choices.Upper computes “best subset” of sources that is expected to:
1. Compute the final score for k top objects.2. Discard other objects as fast as possible.
Upper chooses best source in “best subset”.
Best weight/time ratio.
2/27/2002 20
Experimental Setting: Synthetic Data
Attribute scores randomly generated (three data sets: uniform, gaussian and correlated).tR(Ri): integer between 1 and 10.tS(S) {0.1, 0.2,…,1.0}.Query execution time: ttotalDefault: k=50, 10000 objects, uniform data.Results: average ttotal of 100 queries.Optimal assumes complete knowledge(unrealistic, but useful performance bound)
2/27/2002 21
Experiments: Varying Number of Objects Requested k
0
30
60
90
120
150
180
210
0 20 40 60 80 100
k
t tota
l
Optimal
Upper
TA-EP
TA-Opt
TA-Adapt
2/27/2002 22
Experiments: Varying Number of Database Objects N
0
50
100
150
200
250
300
350
0 20000 40000 60000 80000 100000
Number of objects in S-Source S
t tota
l
Optimal
Upper
TA-EP
TA-Opt
2/27/2002 23
Experimental Setting: Real Web Data
S-Source: Verizon Yellow Pages (sorted by distance)
R-Sources: Subway Navigator Subway time
Altavista Popularity
MapQuest Driving time
NYTimes Review Food and price ratings
Zagat Food, Service, Décor and Price ratings
2/27/2002 24
Experiments: Real-Web Data
0
1000
2000
3000
4000
5000
6000
nR
Upper
TA-EP
TA-Opt
# o
f R
an
dom
Acc
ess
es
2/27/2002 25
Evaluation Conclusions
TA-EP and TA-Opt much faster than TA-Adapt.Upper significantly better than all versions of TA.Upper close to optimal.Real data experiments: Upper faster than TA adaptations.
2/27/2002 26
Conclusion
Introduced first algorithm for top-k processing over R-Sources.Adapted TA to this scenario.Presented new algorithms: Upper and Pick (see paper)
Evaluated our new algorithms with both real and synthetic data.
Upper close to optimal
2/27/2002 27
Current and Future Work
Relaxation of the Source ModelCurrent source model limitedAny number of R-Sources and SR-SourcesUpper has good results even with only SR-Sources
ParallelismDefine a query model for parallel access to sourcesAdapt our algorithms to this model
Approximate Queries
2/27/2002 28
ReferencesTop-k Queries:
Evaluating Top-k Selection Queries, S. Chaudhuri and L. Gravano. VLDB 1999
TA algorithm: Optimal Aggregation Algorithms for Middleware, R. Fagin, A. Lotem, and M. Naor. PODS 2001
Variations of TA:Query Processing Issues on Image (Multimedia) Databases, S. Nepal and V. Ramakrishna. ICDE 1999Optimizing Multi-Feature Queries for Image Databases, U. Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000
Expensive PredicatesPredicate Migration: Optimizing queries with Expensive Predicates, J.M. Hellerstein and M. Stonebraker. SIGMOD 1993
2/27/2002 29
Real-web Experiments
0
1000
2000
3000
4000
5000
6000
t total Upper
TA-EP
TA-Opt
2/27/2002 30
Real-web Experiments with Adaptive Time
0
200
400
600
800
1000
1200
Query 1 Query 2 Query 3 Query 4
t to
tal (
se
co
nd
s)
TA-Opt TA-EP Upper
2/27/2002 31
Relaxing the Source Model
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0 1 2 3 4 5 6 7
Number of SR-Sources (out of 6 sources)
t tota
l
Upper_Weight Upper-Relaxed TA-Upper TAz-EP-NODUP TAz-EP
Upper
TA-EP
2/27/2002 32
Upcoming Journal Paper
Variations of UpperSelect best source
Data StructuresComplexity AnalysisRelaxing Source Model
Adaptation of our AlgorithmsNew Algorithms
Variations of Data and Query Model to handle real web data
2/27/2002 33
Optimality
TA instance optimal over:Algorithms that do not make wild guesses.Databases that satisfy the distinctness property.
TAZ instance optimal over:Algorithms that do not make wild guesses.
No complexity analysis of our algorithms, but experimental evaluation instead