Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | donna-boals |
View: | 219 times |
Download: | 2 times |
1
Evaluating Top-Evaluating Top-KK Selection QueriesSelection Queries
Surajit ChaudhuriSurajit ChaudhuriMicrosoft ResearchMicrosoft Research
Luis GravanoLuis GravanoColumbia UniversityColumbia University
2
Motivating Example
Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000
Exact matches often too Exact matches often too restrictiverestrictive
Rank of houses that are closest Rank of houses that are closest to specification more desirableto specification more desirable
3
Motivating Example (cont.)
Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000
House 1House 1:: 5 bedrooms; $400,000; 5 bedrooms; $400,000; Score=0.9Score=0.9 House 2House 2: 4 bedrooms; $485,000; : 4 bedrooms; $485,000; Score=0.8Score=0.8 House 3House 3: 6 bedrooms; $785,000; : 6 bedrooms; $785,000; Score=0.3Score=0.3
4
Top-K Queries over Precise Relational Data
Support approximate matches Support approximate matches with with minimal changes to the minimal changes to the relational enginerelational engine
Initial focus: Initial focus: Selection queriesSelection queries with “equality” conditionswith “equality” conditions
5
Outline
Definition of top-Definition of top-kk queries queriesExecution alternatives Execution alternatives Mapping of top-Mapping of top-kk queries to queries to
selection queriesselection queriesExperimentsExperiments
6
Top-K Selection Queries
Specify an Specify an nn-dimensional target point-dimensional target pointDefine scoring functionDefine scoring functionSpecify Specify kk
AnswerAnswer:: kk objects with the best score objects with the best score for the target point (i.e., the “top for the target point (i.e., the “top kk” ” objects)objects)
7
Specifying Top-K Queries using SQL
Select *Select *From From RROrder Order [k][k] By By Scoring_FunctionScoring_Function
8
Scoring Functions Measure Degree of Match
Assume attributes defined over Assume attributes defined over metric spacemetric space
Score on any one attribute is Score on any one attribute is well definedwell defined
How to aggregate scores How to aggregate scores acrossacross attributes?attributes?
9
Scoring Functions
Normalize attribute scores to be Normalize attribute scores to be in [0,1] rangein [0,1] range
Combine scores using popular Combine scores using popular aggregate functionsaggregate functions MinMin EuclideanEuclidean Sum, Max, …Sum, Max, …
10
Some Example Scoring Functions
Let Let q=(qq=(q11, …, q, …, qnn)) be the target point be the target point and and t=(tt=(t11, …, t, …, tnn)) a tuple: a tuple:
Min(q, t)Min(q, t) = = min{1-|min{1-|qq11--tt11|, …, 1-||, …, 1-|qqnn--ttnn|}|}
Euclidean(q, t)Euclidean(q, t) = = 1- sqrt((1- sqrt((qq11--tt11))22//nn+ … + (+ … + (qqnn--ttnn))22//nn))
11
Executing Top-K Queries
Known techniques require at least one Known techniques require at least one sequential scansequential scan (or a functional index) (or a functional index) Evaluate Scoring_Function Evaluate Scoring_Function for each tuplefor each tuple SortSort tuples [Carey & Kossman ‘97; ‘98] tuples [Carey & Kossman ‘97; ‘98]
Question: How to avoid sequential Question: How to avoid sequential scans?scans?Exploit implicit selectivity of top-Exploit implicit selectivity of top-kk queries queries
12
Mapping a Top-K Query to a Selection Query
Determine a Determine a search score search score SS such that: such that: Expected # of tuples with Expected # of tuples with score > Sscore > S is is kk No false dismissals No false dismissals
Turn the condition that Turn the condition that score > Sscore > S into a into a range selectionrange selection condition(s) condition(s)
Evaluate selection query using existing Evaluate selection query using existing query processor and access pathsquery processor and access paths
13
Mapping a Top-K Query to a Selection Query
4-bedrooms; $350,000; k=104-bedrooms; $350,000; k=10
Retrieve all tuples with Retrieve all tuples with score > 0.5 score > 0.5 (at least (at least kk=10 tuples expected)=10 tuples expected)
Analyze scoring function to Analyze scoring function to determine selection range: determine selection range: Bedrooms: [3, 5] and Price: [$250K, Bedrooms: [3, 5] and Price: [$250K,
$450K]$450K]
14
Mapping a Search Score to a Selection Range
For For search score search score SS , target point , target point q=(qq=(q11, q, q22)),, and scoring function and scoring function MinMin::
Selection range:Selection range: tt11 IN [ IN [qq11 - (1.0- - (1.0-SS), ), qq11 + (1.0- + (1.0-SS)])]
tt22 IN [IN [qq22 - (1.0- - (1.0-SS), ), qq22 + (1.0- + (1.0-SS)])]
15
Determining a Search Score
MonotonicityMonotonicity: Consider tuple : Consider tuple tt that is no further that is no further from target than from target than t’t’ on any attribute: on any attribute:
Score of t should be at least that of t’Score of t should be at least that of t’ Therefore, Score cannot be high “far away” Therefore, Score cannot be high “far away”
from targetfrom target Sphere for Sphere for EuclideanEuclidean Box for Box for MinMin
……centered at target pointcentered at target point
““Tightness” of enclosing range varies with scoring Tightness” of enclosing range varies with scoring functionsfunctions
a
b
c
16
The Min Scoring Function
17
The Euclidean Scoring Function
18
Comments on Mapping
Search score determines Search score determines efficiencyefficiency, , not correctnessnot correctness
Issues in efficiency:Issues in efficiency: Avoid retrieving too many tuplesAvoid retrieving too many tuples Avoid retrieving fewer than Avoid retrieving fewer than kk top top
tuples tuples (restarts)(restarts)
How to determine good search How to determine good search scores?scores?
19
Determining Search Scores
Find Find kk points in data points in dataCompute their scoreCompute their scoreSet search score to lowest scoreSet search score to lowest score
Challenges:Challenges: Determining the initial Determining the initial kk points to points to
optimize executionoptimize execution Taking original query into accountTaking original query into account
20
Using Histograms
Q4
20
11
10
21
Picking K Representative “Tuples”
Collapse histogram bucket to a single Collapse histogram bucket to a single representative pointrepresentative point Furthest from Furthest from QQ in bucket in bucket (“NoRestarts”)(“NoRestarts”) Closest to Closest to QQ in bucket in bucket (“Restarts”)(“Restarts”)
Assign bucket frequency to the single Assign bucket frequency to the single representative pointrepresentative point
Include closest representative points Include closest representative points until we have until we have kk tuples tuples
22
Using Histograms:“NoRestarts”
Q4
20
11
10
23
Using Histograms:“Restarts”
4
20
11
10
Q
24
Other Strategies for Determining Search Scores
Calculate search score for: Calculate search score for: nn = = NoRestarts NoRestarts (“pessimistic” (“pessimistic”
extreme)extreme) rr = = Restarts Restarts (“optimistic” extreme)(“optimistic” extreme)
Use intermediate scores:Use intermediate scores: InterInter11 = (2 = (2nn + + rr)/3)/3
InterInter22 = (= (nn + 2 + 2rr)/3)/3
0 RestartsNoRestarts 1
25
Evaluating the Generated Selection Query
Sequential scanSequential scanIntersection of a set of indexes, Intersection of a set of indexes,
followed by data access followed by data access Special case: index-only accessSpecial case: index-only access
26
Indexes and Statistics
IndexesIndexesnn-dim (concatenated-key) B-trees-dim (concatenated-key) B-trees
StatisticsStatistics MaxDiffMaxDiff as base 1-dim histogram as base 1-dim histogram
Multidimensional histograms:Multidimensional histograms:AVI, Phased, MHistAVI, Phased, MHist
27
Experimental Evaluation
Is mapping to selection queries an Is mapping to selection queries an effectiveeffective technique? technique?
Sensitivity of relevant parameters:Sensitivity of relevant parameters: Scoring functionsScoring functions Data skew and dimensionalityData skew and dimensionality StatisticsStatistics
28
Data Generation
Characterized by Characterized by ZZ = < = <zz11, …, , …, zznn>>
Generate Generate NN tuples by Zipfian distribution tuples by Zipfian distribution zz11
Group tuples by Group tuples by attrattr11
For a partition with For a partition with attrattr11 = = aa with with NN11 tuples: tuples: Generate Generate NN11 values values ww11, ..., w, ..., wN1N1 using Zipfian using Zipfian
distribution distribution zz22
Create pairs (Create pairs (aa, , ww11), …, (), …, (aa, , wwN1N1))
Repeat steps to fill in all attribute valuesRepeat steps to fill in all attribute values
29
Metrics for Comparison
Fraction of data tuples accessed may Fraction of data tuples accessed may be compared to:be compared to: Ideal: Ideal: kk Worst case: size of data setWorst case: size of data set
% of restarts% of restarts
30
Exploring Limits
Intrinsic limitations of range-query approach: Intrinsic limitations of range-query approach: Enclose actual top-Enclose actual top-kk tuples in tight tuples in tight nn--
rectanglerectangle Retrieve all tuples in Retrieve all tuples in nn-rectangle-rectangle
Less than 1% of database tuples in n-rectangleLess than 1% of database tuples in n-rectangle(k=10; 100,000 tuples)(k=10; 100,000 tuples)
Effect of retrieving tuples with score > Effect of retrieving tuples with score > SS using using an an nn-rectangle-rectangle
31
Effect of Scoring Functions
MinMin has little/no gap between has little/no gap between target region and enclosing target region and enclosing nn--rectanglerectangle
As As kk increases, fraction of retrieved increases, fraction of retrieved tuples grows slowest for tuples grows slowest for MinMin
EuclideanEuclidean performs worse performs worseLess tight Less tight nn-rectangle -rectangle
32
Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)
33
Effect of Mapping Strategies and Histograms
Multidimensional histograms aid Multidimensional histograms aid computation of tight search scorescomputation of tight search scores
NoRestartsNoRestarts dominates at high data dominates at high data skewskew
34
Tuples Retrieved v. Data Skew(PHASED histogram of 5KB; n=3)
35
Restarts v. Data Skew(PHASED histogram of 5KB; n=3)
36
Related Work (1)
[Fagin ‘96; ‘98] [Fagin ‘96; ‘98] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Multiple index scansMultiple index scans Independence assumptionIndependence assumption
[Chaudhuri & Gravano ‘96][Chaudhuri & Gravano ‘96] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Map top-Map top-kk queries to “selection” queries queries to “selection” queries Independence assumptionIndependence assumption Limited scoring functionsLimited scoring functions
37
Related Work (2)
[Carey & Kossman ‘97; ‘98][Carey & Kossman ‘97; ‘98]Optimized sorting phase using Optimized sorting phase using kk
Nearest-neighbor literatureNearest-neighbor literature [Donjerkovic & Ramakrishnan ‘99][Donjerkovic & Ramakrishnan ‘99]
Probabilistic optimization framework Probabilistic optimization framework No multidimensional scoring functionsNo multidimensional scoring functions Independence assumptionsIndependence assumptions
38
SummaryDefined mapping of top-Defined mapping of top-kk queries to queries to
traditional selection queriestraditional selection queriesExploit existing database statistics and Exploit existing database statistics and
query processorsquery processorsStudied effect of scoring functions, Studied effect of scoring functions,
data skew, statistics on mappingdata skew, statistics on mapping
Full experimental analysis forthcoming!Full experimental analysis forthcoming!
39
Tuples Retrieved v. Histogram Size(Euclidean; n=3; Z21)
40
Tuples Retrieved v. n(PHASED histogram of 5KB; Z21)
41
Restarts v. n(PHASED histogram of 5KB; Z21)
42
Tuples Retrieved v. k(PHASED histogram of 5KB; Z21; n=3)
43
Restarts v. k(PHASED histogram of 5KB; Z21; n=3)
44
Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)
45
Tuples Retrieved v. Histogram Size(Census Database; PHASED)
46
Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)
47
The Sum Scoring Function
48
The Max Scoring Function