Date post: | 06-Jul-2015 |
Category: |
Documents |
Upload: | hondafanatics |
View: | 116 times |
Download: | 0 times |
1
Efficient Computation of Diverse Query Results
Erik Vee
joint work with
Utkarsh Srivastava, Jayavel Shanmugasundaram,Prashant Bhat, Sihem Amer Yahia
2
Motivation
• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks
3
Motivation
• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks
• … or looking for cars on Yahoo! Autos, andseeing only Hondas
4
Motivation
• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks
• … or looking for cars on Yahoo! Autos, andseeing only Hondas
• … or looking for jobs on Yahoo! Hotjobs, andseeing only jobs from Yahoo!
• It is not enough to simply give the best response– Need diversity of answers
5
Diversity Search
• If we display 30 results in 5 categories, then should show 6 items from each category– NB: Our goal is to show range of choices,
not representative sample
– Recurse on each subgroup of items
• Diversity crucial for users looking for range of results– e.g. Shopping, information gathering/research
• Useful for aiding navigation– Users tend to favor search-and-click over hierarchies
• Likely to give at least one good answer on first page
6
Contributions
• Formally define diversity search– Other diversity-like approaches use extensive post-processing
or are not query-dependent
• Proved that traditional IR engines cannot produce guaranteed diverse results
• Gave novel algorithms to produce diverse results– Both one-pass (datastreaming) and probing algorithms
• Experimentally verified that these results are nearly as fast as normal top-k processing– Much faster than post-processing techniques
7
What about other approaches?
• If not diverse enough, query again– E.g. If all results are from one company, issue another query– Bad for latency
• Issue multiple queries (one for Honda, one for Toyota...)– Can be prohibitively expensive (kills throughput)
• latency fine
– Some applications may have dozens of top-level categories
• Fetch extra results, then find most diverse set from this– Not guaranteed to get good results– Requires fetching additional results unnecessarily
• Fetch all results, then find diverse set– Many times slower
• Random sample of results– Miss important results this way
8
What about clever scoring?
• Can we give each item a global “diversity” score, then find top-k using this?– Prove in paper: There is no global score that gives guaranteed
diversity
• Can we give each item a local “diversity” score, so that it has a different score in each list of the inverted index?– Prove in paper: There is no list-based scoring of the item that
gives guaranteed diversity
9
Outline
• Definition of diversity
• Overview of our algorithms
• Our experimental results
10
Diversity search
• Over all possible sets of top-k results that match query, return set with most diversity
• Paper defines diversity more precisely– Focus on hierarchy view of diversity (in next slides)
• For scored diversity (in which each item has a score)– Over all possible sets of top-k results with maximum score,
return set with highest diversity
– Note: Diversity only useful when score not too fine-grained
11
Diversity definition (by picture)
Implicitly defineshierarchy
Make
Model
Color
Year
Text
Determine a category ordering
12
Hierarchy after a query
Diversity search alwaysreturns valid results
E.g. Query text contains `Low`
13
Hierarchy after a query
Diversity search alwaysreturns valid results
E.g. Query text contains `Low`
All siblings return thesame number of results(or as close as possible)
14
Returning top-k diverse results
Diversity search alwaysreturns valid results
E.g. Query text contains `Low`
Suppose return k=4 results
Must return 2 Hondas and 2 Toyotas
Will not return2 green Civics
15
Outline
• Definition of diversity
• Overview of our algorithms
• Our experimental results
16
Algorithms
• One Pass– Never goes backward (just one pass over dataset)
– Maintains a top-k diverse set based on what has been seen
– Jumps ahead if more results will not help diversity
– Optimal one-pass algorithm
• Probe– May jump forward or backward (i.e. probes)
– Prove: at most 2k probes for top-k diverse result set
• Both also work for scored diversity
17
Dewey IDs
Every branch gets a number
Every item then labeled,e.g. 0.2.0.1.0 isHonda Odyssey Green ’06 `Good miles’
Create invertedindex
low → 00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000
18
Next and Prev
Supports two basic operations: Next and Prev
E.g. Query text contains `Low`
Next(0.0.3.2.2) = 1.0.0.0.0Prev(2.0.0.0.0) = 1.3.0.0.0
Inverted index for ‘Low’ listsall items in Dewey ID order
In general, must find intersection of lists (still easy)
low → 00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000
19
One pass (for k = 2)
First finds 00000, 00010
Now knows Civic Greenno longer helps
Jumps by callingnext(0.0.1.0.0)
20
Finds 00100Removes 00010
One pass (for k = 2)
First finds 00000, 00010
Now knows Civic Greenno longer helps!
Jumps by callingnext(0.0.1.0.0)
Now knows Civicno longer helps!
Jumps by callingnext(0.1.0.0.0)
21
Finds 00100Removes 00010
One pass (for k = 2)
First finds 00000, 00010
Now knows Civic Greenno longer helps!
Jumps by callingnext(0.0.1.0.0)
Now knows Civicno longer helps!
Jumps by callingnext(0.1.0.0.0)
Finds 01000Removes 00100 Knows to stop
22
Probe (for k = 4)
Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items
Wants another Honda
Calls prev(0. ∞. ∞. ∞. ∞)
Discovers there are only2 top-level categories
23
Probe (for k = 4)
Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items
Wants another Honda
Calls prev(0. ∞. ∞. ∞. ∞)
Why not next(0.1.0.0.0)?
If Honda has only onechild, then will returna Toyota!
24
Probe (for k = 4)
Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items
Wants another Honda
Calls prev(0. ∞. ∞. ∞. ∞)
Finds 00310
Wants another Toyota
Calls next(1.0.0.0.0)
25
Probe (for k = 4)
Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items
Wants another Honda
Calls prev(0. ∞. ∞. ∞. ∞)
Finds 00310
Wants another Toyota
Calls next(1.0.0.0.0)
Finds 10000
26
Outline
• Definition of diversity
• Overview of our algorithms
• Our experimental results
27
Results
• Dataset consisted of listing from Yahoo! Autos
• Queries were synthetic to test various parameters– Selectivity, # predicates, # results
• Preprocessing time for 100K listings < 5min– Times shown are for 5K queries
• 4 algorithms– Basic: No diversity
– Naïve: Fetch everything, post-process
– OnePass: Our algorithm. Takes just one pass over data
– Probe: Our algorithm. May make multiple probes into data
28
Comparable time for diversity search
unscored scored
Basic: No diversity
Naïve: Many times slower OnePass: Close to probe
Probe: Within factor 2 of no diversity
MultiQuery (not shown): Latency close to Basic, but throughput many times worse
29
Results summary
• Getting diverse results not too much slower than getting non-diverse results– Many times faster than naïve approaches
• Multi-query approach has even worse throughput than naïve– But keeps latency low
• How does this compare to getting extra results, then finding a diverse subset?– Getting 2k results instead of k is about twice as slow
– Plus, does not guarantee diverse results
30
Conclusions
• Can get guaranteed diversity, taking time close to normal top-k query– Almost as fast or faster than non-guaranteed results
– Diversity at every level
• Works even when items have scores
• Needs a different algorithm than traditional IR engines– Proved this in paper (under standard notions)
• Are there approximate notions that can use existing IR machinery?
31