+ All Categories

ppt

Date post: 06-Jul-2015
Category:
Upload: hondafanatics
View: 116 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia
Transcript
Page 1: ppt

1

Efficient Computation of Diverse Query Results

Erik Vee

joint work with

Utkarsh Srivastava, Jayavel Shanmugasundaram,Prashant Bhat, Sihem Amer Yahia

Page 2: ppt

2

Motivation

• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

Page 3: ppt

3

Motivation

• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

• … or looking for cars on Yahoo! Autos, andseeing only Hondas

Page 4: ppt

4

Motivation

• Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

• … or looking for cars on Yahoo! Autos, andseeing only Hondas

• … or looking for jobs on Yahoo! Hotjobs, andseeing only jobs from Yahoo!

• It is not enough to simply give the best response– Need diversity of answers

Page 5: ppt

5

Diversity Search

• If we display 30 results in 5 categories, then should show 6 items from each category– NB: Our goal is to show range of choices,

not representative sample

– Recurse on each subgroup of items

• Diversity crucial for users looking for range of results– e.g. Shopping, information gathering/research

• Useful for aiding navigation– Users tend to favor search-and-click over hierarchies

• Likely to give at least one good answer on first page

Page 6: ppt

6

Contributions

• Formally define diversity search– Other diversity-like approaches use extensive post-processing

or are not query-dependent

• Proved that traditional IR engines cannot produce guaranteed diverse results

• Gave novel algorithms to produce diverse results– Both one-pass (datastreaming) and probing algorithms

• Experimentally verified that these results are nearly as fast as normal top-k processing– Much faster than post-processing techniques

Page 7: ppt

7

What about other approaches?

• If not diverse enough, query again– E.g. If all results are from one company, issue another query– Bad for latency

• Issue multiple queries (one for Honda, one for Toyota...)– Can be prohibitively expensive (kills throughput)

• latency fine

– Some applications may have dozens of top-level categories

• Fetch extra results, then find most diverse set from this– Not guaranteed to get good results– Requires fetching additional results unnecessarily

• Fetch all results, then find diverse set– Many times slower

• Random sample of results– Miss important results this way

Page 8: ppt

8

What about clever scoring?

• Can we give each item a global “diversity” score, then find top-k using this?– Prove in paper: There is no global score that gives guaranteed

diversity

• Can we give each item a local “diversity” score, so that it has a different score in each list of the inverted index?– Prove in paper: There is no list-based scoring of the item that

gives guaranteed diversity

Page 9: ppt

9

Outline

• Definition of diversity

• Overview of our algorithms

• Our experimental results

Page 10: ppt

10

Diversity search

• Over all possible sets of top-k results that match query, return set with most diversity

• Paper defines diversity more precisely– Focus on hierarchy view of diversity (in next slides)

• For scored diversity (in which each item has a score)– Over all possible sets of top-k results with maximum score,

return set with highest diversity

– Note: Diversity only useful when score not too fine-grained

Page 11: ppt

11

Diversity definition (by picture)

Implicitly defineshierarchy

Make

Model

Color

Year

Text

Determine a category ordering

Page 12: ppt

12

Hierarchy after a query

Diversity search alwaysreturns valid results

E.g. Query text contains `Low`

Page 13: ppt

13

Hierarchy after a query

Diversity search alwaysreturns valid results

E.g. Query text contains `Low`

All siblings return thesame number of results(or as close as possible)

Page 14: ppt

14

Returning top-k diverse results

Diversity search alwaysreturns valid results

E.g. Query text contains `Low`

Suppose return k=4 results

Must return 2 Hondas and 2 Toyotas

Will not return2 green Civics

Page 15: ppt

15

Outline

• Definition of diversity

• Overview of our algorithms

• Our experimental results

Page 16: ppt

16

Algorithms

• One Pass– Never goes backward (just one pass over dataset)

– Maintains a top-k diverse set based on what has been seen

– Jumps ahead if more results will not help diversity

– Optimal one-pass algorithm

• Probe– May jump forward or backward (i.e. probes)

– Prove: at most 2k probes for top-k diverse result set

• Both also work for scored diversity

Page 17: ppt

17

Dewey IDs

Every branch gets a number

Every item then labeled,e.g. 0.2.0.1.0 isHonda Odyssey Green ’06 `Good miles’

Create invertedindex

low → 00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000

Page 18: ppt

18

Next and Prev

Supports two basic operations: Next and Prev

E.g. Query text contains `Low`

Next(0.0.3.2.2) = 1.0.0.0.0Prev(2.0.0.0.0) = 1.3.0.0.0

Inverted index for ‘Low’ listsall items in Dewey ID order

In general, must find intersection of lists (still easy)

low → 00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000

Page 19: ppt

19

One pass (for k = 2)

First finds 00000, 00010

Now knows Civic Greenno longer helps

Jumps by callingnext(0.0.1.0.0)

Page 20: ppt

20

Finds 00100Removes 00010

One pass (for k = 2)

First finds 00000, 00010

Now knows Civic Greenno longer helps!

Jumps by callingnext(0.0.1.0.0)

Now knows Civicno longer helps!

Jumps by callingnext(0.1.0.0.0)

Page 21: ppt

21

Finds 00100Removes 00010

One pass (for k = 2)

First finds 00000, 00010

Now knows Civic Greenno longer helps!

Jumps by callingnext(0.0.1.0.0)

Now knows Civicno longer helps!

Jumps by callingnext(0.1.0.0.0)

Finds 01000Removes 00100 Knows to stop

Page 22: ppt

22

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items

Wants another Honda

Calls prev(0. ∞. ∞. ∞. ∞)

Discovers there are only2 top-level categories

Page 23: ppt

23

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items

Wants another Honda

Calls prev(0. ∞. ∞. ∞. ∞)

Why not next(0.1.0.0.0)?

If Honda has only onechild, then will returna Toyota!

Page 24: ppt

24

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items

Wants another Honda

Calls prev(0. ∞. ∞. ∞. ∞)

Finds 00310

Wants another Toyota

Calls next(1.0.0.0.0)

Page 25: ppt

25

Probe (for k = 4)

Calls next(0.0.0.0.0) and prev(∞. ∞. ∞. ∞. ∞)to find first and last items

Wants another Honda

Calls prev(0. ∞. ∞. ∞. ∞)

Finds 00310

Wants another Toyota

Calls next(1.0.0.0.0)

Finds 10000

Page 26: ppt

26

Outline

• Definition of diversity

• Overview of our algorithms

• Our experimental results

Page 27: ppt

27

Results

• Dataset consisted of listing from Yahoo! Autos

• Queries were synthetic to test various parameters– Selectivity, # predicates, # results

• Preprocessing time for 100K listings < 5min– Times shown are for 5K queries

• 4 algorithms– Basic: No diversity

– Naïve: Fetch everything, post-process

– OnePass: Our algorithm. Takes just one pass over data

– Probe: Our algorithm. May make multiple probes into data

Page 28: ppt

28

Comparable time for diversity search

unscored scored

Basic: No diversity

Naïve: Many times slower OnePass: Close to probe

Probe: Within factor 2 of no diversity

MultiQuery (not shown): Latency close to Basic, but throughput many times worse

Page 29: ppt

29

Results summary

• Getting diverse results not too much slower than getting non-diverse results– Many times faster than naïve approaches

• Multi-query approach has even worse throughput than naïve– But keeps latency low

• How does this compare to getting extra results, then finding a diverse subset?– Getting 2k results instead of k is about twice as slow

– Plus, does not guarantee diverse results

Page 30: ppt

30

Conclusions

• Can get guaranteed diversity, taking time close to normal top-k query– Almost as fast or faster than non-guaranteed results

– Diversity at every level

• Works even when items have scores

• Needs a different algorithm than traditional IR engines– Proved this in paper (under standard notions)

• Are there approximate notions that can use existing IR machinery?

Page 31: ppt

31


Recommended