+ All Categories
Home > Documents > Using Trees to Depict a Forest

Using Trees to Depict a Forest

Date post: 07-Feb-2016
Category:
Upload: sirius
View: 27 times
Download: 0 times
Share this document with a friend
Description:
Using Trees to Depict a Forest. Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich. Motivation. In interactive database querying, we often get more results than we can comprehend immediately. - PowerPoint PPT Presentation
38
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1
Transcript
Page 1: Using Trees to Depict a Forest

1

Using Trees to Depict a ForestBin Liu, H. V. JagadishEECS, University of Michigan, Ann Arbor

Presented bySergey Shepshelvich

Page 2: Using Trees to Depict a Forest

2

Motivation In interactive database querying, we often get

more results than we can comprehend immediately

When do you actually click over 2-3 pages of results? 85% of users never go to the second page! What to display on the first page?

Page 3: Using Trees to Depict a Forest

3

Standard solutions Sorting by attributes

Computationally expensive Similar results can be distributed many pages

apart Ranking

Hard to estimate of the user's preference. In database queries, all tuples are equally

relevant! What to do when there are millions of results?

Page 4: Using Trees to Depict a Forest

4

Make the First Page Count Human beings are very capable of learning

from examples

Show the most “representative” results Best help users learn what is in the result set User can decide further actions based on

representatives

Page 5: Using Trees to Depict a Forest

5

The Proposal:MusiqLens Experience(Model-driven Usable Systems for Information Querying)

Page 6: Using Trees to Depict a Forest

Suppose a user wants a 2005 Civic6

but there are too many of them…

Page 7: Using Trees to Depict a Forest

7

MusiqLens on the Car DataId Model Price Year Mileage Condition

872 Civic $12,000 2005 50,000 Good 122 more like this

901 Civic $16,000 2005 40,000 Excellent 345 more like this

725 Civic $18,500 2005 30,000 Excellent 86 more like this

423 Civic $17,000 2005 42,000 Good 201 more like this

132 Civic $9,500 2005 86,000 Fair 185 more like this

322 Civic $14,000 2005 73,000 Good 55 more like this

Page 8: Using Trees to Depict a Forest

8

MusiqLens on the Car DataId Model Price Year Mileage Condition

872 Civic $12,000 2005 50,000 Good 122 more like this

901 Civic $16,000 2005 40,000 Excellent 345 more like this

725 Civic $18,500 2005 30,000 Excellent 86 more like this

423 Civic $17,000 2005 42,000 Good 201 more like this

132 Civic $9,500 2005 86,000 Fair 185 more like this

322 Civic $14,000 2005 73,000 Good 55 more like this

Page 9: Using Trees to Depict a Forest

9

After Zooming in:2005 Honda Civics ~ ID 132

Id Model Price Year Mileage Condition342 Civic $9,800 2005 72,000 Good 25 more like

this768 Civic $10,000 2005 60,000 Good 10 more like

this132 Civic $9,500 2005 86,000 Fair 63 more like

this122 Civic $9,500 2005 76,000 Good 5 more like

this123 Civic $9,100 2005 81,000 Fair 40 more like

this898 Civic $9,000 2005 69,000 Fair 42 more like

this

Page 10: Using Trees to Depict a Forest

10

After Filtering by “Price < 9,500”Id Model Price Year Mileage Condition

123 Civic $9,100 2005 81,000 Fair 40 more like this

898 Civic $9,000 2005 69,000 Fair 42 more like this

133 Civic $9,300 2005 87,000 Fair 33 more like this

126 Civic $9,200 2005 89,000 Good 3 more like this

129 Civic $8,900 2005 81,000 Fair 20 more like this

999 Civic $9,000 2005 87,000 Fair 12 more like this

Page 11: Using Trees to Depict a Forest

11

Challenges Representation Modeling: finding a suitable

metric What is the best set of representatives?

Representative finding How to find them efficiently?

Query Refinement How to efficiently adapt to user’s query

operations?

Page 12: Using Trees to Depict a Forest

12

Finding a Suitable Metric Users should be the ultimate judge

Which metric generates the representatives that I can learn the most from?

User study to evaluate different representation modeling

Page 13: Using Trees to Depict a Forest

Metric Candidates Sort by attributes Uniform random sampling

Small clusters are missed Density-biased sampling

Sample more from sparse regions, less from dense regions

Sort by typicality Based on probabilistic modeling

K-medoids

14

Page 14: Using Trees to Depict a Forest

Metric Candidates - K-medoids A medoid of a cluster is the object

whose dissimilarity to others is smallest Average medoid and max medoid

K-medoids are k objects, each from a different cluster where the object is the medoid

Why not K-means? K-means cluster centers do not exist in

database We must present real objects to users

16

Page 15: Using Trees to Depict a Forest

17

Plotting the Candidates

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1Random

Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1Density Biased

Page 16: Using Trees to Depict a Forest

18

Plotting the Candidates - Typicality

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Typical

Page 17: Using Trees to Depict a Forest

19

Plotting the Candidates –k-medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Max-Medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Avg-Medoids

Page 18: Using Trees to Depict a Forest

20

User Study Procedure Users are given:

7 sets of data, generated using the 7 candidate methods

Each set consists of 8 representative points Users predict 4 more data points

That are most likely in the data set Should not pick those already given

Measure the predication error

Page 19: Using Trees to Depict a Forest

21

Verdict K-meoids is the winner In this paper, authors choose average k-

medoids Proposed algorithm can be extended to max-

medoids with small changes

Page 20: Using Trees to Depict a Forest

22

Challenges Representation Modeling: finding a suitable

metric What is the best set of representatives?

Representative finding How to find them efficiently?

Query Refinement How to efficiently adapt to user’s query

operations?

Page 21: Using Trees to Depict a Forest

23

Cover Tree Based Algorithm Cover Tree was proposed by Beygelzimer,

Kakade, and Langford in 2006 Briefly discuss Cover Tree properties See Cover Tree based algorithms for

computing k-medoids

Page 22: Using Trees to Depict a Forest

24

Cover Tree Properties (1)Nesting: for all ,

C i

C i+1

Points in the Data (One Dimension)

Page 23: Using Trees to Depict a Forest

25

Cover Tree Properties (2)

Distance from node to any descendant is less than . This value is called the “span” of the node.

Covering: node in is within distance of to its children in

C i

C i+1

Page 24: Using Trees to Depict a Forest

26

Cover Tree Properties (3)Separation: nodes in are separated by at least

Note: allowed to be negative to satisfy above conditions.

C i

C i+1

Page 25: Using Trees to Depict a Forest

27

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Additional Stats for Cover Tree (2D Example)

Density (DS): number of points in the subtree

DS = 10

DS = 3

Centroid (CT): geometric center of points in the subtree

p

Page 26: Using Trees to Depict a Forest

28

k-medoid Algorithm Outline We descend the cover tree to a level with

more than nodes Choose an initial points as first set of medoids

(seeds) Bad seeds can lead to local minimums with a

high distance cost Assigning nodes and repeated update until

medoids converge

Page 27: Using Trees to Depict a Forest

29

Cover Tree Based Seeding Descend the cover tree to a level with more

than nodes (denote as level m) Use the parent level as starting point for

seeds Each node has a weight, calculated as product of

span and density (the contribution of the subtree to the distance cost)

Expand nodes using a priority queue Fetch the first nodes from the queue as seeds

Page 28: Using Trees to Depict a Forest

A Simple Example: k = 430

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Span = 2

Span = 1

Span = 1/2

Span = 1/4

Priority Queue on node weight (density * span):

S3 (5), S8 (3), S5 (2)

S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2)

Final set of seeds

Page 29: Using Trees to Depict a Forest

31

Update Process1. Initially, assign all nodes to closest seed to

form clusters2. For each cluster, calculate the geometric

center Use centroid and density information to

approximate subtree3. Find the node that is closest to the geometric

center, designate as a new medoid4. Repeat from step 1 until medoids converge

Page 30: Using Trees to Depict a Forest

32

Challenges Representation Modeling: finding a suitable

metric What is the best set of representatives?

Representative finding How to find them efficiently?

Query Refinement How to efficiently adapt to user’s query

operations?

Page 31: Using Trees to Depict a Forest

33

Query Adaptation Handle user actions

Zooming Selection (filtering)

Page 32: Using Trees to Depict a Forest

34

Zooming Zooming

Expand all nodes assigned to the medoid Run k-medoid algorithm on the new set of nodes

Page 33: Using Trees to Depict a Forest

35

Selection Effect of selection on a

node Completely invalid Fully valid Partially valid

Estimate the validity percentage (VG) of each node

Multiply the VG with weight of each node

50

150

A

Mileage

S1S2

S3 S4S5

S6

S7

a

Price

1200030

201

4557

90b

Page 34: Using Trees to Depict a Forest

37

Experiments – Initial Medoid Quality Compare with R-tree based method by M.

Ester, H. Kriegel, and X. Xu Data sets

Synthetic dataset: 2D points with zipf distribution

Real dataset: LA data set from R-tree Portal, 130k points

Measurement Time to compute the medoids Average distance from a data point to its medoid

Page 35: Using Trees to Depict a Forest

Results on Synthetic Data38

256K 512K 1024K 2048K 4096K-1.73472347597681E-18

0.002

0.004

0.006

0.008

0.01

R-tree

Cover Tree

Cardinality

Tim

e (s

econ

ds)

256K 512K 1024K 2048K 4096K0

100

200

300

400

500

600

700

800

R-tree

Cover Tree

CardinalityDi

stan

ce

For various sizes of data, Cover-tree based method outperforms R-tree based method

Time Distance

Page 36: Using Trees to Depict a Forest

39

Results on Real Data

2 8 32 128 5120

200

400

600

800

1000

1200

1400

1600

R-tree

Cover Tree

k

Dist

ance

2 8 32 128 5120

0.01

0.02

0.03

0.04

0.05

0.06

R-tree

Cover Tree

k

Tim

e (s

econ

ds)

For various k values, Cover-tree based method outperforms R-tree based method on real data

Page 37: Using Trees to Depict a Forest

Query Adaptation40

0.8 0.6 0.4 0.20

100

200

300

400

500

600Re-Compute

Incremental

Selectivity

Dist

ance

0.8 0.6 0.4 0.20

50

100

150

200

250

300

350Re-Compute

Incremental

Selectivity

Dist

ance

Synthetic Data Real Data

Compare with re-building the cover tree and running the k-medoid algorithm from scratch.

Time cost of re-building is orders-of-magnitude higher than incremental computation.

Page 38: Using Trees to Depict a Forest

41

Conclusion Authors proposed MusiqLens framework for

solving the many-answer problem Authors conducted user study to select a

metric for choosing representatives Authors proposed efficient method for

computing and maintaining the representatives under user actions

Part of the database usability project at Univ. of Michigan Led by Prof. H.V. Jagadish http://www.eecs.umich.edu/db/usable/


Recommended