Using Trees to Depict a Forest

1

Using Trees to Depict a ForestBin Liu, H. V. JagadishEECS, University of Michigan, Ann Arbor

Presented bySergey Shepshelvich

2

Motivation In interactive database querying, we often get

more results than we can comprehend immediately

When do you actually click over 2-3 pages of results? 85% of users never go to the second page! What to display on the first page?

3

Standard solutions Sorting by attributes

Computationally expensive Similar results can be distributed many pages

apart Ranking

Hard to estimate of the user's preference. In database queries, all tuples are equally

relevant! What to do when there are millions of results?

4

Make the First Page Count Human beings are very capable of learning

from examples

Show the most “representative” results Best help users learn what is in the result set User can decide further actions based on

representatives

5

The Proposal:MusiqLens Experience(Model-driven Usable Systems for Information Querying)

Suppose a user wants a 2005 Civic6

but there are too many of them…

7

MusiqLens on the Car DataId Model Price Year Mileage Condition

872 Civic $12,000 2005 50,000 Good 122 more like this

901 Civic $16,000 2005 40,000 Excellent 345 more like this



132 Civic $9,500 2005 86,000 Fair 185 more like this


8

MusiqLens on the Car DataId Model Price Year Mileage Condition







9

After Zooming in:2005 Honda Civics ~ ID 132

Id Model Price Year Mileage Condition342 Civic $9,800 2005 72,000 Good 25 more like

this768 Civic $10,000 2005 60,000 Good 10 more like

this132 Civic $9,500 2005 86,000 Fair 63 more like

this122 Civic $9,500 2005 76,000 Good 5 more like



this

10

After Filtering by “Price < 9,500”Id Model Price Year Mileage Condition







11

Challenges Representation Modeling: finding a suitable

metric What is the best set of representatives?

Representative finding How to find them efficiently?

Query Refinement How to efficiently adapt to user’s query

operations?

12

Finding a Suitable Metric Users should be the ultimate judge

Which metric generates the representatives that I can learn the most from?

User study to evaluate different representation modeling

Metric Candidates Sort by attributes Uniform random sampling

Small clusters are missed Density-biased sampling

Sample more from sparse regions, less from dense regions

Sort by typicality Based on probabilistic modeling

K-medoids

14

Metric Candidates - K-medoids A medoid of a cluster is the object

whose dissimilarity to others is smallest Average medoid and max medoid

K-medoids are k objects, each from a different cluster where the object is the medoid

Why not K-means? K-means cluster centers do not exist in

database We must present real objects to users

16

17

Plotting the Candidates

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1Random

Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1Density Biased

18

Plotting the Candidates - Typicality

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Typical

19

Plotting the Candidates –k-medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Max-Medoids

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Avg-Medoids

20

User Study Procedure Users are given:

7 sets of data, generated using the 7 candidate methods

Each set consists of 8 representative points Users predict 4 more data points

That are most likely in the data set Should not pick those already given

Measure the predication error

21

Verdict K-meoids is the winner In this paper, authors choose average k-

medoids Proposed algorithm can be extended to max-

medoids with small changes

22





operations?

23

Cover Tree Based Algorithm Cover Tree was proposed by Beygelzimer,

Kakade, and Langford in 2006 Briefly discuss Cover Tree properties See Cover Tree based algorithms for

computing k-medoids

24

Cover Tree Properties (1)Nesting: for all ,

C i

C i+1

Points in the Data (One Dimension)

25

Cover Tree Properties (2)

Distance from node to any descendant is less than . This value is called the “span” of the node.

Covering: node in is within distance of to its children in

C i

C i+1

26

Cover Tree Properties (3)Separation: nodes in are separated by at least

Note: allowed to be negative to satisfy above conditions.

C i

C i+1

27

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Additional Stats for Cover Tree (2D Example)

Density (DS): number of points in the subtree

DS = 10

DS = 3

Centroid (CT): geometric center of points in the subtree

p

28

k-medoid Algorithm Outline We descend the cover tree to a level with

more than nodes Choose an initial points as first set of medoids

(seeds) Bad seeds can lead to local minimums with a

high distance cost Assigning nodes and repeated update until

medoids converge

29

Cover Tree Based Seeding Descend the cover tree to a level with more

than nodes (denote as level m) Use the parent level as starting point for

seeds Each node has a weight, calculated as product of

span and density (the contribution of the subtree to the distance cost)

Expand nodes using a priority queue Fetch the first nodes from the queue as seeds

A Simple Example: k = 430

s1s2

s10

s8s6

s7

s3

s5

s3 s8s5

s6s1 s2 s7

s8s5

s9s4s5 s8

s9

s5

s4

s3

s2

s10

s7s3

Span = 2

Span = 1

Span = 1/2

Span = 1/4

Priority Queue on node weight (density * span):

S3 (5), S8 (3), S5 (2)

S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2)

Final set of seeds

31

Update Process1. Initially, assign all nodes to closest seed to

form clusters2. For each cluster, calculate the geometric

center Use centroid and density information to

approximate subtree3. Find the node that is closest to the geometric

center, designate as a new medoid4. Repeat from step 1 until medoids converge

32





operations?

33

Query Adaptation Handle user actions

Zooming Selection (filtering)

34

Zooming Zooming

Expand all nodes assigned to the medoid Run k-medoid algorithm on the new set of nodes

35

Selection Effect of selection on a

node Completely invalid Fully valid Partially valid

Estimate the validity percentage (VG) of each node

Multiply the VG with weight of each node

50

150

A

Mileage

S1S2

S3 S4S5

S6

S7

a

Price

1200030

201

4557

90b

37

Experiments – Initial Medoid Quality Compare with R-tree based method by M.

Ester, H. Kriegel, and X. Xu Data sets

Synthetic dataset: 2D points with zipf distribution

Real dataset: LA data set from R-tree Portal, 130k points

Measurement Time to compute the medoids Average distance from a data point to its medoid

Results on Synthetic Data38

256K 512K 1024K 2048K 4096K-1.73472347597681E-18

0.002

0.004

0.006

0.008

0.01

R-tree

Cover Tree

Cardinality

Tim

e (s

econ

ds)

256K 512K 1024K 2048K 4096K0

100

200

300

400

500

600

700

800

R-tree

Cover Tree

CardinalityDi

stan

ce

For various sizes of data, Cover-tree based method outperforms R-tree based method

Time Distance

39

Results on Real Data

2 8 32 128 5120

200

400

600

800

1000

1200

1400

1600

R-tree

Cover Tree

k

Dist

ance

2 8 32 128 5120

0.01

0.02

0.03

0.04

0.05

0.06

R-tree

Cover Tree

k

Tim

e (s

econ

ds)

For various k values, Cover-tree based method outperforms R-tree based method on real data

Query Adaptation40

0.8 0.6 0.4 0.20

100

200

300

400

500

600Re-Compute

Incremental

Selectivity

Dist

ance

0.8 0.6 0.4 0.20

50

100

150

200

250

300

350Re-Compute

Incremental

Selectivity

Dist

ance

Synthetic Data Real Data

Compare with re-building the cover tree and running the k-medoid algorithm from scratch.

Time cost of re-building is orders-of-magnitude higher than incremental computation.

41

Conclusion Authors proposed MusiqLens framework for

solving the many-answer problem Authors conducted user study to select a

metric for choosing representatives Authors proposed efficient method for

computing and maintaining the representatives under user actions

Part of the database usability project at Univ. of Michigan Led by Prof. H.V. Jagadish http://www.eecs.umich.edu/db/usable/

Date post:	07-Feb-2016
Category:	Documents
Upload:	sirius
View:	27 times
Download:	0 times

Using Trees to Depict a Forest

Documents