Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

transcript

Effective Change Detection

Using Sampling

Junghoo “John” Cho

Alexandros Ntoulas

Junghoo "John" Cho (UCLA Computer Science) 2

Application Web search engines/crawlers Web archive Data warehouse . . .

ProblemPolling

Remote database Local database

QueryUpdate

Existing Approach

Round robin Download pages in a round robin manner

Change-frequency based [CLW98, CGM00, EMT01] Estimate the change frequency Adjust download frequency Proven to be optimal

Our Approach

Sampling-based Sample k pages from each source Download more pages from the source with more

changed samples

Comparison

Frequency based Proven to be optimal Change history required Difficult to estimate change frequency

Sampling based Can be worse than frequency based policy No history/frequency-estimation required

Experimental comparison later

Questions

Are we assuming correlation? How to use sampling results?

Proportional vs Greedy How many samples?

Dynamic sample size adjustment? What if we have very limited resources?

Is Correlation Necessary?

Random sampling

Correlation not necessary. Only random sampling More discussion later

4/5 1/5

Questions

Download Model (1)

Fixed download cycle Say, once a month

Fixed download resources in each cycle Say, 100,000 page download every month

Goal Download as many changes as we can ChangeRatio =

No of changed & downloaded pages

No of downloaded pages

Download Model (2)

Two-stage sampling policy Sampling stage Download stage

Sampling requires page download

How to Use Sampling Result?

Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining

4/5 1/5A B

Proportional Policy

Download pages proportionally to the detected changes 8 pages from A, 2 pages from B

4/5 1/5A B

Greedy Policy

Download pages from the sites with most changes 10 pages from A

4/5 1/5A B

Optimality of Greedy

Theorem Greedy is optimal if we make download decisions

purely based on sampling results Probabilistic optimality for their expected values

Questions

How Many Samples?

Too few samples Inaccurate change estimates

Too many samples “Waste” of resources for sampling

How to determine optimal sample size?

Optimal Sample Size

Factors to consider Total number of pages that we maintain Number of pages that we can download in the

current cycle Number of pages in each Web site Change distribution

Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100

Change Fraction Distribution

fraction ofsites

i : fraction of changed pages in site i f(): distribution of values

Optimal Sample Size

N: no of pages in a site r: no of pages to download / no of pages we

maintain Analysis is complex

is a good rule of thumb

Nr f (t )6(r )

Dynamic Sample Size?

Do we need the same sample size for every site? A: = 0, B: = 0.45, C: = 0.55, D: = 1

Adaptive Sampling

If the estimated is high/low enough, make an early decision

What does “high enough” mean? Confidence interval above threshold

( )i( )

In the Paper

More details on Optimal sample size Adaptive policy

The cases where resource is too limited for sampling

Experiments

353,000 pages from 252 sites Mostly popular sites

Yahoo, CNN, Microsoft, … ~ 1400 pages from each site Followed the links in the breadth-first manner

Monthly change history for 6 months 5 download cycles

In experiments, 100,000 page downloads in each download cycle

Comparison of Policies

RR FRQ PRP GRD ADP

ChangeRatio

Optimal Sample Size

0 50 100 150 200 250

Optimal sample size ~ 10 through 60 ~ 20Nr

ChangeRatio

Sample Size

Comparison of Long-Term Performance

Problem: We have only 5-download-cycle data

Solution: Extrapolate the history

Repeat

Frequency vs. Sampling

0 100 200 300 400Download Cycle

ChangeRatio

Frequency

Greedy

Related Work

Frequency-based policy Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001

Source cooperation Olston et al., SIGMOD 2002

Conclusion

Sampling-based policy Great short-term performance No change history required

Frequency-based policy Potentially good long-term performance if the

change frequency does not change Greedy is easy to implement and shows high

performance

Future Work

Combination of sampling and frequency based policies Switch to the frequency-based policy after a while

Good partitioning for sampling? Site based? Directory based? Content based? Link-structure based?

Questions?

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Documents