Post on 27-Mar-2015
transcript
Effective Change Detection
Using Sampling
Junghoo “John” Cho
Alexandros Ntoulas
UCLA
Junghoo "John" Cho (UCLA Computer Science) 2
Application Web search engines/crawlers Web archive Data warehouse . . .
ProblemPolling
Remote database Local database
QueryUpdate
Junghoo "John" Cho (UCLA Computer Science) 3
Existing Approach
Round robin Download pages in a round robin manner
Change-frequency based [CLW98, CGM00, EMT01] Estimate the change frequency Adjust download frequency Proven to be optimal
Junghoo "John" Cho (UCLA Computer Science) 4
Our Approach
Sampling-based Sample k pages from each source Download more pages from the source with more
changed samples
Junghoo "John" Cho (UCLA Computer Science) 5
Comparison
Frequency based Proven to be optimal Change history required Difficult to estimate change frequency
Sampling based Can be worse than frequency based policy No history/frequency-estimation required
Experimental comparison later
Junghoo "John" Cho (UCLA Computer Science) 6
Questions
Are we assuming correlation? How to use sampling results?
Proportional vs Greedy How many samples?
Dynamic sample size adjustment? What if we have very limited resources?
Junghoo "John" Cho (UCLA Computer Science) 7
Is Correlation Necessary?
Random sampling
Correlation not necessary. Only random sampling More discussion later
4/5 1/5
Junghoo "John" Cho (UCLA Computer Science) 8
Questions
Are we assuming correlation? How to use sampling results?
Proportional vs Greedy How many samples?
Dynamic sample size adjustment? What if we have very limited resources?
Junghoo "John" Cho (UCLA Computer Science) 9
Download Model (1)
Fixed download cycle Say, once a month
Fixed download resources in each cycle Say, 100,000 page download every month
Goal Download as many changes as we can ChangeRatio =
No of changed & downloaded pages
No of downloaded pages
Junghoo "John" Cho (UCLA Computer Science) 10
Download Model (2)
Two-stage sampling policy Sampling stage Download stage
Sampling requires page download
Junghoo "John" Cho (UCLA Computer Science) 11
How to Use Sampling Result?
Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining
4/5 1/5A B
Junghoo "John" Cho (UCLA Computer Science) 12
Proportional Policy
Download pages proportionally to the detected changes 8 pages from A, 2 pages from B
4/5 1/5A B
Junghoo "John" Cho (UCLA Computer Science) 13
Greedy Policy
Download pages from the sites with most changes 10 pages from A
4/5 1/5A B
Junghoo "John" Cho (UCLA Computer Science) 14
Optimality of Greedy
Theorem Greedy is optimal if we make download decisions
purely based on sampling results Probabilistic optimality for their expected values
Junghoo "John" Cho (UCLA Computer Science) 15
Questions
Are we assuming correlation? How to use sampling results?
Proportional vs Greedy How many samples?
Dynamic sample size adjustment? What if we have very limited resources?
Junghoo "John" Cho (UCLA Computer Science) 16
How Many Samples?
Too few samples Inaccurate change estimates
Too many samples “Waste” of resources for sampling
How to determine optimal sample size?
Junghoo "John" Cho (UCLA Computer Science) 17
Optimal Sample Size
Factors to consider Total number of pages that we maintain Number of pages that we can download in the
current cycle Number of pages in each Web site Change distribution
Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100
Junghoo "John" Cho (UCLA Computer Science) 18
Change Fraction Distribution
fraction ofsites
f( )
t
i : fraction of changed pages in site i f(): distribution of values
Junghoo "John" Cho (UCLA Computer Science) 19
Optimal Sample Size
N: no of pages in a site r: no of pages to download / no of pages we
maintain Analysis is complex
is a good rule of thumb
Nr f (t )6(r )
Nr
Junghoo "John" Cho (UCLA Computer Science) 20
Dynamic Sample Size?
Do we need the same sample size for every site? A: = 0, B: = 0.45, C: = 0.55, D: = 1
Junghoo "John" Cho (UCLA Computer Science) 21
Adaptive Sampling
If the estimated is high/low enough, make an early decision
What does “high enough” mean? Confidence interval above threshold
t
( )i
( )i( )
i
Junghoo "John" Cho (UCLA Computer Science) 22
In the Paper
More details on Optimal sample size Adaptive policy
The cases where resource is too limited for sampling
Junghoo "John" Cho (UCLA Computer Science) 23
Experiments
353,000 pages from 252 sites Mostly popular sites
Yahoo, CNN, Microsoft, … ~ 1400 pages from each site Followed the links in the breadth-first manner
Monthly change history for 6 months 5 download cycles
In experiments, 100,000 page downloads in each download cycle
Junghoo "John" Cho (UCLA Computer Science) 24
Comparison of Policies
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
RR FRQ PRP GRD ADP
ChangeRatio
Junghoo "John" Cho (UCLA Computer Science) 25
Optimal Sample Size
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0 50 100 150 200 250
Optimal sample size ~ 10 through 60 ~ 20Nr
ChangeRatio
Sample Size
Junghoo "John" Cho (UCLA Computer Science) 26
Comparison of Long-Term Performance
Problem: We have only 5-download-cycle data
Solution: Extrapolate the history
?
Repeat
Junghoo "John" Cho (UCLA Computer Science) 27
Frequency vs. Sampling
0.5
0.6
0.7
0.8
0.9
0 100 200 300 400Download Cycle
ChangeRatio
Frequency
Greedy
Junghoo "John" Cho (UCLA Computer Science) 28
Related Work
Frequency-based policy Coffman et al., Journal of Scheduling 1998 Cho et al., SIGMOD 2000 Edwards et al., WWW 2001
Source cooperation Olston et al., SIGMOD 2002
Junghoo "John" Cho (UCLA Computer Science) 29
Conclusion
Sampling-based policy Great short-term performance No change history required
Frequency-based policy Potentially good long-term performance if the
change frequency does not change Greedy is easy to implement and shows high
performance
Junghoo "John" Cho (UCLA Computer Science) 30
Future Work
Combination of sampling and frequency based policies Switch to the frequency-based policy after a while
Good partitioning for sampling? Site based? Directory based? Content based? Link-structure based?
Junghoo "John" Cho (UCLA Computer Science) 31
Questions?