Date post: | 14-Jan-2016 |
Category: |
Documents |
Upload: | barbra-simon |
View: | 212 times |
Download: | 0 times |
Monitoring the dynamic Web to respond to Continuous Queries
Presented by Qing Cao CS851 Spring 2005
2
Talk outline
Introduction and Motivation Previous Approaches Paper Contributions
• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources
among pages• How to schedule monitoring tasks
Experiments and Evaluations Critique Conclusions
3
Problem Context
Current web pages are highly dynamic• 40% commercial pages• 23% of all pages
change per day (Sethuraman et al.) How can search engines handle the user’s
long-term request on a particular topic?• Need to monitor a set of web pages (how
often?) • Analyze the difference• Send the results to the user
4
Application Example
Google News page shot on Apr 16
5
Application Example
Many websites allow users to receive email alerts or updated news on particular events
A special kind of query!
CNN webpage shot on Apr 16
6
Goal of this Paper
Continuous Adaptive Monitoring (CAM) such that it can allocates limited resource (such as bandwidth and computation power) to the monitoring tasks such that the misses for updated pages are minimized
7
User Model: Continuous Queries (CQ)
Users issue long-lived queries of interest Pages of interest may be added, modified,
and deleted System continually updates responses
8
Discrete vs. continuous queries Query lives for an
“instant”, one-shot answer
Optimize content freshness each time
Usually handled by page crawlers, (such as Google Robot), with diverse periods
Queries have positive lifetime, many updates over time
Updates must track changes continuously over certain periods of time
Dynamic monitoring with more restrictive network resources, using new services such as CAM
9
Talk outline
Introduction and motivation Previous approaches Our contributions
• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources
among pages• How to schedule monitoring tasks
Experiments Critique Conclusion
10
Related work CONQUER and WebCQ (Liu, Pu and Tang)
• Query language and architecture for CQ• Do not address monitoring for freshness optimization
NIAGARA (DeWitt and Naughton)• Query evaluation and optimization techniques• Database query optimization setting
ChangeDetector (Boyapati et al.)• Fixed-priority polling for given set of pages
Freshness for discrete queries• Poisson updates (Cho and Garcia-Molina)• Quasi-deterministic and other distributions
(Sethuraman, Wolf, Squillante, Yu)
11
Alternative Solution: RSS (Rich Site Summary or
RDF Site Summary) An XML format for news and content syndication, in which headlines and links to the actual content are made available to Web sites. After the publishing site creates an RSS file of its content, other Web sites may use the headline feed, and the content can be read with a standard Web browser or by specialized RSS viewers
RSS is a push-pull based scheme that is different than the scheme discussed here, which is purely pull based
12
Paper Contributions
New monitoring framework to fit statistical models of page change behavior
Freshness optimization problem constrained by network resources
Two-phase solution to optimization tailored to CQ search systems• Resource allocation (knapsack)• Poll scheduling (flow-shop)
13
Continuous Adaptive Monitoring
Consider epoch Consider a large set of pages Each time step j, each page i has
probability ρi,j of an update
• Can capture predictable periodicityj ρi,j = i, the expected number updates to page i
or change rate in an epoch
Decision variables yij
• Whether a page is visited at a time point• Optimization goal
14
Goal The goal is to minimize the weighted
importance of changes that are not reported to the users
Put another way, update information reported for page i is
Goal is to maximize importance-weighted updates reported, iWiRi
min ( )i ii P
WE
j ijiji yR
15
Constraints and Metrics
The system is subject to polling resource constraint:
Metric returned info ratio (RIR) is:
The goal is equivalent to maximize RIR
Cyji ij ,
i ii
i j ijiji
W
yW
Importance-weighted updatescaptured by system
Total importance-weightedexpected updates
16
CAM System Overview Time proceeds in epochs At the end of every epoch
we re-evaluate• Relevance• Update probabilities
For the next epoch• We select instants at which
to poll each page (resource allocation)
• Schedule these instants subject to resource constraints
Determiningrelevant pages
Tracking
Resourceallocation
Scheduling
Monit
ori
ng
18
Resource allocation Existing policies
• Uniform: Resources (number of polls) distributed uniformly among all pages irrespective of their change frequency
• Proportional: number of polls allocated to a page is proportional to the frequency with which it changes
Better policies also exist, such as taking into the account of the weights of different pages
CAM: Discrete, Separable and Convex • Better than uniform and proportional• Proportional better than uniform• Well studied optimization problem
BUT EXACTLY HOW DO THEY DO IT?
19
Scheduling
Suppose our crawler can fetch M pages concurrently, and
An epoch is T time steps long Then we can fetch a total of
C=MT pages during an epoch• Ensured by resource allocation
phase But at each instant we cannot
schedule more than M fetches• Want small planned-to-actual poll
delays• May fail to schedule all poll jobs
in an epoch
Determiningrelevant pages
Parametertracking
Resourceallocation
Scheduling
Monit
ori
ng
Tentative yijs
20
A flow-shop problem
M “machines” available at any time Each yij which is equal to 1 is a “job”
Job “k” is “released” at time step rk (= j )
“Processing time” = crawl time = tj
“Completion time” of job j is Cj
Want to minimize “total flow”
NP-hard problem• The paper uses a 1.58 heuristic algorithm
k kk rC )(
Time
Job
21
Evaluation: Preparing Knowledge
Zipfian Distribution (Power Law)• Zipf's law, named after the Harvard linguistic
professor George Kingsley Zipf (1902-1950) • Zipf curves follow a straight line when plotted
on a double-logarithmic diagram
22
Zipf Properties Three Key Observations:
• A few elements that score very high (the left tail in the diagrams)
• A medium number of elements with middle-of-the-road scores (the middle part of the diagram)
• A huge number of elements that score very low (the right tail in the diagram)
Zipf distributions have been shown to characterize use of words in a natural language
Web use follows a Zipf distribution Other interesting examples including file
popularity, web request on caching, etc
23
Experiments
Data (Synthetic)• Change frequency distribution:
a few pages change very often (Zipf)
• Update probability distribution: a few ρi,j ’s are large, most are small (Zipf again)
• Page importance distribution: also Zipf (Wolman, 1999)
FIXME0
50
100
150
200
250
300
350
1 5 9 13
Change frequency
Num
ber
of p
ages
24
Experiment Parameters
Comparison Baselines:• Uniform: resources (monitoring tasks) are
allocated uniformly across all pages• Proportional: resources are allocated
proportional to change frequencies of pages respectively
Parameters:• Number of Queries = 500• Number of Pages = 500• Number of Monitoring Tasks = 1000-50000
25
CAM > Proportional > Uniform Uniform update and
importance distribution Plot RIR against ratio
of resources toexpected changes
RIR for CAM is >3times better
Proportional is betterthan uniform in theCQ setting
0
0.020.04
0.060.08
0.1
0.120.14
0.160.18
0.2
2 4 6 8Monitor/change ratio
RIR
UniformProportionalCAM
26
Resource allocation
Sort pages by increasing change rate
Uniform spends same resource for each bin
Proportional wastes fewer resources on slow-changing bins, but is not aggressive enough
CAM invests more aggressively in fast-changing bins, achieving the greatest RIR
27
Skewed Distribution Effect (1)
CAM performs better as the data update rate is skewed
28
Skewed Distribution Effect (2)More information is obtained as the update is skewed
29
Skewed Distribution Effect(3)
00.10.20.30.40.50.6
0 0.5 1 1.5Zipf parameterRIR
CAMProportionalUniform
As Zipf parameter increases, CAM performs better
30
CAM Performance with Epoches
CAM improves over initial epochs Change distribution estimates stabilize
within a few epochs
31
Effect of Monitoring Task ratio on CAM
Only when the ratio is 50 can CAM obtain all information.
Is this good performance?
32
Scheduling Performance
The first figure shows the document size distribution, and the second figure shows the loss of information due to scheduling
Some monitoring tasks are not doneSome monitoring tasks are delayed
33
Experiments on real pages (from one of the author’s talks)
Eight sites with dynamic cricket match information• In fact, Zipfian updates
Adversarial setup: monitor/change < 1• CAM close to best
possible
For M/C=2, CAM updates on 80% of the information changed
0
100
200
300
400
500
1 2 3 4 5 6 7 8Page Index
Number of Changes
0
0.2
0.4
0.6
0.8
1
0.3 1 10Monitoring-Change RatioR
IR
Uniform
Proportional
CAM
34
Critique This paper expresses the algorithm part
very vague and unclear. It is unknown how they performed the experiments
The performance is not good: it takes 50 times the actual number of updates to get all information
Several assumptions do not hold: for example, updates are typically correlated from updates to updates, but the paper assumes that the update information of last time is completely lost when the next update is done
35
Conclusion
Continual queries are inherently different from discrete queries
Approach used in CAM• Identify relevant pages• Track the pages as they change• Characterize page change behavior• Decide when to monitor the pages in future
CAM approach performs better than other naïve approaches