Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.

Monitoring the dynamic Web to respond to Continuous Queries

Presented by Qing Cao CS851 Spring 2005

2

Talk outline

Introduction and Motivation Previous Approaches Paper Contributions

• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources

among pages• How to schedule monitoring tasks

Experiments and Evaluations Critique Conclusions

3

Problem Context

Current web pages are highly dynamic• 40% commercial pages• 23% of all pages

change per day (Sethuraman et al.) How can search engines handle the user’s

long-term request on a particular topic?• Need to monitor a set of web pages (how

often?) • Analyze the difference• Send the results to the user

4

Application Example

Google News page shot on Apr 16

5

Application Example

Many websites allow users to receive email alerts or updated news on particular events

A special kind of query!

CNN webpage shot on Apr 16

6

Goal of this Paper

Continuous Adaptive Monitoring (CAM) such that it can allocates limited resource (such as bandwidth and computation power) to the monitoring tasks such that the misses for updated pages are minimized

7

User Model: Continuous Queries (CQ)

Users issue long-lived queries of interest Pages of interest may be added, modified,

and deleted System continually updates responses

8

Discrete vs. continuous queries Query lives for an

“instant”, one-shot answer

Optimize content freshness each time

Usually handled by page crawlers, (such as Google Robot), with diverse periods

Queries have positive lifetime, many updates over time

Updates must track changes continuously over certain periods of time

Dynamic monitoring with more restrictive network resources, using new services such as CAM

9

Talk outline

Introduction and motivation Previous approaches Our contributions

• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources

among pages• How to schedule monitoring tasks

Experiments Critique Conclusion

10

Related work CONQUER and WebCQ (Liu, Pu and Tang)

• Query language and architecture for CQ• Do not address monitoring for freshness optimization

NIAGARA (DeWitt and Naughton)• Query evaluation and optimization techniques• Database query optimization setting

ChangeDetector (Boyapati et al.)• Fixed-priority polling for given set of pages

Freshness for discrete queries• Poisson updates (Cho and Garcia-Molina)• Quasi-deterministic and other distributions

(Sethuraman, Wolf, Squillante, Yu)

11

Alternative Solution: RSS (Rich Site Summary or

RDF Site Summary) An XML format for news and content syndication, in which headlines and links to the actual content are made available to Web sites. After the publishing site creates an RSS file of its content, other Web sites may use the headline feed, and the content can be read with a standard Web browser or by specialized RSS viewers

RSS is a push-pull based scheme that is different than the scheme discussed here, which is purely pull based

12

Paper Contributions

New monitoring framework to fit statistical models of page change behavior

Freshness optimization problem constrained by network resources

Two-phase solution to optimization tailored to CQ search systems• Resource allocation (knapsack)• Poll scheduling (flow-shop)

13

Continuous Adaptive Monitoring

Consider epoch Consider a large set of pages Each time step j, each page i has

probability ρi,j of an update

• Can capture predictable periodicityj ρi,j = i, the expected number updates to page i

or change rate in an epoch

Decision variables yij

• Whether a page is visited at a time point• Optimization goal

14

Goal The goal is to minimize the weighted

importance of changes that are not reported to the users

Put another way, update information reported for page i is

Goal is to maximize importance-weighted updates reported, iWiRi

min ( )i ii P

WE

j ijiji yR

15

Constraints and Metrics

The system is subject to polling resource constraint:

Metric returned info ratio (RIR) is:

The goal is equivalent to maximize RIR

Cyji ij ,

i ii

i j ijiji

W

yW

Importance-weighted updatescaptured by system

Total importance-weightedexpected updates

16

CAM System Overview Time proceeds in epochs At the end of every epoch

we re-evaluate• Relevance• Update probabilities

For the next epoch• We select instants at which

to poll each page (resource allocation)

• Schedule these instants subject to resource constraints

Determiningrelevant pages

Tracking

Resourceallocation

Scheduling

Monit

ori

ng

18

Resource allocation Existing policies

• Uniform: Resources (number of polls) distributed uniformly among all pages irrespective of their change frequency

• Proportional: number of polls allocated to a page is proportional to the frequency with which it changes

Better policies also exist, such as taking into the account of the weights of different pages

CAM: Discrete, Separable and Convex • Better than uniform and proportional• Proportional better than uniform• Well studied optimization problem

BUT EXACTLY HOW DO THEY DO IT?

19

Scheduling

Suppose our crawler can fetch M pages concurrently, and

An epoch is T time steps long Then we can fetch a total of

C=MT pages during an epoch• Ensured by resource allocation

phase But at each instant we cannot

schedule more than M fetches• Want small planned-to-actual poll

delays• May fail to schedule all poll jobs

in an epoch

Determiningrelevant pages

Parametertracking

Resourceallocation

Scheduling

Monit

ori

ng

Tentative yijs

20

A flow-shop problem

M “machines” available at any time Each yij which is equal to 1 is a “job”

Job “k” is “released” at time step rk (= j )

“Processing time” = crawl time = tj

“Completion time” of job j is Cj

Want to minimize “total flow”

NP-hard problem• The paper uses a 1.58 heuristic algorithm

k kk rC )(

Time

Job

21

Evaluation: Preparing Knowledge

Zipfian Distribution (Power Law)• Zipf's law, named after the Harvard linguistic

professor George Kingsley Zipf (1902-1950) • Zipf curves follow a straight line when plotted

on a double-logarithmic diagram

22

Zipf Properties Three Key Observations:

• A few elements that score very high (the left tail in the diagrams)

• A medium number of elements with middle-of-the-road scores (the middle part of the diagram)

• A huge number of elements that score very low (the right tail in the diagram)

Zipf distributions have been shown to characterize use of words in a natural language

Web use follows a Zipf distribution Other interesting examples including file

popularity, web request on caching, etc

23

Experiments

Data (Synthetic)• Change frequency distribution:

a few pages change very often (Zipf)

• Update probability distribution: a few ρi,j ’s are large, most are small (Zipf again)

• Page importance distribution: also Zipf (Wolman, 1999)

FIXME0

50

100

150

200

250

300

350

1 5 9 13

Change frequency

Num

ber

of p

ages

24

Experiment Parameters

Comparison Baselines:• Uniform: resources (monitoring tasks) are

allocated uniformly across all pages• Proportional: resources are allocated

proportional to change frequencies of pages respectively

Parameters:• Number of Queries = 500• Number of Pages = 500• Number of Monitoring Tasks = 1000-50000

25

CAM > Proportional > Uniform Uniform update and

importance distribution Plot RIR against ratio

of resources toexpected changes

RIR for CAM is >3times better

Proportional is betterthan uniform in theCQ setting

0

0.020.04

0.060.08

0.1

0.120.14

0.160.18

0.2

2 4 6 8Monitor/change ratio

RIR

UniformProportionalCAM

26

Resource allocation

Sort pages by increasing change rate

Uniform spends same resource for each bin

Proportional wastes fewer resources on slow-changing bins, but is not aggressive enough

CAM invests more aggressively in fast-changing bins, achieving the greatest RIR

27

Skewed Distribution Effect (1)

CAM performs better as the data update rate is skewed

28

Skewed Distribution Effect (2)More information is obtained as the update is skewed

29

Skewed Distribution Effect(3)

00.10.20.30.40.50.6

0 0.5 1 1.5Zipf parameterRIR

CAMProportionalUniform

As Zipf parameter increases, CAM performs better

30

CAM Performance with Epoches

CAM improves over initial epochs Change distribution estimates stabilize

within a few epochs

31

Effect of Monitoring Task ratio on CAM

Only when the ratio is 50 can CAM obtain all information.

Is this good performance?

32

Scheduling Performance

The first figure shows the document size distribution, and the second figure shows the loss of information due to scheduling

Some monitoring tasks are not doneSome monitoring tasks are delayed

33

Experiments on real pages (from one of the author’s talks)

Eight sites with dynamic cricket match information• In fact, Zipfian updates

Adversarial setup: monitor/change < 1• CAM close to best

possible

For M/C=2, CAM updates on 80% of the information changed

0

100

200

300

400

500

1 2 3 4 5 6 7 8Page Index

Number of Changes

0

0.2

0.4

0.6

0.8

1

0.3 1 10Monitoring-Change RatioR

IR

Uniform

Proportional

CAM

34

Critique This paper expresses the algorithm part

very vague and unclear. It is unknown how they performed the experiments

The performance is not good: it takes 50 times the actual number of updates to get all information

Several assumptions do not hold: for example, updates are typically correlated from updates to updates, but the paper assumes that the update information of last time is completely lost when the next update is done

35

Conclusion

Continual queries are inherently different from discrete queries

Approach used in CAM• Identify relevant pages• Track the pages as they change• Characterize page change behavior• Decide when to monitor the pages in future

CAM approach performs better than other naïve approaches

Date post:	14-Jan-2016
Category:	Documents
Upload:	barbra-simon
View:	212 times
Download:	0 times

Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.

Documents