Semantics-Based Personalized Prefetching to Improve …czxu/paper/Prefetching.pdf ·...

1

Semantics-Based Personalized Prefetching to Improve Web Performance1

1 Preliminary results were published in Proc. of the 20th IEEE Conf. on Distributed Computing Systems,pages 636—643, April 2000. This work was supported in part by NSF Grants EIA-0729828, CCR-9988266.

Cheng-Zhong Xu and Tamer I. IbrahimDepartment of Electrical and Computer Engineering

Wayne State University, Detroit, MI [email protected]

Abstract

With the rapid growth of web services on the Internet, users are experiencing access delays more

often than ever. Recent studies showed that prefetching could alleviate the WWW latency to alarger extent than caching. Existing prefetching methods are mostly based on URL graphs. They

use the graphical nature of http links to determine the possible paths through a hypertext system.While they have been demonstrated effective in prefetching of documents that are often accessed,

few of them can pre-retrieve documents whose URLs have never been accessed.

In this paper, we propose a semantics-based prefetching technique to overcome the limitation.

It predicts future requests based on semantic preferences of past retrieved document. We appliedthis technique to news reading activities and prototyped a NewsAgent prefetching system. The

system extracts document semantics by identifying keywords in their URL anchor texts and relieson neural networks over the keyword set to predict future requests. It features a self-learning ca-

pability and good adaptivity to the change of user surfing interest. We cross-examined the systemin daily browsing of ABC, CNN, and MSNBC news sites for three months. The experimental results

showed an achievement of approximately 60% hit ratio due to prefetching. Of the pre-fetcheddocuments, less than 24% were undesired.

1. Introduction

Bandwidth demands for the Internet are growing rapidly and saturating the Internet capacity due tothe increasing popularity of web services. Although upgrade of the Internet with higher bandwidthnetworks has never stopped, users are experiencing web access delays more often than ever. Theyare demanding latency tolerant techniques to improve the web performance. The demands are ur-gency particularly for users who connect to the Internet via low-bandwidth modem or wirelesslinks.

.

2

Many latency tolerant techniques have been developed over the years. Most notably are cach-ing and prefetching. Caching takes advantages of temporal locality of web references and is widelyused in server and proxy sides to alleviate the web latency. Caching is often enabled by default inboth popular IE and Netscape browsers. However, recent studies showed that benefits from client-side caching were limited due to the lack of enough temporal locality in web references of a client[21]. The potential for caching requested files was also shown to be even declining over the years[1][2]. As a complement to caching, prefetching is to pre-retrieve web documents (more generally,web objects) that tend to be browsed in the near future while a user is processing previously re-trieved documents [22]. Prefetching can also happen off-line when a user is not browsing the webat all. A recent tracing experiment revealed that prefetching could offer more than twice the im-provement due to caching alone [21].

Previous studies on web prefetching were mostly based on the history of user access patterns[8][24][29][31][14][30]. If the history information shows an access pattern of URL address A fol-lowed B with a high probability, then B will be pre-fetched once A is accessed. The access patternsare often represented by a URL graph, where each edge between a pair of vertices represents a fol-low-up relationship between the two corresponding URLs and the edge weight denotes a probabil-ity of the relationship in statistics. A limitation of this URL graph based technique is that the pre-fetched URLs must be frequently accessed before. The URL graph based approaches are demon-strated effective in prefetching of documents that are often accessed in history. However, there isno way for the approach to pre-retrieve documents that are newly created or never visited before.For example, all anchored URLs of a page are fresh when a client gets into a new web site andnone of them will be pre-fetched by the approaches. For dynamic HTML sites where a URL hastime-variant contents, the existing approaches could not retrieve desired documents by any means.

This paper presents a novel semantics-based prefetching technique to overcome the limitations.It predicts future requests based on preferences of past retrieved documents in semantics, ratherthan on the temporal relationships between URL accesses. It is observed that client surfing is oftenguided by some keywords in anchor texts of URLs. Anchor test refers to the text that surroundshyperlink definitions (hrefs) in web pages. For example, a user with a strong interest in Year 2000problem won’t miss any relevant news or web pages that are pointed to by a URL with a keyword“Y2K” in its anchor text. An on-line shopper with a favorite auto model (e.g. “BMW X5”) wouldlike to see all consumer reports or reviews about the model. We refer to this phenomenon as “se-

mantic locality”. Unlike search engines that takes keywords as input and responds with a list ofsemantically related web documents, semantics-based prefetching systems tend to capture prefer-ences from client’s past access patterns and predicts future references from a list of possible URLswhen the client visits a new web site. Based on document semantics, this approach is capable ofprefetching documents whose URLs have never been accessed. For example, the approach can helpa client with interests in Y2K problems pre-fetch Y2K related documents when he/she gets into anew web site.

It is known that the HTML format was designed to represent a presentation structure of adocument, rather than logical data. A challenge is how to extract semantics of the HTML docu-ments and construct adaptive semantic nets between URLs on-line. Semantics extraction is a key to

3

web search engines, as well. Our bitter experience with today’s search engines leads us to believethat current semantics extraction technology is far away from maturity for a general-purpose se-mantics-based system. Instead, we focus on domain specific semantics-based prefetching.

The prefetching technique is first targeted at news services. We are interested in this applica-tion domain because of the following reasons. First, news headlines are often used as their URLanchors in most news web sites. The meaning of the news can be represented easily by a group ofkeywords in the headline. Second, news web sites like abc.com, cnn.com, and msnbc.com oftenhave coverage about the same topic during a certain period. They can be used to cross-exam theperformance of semantic-based prefetching systems. That is, one site is used to accumulate URLpreferences and the others for evaluation of prediction accuracy. Third, effects of prefetching in thenews domain can be evaluated separately from caching. Prefetching is often coupled with cachingand prefetching takes effect in the presence of a cache. Since a user rarely accesses the same newsitems for more than once, there is very little temporal locality in user access patterns of news surf-ing. This leads to a zero impact of caching on the overall performance of prefetching systems.

We prototyped a NewsAgent system. It monitors web surfing behaviors and dynamically es-tablishes keyword-oriented access preferences. It employs neural networks to predict future re-quests based on the preferences. It employs neural networks to predict future requests based on thepreferences. Such a domain-specific prefetching system is not necessary to be running all the time.It should be turned on or off on-demand.

The rest of the paper is organized as follows. Section 2 presents neural network basics, with anemphasis on key parameters that affect the prediction accuracy and self-learning rules. Section 3presents designs of the NewsAgent and its architecture. Subsequent three sections present the de-tails of our experimental methodology and results in browsing three popular news sites. Relatedwork is discussed in Section 7. Section 8 concludes the paper with remarks on future work.

2. Neural Network Basics

Neural nets have been around since the late 1950's and come into practical use for general applica-tions since mid-1980’s. Due to their resilience against distortions in the input data and their capa-bility of learning, neural networks are often good at solving problems that do not have algorithmicsolutions [18].

2.1. Neurons

A neural network comprises of a collection of neurons. Each neuron performs a very simple func-tion: computing a weighted sum of a number of inputs and then comparing the result to a prede-fined net threshold value. If the result is less than the net threshold, then the neuron yields an output

of zero, otherwise one. Specifically, given a set of inputs x1, x2, x3,…, xn associated with weights

w1, w2, w3, …., wn, respectively, and a net threshold T, the binary output y of a neuron can be

modeled as:

4

The output y is often referred to as net of the neuron. Figure 1 shows a graphical representationof the neuron. Although it is a over-simplified model of the functions performed by a real biologi-cal neuron, it has been shown a useful approximation. Neurons can be interconnected together toform powerful networks that are capable of solving problems of various levels of complexity.

Figure 1. A graphical representation of neuron

To apply the neural network to web prefetching, the predictor examines the URL anchor textsof a document and identifies a set of keywords for each URL. Using the keywords as an input, thepredictor calculates the net of each URL neuron. Their nets determine the priority of URLs that areto be pre-fetched.

2.2. Learning Rules

A distinguishing feature of neurons (or neuron networks) is self-learning. It is accomplishedthrough learning rules. A learning rule incrementally adjusts the weights of the network to improvea predefined performance measure over time. Generally, there are three types of learning rules;supervised, reinforced and unsupervised. In this study, we are interested mainly in supervisedlearning where each input pattern is associated with a specific desired target pattern. Usually, theweights are synthesized gradually and updated at each step of the learning process so that the errorbetween the output and the corresponding desired target is reduced [18].

Many supervised learning rules are available for single unit training. A most widely used ap-proach is µ–LMS learning rule. Let

yif w x T

otherwise

i ii

n

= ∑ ≥

=

1

01 Equation 2.1

T

W1

W2

Wn

X1

Xn

X2

Y

5

be the sum of squared error criterion function. Using the steepest gradient descent to minimizeJ(w) gives

where µ is often referred to as the learning rate of the net because it determines how fast the neuron

can react to the new inputs. Its value normally ranges between 0 and 1. The learning rule governedby the equation is called the batch LMS rule. The µ-LMS is its incremental version, governed by

The LMS learning rule is employed to update the weight of keywords of URL neurons.

3. NewsAgent: A Web Prefetching Predictor

This section presents details of the keyword-oriented neuron network based prefetching techniqueand the architecture of an application domain predictor: NewsAgent.

3.1. Keywords

A keyword is the most meaningful word within the anchor text of a URL. It is usually a noun refer-ring to some role of an affair or some object of an event. For example, associated with the news“Microsoft is ongoing talks with DOJ” as of March 24, 1999 on MSNBC are two keywords “Mi-crosoft” and “DOJ”; the news “Intel speeds up Celeron strategy” contains keywords “Intel” and“Celeron”. Note that it is the predictor that identifies the keywords from anchor texts of URLswhile users retrieving documents. Since keywords tend to be nouns and often start with capital let-ters, these simplify the task of keyword identification.

It is known that meaning of a keyword is often determined by its context. An interesting key-word in one context may be of no interest at all to the same user if it occurs in other contexts. Forexample, a Brazil soccer fan may be interested in all the sports news related with the team. In thissports context, the word “Brazil” will guide the fan to all related news. However, the same wordmay be of less importance to the same user if it occurs in general or world news categories, unlessthe user has a special tie with Brazil. The fact that a keyword in different contexts has differentimplications is well recognized in structural organizations of news sites. Since a news site usually,

J w d y Equationi

i

mi( ) ( ) .= −

=∑1

21

2

2 2

k k i

i

mi iw w d y x Equation+

== + −∑1

1

2 3µ ( ) .

1

1

0w or arbitrary

w w d y xk k i i k

== + −

+ µ( ) Equation 2.4

6

http://www.cnn.com/news/local Local News

Sports

Local News

Technology

...............

Web Document 1 …………… 2 …………… 3 …………… 4 …………… 5 …………… 6 …………… 7 ……………

Links Window 1 …………… 2 …………… 3 …………… 4 ……………

Keywords List …………… …………… …………… …………… …………… …………… ……………

Compute Preference 1 …………… 2 …………… 3 …………… 4 ……………

Prefetch Queue 3 …………… 2 …………… 4 ……………

Keywords Extract

Cached Objects …………… …………… …………… …………… …………… …………… ……………

Web Document …………… …………… …………… …………… …………… …………… ……………

Retrieve and Update

12

Prediction Unit

1

3

2

4

5

6 7

8

9

10

11

if not always, organizes the URLs into categories, we assume one prefetching predictor for eachcategory for an accurate prediction. The idea of using categorical information or topic specific in-formation to improve the precision of information retrieval can also be seen in focused crawling ofsearch engines [3][16][27].

It is also known that the significance of a keyword to a user changes with time. For example,news related to “Brazil” hit the headline during the Brazil economic crisis period in 1998. Thekeyword should be sensitive to any readers of market or financial news during the period. After thecrisis was over, its significance decreased unless the reader has special interests in the Brazil mar-ket. Due to the self-learning property of a neural network, the weight of the key can be adapted tothe change.

3.2. Architecture

Figure 2 shows the architecture of the NewsAgent. Its kernel is a prediction unit. It monitors theway of user browsing. For each user required document, the predictor examines the keywordswithin its anchor text, calculates their preferences (nets), and pre-fetches those URLs which containkeywords with higher preferences than the neuron threshold.

7

1. User requests a URL associated with a category.2. The URL document is retrieved and displayed to the user.3. The contents of the document are examined to identify the first set of links to evaluate.4. Each link from the set is processed to extract the keywords.5. Keywords are matched to the keyword list. They are added to the list if they occur for the

first time.6. The corresponding preferences are computed using the weights associated with the key-

words.7. Links with preferences higher than the neuron threshold are sorted and placed in the pre-

fetching queue.8. Links are accessed one by one and the objects are placed in the cache.9. If the prefetching queue is empty, process the next set of links [Step 2].10. When the user requests a link, the cache list is examined to see if it’s there.11. If it was there, then it is displayed.12. If not, it is retrieved from the web and the corresponding keyword’s weights are updated.

Figure 2. Architecture of the NewsAgent

In light of the fact that a document may contain a large number of links, the predictor won’t ex-amine all anchor texts of URLs at a time. Instead, it examines and pre-fetches documents through awindow. That is, a fixed number of URLs are considered at a time. It patterns after a user browsingbehavior because users will look at a few links before deciding which to follow. Since evaluation oflinks takes non-negligible time, group evaluation also provides a quick response to user web surfing.The window size is referred to as breadth of the prefetching.

Once an anchored URL is clicked, the predictor starts to look for the requested document in thecache. If it has already been pre-fetched, the document is displayed. (The weights of its related key-words will remain unchanged by the rule of Equation 2.4.) Otherwise, the predictor suspends any on-going prefetching process and starts to examine the caption of the requested link for keywords. If akeyword is introduced for the first time, a new weight will be assigned to it. Otherwise its existingweight will be positively updated (d = 1) by the rule of Equation 2.4.

3.3. Control of Pre-fetch Cache and Keyword List

Notice that the keyword list could be full with unwanted items when the user loses interest in a long-standing topic. The pre-fetch cache may also contain undesired documents due to the presence of sometrivial (non-essential) keywords in document captions. For example, the keyword “DOJ” is trivial in asector of technology news. It is of interest only when it appears together with some other technology-related keywords like Microsoft or IBM. Figure 3 shows a mechanism of the NewsAgent to controlthe overflow of the keyword list and purge the undesired documents in the pre-fetch cache.

By checking the cache, we can identify documents that have been pre-fetched but never been ac-cessed. The system will then decrement the weight of its related keywords. Also, we need to set apurging threshold value that determines whether a keyword should be eliminated from the keywordlist or not. If the weight of a keyword is less than this threshold, then we examine the change trend ofthe weight. If it is negative, then the keyword is considered trivial and is eliminated from the list.

8

1. When the cache list is full, use LRU algorithm to remove the least recently accessed objects.2. Extract the corresponding keywords.3. Negatively update those weights.4. Compare the updated weights to the minimum weight threshold.5. If they are, check the updating trend (Decreasing in this case).6. Purge those keywords from the list if they are below the minimum weight threshold.7. If the keyword list is full, consider steps 5, 6 & 7. However, the trend will not be necessarily

decreasing in this case.

Figure 3. Overflow Control for the Keyword List and Pre-fetch Cache

The keywords are purged once the keyword list is full. The list size should be set appropriately. Atoo small list will force the unit to purge legitimate keywords, while a too large list would be filledwith trivial keywords, wasting space and time in evaluation. We set the list size to 100 and the purgingthreshold value to 0.02 in our experiments. It was set based on our preliminary testing. The settingsare by no means optimal choices.

The keywords can also be purged when the cache is full (or up to a certain percentage) and theprogram needs to place a new document in it and the only way to do that is to replace an existingdocument or a number of documents. Documents to be replaced are those which have been in thecache for a long time without being accessed. The caption that resulted in prefetching those docu-ments is first examined and the keywords are extracted. Then the weights associated with thoseweights are negatively updated (d = 0) by the rule of Equation 2.4. This will assure that the probabilityof those keywords to trigger the prefetching unit in the future is lower.

Keywords List …………… …………… …………… …………… …………… …………… ……………

LRU Algorithm

Cached Objects …………… …………… …………… …………… …………… …………… ……………

3 Keywords Extract

Check Weight Threshold

Purge List

Update Weights

Check Weight Trend

2

1 4

5

6

7

9

4. Experimental Methodology

The system was developed entirely in Java and integrated with a pure Java web browser, namely Ice-Browser from IceSoft.com. This lightweight browser with its open source code gave us a great flexi-bility in the evaluation of our algorithm. Using such a lightweight browser simpified the process ofsetting experiment parameters because the browser just did what it was programmed for and therewere no hidden functions or Add-Ins to worry about as those with full-blown browsers like Netscapeand Explorer. Such browsers have their own pre-fetching and caching schemes that may conflict withour algorithm.

Since the web is essentially a dynamic environment, its varying network traffic and server-loadmake the design of a repeatable experiment challenge. There were studies dedicated to characterizinguser access behaviors [2] and prefetching/caching evaluation methodologies; see [11] for a survey ofthe evaluation methods. Roughly, two major evaluation approaches are available: simulation based ona trace of user access and real-time simultaneous evaluation. A trace-based simulation works like aninstruction execution trace based machine simulator. It assumes a parameterized test environment withvarying network traffic and workload. While there are many web traffic benchmarking and characteri-zation studies, few of the trace data sets available to public provide semantic information. Since thesemantics-based prefetching techniques are domain specific, they should be enabled or disabled whenclients enter or exist from domain services. To the best of our knowledge, there are no such domain-specific trace data sets available for evaluation of the NewsAgent.

Real-time simultaneous evaluation [10] was initially designed to evaluate multiple web proxies inparallel by duplicating and passing object requests to each proxy. It complements to trace-based simu-lations because it reduces problems of unrealistic test environments, dated and/or inappropriate work-loads. Along the line of the approach, we evaluated the performance of NewsAgent by cross-examining multiple news sites. One site was used for neural network training. The trained smart readerthen worked, simultaneously with a browser without prefetching, to retrieve news information fromother sites for a period. Since this algorithm aims to predict the future requests based on users’ accesspattern, all the experiments were conducted by a single tester all the time on a Pentium II 600 MHzWindows NT machine with 128 MB of RAM connected to the Internet through a T-1 connection.Following are our experimental settings.

• Topic selection: In order to establish user access patterns systematically, we defined topics ofinterest to guide the tester’s daily browsing behaviors. We are interested in topics that haveextensive coverage in major news sites for a certain period so that we can complete the datacollection and cross-examination before the media changes its interest. During the time thisexperiment was conducted (from 4/1 to 6/15/1999) the KOSOVO crisis was in its peak. Weselected this topic because it was receiving all the attention in media and we expected it wouldlast long enough for us to complete this experiment. Notice that the NewsAgent has no intrin-sic requirements for the selection of topics in advance. Reader interests may change over timeand the algorithm works in the same way no matter whether there is a topic defined or not.However, without a selection of topic in evaluation, most of the keywords generated fromtester’s random browsing would be too trivial to be purged.

10

• Site selection: The next step was to choose a number of sites that a reader interested in suchtopic would most likely find an abundance of information about it. After evaluating many ofthe popular news content sites, we decided on cnn.com for the first phase (training) andabc.com for the second one (evaluation). The training phase took a little bit over a month (4/1to 5/5/1999) and a comparable evaluation phase (5/7-6/14/1999).

• User access pattern: The user access pattern was monitored and logged daily. Also, the timeallowed to view a given page was set not to exceed one minute. Obviously, if we have infinitetime between two consecutive requests, all other links can be evaluated, pre-fetched andcached in advance, which is not the case in real-life and that is why we need to set a time limit.

5. Experimental Results

As indicated earlier in the previous section, the experiment was conducted in two phases: training andevaluation. This section presents the results of each phase.

5.1. Training Phase

�� the success of the network. In this phase, the network is not pre-fetching any future requests. All whatit does is to monitor the access pattern, and consistently establish and update the inputs and their corre-sponding weights. In general, more training generally results in more accurate results. This may not bethe case in news prefetching training because new topics are emerging endlessly, characterized by anew set of keywords, to replace old subjects. During this phase (04/01 – 05/05/1999), a total of 163links were accessed. The average link size was 18K bytes (Minimum 9.4K and maximum 33.4K). Fig-ure 4 shows the daily access pattern during this period.

Training Phase Access Pattern

0

2

4

6

8

10

12

14

4/1/99 4/8/99 4/15/99 4/22/99 4/29/99

Day

No. of Links No. of New Keywords

Figure 4. Access Patterns in Training Session

11

From Figure 4, it can be seen that in the first four days (4/1-4/4/99), new keywords were identifiedby the neuron as more links were accessed. It is expected because we started with an initialized neuronwith no inputs (keywords). The next two days (4/5-4/6) have found no new keywords, even thoughmore links were accessed. It means that the topic is characterized by a very limited number of key-words during a certain period. As the time proceeds, the number of new keywords identified risessharply in the next two days (4/7-4/8/99). This is due to the dynamics of the subject. Any new relatedbreaking news introduced new keywords. We can identify a repetitive cycle here. As links are ac-cessed, a number of new keywords are identified. However, this number declines as the topic is satu-rated (i.e. no more keywords are introduced). But then, changes to the topic introduce another set ofnew keywords and the cycle repeats. This can be always visible in dynamic topics such as the one se-lected for this study.

The neuron parameters used in the training phase were as follows, threshold rate of 0.5 and alearning rate of 0.1. The threshold was set to 0.5 because it’s a typical median value for the output ofthe neuron. The learning rate was set to 0.1 to maintain stability of the neuron and guarantee a smoothweight updating process. In the next section, we will show that high learning rate values can lead toinstability and subsequently, the failure of unit. To examine the progress of the training process, weevaluated the output of the neuron after each link accessed. The output is calculated according toEquation 2.1. An output of 0.5 or higher is considered a success since it indicates that the neuron iden-tified the input set as a match. Figure 5 plots the success rate of the neuron against number of linksaccessed. As we can see, the success rate increases with the number of links accessed until it stabilizesaround 60%. That means that after a reasonable training period, the unit could identify 6 out of 10links as links of interest. Since learning is an iterative process, and the system that we are training theunit on is not a deterministic one, a success rate of 60% is an acceptable value. Since topic changesgenerate additional sets of keywords to the neuron, we would not expect an extremely high successrate.

Training Progress

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161

No. of Links

Figure 5. Success Rate of the Neuron versus the Number of Links

12

5.2. Evaluation Phase

The training phase was conducted using cnn.com as the source for the training set. In this subsectionwe present the results obtained from testing the unit on abc.com. There were neither weight updatesnor extra keyword inputs to the neuron while abc.com was visited. We used the neuron trained in thebrowsing of cnn.com as discussed in preceding subsection. Figure 6 shows the access pattern we usedin this phase.

Prefetching systems are often evaluated in terms of four major performance metrics: hit ratio,byte-hit ratio, waste ratio, and waste byte ratio. Hit ratio refers to the percentage of requested links thatare found in the pre-fetching cache to the total number of requests; Byte-hit ratio refers to the percent-age of hit links in size. This pair of metrics reflects the improvement of web access latency. Wasteratio refers to the percentage of undesired documents that are pre-fetched into the cache; Byte wasteratio refers to the percentage of undesired documents in size. This pair of metrics reflects the cost (i.e.the extra bandwidth consumption) for prefetching.

Evaluation Phase Access Patteren

0

1

2

3

4

5

6

7

5/7/99 5/14/99 5/21/99 5/28/99 6/4/99 6/11/99

Day

No

. o

f L

inks

Figure 6. Access Patterns in Evaluation Session

Figure 7 shows the hit and byte hit ratios in this phase as a function of number of accessed links.While the news in abc.com had never been accessed before by the tester, the SmarNewsReaderachieved a hit rate of around 60% based on the semantic knowledge accumulated in the visit ofcnn.com. Since the news pages have a small difference in size, plot of the byte hit ratio matches withthat of the hit ratio. Since the topic “KOSOVO” lasted longer than our experiment period, keywordsthat were related to the news about “KOSOVO” and used to train the NewsAgent neuron in April,1999 remained the leading keywords to guide user’s daily browsing of related news. Since the inputkeywords and their weights in neuron remained unchanged in the evaluation phase, it is expected thataccuracy of the predictor will decresase slowly as the tester keeps away from the selected topic.

13

Hit and Byte Hit Ratios

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0 20 40 60 80

No. of Links

HitRatio

Hit Ratio

Byte Hit Ratio

Figure 7. Hit Ratio and Byte Hit Ratio in the Evaluation Session

Figure 8 shows an average of 24% waste ratio and approximately the same percentages of bytewaste ratio during the evaluation phase,. It is because many of the links contained similar material andthe reader might be satisfied with only one of them. Also, in the evaluation phase, the process of up-dating weights and purging trivial keywords was not active. However, the neuron kept pre-fetchingthose undesired documents as long as the output of the neuron was greater than the net threshold. Thisleads to a higher waste ratio as well.

Evaluation Waste Ratio

0.00%

10.00%

20.00%

30.00%

40.00%

0 20 40 60 80

No. of Links

WasteRatio

Waste Ratio

Byte Waste Ratio

Figure 8. Waste Ratio and Byte Wa Waste Ratio

6. Effects of Net Threshold and Learning Rate

Accuracy of the predictive decisions is affected by a number of parameters in NewsAgent. To evaluatethe effects of the parameters, particularly, net threshold and learning rate, on hit ratio and waster ratio,we implemented multiple predictors with identical parameters, except the one under test, with the

14

same inputs and access sequence. We conducted the experiment over two new sites: Today in America

Front Page at MSNBC and the US News Page at CNN.com. The first is full with multimedia docu-ments, while the other is a combination of plain text and image illustrations during the period fromNovember 18, 1998 to December 16, 1998, five days a week, and two half-an-hour sessions a day.Results from the surfing of these two new sites will not only reveal the impact of the parameters, butalso help verify the observations we obtained in the last experiment.

During the experiment time, the topic “Clinton Affair” has much coverage in both sites. As we ex-plained in preceding section, we selected a guiding hot topic in evaluation because its extensive wouldhelp collect more data within a short time period. The neural-net based predictor can identify key-words and establish user access patterns automatically by monitoring user-browsing behaviors. Key-words related to the “Clinton Affair” topic include Clinton, Starr, Lewinsky, Testimony, etc. Each re-lated piece of news must have one or more such keywords. To reflect a fact that users may be dis-tracted from their main topic of interest for a while in browsing, we intentionally introduced noises byperiodically selecting headlines that do not belong to the topic. We accessed a random link every 10trials. We also assumed each access be followed by one minute viewing. A recent analysis of web ac-cess pattern showed that the average rate that HTTP requests were made was about 2 requests perminutes [2].

Users may loose interest in a certain topic after a period of time. The system responded to this byupdating the associated weights of the relevant keywords.

6.1. Effect of Net Threshold

Figure 9 plots the hit ratios due to prefetching with neuron thresholds of 0.25, 0.50 and 0.75 respec-tively. In agreement with Equation 2.1, the figure shows that a low threshold would yield a high hitratio for a large cache. Since an extremely small threshold could trigger the prefetching unit fre-quently, it would lead to pre-retrieval of most of the objects on a given URL. By contrast, the higherthe threshold, the smoother the curve is and the more accurate the prediction will be.

Threshold Effect

0%10%20%30%40%50%60%

0 50 100 150 200

Requests

Hit

Rat

io T = 0.25

T = 0.50

T = 0.75

Figure 9. Effects of Threshold on Cache Hit Ratio

15

Figure 10 shows the waste ratio in the cache. Expectedly, the lower the threshold value, the moreundesired objects that get cached. The cache may be filled up with undesired objects quickly. There isa tradeoff between the hit ratio and utilization ratio of the cache. The optimum net threshold shouldstrike a good balance between the two objectives.

Undesired Documents Percentage

0%

10%

20%

30%

Threshold

%C

ache

d T = 0.75

T = 0.50

T = 0.25

Figure 10. Effects of Threshold on Cache Waste Ratio

6.2. Effects of Learning Rate

The learning rate controls the speed a prediction unit responds to a given input. Figure 11 shows thehit ratio due to different learning rates. From the figure, it can be seen that a high learning rate leads toa high hit ratio. It is because high learning rates tend to update the weights quickly. Ripples on thecurve are due to the fast updating mechanism.

The figure also shows that the smaller the learning rate, the smoother the curve and the more accu-rate the prediction unit will be. It is because the weights of trivial keywords are updated with minimalmagnitude. This generally affects cache utilization because insignificant weights won’t trigger theprediction unit with a lower learning rate. This may not be the case for a unit with a higher learningrate, as shown in Figure 12. Thus, choosing a proper learning rate depends mainly on how fast wewant the prediction unit to respond and what percentage of the cache waste we can tolerate.

Learning Rate Effect

0%20%40%60%80%

0 50 100 150 200

Requests

Hit

Rat

io

R = 0.05

R = 0.10

R = 0.20

R = 0.50

Figure 11. Effect of Learning Rate on Hit Ratio

16

Undesired Documents Percentage

0%

10%

20%

30%

Learning Rate

%C

ache

d R = 0.05

R = 0.10

R = 0.20

R = 0.50

Figure 12. Effect of Learning Rate on Cache Waste Ratio

6.3. Update of Keyword and Purge of Trivial Keywords

Recall that the prefetching unit relies on a purging mechanism to identify trivial weights and decreasetheir impact on the prediction unit so as not to be triggered by accident. Through our tests, we foundthat on average approximately 4 to 5 trivial keywords are introduced for each essential keyword. Fig-ure 13 plots the distribution of keyword weights due to NewsAgent. The figure shows the systembounded the trivial weight and nearly minimized the weight in the browsing process.

As outlined in Figure 3, weights of the keywords are reduced whenever their related objects in thecache are to be replaced by the prefetching unit. The objects that were never referenced were have thelowest preference value and are to be replaced first. Then weights of the keywords associated with thereplaced object were updated accordingly. This guarantees that trivial keywords had minimal effectson the prediction unit.

Weight Update Process

0.002.004.006.008.00

10.00

0 50 100 150 200 250 300

Requests

Wei

ght V

alue

Main Weights Trevial Weights

Figure 13. Effect of Unrelated Keywords on Prediction Accuracy

17

7. Related Work

The concept of prefetching has long been used in a variety of distributed and parallel systems to hidecommunication latency [7]. A closely related prefetching topic to web surfing is file prefetching andcaching in the design of remote file systems. Shriver et al presented an overview of the file systemprefetching and explained with an analytical model why file prefetching works [32]. A most notablework in automatic prefetching is due to Griffioen and Appleton [17]. They proposed to predict futurefile accesses based on past file activities that are characterized by probability graphs. Padmanabhanand Mogul adapted the Griffioen-Appleton file predictive algorithm to reduce web latency [29].

Web prefetching is an active topic in both academia and industry. Algorithms used in today’scommercial products are either greedy or user-guided. A greedy prefetching technique does not keeptrack of any user web access history. It pre-fetches documents based solely on the static URL relation-ships in a hypertext document. For example, WebMirror [25] retrieves all anchored URLs of a docu-ment for a fast mirroring of the whole web site; NetAccelerator [19] pre-fetches the URLs on a link’sdownward path for a quick copy of HTML formatted documents.

User-guided prefetching techniques are based on user-provided preference URL links. Go-Ahead-Got-It![15] and Connectix-Surf-Express[5] are two industry leading examples in this category. Theformer relies on a list of predefined favorite links, while the latter maintains a record of user frequentlyaccessed web sites to guide prefetching of related URLs.

Research in academia is mostly targeted at intelligent predictors based on user access patterns. Thepredictors can be associated with servers, clients, or proxies. Padmanabhan and Mogul [29] presenteda first server-side prefetching technique based on that of Griffioen and Appleton. It depicts the accesspattern as a weighted URL dependency graph, where each edge represents a follow-up relationshipbetween a pair of URLs and the edge weight denotes the transfer probability from one URL to theother. The server updates the dependency graph dynamically as requests coming from clients andmakes predictions of future requests. Schechter, et al [31] presented methods of building a sequenceprefix tree (path profile) based on HTTP requests in server logs and using the longest matched most-frequent sequence predict next requests. Sarukkai proposed a probabilistic sequence generation modelbased on Markov chains [30]. On receiving a client request, the server uses the HTTP probabilisticlink prediction to calculate the probabilities of the next requests based on the history of requests fromthe same client.

Loon and Bharghavan proposed an integrated approach for proxy servers to perform prefetching,image filtering, and hoarding for mobile client [24]. They used usage profiles to characterize user ac-cess patterns. A usage profile is similar to the URL dependency graph, except that each node is associ-ated with an extra weight indicating the frequency of access of the corresponding URL. They pre-sented heuristic weighting ways to reflect the changing access patterns of users and temporal localityof accesses in the update of usage profiles. Markatos and Chronaki [26] proposed a top-10 prefetchingtechnique for servers to push their most popular documents to proxies regularly according to clients’aggregated access profiles. A question is that high hit ratio in proxies due to prefetching would maygive servers a wrong perception of future popular documents.

18

Fan et al. [14] proposed a proxy-initiated prefetching technique between a proxy and clients. It isbased on a Prediction-by-Partial-Matching (PPM) algorithm originally developed in data compressor[9]. PPM is a parametric algorithm that predicts the next l requests based on past m accesses by theuser, and limits the candidates by a probability threshold of t. PPM is reduced to the algorithm of Pa-padumanta and Mogul when m is set to 1. Fan et al. showed that the PPM prefetching algorithm per-forms best when m=2 and l=4. A similar idea was recently applied to prefetching of multimedia docu-ments as well by Su, et al. [33].

From the viewpoint of clients, prefetching is to speed up web access, particularly to those linksthat are expected to experience long delays. Klemm [20] suggested to pre-fetch web documents basedon their estimated round-trip retrieval times. Since their technique requires the knowledge about thesizes of prospective documents for estimation of their round-trip time, it was limited to prefetching ofstatic http pages.

Prefetching is essentially a speculative process and incurs a high cost when its prediction is wrong.Cunha and Jaccoud [8] focused on the decision of if prefetching should be enabled or not. They con-structed two user models based on random walk approximation and digital signal processing tech-niques to characterize user behaviors and help predict users’ next actions. They proposed to couple themodels with prefetching modules for customized prefetching services. Crovella and Barford [6] stud-ied the effects of prefetching on network performance. They showed that prefetching might lead to anincrease of traffic burstiness. They proposed to use transport rate-controlled mechanisms to reducesmooth traffic caused by prefetching.

Previous studies on intelligent prefetching were mostly based on temporal relationships betweenaccesses to URLs. Semantic-based prefetching was studied in the context of file prefetching. Lei andDuchamp proposed to capture intrinsic corrections between file accesses in tree structures on-line andthen match against existing pattern trees to predict future access [23]. By contrast, we deployed neuronnets over keywords to capture the semantic relationships between URLs and improve the predictionaccuracy by adjusting the weights of keywords on-line.

8. Concluding Remarks

This paper has presented a semantic-based web prefetching technique and its application in newssurfing. It captures the semantics of a document by identifying keywords on its URL anchor texts andrelies on neural networks over the keyword set to construct the semantic-net of the URLs and to pre-dict user’s future requests. Unlike previous URL graph based techniques, this approach takes advan-tage of “semantics locality” of access patterns and is able to pre-fetch documents whose URLs arenever accessed before. It also features a self-learning capability and good adaptivity to the change ofuser surfing interests. The technique was implemented in a NewsAgent system and cross-examined indaily browsing of ABC.com, CNN.com, MSNBC.com news sites. Experimental results showed anachievement of approximately 60% hit ratio due to the prefetching approach. We note that the News-Agent system was proposed from the perspective of clients. Its predictive prefetching technique isreadily applicable to client proxies.

19

Future work will be focusing on the following aspects. First is to construct a prediction unit usingneuron networks, instead of single neurons. Besides the keywords, other factors, such as documentsize and the URL graphs of a hypertext document, will also be taken into account in the design of thepredictor. Second is to improve cache utilization. We noticed that the hit ratio saturates at around 60-70% after a reasonable number of requests. More work needs to be done in this aspect to reduce thecache waste ratio. On a different track altogether, other statistical classification and learning tech-niques like Bayesian networks could be deployed.

Like many domain-specific search engines [3][16][27], the semantics-based prefetching techniqueis presently targeted at domain applications. In addition to news services, the prefetching techniquecould be applied to other application domains, such as on-line shopping and auction, where semanticsof web pages are often explicitly presented in their URL anchor texts. Such domain-specific prefetch-ing facilities can be turned on or off on-demand.

The semantics-based prefetching technique is based on keyword matching. In general, there aretwo fundamental issues with keyword matching for information retrieval: polysemy and synonymy.Polysemy refers to the fact that a keyword may have more than one meaning. Synonymy means thatthere are many ways referring to the same concept. In this paper we alleviate the impact of polysemyby taking advantage of the categorical information in news service domains but address a little aboutsynonymy. There exist many techniques to address these two issues in conventional information re-trieval. Most notably is latent semantic indexing [12]. It takes advantage of the implicit higher-orderstructure of the association of terms with articles to create a multi-dimensional semantic structure ofthe information. Through the pattern of co-occurrences of words, LSI is able to infer the structure ofrelationships between article and words. Its application to web prefetching deserves an investigation.

Recently, XML and RDF (Resource Description Framework) are emerging as standards of dataexchange format in support of semantic interoperability on the web [13]. With the popularity of XMLor RDF compliant web resources, we expect semantic-based prefetching techniques will play an in-creasingly important role in reducing web access latency and will lead to more personalized qualityservices on the semantic web.

References

[1] V. Almeida, A. Bestavros, M. Crovella, and A. de Oliveira. Characterizing reference locality inthe WWW. In Proc. of IEEE Conf. on Parallel and Distributed Information Systems, Decem-ber 1996.

[2] P. Barford, A. Bestavros, A. Bradley, and M. Crovella. Changes in web client access patterns:Characteristics and caching implications. World Wide Web: Special Issue on Characterization

and Performance Evaluation.

[3] S. Chakrabarti, M. van der Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th Int’l World-Wide Web Conference(WWW8), 1999.

[4] CNN United States News. http://www. cnn.com/US

[5] Connectix, Inc. http://www.connectix. com/.

20

[6] M. Crovella and P. Barford. The network effects of prefetching. In Proceedings of IEEE Info-

com’98.

[7] D. Culler, J. Singh, with A. Gupta. Parallel Computer Architecture: A Hardware/Software In-terface. Morgan Kaufmann Pub. 1998.

[8] C. R. Cunha and C. F. B. Jaccoud. Determining WWW user’s next access and its application toprefetching. In Proc. of the International Symposium on Computers and Communication’97,July 1997.

[9] K. M. Curewitz, P. Krishnan, and J. Vitter. Practical prefetching via data compression. In Pro-ceedings of SIGMOD’93, pages 257—266, May 1993.

[10] B. Davison. Simultaneous proxy evaluation. In Proceedings of 4th International Web CachingWorkshop, March 1999.

[11] B. Davison. A survey of proxy cache evaluation techniques. In Proceedings of 4th Interna-tional Web Caching Workshop, March 1999.

[12] S. Deerwester, et al. Indexing by latent semantic analysis. Journal of the American Society forInformation Science. Vol.41 (6), 1990, pages 391—407.

[13] S. Decker, S. Melnik, et al. The semantic Web: The role of XML and RDF. IEEE Internet

Computing, September/October 2000, pages 63—74.

[14] L. Fan, P. Cao, W. Lin, and Q. Jacobson. Web prefetching between low-bandwidth clients andproxies: Potential and performance. In Proceedings of the SIGMETRICS’99.

[15] Go Ahead Home Page. http://www.goahead.com/.

[16] E. J. Glover, G. W. Falke, S. Lawrence, W. Birmingham, A. Kruger, C. Giles, and D. Pennock.Improving category specific web search by learning query modifications. Symposium on Appli-

cations and the Internet (SAINT 2001), 2001.

[17] J. Griffioen and R. Appleton. Reducing file system latency using a predictive approach. InProceedings of the 1994 Summer USENIX Conference.

[18] H. Hassoun. Fundamentals of Artificial Neural Networks. The MIT Press. 1995

[19] ImsiSoft Home Page. http://www.imsisoft.com/.

[20] R. P. Klemn. WebCompanion: A friendly client-side Web prefetching agent. IEEE Transactionson Knowledge and Data Engineering, Vol. 11, No. 4, July/August 1999, pages 577—594.

[21] T. Kroeger, D. Long, and J. Mogul. Exploring the bounds of web latency reduction fromcaching and prefetching. In Proc. of USENIX Symp. on Internet Technologies and Systems,Dec. 1997.

[22] D. Lee. Methods for web bandwidth and response time improvement In M. Abrams (ed),World Wide Web: Beyond the Basics, Chapter 25, Prentice-Hall Pub. April 1998.

[23] H. Lei and D. Duchamp. An analytical approach to file prefetching. In Proceedings of

USENIX 1997 Annual Technical Conference, January 1997, pages 275—288.

[24] T. S. Loon and V. Bharghavan. Alleviating the latency and bandwidth problems in WWWbrowsing. In Proc. of USENIX Symposium on Internet Technologies and Systems, December1997.

21

[25] Macca Soft Home Page. http://www.maccasoft. com/webmirror/.

[26] E. P. Markatos and C. E. Chronaki. A Top-10 approach to prefetching on the web. In Pro-ceedings of INET’98. July 1998.

[27] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engineswith machine learning techniques. In Proceedings of AAAI Spring Symposium on IntelligentAgents in Cyberspace, 1999.

[28] MSNBC Today in America. http://www.msnbc.com/news/affiliates_front.asp

[29] V. Padmanabhan and J. Mogul. Using predictive prefetching to improve world wide web la-tency. In Proc. of ACM SIGCOMM Computer Comm. Review, July 1996.

[30] R. Sarukkai. Link prediction and path analysis using Markov Chains. In Proceedings of the 9th

Int’l World Wide Web Conference, 2000.

[31] S. Schechter, M. Krishnan, and M. Smith. Using path profiles to predict HTTP requests. InProccedings of the 7th International World Wide Web Conference. Also appeared in Com-

puter Networks and ISDN Systems, 20(1998), pages 457—467.

[32] E. Shriver and C. Small. Why does file system prefetching work? In Proceedings of the 1999USENIX Annual Technical Conference, June 1999.

[33] Z. Su, Q. Yang, and H. Zhang. A prediction system for multimedia prefetching in Internet. InProceedings of ACM Multimedia, 2000.

Date post:	27-Jul-2018
Category:	Documents
Upload:	hoangthu
View:	222 times
Download:	0 times