+ All Categories
Home > Documents > Design Trade-Offs for Search Engine...

Design Trade-Offs for Search Engine...

Date post: 19-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
28
20 Design Trade-Offs for Search Engine Caching RICARDO BAEZA-YATES,ARISTIDES GIONIS, FLAVIO P. JUNQUEIRA, VANESSA MURDOCK, and VASSILIS PLACHOURAS Yahoo! Research and FABRIZIO SILVESTRI ISTI – CNR In this article we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log influence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems, performance evaluation (efficiency and effectiveness) General Terms: Algorithms, Design Additional Key Words and Phrases: Caching, Web search, query logs ACM Reference Format: Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web, 2, 4, Article 20 (October 2008), 28 pages. DOI = 10.1145/1409220.1409223 http://doi.acm.org/10.1145/1409220.1409223 This article is an expanded version of an article that previously appeared in Proceedings of the 30th Annual ACM Conference on Research and Development in Information Retrieval, 183–190. Authors’ addresses: R. Baeza-Yates, A. Gionis, F. P. Junqueira, V. Murdock, and V. Plachouras, Yahoo! Research Barcelona, Avda. Diagonal 177, 8th floor, 08018, Barcelona, Spain; email: [email protected], [email protected], [email protected], [email protected], [email protected]; F. Silvestri, Istituto ISTI A. Faedo, Consiglio Nazionale delle Ricerche (CNR), via Moruzzi 1, I-56100, Pisa, Italy; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2008 ACM 1559-1131/2008/10-ART20 $5.00 DOI 10.1145/1409220.1409223 http://doi.acm.org/ 10.1145/1409220.1409223 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
Transcript
Page 1: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20

Design Trade-Offs for Search EngineCaching

RICARDO BAEZA-YATES, ARISTIDES GIONIS, FLAVIO P. JUNQUEIRA,VANESSA MURDOCK, and VASSILIS PLACHOURAS

Yahoo! Research

and

FABRIZIO SILVESTRI

ISTI – CNR

In this article we study the trade-offs in designing efficient caching systems for Web search engines.

We explore the impact of different approaches, such as static vs. dynamic caching, and caching

query results vs. caching posting lists. Using a query log spanning a whole year, we explore the

limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates

than caching query answers. We propose a new algorithm for static caching of posting lists, which

outperforms previous methods. We also study the problem of finding the optimal way to split the

static cache between answers and posting lists. Finally, we measure how the changes in the query

log influence the effectiveness of static caching, given our observation that the distribution of the

queries changes slowly over time. Our results and observations are applicable to different levels of

the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information

Search and Retrieval—Search process; H.3.4 [Information Storage and Retrieval]: Systems

and Software—Distributed systems, performance evaluation (efficiency and effectiveness)

General Terms: Algorithms, Design

Additional Key Words and Phrases: Caching, Web search, query logs

ACM Reference Format:Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008.

Design trade-offs for search engine caching. ACM Trans. Web, 2, 4, Article 20 (October 2008),

28 pages. DOI = 10.1145/1409220.1409223 http://doi.acm.org/10.1145/1409220.1409223

This article is an expanded version of an article that previously appeared in Proceedings of the 30thAnnual ACM Conference on Research and Development in Information Retrieval, 183–190.

Authors’ addresses: R. Baeza-Yates, A. Gionis, F. P. Junqueira, V. Murdock, and V.

Plachouras, Yahoo! Research Barcelona, Avda. Diagonal 177, 8th floor, 08018, Barcelona, Spain;

email: [email protected], [email protected], [email protected], [email protected],

[email protected]; F. Silvestri, Istituto ISTI A. Faedo, Consiglio Nazionale delle Ricerche

(CNR), via Moruzzi 1, I-56100, Pisa, Italy; email: [email protected].

Permission to make digital or hard copies of part or all of this work for personal or classroom use is

granted without fee provided that copies are not made or distributed for profit or direct commercial

advantage and that copies show this notice on the first page or initial screen of a display along

with the full citation. Copyrights for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,

to redistribute to lists, or to use any component of this work in other works requires prior specific

permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn

Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2008 ACM 1559-1131/2008/10-ART20 $5.00 DOI 10.1145/1409220.1409223 http://doi.acm.org/

10.1145/1409220.1409223

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 2: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:2 • R. Baeza-Yates et al.

1. INTRODUCTION

Millions of queries are submitted daily to Web search engines, and users havehigh expectations of the quality of results and the latency to receive them. Asthe searchable Web becomes larger, with more than 20 billion pages to index,evaluating a single query requires processing large amounts of data. In sucha setting, using a cache is crucial to reducing response time and to increasingthe response throughput.

The primary use of a cache memory is to speed up computation by exploitingpatterns present in query streams. Since access to primary memory (RAM) isorders of magnitude faster than access to secondary memory (disk), the averagelatency drops significantly with the use of a cache. A secondary, yet important,goal is reducing the workload to back-end servers. If the hit rate is x, then theback-end servers receive 1 − x of the original query traffic.

Caching can be applied at different levels with increasing response latenciesor processing requirements. For example, the different levels may correspondto the main memory, the disk, or resources in a local or a wide area network.The decision of what to cache can be taken either off-line (static) or online (dy-namic). A static cache is usually based on historical information and is subjectto periodic updates. A dynamic cache keeps objects stored in its limited numberof entries according to the sequence of requests. When a new request arrives,the cache system decides whether to evict some entry from the cache in the caseof a cache miss. Such online decisions are based on a cache policy, and severaldifferent policies have been studied in the past.

For a search engine, there are two possible ways to use a cache memory:

Caching answers. As the engine returns answers to a particular query, it maydecide to store these partial answers (say, top-K results) to resolve futurequeries.

Caching terms. As the engine evaluates a particular query, it may decide tostore in memory the posting lists of the involved query terms. Often the wholeset of posting lists does not fit in memory, and consequently, the engine hasto select a small set to keep in memory to speed up query processing.

Returning an answer to a query that already exists in the cache is more effi-cient than computing the answer using cached posting lists. On the other hand,a cached posting list can be used to process any query with the correspondingterm, implying a higher hit rate for cached posting lists.

Caching of posting lists has additional challenges. As posting lists have vari-able size, caching them dynamically is not very efficient, due to the complex-ity in terms of efficiency and space, and the skewed distribution of the querystream, as shown later. Static caching of posting lists poses even more chal-lenges: when deciding which terms to cache, one faces the trade-off betweenfrequently queried terms and terms with small posting lists that are space effi-cient. Finally, before deciding to adopt a static caching policy, the query streamshould be analyzed to verify that its characteristics do not change rapidly overtime.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 3: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:3

Fig. 1. One caching level in a distributed search architecture.

In this article we explore trade-offs in the design of each cache level, showingthat the problem is the same at each level, and only a few parameters change.In general, we assume that each level, of caching in a distributed search archi-tecture is similar to that shown in Figure 1. We mainly use a query log fromYahoo! UK, spanning a whole year, to explore the limitations of dynamicallycaching query answers or posting lists for query terms, and in some cases, weuse a query log from the TodoCL search engine to validate our results.

We observe that caching posting lists can achieve higher hit rates thancaching query answers. We propose new algorithms for the static caching ofposting lists for query terms, showing that the static caching of query terms ismore effective than dynamic caching with LRU or LFU policies. We provide ananalysis of the trade-offs between static caching of query answers and of queryterms. This analysis enables us to obtain the optimal allocation of memory fordifferent types of static caches, for both a particular implementation of a re-trieval system and a simple model of a distributed system. Finally, we explorehow changes in the query log influence the effectiveness of static caching.

More concretely, our main conclusions are the following:

—Caching query answers results in lower hit ratios compared with cachingof posting lists for query terms, but it is faster because there is no need forquery evaluation. We provide a framework for the analysis of the trade-offbetween static caching of query answers and posting lists.

—We evaluate the benefits of keeping compressed postings in the posting listcache. To the best of our knowledge, this is the first time cache entries arekept compressed. We show that compression is worthwhile in real cases, sinceit results in a lower average response time.

—Static caching of terms can be more effective than dynamic caching with,for example, LRU. We provide algorithms based on the KNAPSACK problem forselecting the posting lists to put in a static cache, and we show improvementsover previous work, achieving a hit ratio over 90%.

—Changes in the query distribution over time have little impact on staticcaching.

This article is an extended version of the one presented at ACM SIGIR2007 [Baeza-Yates et al. 2007], making the following additional contributions:

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 4: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:4 • R. Baeza-Yates et al.

—In addition to the Yahoo! UK log and the UK 2006 document collection, weuse a query log and a document collection from the TodoCL search engine tovalidate some of our results.

—We present results that show that a mixed policy of combining static anddynamic cache for the problem of caching posting lists, performs better thaneither static or dynamic caching alone.

—We present results from experiments using a real system that validates ourcomputational model.

The remainder of this article is organized as follows. Sections 2 and 3 sum-marize related work and characterize the data sets we use. Section 4 discussesthe limitations of dynamic caching. Sections 5 and 6 introduce algorithms forcaching posting lists, and a theoretical framework for the analysis of staticcaching, respectively. Section 7 discusses the impact of changes in the querydistribution on static caching, and Section 8 provides our concluding remarks.

2. RELATED WORK

Caching is a useful technique for Web systems that are accessed by a large num-ber of users. It enables a shorter average response time, it reduces the workloadon back-end servers, and it reduces the overall amount of utilized bandwidth.In a Web system, both clients and servers can cache items. Browsers cache Webobjects on the client side, whereas servers cache precomputed answers or par-tial data used in the computation of new answers. A third possibility, althoughof less interest to this article, is to use proxies to mediate the communication be-tween clients and servers, storing frequently requested objects [Podlipnig andBoszormenyi 2003].

Query logs constitute a valuable source of information for evaluating theeffectiveness of caching systems. Silverstein et al. [1999] analyze a large querylog of the AltaVista search engine containing about a billion queries submittedover more than a month. Tests conducted include the analysis of the querysessions for each user, and of the correlations among the terms of the queries.Similarly to other work, their results show that the majority of the users (inthis case about 85%) visit the first page of results only. They also show that 77%of the sessions end after the first query. Jansen et al. [1998] conduct a similaranalysis, obtaining results similar to the previous study. They conclude thatwhile IR systems and Web search engines are similar in their features, users ofthe latter are very different from users of IR systems. Jansen and Spink [2006]presents a thorough analysis of search engine user behavior. Besides analyzingthe distribution of page-views, number of terms, number of queries, and so forth,they show a topical classification of the submitted queries, pointing out howusers interact with their preferred search engine. Beitzel et al. [2004] analyzea very large Web query log containing queries submitted by a population of tensof millions users searching the Web through AOL. They partition the query loginto groups of queries submitted during different hours of the day. The analysishighlights the changes in popularity and uniqueness of topically categorizedqueries within the different groups.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 5: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:5

While there are several studies analyzing query logs for different purposes,just a few consider caching for search engines. This might be due to the difficultyin showing the effectiveness without having a real system available for testing.As noted by Xie and O’Hallaron [2002] and confirmed by our analysis, manypopular queries are shared by different users. This level of sharing justifies thechoice of a server-side caching system for Web search engines.

In one of the first published works on exploiting user query history, Raghavanand Sever [1995] propose using a query base built upon a set of persistent“optimal” queries submitted in the past, to improve the retrieval effectivenessfor similar future queries. Markatos [2001] shows the existence of temporallocality in queries, and compares the performance of different variants of theLRU policy, using hit ratio as a metric. According to his analysis, static cachingis very effective if employed on very small caches (50Mbytes), but gracefullydegrades as the cache size increases.

Based on the observations of Markatos, Lempel and Moran [2003] propose anew caching policy, called probabilistic driven caching (PDC), which attemptsto estimate the probability distribution of all possible queries submitted to asearch engine. PDC is the first policy to adopt prefetching in anticipation ofuser requests. To this end, PDC exploits a model of user behavior, where auser session starts with a query for the first page of results, and can proceedwith one or more follow-up queries (i.e., queries requesting successive pages ofresults). When no follow-up queries are received within τ seconds, the sessionis considered finished.

Fagni et al. [2006] follow Markatos’ work by showing that combining staticand dynamic caching policies, together with an adaptive prefetching policy,achieves a high hit ratio. In their experiments, they observe that devoting alarge fraction of entries to static caching, along with prefetching, obtains thebest hit ratio.

Baeza-Yates et al. [2007] introduce a caching mechanism for query answerswhere the cache memory is split in two parts. The first part is used to cacheresults of queries that are likely to be repeated in the future, and the secondpart is used to cache all other queries. The decision to cache the query resultsin the first or the second part depends on features of the query, such as its pastfrequency or its length (in tokens or characters).

One of the main issues with the design of a server-side cache is the amountof memory resources usually available on servers. Tsegay et al. [2007] considercaching of pruned posting lists in a setting where query evaluation terminateswhen the set of top ranked documents does not change by processing morepostings. Zhang et al. [2008] study caching of blocks of compressed posting listsusing several dynamic caching algorithms, and find that evicting from memorythe least frequently used blocks of posting lists performs very well in termsof hit ratio. Our static caching algorithm for posting lists, in Section 5, usesthe ratio frequency/size in order to evaluate the goodness of an item to cache.Similar ideas have been used in the context of file caching [Young 2002], Webcaching [Cao and Irani 1997], and even caching of posting lists [Long and Suel2005], but in all cases in a dynamic setting. To the best of our knowledge weare the first to use this approach for static caching of posting lists.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 6: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:6 • R. Baeza-Yates et al.

Since systems are often hierarchical, there have been proposed multiplelevel caching architectures. Saraiva et al. [2001] propose a new architecturefor Web search engines using a two-level dynamic caching system. Their goalfor such systems has been to improve response time for hierarchical engines.In their architecture, both levels use an LRU eviction policy. They find thatthe second-level cache can effectively reduce disk traffic, thus increasing theoverall throughput. Baeza-Yates and Saint-Jean [2003] propose a three-levelindex organization with a frequency based posting list static cache. Long andSuel [2005] propose a caching system structured according to three differentlevels. The intermediate level contains frequently occurring pairs of terms andstores the intersections of the corresponding inverted lists. The last two studiesare related to ours in that they exploit different caching strategies at differentlevels of the memory hierarchy.

There is a large body of work devoted to query optimization. Buckleyand Lewit [1985], in one of the earliest works, take a term-at-a-time ap-proach to decide when inverted lists need not be further examined. More re-cent examples demonstrate that the top k documents for a query can be re-turned without the need for evaluating the complete set of posting lists [Anhand Moffat 2006; Buttcher and Clarke 2006; Strohman et al. 2005; Ntoulasand Cho 2007]. Although these approaches seek to improve query process-ing efficiency, they differ from our current work in that they do not considercaching. They may be considered separate and complementary to a cache-basedapproach.

3. DATA CHARACTERIZATION

Our main dataset consists of a crawl of documents from the UK domain, andthe logs for one year of queries submitted to http://www.yahoo.co.uk fromNovember 2005 to November 2006. To further validate our results, we use asecond dataset consisting of a crawl of documents indexed by the TodoCL searchengine1 from 2003, with queries submitted to the search engine from May toNovember, 2003.

The document collection from the UK is a summary of the UK domain crawledin May 2006 [Boldi et al. 2004; Castillo et al. 2006].2 This summary correspondsto a maximum of 400 crawled documents per host, using a breadth-first crawlingstrategy, comprising 15GB. The distribution of document frequencies of termsin the UK collection follows a power law distribution with parameter 1.24.3

The corpus statistics for the Chile data are comparable to those for the UKSummary collection. The distribution of document frequencies for every termin the Chile corpus follows a power law of parameter 1.10. The statistics forboth collections are shown in Table I.

With respect to our query-log datasets, in a year of queries to the UK searchengine, 50% of the total volume of queries are unique. The average query length

1http://www.todocl.cl visited July 2008.2The collection is available from the University of Milan: http://law.dsi.unimi.it/.3In this article we use power laws to fit the data in the main part of the distribution, since ingeneral, the power law does not fit well across the two extremes.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 7: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:7

Table I. Statistics of the Document Collections

UK-2006 Sample Statistics Chile Sample Statistics

# of documents 2,786,391 # of documents 3,110,605# of terms 6,491,374 # of terms 3,894,893# of postings 773,440,986 # of postings 529,599,712# of tokens 2,109,512,558 # of tokens 1,578,821,207

Inverted file size (bytes) 1,189,266,893 Inverted file size (bytes) 1,004,086,805

is 2.5 terms, with the longest query having hundreds of terms. Figure 2(a) showsthe distributions of queries and query terms for a sample of the query logs fromyahoo.co.uk for part of a year. The x-axis represents the normalized frequencyrank of the query or term, that is, the most frequent query appears closest tothe y-axis. The y-axis is the normalized frequency for a given query (or term).As expected, the distribution of query frequencies and query term frequenciesshown in this graph follow power law distributions, with parameters of 0.83and 1.06, respectively. In this figure, the queries and terms were normalizedfor case and white space.

The Chile query log resembles the UK query log, where 60% of the totalvolume of queries are unique queries, and 80% of the unique queries are sin-gleton queries—queries that appear only once in the logs. The average querylength was 2.63 terms, the longest being 73 terms. Figure 2(b) shows the queryand term distributions for the Chile query log. The queries were normalizedfor case and whitespace. The query and term distributions follow a power law,with parameters of 0.62 and 0.88, respectively.

Finally, we computed the correlation between the document frequency ofterms in the UK collection, and the number of queries to yahoo.co.uk thatcontain a particular term in the query log, to be 0.42. The correlation betweenthe document frequency of terms in the collection indexed by TodoCL, and thenumber of queries to the TodoCL search engine containing a particular term, isonly 0.29.

A scatter plot for a random sample of terms for both data sets is shown inFigure 3. In this experiment, terms have been converted to lower case in boththe queries and the documents so that the frequencies will be comparable.

4. CACHING OF QUERIES AND TERMS

Caching relies upon the assumption that there is locality in the stream of re-quests. That is, there must be sufficient repetition in the stream of requestsand within intervals of time that enable a cache memory of reasonable sizeto be effective. In the UK query log, 88% of the unique queries are singletonqueries, and 44% are singleton queries out of the whole volume. Thus, out of allqueries in the stream composing the query log, the upper threshold on hit ratiois 56%. This is because only 56% of all the queries comprise queries that havemultiple occurrences. It is important to observe, however, that not all queriesin this 56% can be cache hits because of compulsory misses. A compulsory misshappens when the cache receives a query for the first time. This is differentfrom capacity misses, which happen due to space constraints on the amount ofmemory the cache uses. If we consider a cache with infinite memory, then the

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 8: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:8 • R. Baeza-Yates et al.

Fig. 2. The distribution of queries, query terms, and document terms in the UK dataset (a) andthe Chile dataset (b). The curves are shown for a large subset of queries for part of a year. They-axis has been normalized for each distribution. The x-axis has been normalized by the rank, sothat the most frequent term is closest to the y-axis.

hit ratio is 50% because, as mentioned in the previous section, unique queriesare 50% of the total query volume. Note that for an infinite cache there are nocapacity misses.

As we mentioned before, another possibility is to cache the posting lists ofterms. Intuitively, this gives more freedom in the utilization of the cache contentto respond to queries, because cached terms might form a new query. On theother hand, they need more space.

As opposed to queries, the fraction of singleton terms in the total volume ofterms is smaller. In the UK query log, only 4% of the terms appear once, but

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 9: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:9

Fig. 3. Normalized scatter plot of document-term frequencies vs. query-term frequencies for theUK collection, and queries to yahoo.co.uk (left), and the same for the TodoCL data (right).

Fig. 4. Arrival rate of queries and terms and estimated workload for the UK log.

this accounts for 73% of the vocabulary of query terms. We show in Section 5that caching a small fraction of terms, while accounting for terms appearing inmany documents, is potentially very effective.

Figure 4(a) shows several curves corresponding to the normalized arrivalrate of queries in the UK log for different cases using days as bins. That is,we plot the normalized number of elements that appear in a day. This graphshows only a period of 122 days, and we normalize the values by the maximumvalue observed throughout the whole period of the query log. “total queries”and “total terms” correspond to the total volume of queries and terms, respec-tively. “Unique queries” and “unique terms” correspond to the arrival rate ofunique queries and terms. Finally, “query diff” and “terms diff” correspond tothe difference between the curves for total and unique.

In Figure 4(a), as expected, the volume of terms is much higher than thevolume of queries. The difference between the total number of terms and thenumber of unique terms is much larger than the difference between the totalnumber of queries and the number of unique queries. This observation impliesthat terms repeat significantly more than queries. If we use smaller bins, say

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 10: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:10 • R. Baeza-Yates et al.

of one hour, then the ratio of unique to volume is higher for both terms andqueries, because it leaves less room for repetition.

We also estimated the workload using the document frequency of terms as ameasure of how much work a query imposes on a search engine. We found thatit closely follows the arrival rate for terms shown in Figure 4(a). In more detail,Figure 4(b) plots the sum of the length of the posting lists associated with termsin each bin, normalized by the average workload. Since the absolute workloadvalues are substantially higher, normalizing is necessary to make the graphcomparable to the others for total, unique, and difference in Figure 4(a). Wethen normalize it a second time using the same procedure as for the curves inFigure 4(a). The main observation in this graph is that the workload closelyfollows the arrival rate for terms. The graph would not have such a shape if,for example, the terms in queries in periods of high activity had, on average,shorter posting lists.

To demonstrate the effect of a dynamic cache on the query frequency distri-bution of Figure 2(a), we plot the same frequency graph, but now consideringthe frequency of queries after going through an LRU cache. On a cache miss, anLRU cache decides upon an entry to evict, using the information on the recencyof queries. In this graph, the most frequent queries are not the same queriesthat were most frequent before the cache. It is possible that queries that aremost frequent after the cache have different characteristics, and tuning thesearch engine to queries that were frequent before the cache may degrade per-formance for non-cached queries. The maximum frequency after caching is lessthan 1% of the maximum frequency before the cache, thus showing that thecache is very effective in reducing the load of frequent queries. If we rerank thequeries according to after-cache frequency, the distribution is still a power law,but with a much smaller value for the highest frequency.

When discussing the effectiveness of caching dynamically, an important met-ric is cache miss rate. To analyze the cache miss rate for different memoryconstraints, we use the working set model [Denning 1980; Slutz and Traiger1974]. A working set, informally, is the set of references that an application oran operating system is currently working with. The model uses such sets in astrategy that tries to capture the temporal locality of references. The workingset strategy then consists in keeping in memory only the elements that are ref-erenced in the previous θ steps of the input sequence, where θ is a configurableparameter corresponding to the window size.

Originally, working sets have been used for page replacement algorithmsof operating systems, and considering such a strategy in the context of searchengines is interesting for three reasons. First, it captures the amount of localityof queries and terms in a sequence of queries. Locality in this case refers to thefrequency of queries and terms in a window of time. If many queries appearmultiple times in a window, then locality is high. Second, it enables an offlineanalysis of the expected miss rate given different memory constraints. Third,working sets capture aspects of efficient caching algorithms, such as LRU. LRUassumes that references further in the past are less likely to be referenced inthe present, which is implicit in the concept of working sets [Slutz and Traiger1974].

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 11: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:11

Fig. 5. Frequency graph after LRU cache.

We now characterize the working set model more formally. Following themodel of Slutz and Traiger [1974], let ρk denote a finite reference sequence ofelements, elements being either queries or terms, where r(t) evaluates to theelement at position t of this sequence, and k is the length of the sequence. Aworking set for ρk is as follows:

Definition 4.1. The working set at time t is the distinct set of elementsamong r(t − θ + 1) . . . r(t).

The function ck(x), used to compute the miss rate, is defined as follows:

Definition 4.2. ck(x), 1 ≤ x ≤ k, is the number of occurrences of xt = x inρk , where xt is the number of elements since the last reference to r(t).

We define the miss rate as:

m(θ ) = 1 − (1/k)θ∑

x=1

ck(x). (1)

Figure 6(a) plots the miss rate for different working set sizes, and we considerworking sets of both queries and terms. The working set sizes are normalizedagainst the total number of queries in the query log. In the graph for queries,there is a sharp decay until approximately 0.01, and there is a decrease in therate at which the miss rate drops as we increase the size of the working setover 0.01. Finally, the minimum value it reaches is 50% miss rate, not shownin the figure, since we have cut the tail of the curve for presentation purposes.For the sequence of terms that we use to plot the term curve in the figure, wehave not considered all the terms in the log. Instead, we use the same numberof queries we use for the query graph, taken from the head of the query log.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 12: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:12 • R. Baeza-Yates et al.

Fig. 6. Miss rate as a function of the working set size and distribution of distances.

Compared with the query curve, we observe that the minimum miss ratefor terms is substantially smaller. The miss rate also drops sharply on valuesup to 0.01, and it decreases minimally for higher values. The minimum value,however, is slightly over 10%, which is much smaller than the minimum valuefor the sequence of queries. This implies that, with such a policy, it is possible toachieve over 80% hit rate, if we consider caching dynamically posting lists forterms as opposed to caching answers for queries. This result does not considerthe space required for each unit stored in the cache memory, or the amount oftime it takes to put together a response to a user query. We analyze these issuesmore carefully later in this article.

It is interesting also to observe the histogram of Figure 6(b), which is anintermediate step in the computation of the miss rate graph. It reports thedistribution of distances between repetitions of the same frequent query. Thedistance in the plot is measured in the number of distinct queries separatinga query and its repetition, and it considers only queries appearing at least 10times. For example the distance between repetitions of the query q in the querystream q, q1, q2, q2, q1, q3, q is three. From Figures 6(a) and 6(b), we concludethat even if we set the size of the query answers cache to a relatively largenumber of entries, the miss rate is high. Thus caching the posting lists of termshas the potential to improve the hit ratio. This is what we explore next.

5. CACHING POSTING LISTS

The previous section shows that caching posting lists can obtain a higher hitrate as compared with caching query answers. In this section, we study theproblem of how to select posting lists to place in a certain amount of availablememory, assuming that the whole index is larger than the amount of memoryavailable. The posting lists have variable size (in fact, their size distributionfollows a power law), so it is beneficial for a caching policy to consider the sizesof the posting lists. In Section 5.1, we describe a new algorithm for cachingposting lists statically. We compare our algorithm with a static-cache algorithmthat considers only query frequency statistics, as well as with dynamic-cachealgorithms, such as LRU, LFU, and a modified dynamic algorithm that takes

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 13: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:13

posting-list size into account. Additionally, in Section 5.2, we discuss a mixedcaching policy that considers partitioning the available cache in two parts andusing one part as static cache and the other part as dynamic cache.

5.1 Static Caching

Before discussing the static caching strategies, we introduce some notation. Weuse fq(t) to denote the query-term frequency of a term t, that is, the numberof queries containing t in the query log, and fd (t) to denote the document fre-quency of t, that is, the number of documents in the collection in which the termt appears.

The first strategy we consider is the algorithm proposed by Baeza-Yates andSaint-Jean [2003], which consists in selecting the posting lists of the terms withthe highest query-term frequencies fq(t). We call this algorithm QTF. The QTF

algorithm is clearly motivated by filling the cache with terms that appear oftenin the queries. The query-term frequencies are computed from past query logs,and for the policy to be effective we assume that the query-term frequencies donot change much over time. Later in this article we analyze the impact of thequery-log dynamics on static caching.

Next, we describe our suggested static-cache algorithm. Our main observa-tion is that there is a trade-off between fq(t) and fd (t). On the one hand, termswith high fq(t) are useful to keep in the cache because they are queried often.On the other hand, terms with high fd (t) are not good candidates because theycorrespond to long posting lists and consume a substantial amount of space. Infact, the problem of selecting the best posting lists for the static cache corre-sponds to the standard Knapsack problem: given a knapsack of fixed capacity,and a set of n items, declaring, for example, that the i-th item has value ci

and size si, select the set of items that fit in the knapsack and maximize theoverall value. In our case, “value” corresponds to fq(t) and “size” correspondsto fd (t). Thus we employ a simple algorithm for the knapsack problem, whichis selecting the posting lists of the terms with the highest values of the ratiofq (t)

fd (t). We call this algorithm QTFDF. We tried other variations considering query

frequencies instead of term frequencies, but the gain was minimal relative tothe complexity added.

In addition to the above two static algorithms, we consider the followingalgorithms for dynamic caching:

—LRU: a standard LRU algorithm, but many posting lists might need to beevicted (in order of least-recent usage) until there is enough space in thememory to place the currently accessed posting list.

—LFU: a standard LFU algorithm (eviction of the least-frequently used), withthe same modification as the LRU.

—DYN-QTFDF: a dynamic version of the QTFDF algorithm; evict from the cache

the term(s) with the lowestfq (t)

fd (t)ratio.

The performance of all the above algorithms for the UK and Chile datasetsare shown in Figures 7 and 8, respectively. For the results on the UK dataset,

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 14: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:14 • R. Baeza-Yates et al.

Fig. 7. Hit rate of different strategies for caching posting lists for the UK dataset.

Fig. 8. Hit rate of different strategies for caching posting lists for the Chile dataset.

we use 15 weeks of the UK query log, and for the Chile dataset, we use 4 monthsof the Chile query log. Performance is measured with hit rate. The cache size ismeasured as a fraction of the total space required to store the posting lists ofall terms.

For the dynamic algorithms, we load the cache with terms in order of fq(t)and we let the cache “warm up” for 1 million queries. For the static algorithms,we assume complete knowledge of the frequencies fq(t), that is, we estimatefq(t) from the whole query stream. As we show in Section 7, the results do notchange much if we compute the query-term frequencies using the first 3 or 4weeks of the query log, and measure the hit rate on the rest.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 15: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:15

Fig. 9. Fraction of terms whose posting lists fit in cache for the two different static algorithms.

The most important observation from our experiments is that the staticQTFDF algorithm has a better hit rate than all the dynamic algorithms. Animportant benefit of a static cache is that it requires no eviction and it is hencemore efficient when evaluating queries. However, if the characteristics of thequery traffic change frequently over time, then it requires repopulating thecache often, or there will be a significant impact on hit rate.

A measure illustrating the difference between the QTFDF and QTF algorithmsis demonstrated in Figure 9(a), and 9(b), where we show the fraction of termswhose posting lists fit in cache for the two static algorithms. QTF selects termswith high fq(t) values. However, many of those terms tend to have long postinglists, and as a result, few posting lists fit in cache. On the other hand, QTFDF

prefers to select many more (and shorter) posting lists, even though they havesmaller fq(t) values.

5.2 Adding Dynamic Cache

In addition to pure static and dynamic caching policies, we also consider a mixedcaching policy: given a fixed amount of available cache, partition it in two partsand use the one part as static cache and the other part as dynamic cache. Weconsider combining the static and dynamic caching policies, as demonstratedin the previous section, namely the QTFDF algorithm for static and the LRUalgorithm for dynamic caching.

The motivation behind considering such a mixed policy is to leverage thegood performance of static caching, but at the same time to employ dynamiccaching in order to handle temporal correlations and bursts in the query logstream. Figure 10 presents the results of our experiment that was performedusing 15 weeks of the UK query log. Given a fixed amount of memory for cachingposting lists, we allocate an α fraction of the memory for the QTFDF policy andthe rest for the LRU policy. We tried with α = 0.1, 0.25, 0.5, 0.75 and 0.9.

Like the results presented in Fagni et al. [2006], our mixed static/dynamicstrategy has led to an improvement in the hit ratio of the cache. The improve-ment is more significant for the smaller sizes of the cache; as the cache sizeincreases, the performance of the QTFDF algorithm levels the performance of

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 16: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:16 • R. Baeza-Yates et al.

Fig. 10. The effect of adding dynamic cache to the QTFDF algorithm.

the mixed policy. Also, as Figure 10 shows, the best performance is achieved forα = 0.9, that is, allocating the largest part of the cache for the static policy.

6. ANALYSIS OF STATIC CACHING

In this section, we provide a detailed analysis for the problem of decidingwhether it is preferable to cache query answers or cache posting lists. Ouranalysis takes into account the impact of caching between two levels of thedata-access hierarchy. It can either be applied at the memory/disk layer or at aserver/remote server layer, as in the architecture discussed in the introduction.

Using a particular system model, we obtain estimates for the parametersrequired by our analysis, which we subsequently use to decide the optimaltrade-off between caching query answers and caching posting lists. To validatethe optimal trade-off, we run an implementation of the system with a cache ofquery answers and a cache of posting lists.

6.1 Analytical Model

Let M be the size of the cache measured in answer units, that is, assume thatthe cache can store M query answers. For the sake of simplicity, assume that allposting lists are of the same length L, measured in answer units. We considerthe following two cases: (1) a cache that stores only precomputed answers, and(2) a cache that stores only posting lists. In the first case, Nc = M answers fitin the cache, while in the second case Np = M/L posting lists fit in the cache.Thus Np = Nc/L. Note that although posting lists require more space, we cancombine terms to evaluate more queries (or partial queries).

For case (1), suppose that a query answer in the cache can be evaluated inone time unit. For case (2), assume that if the posting lists of the terms of aquery are in the cache, then the results can be computed in TR1 time units,

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 17: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:17

Fig. 11. Cache saturation as a function of size.

while if the posting lists are not in the cache, then the results can be computedin TR2 time units. Of course we have that TR2 > TR1.

Now we want to compare the time to answer a stream of Q queries in bothcases. Let us use the QTF algorithm as an approximation to the QTFDF algorithm(in fact, if the correlation between query terms and document terms is 0, thisapproximation is quite good) and Vc(Nc) be the volume of the most frequent Nc

queries. Then, for case (1), we have an overall time

TCA = Vc(Nc) + TR2(Q − Vc(Nc)).

Similarly, for case (2), let Vp(Np) be the number of computable queries usingposting lists of the most frequent Np terms: Then we have overall time:

TP L = TR1Vp(Np) + TR2(Q − Vp(Np)).

We want to check under which conditions we have TP L < TCA. Then,

TP L − TCA = (TR2 − 1)Vc(Nc) − (TR2 − TR1)Vp(Np). (2)

Figure 11 shows the values of Vp and Vc for the UK query log. We can seethat caching answers saturates faster, and for this particular data there is noadditional benefit from using more than 10% of the index space for cachinganswers.

Since the query distribution in practice is finite, Vc(n) will be a fraction thatdepends on n of the total number of queries Q . Now we estimate this fraction.Since the query distribution follows a power law with parameter 0 < α < 1in our two data sets, the i-th most frequent query appears with probabilityproportional to 1

iα . Therefore, the volume Vc(n), which is the total number of then most frequent queries, is:

Vc(n) = V0

n∑

i=1

Qiα

≈ V0n1−α Q ,

where V0 = 1/U 1−α and U is the number of unique queries in the querystream. We know that Vp(n) grows faster than Vc(n) and we assume, based

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 18: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:18 • R. Baeza-Yates et al.

Fig. 12. Relation of query volumes of precomputed answers Vc(n) and posting lists Vp(n).

on experimental results, that the relation is of the form Vp(n) = kVc(n)β (seeFigure 12). In the worst case, for a large cache, β → 1. That is, both techniqueswill cache a constant fraction of the overall query volume. By setting β = 1,replacing Np = Nc/L, and by combining with Equation (2), we obtain the resultthat caching posting lists makes sense only if the ratio

ρ = L1−α(TR2 − 1)

k(TR2 − TR1)< 1 .

Differently from previous works, we also want to evaluate whether cachingcompressed postings is better than caching plain postings. Caching compressedpostings has the benefit of allowing the accommodation of a greater number ofentries, in fact we have L′ < L, at the cost of a greater computational cost,that is, TR′

1 > TR1. The trade-off of caching postings vs. query results is nowas follows:

ρ ′ = L′1−α(TR′2 − 1)

k(TR′2 − TR′

1).

That is ρ ′ is ratio for comparing cached answers with caching posting lists whencompression is used. Using compression is better if ρ ′ < ρ. In the next section,we show that L

L′ is about 3, and according to the experiments that we showlater, compression is always better.

For a small cache, we are interested in the transient behavior and then β > 1,as computed from the UK data (between 2 and 3 as shown in Figure 12). In thiscase, there will always be a point where TP L > TCA for a large number of queries,and this shows the importance of the real values of TR, which we estimatenext.

As we showed in the previous section, instead of filling the cache only withanswers or only with posting lists, a better strategy is to divide the total cachespace into a cache for answers and a cache for posting lists. In such a case,there will be some queries that could be answered by both parts of the cache,and a good caching technique should try to minimize the intersection of both

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 19: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:19

caches. Finding the optimal division of the cache in order to minimize the overallretrieval time is a difficult problem to solve analytically. In Section 6.3, we usesimulations to derive optimal cache trade-offs for particular implementationexamples.

6.2 Parameter Estimation

We now use a particular implementation of a centralized system and the modelof a distributed system as examples from which we estimate the parameters ofthe analysis from the previous section. We perform the experiments using anoptimized version of Terrier [Ounis et al. 2006], for both indexing documentsand processing queries, on a single machine with a Pentium 4 at 2GHz and1GB of RAM.

We index the documents from the UK-2006 dataset, without removing stopwords or applying stemming. The posting lists in the inverted file consist of pairsof document identifier and term frequency. We compress the document identifiergaps using Elias gamma encoding, and the term frequencies in documents usingunary encoding [Witten et al. 1994]. The size of the inverted file is 1,189Mb. Astored answer requires 1264 bytes, and an uncompressed posting takes 8 bytes.From Table I, we obtain L = (8· # of postings)

1264· # of terms= 0.75 and L′ = Inverted file size

1264· # of terms= 0.26.

We estimate the ratio TR = T/Tc between the average time T it takes toevaluate a query and the average time Tc it takes to return a stored answer forthe same query, in the following way. Tc is measured by loading the answersfor 100,000 queries in memory, and answering the queries from memory. Theaverage time is Tc = 0.069ms. T is measured by processing the same 100,000queries (the first 10,000 queries are used to warm up the system). For eachquery, we remove stop words if there are at least three remaining terms. Thestop words correspond to the terms with a frequency higher than the numberof documents in the index. We use a document-at-a-time approach to retrievedocuments containing all query terms. The only disk access required duringquery processing is for reading compressed posting lists from the inverted file.We perform both full and partial evaluation of answers, because some queriesare likely to retrieve a large number of documents, and only a fraction of theretrieved documents will be seen by users. In the partial evaluation of queries,we terminate the processing after matching 10,000 documents. The estimatedratios TR are presented in Table II.

Figure 13 shows, for a sample of queries, the workload of the system withpartial query evaluation and compressed posting lists. The x-axis correspondsto the total time the system spends processing a particular query, and the verti-cal axis corresponds to the sum

∑t∈q fq · fd (t). Notice that the total number of

postings of the query-terms does not necessarily provide an accurate estimateof the workload imposed on the system by a query (which is the case for fullevaluation and uncompressed lists).

The analysis of the previous section also applies to a distributed retrieval sys-tem in one or multiple sites. Suppose that a document partitioned distributedsystem is running on a cluster of machines interconnected through a Local AreaNetwork (LAN) in one site. The broker receives queries and broadcasts them

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 20: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:20 • R. Baeza-Yates et al.

Table II. Ratios Between the Average Time toEvaluate a Query and the Average Time to ReturnCached Answers (centralized and distributed case)

Centralized system TR1 TR2 TR′1 TR′

2

Full evaluation 233 1760 707 1140Partial evaluation 99 1626 493 798

LAN system TRL1 TRL

2 TRL1 TRL

2

Full evaluation 242 1769 716 1149Partial evaluation 108 1635 502 807

WAN system TRW1 TRW

2 TRW1 TRW

2

Full evaluation 5001 6528 5475 5908Partial evaluation 4867 6394 5270 5575

Fig. 13. Workload for partial query evaluation with compressed posting lists.

to the query processors, which answer the queries and return the results tothe broker. Finally, the broker merges the received answers and generates thefinal set of answers (we assume that the time spent on merging results is neg-ligible). The difference between the centralized architecture and the documentpartition architecture is the extra communication between the broker and thequery processors. Using ICMP pings on a 100Mbps LAN, we have observedthat sending the query from the broker to the query processors, which sendan answer of 4,000 bytes back to the broker, takes on average 0.615ms. HenceTRL = TR + 0.615ms/0.069ms = TR + 9.

In the case when the broker and the query processors are in different sitesconnected through a Wide Area Network (WAN), we estimate that broadcastingthe query from the broker to the query processors, and getting back an answerof 4,000 bytes, takes on average 329ms. Hence TRW = TR + 329ms/0.069ms =TR + 4768. We can see that TRW

2 /TRW1 = 1.31 < TR2/TR1, suggesting that there

is greater benefit from storing answers for queries when the retrieval systemis distributed across a WAN, since the network communication dominates theresponse time for such systems. We corroborate this observation next.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 21: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:21

Fig. 14. Optimal division of the cache memory in a server.

6.3 Simulation Rsesults

We now address the problem of finding the optimal trade-off between cachingquery answers and caching posting lists. To make the problem concrete, weassume a fixed size M on the available memory, out of which x units are usedfor caching query answers, and M − x for caching posting lists.

We perform a simulation and compute the average response time as a func-tion of x. Using a part of the query log as training data, we first allocate in thecache the answers to the most frequent queries that fit in space x, and thenwe use the rest of the memory to cache posting lists. For selecting posting lists,we use the QTFDF algorithm, applied to the training query log but excludingthe queries that have already been cached.

In Figure 14, we plot the simulated response time for a centralized systemas a function of x. For the uncompressed index, we use M = 1GB, and for thecompressed index we use M = 0.5GB, to make a fair comparison. In the case ofthe configuration that uses partial query evaluation with compressed postinglists, the lowest response time is achieved when 0.15GB out of the 0.5GB isallocated for storing answers for queries. We obtained similar trends in theresults for the LAN setting.

Figure 15 shows the simulated workload for a distributed system across aWAN. In this case, the total amount of memory is split between the broker,which holds the cached answers of queries, and the query processors, whichhold the cache of posting lists. According to the figure, the difference betweenthe configurations of the query processors is less important because the networkcommunication overhead increases the response time substantially. When us-ing uncompressed posting lists, the optimal allocation of memory correspondsto using approximately 70% of the memory for caching query answers. This isexplained by the fact that there is no need for network communication whenthe query can be answered by the cache at the broker.

6.4 Experimenting with a Real System

We validate the results obtained from the simulation of the previous sectionby running a real system, varying the amount of memory allocated for a cache

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 22: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:22 • R. Baeza-Yates et al.

Fig. 15. Optimal division of the cache memory when the next level requires WAN access.

Fig. 16. Average response time and throughput in a server for different splits of memory betweena cache of query answers and a cache of postings.

of query answers and for a cache of posting lists. Our system uses threads toprocess queries in parallel. Each thread processes queries to completion, andindependent of other threads. The posting lists are uncompressed as neededduring query processing, which stops after matching 10,000 documents.

For training, we use the same data as in the simulation. For testing, we usethe first 30,000 queries from the simulation, where the first 10,000 queries areused to warm up the system. For each configuration, we run the system fivetimes and report the average response time in milliseconds, as well as the av-erage throughput. The error bars correspond to the minimum and maximumaverage response time and throughput obtained among all five runs. We vali-date our simulation results from the previous section by running a system ona different server with 2 dual-core processors at 2GHz and 6GB of RAM.

Figure 16(a) shows the average response time on the y-axis and the amountof memory allocated to the cache of query answers on the x-axis. Allocatinga total of 0.5GB for either cache, if we allocate 0.2GB for the cache of queryanswers, then the remaining 0.3GB is used by the cache of posting lists. Bothcurves for a single-threaded and a two-threaded system follow the same trend

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 23: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:23

Table III. Average Response Time and Throughput for a System WithoutCache of Query Answers and Posting Lists, and for a System with Both Caches

and the Optimal Split of Memory

Avg. Response Time (ms) Throughput (q/s)

1 thread/no cache 22.63 44.511 thread/optimal trade-off 12.35 81.06

2 threads/no cache 21.61 92.682 threads/optimal trade-off 12.06 167.37

shown in Figure 14. In the case of the single-threaded system, the optimalallocation of memory corresponds to 0.15GB for the cache of query answers.We also obtained the same result in our simulation described in the previoussection. It is important to note that even though we run the system on a fasterserver than the one on which the parameters were estimated, the results of thesimulation remain valid.

Figure 16(b) shows the throughput achieved in the case of the single-threadedand two-threaded systems. We observe that throughput doubles when usingtwo threads because the two server cores serve queries simultaneously. Havingmultiple cores, however, is not sufficient to have throughput increasing linearly,since there are other system resources the threads share, such as the disk.Although we have not investigated this issue in depth, in our interpretation isthat this happens because the two threads overlap minimally in the use of thedisk. Such a minimal overlap is possible because of the large number of hits,which produce fewer accesses to disk and spread them over time. Reducing thenumber of accesses and spreading them over time results in a small probabilityof overlap. We note that in both cases, the optimal allocation of memory is alsoat 0.15GB for the cache of query answers.

One important observation is that the optimal trade-off between the cacheof query answers and that of posting lists significantly increases the capacity ofthe system, compared to a system that does not use cache. Table III shows theaverage response time and throughput of the system without cache of queryanswers or posting lists, and of the system with both caches and the optimalmemory allocation. The optimal trade-off results in at least 44% reduction inaverage response time and an increase of 80% in throughput.

7. EFFECT OF THE QUERY DYNAMICS

Since the queries in the incoming traffic follow a power-law, some of the most fre-quent queries will remain frequent even after some period of time. Topics uponwhich queries are submitted, however, vary over time and might invalidate thestatic cache built so far. Hence, we assess the impact of time on the validity ofthe trained model, studying the statistical characteristics of the query stream,showing that there is little variation in hit rate over sufficiently long periods oftime.

7.1 Stability of static caching of answers

For our query log, the query distribution and query-term distribution changeslowly over time. To support this claim, we first assess how query topics change

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 24: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:24 • R. Baeza-Yates et al.

Fig. 17. The distribution of queries in the first week of June, 2006 (upper curve) compared to thedistribution of new queries in the remainder of 2006.

over time. Figure 17 shows a comparison of the distribution of queries from thefirst week in June 2006, to the distribution of queries for the remainder of 2006that did not appear in the first week in June. The x-axis shows the rank of thequery frequency, normalized on a log scale. The y-axis shows the frequency of agiven query. We found that a very small percentage of queries are new queries.(the highest frequency is more than two orders of magnitude smaller). In fact,the majority of queries that appear in a given week repeat in the followingweeks for the next six months.

We observe the stability of the hit rate by considering a test period of threemonths and compare the effect of the training duration by training the staticset for one and two weeks. Figure 18 shows the results. The hit rate is sta-ble both in the case of one and two weeks training, for all the three monthstested. Interestingly, hit rate is consistently lower for the static cache trainedfor a single week. The peaks in the graph correspond to nightly periods, whenthe hit rate is highest. The lowest values happen during daily periods, oftencorresponding to 2–3pm.

7.2 Stability of Static Caching of Posting Lists

The static cache of posting lists can be periodically recomputed. To estimatethe time interval in which we need to recompute the posting lists on the staticcache, we need to consider an efficiency/quality trade-off: using too short atime interval might be prohibitively expensive, while recomputing the cachetoo infrequently might lead to having an obsolete cache not corresponding tothe statistical characteristics of the current query stream.

We measure the effect on the QTFDF algorithm of the changes in a 15-weekquery stream (Figure 19(a)). We compute the query-term frequencies over the

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 25: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:25

Fig. 18. Hit rate trend of a static cache of 128,000 results containing the most frequent queriesextracted from one and two weeks before the test period. Hit rate values correspond to periods ofsix hours.

Fig. 19. Impact of distribution changes on the static caching of posting lists.

whole stream, select which terms to cache, and then compute the hit rateon the whole query stream. This hit rate is an upper bound, and it assumesperfect knowledge of the query term frequencies. To simulate a realistic sce-nario, we use the first 6 (3) weeks of the query stream for computing queryterm frequencies, and the following 9 (12) weeks to estimate the hit rate. AsFigure 19(a) shows, the hit rate decreases by less than 2%. We repeated thesame experiment for the QTF algorithm and the decrease on the hit rate was lessthan 0.2%.

The high correlation among the query term frequencies during different timeperiods explains the small changes in hit rate as time elapses. Indeed, the

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 26: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:26 • R. Baeza-Yates et al.

pairwise correlation among all possible 3-week periods of the 15-week querystream, is over 99.5%.

A similar result, shown in Figure 19(b), was obtained for the Chile datasetafter training the static cache, using one month of the query log for training, andthree months for testing. In this case, however, the degradation of the qualityis greater due to the longer testing period.

8. CONCLUSIONS

Caching is an effective technique in search engines for improving response time,reducing the load on query processors, and improving network bandwidth uti-lization. The results we presented in this article consider both dynamic andstatic caching. According to our results, dynamic caching of queries has limitedeffectiveness due to the high number of compulsory misses caused by the num-ber of unique or infrequent queries. In our UK log, the minimum miss rate is50% using a working set strategy. Caching terms is more effective with respectto miss rate, achieving values as low as 12%. We also propose a new algorithmfor static caching of posting lists that outperforms previous static caching al-gorithms as well as dynamic algorithms such as LRU and LFU, obtaining hitrate values that are over 10% higher compared with these strategies.

As one of our main contributions, we present a framework for the analysisof the trade-off between caching query results and caching posting lists. Inparticular, we use the trade-off to evaluate if compression in posting list cachingis worthwhile. Keeping compressed postings, in fact, allows for accommodatinga greater number of entries at the cost of a greater query evaluation time. Weshow in the experimental analysis that compression is always better. To the bestof our knowledge, this is the first work considering compression inside cacheentries. We plan to extend this further by evaluating the impact of differentencoding schemes on the performance of posting list caching. We also showthat partitioning the available cache into a static and a dynamic part improvesthe cache performance for caching posting lists. We use simulation as well as areal system to evaluate different types of architectures. Our results show thatfor centralized and LAN environments, there is an optimal allocation of cachingquery results and caching of posting lists, while for WAN scenarios in whichnetwork latency prevails, it is more important to cache query results. We leaveto future work query processing algorithms that better integrate with caching,improved algorithms for caching posting lists, and a study of the consequencesof the results in a production system.

REFERENCES

ANH, V. N. AND MOFFAT, A. 2006. Pruned query evaluation using pre-computed impacts. In Pro-ceedings of the 29th International ACM Conference on Research and Development in InformationRetrieval (SIGIR’06). ACM, New York, NY, 372–379.

BAEZA-YATES, R., GIONIS, A., JUNQUEIRA, F., MURDOCK, V., PLACHOURAS, V., AND SILVESTRI, F. 2007. Theimpact of caching on search engines. In Proceedings of the 30th International ACM Conference onResearch and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 183–190.

BAEZA-YATES, R., JUNQUEIRA, F., PLACHOURAS, V., AND WITSCHEL, H. F. 2007. Admission policies forcaches of search engine results. In Proceedings of the 14th International Symposium on String

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 27: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

Design Trade-Offs for Search Engine Caching • 20:27

Processing and Information Retrieval (SPIRE’07). Lecture Notes in Computer Science, Vol. 4726,74–85.

BAEZA-YATES, R. AND SAINT-JEAN, F. 2003. A three level search engine index based in query logdistribution. In Proceedings of the 10th International Symposium on String Processing and In-formation Retrieval (SPIRE’03). Lecture Notes in Computer Science, Vol. 2857, 56–65.

BEITZEL, S. M., JENSEN, E. C., CHOWDHURY, A., GROSSMAN, D., AND FRIEDER, O. 2004. Hourly analysisof a very large topically categorized web query log. In Proceedings of the 27th International ACMConference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York,NY, 321–328.

BOLDI, P., CODENOTTI, B., SANTINI, M., AND VIGNA, S. 2004. Ubicrawler: a scalable fully distributedweb crawler. Softw. Pract. Exper. 34, 8.

BUCKLEY, C. AND LEWIT, A. F. 1985. Optimization of inverted vector searches. In Proceedings ofthe 8th International ACM Conference on Research and Development in Information Retrieval(SIGIR’85). ACM, New York, NY, 97–110.

BUTTCHER, S. AND CLARKE, C. L. A. 2006. A document-centric approach to static index pruning intext retrieval systems. In Proceedings of the 15th ACM International Conference on Informationand Knowledge Management (CIKM’06). ACM, New York, NY, 182–189.

CAO, P. AND IRANI, S. 1997. Cost-aware WWW proxy caching algorithms. In USENIX Symposiumon Internet Technologies and Systems.

CASTILLO, C., DONATO, D., BECCHETTI, L., BOLDI, P., LEONARDI, S., SANTINI, M., AND VIGNA, S. 2006. Areference collection for web spam. SIGIR Forum 40, 2, 11–24.

DENNING, P. 1980. Working sets past and present. IEEE Trans. Softw. Eng. SE-6, 1, 64–84.FAGNI, T., PEREGO, R., SILVESTRI, F., AND ORLANDO, S. 2006. Boosting the performance of web search

engines: caching and prefetching query results by exploiting historical usage data. ACM Trans.Inform. Syst. 24, 1, 51–78.

JANSEN, B. AND SPINK, A. 2006. How are we searching the World Wide Web? A comparison of ninesearch engine transaction logs. Inform. Process. Manag. 42, 248–263.

JANSEN, B. J., SPINK, A., BATEMAN, J., AND SARACEVIC, T. 1998. Real life information retrieval: astudy of user queries on the web. SIGIR Forum 32, 1, 5–17.

LEMPEL, R. AND MORAN, S. 2003. Predictive caching and prefetching of query results in searchengines. In Proceedings of the 12th International World Wide Web Conference (WWW’03). ACM,New York, NY, 19–28.

LONG, X. AND SUEL, T. 2005. Three-level caching for efficient query processing in large web searchengines. In Proceedings of the 14th International World Wide Web Conference (WWW’05). ACM,New York, NY, 257–266.

MARKATOS, E. P. 2001. On caching search engine query results. Comput. Commun. 24, 2, 137–143.NTOULAS, A. AND CHO, J. 2007. Pruning policies for two-tiered inverted index with correctness

guarantee. In Proceedings of the 30th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR’07). ACM, New York, NY, 191–198.

OUNIS, I., AMATI, G., PLACHOURAS, V., HE, B., MACDONALD, C., AND LIOMA, C. 2006. Terrier: a highperformance and scalable information retrieval platform. In SIGIR Workshop on Open SourceInformation Retrieval.

PODLIPNIG, S. AND BOSZORMENYI, L. 2003. A survey of web cache replacement strategies. ACMComput. Surv. 35, 4, 374–398.

RAGHAVAN, V. V. AND SEVER, H. 1995. On the reuse of past optimal queries. In Proceedings ofthe 18th International ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR’95). ACM, New York, NY, 344–350.

SARAIVA, P. C., DE MOURA, E. S., ZIVIANI, N., MEIRA, W., FONSECA, R., AND RIBERIO-NETO, B. 2001.Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th In-ternational ACM Conference on Research and Development in Information Retrieval (SIGIR’01).ACM, New York, NY, 51–58.

SILVERSTEIN, C., MARAIS, H., HENZINGER, M., AND MORICZ, M. 1999. Analysis of a very large websearch engine query log. SIGIR Forum 33, 1, 6–12.

SLUTZ, D. R. AND TRAIGER, I. L. 1974. A note on the calculation of average working set size.Commun. ACM 17, 10, 563–565.

STROHMAN, T., TURTLE, H., AND CROFT, W. B. 2005. Optimization strategies for complex queries. InProceedings of the 28th International ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR’05). ACM, New York, NY, 219–225.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

Page 28: Design Trade-Offs for Search Engine Cachingpomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tweb2008.pdf · For a search engine, there are two possible ways to use a cache

20:28 • R. Baeza-Yates et al.

TSEGAY, Y., TURPIN, A., AND ZOBEL, J. 2007. Dynamic index pruning for effective caching. In Pro-ceedings of the 16th ACM conference on Conference on Information and Knowledge Management(CIKM’07). ACM, New York, NY, 987–990.

WITTEN, I. H., BELL, T. C., AND MOFFAT, A. 1994. Managing Gigabytes: Compressing and IndexingDocuments and Images. John Wiley & Sons, Inc., New York, NY.

XIE, Y. AND O’HALLARON, D. R. 2002. Locality in search engine queries and its implications forcaching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Commu-nications Societies (INFOCOM’02).

YOUNG, N. E. 2002. On-line file caching. Algorithmica 33, 3, 371–383.ZHANG, J., LONG, X., AND SUEL, T. 2008. Performance of compressed inverted list caching in search

engines. In Proceedings of the 17th International World Wide Web Conference (WWW’08). ACM,New York, NY, 387–396.

Received December 2007; revised July 2008; accepted August 2008

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.


Recommended