+ All Categories
Home > Documents > Tuning the Capacity of Search Engines: Load-Driven Routing...

Tuning the Capacity of Search Engines: Load-Driven Routing...

Date post: 27-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
36
5 Tuning the Capacity of Search Engines: Load-Driven Routing and Incremental Caching to Reduce and Balance the Load DIEGO PUPPIN, FABRIZIO SILVESTRI, and RAFFAELE PEREGO ISTI “A. Faedo”, CNR and RICARDO BAEZA-YATES Yahoo! Research, Barcelona This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploit- ing the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25% to improve precision and get more than 80% of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems; Performance evaluation (efficiency and effectiveness) General Terms: Design, Performance, Experimentation Additional Key Words and Phrases: Distributed IR, collection selection, incremental caching, Web search engines D. Puppin is currently affiliated with Google, Inc. Authors’ addresses: D. Puppin, Google, Inc., 5 Cambridge Center, Cambridge, MA 02142; email: [email protected]; F. Silvestri and R. Perego, Istituto ISPI “A. Faedo”, Con- siglio Nazionale delle Ricerche (CNR), via Maruzzi 1, I-16524, Pisa, Italy; email: {f.silvestri, r.perego}@isti.cnr.it; R. Baeza-Yates, Yahoo! Research Spain, Diagonal 177, 9th floor, 08018, Barcelona, Spain; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2010 ACM 1046-8188/2010/05-ART5 $10.00 DOI 10.1145/1740592.1740593 http://doi.acm.org/10.1145/1740592.1740593 ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.
Transcript
Page 1: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5

Tuning the Capacity of Search Engines:Load-Driven Routing and IncrementalCaching to Reduce and Balance the Load

DIEGO PUPPIN, FABRIZIO SILVESTRI, and RAFFAELE PEREGOISTI “A. Faedo”, CNRandRICARDO BAEZA-YATESYahoo! Research, Barcelona

This article introduces an architecture for a document-partitioned search engine, based on a novelapproach combining collection selection and load balancing, called load-driven routing. By exploit-ing the query-vector document model, and the incremental caching technique, our architecture cancompute very high quality results for any query, with only a fraction of the computational load usedin a typical document-partitioned architecture. By trading off a small fraction of the results, ourtechnique allows us to strongly reduce the computing pressure to a search engine back-end; we areable to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computingload needed by a configuration where the query is processed by each index partition. Alternatively,we can slightly increase the load up to 25% to improve precision and get more than 80% of thetop-5 results. In fact, the flexibility of our system allows a wide range of different configurations,so as to easily respond to different needs in result quality or restrictions in computing power. Moreimportant, the system configuration can be adjusted dynamically in order to fit unexpected querypeaks or unpredictable failures. This article wraps up some recent works by the authors, showingthe results obtained by tests conducted on 6 million documents, 2,800,000 queries and real querycost timing as measured on an actual index.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Search process; H.3.4 [Information Storage and Retrieval]: Systemsand Software—Distributed systems; Performance evaluation (efficiency and effectiveness)

General Terms: Design, Performance, Experimentation

Additional Key Words and Phrases: Distributed IR, collection selection, incremental caching, Websearch engines

D. Puppin is currently affiliated with Google, Inc.Authors’ addresses: D. Puppin, Google, Inc., 5 Cambridge Center, Cambridge, MA 02142;email: [email protected]; F. Silvestri and R. Perego, Istituto ISPI “A. Faedo”, Con-siglio Nazionale delle Ricerche (CNR), via Maruzzi 1, I-16524, Pisa, Italy; email: {f.silvestri,r.perego}@isti.cnr.it; R. Baeza-Yates, Yahoo! Research Spain, Diagonal 177, 9th floor, 08018,Barcelona, Spain; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributed for profit or commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2010 ACM 1046-8188/2010/05-ART5 $10.00DOI 10.1145/1740592.1740593 http://doi.acm.org/10.1145/1740592.1740593

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 2: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:2 • D. Puppin et al.

ACM Reference Format:Puppin, D., Silvestri, F., Perego, R., and Baeza-Yates, R. 2010. Tuning the capacity of searchengines: Load-driven routing and incremental caching to reduce and balance the load. ACM Trans.Inform. Syst. 28, 2, Article 5 (May 2010), 36 pages.DOI = 10.1145/1740592.1740593 http://doi.acm.org/10.1145/1740592.1740593

1. INTRODUCTION

Today, about 1.1 billion people have access to the Internet and use it for workor leisure. The World Wide Web is one of the most popular applications, andmillions of new Web pages are created every month, to spread information, toadvertise services or for pure entertainment. The Web is getting richer andricher and is storing a vast part of the information available worldwide: it isbecoming, to many users in the world, a tool for augmenting their knowledge,supporting their theses, and comparing their ideas with reputable sources [PewInternet & American Life Project 2005].

To navigate this abundance of data, users rely more and more on searchengines for any information task. This is why successful search engines haveto crawl, index and quickly search billions of pages, for millions of users everyday. The retrieval problem is getting harder every day due to the growth ofthe indexed material and to the growing complexity of the queries users sub-mit. Very sophisticated techniques are needed to implement efficient searchstrategies for very large document collections.

Distributed information retrieval (IR) systems are considered a feasibleway to implement a large-scale search service [Baeza-Yates et al. 2007a]. Adistributed IR system is usually deployed on large clusters of servers run-ning multiple search core modules, each of which is responsible for search-ing a partition of the whole index (see Figure 1). When each subindex isrelative to a subcollection of documents, we have a document-partitionedindex organization, while when the whole index is split so that differentpartitions refer to a subset of the distinct terms contained in all the doc-uments, we have a term-partitioned index organization. Each organizationof the index requires a specific process to evaluate queries, involves dif-ferent costs for computing and I/O, and triggers different network trafficpatterns.

In both cases we have an additional machine (or set of machines) in front ofthe cluster, hosting a broker, which has the task of scheduling the queries to thevarious servers, and collecting the returned results. The broker then mergesand sorts the received results on the basis of their relevance scores, produces aranked list of matching documents, and builds a results page containing URLs,titles, snippets, related links, and so on.

Document partitioning is the strategy usually chosen by the most popularWeb search engines [Brin and Page 1998; Barroso et al. 2003]. In the document-partitioned organization, the broker may choose between two possible strate-gies for scheduling a query. A simple, yet very common, way of schedulingqueries is to broadcast each of them to all the underlying search cores. Thismethod has the advantage of enabling a good, yet not perfect, load balancingamong all the servers [Badue et al. 2007]. On the other hand, it has the major

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 3: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:3

Fig. 1. Organization of a distributed information retrieval system.

drawback of utilizing every server for each query submitted, causing a higheroverall computing load.

The other possible way of scheduling is to choose, for each query, the mostauthoritative server(s). By doing so, we reduce the number of search coresqueried. The relevance of each server to a given query is computed by meansof a selection function that can be built upon statistics computed over eachsubcollection. This process, called collection selection, is considered to be aneffective technique to enhance the capacity of distributed IR systems [Baeza-Yates et al. 2007a].

In this article, we present an architecture for a distributed search enginebased on collection selection. We discuss a novel strategy to perform documentpartitioning and collection selection, which exploits knowledge about the pastuse of the search engine to drive the assignment of documents to subcollections,and to choose the most promising subcollections for each query. Our novel archi-tecture relies on effective strategies for load balancing and result caching, andaims at returning very high-quality results by using only a small and tunablefraction of the computing load needed by a traditional document-partitionedsearch engine.

To achieve this ambitious result, we explored several research directions.

—Several document partitioning strategies are analyzed, and compared withour solution based on the query-vector document model. As a side product,we develop an effective way to select a set of documents that can be safelymoved out of the main index.1

—We address the problem of the quality of collection selection strategies, andwe show how our new model based on coclustering and on the use of querylog information remarkably outperforms state-of-the-art solutions.

—We consider the problem of load balancing, and we design a strategy to reducethe load differences among the cores; our load-driven routing strategy can

1A strategy common to many search engines is to move low-quality documents into a supplementalindex, which can be queried, for instance, when no results are available from the main index.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 4: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:4 • D. Puppin et al.

accommodate load peaks by reducing the pressure on the servers that areless relevant to a given query.

—We study how to utilize a caching system in an architecture based on collec-tion selection in order to improve both efficiency and efficacy.

By addressing all these problems in a single and complete framework, weshow that it is possible to serve a larger volume of queries with a reducedcomputing load. The main strength of the proposed architecture is its abilityto use only a subset of the available search cores to return high-quality resultsto each query. This will reduce the average computing load for answering eachquery, and open the possibility of serving more users without adding expensivehardware to the system.

Our solution does not explicitly discuss the performance of a single query,but rather focuses on increasing the overall capacity of the IR system. Whileour technique can add a small overhead to each query, we believe that thisoverhead is negligible and largely compensated by the remarkable gain inquery throughput.

Our study is mainly centered on finding a good trade-off between result qual-ity and computing load. We show by means of an exhaustive experimentationthat, with a small sacrifice in precision, we can dramatically decrease the com-puting load induced by the query stream on the system. More importantly, thistrade-off can also be tuned dynamically, on the basis of the available resourcesand the current utilization of the system. If enough computational power isavailable at a given instant, the system can return nondegraded results bypolling all servers as in traditional document-partitioned systems, but it canfit unpredictable load peaks by selecting only the most relevant servers, thusreducing the computing pressure on the other servers.

1.1 Main Contributions

This article extends and ties together, in a complete framework, several majorcontributions:

—the query-vector document model, which is a very compact and ef-fective way to represent documents, suitable for clustering and col-lection selection (previously introduced in Puppin et al. [2006] andPuppin and Silvestri [2006]);

—a strategy, called load-driven routing, which combines load balancing andcollection selection in a unified approach, so as to extract all the availablecomputing bandwidth from the servers that are expected to hold the bestresults for a query [Puppin et al. 2007];

— incremental caching, a technique that reduces the computing pressure onservers while improving the result quality [Puppin et al. 2007].

Other contributions of our work, are:

—a simple partitioning algorithm, based on sorting pages’ URLs, which can bea very effective solution if query log information is not available;

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 5: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:5

—a more compact partition representation, based on the results of coclustering,competing with the state of the art;

—a simple way to select a part of the document collection that can be safelymoved out of the main index because it contributes only marginally to theoverall precision of the system;

—an effective way to update the index and the document partitions while main-taining the same performance and quality (previously introduced in Puppin[2008]).

Novel, unpublished contributions of this article are:

—an improved cost model for queries based on real-time performance, as mea-sured on a real implementation of our index, which gives a more realisticestimation of the impact of our strategy;

—updated results using a second quality metric (competitive similarity), de-signed to measure the perceived quality more faithfully;

—a new set of tests, based on a broader query log from Altavista, which confirmsthe preexisting results; also, we extended the period, up to four weeks, wherewe evaluate the effect of topic shift;

—new data on the robustness of coclustering: we show that we can maintainthe same result quality even if using simplified or reduced data for training;

—new tests on the combined effect of SDC (Static Dynamic Cache [Fagni et al.2006]) and incremental cache.

The article is organized as follows. After an overview of the current stateof the art, we discuss the query-vector document model, and we analyze itsperformance and precision. We also show how to update the document collectionand its partitions, while guaranteeing the same performance.

After this, we introduce our load-driven collection selection strategies, andour incremental caching framework. Experiments validating our solution, con-ducted on a real-world data collection and real query logs from TodoBR andAltavista, are presented and discussed in Section 6. Finally we conclude, com-menting on applications and extensions of our model.

2. RELATED WORK

2.1 Document Clustering and Partitioning

In a document-partitioned system, we can reduce the computing load by pollingonly the search cores that contain relevant results. However, performing thischoice is a very challenging problem usually known as collection selection orquery routing. Also, the problem of creating good document clusters for thesesystems is very complex. In Frieder and Siegelmann [1991], the authors presenta model for the problem of assigning documents to a multiprocessor informationretrieval system. The problem is shown to be NP-complete. The majority ofthe proposed approaches in the literature adopt a simple approach, where

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 6: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:6 • D. Puppin et al.

documents are randomly partitioned, and each query uses all the servers. Adrawback of random partitions is that servers may execute several unnecessaryoperations by querying subcollections that may contain few or no relevantdocuments.

Another approach is to use k-means clustering [Jain and Dubes 1988] topartition a collection according to topics, as in Larkey et al. [2000] and Liu andCroft [2004]. This algorithm is unfortunately very expensive and not suitablefor large-scale systems.

In this article, we utilize a coordinated strategy, based on the novel query-vector document model, to partition documents according to historical informa-tion coming from a query log. We show that our document clusters are veryfocused, and only a limited set of clusters is sufficient to retrieve very high-quality results.

A recent paper [Poblete and Baeza-Yates 2008] uses the information of userclicks to create a query document model, which differs from our query-vectordocument model mainly by the fact that the weights associating queries anddocuments are based on clicking information, rather than the rank returnedby a search engine. The authors use this model to cluster the pages of a largeWeb site, based on query and click information.

2.2 Collection Selection

Several techniques have been presented in the literature to perform queryrouting on a set of document collections. The work described in Moffat andZobel [1994] uses a centralized index on blocks of B documents. For exam-ple, each block might be obtained by concatenating documents. A query firstretrieves block identifiers from the centralized index, then searches the top-ranked blocks to retrieve single documents. This approach works well for smallcollections, but causes a significant decrease in precision and recall when largecollections have to be searched.

Gravano et al. [1994] and Tomasic et al. [1997] describe GlOSS, a broker fora distributed IR system based on the Boolean IR model. It uses statistics overthe collections to choose the ones that better fit the user’s requests. The authorsof GlOSS make the assumption of independence among terms in documents.In Gravano and Garcia-Molina [1995] the authors of GlOSS generalize theirideas to vector space IR systems (gGlOSS).

In Callan et al. [1995], the authors compare the retrieval effectiveness ofsearching a set of distributed collections with that of searching a central-ized one. The model they use to rank collections is based on inference net-works in which leaves represent document collections, and nodes representterms that occur in the collection. The probabilities that flow along the arcscan be based upon statistics that are analogous to tf and idf in classicaldocument retrieval: document frequency df (the number of documents con-taining the term) and inverse collection frequency icf (the number of collec-tions containing the term). They call this type of inference network a col-lection retrieval inference network, or CORI for short. They found no signifi-cant differences in retrieval performance between distributed and centralized

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 7: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:7

searching when about half of the collections on average were searched for aquery.

CVV (Cue-Validity Variance) is proposed in Yuwono and Lee [1997]. It is acollection relevance measure, based on the concept of cue-validity of a term in acollection, which evaluates the degree to which terms discriminate documentsin the collection. The paper shows that effectiveness ratio decreases as verysimilar documents are stored within the same collection.

Xu and Croft [1999] analyze collection selection strategies using cluster-based language models. They propose three new methods of organizing adistributed retrieval system, called global clustering, local clustering, andmultiple-topic representation. In the first method, assuming that all documentsare made available in one central repository, a clustering of the collection is cre-ated; each cluster is a separate collection that contains only one topic. Selectingthe right collections for a query is the same as selecting the right topics for thequery. The next method is local clustering, which is very close to the previousexcept for the assumption of a central repository of documents. This methodcan provide competitive distributed retrieval without assuming full coopera-tion among the subsystems. The last method is multiple-topic representation.In addition to the constraints in local clustering, the authors assume thatsubsystems do not want to physically partition their documents into severalcollections. The advantage of this approach is that it assumes minimum coop-eration from the subsystem. The disadvantage is that it is less effective thanboth global and local clustering.

In this article, we show how to use the information coming from a query logto build a very effective collection selection function. Our strategy allows us tochoose, very efficiently, the set of the most promising servers for each query,with results outperforming the state of the art.

2.3 Web Usage Mining and Caching

Query logs constitute a valuable source of information for tuning and eval-uating the effectiveness of caching systems. While there are several papersanalyzing query logs for different purposes, just a few consider caching forsearch engines, storing the results of queries at the broker, so they can bereturned quickly when a query is repeated. As noted in Xie and O’Hallaron[2002], many popular queries are in fact shared by different users. Moreover,another recent work [Baeza-Yates et al. 2007b] shows that the distribution offrequent queries changes very slowly over time.

Previous studies of query logs demonstrate that the majority of users visitonly the first page of results, and that many sessions end after the first query.In Silverstein et al. [1999], the authors analyzed a large query log from theAltavista search engine, containing about one billion queries submitted in morethan a month. Tests conducted included the analysis of the query sessions foreach user, and of the correlations among the terms of the queries. Their resultsshow that most users (about 85%) visit the first result page only. They alsoshow that the majority (77%) of users’ sessions end just after the first query. Asimilar analysis is carried out in Jansen et al. [1998]. With results similar to the

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 8: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:8 • D. Puppin et al.

previous study, they conclude that while IR systems and Web search enginesare similar in their features, users of the latter are very different from users ofIR systems. A very thorough analysis of users’ behavior with search engines ispresented in Jansen and Spink [2006]. In addition to the classical analysis ofthe distribution of page views, number of terms, number of queries, and so on,they show a topical classification of the submitted queries that points out howusers interact with their preferred search engine. The paper [Beitzel et al. 2004]analyzes a very large query log containing queries submitted by a populationof tens of millions of users searching the Web through AOL. They partitionedthe query log into groups of queries submitted in different hours of the day.The analysis tried to highlight the changes in popularity and uniqueness oftopically categorized queries within the different groups.

One of the first papers to exploit the query history is Raghavan and Sever[1995]. Although their technique is not properly caching, they suggest using aquery base, which is built upon a set of persistent optimal queries submittedin the past, in order to improve the retrieval effectiveness for similar queries.The paper [Markatos 2001] shows the existence of temporal locality in queries,and compares the performance of different caching policies.

Lempel and Moran propose PDC (Probabilistic Driven Caching), a newcaching policy for query results based on the idea of associating a probabil-ity distribution with all the possible queries that can be submitted to a searchengine [Lempel and Moran 2003]. PDC uses a combination of a segmented LRU(SLRU) cache [Karedla et al. 1994] for requests of the first page, and a heapfor storing answers to queries requesting pages after the first. PDC is the firstpolicy to adopt prefetching to anticipate user requests. To this purpose, PDCexploits a model of user behavior. A user session starts with a query for the firstpage of results, and can proceed with one or more follow-up queries (queriesrequesting successive pages of results). When no follow-up queries are receivedwithin τ seconds, the session is considered finished. This model is exploited inPDC by demoting the priorities of the entries of the cache referring to queriessubmitted more than τ seconds ago. To keep track of query priorities, a prior-ity queue is used. PDC results measured on a query log of Altavista are verypromising (up to 53.5% of hit-ratio with a cache of 256, 000 elements and 10pages prefetched).

Fagni et al. [2006] show that combining static and dynamic caching policiestogether with an adaptive prefetching policy achieves a very high hit ratio. Intheir experiments, the authors observe that devoting a large fraction of entriesto static caching along with prefetching obtains the best hit ratio. They alsoshow the impact of having a static portion of the cache on a multithreadedcaching system. Through a simulation of the caching operations they showthat, due to the lower contention, the throughput of the caching system canbe doubled by statically fixing one half of the cache entries. Baeza-Yates et al.[2008] show that similar results also hold when posting lists are cached.

In this article, we build these results by giving the cache the capability ofupdating the cached results. In our system, based on collection selection, thecache stores the results coming only from a subset of servers. In case of a hit, thebroker may poll the servers on did not contribute to the set of results currently

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 9: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:9

in cache. This way, over time, the full set of relevant results, taken from allthe servers, will be cached for repeated queries. Our incremental cache thusimproves both the computing load and the result quality.

2.4 Information-Theoretic Coclustering

The coclustering algorithm described in Dhillon et al. [2003] is a very importantblock of our infrastructure. In this section, we quickly summarize its mainfeatures. The paper describes a very interesting strategy to cluster contingencymatrices describing joint probabilities.

Several IR problems are modeled with contingency matrices. For instance,the common vector-space model for describing documents as vectors of termscan be described as a contingency matrix, where each entry is the probabilityof choosing some term and some document. Obviously, common terms and longdocuments will have higher probability.

Given a contingency matrix, coclustering is the general problem of simul-taneously performing a clustering of columns and rows, in order to maximizesome clustering metrics (e.g. intercluster distance). In the cited work, the au-thors give a very interesting theoretical model; coclustering is meant to min-imize the loss of information between the original contingency matrix and itsapproximation given by the coclustered matrix.

They extend the problem by considering the entries as empirical joint prob-abilities of two random variables, and they present an elegant algorithm that,step by step, minimizes the loss of information. Their algorithm terminatesand finds a locally optimal cluster assignment: the loss in mutual informationcannot be decreased by reassigning one row or column to a different cluster.

More formally, given a contingency matrix p(x, y), describing a joint proba-bility, and a target matrix size N× M, an optimal coclustering is an assignmentof each row to one of the N row clusters, and each column to one of the Mcolumn clusters, that minimizes the loss of information between the originalmatrix p(x, y) and the clustered matrix,

p(x, y) =∑

x∈x,y∈y

p(x, y) (1)

as follows:

minX,Y

(I(X; Y) − I(X; Y)

), (2)

where I is the mutual information of two variables:

I(X; Y) = ∑x∑

y p(x, y) logp(x, y)

p(x)p(y)(3)

I(X; Y) = ∑x∑

y p(x, y) logp(x, y)

p(x)p(y). (4)

Clearly, we have:

p(x|x) = p(y|y) = 1. (5)

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 10: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:10 • D. Puppin et al.

By defining:

q(x, y) = p(x, y)p(x|x)p(y|y) = p(x, y)p(x)p(x)

p(y)p(y)

, (6)

we have (see Dhillon et al. [2003]) that:

I(X; Y) − I(X; Y) = D(p(X, Y)||q(X, Y)), (7)

where D is the Kullback-Leibler distance of the two distributions:

D(p(X, Y)||q(X, Y)) =∑

x

∑y

p(x, y) logp(x, y)q(x, y)

. (8)

q has the same marginal distributions as p, and by working on q we can reachour goal. Given an assignment Ct

X and CtY, the optimal coclustering can be

found by assigning a row x to the new cluster C(t+1)X (x) as follows:

C(t+1)X (x) = argminx D(p(Y|x)||q(t)(Y|x)). (9)

The coclustering algorithm in Dhillon et al. [2003] works by assigningeach row and each column to the cluster that minimizes the KL distanceD(p(Y|x)||q(t)(Y|x)) for rows, and D(p(X|y)||q(t)(X|y)) for columns. The algorithmiterates this assignment, monotonically decreasing the value of the objectivefunction. It stops when the change in the objective function is zero. To avoid infi-nite loops, coclustering stops when the change is smaller than a given threshold.

Unfortunately, the algorithm only finds a local minimum, which is influencedby the starting condition. It is possible to build artificial examples where thealgorithm reaches only suboptimal results. Also, it may happen that one ofthe row (or column) clusters becomes empty, for a fluctuation in the algorithm.In this case, the algorithm is no longer able to fill it up, and the cluster islost. The result will therefore have N − 1 clusters. In our implementation,2 weaddressed these issues by introducing a small random noise, which reassignsa small number of rows and columns if one cluster becomes empty or if thereis no improvement in the objective function.

The algorithm is very fast and our implementation can compute a wholeiteration of about 170,000 rows and 2.5 M columns in 40 seconds on a standarddesktop machine. In our tests, we found that ten iterations are usually enoughto reach convergence.

Moreover, the algorithm has the potential to be run in parallel on severalmachines, and can scale to bigger data sets. A recent paper [Papadimitriouand Sun 2008] presents a parallel implementation based on the map-reduceframework [Dean and Ghemawat 2004].

3. HOW TO IMPROVE PARTITIONING: THE QV DOCUMENT MODEL

We believe that information and statistics about queries may help drive thepartitions to a good choice. The goal of our partitioning strategy is to clusterthe most relevant documents for each query in the same partition. The cluster

2http://diego.puppin.it/phd.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 11: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:11

Table I.In the query-vector model, every document is represented by the query it matches

(weighted with the score).

Query/Doc d1 d2 d3 d4 d5 d6 ... dnq1 - 0.5 0.8 0.4 - 0.1 ... -q2 0.3 - 0.2 - - - ... 0.1q3 - - - - - - ... -q4 - 0.4 - 0.2 - 0.5 ... 0.3... ... ... ... ... ... ... ... ...qm 0.1 0.5 0.8 - - - ... -

hypothesis states that closely associated documents tend to be relevant to thesame requests [Van Rijsbergen 1979]. Clustering algorithms, like the k-meansmethod we cited, exploit this claim by grouping documents on the basis of theircontent.

We instead base our partitioning method on the novel Query-Vector (QV)document model, introduced in Puppin et al. [2006]. In the QV model, docu-ments are represented by the weighted list of queries (out of a training set)that recall them; the QV representation of document d is a vector whereeach dimension is the score that d gets for a query in the query set. Theset of the QVs of all the documents in a collection can be used to build aquery-document matrix, which can be normalized and considered as an em-pirical joint distribution of queries and documents in the collection. Our goalis to cocluster queries and documents, so as to identify queries recalling sim-ilar documents, and groups of documents related to similar queries. The al-gorithm we adopt is described in Dhillon et al. [2003], and is based on amodel exploiting the empirical joint probability of picking up a given couple(q, d), where q is a given query and d is a given document (see Section 2.4).The results of the coclustering algorithm are then used to perform collectionselection.

Let us go into deeper detail and formalize our QV document model. So far,two document models have mainly been used in IR: bag-of-words and vectorspace. Since we record which documents are returned as answers to each query,we can represent a document as a query-vector. The QV representation of adocument is built out of a query log. A reference search engine is used in thebuilding phase: for every query in the training set, the system stores the firstN results along with their score.

Table I gives an example. The first query, q1, recalls, in order, d3 with score0.8, d2 with score 0.5, and so on. Query q2 recalls d1 with score 0.3, d3 withscore 0.2, and so on. We may have empty columns, when a document is neverrecalled by any query (in this example d5). Also, we can have empty rows whena query returns no results (q3).

We can state this more formally.

Definition 1. Query-vector model. Let Q be a query log containing queriesq1, q2, . . . , qm. Let di1, di2, . . . , dini be the list of documents returned, by a refer-ence search engine, as results to query qi, and let rij be the score that documentdj gets as result of query qi (0 if the document is not a match).

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 12: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:12 • D. Puppin et al.

A document dj is represented as an m-dimensional query-vector dj = [rij

]T,where rij ∈ [

0, 1]

is the normalized value of rij :

rij = rij∑i∈Q

∑j∈D

rij. (10)

In our model, the underlying reference search engine is treated as a blackbox, with no particular assumptions about its behavior. Internally, the enginecould use any metric, algorithm, and document representation. The QV modelwill simply be built out of the results recalled by the engine using a given querylog.

Definition 2. Silent documents. A silent document is a document neverrecalled by any query from the query log Q. Silent documents are representedby null query-vectors.

The ability to identify silent documents is a very important feature of ourmodel because it allows us to determine a set of documents that can safely bemoved to a supplemental index.

The rij values form a contingency matrix R, which can be seen as an empiricaljoint probability distribution and used by the cited coclustering algorithm.This approach simultaneously creates clusters of rows (queries) and columns(documents) out of an initial matrix, with the goal of minimizing the loss ofinformation.

Coclustering considers both documents and queries. We thus have two differ-ent sets of results: (1) groups made of documents answering to similar queries,and (2) groups of queries with similar results. The first group of results is usedto build the document partitioning strategy, while the second is the key to ourcollection selection strategy (see the following).

The result of coclustering is a matrix P defined as:

P(qca, dcb) =∑i∈qcb

∑j∈dca

rij . (11)

In other words, each entry P(qca, dcb) sums the contributions of rij for thequeries in the query cluster a and the documents in document cluster b. We callthis matrix simply PCAP, from the LATEX command \cap{P} needed to typesetit.3 The values of PCAP are important because they measure the relevance of adocument cluster to a given query cluster. This naturally induces a simple buteffective collection selection algorithm.

3.1 PCAP Selection Algorithm

The queries belonging to each query cluster are chained together into querydictionary files. Each dictionary file stores the text of each query belonging to acluster, as a single text file. For instance, if the four queries hotel in Texas, resort,

3The correct LATEX command for P is actually \widehat{P}, but we initially thought it was \cap{P},which incorrectly gives ∩P. Hats, caps... whatever.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 13: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:13

Fig. 2. Example of PCAP to perform collection selection. We have three query clusters: qc1=“hotelin Texas resort accommodation in Dallas hotel downtown Dallas Texas”, qc2 =“car dealer Texasbuy used cars in Dallas automobile retailer Dallas TX” and qc3 =“restaurant chinese restauranteating chinese Cambridge”. The second cluster is the best match for the query “used Ford retailersin Dallas”. The third document cluster is expected to have the best answers.

accommodation in Dallas, and hotel downtown Dallas Texas, are clusteredtogether as the first query cluster, the first query dictionary will simply beqc1=“hotel in Texas resort accommodation in Dallas hotel downtown DallasTexas”. The second query dictionary file could be, for instance qc2 =“car dealerTexas buy used cars in Dallas automobile retailer Dallas TX”. A third querycluster could be qc3 =“restaurant chinese restaurant eating chinese Cambridge”.

When a new query q is submitted to the IR system, the BM25 metric[Robertson and Walker 1994] is used to find which clusters are the best matches.Each dictionary file is considered as a document, which is indexed using thevector-space model, and then queried with the usual BM25 technique. Thisway, each query cluster qci receives a score relative to the query q, say rq(qci).In our example, if a user asks the query “used Ford retailers in Dallas”, rq(qc2)will be higher than rq(qc1) and rq(qc3).

This is used to weight the contribution of PCAP P(i, j) for the documentcluster dc j , as follows:

rq(dc j) =∑

i

rq(qci) · P(i, j). (12)

Figure 2 gives an example. The top table shows the PCAP matrix for threequery clusters and five document clusters. Suppose BM25 scores the query-clusters respectively 0.2, 0.8, and 0, for a given query q. We compute the vectorrq(dci) by multiplying the matrix PCAP by rq(qci), and we will choose the col-lection dc3, dc1, dc2, dc5, and dc4, in this order.

The QV model and the PCAP selection function are together able to createvery robust document partitions. In addition, they allow the search engine toidentify, with great confidence, the most authoritative servers for any query. Inthe next sections, we compare our strategy with several other partitioning andselection strategies.

3.2 Application to a Distributed Search Engine

We use these ideas to design a distributed IR system for Web pages. Ourstrategy is as follows. First, we train the system with the query log from the

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 14: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:14 • D. Puppin et al.

training period, by using a centralized reference index to answer all the queriessubmitted to the system. We record the top-scoring results for each query.Then, we perform coclustering on the resulting query-document matrix. Thedocuments are then partitioned onto several search cores according to theresults of clustering.4

We partition the documents into 17 clusters; the first 16 are obtained bycoclustering the query-doc matrix, and the last holds the silent documents:the documents that are not returned by any query, represented by null query-vectors. In other words, the 17-th cluster is used as a supplemental index. Thisnumber was initially chosen to match our test architecture.

After the training phase, we perform collection selection as previously shown.In this section, we show a model where the broker actively chooses which coresare going to be polled for every query. In the next sections, we will show how togive more responsibility to each core. The cores holding the selected collectionsreceive the query, and return their results, which will be merged by the broker.In order to have comparable document ranking across cores, we distribute theglobal collection statistics to each search core.

One of the problems that appeared when we tried to test our approach washow to properly measure the quality of the results returned by our systemafter collection selection. Due to the nature of the data, we do not have a listof manually chosen relevant documents for each query (as happens with theTREC data5). Following the example of previous works [Xu and callan 1998],we compare the results coming from collection selection with the results comingfrom a centralized index. This is an effective approach for two reasons. First,we can use any query from our log as a data point, this way broadening thevariety and complexity of our experimental evaluation. Second, our collectionselection strategy uses the underlying retrieval algorithm as a black box; weare interested in guaranteeing the performance of the search algorithm whenthe index is partitioned and collection selection must be performed. That is whyour gold standard is given by the results coming from the centralized index.6

In our experiments, we use the intersection and competitive similarity met-rics, adapted from Chierichetti et al. [2007].

Definition 3. Intersection. Let’s call GNq the top N results returned for q by a

centralized index (ground truth), and HNq the top N results returned for q by the

set of servers chosen by our collection selection strategy. The intersection at N,INTERN(q), for a query q, is the fraction of results retrieved by the collectionselection algorithm that appear among the top N documents in the centralized

4Some documents may be replicated on multiple clusters to reduce assignment and selectionerrors. This is a general trade-off for IR systems with a partitioned index, thoroughly discussed inliterature.5Instructions on how to obtain the TREC Web Corpus (WT10g) are available at http://www.ted.cmis.csiro.au/TRECWeb/wt10g.html.6Our test queries include both informational and navigational queries. We have to note that ourmethod has limited application to navigational queries, where the loss of one relevant result mayhave a higher impact on quality. Modern search engines, anyway, are able to detect navigationalqueries and perform special operations on them.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 15: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:15

index:

INTERN(q) = |HNq ∩ GN

q ||GN

q | . (13)

Note that INTERN if equivalent to the precision at N, P@N, if we considerthe top N results from the centralized index as relevant.

Definition 4. Competitive similarity. Given a set D of documents, we calltotal score the value:

Sq(D) =∑d∈D

rq(d) (14)

with rq(d) the score of d for query q. The competitive similarity at N COMPN ismeasured as:

COMPN(q) = Sq(HNq )

Sq(GNq )

. (15)

This value measures the relative quality of results coming from collectionselection with respect to the best results from the central index. In both cases,if |GN

q | = 0 or Sq(GNq ) = 0, the query q is not used to compute average quality

values.

3.3 Experimental Evaluation

This strategy was tested on a simulated distributed Web search engine. Weused the WBR99 Web document collection,7 of 5,939,061 documents: Web pages,representing a snapshot of the Brazilian Web (domains .br) as spidered by thecrawler of the TodoBR search engine. The collection consists of about 22 GB ofuncompressed data, mostly in Portuguese and English, and comprises about2,700,000 different terms after stemming.

Along with the collection, we used a query log of queries submitted toTodoBR, in the period January through October 2003. We selected the firstthree weeks of the log as our training set. It is composed of about half a millionqueries, of which 190,000 are distinct. The main test set is composed by thefourth week of the log, comprising 194,200 queries. The main features of oursetup are summarized in Table II.

As our search engine, we used Zettair,8 a compact and fast text search enginedesigned and written by the Search Engine Group at RMIT University. Wemodified it so as to implement different collection selection strategies (CORIand PCAP).

To test the quality of our approach, we performed a clustering task aimedat document partitioning and collection selection, for a distributed informationretrieval system. We compared different approaches to partitioning and selec-tion. We partitioned the documents into 17 clusters. For partitions created with

7Thanks to Nivio Ziviani and his group at UFMG, Brazil, who kindly provided the collection andthe query logs.8Available under a BSD-style license at http://www.seg.rmit.edu.au/zettair/.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 16: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:16 • D. Puppin et al.

Table II.Main feature of our test set.

d 5,939,061 documents, 22 GB uncompressedt 2,700,000 unique termst′ 74,767 unique terms in queries

tq 494,113 (190,057 unique) queries in the training set (three weeks - TodoBR)q1 194,200 queries in the main test set (fourth week - TodoBR)

coclustering, the 17th cluster, or overflow cluster (OVR), holds the supplementalindex, which stores the silent documents: in these experiments, we found that52% of the documents (3,128,366) are not returned among the top 100 resultsby any of the training queries, and can be safely moved out of the the main 16cores used by PCAP. The approaches we tested are as follows.

—Random. A random allocation; this is, to the best of our knowledge, themost popular approach to document partitioning among commercial searchengines.

—Shingles. Documents’ signatures are computed using shingling [Broder et al.1997], on which we used a standard k-means algorithm. Shingles have al-ready been used in the past to cluster text collections and to detect duplicatepages [Chowdhury et al. 2002; Hoad and Zobel 2003]. They have been shownto be a very effective document representation for estimating document sim-ilarity.

—URL-sorting. It is a very simple heuristic, which assigns documents block-wise, after sorting them by their URL; this is the first time URL-sorting isused to perform document clustering. We show that this simple technique,already used for other IR tasks in Randall et al. [2002]; Boldi and Vigna[2004]; Silvestri [2007] can offer a remarkable improvement over a randomassignment.

—K-means. We performed k-means over the document collection, representedby query-vectors.

—Co-clustering. We used the coclustering algorithm to compute documentsand query clusters. We created 16 document clusters and 128 query clusters,with 10 iterations of the coclustering algorithm.

We used CORI as the collection selection function in all the tests performedexcept the last one, where we used PCAP. Due to the fact that the collectionsare not guaranteed to have the same size, some search cores may be under-utilized. This problem is tackled by our load-driven routing strategy, discussedin Section 4.

Results are shown in Table III. We show intersection at 5, 10, 20 (INTER5,INTER10, INTER20), when using only a subset of servers, always chosen asthe set of the most promising servers for each query, using CORI or PCAP. Thefirst column, for instance, shows the value of the INTERk measure when onlythe most promising server is used to answer each query. Later in this article,we will consider the effects of this strategy on load balance.

Shingles offer only a moderate improvement over a random allocation, anda bigger improvement when a large number of collections, about half, are

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 17: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:17

Table III.Comparison of different clustering and selection strategies:

intersection (percentage) at 5, 10, 20.

INTER5%

1 2 4 8 16 OVRCORI on random 6 11 25 52 91 100CORI on shingles 11 21 38 66 100 100CORI on URL sorting 18 25 37 59 95 100CORI on kmeans qv 29 41 57 73 98 100CORI on coclustering 31 45 59 76 97 100PCAP on coclustering 34 45 59 76 96 100

INTER10%

1 2 4 8 16 OVRCORI on random 5 11 25 50 93 100CORI on shingles 11 21 39 67 100 100CORI on URL sorting 18 25 37 59 95 100CORI on kmeans qv 29 41 56 74 98 100CORI on coclustering 30 44 58 75 97 100PCAP on coclustering 34 45 58 76 96 100

INTER20%

1 2 4 8 16 OVRCORI on random 6 12 25 48 93 100CORI on shingles 11 21 40 67 100 100CORI on URL sorting 18 24 36 57 95 100CORI on kmeans qv 29 41 56 74 98 100CORI on coclustering 30 43 58 75 97 100PCAP on coclustering 34 45 58 75 96 100

chosen. Shingles are not able to cope effectively with the curse of dimensionality.Our experimental results show that URL-sorting is actually a good clusteringheuristic, better than k-means on shingles when a small number of servers arepolled. URL-sorting is even better if we consider that sorting a list of a billionURLs is not as complex as computing a clustering over one billion documents.This method could therefore become the only one feasible in a reasonable timewithin large scale Web search engines if information from the query logs is notavailable.

Our results improve dramatically when we shift to clustering strategiesbased on the query-vector representation. The result of using CORI over parti-tions created with k-means on query-vectors (a value of INTER of about 29%when a single partition is queried) are much better than the results we had withother clustering strategies that do not exploit usage information (obtaining anINTER score up to 18%).

Coclustering performs even better. Both CORI and PCAP on coclustereddocuments are superior to previous techniques, with PCAP also outperformingCORI by about 10% (from 30% to 34%). This result is even stronger when wewatch at the footprint of the collection representation, which is about five timessmaller for PCAP (see Puppin et al. [2006]).

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 18: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:18 • D. Puppin et al.

3.4 Robustness of the Coclustering Algorithm

The quality of the results reached by our strategy is confirmed by experimentson different configurations. We used different training sets and different testsets. First, we tried to measure the effect of simplifying and reducing the initialQV matrix. Instead of storing the scores obtained for all documents with thetraining queries, we tried using only Boolean values (a value of true when adocument was a match for a query, false otherwise). Also, we stored only thetop 20 results (instead of 100), and we tried to use a reduced number of trainingqueries: those that appeared repeatedly in the training set. We compare

—10iter: the initial configuration, with 10 iterations of coclustering, with 128query-clusters, 16 document-clusters + overflow, on the full scores;

—Nonunique: the result obtained training only on repeated queries (queriesappearing at least twice in the training set);

—Top20: which uses only the top 20 results of each query for training;—Boolean: which builds coclustering on Boolean values.

In Figure 3, the reader can clearly see that using only Boolean values, orlimiting to the top queries, only marginally affects the quality of results. Thismeans that:

—the input matrix is extremely redundant, and the algorithm can use the countof non-zero entries instead of scores to perform a very good assignment;

—queries that appear just once in the query log are not contributing to thetraining; this can be a very important factor for reducing the training effort.

On the other hand, the quality of results degrades if we store only the top20 results, instead of the top 100. In this case, the number of non-zero entriesin the input matrix drops severely, as well as the number of recalled (non-silent) documents. The size of the overflow collection is now much bigger, anda large fraction of good documents is lost. For instance, when we measurethe intersection at 50,—the number of the top-ranking 50 documents that areavailable when using collection selection,—we can see that the overflow cluster(OVR) holds about 25% of the relevant documents when we perform trainingwith only the top 20 results, while this is about 12% when we train withonly the repeated queries, and about 5% with full or Boolean data. Also, therecalled documents are not assigned properly. This is because there is muchless information if we cut the result list to 20; all documents with low relevanceare lost.

The quality of the results achieved in the test phase clearly depends on thequality of the partitions induced by the training set. If over time, the typeand frequency of queries change heavily, the partitions could turn out to besuboptimal. This is an important problem in Web IR systems, as the user couldbe suddenly interested in new topics (e.g. due to news events). This phenomenonis usually referred to as topic shift.

We measured the robustness of training over time. To do this, we testedour partitioning against queries coming from subsequent weeks. Results aretotally comparable (see Figure 4, where 10iter refers to the first week). The

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 19: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:19

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers

Metric: INTER5

10iterboolean

top20nonunique

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers

Metric: INTER50

10iterboolean

top20nonunique

Fig. 3. Robustness of coclustering with different training data. Result quality (intersection at 50)of PCAP for queries coming from the first week, with clusters created on reduced data (Booleanmatrix, top 20 results only, repeated queries only). 10iter refers to the baseline: clusters createdfrom complete data.

users do not perceive a degradation of results over time. This confirms theresults of Fagni et al. [2006] and Baeza-Yates et al. [2007b], which showed thattraining on query log is a robust process over time, because the distribution offrequent queries changes very slowly.

The overall results are very encouraging: the first server (out of 16, plussupplemental index) chosen by the PCAP function holds more than one third ofthe documents that would be chosen by a fully centralized index. This meansthat if a search engine based on our architecture polls only one server per query,i.e. with a very limited computing load, it can return more than one third of themost relevant results. Please note that, after removing all the silent documents

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 20: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:20 • D. Puppin et al.

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers

Metric: INTER5

10itersecondweek

thirdweekfourthweek

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers

Metric: INTER50

10itersecondweek

thirdweekfourthweek

Fig. 4. Robustness of training to topic shift. Result quality (intersection at 50) of PCAP forqueries coming from different weeks. 10iter refers to the first week (training with 10 iterations ofcoclustering).

(about half the collection), each of the 16 search cores holds about 1/32 of thecollection: one-third of the top results are found by querying only 1/32 of thecollection. Clearly, it is possible to improve this figure by polling more servers.

Our experimental results proved that coclustering is very robust, and thatcan be used to create well defined partitions. In our tests, 10 iterations and 128query clusters is the best configuration for our purpose. It is important to saythat, by tuning our coclustering algorithm, we are not artificially improving theperformance of PCAP over CORI. In fact, the performance of CORI also greatlybenefits from our partitioning strategy over competing solutions (shingles, URLsorting, k-means over QV).

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 21: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:21

Another benefit of our strategy is that it is able to identify a set of docu-ments (about 50% of the collection) that can be safely moved to a supplementalindex with minimal loss. The process of pruning these documents could im-prove the performance of the system, because less than 50% of the documentscontributes more than 97% of the relevant documents; around 52% of the doc-uments (3,128,366) are not returned among the first 100 top-ranked results ofany query used in the training phase.

In the rest of this article, we use the QV model to cluster our documentcollection, and the PCAP selection function to route queries.

3.5 How to Update the Index

The problem of adding/removing documents and updating the index is partic-ularly complex in our architecture. If the query is broadcast to all cores, as in atraditional document-partitioned system, the choice of the core to which a newdocument is mapped does not have any effect on the precision of the system(even if it could affect the performance).

In our architecture on the other hand, this choice is more important; it iscrucial to assign new documents to the correct collection, to the server wheremost similar or related documents are already assigned. In other words, wewant to store the new document where the collection selection function willfind it. This is complex because the document could be relevant for severalqueries, and the collections will have a different relevance or authoritativenessfor each of them.

An ideal approach would be to perform all queries in the training set againstthe new documents, and then to rerun coclustering on the extended QV ma-trix. This is clearly unfeasible, as the training set can be composed of severalthousand queries. In Puppin [2008], we introduced a simpler approach, show-ing that a very quick approximation can be reached using the PCAP collectionselection function itself. The body of the document can be used in place of aquery, and the collection selection function will be able to rank the collectionsaccording to their relevance. The rationale is simple: the terms in the documentwill find the servers holding all the other documents relevant to the same broadtopics. Our collection selection function is able to order the document collectionaccording to the relevance to a given query. If the body of the document is usedas query, the collection selection function will find the closest collections.

To allocate a new document to a collection, we use the following strategy.First, we perform a query, using the document body as a topic, against thequery dictionaries, and we determine a rank for each query cluster. Theseare the clusters comprising queries with terms also appearing in the docu-ment. Then, we use the PCAP matrix to choose the best document clusters,as seen in Section 3.1. If no query dictionary matches the new document,it will be classified as silent and stored in the supplemental index. Fromthe moment of the assignment on, the new document will appear amongthe results of the new queries.9 They will contribute to building a new QV

9The techniques to update the local index managed by each core have been discussed in literature,and are beyond the scope of this article.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 22: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:22 • D. Puppin et al.

matrix, used when the system chooses to have a major update of the wholeindex.

In Puppin [2008], we verified that this approach is very effective. We simu-lated the behavior of a system where a part of the collection (about one milliondocuments) is added after the initial setup (with five million documents), as aresult of a crawl. We simulated six different configurations.

(a) This represents the situation right after the training, with 5 million doc-uments available, partitioned using coclustering. We test against the firstset of test queries (queries for the first week after the training).

(b) This represents the situation one week later, with 5.5 million documentsavailable. The new half million documents are assigned using PCAP, asjust described. We test against the second set of test queries (queries forthe second week after the training).

(c) This is similar to (b), but now old and new documents are clustered usinga full run of coclustering. Same test set as (b).

(d) This is the situation after two weeks, with all documents available. In thisconfiguration, all new documents are assigned with PCAP. We use the thirdweek of queries.

(e) Here we imagine that we were able to run coclustering for the first halfmillion new documents, but not for the latest ones. The test set is the someas (d).

(f) Finally, all 6 million documents are clustered with coclustering. The testset is the same as (d).

Figure 5 shows the result quality for the different configurations. No signif-icant change is measured in the different settings.

4. LOAD-DRIVEN ROUTING

In this article, we present the novel concept of load-driven routing. To the bestof our knowledge, this is the first time a strategy for collection selection isdesigned to address load-balancing in a distributed IR system.

We have shown in the previous section that, by choosing a fixed, limitednumber of servers to be polled for each query, the system can return a veryhigh fraction of the relevant results. This strategy can cause a strong differencein the relative computing load of the underlying search cores, if one serverhappens to be hit more frequently than another, the IR system will be sloweddown by the performance of the most loaded server. Let us now formally defineour metrics for load and peak load.

Definition 5. Load, peak load. For each search core server c in the system,given a query stream Q, and a window size W, we call instant load at time tlc,t (Q) the fraction of queries answered by c from the set of W queries endingat t. We define the peak load for c lc (Q) as the maximum lc,t (Q) over t.

In our experiments, we set W equal to 1000: we keep track of the fractionof queries that hit a server out of a rotating window of 1000 queries. With nocache and no collection selection, we expect to have a peak load of 100%. On the

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 23: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:23

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers

Metric: INTER5

configurationAconfigurationBconfigurationCconfigurationDconfigurationEconfigurationF

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers

Metric: INTER50

configurationAconfigurationBconfigurationCconfigurationDconfigurationEconfigurationF

Fig. 5. Result quality with index update. Intersection at 5 and 50, with the test configurationsdescribed in Section 3.5.

other hand, if we choose to poll only one of N servers (the most promising one)for every query, each server should register a load of 1/N on average. There canbe peaks when one server is hit more often than the others, which can raisethe fraction it has to elaborate in a time window.

In Figure 6, we show a sample of the peak load reached by our cores whefour most authoritative collections for every query (FIXED strategy), with thepresence of a LRU result cache of 4000 entries. It varies from 100 to about 250queries out of a rotating window of 1000 queries.

In our load-driven collection selection system, more servers can choose toanswer a given query, even if they are not the most authoritative for it, ifthey happen to be momentarily underloaded. This way, the system tries toexploit the load differences among servers to gather more results. We can

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 24: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:24 • D. Puppin et al.

0

75

150

225

300

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

FIXED 4BASIC 24.7BOOST 4 24.7BOOST 4 24.7 + INC

Load (queries elaborated out of 1000) on each core.

Fig. 6. Computing pressure (sample) on the cores when using different routing strategies: routingto the 4 most promising servers for each query (FIXED), load-driven routing capped to 24.7% load(BASIC), boost with 4 servers, cap to 24.7% (BOOST), and boost with incremental cache (INC).The load is measured as the number of queries that are served by each core from a window of 1000queries. BASIC, BOOST, and INC are able to improve results by utilizing the idle resources, thereis no need for additional computing power.

instruct the search cores about the maximum computing load allowed, andlet them drop the queries they cannot serve. In this configuration, the brokerstill performs collection selection, and ranks the collections according to theexpected relevance for the query at hand. The query is broadcast, but nowevery server is informed about the relevance it is expected to have for the givenquery. At this point, each core can choose to serve or drop the query, accordingto its instant load.

An implementation issue is whether or not to give the query broker theability to determine if a query has been answered by a core. If the query isaccepted, the core will answer with its local results. Otherwise, it can send anegative acknowledgment. Alternatively, the broker can use a time-out, so asto guarantee a chosen query response time.

The most promising core will receive a query tagged with top priority, equalto 1. The other cores, c, will receive a query, q, tagged with linearly decreasingpriority pq,c (down to 1/N, with N cores). At time t, a core c, with current load,lc,t, will serve the query q if:

lc,t < L × pq,c,

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 25: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:25

where L is a load threshold that represents the computing power availableto the system. This is done to give preferred access to queries on the mostpromising cores. If two queries involve a core c, and the load in c is high, onlythe query for which c is very promising will be served by c, the other will bedropped. A query with priority 1/2 will be served only if the current load isbelow half the threshold.

This way, overloaded cores will be not hit by queries for which they representonly a second choice. If the condition is met, the core will compute its localresults and will return them to the broker. Otherwise, the query is discardedby the core, which will return no results for it. In this model, the broker, insteadof simply performing collection selection, performs a process of prioritization,chooses the priority that every query should get at every collection.

In the experimental evaluation, we compare three query routing strategies

—Fixed < T >. The query is routed to the T most relevant servers, accordingto a collection selection function, with fixed T. This allows us to measure thecomputing power required to have at least T servers answer a query (and tohave a guaranteed average result quality).

—Load-driven basic (LOAD) < L >. The system contacts all servers, withdifferent priority. Priority ranges from 0 down to 1/N (on a system with Ncores). The load threshold on cores is fixed to L.

—Load-driven boost (BOOST) < L, T >. This is the same as load-driven, buthere we contact the first T servers with maximum priority, and then theothers with linearly decreasing priority. By boosting, we are able to keep theunderloaded servers closer to the load threshold. Boosting is valuable whenthe available load is higher, as it enables us to use the lower-loaded serversmore intensively. If the threshold, L, is equal to the load reached by FIXED< T >, we know that we can poll T servers every time without problems. Thelower-priority queries will be dropped when we get closer to the threshold.

Using our load-driven strategy, we are able to keep all cores in the systembusy by asking them to also answer the queries for which they are less author-itative (see Figure 6). As we will show in Section 6, this directly corresponds toan improvement in result quality, without the need for additional computingresources.

Also, this strategy can be used to fine-tune the behavior of the search enginein particular conditions.

—Some users or queries could be given higher priority in the system.—Hard queries (very long and highly specific, expecting very few results) can

also be broadcast with higher priority to increase users’ satisfaction. For thesame reason, queries that PCAP cannot route (composed of terms not presentin the training set) can be broadcast with higher priority to guarantee someresults.

—If the cores are replicated, as in a hierarchical, large-scale search enginearchitecture, the load to a replicated core can be controlled and reduced totolerate the failure of some of the instances.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 26: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:26 • D. Puppin et al.

5. INCREMENTAL CACHING

In our system, the caching system has a complex interaction with the selectionfunction and the load-driven routing. In fact, a traditional cache stores theresults as they are merged by the broker. In our architecture, if some serversare not highly relevant to a query and/or they are heavily loaded, their resultswill not be available to the broker for caching. This means that an incompleteor degraded set of cached results will be returned at each hit for frequentlyrepeated queries. To address this issue, in this section we discuss the con-cept of incremental caching, introduced in Puppin et al. [2007]. In case of ahit, the broker will try to poll more servers among the ones that were notavailable at the time of the previous request of a given query. Over time, thebroker will try to poll more and more servers, with the result of storing non-degraded results: the full set of results from all the servers for the repeatedqueries.

While traditional caching systems are aimed exclusively at reducing thecomputing load of the system and at improving the overall system efficiency,incremental caching is also able to improve the result quality. In fact, everytime there is a hit, more relevant results are added to the stored entry, and theuser will get an answer of higher quality. Incremental caching, in other words,is addressing both the computing load and the result quality.

To formalize our incremental caching system in detail, we need to redefinethe type of entries stored in a cache line.

Definition 6. A query-result record is a quadruple of the form< q, p, r, s >, where: q is the query string, p is the number of the result pagerequested, r is the ordered list of results associated with q and p, s is the setof servers from which results in r are returned. Each result is represented bya pair < doc id, score >.

If we cache an entry < q, p, r, s >, this means that only the servers in sanswered, and that they returned r. Also, since new results might be addedin an incremental cache, we need to record the score along with the documentidentifier in order to compare the new results with the old ones.

The complete algorithm is formally shown in Figure 7. Results in an incre-mental cache are continuously modified by adding results from the servers thathave not yet been queried. The set s is used for this purpose and keeps track ofthe servers that have answered so far. While the resulting cache has a rathercomplex management logic, we believe that the benefits are so high that theyoffset any additional cost.

When our load-driven selection is used, only the results coming from theaccepted servers are available for caching. In case of a subsequent hit, theselection function will give top priority to the servers that did not accept thequery before. Let’s say, for example, that for a query q, the cores are ranked inthis order: s4, s5, s1, s3, and s2. s4 will have priority 1. Now, let’s say that onlys4 and s1 actually answered the query, due to their lower load. When q hits thesystem a second time, s5 will have the top priority, followed by s3 and s2. Theirresults will be added to the incremental cache.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 27: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:27

Fig. 7. The incremental cache algorithm, as performed by the search engine broker. The casewhen the cache is not full is straightforward and not shown.

Table IV.This sample log from Altavista was remapped to a log with queries from TodoBRby replacing the n-th most popular query with the corresponding n-th mostpopular query from TodoBR.

Original log: Resulting log:query from Altavista Rank corresponding query from TodoBRBrittney Rache 2,113,337 assentos ovaischris isaak lyrics 661,810 tudo sobre sono e virgiliatigger wallpaper 11,783 monografias juridicasapril cornell com 2,120,014 queimadurascarding 2,203 nikeranma 564 cpfpractical solutions for lawyers 530,130 adesivos hello kittyvideos playstation 723,663 capa para automoveissymbolic toolbox matlab free 101,303 obras de casimiro de abreu

This strategy does not add computing pressure to the system with respect toboost (see Figure 6). The advantage comes from the fact that repeated queriesalso get higher priority for the low-relevance servers, at the cost of other, non-repeated, queries.

It is important to emphasize that our load-driven routing and incrementalcaching strategy works independently of the selection function, which is usedas a black box. In this work we use the PCAP strategy, but these concepts canbe successfully utilized with any other collection selection algorithm.

6. EXPERIMENTS

For our tests, we used the same infrastructure described in Section 3.3. Tobetter assess our results, we added a query log from Altavista, collected duringSpring 2001. We used a sample of this very large log, of about two millionqueries, spanning three days. The queries from the Altavista query log are

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 28: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:28 • D. Puppin et al.

Table V.Competitive similarity at 5 (comp5 %) of our different strategies on several cache configurations,over four weeks. Tests on TodoBR logs, using the basic load model. We tested caches of size 4,000,15,000, and 32,000, of which 0% and 50% was devoted to static entries. The combined strategy ofload-driven (Boost) routing and incremental caching is greatly beneficial with any configuration.

Load level: 10% 15% 20% 25%

Cac

he

Siz

e

SD

CR

atio

Wee

k#

Fix

ed

Boo

st

Boo

st+

Inc.

No

Col

l.+

Inc.

Fix

ed

Boo

st

Boo

st+

Inc.

No

Col

l.+

Inc.

Fix

ed

Boo

st

Boo

st+

Inc.

No

Col

l.+

Inc.

Fix

ed

Boo

st

Boo

st+

Inc.

No

Col

l.+

Inc.

4k 0 0 40 54 59 33 51 60 67 43 57 65 72 52 61 70 77 611 40 53 61 36 50 60 68 46 55 64 74 55 60 70 78 642 40 55 61 38 50 61 68 48 55 66 74 57 60 71 78 653 38 51 56 31 47 59 64 41 53 64 70 51 59 69 76 60

4k 50 0 46 59 63 42 55 65 70 52 61 70 75 60 65 74 80 681 47 59 65 45 55 65 71 54 60 69 76 62 64 73 81 702 48 60 64 46 56 66 71 55 62 71 76 63 66 75 80 713 42 56 60 39 50 62 67 49 56 67 73 58 61 71 78 66

15k 0 0 40 54 61 37 51 61 69 48 57 66 74 57 61 71 79 661 40 53 56 39 50 60 70 50 55 65 75 59 60 69 80 682 40 56 63 41 50 63 70 51 55 68 76 61 60 73 80 693 38 51 58 34 47 60 66 45 53 65 72 55 59 69 77 64

15k 50 0 49 62 66 48 57 67 73 57 62 72 77 65 66 75 82 721 49 61 67 49 57 67 73 58 62 71 78 67 65 74 82 752 50 63 67 50 58 68 73 59 63 73 78 67 67 76 82 753 44 58 63 44 52 64 70 54 58 69 75 63 63 73 80 71

32k 0 0 40 54 62 37 51 61 69 49 57 66 75 58 61 71 79 661 40 53 63 40 50 60 70 51 55 65 76 60 61 70 80 692 40 57 63 42 50 63 71 52 55 68 76 62 60 73 81 713 38 52 59 36 47 60 67 47 53 65 73 56 59 70 78 66

32k 50 0 51 64 68 51 59 69 74 60 63 73 79 68 67 77 83 761 51 62 69 53 58 68 75 61 63 72 79 70 66 75 83 772 52 64 69 53 59 69 75 62 64 74 80 70 68 77 83 783 46 60 66 48 53 66 71 57 59 70 77 66 64 74 81 74

more challenging for their volume and variety than those in the TodoBR log,but they represent the interest of a different public and are expressed mainlyin a different language (English). In order to use the log with our documentcollection, mostly in Portuguese, we followed the approach proposed in Webberet al. [2006]. We sorted the queries of the Altavista and the TodoBR log byfrequency, and then we created a new query stream by substituting the n-thmost popular query in Altavista with the n-th most popular query from TodoBR,for every n (see Table IV).

This way, in the new log, the top query will be the most popular query fromTodoBR, but with the distribution in time that the first Altavista query hadin the original log. The resulting log has about 658,000 unique queries. As acomparison, this number is reached by TodoBR in about a month. This allowedus to use a query log with the typical timing and distribution of a very largesearch engine, but with queries consistent with our collection.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 29: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:29

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers/Load level

Metric: COMP5

FIXED*NO SELECTION

NO SELECTION + INCBOOST

BOOST + INC

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of polled servers/Load level

Metric: COMP20

FIXED*NO SELECTION

NO SELECTION + INCBOOST

BOOST + INC

Fig. 8. Tests on TodoBR log. Load model with real timing information.

We assume that a cache is available at the broker to filter repeated queries.In the case of our incremental cache, the broker can poll more servers in orderto improve the quality of the cached results. We measure the result quality interms of competitive similarity.

For our first test, based on the log from TodoBR, we modeled the computingload of each server simply as the number of queries that the server accepts fromthe window of the last 1000 queries (basic load model). In Table V, we showthe validity of our system across different configurations. We varied the loadthreshold (5%, 10%, 20%, 25%), the routing strategy (fixed, load-driven routing(boost), with incremental caching, and incremental caching with no collectionselection), and the test week (four different weeks). The different sizes chosenfor the cache (4k, 15k, 32k) correspond to different implementation costs, from

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 30: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:30 • D. Puppin et al.

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of servers/Load level

Metric: COMP 5

FIXED*NO SELECTION

NO SELECTION + INCBOOST

BOOST + INC

0

10

20

30

40

50

60

70

80

90

100

OVR16151413121110987654321

Number of servers/Load level

Metric: COMP 20

FIXED*NO SELECTION

NO SELECTION + INCBOOST

BOOST + INC

Fig. 9. Tests on Altavista log. Load model with real timing information.

very modest and conservative (4k) to a much more optimistic configuration(32k)—clearly scaled to the limited size of our experiments.

In addition, we tested a different caching strategy, SDC [Fagni et al. 2006].SDC devotes part of the cache to a static set of precomputed results. Thisstrategy has proven to be effective in reducing the cost of managing the cache,and in improving the access rate to cache entries, while keeping a very highhit ratio. In our test, we devoted 50% of the cache to static entries. These arechosen as the most popular queries from the training set.

Across all configurations, load-driven routing, combined with incrementalcaching, clearly surpasses the other strategies. We would like to highlightthat in this test the change from fixed to load-driven routing and incrementalcaching does not add computing pressure. We can improve the results with the

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 31: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:31

Table VI.Average number of servers polled per query with different strategies, for different load levels.FIXED* polls a fixed number of servers, but queries can be dropped by overloaded servers.Even if Boost and Boost + incremental caching are utilizing, on average, a smaller number ofservers than broadcast (no collection selection), the selection is more focused and gives betterresults (see Figure 9).

1 2 3 4 5 6 7 8FIXED* 0.86 1.77 2.75 3.64 4.58 5.56 6.50 7.41NO SEL. 5.72 7.57 9.17 10.43 11.43 12.08 12.59 13.04NO SEL. + INC 8.72 10.35 11.57 12.65 13.31 13.90 14.26 14.61BOOST 2.04 2.88 3.78 4.68 5.57 6.47 7.30 8.23BOOST + INC 5.32 6.52 7.69 8.78 9.70 10.57 11.34 12.05

9 10 11 12 13 14 15 16 16+OVRFIXED* 8.33 9.26 10.23 11.15 12.06 13.04 14.01 15.08 16.34NO SEL. 13.45 13.86 14.16 14.57 14.86 15.19 15.40 15.63 16.34NO SEL. + INC 14.88 15.14 15.37 15.58 15.77 15.92 16.06 16.17 16.62BOOST 9.14 9.98 10.91 11.93 12.85 13.83 14.71 15.50 16.34BOOST + INC 12.65 13.17 13.67 14.19 14.68 15.16 15.64 15.98 16.62

same computing requirements in the search cores, at the cost of a little highercache complexity.

Our strategy can easily return about 2/3 of the results that would be availablefrom the full index, with a computing load of only 10%, that is a server answersno more than 100 queries out of every 1000.

In the second test, we actually partitioned and indexed the documents ondifferent servers, and for every query we measured the timing for answeringit, on each collection (load model with real timing information). In this model,the computing load lc,t is estimated as the sum of the timings of the queriesaccepted by a core, within a window of 1000 queries. When the core receives aquery tagged with a priority, it verifies if lc,t < L × pq,c.

In Figure 8, we show the competitive similarity of different strategies, underthis model, for the tests performed with the TodoBR log. Tests performed withthe Altavista log are available in Figure 9. We tested different routing strate-gies, with different load thresholds. The load levels are chosen as follows: thefirst load level is the average load of a system polling only the most promis-ing server per query; the second level is the average load when the two mostpromising cores are used, and so on.

Here, we use a variant of the FIXED strategy (FIXED*) where cores candrop the queries if they surpass their load threshold, as in the load-drivenstrategies. This is due to the fact that, in this extended model, which usesreal computing timing, we can have unexpected peaks, for example due to diskfragmentation, intercommunication errors, memory swapping and so on. Ina commercial search engine, these problems are usually mitigated by a verycareful design and tuning of the system.

While the configuration without collection selection will poll more serverson average (see Table VI), they are less relevant for the queries, and the finalresults are better for the approaches based on collection selection, which pollfewer, but more relevant, servers.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 32: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:32 • D. Puppin et al.

With a peak load set to the average load of FIXED < 1 >, the averageload needed to always poll only the most promising server for every query, thestrategy based on load-driven routing and incremental caching surpasses acompetitive similarity of 65% (with almost 80% COMP20).

7. CONCLUSIONS AND FUTURE WORK

In this work, we presented a distributed architecture for a Web search engine,based on the concept of collection selection. Our work exploits a novel approachto partitioning documents in a distributed architecture, which enables it togreatly improve the effectiveness of standard collection selection techniques(CORI), and a new selection function outperforming the state of the art. Ourtechnique is based on the analysis of query logs, and on coclustering queriesand documents at the same time.

As a side effect, our partitioning strategy is able to identify documents thatcan be safely moved out of the main index into a supplemental index, with aminimal loss in the accuracy of the results.

By suitably partitioning the documents in the collection, our system is ableto determine which servers hold the most relevant documents for each query.Our load-driven routing consists in assigning a priority to each query for everycore. The computing cores will answer on the basis of queries’ priorities andof their instantaneous load, this way reducing the computing pressure for lowpriority queries on overloaded servers.

Our combined approach to partitioning and selection is very effective. In asetting using 16 main search cores, plus one core for the supplemental index, wecan retrieve more than 1/3 of the most relevant results that a full index wouldreturn, by querying only the first server returned by our selection function.Note that one core holds about 1/16 of the non-silent documents (which areabout 50% of the collection). One-third of the relevant results can be thusretrieved by accessing only 1/32 of the whole collection.

We also presented a novel strategy to use the instant load at each server fordriving query routing: the less-loaded servers can be polled, even when theyare not the most relevant for a given query. Also, we described a new approachto caching, able to incrementally improve the quality of the stored results.By combining these two techniques, we can achieve extremely high figures ofprecision, with a reduced load with respect to full query broadcasting. We canreach a competitive similarity of about 2/3, with a computing load of 10%, aserver answers no more than 100 queries out of every 1000. The system, witha slightly higher load (25%), can reach a whooping 80% competitive similaritywith respect to a centralized global index.

All these results were measured with exhaustive experiments conducted ona base of 6 million documents, with 190,000 queries for training and two testlogs of 800,000 and 2 million queries. We verified that the results are still validin an extended model that uses the real timing of queries on an actual searchengine running on each document partition. We also believe that the system can

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 33: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:33

scale to larger collections, due to the availability of a parallel implementationof the coclustering algorithm [Papadimitriou and Sun 2008], which is the mostcomputationally intensive step of our training phase.

The proposed architecture presents a trade-off between computing cost andresult quality, and we showed how to guarantee very precise results in faceof a dramatic reduction of computing load. This means that with the samecomputing infrastructure, our system can serve more users, more queries, andmore documents. Also, it can be used to dynamically adjust to unexpected peaksor to accommodate to reduced computing power due to failures.

Our approach to partitioning and selection is very general, as it makes noassumptions about the underlying search strategies, it only suggests a way topartition documents so that the subsequent selection process is very effective.We used this approach for implementing a distributed Web search engine, butit could be used with different types of data, or with different architectures. Infact, our strategy could be used, for example, in an image retrieval system, aslong as similar queries return similar images, so that the coclustering algorithmcan safely identify the clusters. Or, it could be used in a peer-to-peer searchnetwork. Even more interesting, there is no need for a centralized referencesearch engine. The training can be done over the P2P network itself, and laterdocuments could be reassigned (or better duplicated).

Another interesting aspect of this work is the analysis of the caching system.We tested several algorithms, in combination with our incremental strategy.More aggressive solutions could be tested. The incremental cache could activelytry to complete partial results by polling idle servers. This can be particularlyvaluable for queries that get tenured in the cache. In Web search engines, itis really hard to determine which queries should be cached, which should not,which should be moved to the static section and never be removed (maybe onlyupdated if new results become available).

We did not investigate the interaction of our architecture with some ad-vanced features of modern engines, such as query suggestion, query expansion,universal querying (the ability of returning results of different kinds, such asimages, videos, and books [Google 2007]) and so on. We believe that our systemhas a great potential for query expansion and suggestion, as it is already ableto identify similar queries returning similar documents.

We strongly believe that the proposed architecture has the potential ofgreatly reducing the computing cost of solving queries in modern IR systems,contributing to saving energy, money, time, and computer administrators.

REFERENCES

BADUE, C. S., BAEZA-YATES, R., RIBEIRO-NETO, B., ZIVIANI, A., AND ZIVIANI, N. 2007. Analyzing imbal-ance among homogeneous index servers in a Web search system. Inform. Process. Manage. 43, 3,592–608.

BAEZA-YATES, R., CASTILLO, C., JUNQUEIRA, F., PLACHOURAS, V., AND SILVESTRI, F. 2007a. Challengesin distributed information retrieval (invited paper). In Proceedings of International Conferenceon Data Engineering (ICDE). IEEE CS Press.

BAEZA-YATES, R., GIONIS, A., JUNQUEIRA, F., MURDOCK, V., PLACHOURAS, V., AND SILVESTRI, F. 2007b.The impact of caching on search engines. In Proceedings of the 30th Annual International ACM

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 34: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:34 • D. Puppin et al.

SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, NewYork, NY, 183–190.

BAEZA-YATES, R., GIONIS, A., JUNQUEIRA, F. P., MURDOCK, V., PLACHOURAS, V., AND SILVESTRI, F. 2008.Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1–28.

BARROSO, L., DEAN, J., AND HOLZE, U. 2003. Web search for a planet: The Google cluster architec-ture. IEEE Micro 22, 2.

BEITZEL, S. M., JENSEN, E. C., CHOWDHURY, A., GROSSMAN, D., AND FRIEDER, O. 2004. Hourly analysisof a very large topically categorized Web query log. In Proceedings of the 27th Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).ACM, New York, NY, 321–328.

BOLDI, P. AND VIGNA, S. 2004. The webgraph framework I: compression techniques. In Proceedingsof the 13th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 595–602.

BRIN, S. AND PAGE, L. 1998. The anatomy of a large-scale hypertextual Web search engine. InProceedings of the Seventh International Conference on World Wide Web (WWW). Elsevier SciencePublishers B. V., Amsterdam, The Netherlands, 107–117.

BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S., AND ZWEIG, G. 1997. Syntactic clustering of theWeb. In Selected Papers from the Sixth International Conference on World Wide Web. ElsevierScience Publishers Ltd., Amsterdam, The Netherlands, 1157–1166.

CALLAN, J., LU, Z., AND CROFT, W. 1995. Searching distributed collections with inference net-works. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval. E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press,21–28.

CHIERICHETTI, F., PANCONESI, A., RAGHAVAN, P., SOZIO, M., TIBERI, A., AND UPFAL, E. 2007. Findingnear neighbors through cluster pruning. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). ACM, New York, NY,103–112.

CHOWDHURY, A., FRIEDER, O., GROSSMAN, D., AND MCCABE, M. C. 2002. Collection statistics for fastduplicate document detection. ACM Trans. Inform. Syst. 20, 2, 171–191.

DEAN, J. AND GHEMAWAT, S. 2004. Mapreduce: simplified data processing on large clusters. InProceedings of the 6th Conference Symposium on Operating Systems Design and Implementation(OSDI). USENIX Association, Berkeley, CA, 10–10.

DHILLON, I. S., MALLELA, S., AND MODHA, D. S. 2003. Information-theoretic co-clustering. In Pro-ceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and DataMining(KDD). 89–98.

FAGNI, T., PEREGO, R., SILVESTRI, F., AND ORLANDO, S. 2006. Boosting the performance of Web searchengines: Caching and prefetching query results by exploiting historical usage data. ACM Trans.Inform. Syst. 24, 1, 51–78.

FRIEDER, O. AND SIEGELMANN, H. T. 1991. On the allocation of documents in multiprocessor infor-mation retrieval systems. In Proceedings of the 14th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY,230–239.

GOOGLE. 2007. Google begins move to universal search. http://www.google.com/intl/en/

press/pressrel/universalsearch_20070516.html.GRAVANO, L. AND GARCIA-MOLINA, H. 1995. Generalizing GlOSS to vector-space databases and

broker hierarchies. In Proceedings of the 21th International Conference on Very Large DataBases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 78–89.

GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994. Precision and recall of GlOSS estimatorsfor database discovery. Techn. note number STAN-CS-TN-94-10, Stanford University.

HOAD, T. C. AND ZOBEL, J. 2003. Methods for identifying versioned and plagiarized documents. J.Amer. Soc. Inform. Sci. Tech. 54, 3, 203–215.

JAIN, A. AND DUBES, R. 1988. Algorithms for Clustering Data. Prentice Hall.JANSEN, B. AND SPINK, A. 2006. How are we searching the World Wide Web? A comparison of nine

search engine transaction logs. Inform Proc. and Management 42, 248–263.JANSEN, B. J., SPINK, A., BATEMAN, J., AND SARACEVIC, T. 1998. Real life information retrieval: a

study of user queries on the Web. SIGIR Forum 32, 1, 5–17.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 35: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

Tuning the Capacity of Search Engines • 5:35

KAREDLA, R., LOVE, J. S., AND WHERRY, B. G. 1994. Caching strategies to improve disk systemperformance. Computer 27, 3, 38–46.

LARKEY, L. S., CONNELL, M. E., AND CALLAN, J. 2000. Collection selection and results mergingwith topically organized U.S. patents and TREC data. In Proceedings of the 9th InternationalConference on Information and Knowledge Management (CIKM). ACM Press, New York, NY,282–289.

LEMPEL, R. AND MORAN, S. 2003. Predictive caching and prefetching of query results in searchengines. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM,New York, NY, 19–28.

LIU, X. AND CROFT, W. B. 2004. Cluster-based retrieval using language models. In Proceedings ofthe 27th Annual International ACM SIGIR Conference on Research and Development in Infor-mation Retrieval (SIGIR). ACM Press, New York, NY, 186–193.

MARKATOS, E. P. 2001. On caching search engine query results. Comput. Comm. 24, 2, 137–143.

MOFFAT, A. AND ZOBEL, J. 1994. Information retrieval systems for large document collections. InProceedings of the Text REtrieval Conference. 85–94.

PAPADIMITRIOU, S. AND SUN, J. 2008. Disco: Distributed co-clustering with map-reduce. In Proceed-ings of the IEEE International Conference on Data Mining (ICDM).

PEW INTERNET AND AMERICAN LIFE PROJECT. 2005. Search engine use shoots up in the past year andedges towards email as the primary internet application. http://www.pewinternet.org/pdfs/PIP_SearchData_1105.pdf.

POBLETE, B. AND BAEZA-YATES, R. 2008. Query-sets: using implicit feedback and query patternsto organize Web documents. In Proceedings of the 17th International Conference on World WideWeb (WWW). ACM, New York, NY, USA, 41–50.

PUPPIN, D. 2008. Collection selection... now, with more documents! In Proceedings of the2nd International Conference on Scalable Information Systems (InfoScale). ICST, Brussels,Belgium.

PUPPIN, D. AND SILVESTRI, F. 2006. The query-vector document model. In Proceedings of the 15thACM International Conference on Information and Knowledge Management (CIKM). ACM, NewYork, NY, 880–881.

PUPPIN, D., SILVESTRI, F., AND LAFORENZA, D. 2006. Query-driven document partitioning and col-lection selection (invited paper). In Proceedings of the 1st International Conference on ScalableInformation Systems (InfoScale). ACM, New York, NY, USA, 34.

PUPPIN, D., SILVESTRI, F., PEREGO, R., AND BAEZA-YATES, R. 2007. Load-balancing and caching forcollection selection architectures. In Proceedings of the 2nd International Conference on ScalableInformation Systems (InfoScale). ICST, Brussels, Belgium, 1–10.

RAGHAVAN, V. V. AND SEVER, H. 1995. On the reuse of past optimal queries. In SIGIR ’95: Proceed-ings of the 18th annual international ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR). ACM, New York, NY, 344–350.

RANDALL, K. H., STATA, R., WIENER, J. L., AND WICKREMESINGHE, R. G. 2002. The link database: Fastaccess to graphs of the Web. In Proceedings of the Data Compression Conference (DCC). IEEEComputer Society, Los Alamitos, CA, 122.

ROBERTSON, S. E. AND WALKER, S. 1994. Some simple effective approximations to the 2-Poissonmodel for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR). Springer-Verlag, Berlin, Germany, 232–241.

SILVERSTEIN, C., HENZINGER, M., MARAIS, H., AND MORICZ, M. 1999. Analysis of a very large Websearch engine query log. In ACM SIGIR Forum. 6–12.

SILVESTRI, F. 2007. Sorting out the document identifier assignment problem. In Proceedings ofthe European Conference on IR Research (ECIR). G. Amati, C. Carpineto, and G. Romano, Eds.Lecture Notes in Computer Science, vol. 4425. Springer, 101–112.

TOMASIC, A., GRAVANO, L., LUE, C., SCHWARZ, P., AND HAAS, L. 1997. Data structures for efficientbroker implementation. ACM Trans. Inform. Syst. 15, 3, 223–253.

VAN RIJSBERGEN, C. 1979. Information Retrieval. Butterworths.WEBBER, W., MOFFAT, A., ZOBEL, J., AND BAEZA-YATES, R. 2006. A pipelined architecture for dis-

tributed text query evaluation. Inform. Retrieval. 10, 3.

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.

Page 36: Tuning the Capacity of Search Engines: Load-Driven Routing ...pomino.isti.cnr.it/~silvestr/wp-content/uploads/2011/02/tois2010.pdf · Tuning the Capacity of Search Engines: Load-Driven

5:36 • D. Puppin et al.

XIE, Y. AND O’HALLARON, D. R. 2002. Locality in search engine queries and its implications forcaching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Commu-nications Societies (INFOCOM).

XU, J. AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of theConference on Research and Development in Information Retrieval (SIGIR).

XU, J. AND CROFT, W. B. 1999. Cluster-based language models for distributed retrieval. In Proceed-ings of the Conference on Research and Development in Information Retrieval (SIGIR). 254–261.

YUWONO, B. AND LEE, D. L. 1997. Server ranking for distributed text retrieval systems on theInternet. In Proceedings of the 5th International Conference on Database Systems for AdvancedApplications (DASFAA). World Scientific Press, 41–50.

Received February 2008; revised December 2008; accepted March 2009

ACM Transactions on Information Systems, Vol. 28, No. 2, Article 5, Publication date: May 2010.


Recommended