Improving the Eff ic iency of Mul t i -s i te Web Search Engines
Xiao Bai ( x b a i @ y a h o o - i n c . c o m ) Yahoo Labs Joint work with Guillem Francès Medina, B. Barla Cambazoglu and Ricardo Baeza-Yates
July 15, 2014
Web search is difficult
2
› Size of the Web • 130+ billions pages • Constantly changing
› Cost of data centers • Hardware investment • Energy consumption
› Diversity of users • Different information need • Little patience
Multi-site Web search engine
3
› Fully replicated index • Easy to implement • Vertical scalability
› Partially replicated index • Faster response • Horizontal scalability
Challenges in multi-site web search
4
› Distributed Web crawling • Which site is the best to crawl a page?
› Index partitioning • Which site is the best to index a page?
Crawler
Web
Indexer Query processor
Crawler Indexer Query processor
Crawler Indexer Query processor
Site 1
Site N Site 2
› Query forwarding • Which sites contain the best-matching pages?
› Index replication • Which pages to replicate in each site?
› Distributed result caching • Which site is the best to cache a query result?
Improve query locality Reduce query response time
› Geographically distributed search sites
› Document-based index partition • Easy to build, good load balancing, better fault tolerance • One document is indexed in only one site • Document language, domain, server IP, etc.
System architecture
5
Query forwarding
6
Q: w1
(d1, 0.87) (d2, 0.59) (d3, 0.32)
(d4, 0.25) (d5, 0.18)
(d7, 0.70) (d8, 0.23) (d9, 0.07)
(d10, 0.11)
w1
w1
w1
w1
Q: w1,top-2
› Accurate query forwarding is important • False negative forwarding: decrease result quality • False positive forwarding: increase query response time & system workload
› Threshold-based algorithms • Thresholds periodically exchanged or learnt from previous query processing
› Improve result quality by serving documents indexed in remote sites
Machine-learned query forwarder
7
§ m×m binary classifier for each pair of sites
§ A single classifier › Pre-retrieval and post-retrieval classifiers: 100 decision trees › Pre-retrieval confidence threshold C
? ? ? ?
Q: w1,top-2
No
Pre-retrieval classifier c > C
< F, c > Yes
Post-retrieval classifier
Local query scoreLocal
query processor
Fq
ML query forwarder
FPre-retrieval features
Term lengths Term IDFs Term scores Query language Query popularity Query performance
Post-retrieval features Local query score LP forwarder decision
Performance of machine-learned query forwarder
8
› Accuracy • 1-FN-FP
› Query locality • Fraction of queries without forwarding
Machine-learned
Baseline: Linear programming
Machine-learned
Baseline: Linear programming
Oracle
› 5 sites: 200M web pages + 5M training queries + 2M test queries
Document replication
9
› Objective • Select replicated subset of documents that maximizes the fraction of queries whose top-k best-
matching documents are all indexed in local site with a given budget
› Replication strategies • Identical replication
– Global budget • Individual replication
– Global budget
• Individual replication – Local budget
b% b% b%/4
R(q) b%Si
Identical Individual + Global budget Individual + Local budget
Document selection heuristics
10
› 0-1 knapsack problem • Utility of document for site
› Application in different document replication strategies
d ∈ D \ Di Si
ui (d) =
freq(qj )Ri (qj ) × s(d)qj∈Qi
∑ , if d ∈ R(qj )
0 , otherwise
$
%&
'&
u(d) = ui (d)1≤i≤m∑
bm%× size(D)
ui (d), Si
b%× size(D)
ui (d)
b%× size(Di )Budget
Utility
Performance of document replication
11
› Impact on query forwarding • Query locality
› Comparison of different strategies • Query locality
Individual+Local
Identical
Individual+Global
Result caching
12
› Improve query locality by caching previously processed query results › Basic assumptions
• Cache with “unlimited” size • TTL-based invalidation
› Cache strategies: where to cache a query?
• Local cache – state-of-the-art – Mechanism
§ A query is cached in the site it is issued to
§ Local TTL
– Pros § Easy to implement
– Cons § Redundant processing for popular queries
• Global cache – Mechanism
§ A query is cached in all sites
§ TTL w.r.t. the local site receiving the query
– Pros § Highest cache hit rate
– Cons § Redundant transmission among sites
Result caching strategies
13
› Cache strategies: where to cache a query?
• Partial cache – Mechanism
§ A query is cached in the site it is issued to and the sites it is forwarded to
§ TTL w.r.t. the local site receiving the query
– Pros
§ Reduce redundant transmission among sites
– Cons § Reduce cache hit rate
• Forward cache – Mechanism
§ A query is cached in the site it is issued to
§ A pointer to the query is cached in the site receiving the forwarded query
§ TTL w.r.t. the local site receiving the query
– Pros
§ Further reduce redundant transmission
– Cons § Increase query response time
Q: w1 Q: w1 Q: w1 Q: w1
Performance of result caching
14
› Comparison of different strategies • Cache hit rate
› Impact of global cache • Query locality
User experience
15
› Result quality • Centralized top-10 as ground-truth
– Overlap@p (O@p)
– NDCG@p (N@p)
– ExactMatchRate@p (E@p)
• Caching has no impact on result quality
› Query response time • Estimation
– Processing time: size of index
– Transmission time: geographical distance
• Setting – Identical replication: 8%
– Global cache: TTL=2h
Conclusions
16
› First study on the interplay among the key components of multi-site search engines
› Multi-site search engine is very promising as an alternative to traditional search engines
• Query forwarding – Almost the same query response time as Oracle – Slightly decreases result quality of Oracle
• Document replication – Significantly reduces query response time – Improves result quality
• Result caching – Significantly reduces query response time – No impact on result quality