Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 228 times |
Download: | 0 times |
1
Random Sampling from a Search Engine‘s Index
Ziv Bar-Yossef and Maxim Gurevich
Department of Electrical Engineering Technion
Presentation at group meeting, Oct., 24
Allen, Zhenjiang Lin
2
Outline Introduction
Search Engine Samplers Motivation
The Bharat-Broder Sampler (WWW’98) Infrastructure of Proposed Methods
Search Engines as Hypergraphs Monte Carlo Simulation Methods – Rejection Sampling
The Pool-based Sampler The Random Walk Sampler Experimental Results Conclusions
3
Search Engine Samplers
IndexPublicInterface
PublicInterface
Search Engine
Sampler
Web
D
Queries
Top k results
Random document x D
Indexed Documents
4
Motivation Useful tool for search engine evaluation:
Freshness Fraction of up-to-date pages in the index
Topical bias Identification of overrepresented/underrepresented topics
Spam Fraction of spam pages in the index
Security Fraction of pages in index infected by viruses/worms/trojans
Relative Size Number of documents indexed compared with other search
engines
5
Size Wars
August 2005
: We index 20 billion documents.
So, who’s right?
September 2005
: We index 8 billion documents, but our index is 3 times larger than our competition’s.
6
Why Does Size Matter, Anyway?
ComprehensivenessA good crawler covers the most documents
possible
Narrow-topic queriesE.g., get homepage of John Doe
PrestigeA marketing advantage
7
Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05]
Sample pages uniformly at random from the search engine’s index
Two alternatives Absolute size estimation
Sample until collision Collision expected after k ~ N½ random samples (birthday
paradox) Return k2
Relative size estimation Check how many samples from search engine A are present
in search engine B and vice versa
8
Related Work
Random Sampling from a Search Engine’s Index[BharatBroder98, CheneyPerry05, GulliSignorni05]
Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]
Queries from user query logs [LawrenceGiles98, DobraFeinberg04]
Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]
9
The Bharat-Broder Sampler: Preprocessing Step
C
Large corpusL
t1, freq(t1,C)t2, freq(t2,C)……
Lexicon
10
The Bharat-Broder Sampler
Search Engine
BB Sampler
t1 AND t2Top k results
Random document from top k results
LTwo random terms t1, t2
Only if:• all queries return the same number of results ≤ k • all documents are of the same lengthThen, samples are uniform.
Only if:• all queries return the same number of results ≤ k • all documents are of the same lengthThen, samples are uniform.
11
The Bharat-Broder Sampler:Drawbacks Documents have varying lengths
Bias towards long documents
Some queries have more than k matchesBias towards documents with high static rank
12
Two novel samplers
A pool-based sampler Guaranteed to produce near-uniform samples Needs an lexicon / query pool
A random walk sampler After sufficiently many steps, guaranteed to produce
near-uniform samples Does not need an explicit lexicon / pool at all!
Focus of this talk
13
Search Engines as Hypergraphs
results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:
Vertices: Indexed documents Hyperedges: { result(q) | q P }
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
14
Query Cardinalities and Document Degrees
Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:
card(“news”) = 4, card(“bbc”) = 3 deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
15
Sampling documents uniformly
Sampling documents from D uniformly Hard Sampling documents from D non-uniformly: Easier
Will show later: can sample documents proportionally to their degrees:
16
Sampling documents by degree
p(news.bbc.co.uk) = 2/13 p(www.cnn.com) = 1/13
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
17
Monte Carlo Simulation
We need: Samples from the uniform distribution We have: Samples from the degree distribution Can we somehow use the samples from the degree
distribution to generate samples from the uniform distribution?
Yes!
Monte Carlo Simulation Methods
Rejection Sampling
Rejection Sampling
Importance Sampling
Importance Sampling
Metropolis-Hastings
Metropolis-Hastings
Maximum-Degree
Maximum-Degree
18
Rejection Sampling Algorithm
Sampling values from an arbitrary probability distribution f(x) by using an instrumental distribution g(x)
The algorithm (due to John von Neumann) is as follows: Sample x from g(x) and u from U(0,1) Check whether or not u < f(x) / Mg(x).
If this holds, accept x as a realization of f(x); if not, reject the value of x and repeat the sampling step.
M > 1 is an appropriate bound on f(x) / g(x). Prove:
pRS(x) = g(x) . f(x) / Mg(x) = f(x) / M.
f(x) / Mg(x) ≤ 1 <=> M ≥ f(x) / g(x), x D.∨ ∈
19
Rejection Sampling: An Example
Sampling u.a.r from Square: g(x) Easy Sampling u.a.r from Disc: f(x) Hard Since f(x)=F, g(x)=G, set M = F/G; Generate a candidate point x from
unit square, g(x); If x is in unit disc, f(x) = F≠ 0,
thus f(x)/Mg(x)=1, accept x; If x is in square/disc, f(x) = 0,
thus f(x)/Mg(x)=0, reject x; Therefore, x is sampled u.a.r from the unit disc.
20
Monte Carlo Simulation : Target distribution
In our case: = uniform on D p: Trial distribution
In our case: p = degree distribution
Bias weight of p(x) relative to (x): In our case:
Monte Carlo Simulator
Monte Carlo Simulator
Samples from p
Sample from
x
Sampler
(x1,w(x)), (x2,w(x)),… p-Samplerp-Sampler
21
Bias Weights Unnormalized forms of and p:
: (unknown) normalization constants
Examples: = uniform: p = degree distribution:
Bias weight:
22
C: envelope constant C ≥ w(x) for all x
The algorithm: accept := false while (not accept)
generate a sample x from p toss a coin whose heads probability is if coin comes up heads,
accept := true
return x
In our case: C = 1 and acceptance prob = 1/deg(x)
Rejection Sampling [von Neumann]
23
Pool-Based Sampler
Degree distribution sampler
Degree distribution sampler
Search EngineSearch Engine
Rejection Sampling
Rejection Sampling
q1,q2,…results(q1), results(q2),…
x
Pool-Based Sampler
(x(x11,1/deg(x,1/deg(x11)),)),
(x(x22,1/deg(x,1/deg(x22)),…)),…
Uniform sample
Documents sampled from degree distribution with corresponding weights
Degree distribution: p(x) = deg(x) / x’deg(x’)
24
Sampling documents by degree
Select a random query q Select a random x results(q) Documents with high degree are more likely to be sampled If we sample q uniformly “oversample” documents that
belong to narrow queries-the weights of queries are different. We need to sample q proportionally to its cardinality
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
25
Sampling documents by degree (2)
Select a query q proportionally to its cardinality Select a random x results(q) Analysis:
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.ukwww.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
26
Degree Distribution Sampler
Search EngineSearch Engine
results(q)
xCardinality Distribution Sampler
Cardinality Distribution Sampler
Sample x uniformly from results(q)
Sample x uniformly from results(q)
q
Degree Distribution Sampler
Query sampled from cardinality
distribution
Document sampled from
degree distribution
27
Sampling queries by cardinality
Sampling queries from pool uniformly:Easy
Sampling queries from pool by cardinality: Hard Requires knowing cardinalities of all queries in the
search engine
Use Monte Carlo methods to simulate biased sampling via uniform sampling: Target distribution: the cardinality distribution Trial distribution: uniform distribution on the query pool
28
Sampling queries by cardinality
Bias weight of cardinality distribution relative to the uniform distribution:
Can be computed using a single search engine query
Use rejection sampling: Envelope constant for rejection sampling:
Queries are sampled uniformly from the pool Each query q is accepted with probability
29
Degree Distribution Sampler
Degree Distribution Sampler
Complete Pool-Based Sampler
Search EngineSearch Engine
Rejection Sampling
Rejection Sampling
x(x,1/deg(x)),…
Uniform document
sample
Documents sampled from degree distribution with corresponding weights
Uniform Query Sampler
Uniform Query Sampler
Rejection Sampling
Rejection Sampling
(q,card(q)),…
Uniform query
sampleQuery
sampled from cardinality distribution
(q,results(q)),…
30
Dealing with Overflowing Queries
Problem: Some queries may overflow (card(q) > k) Bias towards highly ranked documents
Solutions: Select a pool P in which overflowing queries are rare
(e.g., phrase queries) Skip overflowing queries Adapt rejection sampling to deal with approximate weights
Theorem:
Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)
31
Creating the query pool
C
Large corpusPq1
……
Query Pool
Example: P = all 3-word phrases that occur in C If “to be or not to be” occurs in C, P contains:
“to be or”, “be or not”, “or not to”, “not to be”
Choose P that “covers” most documents in D
q2
32
A random walk sampler Define a graph G over the indexed documents
(x,y) E iff queries(x) ∩ queries(y) ≠
Run a random walk on G Limit distribution = degree distribution Use MCMC methods to make limit distribution uniform.
Metropolis-Hastings Maximum-Degree
Does not need a preprocessing step Less efficient than the pool-based sampler
33
Bias towards Long Documents
0%
10%
20%
30%
40%
50%
60%
1 2 3 4 5 6 7 8 9 10
Deciles of documents ordered by size
Perc
ent
of
docu
ments
fro
m s
am
ple
.
Pool Based
Random Walk
Bharat-Broder
35
Top-Level Domains in Google, MSN and Yahoo!
0%
10%
20%
30%
40%
50%
60%
Top level domain name
Pe
rce
nt
of
do
cu
me
nts
fro
m s
am
ple
.
GoogleMSNYahoo!
36
Conclusions
Two new search engine samplersPool-based samplerRandom walk sampler
Samplers are guaranteed to produce near-uniform samples, under plausible assumptions.
Samplers show no or little bias in experiments.