The Cluster Hypothesis in Information Retrieval
SIGIR 2013 tutorial
Oren KurlandTechnion --- Israel Institute of Technology
Email: [email protected]: http://iew3.technion.ac.il/~kurland
Slides are available at: http://iew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdf
Tutorial overview• The cluster hypothesis• Historical view of the effect of the hypothesis on work
on ad hoc information retrieval• Testing the cluster hypothesis• Cluster-based document retrieval
• Using topic models for ad hoc information retrieval• Graph-based methods for ad hoc retrieval that utilize
inter-document similarities• Additional tasks/applications
• Search results visualization, query-performance prediction, fusion, federated search, query expansion, microblog retrieval, relevance feedback, adversarial search
• Concluding notes
The ad hoc retrieval task
• Ranking the documents in a corpus by their relevance to the information need expressed by a query
• Vector space model• Probabilistic approaches• Language modeling framework• Divergence from randomness framework• Learning to rank
The cluster hypothesis
Closely associated documents tend to be relevant to the same requests
(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)
A quick historical tour
• Mid-end 60’s• Using document clusters to improve search efficiency (Salton ’68)
• 70’s-80’s• Using document clusters to improve search effectiveness (Jardine&van
Rijsbergen ’71)• ~2004-today
• Using document clusters to improve search effectiveness (Azzopardi et al. ’04, Kurland&Lee ’04, Liu&Croft ’04)
• 90’s-00’s• Using document clusters to improve results browsing (Preece ’73)
• 90-today• Using topic models to improve search effectiveness (Deerwester et al. ’90)
• ~00’s-today• Using graph-based approaches for ad hoc retrieval that utilize inter-
document similarities (Salton&Buckley ’88)
Ph.D. dissertations • Ivie, E. L. Search procedures based on measures of relatedness between documents. PhD thesis, Massachusetts
Institute of Technology, 1966.
• Marcia Davis Kerchner. Dynamic document processing in clustered collections. PhD thesis, Cornell University, 1971.
• Daniel McClure Murray. Document retrieval based on clustered files. PhD thesis, Cornell Univeristy, 1972.
• Ellen Voorhees. The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell Univeristy, 1985.
• Anton Leuski. Interactive information organization: Techniques and evaluation. PhD thesis, University Massachusetts Amherst, 2001.
• Anastasios Tombros. The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow, 2002.
• Leif Azzopardi. Incorporating context within the language modeling approach for ad hoc information retrieval. PhD thesis, University of Paisley, 2005.
• Oren Kurland. Inter-document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.
• Xiaoyong Liu. Cluster-based retrieval from a language modeling perspective. PhD thesis, University Massachusetts Amherst, 2006.
• Xing Wei. Topic models in information retrieval. PhD thesis, University Massachusetts Amherst, 2007.
• Fernando Diaz. Autocorrelation and regularization of query-based retrieval scores. PhD thesis, University of Massachusetts Amherst, 2008.
• Mark Smucker. Evaluation of find-similar with simulation and network analysis. PhD thesis, University Massachusetts Amherst, 2008.
Improving search efficiency
• Cluster the corpus offline• Represent each cluster by its centroid• At query time, compare the centroids with the
query and select the clusters to present
The cluster hypothesis
Closely associated documents tend to be relevant to the same requests
(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)
Does the cluster hypothesis hold?
• Depends on the inter-document similarity used?• Maybe we should assume that the hypothesis
holds, and accordingly devise inter-document similarity measures?
• More details later
The Jardine&van Rijsbergen’s (’71) (overlap) test• The similarity between two relevant documents vs.
the similarity between a relevant and a non-relevant document
• Measuring the overlap between the similarity distributions
Jardine&van Rijsbergen’s cluster hypothesis test
(Figure is taken from Voorhees ‘85)
Voorhees’ (’85) nearest-neighbor test• The percentage of relevant documents among the 5
nearest neighbors of a relevant document• The cosine similarity between tf.idf vectors is used
Voorhees’ nearest-neighbor test (Kurland ’06)• The KL divergence between language models of
documents is used for the similarity measure
Voorhees’ nearest-neighbor test applied to the result list of the n highest ranked documents (Raiber&Kurland ’12)• The KL divergence between language models of
documents is used for the similarity measure
The connection between the cluster hypothesis and cluster-based retrieval effectiveness
• “The extent to which the cluster hypothesis characterized a collection seemed to have little effect on how well cluster searching performed as compared to a sequential search of the collection.” (Voorhees ’85)
• There is (high) correlation between the extent to which the nearest-neighbor cluster hypothesis hold, and the effectiveness of cluster-based document retrieval. (Na et al. ’08)
• One potential reason for the contradicting findings: completely different cluster-based retrieval methods have been used
The density-based cluster hypothesis test (El-Hamdouchi and Willett ‘87)• The test value is the ratio between the number of
postings in the index (i.e., the total number of different terms used in documents) and the size of the vocabulary
• There is also a weighted version• The test was empirically shown to be more correlated
than the overlap and nearest-neighbor tests with the relative improvement posted by cluster-based retrieval over document-based retrieval
• Nearest-neighbor clusters (Griffiths et al. ’85) were used• Retrieval performance was measured by recall at some
cuttoff
Query-sensitive similarity measures (Tombros&van Rijsbergen ’01)• Claim: the cluster hypothesis should hold for every
collection; it is the inter-document similarity measure that needs to be adjusted so that the hypothesis holds
• Heretofore, all inter-document similarities were query-independent
• Idea: bias the inter-document similarity measure to emphasize relations between the documents and the query𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑1,𝑑𝑑2 𝑞𝑞 ≜ cos 𝑑𝑑1,𝑑𝑑2 cos 𝑐𝑐, �⃗�𝑞 ;𝑐𝑐𝑖𝑖 = 1
2(𝑑𝑑1;𝑖𝑖+𝑑𝑑2;𝑖𝑖), where the i’th term in the vocabulary
is common to 𝑑𝑑1 and 𝑑𝑑2, and 𝑑𝑑𝑥𝑥;𝑖𝑖 is its weight in 𝑑𝑑𝑥𝑥
Query-sensitive similarity measures (Tombros&van Rijsbergen ’01)
Nearest neighbor test with 5 nearest neighbors
Using the cluster hypothesis to induce clustering• The optimum clustering framework (Fuhr et al. 2012)• Basic principle: documents that are co-relevant to
many queries should be clustered together• Definitions for the expected recall and precision of a
clustering based on co-relevance• Well known clustering methods can be viewed as based
on principles of the framework• The framework was shown to provide a more effective
internal clustering quality criterion than commonly used alternatives
• In terms of correlation to ground truth
Alternative cluster hypothesis test (Smucker&Allan ’09)• Claim: The nearest neighbor test is insufficient for
query-biased similarities• The nearest-neighbor test is a good measure of local
clustering• A graph-based normalized mean reciprocal distance
measure
The cluster hypothesis test for entity retrieval (Raviv et al. SIGIR 2013)• Check out the poster of Hadas Raviv• The main challenge: defining similarities between
entities
Using clusters of similar documents for document retrieval• Visualizing results• Using clusters to select documents• Using clusters to enrich document representations
• Using topic models to enrich document representations
Types of document clusters• Hard vs. soft
• Hard clustering: A document belongs to a single cluster• Partitioning (e.g., K-Means) vs. hierarchical agglomerative clustering (single
link, complete link, average link, Ward’s criterion)• Soft clustering: A document can belong to, or be associated with,
several clusters• Overlapping nearest-neighbor clusters: for each document we construct a
cluster that contains the document and its k nearest neighbors• Topic models (more details later)
• Offline (query-independent)• Created from all documents in the corpus• Help to address recall issues with the initial search (?)• Efficiency issues
• Large scale and dynamic corpora
• Query specific (Preece ’73, Willett ’85)• Created from the documents most highly ranked by an initial search• Used either for visualization of results or for automatic re-ranking
of the initial result list• Drawback(?): dependence on the effectiveness of the initial search
Cluster-based search results visualization
• The scatter-gather system (Cutting et al. ’92, Hearst&Pedersen ’96)
• Browsing strategies for cluster-based result interfaces (Leuski ’01)
• Interactive retrieval using a cluster-based interface (Leuski&Allan ’04)
• Interactive exploration of corpora based on inter-document similarities (Smucker ’08)
Challenges to address
• Fast online creation of clusters• Cutting et al. ’92, Zamir&Etzioni ’98
• Automatic labeling of clusters • Treeratpituk&Callan ’06 (agglomerative clusters)• Mei et al. ’07 (topic models)
Using document clusters for ad hoc retrieval• The user needn’t be aware of the fact that
clustering was performed• Clusters often serve one of two roles (or both)
(Kurland&Lee ’04)• Document selection• Enriching (“smoothing”) document representations
Using offline-created clusters for document selection (Kurland ’06)
Query = {truck, bus}
d1=school bus, classes, teachers
d2=school , classes, teachers, class
d3=bus, taxi, boat, bike
d4=taxi, boat, truck, scooter
d5=boat, horse, taxi, bike, scooter
d6=home, house, kids, floor
(x , x ) | x |i j i jsim x=
( , 1) ( , 5) ( , 6)
( , 2) ( , 1) ( , 3)( 5, 2) ( 4, 2) ( 3, 2)
3, , , , 6,1 2
, , , , , 6
4 5
3 25 14
Rank using documents
Rank using clusters and docs
sim q d sim q d sim q d
sim q C sim q C sim q Csim d C sim d C sim d c
Ranking d
Ra
d d d
d d
d d
nki d dn dg d
> =
> >> =
=
=
c2
c1
c3
Using clusters for document selection
• Given a query 𝑞𝑞 and a list of document clusters 𝐶𝐶𝐶𝐶• Cl can be a set of offline-created or query-specific clusters
• Rank the clusters c in 𝐶𝐶𝐶𝐶 using a query-cluster similarity measure, or any other approach:
• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞, 𝑐𝑐• A key estimation issue which will be further discussed
• Transform the cluster ranking to document ranking
Transforming cluster ranking to document ranking
• Strategy #1 (originally termed “cluster-based retrieval” by Jardine&Rijsbergen (’71))
• Replace each cluster with its constituent documents (omitting repeats)
• Within-cluster document ranking is based on the initial document scores which were assigned in response to the query or on the similarity between the document and the cluster centroid
• Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Willett ’85, Liu&Croft ’04, Kurland&Lee ’06, Kurland ’08, Kurland&Domshlak ’08, Liu&Croft ’08
The CQL method (Liu&Croft ’04)(Example for strategy #1 using query-specific clusters)
𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑐𝑐); similarity is measured using language modelsA cluster is represented by the concatenation of its constituent documents
Transforming cluster ranking to document ranking (cont.)• Strategy #2
• Rank all the documents in the top-retrieved clusters using some criterion (Voorhees ’85)
• Examples to follow
• Strategy #3• Traverse the clustering dendogram until finding the cluster
with the best match to the query or using any other stopping criterion (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Griffiths et al. ‘86)
• Mixed results with respect to whether cluster-based retrieval is consistently more effective than document retrieval
• Cluster-based retrieval was shown to be more effective in terms of precision (Jardine&Rijsbergen ’71, Croft ’80)
• Bottom up search was shown to be more effective than top-down (Croft ’80)
Algorithmic framework (Kurland&Lee’04, ‘09)
(Example for strategy #2)
is the set of nearest-neighbor clusters created fromall documents in the corpus (cf., Griffiths et al. '86)Given query and (the number of docs to retrieve): 1. For each document , - C
Cl
q Nd
Score by a weighted combination of ( , ) and the ( , ) ' for all ( , )
hoose ( , ) -
2. Set ( ) to the ranked-ordered list of -to
d sFacets q d C
im q dsim q c s c Facets q
l
TopD cd
o s NN
⊆
∈
Optional: re-rank p scoring documents
3. 4. Return
( ) by ((
,)
)d TopTo
Docs N sipDoc
m ds
qN∈
The Set-Select algorithm (Kurland&Lee ’04, ’09; cf. Voorhees ‘85)
• Instantiation of the framework:
• The procedure
{ }[ ]
( , ) : ( )
( ) ( , ) | ( , ) | 0
( ) is the set of clusters that are the most similar to the query
q
q
Facets q d c d c TopClusters m
Score d sim q d Facets q d
TopClusters m m
δ
= ∈
= ⋅ >
Rank only documents in top retrieved clusters using ( , )sim q d
The Bag-Select Algorithm (Kurland&Lee ’04, ‘09)
• Instantiation of the framework:
• The procedure
{ }( , ) : ( )
( ) ( , ) | ( , ) |qFacets q d c d c TopClusters m
Score d sim q d Facets q d
= ∈
= ⋅
Rank only documents in top retrieved clusters using ( , ) # of top clusters belongs to
sim q d d×
Set-Select, Bag-Select (Kurland&Lee ’04, ‘09)
c1c2 c3
d1d2
d40
40 1
2 1
2
1
2 3
1
( , )( , ) { ,
( , }
,
) {{ , }
}Facets q d c cFacet
Facets q d c
q
c
d
c
s c==
=
40 40
1
2 2
1
( ) (( ) ( , )
,( ) ( , )
)Score d s
ScorSco
e dre d sim q d
si
d
d
i
m
q
q
m==
= 1 1
2
40 40
2
( ) 2* ( ,(
( ) (
) 3* (
,
)
))
,Score d sim q dScore d sim q
Score d sim
d
q d==
=
Set-Select Bag-Select
q
Empirical Results (Kurland&Lee‘09)
15%17%19%21%23%25%27%29%
Baseline(doc-based)
Set-s
Bag-s
*
*
* *
AP89 AP88+89 LA+FR
‘*’ marks a statistically significant difference with the baseline
*
*
*
MAP
The cluster ranking challenge
• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑐𝑐; 𝑞𝑞) ≜ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑐𝑐)• How do we represent cluster c?
• Binary term vector (Jardine&van Rijsbergen ’71, van Rijsbergen ’74)
• A centroid of the vectors representing c’s constituent documents (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, El-Hamdouci&Willett ’87,Liu&Croft ‘08)
• Cosine, for example, serves for the similarity measure in the vector space
• The big document that results from concatenating c’sconstituent documents (Kurland&Lee ’04, Liu&Croft ‘04)
• Language-model-based similarity estimates
Cluster representations (Liu&Croft ’08)
• The best representation for a cluster, among those studied, was the geometric mean of the language models of its constituent documents• Seo&Croft ’10 provide arguments based on information geometry
for the effectiveness of the geometric-mean-based representation
Some more cluster ranking methods• Using the min/max query-similarity score of a
document in the cluster (Leuski ’01, Shanahan et al. ’03, Liu&Croft ’08)
• Document-cluster graphs (Kurland&Lee ’06; more details later)
• Variance of document retrieval scores in the cluster (Liu&Croft ’06; more details later)
• Aggregating measures of properties of a cluster (Kurland&Domshlak ’08)
• Using the similarity between the cluster and an expanded query form (Liu&Croft ’04, Wei&Croft ’06; more details later)
The optimal cluster• The percentage of relevant documents in cluster c
of k documents for which 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 is the highest is the precision@k attained by cluster-based retrieval with strategy #1.
Using nearest neighbor query-specific clusters of 5 documents: (Kurland&Domshlak ’08)
The optimal cluster (Liu&Croft’06)
The optimal cluster (cont.)
• Jardine and van Rijsbergen ’71 were the first to report the existence of an optimal (offline created) cluster
• This was later re-asserted by, for example, Hearst&Pedersen (’96), Tombros et al. (’02), Kurland (’06) and Liu+Croft (’06)
• Offline-created optimal clusters contain a smaller percentage of relevant documents than optimal query-specific clusters (Tombros et al. ’02)
• There are several approaches to estimating the potential retrieval merits of finding optimal clusters (Tombros et al. ’02)
A probabilistic graph-based approach for ranking clusters (Kurland ’08, Kurland&Krikon ’11)
• What is the probability that this cluster is relevant to this query? (cf., Croft ’80)
• The ClustRanker method:𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑐𝑐; 𝑞𝑞) ≜ 𝜆𝜆𝜆𝜆 𝑞𝑞|𝑐𝑐 𝜆𝜆 𝑐𝑐 + 1 − 𝜆𝜆 �
𝑑𝑑∈𝑐𝑐
𝜆𝜆 𝑞𝑞|𝑑𝑑 𝜆𝜆 𝑐𝑐|𝑑𝑑 𝜆𝜆 𝑑𝑑
• p(d) and p(c) are estimated based on graphs where documents (clusters) are vertices, edge-weights represent inter-item similarities, and the PageRank score of item x serves as an estimate for p(x)
ClustRanker (Kurland ’08, Kurland&Krikon’11)(Using nearest-neighbor query-specific clusters for re-ranking)
A (much) more effective cluster ranking method (Raiber&Kurland SIGIR 2013)
• Uses the Markov Random Field framework• Attend Fiana Raiber’s talk:
Ranking Document Clusters using Markov Random Fields
Selective cluster-based retrieval
• Griffiths et al. (’86) observed that cluster-based retrieval and document-based retrieval can be of the same effectiveness
• But, different relevant documents were retrieved in the two cases
• Some previous work on selecting a retrieval strategy per query
• Croft&Thompson ’84• Amati et al. ’04• Balasubramanian&Allan ‘10
Selective cluster-based retrieval (Liu&Croft ’06)• A “good” cluster is one which (1) exhibits high similarity to
the query and (2) contains documents with query-similarity values that do not deviate much from that of the cluster
• Kurland et al. (’12) found some contrasting evidence with respect to the deviation
• For queries with “good” clusters perform cluster-based retrieval, for the other queries perform document-based retrieval
Selective cluster-based retrieval (Liu&Croft ’06)
Intermediate summary• We ranked document clusters• We transformed the cluster ranking to document
ranking• Some observations:
• There are clusters that contain a very high percentage of relevant documents (whether static offline-created clusters or query-specific clusters); the optimal cluster
• Optimal query-specific clusters contain a higher percentage of relevant documents than static clusters (Tombros et al. ’02)
• Using small clusters results in more effective retrieval (Griffiths et al. ’86, Tombros et al. ’02, Kurland&Lee ‘04)
• Cluster representation is crucial• A geometric-mean-based representation seems to be the most
effective among those proposed (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)
• The performance of cluster-based retrieval can be much better than that of document-based retrieval and that of using query expansion
Using clusters to enrich (smooth) document representations• Clusters provide (corpus) context for documents • Enrich the document representation using information
induced from similar documents • Example: use Rocchio’s method to smooth the vector
representing the document with those representing similar documents (Singhal&Pereira ’99)
𝑑𝑑𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑑𝑑𝑜𝑜𝑜𝑜𝑑𝑑 + 1𝑘𝑘∑𝑖𝑖=1𝑘𝑘 𝑑𝑑𝑖𝑖;
𝑑𝑑1, … ,𝑑𝑑𝑘𝑘 are d’s nearest neighbors in the vector space
Similar document-expansion methods• Lavrenko (’00) employed a nearest-neighbor smoothing
method for the query model• Ogilvie (’00) and Kurland&Lee (’04) smoothed a
document language model with language models of its nearest neighbors
• Tao et al. (’06) created pseudo counts for terms in a document that are smoothed using the counts of terms in similar documents
• Wi&Allan (’09) and Efron et al. (’12) use the (weighted) language models of nearest-neighbors of a document to smooth the document language model
• Efron et al. (’12) use this document model for Twitter search
A quick recap of the language modeling approach (Ponte&Croft ’98)
'
: query, : document, : corpus of documents, : term
( )( | ) ; ( )
A maximu
is the
m likelihood est
number of times
Jelinek-Merc
appears in ( ' )
(
i
er smo
mate:
othi|
g1 ) (
n :) (
MLE
w d
JM MLE
q d C w
tf w dp w d tf w d w dtf w d
p w d pλ
∈
∈= ∈
∈
−
∑
(1) The query likelihood model: (Song&Croft '99) (2) The KL retrieval method: (Lafferty&Zhai '0
| ) ( |Dir
1
ichle
)
t smo)
| |
( ; ) ( | ) ( | )
oth
ing:
i
MLE
iq q
w d p w C
d
score d q p q d p q d
λ
µλµ
∈
+
=+
∏
( | ) ( ; ) ( | ) log( | )
i
ii
q q i
p q qscore d q p q qp q d∈
∑
Using pLSA for retrieval (Hofmann ’99)• pLsa (probabilistic latent semantic analysis) is a
“probabilistic successor” of LSA (Deerwester et al. ’90), and an implementation of the aspect model (Hofmann et al. ’97)
• Additional topic models (LDA Blei et al. ’03, Pachinko Allocation Model Li&McCallum ’06)
• The generative story of pLSA:• Select a document 𝑑𝑑 with probability P 𝑑𝑑• Pick a latent class (topic) 𝑧𝑧 with probability P 𝑧𝑧|𝑑𝑑• Generate a word w with probability P 𝑤𝑤|𝑧𝑧
Using pLSA for retrieval (Hofmann ’99)• P 𝑑𝑑,𝑤𝑤 = 𝑃𝑃 𝑑𝑑 𝑃𝑃 𝑤𝑤 𝑑𝑑• 𝑃𝑃 𝑤𝑤 𝑑𝑑 = ∑𝑧𝑧∈𝑍𝑍 𝑃𝑃 𝑤𝑤 𝑧𝑧 𝑃𝑃(𝑧𝑧|𝑑𝑑)• Data likelihood:
𝐿𝐿 = ∑𝑑𝑑∈𝐷𝐷∑𝑛𝑛∈𝑊𝑊 𝑡𝑡𝑡𝑡 𝑤𝑤 ∈ 𝑑𝑑 𝐶𝐶𝑠𝑠𝑙𝑙𝑃𝑃(𝑑𝑑,𝑤𝑤)• Maximizing data likelihood using tempered EM
• Note the potential metric divergence problem (Azzopardi et al. ’03)• Using the topic model for retrieval
• Smoothing a document language model (more details later)• Folding the query into the lower dimensional space
• Vector-based representation, cosine measure
• Retrieval performance is better than that attained by LSA and the cosine method; very small collections are used
Cluster-based document language models (Liu&Croft ’04)
1 2 3
1 2 3
The CBDM model:( | ) ( | ) ( | c) ( | );
1
Let be the single hard cluster to which belongs; c is represented by the concatenation of its constituent documents
MLE MLE MLEp w d p w d p w p w C
c d
λ λ λλ λ λ
+ ++ + =
Using offline K-MEANS clustering (K is the number of clusters)
Topic-based document language models (Wei&Croft ’06)• Apply Latent Dirichlet Allocation (LDA; Blei et al. ‘03) to induce topics from the
corpus• Use the resultant topics to smooth document language models
• Using the KL divergence between a document LM and that of the query for ranking
• A generalization of the CBDM model• Earlier work by Azzopardi et al. (’04) used LDA and pLSA to induce document
prior distributions• Lu et al. ’11 found that the retrieval performance of using LDA and pLSA
(Hofmann ’99) was comparable
1 2 3
1
1 2 3
1
ˆ ˆ( | , ) ( | , )
( | ) ( | ) ( | ) ( | );
( | )k
i
MLE MLELDA
LDA p w z p z d
p w d p w d p w d p w C
p w d
λ λ λ
φ θ
λ λ λ
=
+ + =
+ +
∑
Topic-based document language models (Wei&Croft ’06)
A study of using topic-based document language models (Yi&Allan ’09)
• Using more sophisticated topic models (e.g, Pachinko Allocation Model Li&McCallum ’06) doesn’t yield better retrieval performance (e.g., than that attained by LDA)
• Using nearest-neighbor smoothing results in performance that is as good as that of using topic models
• Pseudo-feedback-based query expansion is more effective than using topic models (either in an offline fashion or in a query-specific fashion)
𝜆𝜆𝑇𝑇𝑇𝑇 𝑤𝑤 𝑑𝑑 ≜ �𝑡𝑡𝑖𝑖∈𝑇𝑇
𝜆𝜆𝑇𝑇𝑇𝑇 𝑤𝑤 𝑡𝑡𝑖𝑖 𝜆𝜆𝑇𝑇𝑇𝑇 𝑡𝑡𝑖𝑖 𝑑𝑑
T is the set of topics
𝜆𝜆′(𝑤𝑤|𝑑𝑑) ≜ 𝜆𝜆1𝜆𝜆𝑇𝑇𝑀𝑀𝑀𝑀 𝑤𝑤 𝑑𝑑 + 𝜆𝜆2𝜆𝜆𝑇𝑇𝑇𝑇(𝑤𝑤|𝑑𝑑) +𝜆𝜆3 𝜆𝜆𝑇𝑇𝑀𝑀𝑀𝑀 𝑤𝑤 𝐶𝐶
Score-based smoothing (Kurland&Lee ’04 ‘10, Kurland ’09)
( ; ) ( | ) ( | , ) ( | )
Estimate ( | , ) using ( | ) (1- ) ( | )
The interpolation m
Using a single term for we get a clust
ethod:( ; ) ( | ) (1- ) ( | ) (
er/top b
|
ic-
)
c Cl
c Cl
score d q p q d p q d c p c d
p q d c p q d p q c
score d q p q d p q c p c d
w q
λ λ
λ λ
∈
∈
=
+=>
+
∑
∑
Use exp(- ( ( | ) || ( | ))), where ( | ) is the unigram language model ind
ased documen
uce
t language
d from , as an estimate fo
mod
r
l
|
e
( )KL p x p y p z z
p x y
The interpolation method (Kurland&Lee ’04, ‘09)
• Ranking the corpus using nearest-neighbor offline-created clusters
‘b’ and ‘I’ mark statistically significant differences with the baseline and interpolation, respectively
Interpolation method (Kurland&Lee ’04, ‘09) –comparison between using nearest neighbor (NN) clusters and K-Means clusters (Clusters are created offline)
El-Hamdouci and Willett (’89) found that when using cluster ranking with offline created clusters(i) Using nearest neighbor clusters (of two documents) resulted in better
performance than that of using various hard clustering methods; (the same as Griffiths et al. (’86) findings)
(ii) Using small agglomerative clusters yielded better performance than using larger clusters; and,
(iii) Complete-link agglomerative clustering was more effective than single-link and Ward’s method for cluster-based retrieval; (the same as Voorhees’ (’85) findings)
p@5 p@10 p@5 p@10 p@5 p@10
Init. Rank. .457 .432 .500 .456 .536 .484
CQL (sim(q,c)) .448 .418 .500 .432 .504 .454
Bag-Select .507 .494 .532 .514 .548 .488
Interpolation .537 .498 .576 .496 .592 .508
The interpolation method (Kurland ’09)
• Using nearest-neighbor query-specific clusters to re-rank an initially retrieved list
* * * *
*
* *
AP TREC8 WSJ
‘*’ marks a statistically significant difference with the initial ranking (Init. Rank.)
The interpolation method (Kurland ’09)• Comparison with pseudo-feedback-based query
expansion• Using nearest-neighbor query-specific clusters
The interpolation method with different query-specific clustering algorithms (Kurland ’09)
• The findings with respect to (i) nn-LM and nn-VS being superior to hard clustering schemes, and (ii) agg-comp’s relative effectiveness, are reminiscent of those of El-Hamdouci and Willett (’89) who used cluster ranking with clusters created offline
• Tombros et al. (’02) found that average link was better than complete link in terms of query-specific optimal cluster search
Comparison of cluster-based retrieval methods (Raiber&Kurland ’12)
Cluster types and their effectiveness for smoothing document language models (or scores)• Nearest-neighbor clusters (with a small number of
nearest neighbors)>= topic models > hard clusters• Note: The first to suggest the use of nearest-
neighbor overlapping clusters were Griffiths et al. (’86)
• Used offline-created clusters (of two documents) • There is much recent evidence for the effectiveness of using
nearest-neighbor query-specific clusters (Kurland+Lee:06,Liu+Croft:08,Kurland:09)
Integrating query-specific and offline-created clusters• Meister et al. (’09) used the interpolation algorithm
(Kurland&Lee ’04) with both offline and query-specific clusters
• Small (but consistent) performance improvements over using only offline-created or query-specific clusters
• Lee et al. (’01) use a “query-specific” view of static (offline-created) clusters
Cluster-based fusion
Fusion of retrieved lists
1Given query and document lists ,..., that were retrieved in response to from corpus , produce a single list of results
Integrating various information sources for retrieval
Motivation
The taskmq L L q
D
(Croft '00)(e.g., document representations, query representations, retrieval models)
1d
2d
1d
2d
3d3d
2d
4d
3d
1L 2L 3L
fuse
1d
2d
3d
A common fusion principleDocuments that are highly ranked in many of the lists are rewarded
(1) overlap of relevant documents in the lists is higher than that of non-relevant documents (Lee ’97)
(2) chorus and skimming effects (Saracevic&Kantor‘88, Vogt&Cottrell ’99)
: document: is a member of list
( ) : the positive retrieval score of in if ; 0 otherwise
( ) : "standard" fusion score of that depends only on retrieval scores or ranks
Examples
i
i i
L i i
dd L d LS d d L d L
F d d
∈∈
:( ) #{ : } ( ) (Fox&Shaw '94, Lee '97)
( ) #{ ' : ( ') ( )} (Borda 1781, Aslam&Montague '01)
i
i
i i
i
CombMNZ i i LL
Borda i L LL
F d L d L S d
F d d L S d S d
∈
∈ ≤
∑
∑
But …
Different retrieved lists might contain different relevant documents
e.g., Das-Jupta and Katzer ’83, Griffiths et al. ’86, Soboroff et al. ’01, Beitzel et al. ’03
fuse2r
1r
nr
1rnr
2r
1rnr
3r
1rnr
2r
1L 2L 3L
A cluster-based fusion approach (Kozorovitsky&Kurland ’11)
Let similar documents across the lists provide relevance-status support to each other
• cluster hypothesis • utilize information induced from clusters of similar
documents that are created across the lists
fuse2r
1r
nr
1rnr
2r
1rnr
3r
1r
2r
3r
Fusion model (Kozorovitsky&Kurland ’11)
( )
- the set of documents in the lists; ( ) - their clusters
: use clusters as proxies for documents( | ) ( | , ) ( | ) (cf., Kurland&Lee '04)
ˆEstimate: ( | ,
Starting point
init
init init
c Cl D
D Cl D
p d q p d c q p c q
p d c
∈
= ∑
( )
Resultant fusion
) ( | ) (1 ) (
functio
| )
: ( ) (1 ) ( | ) ( | ) ( | )
n
init
ClustFusec Cl D
q p d c p d q
F d p d q p c q p d c
λ λ
λ λ∈
= + −
− + ∑
ClustFuse (Kozorovitsky&Kurland ’11)
Document d is rewarded based on its • standard fusion score
• reflects the extent to which d is highly ranked in many of the lists• similarity to clusters that contain documents that are highly
ranked in many of the lists
• 𝜆𝜆=0 amounts to the standard fusion method (F) that ClustFuse incorporates
( )( ) (1 ) ( | ) ( | ) ( | )
L
ClustFusec Cl C
F d p d q p c q p d cλ λ∈
− + ∑
)𝐹𝐹(𝑑𝑑;𝑞𝑞�𝑑𝑑′∈𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝐹𝐹(𝑑𝑑′;𝑞𝑞)∏𝑑𝑑∈𝑐𝑐 𝐹𝐹(𝑑𝑑;𝑞𝑞)
∑𝑐𝑐′∈𝐶𝐶𝑜𝑜(𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖)∏𝑑𝑑′∈𝑐𝑐′ 𝐹𝐹(𝑑𝑑′;𝑞𝑞)∑𝑑𝑑′∈𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑′,𝑑𝑑)
∑𝑑𝑑𝑖𝑖∈𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ∑𝑑𝑑′∈𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑′,𝑑𝑑𝑖𝑖)
cf., the interpolation model (Kurland&Lee ’04, ‘09)
MAP performance of fusing TREC runs (Kozorovitsky&Kurland ’11)
4
5
6
7
8
9
10
trec3
run1CombMNZClustFuseCombMNZBordaClustFuseBorda
‘r’, ‘f’ – statistically significant differences with run1 and the standard fusion method, respectively
r
r
r
r
r
f f
r,f r,f
Fusing 3 randomly selected TREC runs (run1 is the best performing among the three)
Optimal clusters in the fusion setting (Kozorovitsky&Kurland ’11)
• OptCluster is the optimal cluster among all clusters created from all the documents in the 3 runs that are fused (run1, run2, run3)
• OptCluster(runi) is the optimal cluster among clusters created from runi‘a’, ’b’ and ‘c’ mark statistically significant differences with run1, run2, and run3, respectively
Cluster-based fusion (Kozorovitsky&Kurland ’11)• Re-ranking each run using query-specific clusters
created from the run, and then fusing the runs (cf. Zhang et al. ’01), yields performance that is inferior to that of using clusters created over all the runs
• The lower the overlap between relevant documents in the runs the more benefit we gain from applying cluster-based fusion
Cluster-based federates search(Khalman&Kurland ’12)
• Retrieving the lists from disjoint corpora(federated/distributed search)
• Crestani and Wu (’06) showed that in the federated search setting there exist clusters that contain a high percentage of relevant documents
p@10p@5
42.046.0init
49.853.6ClustFuse
41.043.2init
49.050.4ClustFuse
CORI
SSL
Cluster-based query expansion
• Treating clusters as pseudo queries (Kurland et al. ’05)• Using cluster-based (or topic-based) smoothed document
language models for both constructing an expanded query form and for ranking (Liu&Croft ’04, Tao et al. ’06, Wei&Croft ‘06)
• Constructing a query-expanded form by rewarding top-retrieved documents that are members of many query-specific overlapping clusters (Lee et al. ’08)
• Using top-retrieved clusters instead (or in addition to) top-retrieved documents for constructing an expanded query form (Na et al. ’07, Gelfer Kalmanovich&Kurland ’09)
• Cluster-based query expansion for federated search (Shokouhi et al. ’09)
Cluster-based results diversification
• e.g., Maximal Marginal Relevance (Carbonell&Goldstein’98)
• Let R be the result list of the documents most highly ranked using 𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑
• Let S be the new list we create from R (i.e., re-ranking); the order of inserting documents to S is the induced ranking
• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑; 𝑞𝑞) ≜ 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑎𝑎𝑎𝑎𝑑𝑑𝑖𝑖∈𝑅𝑅\S[𝜆𝜆𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑 − (1− 𝜆𝜆𝑠𝑠𝑎𝑎𝑎𝑎𝑑𝑑𝑖𝑖∈𝑆𝑆𝑆𝑆𝑠𝑠𝑠𝑠2 𝑑𝑑,𝑑𝑑𝑖𝑖 ]
• A cluster-based approach: estimate 𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑 using a cluster ranking approach (He et al. ’11, Raiber&Kurland’13)
Cluster-based results diversification (cont.)
• A second cluster-based approach: Cluster R and pick documents from the clusters (e.g., in a round robin fashion) which are viewed as potential aspects
• e.g., Leelanupab et al. ’10
Utilizing relevance feedback using document clusters• Shanhan et al. ’03 found that the optimal cluster is
a very good basis for relevance feedback• Re-emphasizing claims in older literature (e.g.,
Jardine&Rijsbergen ’71, Croft ’80) about the motivation to find good clusters
• Active relevance feedback (Shen&Zhai ’05)• Diversify the feedback set by picking documents from
query-specific clusters of top-retrieved results• Baseline: asking for feedback for the top-k retrieved documents
• Interactive retrieval• Ivie ’66, Hearst&Pedersen ’95, Leuski ’01
Using clusters for query-performance prediction (QPP)• The query-performance prediction task: estimating
the effectiveness of a search performed in response to a query in lack of relevance judgments (Carmel&Yom Tov ’10)
• The clustering tendency of the results is an indicator for search effectiveness (Vinay et al. ’06)
• The extent to which the retrieval scores of documents “respect” the cluster hypothesis is an effective query-performance predictor (Diaz ’07)
On the connection between cluster ranking and query-performance prediction (Kurland et al. ’12)• Cluster ranking: estimate the probability that a
cluster (set of documents) is relevant to a query• Query performance prediction (QPP): estimate the
probability that a result list (ranked list of documents) is relevant to a query
• As it turns out, quite a few QPP and cluster ranking methods are based on the exact same principles
• The geometric mean of retrieval scores in a result list is a high quality performance predictor (Zhou&Croft ’07)
• The geometric mean of retrieval scores in a cluster is an effective criterion for ranking clusters (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)
Cluster-based retrieval –intermediate summary• Using clusters to select documents• Using clusters to enrich (smooth) document representations
• or topic models
• Offline vs. query-specific clustering• Soft vs. hard clustering• The optimal cluster• Cluster-based fusion and federated search• Cluster-based query expansion• Cluster-based results diversification• Using document clusters to utilize relevance feedback• The connection between query-performance prediction and
cluster ranking
Graph-based methods utilizing inter-document similarities
Graph-based framework for re-ranking (Kurland&Lee '05, ‘10)
Inspiration: Web Retrieval Common approach to web retrieval:
• Re-rank an initial retrieved list 𝐷𝐷𝑖𝑖𝑛𝑛𝑖𝑖𝑡𝑡 of documents by the degree of centrality (Brin&Page '98, Kleinberg '99)
• Centrality of a document is estimated using explicit hyperlink structure (PageRank, HITS)
Can we use the scoring by centrality approach for ranking non-hypertext documents?
AB C
X
Y
A possible strategy:Structural re-ranking
• Use inter-document similarities to infer links between documents in 𝐷𝐷𝑖𝑖𝑛𝑛𝑖𝑖𝑡𝑡
• On the resultant graph (of documents and induced links) define centrality measures and use them as criteria for ranking
How to induce links ?One might suggest:
Vector Space Model (VSM) for information representation and cosine for similarity metric
Erkan&Radev '04 : text summarization• Cosine similarity between sentences• See the book: Rada Mihalcea and Dragomir Radev. Graph-
based natural language processing and information retrieval. Cambridge University Press, 2011.
but …
Inducing Links
PortlandPortlandPortland
Relevant Relevant⇒
{2d
DublinPortlandBeijing
{1d
Relevant Relevant⇒/
“spiky” distribution
“flat” distribution
1d 2d 1d 2d2 1 1 2( | ) ( | )p d d p d d>LMs:1 2 2 1cos( , ) cos( , )d d d d=VSM:
Zhang et al. (’05) used asymmetric cosine-based edge weights, but these were found by Kurland&Lee (’10) to be somewhat less effective than the language-model-based weights
Generation graphsFor document :
( ) documents that yield the highest ( | )
The complete graph with edge weig( , )
hts:[ ( )] ( | )
The smoothed (Brin&P
( )
ag
init ini
init
ini
t init
t
G D D Dwt
o DTopGen o k g D p o g
g TopGen o p o go g δ
∈
∈
∈×
− > =
'
[ ]
[ ]
e '98) complete graph with edge weights:
1 + |
( , )
( )( '
( ))|
init
init
init ini
g D
t init
wt o gwt o
G D D
wt o gD
D
g
λ
λ λ λ
∈
− >×
− >−
=− >∑
is a “generator” of o (o is an offspring of g)( )g TopGen o∈
Inducing centrality: Recursive Weighted Influx Algorithm
• Smoothed graph : ergodic Markov chain, power method converges• The Recursive Weighted Influx algorithm is a weighted analog of
PageRank
[ ] [ ] [ ]
[ ]
( ; ) ( ) ( ; )
( ; ) 1init
init
RWI RWIo D
RWId D
Cen d G wt o d Cen o G
Cen d G
λ λ λ
λ
∈
∈
− >
=∑
∑
[ ]G λ
The language modeling framework
( ; ) ( | )Cen d G p q d•
doc “prior” initial ranking
Lafferty&Zhai '01: “with hypertext, [a document prior] might be the distribution calculated using the ‘PageRank’ scheme”
Algorithm: Recursive Weighted Influx+LM
Score by: cf. ( ) ( | )p d p q d•
Evaluation
• : 50 documents retrieved by an optimized language-model-based retrieval method
• Evaluation measure: precision@5• Reference comparison (initial ranking)
• Can we push relevant documents to the top 5 ranks and move from there non-relevant documents ?
initD
LM framework with centrality scores as “priors”
26
31
36
41
46
51
56
61
init rank
RW-Influx+LM
AP TREC8 WSJ AP89prec @ 5
*
*
* Statistical significance difference with init rank
Comparing centrality measures(e.g., Miller et al. '99; doc length as "prior")
25
30
35
40
45
50
55
60uniform
log(length)
RW-Influx
precision@5 AP TREC8 WSJ AP89
*
*
* Statistical significance difference with init rank
Cosine vs. LM probabilities
25
30
35
40
45
50
55
60 init rank
RW-In+LM(COS)
RW-In+LM(LM)
precision@5 AP TREC8 WSJ AP89
*
*
* Statistical significance difference with init rank
Relevance score propagation (cf., Otterbacher et al. 05)
For document :( ) documents that yield the highest ( | )
The complete graph with edge weig( , )
hts:[ ( )] ( | )
The smoothed (Brin&P
( )
ag
init ini
init
ini
t init
t
G D D Dwt
o DTopGen o k g D p o g
g TopGen o p o go g δ
∈
∈
∈×
− > =
' '
[ ]
[ ]
e '98) complete graph with e( , )
dge weights:( , ) + ( )
( ')( , ') (
)init init
init init i
g
n
D
i
g
t
D
G D D D
w wt o gw
sim q gs
tt o
gim g
oq g
λ
λ λ
∈ ∈
− >−
×
− > =>∑ ∑
Label propagation (Yang et al. ’06)• Treat the query and documents in the highest
ranked (query-specific) cluster as relevant• Treat the documents at the tail of the result list as
non-relevant• Apply Zhu&Ghahramani’s (’02) label propagation
algorithm:
2, j
, j 2
Y: the vector of documents' labels (relevant/not-relevant/unlabeled): the distance between docs i and j: regularization factor
exp( ) (j i)
The algorithm:1. Propag
ij
ij ii ij
kj
d
d ww T P
w
σ
σ−
= = → =∑
ate Y TY2. Row-normalize Y3. Clamp the labeled data and go to step 1 until convergence
←
Label propagation (Yang et al. ’06)Experiments with TREC8; Okapi BM25 is the initial ranking methodM: size of the initial list; K: size of the base set from which pseudo relevant documents are selected; N: # of pseudo non relevant documents
Document-cluster graphs (Kurland&Lee ’06)• The cluster-document duality (for query-specific
clusters)• Clusters that are most representative of the information
need contain (are associated with) many relevant documents
• Clusters that contain (are associated with) many relevant documents are most representative of the information need
Hub/Authority Cluster/Document
Document as authority Document as hubd c→c d→
id
jd
lc
mc
nc
id
jd
lc
mc
nc
Doc-only graph d d→
jdid
kdld
30%
35%
40%
45%
50%
55%
60%
65% init. rank
auth[d->d]
PR[d->d]
auth[c->d]
Re-ranking using document centrality (query-specific clusters)
AP TREC8 WSJprec @ 5
Authority scoresdoc-only graph d d→doc as authority graph c d→
PageRank scores doc-only graph d d→
* *
using nearest-neighbor clusters
* significant difference between auth[c d] and auth[d d]→ →
Cluster centralityThe percentage of relevant documents in the highest ranked cluster of 5 documents
39.2% 39.6% 44%
48.7% 48% 51.2%
49.5% 50.8% 53.6%
AP TREC8 WSJ
influx[d c]→
auth[d c]→
query likelihood (sim(q,c))
q q
q q q
'q': significant difference with query likelihood
Passage-based document retrieval
• Motivation for using passage-based information: long and/or topically heterogeneous documents that are relevant to a query might contain a single (short) passage that contains query-pertaining information
The InterPsgDoc method (Callan ’94):𝑆𝑆𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑; 𝑞𝑞 ≜ 𝜆𝜆𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑑𝑑 + (1 − 𝜆𝜆) 𝑠𝑠𝑎𝑎𝑎𝑎𝑔𝑔𝑖𝑖∈𝑑𝑑𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑙𝑙𝑖𝑖𝑙𝑙𝑖𝑖 (∈ 𝑑𝑑) is a passage in document 𝑑𝑑
Passage-document graphs (Bendersky&Kurland ’08)
• Re-ranking an initially retrieved list• Documents in the list are hubs, passages of
documents in the lists are authorities• 𝑆𝑆𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑑𝑑 𝑠𝑠𝑎𝑎𝑎𝑎𝑔𝑔𝑖𝑖∈𝑑𝑑𝐶𝐶𝑠𝑠𝐶𝐶𝑡𝑡𝑠𝑠𝑎𝑎𝐶𝐶𝑠𝑠𝑡𝑡𝐶𝐶(g)Centrality(g) is g’s influx (weighted in-degree) or authority value (induced by the HITS algorithm)
id
jd
lg
mg
See Krikon et al. (’10) for using simultaneously doc-only and passage-only graphsSee Krikon&Kurland (’11) for integrating documents, passages and clusters
Score regularization (Diaz ’05, ’07)• The idea: similar documents should be assigned with
similar retrieval scores• But, maintain some “consistency” with the initial retrieval scores
• Otherwise, a flat score distribution would be the best• Could be viewed as “iterative score smoothing”
• f: vector of regularized scores• S(f): score function associated with inter-document-
consistency of scores; penalizes large differences between scores of similar documents
• Υ 𝑡𝑡 : consistency with original scores (L2 distance)• Objective function:
𝑡𝑡∗ = 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑠𝑠𝐶𝐶 𝑄𝑄 𝑡𝑡 ≜ 𝑆𝑆 𝑡𝑡 + 𝜇𝜇Υ 𝑡𝑡𝑡𝑡 ∈ 𝑅𝑅𝑛𝑛
1
1
( ,..., ) is the vector of original retrieval scores of the documents to be re-ranked (i.e., the highest ranked documents by the initial search)
( ,..., ) is the vector of regularized (new)
n
n
y y y nn
f f f
=
=
1
2
, 1
2
1
scores
: affinity matrix ( - affinity between doc i and doc j; =0)
: diagonal matrix where
( )
( ) ( )
)(
ij ii
n
ii iji
nij ij
i ji j ii jj
n
i ii
W W W
D D W
W WS f f f
D D
f f y
=
=
=
=
= −
ϒ = −
∑
∑
∑
𝑡𝑡∗ = 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑠𝑠𝐶𝐶 𝑄𝑄 𝑡𝑡 ≜ 𝑆𝑆 𝑡𝑡 + 𝜇𝜇Υ 𝑡𝑡𝑡𝑡 ∈ 𝑅𝑅𝑛𝑛
There is a closed form solution
Score regularization – empirical results (Diaz ’07)
Additional approaches (1)
• Daniłowicz and Bali´nski (’01) define a Markov chain over a graph where documents are vertices and edge-weights are based on query-sensitive inter-document similarities
• The more similar two documents, the higher the edge weight
• The smaller the difference between the original retrieval scores of two documents, the higher the edge weight
• Matveeva ‘04
Additional approaches (2)
• Spreading activation networks (Salton&Buckley ’88)• Markov chains (Balin´ski& Daniłowicz ’05) • Hyperlinks and inter-document similarities
• Biasing PageRank using query-similarity values (Richardson&Domingos ’04)
• Both the random jump and the jump via hyperlinks• The bias can be based on inter-document similarities
• Are semantically related (hyper) links more effective for retrieval? (Koolen&Kamps ’11)
• Graph-based fusion (Kozorovitzky&Kurland ’09, ’11)• Using random walks with absorbing states to diversify
search results (Zhu et al. ’07)• Term-based graphs (out of the scope of this tutorial)
Adversarial search
• Mishne et al. ’05• Finding spam comments in blogs by comparing the language
models of the (i) comment, (ii) pages to which the comment has outgoing links, and (iii) the blog post
• Benczúr et al. ’06• Finding nepotistic links by comparing a language model
induced from the anchor text and that induced from the target page
• A similar approach was used by Martinez-Romo&Araujo ‘09
• Raiber et al. ’12 • Using inter-document similarities to address keyword stuffing
Concluding notes• The cluster hypothesis gave rise to much work in
the IR field. We surveyed:• Cluster hypothesis tests• Cluster-based document retrieval
• Using clusters to select documents• Using clusters (or topic models) to enrich (smooth) document
representations• Using graph-based methods that utilize inter-document
similarities for ad hoc retrieval• Applications/tasks for which cluster-based or graph-
based methods have been used• Query-performance prediction, fusion, federated search,
microblog retrieval, results diversification, query expansion, using relevance feedback, adversarial search
Some open challenges
• The optimal cluster• The cluster ranking challenge
• Selective application of cluster-based and document-based retrieval
• Devising query-sensitive inter-document measures that will result in the cluster hypothesis holding to a larger extent
• Devising additional graph-based centrality measures that are correlated with relevance