IBM Haifa Research Lab © 2010 IBM Corporation
Query Performance Prediction
for IR
David Carmel, IBM Haifa Research Lab
Oren Kurland, Technion
SIGIR TutorialPortland Oregon, August 12, 2012
IBM Labs in Haifa
2212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction2 SIGIR 2010, Geneva
Instructors
� Dr. David Carmel
� Research Staff Member at the Information Retrieval group at IBM Haifa Research Lab
� Ph.D. in Computer Science from the Technion, Israel, 1997
� Research Interests: search in the enterprise, query performance prediction, social search, and text mining
� [email protected], https://researcher.ibm.com/researcher/view.php?person=il-CARMEL
� Dr. Oren Kurland
� Senior lecturer at the Technion --- Israel Institute of Technology
� Ph.D. in Computer Science from Cornell University, 2006
� Research Interests: information retrieval
� http://iew3.technion.ac.il/~kurland
IBM Labs in Haifa
3312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction3 SIGIR 2010, Geneva
� This tutorial is based in part on the book
�Estimating the query difficulty
for Information Retrieval� Synthesis Lectures on Information
Concepts, Retrieval, and Services,
� Morgan & Claypool publishers
� The tutorial presents the opinions of the
presenters only, and does not
necessarily reflect the views of IBM or
the Technion
� Algorithms, techniques, features, etc.
mentioned here might or might not be in
use by IBM
IBM Labs in Haifa
4412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction4 SIGIR 2010, Geneva
Query Difficulty Estimation – Main Challenge
Estimating the query difficulty is an attempt to quantify the quality of search results retrieved for a query from a given collection of documents, when no relevance feedback is given
� Even for systems that succeed
very well on average, the quality
of results returned for some of
the queries is poor
� Understanding why some queries
are inherently more difficult than
others is essential for IR
� A good answer to this question
will help search engines to
reduce the variance in
performance
IBM Labs in Haifa
5512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction5
Estimating the Query Difficulty – Some Benefits
� Feedback to users:
� The IR system can provide the users with an estimate of the expected quality of the results retrieved for their queries.
� Users can then rephrase “difficult” queries or, resubmit a “difficult” query to alternative search resources.
� Feedback to the search engine:
� The IR system can invoke alternative retrieval strategies for different queries according to their estimated difficulty.
� For example, intensive query analysis procedures may be invoked selectively for difficult queries only
� Feedback to the system administrator:
� For example, administrators can identify missing content queries
� Then expand the collection of documents to better answer these queries.
� For IR applications:
� For example, a federated (distributed) search application
� Merging the results of queries employed distributively over different datasets
� Weighing the results returned from each dataset by the predicted difficulty
IBM Labs in Haifa
6612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction6 SIGIR 2010, Geneva
Contents
� Introduction -The Robustness Problem of (ad hoc) Information
Retrieval
� Basic Concepts
� Query Performance Prediction Methods
� Pre-Retrieval Prediction Methods
� Post-Retrieval Prediction Methods
� Combining Predictors
� A Unified Framework for Post-Retrieval Query-Performance
Prediction
� A General Model for Query Difficulty
� A Probabilistic Framework for Query-Performance Prediction
� Applications of Query Difficulty Estimation
� Summary
� Open Challenges
IBM Haifa Research Lab © 2010 IBM Corporation
Introduction –
The Robustness Problem of Information Retrieval
IBM Labs in Haifa
8812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction8 SIGIR 2010, Geneva
The Robustness problem of IR
� Most IR systems suffer from a radical variance in retrieval
performance when responding to users’ queries
� Even for systems that succeed very well on average, the quality of results
returned for some of the queries is poor
� This may lead to user dissatisfaction
� Variability in performance relates to various factors:
� The query itself (e.g., term ambiguity “Golf”)
� The vocabulary mismatch problem - the discrepancy between the query
vocabulary and the document vocabulary
� Missing content queries - there is no relevant information in the corpus
that can satisfy the information needs.
IBM Labs in Haifa
9912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction9 SIGIR 2010, Geneva
An example for a difficult query: “The Hubble Telescope achievements”
Retrieved results deal with issues related to the Hubble telescope project in
general; but the gist of that query, achievements, is lost
Hubble Telescope Achievements:
�Great eye sets sights sky high
�Simple test would have found flaw in Hubble telescope
�Nation in brief
�Hubble space telescope placed aboard shuttle
�Cause of Hubble telescope defect reportedly found
�Flaw in Hubble telescope
�Flawed mirror hampers Hubble space telescope
�Touchy telescope torments controllers
�NASA scrubs launch of discovery
�Hubble builders got award fees, magazine says
IBM Labs in Haifa
101012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction10 SIGIR 2010, Geneva
The variance in performance across queries and systems (TREC-7)
� Queries are sorted in decreasing order according to average precision attained among all TREC participants (green bars).
� The performance of two different systems per query is shown by the two curves.
IBM Labs in Haifa
111112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction11 SIGIR 2010, Geneva
The Reliable Information Access (RIA) workshop
� The first attempt to rigorously investigate the reasons for performance variability
across queries and systems.
� Extensive failure analysis of the results of
� 6 IR systems
� 45 TREC topics
� Main reason to failures - the systems’ inability to identify all important aspects of
the query
� The failure to emphasize one aspect of a query over another, or emphasize one aspect
and neglect other aspects
� “What disasters have occurred in tunnels used for transportation?”
� Emphasizing only one of these terms will deteriorate performance because each
term on its own does not fully reflect the information need
� If systems could estimate what failure categories the query may belong to
� Systems could apply specific automated techniques that correspond to the failure
mode in order to improve performance
IBM Labs in Haifa
121212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction12 SIGIR 2010, Geneva
IBM Labs in Haifa
131312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction13 SIGIR 2010, Geneva
Instability in Retrieval - The TREC’s Robust Tracks
� The diversity in performance among topics and systems led to the
TREC Robust tracks (2003 – 2005)
� Encouraging systems to decrease variance in query performance by
focusing on poorly performing topics
� Systems were challenged with 50 old TREC topics found to be
“difficult” for most systems over the years
� A topic is considered difficult when the median of the average precision
scores of all participants for that topic is below a given threshold
� A new measure, GMAP, uses the geometric mean instead of the
arithmetic mean when averaging precision values over topics
� Emphasizes the lowest performing topics, and is thus a useful measure
that can attest to the robustness of system’s performance
IBM Labs in Haifa
141412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction14 SIGIR 2010, Geneva
The Robust Tracks – Decreasing Variability across Topics
� Several approaches to improving the poor effectiveness for some
topics were tested
� Selective query processing strategy based on performance prediction
� Post-retrieval reordering
� Selective weighting functions
�Selective query expansion
� None of these approaches was able to show consistent
improvement over traditional non-selective approaches
� Apparently, expanding the query by appropriate terms extracted
from an external collection (the Web) improves the effectiveness
for many queries, including poorly performing queries
IBM Labs in Haifa
151512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction15 SIGIR 2010, Geneva
The Robust Tracks – Query Performance Prediction� As a second challenge: systems were asked to predict their performance for
each of the test topics
� Then the TREC topics were ranked
� First, by their predicted performance value
� Second, by their actual performance value
� Evaluation was done by measuring the similarity between the predicted performance-based ranking and the actual performance-based ranking
� Most systems failed to exhibit reasonable prediction capability
� 14 runs had a negative correlation between the predicted and actual topic rankings, demonstrating that measuring performance prediction isintrinsically difficult
On the positive side, the difficulty in developing reliable prediction methods raised the awareness of the IR community to this challenge
IBM Labs in Haifa
161612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction16 SIGIR 2010, Geneva
How difficult is the performance prediction task for human experts?
� TREC-6 experiment: Estimating whether human experts can predict the query difficulty
� A group of experts were asked to classify a set of TREC topics to three degrees of difficulty based on the query expression only
� easy, middle, hard?
� The manual judgments were compared to the median of the average precision scores, as determined after evaluating the performance of all participating systems
� Results
� The Pearson correlation between the expert judgments and the “true” values was very low (0.26).
� The agreement between experts, as measured by the correlation between their judgments, was very low too (0.39)
The low correlation illustrates how difficult this task is and how little is known
about what makes a query difficult
IBM Labs in Haifa
171712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction17 SIGIR 2010, Geneva
How difficult is the performance prediction task for human experts? (contd.)
Hauff et al. (CIKM 2010) found
� a low level of agreement between humans with regard to which queries are more difficult than others (median kappa = 0.36)
� there was high variance in the ability of humans to estimate query difficulty although they shared “similar” backgrounds
� a low correlation between true performance and humans’ estimates of query-difficulty (performed in a pre-retrieval fashion)
� median Kendall tau was 0.31 which was quite lower than that posted by the best performing pre-retrieval predictors
� however, overall the humans did manage to differentiate between “good” and “bad”queries
� a low correlation between humans’ predictions and those of query-performance predictors with some exceptions
These findings further demonstrate how difficult the query-performance prediction task is and how little is known about what makes a query difficult
IBM Labs in Haifa
181812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction18 SIGIR 2010, Geneva
Are queries found to be difficult in one collection still considered difficult in another collection?
� Robust track 2005: Difficult topics in the ROBUST collection were tested against another
collection (AQUAINT)
� The median average precision over the ROBUST collection is 0.126
� Compared to 0.185 for the same topics over the AQUAINT collection
� Apparently, the AQUAINT collection is “easier” than the ROBUST collection
probably due to
� Collection size
� Many more relevant documents per topic in AQUAINT
� Document features such as structure and coherence
� However, the relative difficulty of the topics is preserved over the two datasets
� The Pearson correlation between topics’ performance in both datasets is 0.463
� This illustrates some dependency between the topic’s median scores on both
collections
Even when topics are somewhat easier in one collection than another, the relative difficulty among topics is preserved, at least to some extent
IBM Haifa Research Lab © 2010 IBM Corporation
Basic concepts
IBM Labs in Haifa
202012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction20 SIGIR 2010, Geneva
The retrieval task
� Given:
� A document set D (the corpus)
� A query q
� Retrieve Dq (the result list), a ranked list of documents from D,
which are most likely to be relevant to q
� Some widely used retrieval methods:
� Vector space tf-idf based ranking, which estimates relevance by the
similarity between the query and a document in the vector space
� The probabilistic OKAPI-BM25 method, which estimates the probability
that the document is relevant to the query
� Language-model-based approaches, which estimate the probability that
the query was generated by a language model induced from the
document
� And more, divergence from randomness (DFR) approaches, inference
networks, markov random fields (MRF), …
IBM Labs in Haifa
212112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction21 SIGIR 2010, Geneva
Text REtrieval Conference (TREC)
� A series of workshops for large-scale evaluation of (mostly) text
retrieval technology:
� Realistic test collections
� Uniform, appropriate scoring procedures
� Started in 1992Title: African Civilian Deaths
Description: How many civilian non-combatants have been killed in the various civil wars in Africa?
Narrative: A relevant document will contain specific casualty information for a given area, country, or region. It will cite numbers of civilian deaths caused directly or indirectly by armed conflict.
� A TREC task usually comprises of:
� A document collection (corpus)
� A list of topics (information needs)
� A list of relevant documents for each
topic (QRELs)
IBM Labs in Haifa
222212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction22 SIGIR 2010, Geneva
Precision measures
�Precision at k (P@k): The fraction of relevant documents
among the top-k results.
�Average precision: AP is the average of precision values
computed at the ranks of each of the relevant documents
in the ranked list:
is the set of documents in the corpus that are relevant to
1( ) @ ( )
qr Rq
R qq
AP q P rank rR ∈
= ∑
Average Precision is usually computed using a ranking which is truncated at some position (typically 1000 in TREC).
IBM Labs in Haifa
232312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction23 SIGIR 2010, Geneva
Prediction quality measures
� Given:
� Query q
� Result list Dq that contains n documents
� Goal: Estimate the retrieval effectiveness of Dq, in terms of
satisfying Iq, the information need behind q
� specifically, the prediction task is to predict AP(q) when no
relevance information (Rq) is given.
� In practice: Estimate, for example, the expected average
precision for q.
� The quality of a performance predictor can be measured by the
correlation between the expected average precision and the
corresponding actual precision values
IBM Labs in Haifa
242412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction24 SIGIR 2010, Geneva
Measures of correlation
� Linear correlation (Pearson)
� considers the true AP values as well as the predicted AP values of queries
� Non-linear correlation (Kendall’s tau, Spearman’s rho)
� considers only the ranking of queries by their true AP values and by their
predicted AP values
IBM Labs in Haifa
252512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction25 SIGIR 2010, Geneva
Evaluating a prediction system
Actual AP
Pre
dic
ted A
P
Q1
Q2Q3Q4
Q5
� Different correlation metrics will measure different things
� If there is no apriory reason to use Pearson, prefer Spearman or KT
� In practice, the difference is not large, because there are enough queries and because of the distribution of AP
IBM Labs in Haifa
262612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction26 SIGIR 2010, Geneva
Evaluating a prediction system (cond.)
� It is important to note that it was found that state-of-the-art
query-performance predictors might not be correlated (at all)
with measures of users’ performance (e.g., the time it takes to
reach the first relevant document)
� see Terpin and Hersh ADC ’04 and Zhao and Scholer ADC ‘07
� However!,
� This finding might be attributed, as suggested by Terpin and Hersh in ADC
’04, to the fact that standard evaluation measures (e.g., average
precision) and users’ performance are not always strongly correlated
� Hersh et al. ’00, Turpin and Hersh ’01, Turpin and Scholer ’06, Smucker and
Parkash Jethani ‘10
IBM Haifa Research Lab © 2010 IBM Corporation
Query Performance Prediction
Methods
IBM Labs in Haifa
282812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction28 SIGIR 2010, Geneva
Performance Prediction
Methods
Pre-Retrieval
Methods
Linguistics Statistics Clarity RobustnessScore Analysis
Post-Retrieval
Methods
Morphologic Syntactic
Specificity Similarity Coherency
Query Perturb.
Relatedness
Doc Perturb. Retrieval Perturb.
Semantic
Top score Avg. score
Variance of
scoresCluster hypo.
IBM Labs in Haifa
292912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction29 SIGIR 2010, Geneva
Pre-Retrieval Prediction Methods
� Pre-retrieval prediction approaches estimate the quality of the
search results before the search takes place.
� Provide effective instantiation of query performance prediction for search
applications that must efficiently respond to search requests
� Only the query terms, associated with some pre-defined statistics
gathered at indexing time, can be used for prediction
� Pre-retrieval methods can be split to linguistic and statistical
methods.
� Linguistic methods apply natural language processing (NLP)
techniques and use external linguistic resources to identify
ambiguity and polysemy in the query.
� Statistical methods analyze the distribution of the query
terms within the collection
IBM Labs in Haifa
303012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction30 SIGIR 2010, Geneva
Linguistic Approaches (Mothe & Tangui (SIGIR QD workshop, SIGIR 2005), Hauff (2010))
� Most linguistic features do not correlate well with the system performance.
� Features include, for example, morphological (avg. # of morphemes per query term),
syntactic link span (which relates to the average distance between query words in the
parse tree), semantic (polysemy- the avg. # of synsets per word in the WordNet
dictionary)
� Only the syntactic links span and the polysemy value were shown to have
some (low) correlation
� This is quite surprising as intuitively poor performance can be expected for
ambiguous queries
� Apparently, term ambiguity should be measured using corpus-based approaches,
since a term that might be ambiguous with respect to the general vocabulary, may
have only a single interpretation in the corpus
IBM Labs in Haifa
313112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction31 SIGIR 2010, Geneva
Pre-retrieval Statistical methods
� Analyze the distribution of the query term frequencies within the
collection.
� Two major frequent term statistics :
� Inverse document frequency idf(t) = log(N/Nt)
� Inverse collection term frequency ictf(t) = log(|D|/tf(t,D))
� Specificity based predictors
� Measure the query terms' distribution over the collection
� Specificity-based predictors:
� avgIDF, avgICTF - Queries composed of infrequent terms are easier to satisfy
� maxIDF, maxICTF – Similarly
� varIDF, varICTF - Low variance reflects the lack of dominant terms in the query
� A query composed of non-specific terms is deemed to be more difficult
� “Who and Whom”
IBM Labs in Haifa
323212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction32 SIGIR 2010, Geneva
Specificity-based predictors� Query Scope (He & Ounis 2004)
� The percentage of documents in the corpus that contain at least one
query term; high value indicates many candidates for retrieval, and
thereby a difficult query
� High query scope indicates marginal prediction quality for short queries
only, while for long queries its quality drops significantly
� QS is not a ``pure'' pre-retrieval predictor as it requires finding the
documents containing query terms (consider dynamic corpora)
� Simplified Clarity Score (He & Ounis 2004)
� The Kullback-Leibler (KL) divergence between the (simplified) query
language model and the corpus language model
� SCS is strongly related to the avgICTF predictor assuming each term appears only once
in the query : SCS(q) = log(1/|q|) + avgICTF(q)
( | )( ) ( | ) log
( | )t q
p t qSC S q p t q
p t D∈
∑�
IBM Labs in Haifa
333312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction33 SIGIR 2010, Geneva
Similarity, coherency, and variance -based predictors
� Similarity: SCQ (Zhao et al. 2008): High similarity to the corpus indicates effective retrieval
� maxSCQ(q)/avgSCQ(q) are the maximum/average over the query terms
� contradicts(?) the specificity idea (e.g., SCS)
� Coherency: CS (He et al. 2008): The average inter-document similarity between documents containing the query terms, averaged over query terms
� This is a conceptual pre-retrieval reminiscent of the post-retrieval autocorrelation approach (Diaz 2007) that we will discuss later
� Demanding computation that requires the construction of a pointwisesimilarity matrix for all pair of documents in the index
�Variance (Zhao et al. 2008): var(t) - Variance of term t’sweights (e.g., tf.idf) over the documents containing it� maxVar and sumVar are the maximum/average over query terms
� hypothesis: low variance implies to a difficult query due to lowdiscriminative power
)())),(log(1()( tidfDttftSCQ ⋅+=
IBM Labs in Haifa
343412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction34 SIGIR 2010, Geneva
Term Relatedness (Hauff 2010)
� Hypothesis: If the query terms co-occur frequently in the collection we expect good
performance
� Pointwise mutual information (PMI) is a popular measure of co-occurrence
statistics of two terms in the collection
� It requires efficient tools for gathering collocation statistics from the corpus, to
allow dynamic usage at query run-time.
� avgPMI(q)/maxPMI(q) measures the average and the maximum PMI over all pairs of
terms in the query
1 21 2
1 2
( , | )( , ) log
( | ) ( | )
p t t DPMI t t
p t D p t D�
IBM Labs in Haifa
353512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction35 SIGIR 2010, Geneva
Evaluating pre-retrieval methods (Hauff 2010)
Insights� maxVAR and maxSCQ dominate other predictors, and are most stable over the
collections and topic sets
� However, their performance significantly drops for one of the query sets (301-350)
� Prediction is harder over the Web collections (WT10G and GOV2) than over the
news collection (ROBUST), probably due to higher heterogeneity of the data
IBM Haifa Research Lab © 2010 IBM Corporation
Post-retrieval Predictors
IBM Labs in Haifa
373712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction37 SIGIR 2010, Geneva
Post-Retrieval Predictors
� Analyze the search results in addition to the query
� Usually are more complex as the top-results are retrieved and analyzed
� Prediction quality depends on the retrieval process
� As different results are expected for the same query when using different retrieval methods
� In contrast to pre-retrieval methods, the search results may depend on query-independent factors
� Such as document authority scores, search personalization etc.
Clarity RobustnessScore Analysis
Post-Retrieval
Methods
Query
Perturb.
Doc
Perturb.
Retrieval
Perturb.
� Post-retrieval methods can be categorized into three main paradigms:
�Clarity-based methods directly measure the coherence of the search results.
�Robustness-based methods evaluate how robust the results are to perturbations in the query, the result list, and the retrieval method.
�Score distribution based methods analyze the score distribution of the search results
IBM Labs in Haifa
383812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction38 SIGIR 2010, Geneva
Clarity (Cronen-Townsend et al. SIGIR 2002)
� Clarity measures the coherence (clarity) of the result-list with
respect to the corpus
� Good results are expected to be focused on the query's topic
� Clarity considers the discrepancy between the likelihood of
words most frequently used in retrieved documents and their
likelihood in the whole corpus
� Good results - The language of the retrieved documents should be
distinct from the general language of the whole corpus
� Bad results - The language of retrieved documents tends to be more
similar to the general language
� Accordingly, clarity measures the KL divergence between a
language model induced from the result list and that induced
from the corpus
IBM Labs in Haifa
393912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
( , )( | )
| |MLE
tf t Xp t X
X� X’s unsmoothed LM (MLE)
39 SIGIR 2010, Geneva
Clarity Computation
( )
( ) ( ( | ) || ( | ))
( | )( ) ( | ) log
( | )
q
q
q
t V D MLE
Clarity q KL p D p D
p t DClarity q p t D
p t D∈
• •
= ∑
� KL divergence between
Dq and D
( | ) ( | ) (1 ) ( | )MLEp t d p t d p t Dλ λ= + − d’s smoothed LM
'
( | ) ( )( | ) ( | )
( | ') ( ')qd D
p q d p dp d q p q d
p q d p d∈
= ∝∑ d’s “relevance” to q
( | ) ( | )t q
p q d p t d∈∏� query likelihood model
( | ) ( | ) ( | )q
q
d D
p t D p t d p d q∈
= ∑ Dq’s LM (RM1)
IBM Labs in Haifa
404012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction40 SIGIR 2010, Geneva
Example
� Consider the two queries variants for TREC topic 56 (from the TREC query track):� Query A: Show me any predictions for changes in the prime lending rate
and any changes made in the prime lending rates
� Query B: What adjustments should be made once federal action occurs?
IBM Labs in Haifa
414112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction41 SIGIR 2010, Geneva
The Clarity of 1)relevant results, 2) non-relevant results, 3)random documents
300 350 400 4500
1
2
3
4
5
6
7
8
9
10
Topic number
Query
cla
rity
score
Relevant
Non−relevant
Collection−wide
IBM Labs in Haifa
424212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction42 SIGIR 2010, Geneva
Novel interpretation of clarity - Hummel et al. 2012(See the Clarity Revisited poster!)
( ) ( ( | ) || ( | )
( ( | ) || ( | )) ( ( | ))
( ) Distance( ) Diversity( )
Cross entropy Entropy
q
q q
Clarity q KL p D p D
CE p D p D H p D
Clarity q C q L q
• • =
• • − •
= −
�
IBM Labs in Haifa
434312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
Clarity variants
�Divergence from randomness approach (Amati et al.
’02)
�Emphasize query terms (Cronen Townsend et al. ’04,
Hauff et al. ’08)
�Only consider terms that appear in a very low
percentage of all documents in the corpus (Hauff et al.
’08)
� Beneficial for noisy Web settings
43 SIGIR 2010, Geneva
IBM Labs in Haifa
444412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction44 SIGIR 2010, Geneva
Robustness
� Robustness can be measured with respect to perturbations of the
� Query
� The robustness of the result list to small modifications of the query (e.g.,
perturbation of term weights)
� Documents
� Small random perturbations of the document representation are unlikely to
result in major changes to documents' retrieval scores
� If scores of documents are spread over a wide range, then these
perturbations are unlikely to result in significant changes to the ranking
� Retrieval method
� In general, different retrieval methods tend to retrieve different results for the
same query, when applied over the same document collection
� A high overlap in results retrieved by different methods may be related to high
agreement on the (usually sparse) set of relevant results for the query.
� A low overlap may indicate no agreement on the relevant results; hence, query
difficulty
IBM Labs in Haifa
454512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
Query Perturbations
� Overlap between the query and its sub-queries (Yom Tov et al. SIGIR 2005)
� Observation: Some query terms have little or no influence on theretrieved documents, especially in difficult queries.
� The query feedback (QF) method (Zhou and Croft SIGIR 2007) models retrieval as a communication channel problem
� The input is the query, the channel is the search system, and the set of results is the noisy output of the channel
� A new query is generated from the list of results, using the terms with maximal contribution to the Clarity score, and then a second list of results is retrieved for that query
� The overlap between the two lists is used as a robustness score.
)(qAPq SE q’
Original R
New R
Overlap
IBM Labs in Haifa
464612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction46 SIGIR 2010, Geneva
Document Perturbation (Zhou & Croft CIKM 06)(A conceptually similar approach, but which uses a different technique, was presented by Vinay et al. in SIGIR 06)
� How stable is the ranking in the presence of uncertainty in the
ranked documents?
� Compare a ranked list from the original collection to the corresponding
ranked list from the corrupted collection using the same query and ranking
function
D D’docs’
lang. model
d d’
d1
d2
…
dk
SE
d’1
d’2
…
d’k
q
Compare
IBM Labs in Haifa
474712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction47 SIGIR 2010, Geneva
Cohesion of the Result list – Clustering tendency(Vinay et al. SIGIR 2006)
� The cohesion of the result list can be measured by its clustering patterns
� Following the ``cluster hypothesis'' which implies that documents relevant to a given query are likely to be similar to one another
� A good retrieval returns a single, tight cluster, while poor retrieval returns a loosely related set of documents covering many topics
� The ``clustering tendency'' of the result set
� Corresponds to the Cox-Lewis statistic which measures the ``randomness'' level of the result list
� Measured by the distance between a randomly selected document and it’s nearest neighbor from the result list.
� When the list contains “inherent” clusters, the distance between the random document and its closest neighbor is likely to be much larger than the distance between this neighbor and its own nearest neighbor in the list
IBM Labs in Haifa
484812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction48 SIGIR 2010, Geneva
Retrieval Method Perturbation (Aslam & Pavlu, ECIR 2007)
� Query difficulty is predicted by submitting the query to different
retrieval methods and measuring the diversity of the retrieved
ranked lists
� Each ranking is mapped to a distribution over the document collection
� JSD distance is used to measure the diversity of these distributions
� Evaluation: Submissions of all participants to several TREC tracks
were analyzed
� The agreement between submissions highly correlates with the query
difficulty, as measured by the median performance (AP) of all participants
� The more submissions are analyzed, the prediction quality improves.
IBM Labs in Haifa
494912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction49 SIGIR 2010, Geneva
Score Distribution Analysis
� Often, the retrieval scores reflect the similarity of documents to a
query
� Hence, the distribution of retrieval scores can potentially help predict
query performance.
� Naïve predictors:
� The highest retrieval score or the mean of top scores (Thomlison 2004)
� The difference between query-independent scores and a query-
dependent scores
� Reflects the ``discriminative power'' of the query (Bernstein et al. 2005)
IBM Labs in Haifa
505012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction50 SIGIR 2010, Geneva
Spatial autocorrelation (Diaz SIGIR 2007)
� Query performance is correlated with the extent to which the result list
“respects” the cluster hypothesis
� The extent to which similar documents receive similar retrieval scores
� In contrast, a difficult query might be detected when similar documents are
scored differently.
� A document's ``regularized'' retrieval score is determined based on the
weighted sum of the scores of its most similar documents
Re
'
( , ) ( , ') ( , ')g
d
Score q d Sim d d Score q d∑�
The linear correlation of the regularized scores with the original scores is used for
query-performance prediction.
IBM Labs in Haifa
515112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction51 SIGIR 2010, Geneva
Weighted Information Gain –WIG (Zhou & Croft 2007)� WIG measures the divergence between the mean retrieval score of top-
ranked documents and that of the entire corpus.
� The more similar these documents are to the query, with respect to the
corpus, the more effective the retrieval
� The corpus represents a general non-relevant document
( )1
is the list of the highest ranked documents
( | ) 1( ) log ( ) ( )
( | ) | |
k
q
kkqq
tkd Dt qd D
D k
p t dWIG q avg Score d Score D
p t D k qλ
∈∈∈
−
∑∑� �
� λt reflects the weight of the term's type
� When all query terms are simple keywords, this parameter collapses to
1/sqrt(|q|)
� WIG was originally proposed and employed in the MRF framework
� However, for a bag-of-words representation MRF reduces to the query likelihood
model and is effective (Shtok et al. ’07, Zhou ’07)
IBM Labs in Haifa
525212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction52 SIGIR 2010, Geneva
Normalized Query Commitment - NQC (Shtok et al. ICTIR 2009 and TOIS 2012)
� NQC estimates the presumed amount of query drift in the list of top-retrieved documents
� Query expansion often uses a centroid of the list Dq as an expanded query model
� The centroid usually manifests query drift (Mitra et al. ‘98)
� The centroid could be viewed as a prototypical misleader as it exhibits (some)
similarity to the query
� This similarity is dominated by non-query-related aspects that lead to query drift
� Shtok et al. showed that the mean retrieval score of documents in the result list
corresponds, in several retrieval methods, to the retrieval score of some centroid-based
representation of Dq
� Thus, the mean score represents the score of a prototypical misleader
� The standard deviation of scores which reflects their dispersion around the mean,
represents the divergence of retrieval scores of documents in the list from that of a non-
relevant document that exhibits high query similarity (the centroid).
( )21( )
( )( )
k
qd D
Score dk
NQC qScore D
µ∈
−∑�
IBM Labs in Haifa
535312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction53 SIGIR 2010, Geneva
A geometric interpretation of NQC
IBM Labs in Haifa
545412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction54 SIGIR 2010, Geneva
Evaluating post-retrieval methods (Shtok et al. TOIS 2012)
� Similarly to pre-retrieval methods, there is no clear winner
� QF and NQC exhibit comparable results
� NQC exhibits good performance over most collections but does
not perform very well on GOV2.
� QF performs well over some of the collections but is inferior to
other predictors on ROBUST.
Prediction quality (Pearson correlation and Kendall’s tau)
IBM Labs in Haifa
555512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction55 SIGIR 2010, Geneva
Additional predictors that analyze the retrieval scores distribution
� Computing the standard deviation of retrieval scores at a query-
dependent cutoff
�Perez-Iglesias and Araujo SPIRE 2010
� Cummins et al. SIGIR 2011
� Computing expected ranks for documents
�Vinay et al. CIKM 2008
� Inferring AP directly from the retrieval score distribution
�Cummins AIRS 2011
IBM Haifa Research Lab © 2010 IBM Corporation
Combining Predictors
IBM Labs in Haifa
575712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction57 SIGIR 2010, Geneva
Combining post-retrieval predictors
� Some integration efforts of a few predictors based on linear
regression:
� Yom-Tov et al (SIGIR’05) combined avgIDF with the Overlap predictor
� Zhou and Croft (SIGIR’07) integrated WIG and QF using a simple linear
combination
� Diaz (SIGIR’07) incorporated the spatial autocorrelation predictor with
Clarity and with the document perturbation based predictor
� In all those trials, the results of the combined predictor were
much better than the results of the single predictors
� This suggests that these predictors measure (at least semi)
complementary properties of the retrieved results
IBM Labs in Haifa
585812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
� Suppose that we have a true model of relevance, , for the
information need that is represented by the query q
� Then, the ranking , induced over the given result list
using , is the most effective for these documents
� Accordingly, the utility (with respect to the information need)
provided by the given ranking, which can be though of as
reflecting query performance, can be defined as
� In practice, we have no explicit knowledge of the underlying
information need and of RIq
� Using statistical decision theory principles, we can approximate
the utility by estimating RIq
Utility estimation framework (UEF) for query performance prediction (Shtok et al. SIGIR 2010)
58 SIGIR 2010, Geneva
( | ) ( , ( , ))q
q q q q IU D I Similarity D D Rπ�
qIR
qI
( , )qq ID Rπ
qIR
( )ˆ
ˆ ˆ ˆ( | ) , ( , ) ( | )
q
q q q q q q q q
R
U D I Similarity D D R p R I dRπ≈ ∫
IBM Labs in Haifa
595912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
Instantiating predictors
59 SIGIR 2010, Geneva
( )ˆ
ˆ ˆ ˆ( | ) , ( , ) ( | )
q
q q q q q q q q
R
U D I Similarity D D R p R I dRπ≈ ∫
� Relevance-model estimates ( )
� relevance language models constructed from documents sampled from
the highest ranks of some initial ranking
� Estimate the relevance-model presumed “representativeness” of
the information need ( )
� apply previously proposed predictors (Clarity, WIG, QF, NQC) upon the
sampled documents from which the relevance model is constructed
� Inter-list similarity measures
� Pearson, Kendall’s-tau, Spearman
� A specific, highly effective, instantiated predictor:
� Construct a single relevance model from the given result list, Dq
� Use a previously proposed predictor upon Dq to estimate relevance-
model effectiveness
� Use Pearson’s correlation between retrieval scores for the similarity
measure
ˆqR
ˆ( | )q qp R I
IBM Labs in Haifa
606012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction60 SIGIR 2010, Geneva
q SEqD
Model Sampling SqR ;
ˆ
Re-Ranking
Ranking
Similarity
)ˆ,( ;Sqq RD
Performance predictor
d1
d2
…
dk
d’1
d’2
…
d’k
Relevance Estimator
)(qAP
π
The UEF framework - A flow diagram
IBM Labs in Haifa
6161
Prediction quality of UEF (Pearson correlation with true AP)
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75Clarity
UEF(Clarity)
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75WIG
UEF(WIG)
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
NQC
UEF(NQC)
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75QF
UEF(QF)
TREC4 TREC5 WT10G ROBUST GOV2 TREC4 TREC5 WT10G ROBUST GOV2
Clarity->UEF(Clarity) (+31.4%) WIG->UEF(WIG) (+27.8%)
NQC->UEF(NQC) (+17.7%) QF->UEF(QF) (+24.7%)
TREC4 WT10G GOV2TREC5 ROBUSTTREC5 ROBUSTTREC4 WT10G GOV2
IBM Labs in Haifa
6262
Prediction quality for ClueWeb (Category A)
LM+SpamRmLMLM+SpamRmLM
.312.302.526.465SumVar (Zhao et al. 08)
.292.294.524.463SumIDF
-.111-.385-.178.017Clarity
.295.303.473.124UEF(Clarity)
-.008.072.133.348ImpClarity (Hauff et al. ’08)
.366.340.580.221UEF(ImpClarity)
.349.269.542.423WIG
.414375..651.236UEF(WIG)
.214.269.430.083NQC
.460.342.633.154UEF(NQC)
.617.368.630.494QF
649..358708.637.UEF(QF)
TREC 2009 TREC 2010
*
* Thanks to Fiana Raiber for producing the results
IBM Labs in Haifa
6363
Are the various post-retrieval predictors that different from each other?
A unified framework for explaining post-retrieval
predictors (Kurland et al. ICTIR 2011)
63
IBM Labs in Haifa
646412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
A unified post-retrieval prediction framework
Corpus ranking πM(q;D)
……
Pseudo –Effective ranking πPE(q;D)
Pseudo -Ineffective rankingπPIE(q;D)
…
……
We want to predict the effectiveness of a ranking ( ; ) of the corpus that was induced by
retrieval method in response to query
Assume a true model of relevance that can be used for retrievalq
M
I
q D
M q
R
π
pseudo ineffective
the resultant ranking, ( ; ), is of optimal utility
( ( ; ); ) ( ( ; ), ( ; ))
Use as reference comp psearisons a
and rank
udo effective (PE)
ing( (IE) sP
opt
q optM M
q D
Utility q D I sim q D q D
π
π π π
⇒
�
cf. Rocchio '71):
ˆ ( ( ; (( ; )); ) ( ) ( ( ; ), ) ( ; )) ( ( ; ), )
PIEM M MEq PUtility q D I q sim q D q sim q Dq D q Dππ ππ α β π−�
IBM Labs in Haifa
656512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
Instantiating predictors
Focus on the result lists of the documents most highly ranked by each ranking
ˆ ( ( ; ); ) ( , ) ( , )
Deriving predictors:
1. "Guess" a PE result list
( )
( )
( ) PIEPEq q qM
PE
U q D I sim D sim Dq q LL
L
π βα −�
and/or a PIE result list ( )
2. Select weights and
3. Select an inter-list (ranking) similarity measure
- Pearson's , Kendall's-
(
, Sp
)
earm
n'
( )
a s-
PIEL
r
α β
τ ρ
IBM Labs in Haifa
666612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
Predictor Pseudo
Ineffective
document
Predictor’s
description
Sim.
measure
α(q) β(q)
Clarity (Cronen-Townsend
’02, ‘04)
Corpus Estimates the focus of the result
list with respect to the corpus
-KL diverg.
between
language
models
0 1
WIG(Weighted
Information Gain;
Zhou & Croft ‘07)
Corpus Measures the difference between
the retrieval scores of documents
in the result list and the score of
the corpus
-L1 distance
of retrieval
scores
0 1
NQC(Normalized
Query Commitment;
Shtok et al. ’09)
Result list
centroid
Measures the standard deviation
of the retrieval scores of
documents in the result list
-L2 distance
of retrieval
scores
0 1
Basic idea: The PIE result list is composed of k copies of a pseudo ineffective document
(ˆ ( ( ; ); ) ( , ) ( ,) )( ) PqM M P IEE MU q D I sim L sim L Lq qL βπ α −�
IBM Labs in Haifa
676712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
Predictor Pseudo
Effective result
list
Predictor’s
description
Sim.
measure
α(q) β(q)
QF (Query
Feedback; Zhou
& Croft ’07)
Use a relevance
model for
retrieval over
the corpus
Measures the “amount of
noise“ (non-query-related
aspects) in the result list
Overlap at
top ranks
1 0
UEF (Utility
Estimation
Framework;
Shtok et al. ‘10)
Re-rank the
given result list
using a
relevance model
Estimates the potential utility
of the result list using
relevance models
Rank/Scores
correlation
Presumed
representat
iveness of
the
relevance
model
0
Autocorrelation
(Diaz ‘07)
1.Score
regularization
2.Fusion
1. The degree to which
retrieval scores “respect
the cluster hypothesis”
2. Similarity with a fusion –
based result list
Pearson
correlation
between
scores
1 0
(ˆ ( ( ; ); ) ( , ) ( ,) )( ) PqM M P IEE MU q D I sim L sim L Lq qL βπ α −�
IBM Labs in Haifa
686812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction68 SIGIR 2010, Geneva
A General Model for Query
Difficulty
IBM Labs in Haifa
696912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction69 SIGIR 2010, Geneva
A model of query difficulty (Carmel et al, SIGIR 2006)
Define: Topic = (Q,R|C)
Queries (Q)
� A user with a given information
need (a topic): � Submits a query to a search
engine
� Judges the search results
according to their relevance
to this information need.
� Thus, the query/ies and the Qrels
are two sides of the same
information need
� Qrels also depend on the existing
collection (C)
Topic
Search engine
Documents (R)
IBM Labs in Haifa
707012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction70 SIGIR 2010, Geneva
A Theoretical Model of Topic Difficulty
Main Hypothesis
Topic difficulty is induced from the distances between the model parts
IBM Labs in Haifa
717112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction71 SIGIR 2010, Geneva
Model validation: The Pearson correlation between Average Precision and the model distances (see the
paper for estimates of the various distances)
Topic
Queries Documents
Based on the .gov2 collection (25M docs) and 100 topics of the Terabyte tracks 04/05
0.17
-0.06
0.32
0.15
Combined: 0.45
IBM Labs in Haifa
727212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction72 SIGIR 2010, Geneva
A probabilistic framework for QPP (Kurland et al. CIKM 2012, to appear)
IBM Labs in Haifa
737312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction73 SIGIR 2010, Geneva
Post-retrieval prediction (Kurland et al. 2012)
query-independent result list
properties (e.g.,
cohesion/dispersion)
WIG and Clarity are derived from this expression
IBM Labs in Haifa
747412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction74 SIGIR 2010, Geneva
Using reference lists (Kurland et al. 2012)
The presumed quality of the
reference list (a prediction
task at its own right)
IBM Haifa Research Lab © 2010 IBM Corporation
Applications of Query Difficulty Estimation
IBM Labs in Haifa
767612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction76 SIGIR 2010, Geneva
Main applications for QDE
� Feedback to the user and the system
� Federation and metasearch
� Content enhancement using missing content analysis
� Selective query expansion/ query selection
� Others
IBM Labs in Haifa
777712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction77 SIGIR 2010, Geneva
Feedback to the user and the system
� Direct feedback to the user
�“We were unable to find relevant
documents to your query”
� Estimating the value of terms from
query refinement
�Suggest which terms to add, and
estimate their value
� Personalization
�Which queries would benefit from
personalization? (Teevan et al. ‘08)
IBM Labs in Haifa
787812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction78 SIGIR 2010, Geneva
Federation and metasearch: Problem definition
� Given several databases that
might contain information
relevant to a given question,
� How do we construct a good
unified list of answers from all
these datasets?
� Similarly, given a set of search
engines employed for the same
query.
� How do we merge their results in
an optimal manner?Collection
Federate
results
Merged
ranking
Search
engine
Collection Collection
Search
engine
Search
engine
IBM Labs in Haifa
797912/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction79 SIGIR 2010, Geneva
Prediction-based Federation & Metasearch
Collection
Federate
results
Merged
ranking
Search
engine
Collection Collection
Search
engine
Search
engine
Weight each result set by it’s predicted precision (Yom-Tov et al., 2005; Sheldon et al. 11)
or
select the best predicted SE (White et al., 2008, Berger&Savoy, 2007)
IBM Labs in Haifa
808012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction80 SIGIR 2010, Geneva
Metasearch and Federation using Query Prediction
�Train a predictor for each engine/collection pair
�For a given query: Predict its difficulty for each engine/collection
pair
�Weight the results retrieved from each engine/collection pair
accordingly, and generate the federated list
IBM Labs in Haifa
818112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction81 SIGIR 2010, Geneva
Metasearch experiment
� The LA Times collection was
indexed using four search engines.
� A predictor was trained for each
search engine.
� For each query the results-list from
each search engine was weighted
using the prediction for the specific
queries.
� The final ranking is a ranking of the
union of the results lists, weighted
by the prediction.
P@10 %no
Single
search
engine
SE 1 0.139 47.8
SE 2 0.153 43.4
SE 3 0.094 55.2
SE 4 0.171 37.4
Meta-
search
Round-
robin
0.164 45.0
Meta-
Crawler
0.163 34.9
Prediction-
based
0.183 31.7
IBM Labs in Haifa
828212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction82 SIGIR 2010, Geneva
Federation for the TREC 2004 Terabyte track
� GOV2 collection: 426GB, 25 million documents
� 50 topics
� For federation, divided into 10 partitions of roughly equal size
P@10 MAP
One collection 0.522 0.292
Prediction-based federation 0.550 0.264
Score-based federation 0.498 0.257
* 10-fold cross-validation
IBM Labs in Haifa
838312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction83 SIGIR 2010, Geneva
Content enhancement using missing content analysis
Information
need
Repository
Search
engine
Missing
content
estimation
Data
gathering
External
content
Quality
estimation
IBM Labs in Haifa
848412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction84 SIGIR 2010, Geneva
Story line
� A helpdesk system where system administrators try to find
relevant solutions in a database
� The system administrators’ queries are logged and analyzed to
find topics of interest that are not covered in the current
repository
� Additional content for topics which are lacking is obtained from
the internet, and is added to the local repository
IBM Labs in Haifa
858512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction85 SIGIR 2010, Geneva
Adding content where it is missing improves precision� Cluster user queries
� Add content according to:
�Lacking content
�Size
�Random
� Adding content where it is missing
brings the best improvement in
precision
IBM Labs in Haifa
868612/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction86 SIGIR 2010, Geneva
Selective query expansion
� Automatic query expansion: Improving average quality by adding search terms
� Pseudo-relevance feedback (PRF): Considers the (few) top-ranked documents to be relevant, and uses them to expand the query.
� PRF works well on average, but can significantly degrade retrieval performance for some queries.
� PRF fails because of Query Drift: Non-relevant documents infiltrate into the top results or relevant results contain aspects irrelevant of the query’s original intention.
� Rocchio’s method: The expanded query is obtained by combining the original query and the centroid of the top results, adding new terms to the query and reweighting the original terms.
∑∑∈∈
−+=Irrelevantklevantj DD
k
DD
jOriginalm DDQQ γβαRe
IBM Labs in Haifa
878712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction87 SIGIR 2010, Geneva
Some work on selective query expansion � Expand only “easy” queries
� Predict which are the easy queries and only expand them (Amati et al.
2004, Yom-Tov et al. 2005)
� Estimate query drift (Cronen-Townsend et al., 2006)
� Compare the expanded list with the unexpanded list to estimate if too
much noise was added by using expansion
� Selecting a specific query expansion form from a list of candidates
(Winaver et al. 2007)
� Integrating various query-expansion forms by weighting them using
query-performance predictors (Soskin et al., 2009)
� Estimate how much expansion (Lv and Zhai, 2009)
� Predict the best α in the Rocchio formula using features of the query, the
documents, and the relationship between them
� But, see Azzopardi and Hauff 2009
IBM Labs in Haifa
888812/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction88 SIGIR 2010, Geneva
Other uses of query difficulty estimation
� Collaborative filtering (Bellogin & Castells, 2009): Measure the
lack of ambiguity in a users’ preference.
� An “easy” user is one whose preferences are clear-cut, and thus
contributes much to a neighbor.
� Term selection (Kumaran & Carvalho, 2009): Identify irrelevant
query terms
� Queries with fewer irrelevant terms tend to have better results
� Reducing long queries (Cummins et al., 2011)
IBM Haifa Research Lab © 2010 IBM Corporation
Summary & Conclusions
IBM Labs in Haifa
909012/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction90 SIGIR 2010, Geneva
Summary
� In this tutorial, we surveyed the current state-of-the-art research on query difficulty estimation for IR
� We discussed the reasons that
� Cause search engines to fail for some of the queries
� Bring about a high variability in performance among queries as well as among systems
� We summarized several approaches for query performance prediction
� Pre-retrieval methods
� Post-retrieval methods
� Combining predictors
� We reviewed evaluation metrics for prediction quality and the results of various evaluation studies conducted over several TREC benchmarks.
� These results show that state-of-the-art existing predictors are able to identify difficult queries by demonstrating a reasonable prediction quality
� However, prediction quality is still moderate and should be substantially improved in order to be widely used in IR tasks
IBM Labs in Haifa
919112/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction91 SIGIR 2010, Geneva
Summary – Main Results
� Current linguistic-based predictors do not exhibit meaningful correlation with query performance
� This is quite surprising as intuitively poor performance can be expected for ambiguous queries
� In contrast, statistical pre-retrieval predictors such as SumSCQ, and maxVAR have relatively significant predictive ability
� These pre-retrieval predictors, and a few others, exhibit comparable performance to post-retrieval methods such as Clarity, WIG, NQC, QF over large scale Web collections
� This is counter-intuitive as post-retrieval methods are exposed to much more information than pre-retrieval methods
� However, current state-of-the-art predictors still suffer from low robustness in prediction quality
� This robustness problem, as well as the moderate prediction quality of existing predictors, are two of the greatest challenges in query difficulty prediction, that should be further explored in the future
IBM Labs in Haifa
929212/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction92 SIGIR 2010, Geneva
Summary (cont)
� We tested whether combining several predictors together may improve prediction quality
� Especially when the different predictors are independent and measure different aspects of the query and the search results
� An example for a combination method is using linear regression
� The regression task is to learn how to optimally combine the predicted performance values in order to best fit them to the actual performance values
� Results were moderate, probably due to the sparseness of the training data, which over-represents the lower end of the performance values
� We discussed three frameworks for query-performance prediction� Utility Estimation Framework (UEF); Shtok et al. 2010
� State-of-the-art prediction quality
� A unified framework for post-retrieval prediction that sets common grounds for various previously-proposed predictors; Kurland et al. 2011
� A fundamental framework for estimating query difficulty; Carmel et al. 2006
IBM Labs in Haifa
939312/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction93 SIGIR 2010, Geneva
Summary (cont)
� We discussed a few applications that utilize query difficulty
estimators
� Handling each query individually based on its estimated difficulty
� Find the best terms for query refinement by measuring the expected gain in
performance for each candidate term
� Expand the query or not based on predicted performance of the expanded
query
� Personalize the query selectively only in cases that personalization is
expected bring value
� Collection enhancement guided by the identification of missing content
queries
� Fusion of search results from several sources based on their predicted
quality
IBM Labs in Haifa
949412/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction94 SIGIR 2010, Geneva
What’s Next?
� Predicting the performance for other query types
� Navigational Queries
� XML queries (Xquery, Xpath)
� Domain Specific queries (e.g .Healthcare)
� Considering other factors that may affect query difficulty
� Who is the person behind the query? in what context?
� Geo-spatial features
� Temporal aspects
� Personal parameters
� Query difficulty in other search paradigms
� Multifaceted search
� Exploratory Search
IBM Labs in Haifa
959512/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction95 SIGIR 2010, Geneva
Concluding Remarks
� Research on query difficulty estimation has begun only ten years ago with the pioneering work on the Clarity predictor (2002)
� Since then this subfield has found its place at the center of IR research
� These studies have revealed alternative prediction approaches, new evaluation methodologies, and novel applications
� In this tutorial we covered
� Existing performance prediction methods
� Some evaluation studies
� Potential applications
� Some anticipations on future directions in the field
� While the progress we see is enormous already, performance prediction is still challenging and far from being solved
� Much more accurate predictors are required in order to be widely adopted by IR tasks
� We hope that this tutorial will contribute in increasing the interest in query difficulty estimation
IBM Haifa Research Lab © 2010 IBM Corporation
Thank You!