TU Wien, April 12th, 2010
Evaluation of IR Methods
1
Mihai [email protected]
Post-doctoral Research FellowIRF
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Outline
• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation
– Measures, Experimentation– Test Collections
• Discussion on Evaluation• Conclusion
Slides also available at http://mihailupu.net/teaching/IREvalLectureApr12.ppt
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Introduction
• Why?– Because without evaluation, there is no
research• Why is this a research field in itself?
– Because there are many kinds of IR• With different evaluation criteria
– Because it’s hard• Why?
– Because it involves human subjectivity (document relevance)
– Because of the amount of data involved (who can sit down and evaluate 1,750,000 documents returned by Google for ‘university vienna’?)
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Kinds of evaluation
• “Efficient and effective system”• Time and space: efficiency
– Generally constrained by pre-development specification
• E.g. real-time answers vs. batch jobs• E.g. index-size constraints
– Easy to measure• Good results: effectiveness
– Harder to define more research into it• And…
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Kinds of evaluation (cont.)
• User studies– Does a 2% increase in some retrieval
performance measure actually make a user happier?
– Does displaying a text snippet improve usability even if the underlying method is 10% weaker than some other method?
– Hard to do– Mostly anecdotal examples– IR people don’t like to do it (though it’s
starting to change)
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Outline
• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation
– Measures– Test Collections
• Discussion on Evaluation• Conclusion
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
• Precision– How happy are we with what we’ve got
• Recall– How much more we could have had
Precision =Number of relevant documents retrieved
Number of documents retrieved
Recall =Number of relevant documents retrieved
Number of relevant documents
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
All documents
Retrieved documents
Relevant documents
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval effectiveness
• Tools we need:– A set of documents (the “dataset”)– A set of questions/queries/topics– For each query, and for each document, a
decision: relevant or not relevant• Let’s assume for the moment that’s all we
need and that we have it
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
• Precision and Recall generally plotted as a “Precision-Recall curve”
0
1
1
precision
recall
# retrieved documents increases
• They do not play well together
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Precision-Recall Curves
• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis
0
1
1
precision
recall
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Precision-Recall Curves
• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis
0
1
1
precision
recall
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Precision-Recall Curves
• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis– Repeat for all queries
0
1
1
precision
recall
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Precision-Recall Curves
• And the average is the system’s P-R curve
0
1
1
precision
recall
# retrieved documents increases
• We can compare systems by comparing the curves
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval effectiveness
• What if we don’t like this twin-measure approach?
• A few solutions:– Van Rijsbergen’s E-Measure:
– Harmonic mean
€
F( j) =2
1
P( j)+
1
R( j)€
Eβ ( j) =1−1+ β 2
β 2
R( j)+
1
P( j)
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance• Some documents may be more relevant to the
question than others
– How about ranking?• A document retrieved at position 1,234,567 can
still be considered useful?
– Who says which documents are relevant and which not?
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Single-value measures
• What if we want to compare systems at query level?
0
1
1
precision
recall
• Could we have just one measure, to avoid the curves?
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Single-value measures
• Average precision– For each query:
• Every time a relevant document is retrieved, calculate precision
• Average with previously calculated values• Repeat until all relevant documents retrieved
– For each system:• Compute the mean of these averages: Mean
Average Precision (MAP) – one of the most used measures
• R-precision– Precision at the position at which the last
relevant document is retrieved.
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance• Some documents may be more relevant to the
question than others
– How about ranking?• A document retrieved at position 1,234,567 can
still be considered useful?
– Who says which documents are relevant and which not?
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Cumulative Gain
• For each document d, and query q, definerel(d,q) >= 0
• The higher the value, the more relevant the document is to the query
• Pitfalls:– Graded relevance introduces even more
ambiguity in practice– With great flexibility comes great
responsibility to justify parameter values
€
CG( j) = rel(di,q)i= 0
j
∑
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance• Some documents may be more relevant to the
question than others
– How about ranking?• A document retrieved at position 1,234,567 can
still be considered useful?
– Who says which documents are relevant and which not?
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discounted Cumulative Gain
• A system that returns highly relevant documents at the top of the list should be scored higher than one that returns the same documents lower in the ranked list
• Other formulations also possible• Neither CG, nor DCG can be used for
comparison! (depend on # rel documents per query)
€
DCGb ( j) =
CG( j) j < brel(di,q)
logb (i)i= 0
j
∑ j ≥ b
⎧
⎨ ⎪
⎩ ⎪
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Normalised Discounted Cumulative Gain
• Compute CG / DCG for the optimal return setEg:
(5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
• Normalise:
€
NDCGb ( j) =DCGb ( j)
IDCGb ( j)
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Other measures
• There are tens of IR measures!• trec_eval is a little program that computes
many of them– 37 in v9.0, many of which are multi-point
(e.g. Precision @10, @20…) • http://trec.nist.gov/trec_eval/• “there is a measure to make anyone a
winner”– Not really true, but still…
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Other measures
• How about correlations between measures?
• Kendal Tau values • From Voorhees and Harman,2004
• Overall they correlate
P(30) R-Prec MAP .5 prec R(1,1000) Rel Ret MRR
P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77P(30) 0.87 0.84 0.82 0.80 0.79 0.72
R-Prec 0.93 0.87 0.83 0.83 0.67
MAP 0.88 0.85 0.85 0.64
.5 prec 0.77 0.78 0.63
R(1,1000) 0.92 0.67
Rel ret 0.66
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Retrieval Effectiveness
• Not quite done yet…– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance• Some documents may be more relevant to the
question than others
– How about ranking?• A document retrieved at position 1,234,567 can
still be considered useful?
– Who says which documents are relevant and which not?
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessments
• Ideally– Sit down and look at all documents
• Practically– The ClueWeb09 collection has
• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)
– No way to do this exhaustively– Look only at the set of returned documents
• Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessments - incomplete
• The unavoidable conclusion is that we have to handle incomplete relevance assessments– Consider unjudged = non relevant– Do not consider unjudged at all (i.e.
compress the ranked lists)• A new measure:
– Bpref (binary preference)
• And a few others:– Rank-Based Precision (RBP), Rpref (for
graded relevance)
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessments - Pooling
• Combine the results retrieved by all systems
• Choose a parameter k (typically 100)• Choose the top k documents as ranked in
each submitted run• The pool is the union of these sets of docs
– Between k and (# submitted runs) × k documents in pool
– (k+1)st document returned in one run either irrelevant or ranked higher in another run
• Give pool to judges for relevance assessments
From Donna Harman30
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessments - Pooling
• Conditions under which pooling works [Robertson]– Range of different kinds of systems,
including manual systems– Reasonably deep pools (100+ from each
system)• But depends on collection size
– The collections cannot be too big.• Big is so relative…
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessments - Pooling
• Advantage of pooling:– Fewer documents must be manually
assessed for relevance• Disadvantages of pooling:
– Can’t be certain that all documents satisfying the query are found (recall values may not be accurate)
– Runs that did not participate in the pooling may be disadvantaged
– If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these.
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessments
• Pooling with randomized sampling• As the data collection grows, the top 100
may not be representative of the entire result set– (i.e. the assumption that everything after is
not relevant does not hold anymore)• Add, to the pool, a set of documents
randomly sampled from the entire retrieved set– If the sampling is uniform easy to reason
about, but may be too sparse as the collection grows
– Stratified sampling: get more from the top of the ranked list
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessment - subjectivity
• In TREC-CHEM’09 we had each topic evaluated by two students– “conflicts” ranged between 2% and 33%
(excluding a topic with 60% conflict)– This all increased if we considered “strict
disagreement”• In general, inter-evaluator agreement is
rarely above 80%• There is little one can do about it – it has
to be dealt with
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Relevance assessment - subjectivity
• Good news:– “idiosyncratic nature of relevance
judgments does not affect comparative results” (E. Voorhees)
– Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938
– Similar results held for:• Different query sets• Different evaluation measures• Different assessor types• Single opinion vs .group opinion judgments
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Statistical validity
• Whatever evaluation metric used, all experiments must be statistically valid– i.e. differences must not be the result of
chance
00.020.040.060.08
0.10.120.140.160.18
0.2
MAP
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Statistical validity
• Ingredients of a significance test– A test statistic (e.g. the differences between AP
values) – A null hypothesis (e.g. “there is no difference
between the two systems)• This gives us a particular distribution of the test
statistic
– A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis
• P-value
• If the p-value is low, we can feel confident that we can reject the null hypothesis the systems are different
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Statistical validity
• Common practice is to declare systems different when the p-value <= 0.05
• A few tests– Randomization tests
• Wilcoxon Signed Rank test• Sign test
– Boostrap test– Student’s Paired t-test
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Statistical Validity - example
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Statistical validity
• How do we increase the statistical validity of an experiment?
• By increasing the number of topics– The more topics, the more confident we are
that the difference between average scores will be significant
• What’s the minimum number of topics?
42
• Depends, but• TREC started with 50• Below 25 is generally considered not
significant
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Test Collections
• Generally created as the result of an evaluation campaign– TREC – Text Retrieval Conference (USA)– CLEF – Cross Language Evaluation Forum
(EU)– NTCIR - NII Test Collection for IR Systems
(JP)– INEX – Initiative for evaluation of XML
Retrieval• First one and paradigm definer:
– The Cranfield Collection • In the 1950s• Aeronautics• 1400 queries, about 6000 documents• Fully evaluated
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
TREC
• Started in 1992• Always organised in the States, on the
NIST campus
• As leader, introduced most of the jargon used in IR Evaluation:– Topic = query / request for information– Run = a ranked list of results – Qrel = relevance judgements
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
TREC
• Organised as a set of tracks that focus on a particular sub-problem of IR– E.g.
• Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust
– Set of tracks in a year depends on• Interest of participants• Fit to TREC• Needs of sponsors• Resource constraints
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
TREC
C all forpartic ipation Task
defin ition
D ocum entprocurem ent
Topic defin ition
IRexperim ents
R elevance assessm ents
R esultsevaluation
R esultsanalysis
TR ECconference
Proceedingspublication
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
TREC – Task definition
• Each Track has a set of Tasks:• Examples of tasks from the Blog track:
– 1. Finding blog posts that contain opinions about the topic
– 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find
blog posts relevant to the topic) – 4. Finding blogs that have a principal,
recurring interest in the topic
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
TREC - Topics
• For TREC, topics generally have a specific format (not always though)– <ID>– <title>
• Very short
– <description>• A brief statement of what would be a relevant
document
– <narrative>• A long description, meant also for the evaluator
to understand how to judge the topic
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
TREC - Topics
• Example:– <ID>
• 312
– <title>• Hydroponics
– <description>• Document will discuss the science of growing
plants in water or some substance other than soil
– <narrative>• A relevant document will contain specific
information on the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- …
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
CLEF
• Cross Language Evaluation Forum– From 2010: Conference on Multilingual and
Multimodal Information Access Evaluation• Started in 2000• Grand challenge:
– Fully multilingual, multimodal IR systems• Capable of processing a query in any medium
and any language• Finding relevant information from a multilingual
multimedia collection• And presenting it in the style most likely to be
useful for the user
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
CLEF
• Previous tracks:• Mono-, bi- multilingual text retrieval• Interactive cross language retrieval• Cross language spoken document retrieval• QA in multiple languages• Cross language retrieval in image collections• CL geographical retrieval• CL Video retrieval• Multilingual information filtering• Intellectual property• Log file analysis• Large scale grid experiments
• From 2010– Organised as a series of “labs”
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
NTCIR
• Started in 1997, but organized every 1.5 years
• The first to look at Patent data (in 2001/2002)
• Other tracks:– Japanese / Cross-language retrieval– Web Retrieval– Term extraction– QA
• Information Access Dialog
– Text summarisation– Trend information– Opinion analysis
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
INEX
• All previously mentioned campaigns focus on document retrieval – Though mostly have ‘tracks’ on QA or
information extraction/access• INEX is fully dedicated to retrieval of the most
relevant parts of a document:– “Focused retrieval”
• = passage retrieval (for long documents)• = element retrieval (for xml documents)• = page retrieval (for books)• = question answering
– Adds a whole new twist on relevance judgments: the returned passage may contain, be contained, or partially overlap the correct answer
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Test collections
• In summary, it is important to design the right experiment for the right IR task– Web retrieval is very different from legal
retrieval• The example of Patent retrieval
– High Recall: a single missed document can invalidate a patent
– Session based: single searches may involve days of cycles of results review and query reformulation
– Defendable: Process and results may need to be defended in court
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Outline
• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation
– Measures, Experimentation– Test Collections
• Discussion on Evaluation• Conclusion
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discussion on evaluation
• Laboratory evaluation – good or bad?– Rigorous testing– Over-constrained
• I usually make the comparison to a tennis racket:– No evaluation of the device will tell you
how well it will perform in real life – that largely depends on the user
– But the user will chose the device based on the lab evaluation
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discussion on evaluation
• There is bias to account for– E.g. number of relevant documents per
topic
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discussion on evaluation
• Recall and recall-related measures are often contested
• [cooper:73,p95]– “The involvement of unexamined documents
in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”
• Clearly not true in the legal & patent domains
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discussion on Evaluation
• Realistic tasks and user models– Evaluation has to be based on the available
data sets.• This creates the user model• Tasks need to correspond to available techniques
• Much literature on generating tasks– Experts describe typical tasks– Use of log files of various sorts
• IR Research decades behind sociology in terms of user modeling – there is a place to learn from
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discussion on Evaluation
• Competitiveness– Most campaigns take pain in explaining
“This is not a competition – this is an evaluation”
• Competitions are stimulating, but– Participants wary of participating if they
are not sure to win• Particularly commercial vendors
– Without special care from organizers, it stifles creativity:
• Best way to win is to take last year’s method and improve a bit
• Original approaches are risky
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Discussion on Evaluation
• Other data representations than text– Image – Music– Video
• Could be fit into all discussed above• The subjectivity problem may become
even more of a problem
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Conclusion
• IR Evaluation is a research field in itself• Without evaluation, research is pointless
– IR Evaluation research included • statistical significance testing is a must to
validate results
• Most IR Evaluation exercises are laboratory experiments– As such, care must be taken to match, to
the extent possible, real needs of the users• Experiments in the wild are rare, small and
domain specific:– VideOlympics (since 2007)– PatOlympics (since 2010)
IR EvaluationMihai Lupu
IRF
TU Wien, April 12, 2010
Bibliography
• Modern Information Retrieval– R. Baeza-Yates, B. Ribeiro-Neto
• TREC – Experiment and Evaluation in Information Retrieval– E. Voorhees, D. Harman (eds.)
• A Comparison of Statistical Significance Tests for Information Retrieval Evaluation– M. Smucker, J. Allan, B. Carterette
(CIKM’07)• A Simple and Efficient Sampling Method
for Estimating AP and NDCG– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)