Download - TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu [email protected] Post-doctoral Research Fellow IRF.

TU Wien, April 12th, 2010

Evaluation of IR Methods

1

Mihai [email protected]

Post-doctoral Research FellowIRF

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Outline

• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

• Discussion on Evaluation• Conclusion

Slides also available at http://mihailupu.net/teaching/IREvalLectureApr12.ppt


IRF


Introduction

• Why?– Because without evaluation, there is no

research• Why is this a research field in itself?

– Because there are many kinds of IR• With different evaluation criteria

– Because it’s hard• Why?

– Because it involves human subjectivity (document relevance)

– Because of the amount of data involved (who can sit down and evaluate 1,750,000 documents returned by Google for ‘university vienna’?)


IRF


Kinds of evaluation

• “Efficient and effective system”• Time and space: efficiency

– Generally constrained by pre-development specification

• E.g. real-time answers vs. batch jobs• E.g. index-size constraints

– Easy to measure• Good results: effectiveness

– Harder to define more research into it• And…


IRF


Kinds of evaluation (cont.)

• User studies– Does a 2% increase in some retrieval

performance measure actually make a user happier?

– Does displaying a text snippet improve usability even if the underlying method is 10% weaker than some other method?

– Hard to do– Mostly anecdotal examples– IR people don’t like to do it (though it’s

starting to change)


IRF


Outline


– Measures– Test Collections



IRF


Retrieval Effectiveness

• Precision– How happy are we with what we’ve got

• Recall– How much more we could have had

Precision =Number of relevant documents retrieved

Number of documents retrieved

Recall =Number of relevant documents retrieved

Number of relevant documents


IRF



All documents

Retrieved documents

Relevant documents


IRF


Retrieval effectiveness

• Tools we need:– A set of documents (the “dataset”)– A set of questions/queries/topics– For each query, and for each document, a

decision: relevant or not relevant• Let’s assume for the moment that’s all we

need and that we have it


IRF



• Precision and Recall generally plotted as a “Precision-Recall curve”

0

1

1

precision

recall

# retrieved documents increases

• They do not play well together


IRF


Precision-Recall Curves

• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall


IRF



• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall


IRF



• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis– Repeat for all queries

0

1

1

precision

recall


IRF



• And the average is the system’s P-R curve

0

1

1

precision

recall

# retrieved documents increases

• We can compare systems by comparing the curves


IRF


Retrieval effectiveness

• What if we don’t like this twin-measure approach?

• A few solutions:– Van Rijsbergen’s E-Measure:

– Harmonic mean

€

F( j) =2

1

P( j)+

1

R( j)€

Eβ ( j) =1−1+ β 2

β 2

R( j)+

1

P( j)


IRF



• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value

– How about graded relevance• Some documents may be more relevant to the

question than others

– How about ranking?• A document retrieved at position 1,234,567 can

still be considered useful?

– Who says which documents are relevant and which not?


IRF


Single-value measures

• What if we want to compare systems at query level?

0

1

1

precision

recall

• Could we have just one measure, to avoid the curves?


IRF


Single-value measures

• Average precision– For each query:

• Every time a relevant document is retrieved, calculate precision

• Average with previously calculated values• Repeat until all relevant documents retrieved

– For each system:• Compute the mean of these averages: Mean

Average Precision (MAP) – one of the most used measures

• R-precision– Precision at the position at which the last

relevant document is retrieved.


IRF











IRF


Cumulative Gain

• For each document d, and query q, definerel(d,q) >= 0

• The higher the value, the more relevant the document is to the query

• Pitfalls:– Graded relevance introduces even more

ambiguity in practice– With great flexibility comes great

responsibility to justify parameter values

€

CG( j) = rel(di,q)i= 0

j

∑


IRF











IRF


Discounted Cumulative Gain

• A system that returns highly relevant documents at the top of the list should be scored higher than one that returns the same documents lower in the ranked list

• Other formulations also possible• Neither CG, nor DCG can be used for

comparison! (depend on # rel documents per query)

€

DCGb ( j) =

CG( j) j < brel(di,q)

logb (i)i= 0

j

∑ j ≥ b

⎧

⎨ ⎪

⎩ ⎪


IRF


Normalised Discounted Cumulative Gain

• Compute CG / DCG for the optimal return setEg:

(5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)

has the Ideal Discounted Cumulative Gain: IDCG

• Normalise:

€

NDCGb ( j) =DCGb ( j)

IDCGb ( j)


IRF


Other measures

• There are tens of IR measures!• trec_eval is a little program that computes

many of them– 37 in v9.0, many of which are multi-point

(e.g. Precision @10, @20…) • http://trec.nist.gov/trec_eval/• “there is a measure to make anyone a

winner”– Not really true, but still…

http://trec.nist.gov/trec_eval/


IRF


Other measures

• How about correlations between measures?

• Kendal Tau values • From Voorhees and Harman,2004

• Overall they correlate

P(30) R-Prec MAP .5 prec R(1,1000) Rel Ret MRR

P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77P(30) 0.87 0.84 0.82 0.80 0.79 0.72

R-Prec 0.93 0.87 0.83 0.83 0.67

MAP 0.88 0.85 0.85 0.64

.5 prec 0.77 0.78 0.63

R(1,1000) 0.92 0.67

Rel ret 0.66


IRF











IRF


Relevance assessments

• Ideally– Sit down and look at all documents

• Practically– The ClueWeb09 collection has

• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)

– No way to do this exhaustively– Look only at the set of returned documents

• Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant


IRF


Relevance assessments - incomplete

• The unavoidable conclusion is that we have to handle incomplete relevance assessments– Consider unjudged = non relevant– Do not consider unjudged at all (i.e.

compress the ranked lists)• A new measure:

– Bpref (binary preference)

• And a few others:– Rank-Based Precision (RBP), Rpref (for

graded relevance)


IRF


Relevance assessments - Pooling

• Combine the results retrieved by all systems

• Choose a parameter k (typically 100)• Choose the top k documents as ranked in

each submitted run• The pool is the union of these sets of docs

– Between k and (# submitted runs) × k documents in pool

– (k+1)st document returned in one run either irrelevant or ranked higher in another run

• Give pool to judges for relevance assessments

From Donna Harman30


IRF



• Conditions under which pooling works [Robertson]– Range of different kinds of systems,

including manual systems– Reasonably deep pools (100+ from each

system)• But depends on collection size

– The collections cannot be too big.• Big is so relative…


IRF



• Advantage of pooling:– Fewer documents must be manually

assessed for relevance• Disadvantages of pooling:

– Can’t be certain that all documents satisfying the query are found (recall values may not be accurate)

– Runs that did not participate in the pooling may be disadvantaged

– If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these.


IRF


Relevance assessments

• Pooling with randomized sampling• As the data collection grows, the top 100

may not be representative of the entire result set– (i.e. the assumption that everything after is

not relevant does not hold anymore)• Add, to the pool, a set of documents

randomly sampled from the entire retrieved set– If the sampling is uniform easy to reason

about, but may be too sparse as the collection grows

– Stratified sampling: get more from the top of the ranked list


IRF


Relevance assessment - subjectivity

• In TREC-CHEM’09 we had each topic evaluated by two students– “conflicts” ranged between 2% and 33%

(excluding a topic with 60% conflict)– This all increased if we considered “strict

disagreement”• In general, inter-evaluator agreement is

rarely above 80%• There is little one can do about it – it has

to be dealt with


IRF


Relevance assessment - subjectivity

• Good news:– “idiosyncratic nature of relevance

judgments does not affect comparative results” (E. Voorhees)

– Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938

– Similar results held for:• Different query sets• Different evaluation measures• Different assessor types• Single opinion vs .group opinion judgments


IRF


Statistical validity

• Whatever evaluation metric used, all experiments must be statistically valid– i.e. differences must not be the result of

chance

00.020.040.060.08

0.10.120.140.160.18

0.2

MAP


IRF



• Ingredients of a significance test– A test statistic (e.g. the differences between AP

values) – A null hypothesis (e.g. “there is no difference

between the two systems)• This gives us a particular distribution of the test

statistic

– A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis

• P-value

• If the p-value is low, we can feel confident that we can reject the null hypothesis the systems are different


IRF



• Common practice is to declare systems different when the p-value <= 0.05

• A few tests– Randomization tests

• Wilcoxon Signed Rank test• Sign test

– Boostrap test– Student’s Paired t-test


IRF


Statistical Validity - example


IRF



• How do we increase the statistical validity of an experiment?

• By increasing the number of topics– The more topics, the more confident we are

that the difference between average scores will be significant

• What’s the minimum number of topics?

42

• Depends, but• TREC started with 50• Below 25 is generally considered not

significant


IRF


Test Collections

• Generally created as the result of an evaluation campaign– TREC – Text Retrieval Conference (USA)– CLEF – Cross Language Evaluation Forum

(EU)– NTCIR - NII Test Collection for IR Systems

(JP)– INEX – Initiative for evaluation of XML

Retrieval• First one and paradigm definer:

– The Cranfield Collection • In the 1950s• Aeronautics• 1400 queries, about 6000 documents• Fully evaluated


IRF


TREC

• Started in 1992• Always organised in the States, on the

NIST campus

• As leader, introduced most of the jargon used in IR Evaluation:– Topic = query / request for information– Run = a ranked list of results – Qrel = relevance judgements


IRF


TREC

• Organised as a set of tracks that focus on a particular sub-problem of IR– E.g.

• Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust

– Set of tracks in a year depends on• Interest of participants• Fit to TREC• Needs of sponsors• Resource constraints


IRF


TREC

C all forpartic ipation Task

defin ition

D ocum entprocurem ent

Topic defin ition

IRexperim ents

R elevance assessm ents

R esultsevaluation

R esultsanalysis

TR ECconference

Proceedingspublication


IRF


TREC – Task definition

• Each Track has a set of Tasks:• Examples of tasks from the Blog track:

– 1. Finding blog posts that contain opinions about the topic

– 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find

blog posts relevant to the topic) – 4. Finding blogs that have a principal,

recurring interest in the topic


IRF


TREC - Topics

• For TREC, topics generally have a specific format (not always though)– <ID>– <title>

• Very short

– <description>• A brief statement of what would be a relevant

document

– <narrative>• A long description, meant also for the evaluator

to understand how to judge the topic


IRF


TREC - Topics

• Example:– <ID>

• 312

– <title>• Hydroponics

– <description>• Document will discuss the science of growing

plants in water or some substance other than soil

– <narrative>• A relevant document will contain specific

information on the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- …


IRF


CLEF

• Cross Language Evaluation Forum– From 2010: Conference on Multilingual and

Multimodal Information Access Evaluation• Started in 2000• Grand challenge:

– Fully multilingual, multimodal IR systems• Capable of processing a query in any medium

and any language• Finding relevant information from a multilingual

multimedia collection• And presenting it in the style most likely to be

useful for the user


IRF


CLEF

• Previous tracks:• Mono-, bi- multilingual text retrieval• Interactive cross language retrieval• Cross language spoken document retrieval• QA in multiple languages• Cross language retrieval in image collections• CL geographical retrieval• CL Video retrieval• Multilingual information filtering• Intellectual property• Log file analysis• Large scale grid experiments

• From 2010– Organised as a series of “labs”


IRF


NTCIR

• Started in 1997, but organized every 1.5 years

• The first to look at Patent data (in 2001/2002)

• Other tracks:– Japanese / Cross-language retrieval– Web Retrieval– Term extraction– QA

• Information Access Dialog

– Text summarisation– Trend information– Opinion analysis


IRF


INEX

• All previously mentioned campaigns focus on document retrieval – Though mostly have ‘tracks’ on QA or

information extraction/access• INEX is fully dedicated to retrieval of the most

relevant parts of a document:– “Focused retrieval”

• = passage retrieval (for long documents)• = element retrieval (for xml documents)• = page retrieval (for books)• = question answering

– Adds a whole new twist on relevance judgments: the returned passage may contain, be contained, or partially overlap the correct answer


IRF


Test collections

• In summary, it is important to design the right experiment for the right IR task– Web retrieval is very different from legal

retrieval• The example of Patent retrieval

– High Recall: a single missed document can invalidate a patent

– Session based: single searches may involve days of cycles of results review and query reformulation

– Defendable: Process and results may need to be defended in court


IRF


Outline


– Measures, Experimentation– Test Collections



IRF


Discussion on evaluation

• Laboratory evaluation – good or bad?– Rigorous testing– Over-constrained

• I usually make the comparison to a tennis racket:– No evaluation of the device will tell you

how well it will perform in real life – that largely depends on the user

– But the user will chose the device based on the lab evaluation


IRF



• There is bias to account for– E.g. number of relevant documents per

topic


IRF



• Recall and recall-related measures are often contested

• [cooper:73,p95]– “The involvement of unexamined documents

in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”

• Clearly not true in the legal & patent domains


IRF


Discussion on Evaluation

• Realistic tasks and user models– Evaluation has to be based on the available

data sets.• This creates the user model• Tasks need to correspond to available techniques

• Much literature on generating tasks– Experts describe typical tasks– Use of log files of various sorts

• IR Research decades behind sociology in terms of user modeling – there is a place to learn from


IRF



• Competitiveness– Most campaigns take pain in explaining

“This is not a competition – this is an evaluation”

• Competitions are stimulating, but– Participants wary of participating if they

are not sure to win• Particularly commercial vendors

– Without special care from organizers, it stifles creativity:

• Best way to win is to take last year’s method and improve a bit

• Original approaches are risky


IRF



• Other data representations than text– Image – Music– Video

• Could be fit into all discussed above• The subjectivity problem may become

even more of a problem


IRF


Conclusion

• IR Evaluation is a research field in itself• Without evaluation, research is pointless

– IR Evaluation research included • statistical significance testing is a must to

validate results

• Most IR Evaluation exercises are laboratory experiments– As such, care must be taken to match, to

the extent possible, real needs of the users• Experiments in the wild are rare, small and

domain specific:– VideOlympics (since 2007)– PatOlympics (since 2010)


IRF


Bibliography

• Modern Information Retrieval– R. Baeza-Yates, B. Ribeiro-Neto

• TREC – Experiment and Evaluation in Information Retrieval– E. Voorhees, D. Harman (eds.)

• A Comparison of Statistical Significance Tests for Information Retrieval Evaluation– M. Smucker, J. Allan, B. Carterette

(CIKM’07)• A Simple and Efficient Sampling Method

for Estimating AP and NDCG– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)