+ All Categories
Home > Documents > TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu [email protected]...

TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu [email protected]...

Date post: 18-Jan-2016
Category:
Upload: brian-pearson
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
61
TU Wien, April 12 th , 2010 Evaluation of IR Methods 1 Mihai Lupu [email protected] Post-doctoral Research Fellow IRF
Transcript
Page 1: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

TU Wien, April 12th, 2010

Evaluation of IR Methods

1

Mihai [email protected]

Post-doctoral Research FellowIRF

Page 2: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Outline

• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

• Discussion on Evaluation• Conclusion

Slides also available at http://mihailupu.net/teaching/IREvalLectureApr12.ppt

Page 3: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Introduction

• Why?– Because without evaluation, there is no

research• Why is this a research field in itself?

– Because there are many kinds of IR• With different evaluation criteria

– Because it’s hard• Why?

– Because it involves human subjectivity (document relevance)

– Because of the amount of data involved (who can sit down and evaluate 1,750,000 documents returned by Google for ‘university vienna’?)

Page 4: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Kinds of evaluation

• “Efficient and effective system”• Time and space: efficiency

– Generally constrained by pre-development specification

• E.g. real-time answers vs. batch jobs• E.g. index-size constraints

– Easy to measure• Good results: effectiveness

– Harder to define more research into it• And…

Page 5: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Kinds of evaluation (cont.)

• User studies– Does a 2% increase in some retrieval

performance measure actually make a user happier?

– Does displaying a text snippet improve usability even if the underlying method is 10% weaker than some other method?

– Hard to do– Mostly anecdotal examples– IR people don’t like to do it (though it’s

starting to change)

Page 6: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Outline

• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation

– Measures– Test Collections

• Discussion on Evaluation• Conclusion

Page 7: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

• Precision– How happy are we with what we’ve got

• Recall– How much more we could have had

Precision =Number of relevant documents retrieved

Number of documents retrieved

Recall =Number of relevant documents retrieved

Number of relevant documents

Page 8: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

All documents

Retrieved documents

Relevant documents

Page 9: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval effectiveness

• Tools we need:– A set of documents (the “dataset”)– A set of questions/queries/topics– For each query, and for each document, a

decision: relevant or not relevant• Let’s assume for the moment that’s all we

need and that we have it

Page 10: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

• Precision and Recall generally plotted as a “Precision-Recall curve”

0

1

1

precision

recall

# retrieved documents increases

• They do not play well together

Page 11: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Precision-Recall Curves

• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall

Page 12: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Precision-Recall Curves

• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis

0

1

1

precision

recall

Page 13: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Precision-Recall Curves

• How to build a Precision-Recall Curve?– For one query at a time– Make checkpoints on the recall-axis– Repeat for all queries

0

1

1

precision

recall

Page 14: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Precision-Recall Curves

• And the average is the system’s P-R curve

0

1

1

precision

recall

# retrieved documents increases

• We can compare systems by comparing the curves

Page 15: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval effectiveness

• What if we don’t like this twin-measure approach?

• A few solutions:– Van Rijsbergen’s E-Measure:

– Harmonic mean

F( j) =2

1

P( j)+

1

R( j)€

Eβ ( j) =1−1+ β 2

β 2

R( j)+

1

P( j)

Page 16: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value

– How about graded relevance• Some documents may be more relevant to the

question than others

– How about ranking?• A document retrieved at position 1,234,567 can

still be considered useful?

– Who says which documents are relevant and which not?

Page 17: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Single-value measures

• What if we want to compare systems at query level?

0

1

1

precision

recall

• Could we have just one measure, to avoid the curves?

Page 18: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Single-value measures

• Average precision– For each query:

• Every time a relevant document is retrieved, calculate precision

• Average with previously calculated values• Repeat until all relevant documents retrieved

– For each system:• Compute the mean of these averages: Mean

Average Precision (MAP) – one of the most used measures

• R-precision– Precision at the position at which the last

relevant document is retrieved.

Page 19: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value

– How about graded relevance• Some documents may be more relevant to the

question than others

– How about ranking?• A document retrieved at position 1,234,567 can

still be considered useful?

– Who says which documents are relevant and which not?

Page 20: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Cumulative Gain

• For each document d, and query q, definerel(d,q) >= 0

• The higher the value, the more relevant the document is to the query

• Pitfalls:– Graded relevance introduces even more

ambiguity in practice– With great flexibility comes great

responsibility to justify parameter values

CG( j) = rel(di,q)i= 0

j

Page 21: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value

– How about graded relevance• Some documents may be more relevant to the

question than others

– How about ranking?• A document retrieved at position 1,234,567 can

still be considered useful?

– Who says which documents are relevant and which not?

Page 22: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discounted Cumulative Gain

• A system that returns highly relevant documents at the top of the list should be scored higher than one that returns the same documents lower in the ranked list

• Other formulations also possible• Neither CG, nor DCG can be used for

comparison! (depend on # rel documents per query)

DCGb ( j) =

CG( j) j < brel(di,q)

logb (i)i= 0

j

∑ j ≥ b

⎨ ⎪

⎩ ⎪

Page 23: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Normalised Discounted Cumulative Gain

• Compute CG / DCG for the optimal return setEg:

(5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)

has the Ideal Discounted Cumulative Gain: IDCG

• Normalise:

NDCGb ( j) =DCGb ( j)

IDCGb ( j)

Page 24: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Other measures

• There are tens of IR measures!• trec_eval is a little program that computes

many of them– 37 in v9.0, many of which are multi-point

(e.g. Precision @10, @20…) • http://trec.nist.gov/trec_eval/• “there is a measure to make anyone a

winner”– Not really true, but still…

Page 25: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Other measures

• How about correlations between measures?

• Kendal Tau values • From Voorhees and Harman,2004

• Overall they correlate

P(30) R-Prec MAP .5 prec R(1,1000) Rel Ret MRR

P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77P(30) 0.87 0.84 0.82 0.80 0.79 0.72

R-Prec 0.93 0.87 0.83 0.83 0.67

MAP 0.88 0.85 0.85 0.64

.5 prec 0.77 0.78 0.63

R(1,1000) 0.92 0.67

Rel ret 0.66

Page 26: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Retrieval Effectiveness

• Not quite done yet…– When to stop retrieving?

• Both P and R imply a cut-off value

– How about graded relevance• Some documents may be more relevant to the

question than others

– How about ranking?• A document retrieved at position 1,234,567 can

still be considered useful?

– Who says which documents are relevant and which not?

Page 27: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessments

• Ideally– Sit down and look at all documents

• Practically– The ClueWeb09 collection has

• 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.)

– No way to do this exhaustively– Look only at the set of returned documents

• Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant

Page 28: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessments - incomplete

• The unavoidable conclusion is that we have to handle incomplete relevance assessments– Consider unjudged = non relevant– Do not consider unjudged at all (i.e.

compress the ranked lists)• A new measure:

– Bpref (binary preference)

• And a few others:– Rank-Based Precision (RBP), Rpref (for

graded relevance)

Page 29: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessments - Pooling

• Combine the results retrieved by all systems

• Choose a parameter k (typically 100)• Choose the top k documents as ranked in

each submitted run• The pool is the union of these sets of docs

– Between k and (# submitted runs) × k documents in pool

– (k+1)st document returned in one run either irrelevant or ranked higher in another run

• Give pool to judges for relevance assessments

Page 30: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

From Donna Harman30

Page 31: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessments - Pooling

• Conditions under which pooling works [Robertson]– Range of different kinds of systems,

including manual systems– Reasonably deep pools (100+ from each

system)• But depends on collection size

– The collections cannot be too big.• Big is so relative…

Page 32: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessments - Pooling

• Advantage of pooling:– Fewer documents must be manually

assessed for relevance• Disadvantages of pooling:

– Can’t be certain that all documents satisfying the query are found (recall values may not be accurate)

– Runs that did not participate in the pooling may be disadvantaged

– If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these.

Page 33: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessments

• Pooling with randomized sampling• As the data collection grows, the top 100

may not be representative of the entire result set– (i.e. the assumption that everything after is

not relevant does not hold anymore)• Add, to the pool, a set of documents

randomly sampled from the entire retrieved set– If the sampling is uniform easy to reason

about, but may be too sparse as the collection grows

– Stratified sampling: get more from the top of the ranked list

Page 34: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessment - subjectivity

• In TREC-CHEM’09 we had each topic evaluated by two students– “conflicts” ranged between 2% and 33%

(excluding a topic with 60% conflict)– This all increased if we considered “strict

disagreement”• In general, inter-evaluator agreement is

rarely above 80%• There is little one can do about it – it has

to be dealt with

Page 35: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Relevance assessment - subjectivity

• Good news:– “idiosyncratic nature of relevance

judgments does not affect comparative results” (E. Voorhees)

– Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938

– Similar results held for:• Different query sets• Different evaluation measures• Different assessor types• Single opinion vs .group opinion judgments

Page 36: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Statistical validity

• Whatever evaluation metric used, all experiments must be statistically valid– i.e. differences must not be the result of

chance

00.020.040.060.08

0.10.120.140.160.18

0.2

MAP

Page 37: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Statistical validity

• Ingredients of a significance test– A test statistic (e.g. the differences between AP

values) – A null hypothesis (e.g. “there is no difference

between the two systems)• This gives us a particular distribution of the test

statistic

– A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis

• P-value

• If the p-value is low, we can feel confident that we can reject the null hypothesis the systems are different

Page 38: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Statistical validity

• Common practice is to declare systems different when the p-value <= 0.05

• A few tests– Randomization tests

• Wilcoxon Signed Rank test• Sign test

– Boostrap test– Student’s Paired t-test

Page 39: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Statistical Validity - example

Page 40: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Statistical validity

• How do we increase the statistical validity of an experiment?

• By increasing the number of topics– The more topics, the more confident we are

that the difference between average scores will be significant

• What’s the minimum number of topics?

42

• Depends, but• TREC started with 50• Below 25 is generally considered not

significant

Page 41: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Test Collections

• Generally created as the result of an evaluation campaign– TREC – Text Retrieval Conference (USA)– CLEF – Cross Language Evaluation Forum

(EU)– NTCIR - NII Test Collection for IR Systems

(JP)– INEX – Initiative for evaluation of XML

Retrieval• First one and paradigm definer:

– The Cranfield Collection • In the 1950s• Aeronautics• 1400 queries, about 6000 documents• Fully evaluated

Page 42: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

TREC

• Started in 1992• Always organised in the States, on the

NIST campus

• As leader, introduced most of the jargon used in IR Evaluation:– Topic = query / request for information– Run = a ranked list of results – Qrel = relevance judgements

Page 43: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

TREC

• Organised as a set of tracks that focus on a particular sub-problem of IR– E.g.

• Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust

– Set of tracks in a year depends on• Interest of participants• Fit to TREC• Needs of sponsors• Resource constraints

Page 44: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

TREC

C all forpartic ipation Task

defin ition

D ocum entprocurem ent

Topic defin ition

IRexperim ents

R elevance assessm ents

R esultsevaluation

R esultsanalysis

TR ECconference

Proceedingspublication

Page 45: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

TREC – Task definition

• Each Track has a set of Tasks:• Examples of tasks from the Blog track:

– 1. Finding blog posts that contain opinions about the topic

– 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find

blog posts relevant to the topic) – 4. Finding blogs that have a principal,

recurring interest in the topic

Page 46: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

TREC - Topics

• For TREC, topics generally have a specific format (not always though)– <ID>– <title>

• Very short

– <description>• A brief statement of what would be a relevant

document

– <narrative>• A long description, meant also for the evaluator

to understand how to judge the topic

Page 47: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

TREC - Topics

• Example:– <ID>

• 312

– <title>• Hydroponics

– <description>• Document will discuss the science of growing

plants in water or some substance other than soil

– <narrative>• A relevant document will contain specific

information on the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- …

Page 48: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

CLEF

• Cross Language Evaluation Forum– From 2010: Conference on Multilingual and

Multimodal Information Access Evaluation• Started in 2000• Grand challenge:

– Fully multilingual, multimodal IR systems• Capable of processing a query in any medium

and any language• Finding relevant information from a multilingual

multimedia collection• And presenting it in the style most likely to be

useful for the user

Page 49: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

CLEF

• Previous tracks:• Mono-, bi- multilingual text retrieval• Interactive cross language retrieval• Cross language spoken document retrieval• QA in multiple languages• Cross language retrieval in image collections• CL geographical retrieval• CL Video retrieval• Multilingual information filtering• Intellectual property• Log file analysis• Large scale grid experiments

• From 2010– Organised as a series of “labs”

Page 50: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

NTCIR

• Started in 1997, but organized every 1.5 years

• The first to look at Patent data (in 2001/2002)

• Other tracks:– Japanese / Cross-language retrieval– Web Retrieval– Term extraction– QA

• Information Access Dialog

– Text summarisation– Trend information– Opinion analysis

Page 51: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

INEX

• All previously mentioned campaigns focus on document retrieval – Though mostly have ‘tracks’ on QA or

information extraction/access• INEX is fully dedicated to retrieval of the most

relevant parts of a document:– “Focused retrieval”

• = passage retrieval (for long documents)• = element retrieval (for xml documents)• = page retrieval (for books)• = question answering

– Adds a whole new twist on relevance judgments: the returned passage may contain, be contained, or partially overlap the correct answer

Page 52: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Test collections

• In summary, it is important to design the right experiment for the right IR task– Web retrieval is very different from legal

retrieval• The example of Patent retrieval

– High Recall: a single missed document can invalidate a patent

– Session based: single searches may involve days of cycles of results review and query reformulation

– Defendable: Process and results may need to be defended in court

Page 53: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Outline

• Introduction• Kinds of evaluation• Retrieval Effectiveness evaluation

– Measures, Experimentation– Test Collections

• Discussion on Evaluation• Conclusion

Page 54: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discussion on evaluation

• Laboratory evaluation – good or bad?– Rigorous testing– Over-constrained

• I usually make the comparison to a tennis racket:– No evaluation of the device will tell you

how well it will perform in real life – that largely depends on the user

– But the user will chose the device based on the lab evaluation

Page 55: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discussion on evaluation

• There is bias to account for– E.g. number of relevant documents per

topic

Page 56: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discussion on evaluation

• Recall and recall-related measures are often contested

• [cooper:73,p95]– “The involvement of unexamined documents

in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”

• Clearly not true in the legal & patent domains

Page 57: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discussion on Evaluation

• Realistic tasks and user models– Evaluation has to be based on the available

data sets.• This creates the user model• Tasks need to correspond to available techniques

• Much literature on generating tasks– Experts describe typical tasks– Use of log files of various sorts

• IR Research decades behind sociology in terms of user modeling – there is a place to learn from

Page 58: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discussion on Evaluation

• Competitiveness– Most campaigns take pain in explaining

“This is not a competition – this is an evaluation”

• Competitions are stimulating, but– Participants wary of participating if they

are not sure to win• Particularly commercial vendors

– Without special care from organizers, it stifles creativity:

• Best way to win is to take last year’s method and improve a bit

• Original approaches are risky

Page 59: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Discussion on Evaluation

• Other data representations than text– Image – Music– Video

• Could be fit into all discussed above• The subjectivity problem may become

even more of a problem

Page 60: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Conclusion

• IR Evaluation is a research field in itself• Without evaluation, research is pointless

– IR Evaluation research included • statistical significance testing is a must to

validate results

• Most IR Evaluation exercises are laboratory experiments– As such, care must be taken to match, to

the extent possible, real needs of the users• Experiments in the wild are rare, small and

domain specific:– VideOlympics (since 2007)– PatOlympics (since 2010)

Page 61: TU Wien, April 12 th, 2010 Evaluation of IR Methods 1 Mihai Lupu m.lupu@ir-facility.org Post-doctoral Research Fellow IRF.

IR EvaluationMihai Lupu

IRF

TU Wien, April 12, 2010

Bibliography

• Modern Information Retrieval– R. Baeza-Yates, B. Ribeiro-Neto

• TREC – Experiment and Evaluation in Information Retrieval– E. Voorhees, D. Harman (eds.)

• A Comparison of Statistical Significance Tests for Information Retrieval Evaluation– M. Smucker, J. Allan, B. Carterette

(CIKM’07)• A Simple and Efficient Sampling Method

for Estimating AP and NDCG– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)


Recommended