On the Evaluation of Snippet Selection for Information Retrieval

transcript

A. Overwijk, D. Nguyen, C. Hauff,

R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong

ContentsProperties of a good evaluation methodEvaluation method of WebCLEFApproachResultsAnalysisConclusion

Good evaluation methodReflects the quality of the systemReusability

Evaluation method of WebCLEFRecall

The sum of character lengths of all spans in the response of the system linked to nuggets (i.e. an aspect the user includes in his article), divided by the total sum of span lengths in the responses for a topic in all submitted runs.

PrecisionThe number of characters that belong to at

least one span linked to a nugget, divided by the total character length of the system’s response.

ApproachBetter system, better performance scores?Similar system, same performance scores?Worse system, lower performance scores?

Better systemLast year’s best performing system contains a bug

our %stopwords = qw( for my $w … { ‘s next if exists $stopwords{$w}; a … … } zwischen);

Better systemSystem Precision Recall

With bug 0.2018 0.2561

Without bug 0.1328 0.1685

Not filtering stop words 0.1087 0.1380

Similar systemGeneral idea

Almost identical snippets should have almost the same precision and recall

ExperimentRemove the last word for every snippet in the output of

last year’s best performing system

Similar system

System Precision Recall

Original 0.2018 0.2561

Last word removed 0.0597 0.0758

Worse systemDelivering snippets based on occurrence

1st snippet = 1st paragraph of 1st document2nd snippet = 2nd paragraph of 2nd document...

No difference with search engines, except that documents are split up in snippets

Worse systemOriginal First occurrence

Topic Precision Recall Precision Recall

17 0.0389 0.0436 0.0389 0.0436

18 0.1590 0.6190 0.1590 0.6190

21 0.4083 0.6513 0.4083 0.6513

23 0.1140 0.1057 0.1140 0.1057

25 0.4240 0.4041 0.4240 0.4041

26 0.0780 0.1405 0.0780 0.1405

Avg. 0.2018 0.2561 0.0536 0.0680

AnalysisPool of snippetsImplementationAssessments

ConclusionEvaluation method is not sufficient:

Biased towards participating systemsCorrectness of a snippet is too strict

Recommendations:N-grams (e.g. ROUGE)Multiple assessors per topic

Questions

On the Evaluation of Snippet Selection for Information Retrieval

Documents