Post on 06-Jan-2016
description
transcript
The CLEF 2005 The CLEF 2005 interactive track interactive track
(iCLEF)(iCLEF)Julio GonzaloJulio Gonzalo11, Paul Clough, Paul Clough22 and Alessandro and Alessandro
VallinVallin33
11Departamento de Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a
Distancia22Department of Information Studies, University of Department of Information Studies, University of
Sheffield, UKSheffield, UK33ITC-irst, Trento, ItalyITC-irst, Trento, Italy
OverviewOverview Consolidation of two pilot user studies at iCLEF Consolidation of two pilot user studies at iCLEF
20042004 Interactive question answering taskInteractive question answering task Interactive image retrieval task Interactive image retrieval task
iCLEF provides resources and experiment designiCLEF provides resources and experiment design Participants select a research question to Participants select a research question to
investigate (by comparing the behaviour and investigate (by comparing the behaviour and search results of users with a reference and a search results of users with a reference and a contrastive system) contrastive system)
Five research groups submitted resultsFive research groups submitted results 2 groups for image retrieval2 groups for image retrieval 3 groups for QA3 groups for QA
AgendaAgenda
Image retrieval taskImage retrieval task Question Answering taskQuestion Answering task Ideas for 2006: the flickr taskIdeas for 2006: the flickr task
Cross-language image Cross-language image retrievalretrieval
OverviewOverview Limited evaluation with ranked listsLimited evaluation with ranked lists
Image retrieval systems highly interactiveImage retrieval systems highly interactive Appealing for CLIR researchAppealing for CLIR research
Language-independent: object to be Language-independent: object to be retrieved is an image retrieved is an image
Image RetrievalImage Retrieval Purely visual (QBE) e.g. “find images Purely visual (QBE) e.g. “find images
like this one”like this one” Text-based e.g. Web image searchText-based e.g. Web image search CombinationCombination
Interactive taskInteractive task Areas of interaction to study includeAreas of interaction to study include
Query formulation (visual and textual)Query formulation (visual and textual) Query re-formulation (relevance feedback)Query re-formulation (relevance feedback) Browsing/navigating resultsBrowsing/navigating results Identifying/selecting relevant imagesIdentifying/selecting relevant images
Based on iCLEF methodologyBased on iCLEF methodology Participants require minimum of 8 usersParticipants require minimum of 8 users Within-subject experimental designWithin-subject experimental design 16 search tasks (5 mins per task)16 search tasks (5 mins per task)
Participants select area to investigateParticipants select area to investigate
Target search taskTarget search task
海 (sea)
Clear goal for the user (easy to describe task)Clear goal for the user (easy to describe task) Can be achieved without knowledge of the Can be achieved without knowledge of the
collectioncollection Clearly defined measures of successClearly defined measures of success Invokes different searching strategiesInvokes different searching strategies
Search tasksSearch tasks
ParticipantsParticipants 11 signed up; 2 submitted11 signed up; 2 submitted University of SheffieldUniversity of Sheffield
Compared Italian version of the same systemCompared Italian version of the same system Aimed to test whether automatically-generated Aimed to test whether automatically-generated
menus better for presenting results than ranked menus better for presenting results than ranked listlist
MiracleMiracle Compared Spanish versus English query Compared Spanish versus English query
formulationformulation Aimed to test whether Boolean AND or OR Aimed to test whether Boolean AND or OR
betterbetter
ResultsResults MiracleMiracle
69% of images found English; 66% 69% of images found English; 66% SpanishSpanish
Domain-specific terminology caused Domain-specific terminology caused problems for users (and system)problems for users (and system)
University of SheffieldUniversity of Sheffield 53% images found using list; 47% menus53% images found using list; 47% menus Users preferred the menusUsers preferred the menus
Comparison between groups (limited)Comparison between groups (limited) Miracle: 86/128 images found overallMiracle: 86/128 images found overall Sheffield: 82/128 images found overallSheffield: 82/128 images found overall
Interactive CL Q&AInteractive CL Q&A
Q&A taskQ&A taskQuestion
(native language)
Text collection(foreign language)
Answer(native language)
Q&A search assistant
Q&A vs. interactive Q&A vs. interactive Q&AQ&A
People know some of the answersPeople know some of the answers Questions must be carefully selectedQuestions must be carefully selected
People can draw inferencesPeople can draw inferences Answer from multiple documents: considered in 2004, Answer from multiple documents: considered in 2004,
problems with assessmentproblems with assessment Combination of document evidence with user Combination of document evidence with user
knowledge: avoid definition and other open questions.knowledge: avoid definition and other open questions. People answer in the question languagePeople answer in the question language
Need to provide high-quality manual translations for Need to provide high-quality manual translations for assessment.assessment.
People get tiredPeople get tired Exclude nil questions, limit question typesExclude nil questions, limit question types
Experimental DesignExperimental Design 8 users (native query language)8 users (native query language) 16 evaluation questions (+ 4 for training)16 evaluation questions (+ 4 for training) 5 minutes per search (~3 hours per user)5 minutes per search (~3 hours per user) Independent variable: CLIR system Independent variable: CLIR system
design (reference/contrastive)design (reference/contrastive) Dependent variable: accuracyDependent variable: accuracy Latin square to block user/question Latin square to block user/question
effectseffects
Evaluation measuresEvaluation measures Official score: accuracy (= Q&A Official score: accuracy (= Q&A
track)track) Additional quantitative data: Additional quantitative data:
searching time, number of searching time, number of interactions, log analysis in general.interactions, log analysis in general.
Additional data: questionnaires Additional data: questionnaires (initial, 2 post-system, final), (initial, 2 post-system, final), observational information.observational information.
ExperimentsExperiments AlicanteAlicante: How much context users need : How much context users need
to correctly identify answers? (clauses vs to correctly identify answers? (clauses vs full paragraphs in a QA-based system)full paragraphs in a QA-based system)
SalamancaSalamanca: How useful is MT for the : How useful is MT for the task? (with/without MT) X (poor/good task? (with/without MT) X (poor/good target language skills) X (EN/FR as target language skills) X (EN/FR as target language)target language)
UNEDUNED: Is it better to search paragraphs : Is it better to search paragraphs than full documents?than full documents?
Official resultsOfficial results
Remarkable factsRemarkable facts
UNED & Alicante: accuracy UNED & Alicante: accuracy increases with larger contexts. increases with larger contexts.
Salamanca: MT is not very Salamanca: MT is not very helpful!helpful!
Implications for CL-QA systems?Implications for CL-QA systems?
Ideas for 2006Ideas for 2006
ParticipationParticipation
0
10
20
30
40
50
60
70
80
2001 2002 2003 2004 2005
iCLEFparticipantsCLEFparticipants
Conclusion: terminate the track!
Failure analysis (1)Failure analysis (1)
High cost of entryHigh cost of entry Long, boring guidelines.Long, boring guidelines. User recruitment, scheduling, User recruitment, scheduling,
training, monitoring.training, monitoring. Can’t really do experiment Can’t really do experiment
variations.variations. Made a programming mistake? Made a programming mistake?
Start recruiting volunteers again.Start recruiting volunteers again.
Failure analysis (2)Failure analysis (2) ““users screw everything up” (XXX, IR users screw everything up” (XXX, IR
competition organizer). Recruiting, competition organizer). Recruiting, training, monitoring, sometimes even training, monitoring, sometimes even paying… just to see how users ruin your paying… just to see how users ruin your hypothesis.hypothesis.
Is your search assistant at least good for Is your search assistant at least good for demonstration purposes? No, becausedemonstration purposes? No, because
1)1) Cross-Language Search Cross-Language Search cross-cultural cross-cultural needneed
2)2) But cross-cultural need is unfrequent!But cross-cultural need is unfrequent!
(Experiment: show your mother)(Experiment: show your mother)
titletitle
description
description
comments
comments
setssets
Tags(folksonomi
es)
Tags(folksonomi
es)
SpanishSpanish
ItalianItalian
EnglishEnglish
Japanese
Japanese
Advantages of flickrAdvantages of flickr Naturally multilingual, new IR challenge Naturally multilingual, new IR challenge
(folksonomies)(folksonomies) You can show your mother! (it is cross-You can show your mother! (it is cross-
language but it is not cross-cultural)language but it is not cross-cultural) Can avoid recruiting users: study Can avoid recruiting users: study
behaviour of real web/flickr users (log behaviour of real web/flickr users (log analysis)analysis)
Challenges of web scenarios (social Challenges of web scenarios (social network effects) plus advantages of network effects) plus advantages of controlled scenarios (unlike Google or controlled scenarios (unlike Google or Yahoo image search)Yahoo image search)
Interactive Flickr task Interactive Flickr task (2006)(2006)
Target language: PortugueseTarget language: Portuguese Data: Flickr images (local or via Flickr API)Data: Flickr images (local or via Flickr API) Search task:Search task:
Illustrate this text (open)Illustrate this text (open) What’s behind this house? (focused, Q&A type)What’s behind this house? (focused, Q&A type) Sunsets in Mangue Seco (ad-hoc type)Sunsets in Mangue Seco (ad-hoc type) Pictures where the Nike logo appears (ad-hoc, Pictures where the Nike logo appears (ad-hoc,
content oriented)content oriented) Track real users w. real information needs (log Track real users w. real information needs (log
analysis!)analysis!) Experiment design: open!! (let’s also Experiment design: open!! (let’s also
compare evaluation methodologies!)compare evaluation methodologies!)
Plans for 2007Plans for 2007
Make the task compulsory for Make the task compulsory for CLEF participantsCLEF participants
Terminate all other tracksTerminate all other tracks Task coordinators hired by Task coordinators hired by
Yahoo!Yahoo!
AcknowledgmentsAcknowledgments People who helped organizing iCLEF People who helped organizing iCLEF
2005: Richard Sutcliffe, Christelle 2005: Richard Sutcliffe, Christelle Ayache, Víctor Peinado, Fernando Ayache, Víctor Peinado, Fernando López, Javier Artiles, Jianqiang Wang, López, Javier Artiles, Jianqiang Wang, Daniela PetrelliDaniela Petrelli
People already helping us shape flickr People already helping us shape flickr task: Javier Artiles, Peter Anick, Jussi task: Javier Artiles, Peter Anick, Jussi Karlgren, Doug Oard, William Hersh, Karlgren, Doug Oard, William Hersh, Donna Harman, Daniela Petrelli, Donna Harman, Daniela Petrelli, Henning MüllerHenning Müller
All participant groupsAll participant groups