Text REtrieval Conference (TREC)
Overview of TREC 2016
Ellen Voorhees
Text REtrieval Conference (TREC)
TREC 2016 Track CoordinatorsClinical Decision Support: Kirk Roberts, Dina Demner-Fushman,
Bill Hersh, Ellen Voorhees
Contextual Suggestion: Seyyed Hadi Hashemi, Jaap Kamps,Julia Kiseleva, Charlie Clarke
Dynamic Domain: Grace Hui Yang, Ian SoboroffLive QA: David Carmel, Dan Pelleg, Yuval Pinter, Eugene Agichtein,
Donna Harman
OpenSearch: Krisztian Balog, Anne SchuthReal-time Summarization: Jimmy Lin, Richard McCreadie,
Adam Roegiest, Fernando Diaz
Tasks: Manish Verma, Evangelos Kanoulas, Emine Yilmaz,Rishabh Mehrotra, Ben Carterette, Nick Craswell, Peter Bailey
Total Recall:, Gord Cormack, Maura Grossman, Adam RoegiestCharlie Clarke
Text REtrieval Conference (TREC)
TREC 2016 Program Committee
Ellen Voorhees, chairJames Allan David LewisBen Carterette Paul McNameeGord Cormack Doug OardSue Dumais John PragerDonna Harman Ian SoboroffDiane Kelly Arjen de Vries
Text REtrieval Conference (TREC)
74 TREC 2016 Participants
Bauhaus U. Weimar ETH Zurich National Childrens Hosp. U. Amsterdam (2)
Beijing U. Posts & Telecommun. Fudan U. Peking U. U. of Delaware (2)
Beijing U. of Technology Georgetown U. Philips Research N. America U. of Glasgow
Carnegie Mellon U. Heilongjiang Inst. of Technology Polytechnic U. of Hong Kong U. of Iowa
Catalyst Repository Systems Henan U. of Technology Pozan U. of Technology U. of Maryland (2)
Central China Normal U. (2) Hubert Curien Lab Qatar U. U. of Michagan
Chonbuk National U. Indian Inst. Tech, BHU (2) Queensland U. of Technology U. of North Texas
City U. Hong Kong Indian Statistical Inst., Kolkata Reyerson U. & Ferdowsi U. U. of Padua (2)
CSIRO IRIT RMIT U. U. of Pittsburgh
Democritus U. of Thrace Laval U. & Lakehead U. San Francisco State U. U. of Stavanger
DFKI GmbH Leipzig U. Siena College U. of Waterloo (3)
Dhirubhai Ambani Inst. (2) Mayo Clinic Texas Advanced Comp. Ctr. U. of Wisconsin-Milwaukee
East China Normal U. (2) MERCK KGAA TH Koeln U. Applied Sciences U.S. NLM
e-Discovery Team, LLC Nanjing U. Trinity College Dublin Wuhan U.
Emory U. Nankai U. U. Federal de Minas Gerais Yahoo!
National U. Defense Tech (3) U. della Svizzera italiana
Text REtrieval Conference (TREC)
Number of Participants in TREC
0
20
40
60
80
100
120
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Num
ber o
f Par
ticip
ants
Text REtrieval Conference (TREC)
A big thank you to our assessors
Text REtrieval Conference (TREC)
TREC Tracks
Text REtrieval Conference (TREC)
Basics Generic tasks
ad hoc: known collection, unpredictable queries, response is a ranked list
filtering: known queries, document stream, response is a document set,
question answering: unpredictable questions, response is an actual answer not a document
Measures recall, precision are fundamental components ranked list measures: nDCG@X, IA-ERR, CubeTest filtering measures: F, expected gain, latency
Text REtrieval Conference (TREC)
TREC 2016 A year of consolidation
sophomore year for 4/8 tracks OpenSearch track new
a meta-track focusing on new evaluation paradigm continued high [engineering] barrier for
participation writing to track APIs to access track resources hard time constraints for responses
Text REtrieval Conference (TREC)
Clinical Decision Support
Clinical decision support systems a piece of target Health IT infrastructure
aim to anticipate physicians needs by linking health records to information needed for patient care
some of that info comes from biomedical literature
ImplementationGiven a case narrative, return biomedical articles that can be used to accomplish one of three generic clinical tasks: What is the diagnosis? or What is the best
treatment? or What test should be run?
Text REtrieval Conference (TREC)
CDS Track Task Documents:
new snapshot of the open-access subset of PubMed Central
contains ~ 1.25 million full-text articles
30 topics new for 2016, based on nursing admission notes
from existing medical record (MIMIC-II) physicians created corresponding description and
summary versions, as well as designated target clinical task
10 topics for each clinical task type
Text REtrieval Conference (TREC)
CDS Track Judgments
judgment sets created using inferred measure sampling (2 strata; ranks 1-15; 20% of 16-100); main measure infNDCG
judgments made by physicians coordinated by OHSU
up to 5 runs per participant all runs contributed to same set of pools
Text REtrieval Conference (TREC)
CDS Track Sample Topic
Text REtrieval Conference (TREC)
CDS Results: Automatic, NoteDistribution of Per-topic infNDCG Scores for Best Run by Mean infNDCG
Text REtrieval Conference (TREC)
CDS Results: Auto, SummaryDistribution of Per-topic infNDCG Scores for Best Run by Mean infNDCG
Text REtrieval Conference (TREC)
CDS Results: ManualDistribution of Per-topic infNDCG Scores for Best Run by Mean infNDCG
Text REtrieval Conference (TREC)
Dynamic Domain Goal
evaluate methods that support the entire information-seeking process for exploratory search in complex domains
systems must support dynamic nature of search in cost effective manner
Implementation interaction jig referred to as Simulated User participants submit 5-doc packets to Simulated User
and get judgments for individual facets of the topic each packet-submission with feedback is one iteration
system decides to stop when it thinks sufficient info for all facets has been retrieved
Text REtrieval Conference (TREC)
Dynamic Domain Domains
two domains with a total of 53 topics Ebola: ~680,000 webpages/pdfs/tweets about
Ebola outbreak in Africa in 2014-2015 Polar: ~1.7 million webpages/data
files/images/code related to the polar sciences
Topics developed by assessors (Ebola) or USC (polar) NIST assessors made judgments for docs found in
multiple rounds of searching prior to topic release assessors also created gold-standard set of facets
for each topic based on these searches
Text REtrieval Conference (TREC)
Dynamic Domain Sample Topics
PolarTopic: polar oceans freshwater sensitivityHow sensitive are the polar oceans to changes in freshwater input?
Subtopic 1: surface freshwater forcingSubtopic 2: Arctic Freshwater InitiativeSubtopic 3: Freshwater Budget of the Canadian ArchipelagoSubtopic 4: Freshwater Fluxes in the East Greenland CurrentSubtopic 5: terrestrial and freshwater ecosystems
EbolaTopic: Ebola Conspiracy TheoriesIdentify the conspiracies circulating about Ebola.
Subtopic 1: Efforts to CounterSubtopic 2: OriginsSubtopic 3: The Claims
Text REtrieval Conference (TREC)
Dynamic Domain ResultsAverage Cube Test Score by Iteration
Text REtrieval Conference (TREC)
Total Recall Goal
evaluate methods for achieving very high recall, including methods that use a human-in-the-loop
more emphasis on recall, less on different facets than Dynamic Domain track; both emphasize stopping criteria
Implementation participant system submits one doc at a time to a
software jig; jig both records activity & responds to system with relevance judgment for that doc
participant decides when to terminate search; entire set of documents submitted to jig counts as retrieved set
Text REtrieval Conference (TREC)
Total RecallAt Home Collection
Jeb Bush email: 34 (new) topics against the email of Florida governor Jeb Bush
Sandbox CollectionsGov email: six topics against email of Rod Blagojevich and Patrick Quinn adminsTwitter: four topics against a collection of 800,000 tweets
Tasks at home: systems connect to jig over Internet; participants
machine contains document set and search runs there. sandbox: participants system sent as virtual machine that runs
on isolated machine along with the jig. Participant never sees any documents, but gets counts of relevants returned as function of number documents submitted. Automatic only.
Text REtrieval Conference (TREC)
Total Recall At Home judgments:
NIST assessors judged multiple rounds of documents per topic
(small) sample of documents independently judged by multiple assessors
3-way: not relevant, relevant, important primary assessor also grouped relevant/important
documents into clusters representing main subtopics (aspects) of topic
Enables evaluation contrasts: all relevant vs. important; assessor differences;
coverage
Text REtrieval Conference (TREC)
Total Recall At Home ResultsAverage Gain Curve: All relevant primary assessor
Text REtrieval Conference (TREC)
OpenSearch Live Labs comes to TREC
provide access to real users doing their real searches at the actual time they search
at scale
For TREC participants, an ad hoc search re-ranking task
Text REtrieval Conference (TREC)
OpenSearch Protocol Sites provide frequent queries and valid
document sets Participants re-rank document sets for
each query and upload new rankings Once a query in set is issued,
a participant is randomly selected and that participants corresponding ranking is interleaved with sites native ranking
all users interactions with ranked list recorded participants ranking declared to be a win, loss or
tie with respect to native ranking
Text REtrieval Conference (TREC)
0
20
40
60
80
100
120
Gesis UWM QU webis KarMat IAPLab Udel OS_404
0
2
4
6
8
10
12
Udel webis UWM IAPLab BJUT QU Gesis OS_404 KarMat
1.01.00.5
0.5 0.0
0.0
0.0
0.0
0.86
0.750.67 0.6
0.6
0.5
0.5
0.5 0.44
OpenSearch Results
CiteSeerX
SSOAR
Round 2
Wins
Losses
Ties
Outcome
Text REtrieval Conference (TREC)
Contextual Suggestion Entertain Me app: suggest activities
based on users prior history and target location
Fifth year of track this year consolidates work in 2015 focusing on
creating a reusable test collection for task suggestions required to come from track-created
repository of activities suggestions in profiles might be tagged features
the profile owner finds attractive
Text REtrieval Conference (TREC)
Contextual Suggestion Terminology:
a profile represents the user profile consists of a set of previously rated
activities and possibly some demographic info a system returns [a ranked list of]
suggestions in response to a request a request contains at least a profile and target
location and possibly some other data (e.g., time) a suggestion is an activity from the repository that
is located in the target area
Text REtrieval Conference (TREC)
Contextual Suggestion Sample Requestlocation: Cape Coral, FLgroup: Family season: Summertrip_type: Holiday duration: Weekend tripperson:
gender: Male age: 23preferences:
doc: 00674898-160 rating: 3tags: Romantic, Seafood, Family Friendly
doc: 00247656-160 rating: 2tags: Bar-hopping
doc: 00085961-160 rating: 3tags: Gourmet Food
doc: 00086637-160 rating: 4tags: Family Friendly, Local Food, Entertainment
doc: 00086298-160 rating: 0doc: 00087389-160 rating: 3
tags: Shopping for Shoes, Family Friendly, Luxury Brand Shopping
doc: 00405444-152 rating: 3tags: Art, Art Galleries, Family Friendly, Fine Art Museums
Text REtrieval Conference (TREC)
Contextual Suggestion Phase 1 task
crowd-sourced assessors rated attractions from a set of seed locations; these ratings formed initial profiles
participants returned suggestions for the cross product of profiles and set of different locations
suggestions from Phase 1 participants pooled and sent back to requestor for ratings and feature tags
evaluated a total of 61 requests in test set ratings on 5-point scale Strongly Uninterested
Strongly Interested; top 2 counted as relevant for binary measures NDCG computed using 3-rating as gain value
Text REtrieval Conference (TREC)
Contextual Suggestion Phase 2 task
essentially, a re-ranking task a request contained the complete set of (unrated)
suggestions from all Phase 1 participants for the request; Phase 2 task participants were required to return only suggestions from this set
58 requests in evaluation set
Text REtrieval Conference (TREC)
Contextual SuggestionPhase 1 Results
Distribution of Per-Request NDCG Scores for Best Run By Mean NDCG
Text REtrieval Conference (TREC)
Contextual SuggestionPhase 2 Results
Distribution of Per-Request NDCG Scores for Best Run By Mean NDCG
Text REtrieval Conference (TREC)
Tasks Track Goal
facilitate research on systems that are able to infer the underlying real-world task that motivates a query and then can retrieve documents useful for accomplishing all aspects of that real-world task
Tasks Task Understanding
return key phrases covering breadth of Task Task Completion
return documents that are useful for whole Task Web/ad hoc
Text REtrieval Conference (TREC)
Tasks Track ClueWeb12 document set 50 topics in test set
track organizers selected topics from logs and created the set of subtasks using their own resources plus participants submissions
Aspect-based judgments depth 20 pools for phrases depth 10 pools for documents (completion & ad hoc) documents judged for both relevance and
usefulness
Text REtrieval Conference (TREC)
Tasks Track Sample Topicsquery: fake tan at home
You are trying to find out how to get a fake tan at home.Subtask 1: Places to get a fake tanSubtask 2: Determine fake tan for skin typeSubtask 3: Cost of getting a fake tanSubtask 4: Products to get a fake tanSubtask 5: Buy products to get a fake tanSubtask 6: Fake tan DIY guideSubtask 7: Get fake tan of some part of the bodySubtask 8: Precautions for getting a fake tanSubtask 9: Pictures of fake tans
query: social media for learningYou want to learn different ways to use social media to enhance learning activities at
school.Subtask 1: How to use blogging for learningSubtask 2: How to use collaborative calendaringSubtask 3: How to use podcastingSubtask 4: How to use social media for collaborative mindmappingSubtask 5: How to use social media for sharing informationSubtask 6: How to use social media for presentation sharingSubtask 7: How to use social media for collaborative working
Text REtrieval Conference (TREC)
Task Understanding ResultsDistribution of Per-topic Scores (ordered by mean ERR-IA@20)
Text REtrieval Conference (TREC)
Distribution of Per-topic Scores (ordered by Mean ERR-IA@10)
Task Completion Results
Text REtrieval Conference (TREC)
Live QA Goal
create systems that can generate answers in real time for real questions asked by real users
Implementation questions sampled from Yahoo Answers site directed at participants systems at the rate of
about 1 per minute for 24 hours May 31-Jun 1 systems required to respond a question with a
single [textual] answer in at most 1 minute; answers recorded by track server
at end of evaluation period, questions and responses sent to NIST for judgment
Text REtrieval Conference (TREC)
Live QA Questions
drawn from seven top-level Yahoo Answer categories, as self-labeled by asker
lightly filtered to remove objectionable material final test set of 1015 questions
Summarization pilot task: systems return a concise re-statement of the
question indicating its main focus
Text REtrieval Conference (TREC)
Live QA Sample QuestionsCategory: Health
Bat ran into me, should I be afraid of rabies?Could I catch rabies from a bat which ran into me?
Category: Beauty & StyleIs waterproof mascara and eyeliner necessary for traveling in hot and humid areas
(Thailand and Singapore)? If I just go with normal ones, will they not smudge or melt off?Do I need to use waterproof mascara and eyeliner in hot, humid areas?
Category: PetsOne of my dogs left? I have two dogs and one just ran off and we cant find him. Its
been three hours and the other one is really depressed she only moves to get water and is breathing heavily. What should I do if my dog does not come back.
What can I do if I cant find my lost dog?
Category: SportsHow does basketball finals work in the NBA? So I like all sorts of sports but my
favorite is football (soccer). I was wondering why in basketball have like 4 games just to go to the final. I heard its gonna be the Cavs vs Warriors?
How are NBA basketball finals structured?
Category: Arts & HumanitiesPlaces to read about early human settlement/migrations in Modern Russia? I was recently
listening to lectures by the great courses on big history by David Christain. At some point he brings up the fact that Humans started to settle in northern Ukraine/Russia in the sixth century. I am interested to read more on this settlement/migration. Do you have any suggested scholarly sources/suggestions?
Where can I find out more about early settlements and migration in Russia?
Text REtrieval Conference (TREC)
Live QA 3 human baselines
best human answer on Yahoo! Answers site as voted by asker
fastest human answer on site crowd-sourced answer within 1 min. response time
Scoring NIST assessors rated responses
-2 Unreadable; 1 Poor; 2 Fair; 3 Good; 4 Excellent runs score a function of the rating assigned per q
avgScore(0-3): conflate all negative responses to 0 & subtract 1 from other ratings; take mean of ratings
prec@i+: number of qs with at rating of at least i divided by number of qs system responded to
Text REtrieval Conference (TREC)
Live QA Results
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
avgS
core
(0-3
)
Prec@2+
CMU
Qatar U.CLIP
Yahoo
UTRGV-JBC
Emory U.
RMITECNU
SFSU
U. WaterlooECNUCS
DFKI
PRNANUDT681
NUDTMDP
U. Leipzig
Text REtrieval Conference (TREC)
Real-time Summarization Goal
examine techniques for constructing real-time update summaries from social media streams in response to users information needs
Mash-up of TREC 2015 Microblog and Temporal Summarization tracks (type of) filtering task over Tweet stream
Task A: deliver updates to mobile device Task B: periodic digest of updates
Text REtrieval Conference (TREC)
Real-time Summarization Participant listens to Twitter public feed for
evaluation period (~10 days in August) Pushes a tweet to the RTS server when it
decides to retrieve a tweet for a profile at most 10 tweets per day per profile
Server records time of receipt A subset of the tweets pushed to crowd-sourced
mobile assessors assessor may or may not make judgment for 2016, judgments not returned to participant
Digest task results uploaded to NIST after eval period ended
Text REtrieval Conference (TREC)
Real-time Summarization Task A:
return at most 10 tweets/topic/day lag between time tweet available and decision to return it to
user should be minimized scored using Expected Latency Gain (ELG)
Task B: return at most 100 [ranked] tweets/topic/day decision period anytime within day is fine scored using nDCG
For both, Automatic, Manual Preparation or Manual Interaction runs manual clustering of relevant tweets define equivalence
classes used for redundancy penalties in scoring relevance judgment for unjudged tweets in equivalence class
(eg, retweets) assigned as function of judged tweets in class
Text REtrieval Conference (TREC)
Real-time Summarization Profiles
combination of re-used TREC 2015 and newly developed profiles (total of 203)
describe prospective information need
Judgments mobile assessors judged tweets as
relevant, redundant, not relevant NIST assessors judged pools formed from both
task A and task B runs 3-way scale of not relevant, relevant, highly relevant created clusters of semantically equivalent relevant
tweets 56 profiles with NIST judgments; 123 w/ mobile
Text REtrieval Conference (TREC)
Real-time Sample TopicsTitle: Hershey, PA quilt showDescription: Find information on the quilt show being held in Hershey, PANarrative: The user is a beginning quilter who would like to attend her first quilt show. She has learned that a major quilt show will happen in Hershey, PA, and wants to see Tweets about the show, including such things as announcement of classes, teachers or vendors attending the show; prize-winning quilts; comments on logistics, travel information, and lodging; opinions about the quality of the show.
Title: FIFA corruption investigationDescription: Find information related to the ongoing investigation of FIFA officials for corruption.Narrative: The user is a soccer fan who is interested in the current status of the ongoing investigation by various governments of corruption and bribery by officials of FIFA (Federation Internationale de Football Association). This includes tweets giving information on various investigations and possible rebidding of the 2018 and 2022 World Cup games.
Title: Mount RushmoreDescription: Find tweets about peoples reactions to and experiences when visiting Mount Rushmore.Narrative: The user is considering a trip to South Dakota to see Mount Rushmore. She would like to see what reaction other tourists have had to the site as well as any traveling tips and advice to make the trip more enjoyable.
Text REtrieval Conference (TREC)
Top Task A Runs: Mobile JudgmentsOrdered by Strict Precision
0
500
1000
1500
2000
2500
3000
3500
Num
ber o
f tw
eets
RelevantRedundantNonRelUnjudged
0.570
0.318
0.319
0.332
0.369
0.377
0.379
0.388
0.476
0.503
(P) = Manual Preparation run
Text REtrieval Conference (TREC)
Top Task A Runs: NIST JudgmentsDistribution of Per-topic EG-1 Scores for Best Run by Mean EG-1
(P) = Manual Preparation run Empty Run
Text REtrieval Conference (TREC)
Top Task B RunsDistribution of Per-topic nDCG-1 Scores for Best Run by Mean nDCG-1
(P) = Manual Preparation run
Text REtrieval Conference (TREC)
TREC 2017 Tracks
Dynamic Domain, Live QA, OpenSearch, and Real-time Summarization, and Tasks tracks continuing
CDS Precision Medicine new track: Complex Answer Retrieval
TREC 2017 track planning sessions 1.5 hours per track tomorrow (3- or 4-way parallel) track coordinators attending 2016 you can help shape task(s); make your opinions
known
Text REtrieval Conference (TREC)
Overview of TREC 2016TREC 2016 Track CoordinatorsTREC 2016 Program CommitteeSlide Number 4Number of Participants in TRECSlide Number 6TREC TracksBasicsTREC 2016Clinical Decision SupportCDS Track TaskCDS TrackCDS Track Sample TopicCDS Results: Automatic, NoteCDS Results: Auto, SummaryCDS Results: ManualDynamic DomainDynamic DomainDynamic Domain Sample TopicsDynamic Domain ResultsTotal RecallTotal RecallTotal RecallTotal Recall At Home ResultsOpenSearchOpenSearch ProtocolSlide Number 27Contextual SuggestionContextual SuggestionContextual Suggestion Sample RequestContextual SuggestionContextual SuggestionContextual SuggestionPhase 1 ResultsContextual SuggestionPhase 2 ResultsTasks TrackTasks TrackTasks Track Sample TopicsTask Understanding ResultsTask Completion ResultsLive QALive QALive QA Sample QuestionsLive QALive QA ResultsReal-time SummarizationReal-time SummarizationReal-time SummarizationReal-time SummarizationReal-time Sample TopicsTop Task A Runs: Mobile JudgmentsOrdered by Strict PrecisionTop Task A Runs: NIST JudgmentsTop Task B RunsTREC 2017Slide Number 54