+ All Categories
Home > Documents > Overview of TREC 2016 - Text Retrieval...

Overview of TREC 2016 - Text Retrieval...

Date post: 13-Apr-2018
Category:
Upload: ngongoc
View: 229 times
Download: 2 times
Share this document with a friend
54
Text REtrieval Conference (TREC) Overview of TREC 2016 Ellen Voorhees
Transcript
  • Text REtrieval Conference (TREC)

    Overview of TREC 2016

    Ellen Voorhees

  • Text REtrieval Conference (TREC)

    TREC 2016 Track CoordinatorsClinical Decision Support: Kirk Roberts, Dina Demner-Fushman,

    Bill Hersh, Ellen Voorhees

    Contextual Suggestion: Seyyed Hadi Hashemi, Jaap Kamps,Julia Kiseleva, Charlie Clarke

    Dynamic Domain: Grace Hui Yang, Ian SoboroffLive QA: David Carmel, Dan Pelleg, Yuval Pinter, Eugene Agichtein,

    Donna Harman

    OpenSearch: Krisztian Balog, Anne SchuthReal-time Summarization: Jimmy Lin, Richard McCreadie,

    Adam Roegiest, Fernando Diaz

    Tasks: Manish Verma, Evangelos Kanoulas, Emine Yilmaz,Rishabh Mehrotra, Ben Carterette, Nick Craswell, Peter Bailey

    Total Recall:, Gord Cormack, Maura Grossman, Adam RoegiestCharlie Clarke

  • Text REtrieval Conference (TREC)

    TREC 2016 Program Committee

    Ellen Voorhees, chairJames Allan David LewisBen Carterette Paul McNameeGord Cormack Doug OardSue Dumais John PragerDonna Harman Ian SoboroffDiane Kelly Arjen de Vries

  • Text REtrieval Conference (TREC)

    74 TREC 2016 Participants

    Bauhaus U. Weimar ETH Zurich National Childrens Hosp. U. Amsterdam (2)

    Beijing U. Posts & Telecommun. Fudan U. Peking U. U. of Delaware (2)

    Beijing U. of Technology Georgetown U. Philips Research N. America U. of Glasgow

    Carnegie Mellon U. Heilongjiang Inst. of Technology Polytechnic U. of Hong Kong U. of Iowa

    Catalyst Repository Systems Henan U. of Technology Pozan U. of Technology U. of Maryland (2)

    Central China Normal U. (2) Hubert Curien Lab Qatar U. U. of Michagan

    Chonbuk National U. Indian Inst. Tech, BHU (2) Queensland U. of Technology U. of North Texas

    City U. Hong Kong Indian Statistical Inst., Kolkata Reyerson U. & Ferdowsi U. U. of Padua (2)

    CSIRO IRIT RMIT U. U. of Pittsburgh

    Democritus U. of Thrace Laval U. & Lakehead U. San Francisco State U. U. of Stavanger

    DFKI GmbH Leipzig U. Siena College U. of Waterloo (3)

    Dhirubhai Ambani Inst. (2) Mayo Clinic Texas Advanced Comp. Ctr. U. of Wisconsin-Milwaukee

    East China Normal U. (2) MERCK KGAA TH Koeln U. Applied Sciences U.S. NLM

    e-Discovery Team, LLC Nanjing U. Trinity College Dublin Wuhan U.

    Emory U. Nankai U. U. Federal de Minas Gerais Yahoo!

    National U. Defense Tech (3) U. della Svizzera italiana

  • Text REtrieval Conference (TREC)

    Number of Participants in TREC

    0

    20

    40

    60

    80

    100

    120

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    2004

    2005

    2006

    2007

    2008

    2009

    2010

    2011

    2012

    2013

    2014

    2015

    2016

    Num

    ber o

    f Par

    ticip

    ants

  • Text REtrieval Conference (TREC)

    A big thank you to our assessors

  • Text REtrieval Conference (TREC)

    TREC Tracks

  • Text REtrieval Conference (TREC)

    Basics Generic tasks

    ad hoc: known collection, unpredictable queries, response is a ranked list

    filtering: known queries, document stream, response is a document set,

    question answering: unpredictable questions, response is an actual answer not a document

    Measures recall, precision are fundamental components ranked list measures: nDCG@X, IA-ERR, CubeTest filtering measures: F, expected gain, latency

  • Text REtrieval Conference (TREC)

    TREC 2016 A year of consolidation

    sophomore year for 4/8 tracks OpenSearch track new

    a meta-track focusing on new evaluation paradigm continued high [engineering] barrier for

    participation writing to track APIs to access track resources hard time constraints for responses

  • Text REtrieval Conference (TREC)

    Clinical Decision Support

    Clinical decision support systems a piece of target Health IT infrastructure

    aim to anticipate physicians needs by linking health records to information needed for patient care

    some of that info comes from biomedical literature

    ImplementationGiven a case narrative, return biomedical articles that can be used to accomplish one of three generic clinical tasks: What is the diagnosis? or What is the best

    treatment? or What test should be run?

  • Text REtrieval Conference (TREC)

    CDS Track Task Documents:

    new snapshot of the open-access subset of PubMed Central

    contains ~ 1.25 million full-text articles

    30 topics new for 2016, based on nursing admission notes

    from existing medical record (MIMIC-II) physicians created corresponding description and

    summary versions, as well as designated target clinical task

    10 topics for each clinical task type

  • Text REtrieval Conference (TREC)

    CDS Track Judgments

    judgment sets created using inferred measure sampling (2 strata; ranks 1-15; 20% of 16-100); main measure infNDCG

    judgments made by physicians coordinated by OHSU

    up to 5 runs per participant all runs contributed to same set of pools

  • Text REtrieval Conference (TREC)

    CDS Track Sample Topic

  • Text REtrieval Conference (TREC)

    CDS Results: Automatic, NoteDistribution of Per-topic infNDCG Scores for Best Run by Mean infNDCG

  • Text REtrieval Conference (TREC)

    CDS Results: Auto, SummaryDistribution of Per-topic infNDCG Scores for Best Run by Mean infNDCG

  • Text REtrieval Conference (TREC)

    CDS Results: ManualDistribution of Per-topic infNDCG Scores for Best Run by Mean infNDCG

  • Text REtrieval Conference (TREC)

    Dynamic Domain Goal

    evaluate methods that support the entire information-seeking process for exploratory search in complex domains

    systems must support dynamic nature of search in cost effective manner

    Implementation interaction jig referred to as Simulated User participants submit 5-doc packets to Simulated User

    and get judgments for individual facets of the topic each packet-submission with feedback is one iteration

    system decides to stop when it thinks sufficient info for all facets has been retrieved

  • Text REtrieval Conference (TREC)

    Dynamic Domain Domains

    two domains with a total of 53 topics Ebola: ~680,000 webpages/pdfs/tweets about

    Ebola outbreak in Africa in 2014-2015 Polar: ~1.7 million webpages/data

    files/images/code related to the polar sciences

    Topics developed by assessors (Ebola) or USC (polar) NIST assessors made judgments for docs found in

    multiple rounds of searching prior to topic release assessors also created gold-standard set of facets

    for each topic based on these searches

  • Text REtrieval Conference (TREC)

    Dynamic Domain Sample Topics

    PolarTopic: polar oceans freshwater sensitivityHow sensitive are the polar oceans to changes in freshwater input?

    Subtopic 1: surface freshwater forcingSubtopic 2: Arctic Freshwater InitiativeSubtopic 3: Freshwater Budget of the Canadian ArchipelagoSubtopic 4: Freshwater Fluxes in the East Greenland CurrentSubtopic 5: terrestrial and freshwater ecosystems

    EbolaTopic: Ebola Conspiracy TheoriesIdentify the conspiracies circulating about Ebola.

    Subtopic 1: Efforts to CounterSubtopic 2: OriginsSubtopic 3: The Claims

  • Text REtrieval Conference (TREC)

    Dynamic Domain ResultsAverage Cube Test Score by Iteration

  • Text REtrieval Conference (TREC)

    Total Recall Goal

    evaluate methods for achieving very high recall, including methods that use a human-in-the-loop

    more emphasis on recall, less on different facets than Dynamic Domain track; both emphasize stopping criteria

    Implementation participant system submits one doc at a time to a

    software jig; jig both records activity & responds to system with relevance judgment for that doc

    participant decides when to terminate search; entire set of documents submitted to jig counts as retrieved set

  • Text REtrieval Conference (TREC)

    Total RecallAt Home Collection

    Jeb Bush email: 34 (new) topics against the email of Florida governor Jeb Bush

    Sandbox CollectionsGov email: six topics against email of Rod Blagojevich and Patrick Quinn adminsTwitter: four topics against a collection of 800,000 tweets

    Tasks at home: systems connect to jig over Internet; participants

    machine contains document set and search runs there. sandbox: participants system sent as virtual machine that runs

    on isolated machine along with the jig. Participant never sees any documents, but gets counts of relevants returned as function of number documents submitted. Automatic only.

  • Text REtrieval Conference (TREC)

    Total Recall At Home judgments:

    NIST assessors judged multiple rounds of documents per topic

    (small) sample of documents independently judged by multiple assessors

    3-way: not relevant, relevant, important primary assessor also grouped relevant/important

    documents into clusters representing main subtopics (aspects) of topic

    Enables evaluation contrasts: all relevant vs. important; assessor differences;

    coverage

  • Text REtrieval Conference (TREC)

    Total Recall At Home ResultsAverage Gain Curve: All relevant primary assessor

  • Text REtrieval Conference (TREC)

    OpenSearch Live Labs comes to TREC

    provide access to real users doing their real searches at the actual time they search

    at scale

    For TREC participants, an ad hoc search re-ranking task

  • Text REtrieval Conference (TREC)

    OpenSearch Protocol Sites provide frequent queries and valid

    document sets Participants re-rank document sets for

    each query and upload new rankings Once a query in set is issued,

    a participant is randomly selected and that participants corresponding ranking is interleaved with sites native ranking

    all users interactions with ranked list recorded participants ranking declared to be a win, loss or

    tie with respect to native ranking

  • Text REtrieval Conference (TREC)

    0

    20

    40

    60

    80

    100

    120

    Gesis UWM QU webis KarMat IAPLab Udel OS_404

    0

    2

    4

    6

    8

    10

    12

    Udel webis UWM IAPLab BJUT QU Gesis OS_404 KarMat

    1.01.00.5

    0.5 0.0

    0.0

    0.0

    0.0

    0.86

    0.750.67 0.6

    0.6

    0.5

    0.5

    0.5 0.44

    OpenSearch Results

    CiteSeerX

    SSOAR

    Round 2

    Wins

    Losses

    Ties

    Outcome

  • Text REtrieval Conference (TREC)

    Contextual Suggestion Entertain Me app: suggest activities

    based on users prior history and target location

    Fifth year of track this year consolidates work in 2015 focusing on

    creating a reusable test collection for task suggestions required to come from track-created

    repository of activities suggestions in profiles might be tagged features

    the profile owner finds attractive

  • Text REtrieval Conference (TREC)

    Contextual Suggestion Terminology:

    a profile represents the user profile consists of a set of previously rated

    activities and possibly some demographic info a system returns [a ranked list of]

    suggestions in response to a request a request contains at least a profile and target

    location and possibly some other data (e.g., time) a suggestion is an activity from the repository that

    is located in the target area

  • Text REtrieval Conference (TREC)

    Contextual Suggestion Sample Requestlocation: Cape Coral, FLgroup: Family season: Summertrip_type: Holiday duration: Weekend tripperson:

    gender: Male age: 23preferences:

    doc: 00674898-160 rating: 3tags: Romantic, Seafood, Family Friendly

    doc: 00247656-160 rating: 2tags: Bar-hopping

    doc: 00085961-160 rating: 3tags: Gourmet Food

    doc: 00086637-160 rating: 4tags: Family Friendly, Local Food, Entertainment

    doc: 00086298-160 rating: 0doc: 00087389-160 rating: 3

    tags: Shopping for Shoes, Family Friendly, Luxury Brand Shopping

    doc: 00405444-152 rating: 3tags: Art, Art Galleries, Family Friendly, Fine Art Museums

  • Text REtrieval Conference (TREC)

    Contextual Suggestion Phase 1 task

    crowd-sourced assessors rated attractions from a set of seed locations; these ratings formed initial profiles

    participants returned suggestions for the cross product of profiles and set of different locations

    suggestions from Phase 1 participants pooled and sent back to requestor for ratings and feature tags

    evaluated a total of 61 requests in test set ratings on 5-point scale Strongly Uninterested

    Strongly Interested; top 2 counted as relevant for binary measures NDCG computed using 3-rating as gain value

  • Text REtrieval Conference (TREC)

    Contextual Suggestion Phase 2 task

    essentially, a re-ranking task a request contained the complete set of (unrated)

    suggestions from all Phase 1 participants for the request; Phase 2 task participants were required to return only suggestions from this set

    58 requests in evaluation set

  • Text REtrieval Conference (TREC)

    Contextual SuggestionPhase 1 Results

    Distribution of Per-Request NDCG Scores for Best Run By Mean NDCG

  • Text REtrieval Conference (TREC)

    Contextual SuggestionPhase 2 Results

    Distribution of Per-Request NDCG Scores for Best Run By Mean NDCG

  • Text REtrieval Conference (TREC)

    Tasks Track Goal

    facilitate research on systems that are able to infer the underlying real-world task that motivates a query and then can retrieve documents useful for accomplishing all aspects of that real-world task

    Tasks Task Understanding

    return key phrases covering breadth of Task Task Completion

    return documents that are useful for whole Task Web/ad hoc

  • Text REtrieval Conference (TREC)

    Tasks Track ClueWeb12 document set 50 topics in test set

    track organizers selected topics from logs and created the set of subtasks using their own resources plus participants submissions

    Aspect-based judgments depth 20 pools for phrases depth 10 pools for documents (completion & ad hoc) documents judged for both relevance and

    usefulness

  • Text REtrieval Conference (TREC)

    Tasks Track Sample Topicsquery: fake tan at home

    You are trying to find out how to get a fake tan at home.Subtask 1: Places to get a fake tanSubtask 2: Determine fake tan for skin typeSubtask 3: Cost of getting a fake tanSubtask 4: Products to get a fake tanSubtask 5: Buy products to get a fake tanSubtask 6: Fake tan DIY guideSubtask 7: Get fake tan of some part of the bodySubtask 8: Precautions for getting a fake tanSubtask 9: Pictures of fake tans

    query: social media for learningYou want to learn different ways to use social media to enhance learning activities at

    school.Subtask 1: How to use blogging for learningSubtask 2: How to use collaborative calendaringSubtask 3: How to use podcastingSubtask 4: How to use social media for collaborative mindmappingSubtask 5: How to use social media for sharing informationSubtask 6: How to use social media for presentation sharingSubtask 7: How to use social media for collaborative working

  • Text REtrieval Conference (TREC)

    Task Understanding ResultsDistribution of Per-topic Scores (ordered by mean ERR-IA@20)

  • Text REtrieval Conference (TREC)

    Distribution of Per-topic Scores (ordered by Mean ERR-IA@10)

    Task Completion Results

  • Text REtrieval Conference (TREC)

    Live QA Goal

    create systems that can generate answers in real time for real questions asked by real users

    Implementation questions sampled from Yahoo Answers site directed at participants systems at the rate of

    about 1 per minute for 24 hours May 31-Jun 1 systems required to respond a question with a

    single [textual] answer in at most 1 minute; answers recorded by track server

    at end of evaluation period, questions and responses sent to NIST for judgment

  • Text REtrieval Conference (TREC)

    Live QA Questions

    drawn from seven top-level Yahoo Answer categories, as self-labeled by asker

    lightly filtered to remove objectionable material final test set of 1015 questions

    Summarization pilot task: systems return a concise re-statement of the

    question indicating its main focus

  • Text REtrieval Conference (TREC)

    Live QA Sample QuestionsCategory: Health

    Bat ran into me, should I be afraid of rabies?Could I catch rabies from a bat which ran into me?

    Category: Beauty & StyleIs waterproof mascara and eyeliner necessary for traveling in hot and humid areas

    (Thailand and Singapore)? If I just go with normal ones, will they not smudge or melt off?Do I need to use waterproof mascara and eyeliner in hot, humid areas?

    Category: PetsOne of my dogs left? I have two dogs and one just ran off and we cant find him. Its

    been three hours and the other one is really depressed she only moves to get water and is breathing heavily. What should I do if my dog does not come back.

    What can I do if I cant find my lost dog?

    Category: SportsHow does basketball finals work in the NBA? So I like all sorts of sports but my

    favorite is football (soccer). I was wondering why in basketball have like 4 games just to go to the final. I heard its gonna be the Cavs vs Warriors?

    How are NBA basketball finals structured?

    Category: Arts & HumanitiesPlaces to read about early human settlement/migrations in Modern Russia? I was recently

    listening to lectures by the great courses on big history by David Christain. At some point he brings up the fact that Humans started to settle in northern Ukraine/Russia in the sixth century. I am interested to read more on this settlement/migration. Do you have any suggested scholarly sources/suggestions?

    Where can I find out more about early settlements and migration in Russia?

  • Text REtrieval Conference (TREC)

    Live QA 3 human baselines

    best human answer on Yahoo! Answers site as voted by asker

    fastest human answer on site crowd-sourced answer within 1 min. response time

    Scoring NIST assessors rated responses

    -2 Unreadable; 1 Poor; 2 Fair; 3 Good; 4 Excellent runs score a function of the rating assigned per q

    avgScore(0-3): conflate all negative responses to 0 & subtract 1 from other ratings; take mean of ratings

    prec@i+: number of qs with at rating of at least i divided by number of qs system responded to

  • Text REtrieval Conference (TREC)

    Live QA Results

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    avgS

    core

    (0-3

    )

    Prec@2+

    CMU

    Qatar U.CLIP

    Yahoo

    UTRGV-JBC

    Emory U.

    RMITECNU

    SFSU

    U. WaterlooECNUCS

    DFKI

    PRNANUDT681

    NUDTMDP

    U. Leipzig

  • Text REtrieval Conference (TREC)

    Real-time Summarization Goal

    examine techniques for constructing real-time update summaries from social media streams in response to users information needs

    Mash-up of TREC 2015 Microblog and Temporal Summarization tracks (type of) filtering task over Tweet stream

    Task A: deliver updates to mobile device Task B: periodic digest of updates

  • Text REtrieval Conference (TREC)

    Real-time Summarization Participant listens to Twitter public feed for

    evaluation period (~10 days in August) Pushes a tweet to the RTS server when it

    decides to retrieve a tweet for a profile at most 10 tweets per day per profile

    Server records time of receipt A subset of the tweets pushed to crowd-sourced

    mobile assessors assessor may or may not make judgment for 2016, judgments not returned to participant

    Digest task results uploaded to NIST after eval period ended

  • Text REtrieval Conference (TREC)

    Real-time Summarization Task A:

    return at most 10 tweets/topic/day lag between time tweet available and decision to return it to

    user should be minimized scored using Expected Latency Gain (ELG)

    Task B: return at most 100 [ranked] tweets/topic/day decision period anytime within day is fine scored using nDCG

    For both, Automatic, Manual Preparation or Manual Interaction runs manual clustering of relevant tweets define equivalence

    classes used for redundancy penalties in scoring relevance judgment for unjudged tweets in equivalence class

    (eg, retweets) assigned as function of judged tweets in class

  • Text REtrieval Conference (TREC)

    Real-time Summarization Profiles

    combination of re-used TREC 2015 and newly developed profiles (total of 203)

    describe prospective information need

    Judgments mobile assessors judged tweets as

    relevant, redundant, not relevant NIST assessors judged pools formed from both

    task A and task B runs 3-way scale of not relevant, relevant, highly relevant created clusters of semantically equivalent relevant

    tweets 56 profiles with NIST judgments; 123 w/ mobile

  • Text REtrieval Conference (TREC)

    Real-time Sample TopicsTitle: Hershey, PA quilt showDescription: Find information on the quilt show being held in Hershey, PANarrative: The user is a beginning quilter who would like to attend her first quilt show. She has learned that a major quilt show will happen in Hershey, PA, and wants to see Tweets about the show, including such things as announcement of classes, teachers or vendors attending the show; prize-winning quilts; comments on logistics, travel information, and lodging; opinions about the quality of the show.

    Title: FIFA corruption investigationDescription: Find information related to the ongoing investigation of FIFA officials for corruption.Narrative: The user is a soccer fan who is interested in the current status of the ongoing investigation by various governments of corruption and bribery by officials of FIFA (Federation Internationale de Football Association). This includes tweets giving information on various investigations and possible rebidding of the 2018 and 2022 World Cup games.

    Title: Mount RushmoreDescription: Find tweets about peoples reactions to and experiences when visiting Mount Rushmore.Narrative: The user is considering a trip to South Dakota to see Mount Rushmore. She would like to see what reaction other tourists have had to the site as well as any traveling tips and advice to make the trip more enjoyable.

  • Text REtrieval Conference (TREC)

    Top Task A Runs: Mobile JudgmentsOrdered by Strict Precision

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    Num

    ber o

    f tw

    eets

    RelevantRedundantNonRelUnjudged

    0.570

    0.318

    0.319

    0.332

    0.369

    0.377

    0.379

    0.388

    0.476

    0.503

    (P) = Manual Preparation run

  • Text REtrieval Conference (TREC)

    Top Task A Runs: NIST JudgmentsDistribution of Per-topic EG-1 Scores for Best Run by Mean EG-1

    (P) = Manual Preparation run Empty Run

  • Text REtrieval Conference (TREC)

    Top Task B RunsDistribution of Per-topic nDCG-1 Scores for Best Run by Mean nDCG-1

    (P) = Manual Preparation run

  • Text REtrieval Conference (TREC)

    TREC 2017 Tracks

    Dynamic Domain, Live QA, OpenSearch, and Real-time Summarization, and Tasks tracks continuing

    CDS Precision Medicine new track: Complex Answer Retrieval

    TREC 2017 track planning sessions 1.5 hours per track tomorrow (3- or 4-way parallel) track coordinators attending 2016 you can help shape task(s); make your opinions

    known

  • Text REtrieval Conference (TREC)

    Overview of TREC 2016TREC 2016 Track CoordinatorsTREC 2016 Program CommitteeSlide Number 4Number of Participants in TRECSlide Number 6TREC TracksBasicsTREC 2016Clinical Decision SupportCDS Track TaskCDS TrackCDS Track Sample TopicCDS Results: Automatic, NoteCDS Results: Auto, SummaryCDS Results: ManualDynamic DomainDynamic DomainDynamic Domain Sample TopicsDynamic Domain ResultsTotal RecallTotal RecallTotal RecallTotal Recall At Home ResultsOpenSearchOpenSearch ProtocolSlide Number 27Contextual SuggestionContextual SuggestionContextual Suggestion Sample RequestContextual SuggestionContextual SuggestionContextual SuggestionPhase 1 ResultsContextual SuggestionPhase 2 ResultsTasks TrackTasks TrackTasks Track Sample TopicsTask Understanding ResultsTask Completion ResultsLive QALive QALive QA Sample QuestionsLive QALive QA ResultsReal-time SummarizationReal-time SummarizationReal-time SummarizationReal-time SummarizationReal-time Sample TopicsTop Task A Runs: Mobile JudgmentsOrdered by Strict PrecisionTop Task A Runs: NIST JudgmentsTop Task B RunsTREC 2017Slide Number 54


Recommended