Date post: | 10-Jan-2016 |
Category: |
Documents |
Upload: | briana-dorsey |
View: | 222 times |
Download: | 1 times |
Evaluating LanguageEvaluating Language Processing Applications and Processing Applications and ComponentsComponents
PROPOR’03 FaroPROPOR’03 Faro
Robert Gaizauskas
Natural Language Processing Group
Department of Computer Science
University of Sheffield
www.dcs.shef.ac.uk/~robertg
June 27, 2003 Propor03, Faro
Outline of TalkOutline of Talk
Introduction: Perspectives on Evaluation in HLT
Technology Evaluation: Terms and Definitions
An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition
Ingredients of a Successful Evaluation
Snares and Delusions
Conclusions
June 27, 2003 Propor03, Faro
Perspectives on Evaluation …Perspectives on Evaluation …
Three key stakeholders in the evaluation process:– Users (user-centred evaluation)– Researchers/technology developers (technology evaluation)– Funders (programme evaluation)
Users are concerned to accomplish a task for which HLT is just a tool– User evaluation requires evaluating system in its operational
setting– Does the HLT system allow users to accomplish their task better ? – Keeping in mind that
The technology may be transformational Faults may due to non-system problems or interaction effects in setting
June 27, 2003 Propor03, Faro
… … Perspectives on Evaluation …Perspectives on Evaluation …
Researchers are concerned with models and techniques for carrying out language processing tasks
For them evaluation is a key part of the empirical method– A model/system represents a hypothesis about how a language-
related input may be translated into an output– Evaluation = hypothesis testing
Model output
Human output
Evaluation
Human Model
Model refinement
June 27, 2003 Propor03, Faro
… … Perspectives on EvaluationPerspectives on Evaluation
Funders are concerned to determine whether R & D funds have been well spent
Programme evaluation may rely on – User-centred evaluation– Technology evaluation– Competitive evaluation– Assessment of social impact
The rest of this talk will concentrate on “technology evaluation”, or more preferably, the empirical method for HLT
June 27, 2003 Propor03, Faro
Outline of TalkOutline of Talk
Introduction: Perspectives on Evaluation in HLT
Technology Evaluation: Terms and Definitions
An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition
Ingredients of a Successful Evaluation
Snares and Delusions
Conclusions
June 27, 2003 Propor03, Faro
Technology Evaluation …Technology Evaluation …
Helpful to make a few further initial distinctions …
Important to distinguish tasks or functions from systems which carry out them out
Tasks/functions may be broken into subtasks/subfunctions and systems into subsystems (or components)
Need not be an isomorphism between these decompositions
Tasks, specified independently of any system or class of systems, are the proper subject of evaluation
June 27, 2003 Propor03, Faro
……Technology Evaluation …Technology Evaluation …
User-visible tasks are tasks where input and output have functional significance for a system user– E.g. machine translation, speech recognition
User-transparent tasks are tasks where input and output do not have such significance– E.g. part-of-speech tagging, parsing, mapping to logical form
Usually user-transparent tasks are components of higher level user-visible tasks
Will refer to “user-visible tasks” as application tasks and “user transparent tasks” as component tasks
June 27, 2003 Propor03, Faro
… … Technology Evaluation …Technology Evaluation …
Machine translation (DARPA MT) Speech recognition (CSR,
LVCSR,Broadcast News) Spoken language understanding
(ATIS) Information Retrieval (TREC,
Amaryllis, CLEF) Information Extraction (MUC, ACE,
IREX) Summarisation (Summac, DUC) Question Answering (TREC-QA) …
ApplicationsApplications ComponentsComponents
Parsers (Parseval) Morphology POS Tagging (Grace) Coreference (MUC) Word Sense Disambiguation
(SENSEVAL) …
June 27, 2003 Propor03, Faro
……Technology Evaluation …Technology Evaluation …
Evaluation scenarios may be defined for both application tasks and component tasks
Each sort of evaluation faces characteristic challenges
Component task evaluation is difficult because– No universally agreed set of “components” or intermediate
representations composing the human language processing system – i.e. theory dependence (e.g. grammatical formalisms)
– Collecting annotated resources is difficult because They must be created, unlike, e.g. source and target language
texts, full texts and summaries, etc. which can be found Their creation relies on a small number of expensive experts
– Users and funders (and sometimes scientists!) need convincing
June 27, 2003 Propor03, Faro
……Technology Evaluation …Technology Evaluation …
Application task evaluation also faces difficulties
Note that an application task technology evaluation is NOT a user-centred evaluation – no specific setting is assumed
An application task evaluation may use– Intrinsic criteria: how well does a system perform on the task it
was designed to carry out? – Extrinsic criteria: how well does a system enable a user to
complete a task? May approximate user-centred evaluation depending on reality of
the setting of the user task
June 27, 2003 Propor03, Faro
Application Task Technology Evaluation vs Application Task Technology Evaluation vs User-Centred Evaluation: ExampleUser-Centred Evaluation: Example
Information Extraction (IE) is the task of populating a structured DB with information from free text pertaining to predefined scenarios of interest
E.g. extract info about management succession events – events involving persons moving in or out of positions in organizations
Technology evaluation of this task was carried out in MUC-6 using a controlled corpus, task definition and scoring metrics/software
The participating systems produced structured templates which were scored by the organisers
June 27, 2003 Propor03, Faro
BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month.
<TEMPLATE-9404130062> := DOC_NR: "9404130062“ CONTENT: <SUCCESSION_EVENT-1><SUCCESSION_EVENT-1> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "executive vice president" IN_AND_OUT: <IN_AND_OUT-1> <IN_AND_OUT-2> VACANCY_REASON: OTH_UNK<IN_AND_OUT-1> := <IN_AND_OUT-2> := IO_PERSON: <PERSON-1> IO_PERSON: <PERSON-2> NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-2> REL_OTHER_ORG: OUTSIDE_ORG<ORGANIZATION-1> := <ORGANIZATION-2> := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada<PERSON-1> := <PERSON-2> := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr."
Application Task Technology Evaluation vs Application Task Technology Evaluation vs User-Centred Evaluation: Example (cont)User-Centred Evaluation: Example (cont)
June 27, 2003 Propor03, Faro
Application Task Technology Evaluation vs Application Task Technology Evaluation vs User-Centred Evaluation: Example (cont)User-Centred Evaluation: Example (cont)
The MUC evaluations are application task technology evaluation – results are of interest to users, but resulting systems cannot be directly used by them
Interestingly, to design a system users can use, leads to insights into the task and into the limitations of technology evaluation.
To explore deployability of IE technology we carried out a 2-year project with GlaxoSmithKline: TRESTLE – Text Retrieval, Extraction and Summarisation Technology for Large Enterprises
Key insights were:– The importance of good interface design
– The importance of allowing for imperfect IE – one click from “truth"
June 27, 2003 Propor03, Faro
TRESTLE InterfaceTRESTLE Interface
June 27, 2003 Propor03, Faro
TRESTLE InterfaceTRESTLE Interface
June 27, 2003 Propor03, Faro
……Technology EvaluationTechnology Evaluation
Both application task and component task technology evaluation face challenges
However, both are essential elements in the empirical methodology for progressing HLT
Researchers and developers need to fight hard for support for well-designed evaluation programmes
However, they must bear in mind that users and funders will be sceptical and that translating results from evaluations into usable systems is not trivial
June 27, 2003 Propor03, Faro
Outline of TalkOutline of Talk
Introduction: Perspectives on Evaluation in HLT
Technology Evaluation: Terms and Definitions
An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition
Ingredients of a Successful Evaluation
Snares and Delusions
Conclusions
June 27, 2003 Propor03, Faro
The Cub Reporter ScenarioThe Cub Reporter Scenario
“Cub reporter” = junior reporter whose job is to gather background and check facts
The electronic cub reporter is a vision about how question answering (QA) and summarisation technology could be brought together– first identified by the TREC QA Roadmap group
Sheffield (Computer Science and Journalism) has a 6 person-year effort funded by the UK EPSRC to investigate the scenario in conjunction with the UK Press Association (UK premier newswire service)
June 27, 2003 Propor03, Faro
05:33 20/05/03: Page 1 (HHH) CITY Glaxo BackgroundGROWING ANGER OVER BOSSES' PAY By Paul Sims, PA News
The shareholders' revolt at yesterday's annual general meeting of GlaxoSmithKline was the latest in a line of protests against the ``fat cat'' salaries of Britain's top executives.
Last night's revolt is the first time a FTSE 100 member has faced such a revolt since companies became obliged to submit their remuneration report to shareholder vote. But shareholders from a number of companies had already denounced massive payouts to directors presiding over plummeting share values on a struggling stock market.
Earlier this month, a third of shareholders' votes went against Royal & Sun Alliance's remuneration report …
A similar protest at Shell last month attracted 23% of the vote … And at Barclays Bank, three out of 10 of the larger shareholders registered dissent over rewards for top executives …
20:05 19/05/03: Page 1 (HHH) CITY Glaxo
Pharmaceuticals giant GlaxoSmithKline tonight suffered an unprecedented defeat at its annual general meeting when shareholders won a vote against a multi-million pound pay and rewards package for executives.
Cub Reporter Scenario: ExampleCub Reporter Scenario: Example
Snap
20:05 19/05/03: Page 1 (HHH) CITY Glaxo
Pharmaceuticals giant GlaxoSmithKline tonight suffered an unprecedented defeat at its annual general meeting when shareholders won a vote against a multi-million pound pay and rewards package for executives.
How have shareholders in other companies recently voted on executive pay?
How have shareholders in GlaxosmithKline voted on executive pay in the past?
Questions
Archive (10 years of PA Text)
QuestionGeneration
QuestionAnswering
Answers andAnswer Source Documents
MultidocumentSummarisation
Background
June 27, 2003 Propor03, Faro
Cub Reporter Scenario: EvaluationCub Reporter Scenario: Evaluation
What role will evaluation play in the project?
Application task technology evaluation– Question answering (TREC QA)– Summarisation (DUC)
Component task technology evaluation– Part of speech tagging (Penn Tree Bank; BNC)– Parsing – Time and event recognition (TimeML)
User-centred evaluation– Observation of journalists performing controlled task with
and without the developed system
June 27, 2003 Propor03, Faro
Application Task Evaluation: TREC QA TrackApplication Task Evaluation: TREC QA Track
Aim is to move beyond document retrieval (traditional search engines) to information retrieval
The TExt Retrieval Conferences started in 1992 to stimulate research in information retrieval through providing:– Standardised task definitions– Standardised resources (corpora, human relevance judgments)– Standardised metrics (e.g. recall and precision) and evaluation
procedures– An annual competition and forum for reporting results
TREC has added and removed variant tasks– In 1999 a task (“track”) on open domain question answering was
added
June 27, 2003 Propor03, Faro
The TREC QA Track: Task Definition (TREC 8/9)The TREC QA Track: Task Definition (TREC 8/9)
Inputs:– 4GB newswire texts (from the TREC text collection)– File of natural language questions (200 TREC-8/700 TREC-9), e.g.
Where is the Taj Mahal? How tall is the Eiffel Tower?
Who was Johnny Mathis’ high school track coach?
Outputs:– Five ranked answers per question, including pointer to source document
50 byte category 250 byte category
– Up to two runs per category per site
Limitations:– Each question has an answer in the text collection– Each answer is a single literal string from a text (no implicit or multiple
answers)
June 27, 2003 Propor03, Faro
The TREC QA Track: Task Definition (TREC2001)The TREC QA Track: Task Definition (TREC2001)
Subtrack1 : Main (similar to previous years) NIL is a valid response – no longer a guaranteed answer in corpus
– questions (500) more ‘real’ – from MSNSearch and AskJeeves logs– more definition type questions -- What is an atom?
Subtrack 2: List– system must assemble an answer from multiple documents– questions specified number of instances to retrieve
What are 9 novels written by John Updike?
– Response is an unordered set of the target number of [doc-id answer] pairs– Target number of instances guaranteed to exist in the collection
Subtrack 3: Context:– System must return ranked list of response pairs for question in a series
How many species of spider are there? How many are poisonous to humans? What percentage of spider bites in the US are fatal?
– Evaluated using reciprocal rank– Answers guaranteed to exist + later questions independently answerable
June 27, 2003 Propor03, Faro
The TREC QA Track: Task Definition (TREC2002)The TREC QA Track: Task Definition (TREC2002) New text collection – AQUAINT collection
– AP newswire (1998-2000), New York Times newswire (1998-2000), Xinhua News Agency (English portion, 1996-2000)
– Approximately 1,033,000 documents / 3 gigabytes of text
Subtrack1 : Main, similar to previous years but:– No definition type questions – One answer per question only– Exact matches only (no 50/250 byte strings)– principal metric is confidence weighted score
Subtrack 2: List
– As for TREC2001
No context track
June 27, 2003 Propor03, Faro
TREC QA Track 2003TREC QA Track 2003
Two tasks: main task and passage task
Main task features 3 sorts of questions:– Factoid (400-450); exact answers; answers not guaranteed
– List (25-50); exact answers; answers guaranteed
– Definition (25-50); answers guaranteed Definition question ask for “a set of interesting and salient information
items about a person, organization, or thing”
Questions are tagged as to type
Passage task – Relaxes the requirement of exact answers for factoid questions
– Answers may be up to 250 bytes in length
Same corpus (AQUAINT) as TREC2002
June 27, 2003 Propor03, Faro
The TREC QA Track: Metrics and ScoringThe TREC QA Track: Metrics and Scoring
Principal metric for TREC8-10 was Mean Reciprocal Rank (MRR)– Correct answer at rank 1 scores 1– Correct answer at rank 2 scores 1/2– Correct answer at rank 3 scores 1/3– …Sum over all questions and divide by number of questions
More formally:
N = # questions
ri = reciprocal of best (lowest) rank assigned by system at which a correct answer is found for question i, or 0 if no correct answer found
Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation)
N
rMRR
N
1ii
June 27, 2003 Propor03, Faro
The TREC QA Track: Metrics and ScoringThe TREC QA Track: Metrics and Scoring
For list questions– each list judged as a unit
– evaluation measure is accuracy:
# distinct instances returned / target # instances
The principal metric for TREC2002 was Confidence Weighted Score
where Q is number of questions
1
#correct in first positionsconfidence weighted score
Q
i
i i
Q
June 27, 2003 Propor03, Faro
The TREC QA Track: Metrics and ScoringThe TREC QA Track: Metrics and Scoring
A systems overall score will be:1/2*factoid-score + 1/4*list-score + 1/4*definition-score
A factoid answer is one of: correct, non-exact, unsupported, incorrect.
Factoid-score is % factoid answers judged correct List answers are treated as sets of factoid answers or “instances”
Instance recall + precision are defined as:IR = # instances judged correct & distinct/|final answer set|IP = # instances judged correct & distinct/# instances returned
Overall factoid score is then the F1 measure:F = (2*IP*IR)/(IP+IR)
Definition answers are scored based on the number of “essential” and “acceptable” information “nuggets” they contain – see track definition for details
June 27, 2003 Propor03, Faro
TREC QA Track Evaluation: ObservationsTREC QA Track Evaluation: Observations
Track has evolved – Each year has presented a more challenging task
– Metrics have changed
– Bad ideas have been identified and discarded
Task definitions developed/modified in conjunction with the participants
While scoring involves human judges, automatic scoring procedures which approximate human judges have been developed– Allows evaluation to be repeated outside formal evaluation
– Supports development of supervised machine learning
Support from academic and industrial participants
June 27, 2003 Propor03, Faro
TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …
Most QA systems follow a two part architecture:1. Use an IR component with the (preprocessed) question as query
to retrieve documents/passages from the overall collection which are likely to contain answers
2. Use an answer extraction component to extract answers from the highest ranked documents/passages retrieved in step 1
Clearly performance of step 1 places on bound performance of step 2
What is the most appropriate way to assess performance of step 1?– Conventional metrics for evaluating IR systems are recall and
precision
– Not directly useful in QA context …
June 27, 2003 Propor03, Faro
TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …
Intuitively, want to know – % of questions which have an answer in top n ranks of returned
documents/passages, i.e. how far down the ranking to go
– # of answer instances in top n ranks, i.e. redundancy
Let – Q be the question set
– D the document (or passage) collection
– A D, q the subset of D which contains correct answers for q Q
– R D,q,n be the n top-ranked documents (or passages) in D retrieved given question $q$
Define
||
|}|{|),,(
,,,
Q
ARQqnQDCoverage
qDnqD
||
||),,(
,,,
Q
ARnDQreundancy
qDQq
nqD
June 27, 2003 Propor03, Faro
TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …
Given these new metrics can now ask new questions about e.g. different passage retrieval approaches (passage before or after initial query? )
June 27, 2003 Propor03, Faro
TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …
Conclusion: evaluations lead to new questions about components which in turn lead to new metrics and new metrics
June 27, 2003 Propor03, Faro
Outline of TalkOutline of Talk
Introduction: Perspectives on Evaluation in HLT
Technology Evaluation: Terms and Definitions
An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition
Ingredients of a Successful Evaluation
Snares and Delusions
Conclusions
June 27, 2003 Propor03, Faro
Component Task Evaluation: Time and Component Task Evaluation: Time and Event RecognitionEvent Recognition
Answering questions, extraction information from text, summarising documents all presuppose sensitivity to the temporal location and ordering of events in text
• When did the war between Iran and Iraq end?• When did John Sununu travel to a fundraiser for John Ashcroft?• How many Tutsis were killed by Hutus in Rwanda in 1994?• Who was Secretary of Defense during the Gulf War?• What was the largest U.S. military operation since Vietnam?• When did the astronauts return from the space station on the last shuttle flight?
June 27, 2003 Propor03, Faro
Time and Event RecognitionTime and Event Recognition
To address this task a significant effort has been recently been made to create– An agreed standard annotation for times, events and temporal
relations in text (TimeML)
– A corpus of texts annotated according to this standard (TimeBank)
– An annotation tool to support manual creation of the annotations
– Automated tools to assist in the annotation process (time and event taggers)
This effort was made via a 9-month workshop (Jan-Sep 2002) called TERQAS: Time and Event Recognition for Question Answering Systems sponsored by ARDA
See www.time2002.org
Significantly informed by earlier work by Andrea Setzer at Sheffield
June 27, 2003 Propor03, Faro
TERQAS: OrganisationTERQAS: Organisation
TERQAS was organised into 6 working groups: – TimeML Definition and Specification – Algorithm Review and Development– Article Corpus Collection Development– Query Corpus Development and Classification– TIMEBANK Annotation– TimeML and Algorithm Evaluation
June 27, 2003 Propor03, Faro
TERQAS: ParticipantsTERQAS: Participants
– James Pustejovsky, PI– Rob Gaizauskas– Graham Katz– Bob Ingria – José Castaño– Inderjeet Mani– Antonio Sanfilippo– Dragomir Radev– Patrick Hanks– Marc Verhagen– Beth Sundheim– Andrea Setzer
– Jerry Hobbs– Bran Boguraev– Andy Latto– John Frank– Lisa Ferro– Marcia Lazo– Roser Saurí– Anna Rumshisky– David Day– Luc Belanger– Harry Wu– Andrew See
June 27, 2003 Propor03, Faro
TERQAS: OutcomesTERQAS: Outcomes Creation of a robust markup language for temporal expressions, event
expressions, and the relations between them (TimeML 1.0)
Guidelines for Annotation
Creation of a Gold standard annotated against this language (TIMEBANK)
Creation of a Suite of Algorithms for recognizing (T3PO):
– Temporal Expressions ─ Event Expressions
– Signals ─ Link Construction
Development of a Text Segmented Closure Algorithm
Creation of a Semi-graphical Annotation Tool for speeding up annotation of dependency-rich texts (SGAT)
Query Database Creation Tool
Guidelines for Creating a Corpus of Questions
Initial Scoring and Inter-annotator Evaluation Setup
June 27, 2003 Propor03, Faro
TimeML: The Conceptual and Linguistic BasisTimeML: The Conceptual and Linguistic Basis
TimeML presupposes the following temporal entities and relations.
Events are taken to be situations that occur or happen, punctual or lasting for a period of time. They are generally expressed by means of tensed or untensed verbs, nominalisations, adjectives, predicative clauses, or prepositional phrases.
Times may be either points, intervals, or durations. They may be referred to by fully specified or underspecified temporal expressions, or intensionally specified expressions.
Relations can hold between events and events and times. They can be temporal, subordinate, or aspectual relations.
June 27, 2003 Propor03, Faro
TimeML: TimeML: Annotating EventsAnnotating Events
Events are marked up by annotating a representative of the event expression, usually the head of the verb phrase.
The attributes of events are a unique identifier, the event class, tense, and aspect.
Fully annotated example: All 75 passengers <EVENT eid="1" class="OCCURRENCE" tense="past"
aspect="NONE"> died </EVENT>
See full TimeML spec for handling of events conveyed by nominalisations or stative adjectives.
June 27, 2003 Propor03, Faro
TimeML: TimeML: Annotating TimesAnnotating Times
Annotation of times designed to be as compatible with TIMEX2 time expression annotation guidelines as possible.
Fully annotated example for a straightforward time expression:
<TIMEX3 tid="1" type="DATE" value="1966-07" temporalFunction="false"> July 1966 </TIMEX3>
Additional attributes are used to, e.g. anchor relative time expressions and supply functions for computing absolute time values (last week).
June 27, 2003 Propor03, Faro
TimeML: TimeML: Annotating SignalsAnnotating Signals
The SIGNAL tag is used to annotate sections of text, typically function words, that indicate how temporal objects are to be related to each other.
Also used to mark polarity indicators such as not, no, none, etc., as well as indicators of temporal quantification such as twice, three times, and so forth.
Signals have only one attribute, a unique identifier.
Fully annotated example:
Two days <SIGNAL sid="1" before </SIGNAL> the attack …
June 27, 2003 Propor03, Faro
TimeML: TimeML: Annotating Relations (1)Annotating Relations (1)
To annotate the different types of relations that can hold between events and events and times, the LINK tag has been introduced.
There are three types of LINKs: TLINKs, SLINKs, and ALINKs, each of which has temporal implications.
A TLINK or Temporal Link represents the temporal relationship holding between events or between an event and a time.
It establishes a link between the involved entities making explicit whether their relationship is: before, after, includes, is_included, holds, simultaneous, immediately after, immediately before, identity, begins, ends, begun by, ended by.
June 27, 2003 Propor03, Faro
TimeML: TimeML: Annotating Relations (2)Annotating Relations (2)
An SLINK or Subordination Link is used for contexts introducing relations between two events, or an event and a signal. – SLINKs are of one of the following sorts: Modal, Factive,
Counter-factive, Evidential, Negative evidential, Negative.
An ALINK or Aspectual Link represents the relationship between an aspectual event and its argument event. – The aspectual relations encoded are: initiation, culmination,
termination, continuation.
June 27, 2003 Propor03, Faro
Annotating Relations (3)Annotating Relations (3)
Annotated examples: TLINK: John taught on Monday
<TLINK eventInstanceID="2" relatedToTime="4" signalID="4" relType="IS_INCLUDED"/>
SLINK: John said he taught
<SLINK eventInstanceID="3" subordinatedEvent="4" relType="EVIDENTIAL"/>
ALINK: John started to read
<ALINK eventInstanceID="5" relatedToEvent="6" relType="INITIATES"/>
June 27, 2003 Propor03, Faro
Comparing TimeML AnnotationsComparing TimeML Annotations
Time and event annotations may be compared as are other text tagging annotations (e.g. named entities, template elements)
Problem: semantically identical temporal relations can be annotated in multiple ways
Solution: the deductive closure of temporal relations is computed and compared for two different annotations of the same text (similar to solution for MUC coreference task)
Provides basis for defining Precision and Recall metrics for evaluation and inter-annotator agreement
A
BC
<
<~
June 27, 2003 Propor03, Faro
An Earlier Pilot StudyAn Earlier Pilot Study
Based on Andrea Setzer’s annotation scheme which was taken as the starting point for TimeML
trial corpus: 6 New York Times newspaper texts (1996) each text annotated by 2-3 annotators gold standard questions:
– guidelines comprehensive and unambiguous?– how much genuine disagreement?– feasible to annotate a larger corpus?
calculated recall and precision of annotators against gold standard
June 27, 2003 Propor03, Faro
Pilot Study Annotation ProcedurePilot Study Annotation Procedure
annotation takes place in stages:1. annotate events and times
2. annotate explicit temporal relations
3. annotate ‘obvious’ implicit temporal relations
4. annotate less obvious implicit temporal relations by inference and interactively soliciting information from the annotator
TimeBank annotation follows more or less same procedure– Events and times automatically annotated in 1st pass and
corrected by human
June 27, 2003 Propor03, Faro
Pilot Study Inter-Annotator ResultsPilot Study Inter-Annotator Results
two sets of results:1. agreement on entities (events and times)
77% recall and 81% precision
agreement on attributes60% recall and 64% precision
2. agreement on temporal relations 40% recall and 68% precision
June 27, 2003 Propor03, Faro
Outline of TalkOutline of Talk
Introduction: Perspectives on Evaluation in HLT
Technology Evaluation: Terms and Definitions
An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition
Ingredients of a Successful Evaluation
Snares and Delusions
Conclusions
June 27, 2003 Propor03, Faro
Ingredients of a Successful (Technology) EvaluationIngredients of a Successful (Technology) Evaluation
Resources– Human (people to make it work)
– Data (copyright issues, distribution)
Well-defined task– Can humans independently perform the task at an acceptable level (e.g.
> 80%) when given only a written specification?
Controlled challenge– Should be hard enough to attract interest, stimulate new approaches
– Not so hard as to require too much effort from participants or to be impossible to specify
Metrics– Need to capture intuitively significant aspects of systems performance
– Need to experiment with new metrics (lead to asking different questions)
= money
June 27, 2003 Propor03, Faro
Ingredients of a Successful (Technology) EvaluationIngredients of a Successful (Technology) Evaluation
Participants– Must be a community willing to take part– Is the task so hard that participants need funding to do it?
Reusability– Evaluation should result in resources – data (raw and
annotated), scoring software, guidelines, that can be reused b
Extensibility– Should be possible to ramp up the challenge year-on-year so
as to capitalise on the resources invested and the community which has been established
June 27, 2003 Propor03, Faro
Snares and DelusionsSnares and Delusions
Diminishing returns– Evaluations can lead to participants getting obsessed with
reducing error/increasing performance by fractions of a percent
Pseudo-science– Because we’re measuring something it must be science
Significance– Are differences between participant’s systems statistically
significant?
Users care
June 27, 2003 Propor03, Faro
ConclusionsConclusions
A successful technology evaluation should advance the field
Advances can include:– Better models of human language processing (e.g. better
applications and components)– Creation of reusable resources and materials– Creation of a community of researchers focused on the
same task– Creation of an ethos of repeatability and transparency (i.e.
good scientific practice)
June 27, 2003 Propor03, Faro
The EndThe End
I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it and cannot express it in numbers your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.
Lord Kelvin, Popular Lectures and Addresses, (1889), vol 1. p. 73.