+ All Categories
Home > Documents > Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas...

Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas...

Date post: 10-Jan-2016
Category:
Upload: briana-dorsey
View: 222 times
Download: 1 times
Share this document with a friend
57
Evaluating Language Evaluating Language Processing Processing Applications and Components Applications and Components PROPOR’03 Faro PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield www.dcs.shef.ac.uk/~robertg
Transcript
Page 1: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

Evaluating LanguageEvaluating Language Processing Applications and Processing Applications and ComponentsComponents

PROPOR’03 FaroPROPOR’03 Faro

Robert Gaizauskas

Natural Language Processing Group

Department of Computer Science

University of Sheffield

www.dcs.shef.ac.uk/~robertg

Page 2: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Outline of TalkOutline of Talk

Introduction: Perspectives on Evaluation in HLT

Technology Evaluation: Terms and Definitions

An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition

Ingredients of a Successful Evaluation

Snares and Delusions

Conclusions

Page 3: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Perspectives on Evaluation …Perspectives on Evaluation …

Three key stakeholders in the evaluation process:– Users (user-centred evaluation)– Researchers/technology developers (technology evaluation)– Funders (programme evaluation)

Users are concerned to accomplish a task for which HLT is just a tool– User evaluation requires evaluating system in its operational

setting– Does the HLT system allow users to accomplish their task better ? – Keeping in mind that

The technology may be transformational Faults may due to non-system problems or interaction effects in setting

Page 4: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

… … Perspectives on Evaluation …Perspectives on Evaluation …

Researchers are concerned with models and techniques for carrying out language processing tasks

For them evaluation is a key part of the empirical method– A model/system represents a hypothesis about how a language-

related input may be translated into an output– Evaluation = hypothesis testing

Model output

Human output

Evaluation

Human Model

Model refinement

Page 5: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

… … Perspectives on EvaluationPerspectives on Evaluation

Funders are concerned to determine whether R & D funds have been well spent

Programme evaluation may rely on – User-centred evaluation– Technology evaluation– Competitive evaluation– Assessment of social impact

The rest of this talk will concentrate on “technology evaluation”, or more preferably, the empirical method for HLT

Page 6: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Outline of TalkOutline of Talk

Introduction: Perspectives on Evaluation in HLT

Technology Evaluation: Terms and Definitions

An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition

Ingredients of a Successful Evaluation

Snares and Delusions

Conclusions

Page 7: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Technology Evaluation …Technology Evaluation …

Helpful to make a few further initial distinctions …

Important to distinguish tasks or functions from systems which carry out them out

Tasks/functions may be broken into subtasks/subfunctions and systems into subsystems (or components)

Need not be an isomorphism between these decompositions

Tasks, specified independently of any system or class of systems, are the proper subject of evaluation

Page 8: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

……Technology Evaluation …Technology Evaluation …

User-visible tasks are tasks where input and output have functional significance for a system user– E.g. machine translation, speech recognition

User-transparent tasks are tasks where input and output do not have such significance– E.g. part-of-speech tagging, parsing, mapping to logical form

Usually user-transparent tasks are components of higher level user-visible tasks

Will refer to “user-visible tasks” as application tasks and “user transparent tasks” as component tasks

Page 9: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

… … Technology Evaluation …Technology Evaluation …

Machine translation (DARPA MT) Speech recognition (CSR,

LVCSR,Broadcast News) Spoken language understanding

(ATIS) Information Retrieval (TREC,

Amaryllis, CLEF) Information Extraction (MUC, ACE,

IREX) Summarisation (Summac, DUC) Question Answering (TREC-QA) …

ApplicationsApplications ComponentsComponents

Parsers (Parseval) Morphology POS Tagging (Grace) Coreference (MUC) Word Sense Disambiguation

(SENSEVAL) …

Page 10: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

……Technology Evaluation …Technology Evaluation …

Evaluation scenarios may be defined for both application tasks and component tasks

Each sort of evaluation faces characteristic challenges

Component task evaluation is difficult because– No universally agreed set of “components” or intermediate

representations composing the human language processing system – i.e. theory dependence (e.g. grammatical formalisms)

– Collecting annotated resources is difficult because They must be created, unlike, e.g. source and target language

texts, full texts and summaries, etc. which can be found Their creation relies on a small number of expensive experts

– Users and funders (and sometimes scientists!) need convincing

Page 11: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

……Technology Evaluation …Technology Evaluation …

Application task evaluation also faces difficulties

Note that an application task technology evaluation is NOT a user-centred evaluation – no specific setting is assumed

An application task evaluation may use– Intrinsic criteria: how well does a system perform on the task it

was designed to carry out? – Extrinsic criteria: how well does a system enable a user to

complete a task? May approximate user-centred evaluation depending on reality of

the setting of the user task

Page 12: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Application Task Technology Evaluation vs Application Task Technology Evaluation vs User-Centred Evaluation: ExampleUser-Centred Evaluation: Example

Information Extraction (IE) is the task of populating a structured DB with information from free text pertaining to predefined scenarios of interest

E.g. extract info about management succession events – events involving persons moving in or out of positions in organizations

Technology evaluation of this task was carried out in MUC-6 using a controlled corpus, task definition and scoring metrics/software

The participating systems produced structured templates which were scored by the organisers

Page 13: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month.

<TEMPLATE-9404130062> := DOC_NR: "9404130062“ CONTENT: <SUCCESSION_EVENT-1><SUCCESSION_EVENT-1> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "executive vice president" IN_AND_OUT: <IN_AND_OUT-1> <IN_AND_OUT-2> VACANCY_REASON: OTH_UNK<IN_AND_OUT-1> := <IN_AND_OUT-2> := IO_PERSON: <PERSON-1> IO_PERSON: <PERSON-2> NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-2> REL_OTHER_ORG: OUTSIDE_ORG<ORGANIZATION-1> := <ORGANIZATION-2> := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada<PERSON-1> := <PERSON-2> := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr."

Application Task Technology Evaluation vs Application Task Technology Evaluation vs User-Centred Evaluation: Example (cont)User-Centred Evaluation: Example (cont)

Page 14: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Application Task Technology Evaluation vs Application Task Technology Evaluation vs User-Centred Evaluation: Example (cont)User-Centred Evaluation: Example (cont)

The MUC evaluations are application task technology evaluation – results are of interest to users, but resulting systems cannot be directly used by them

Interestingly, to design a system users can use, leads to insights into the task and into the limitations of technology evaluation.

To explore deployability of IE technology we carried out a 2-year project with GlaxoSmithKline: TRESTLE – Text Retrieval, Extraction and Summarisation Technology for Large Enterprises

Key insights were:– The importance of good interface design

– The importance of allowing for imperfect IE – one click from “truth"

Page 15: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TRESTLE InterfaceTRESTLE Interface

Page 16: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TRESTLE InterfaceTRESTLE Interface

Page 17: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

……Technology EvaluationTechnology Evaluation

Both application task and component task technology evaluation face challenges

However, both are essential elements in the empirical methodology for progressing HLT

Researchers and developers need to fight hard for support for well-designed evaluation programmes

However, they must bear in mind that users and funders will be sceptical and that translating results from evaluations into usable systems is not trivial

Page 18: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Outline of TalkOutline of Talk

Introduction: Perspectives on Evaluation in HLT

Technology Evaluation: Terms and Definitions

An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition

Ingredients of a Successful Evaluation

Snares and Delusions

Conclusions

Page 19: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The Cub Reporter ScenarioThe Cub Reporter Scenario

“Cub reporter” = junior reporter whose job is to gather background and check facts

The electronic cub reporter is a vision about how question answering (QA) and summarisation technology could be brought together– first identified by the TREC QA Roadmap group

Sheffield (Computer Science and Journalism) has a 6 person-year effort funded by the UK EPSRC to investigate the scenario in conjunction with the UK Press Association (UK premier newswire service)

Page 20: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

05:33 20/05/03: Page 1 (HHH) CITY Glaxo BackgroundGROWING ANGER OVER BOSSES' PAY By Paul Sims, PA News

The shareholders' revolt at yesterday's annual general meeting of GlaxoSmithKline was the latest in a line of protests against the ``fat cat'' salaries of Britain's top executives.

Last night's revolt is the first time a FTSE 100 member has faced such a revolt since companies became obliged to submit their remuneration report to shareholder vote. But shareholders from a number of companies had already denounced massive payouts to directors presiding over plummeting share values on a struggling stock market.

Earlier this month, a third of shareholders' votes went against Royal & Sun Alliance's remuneration report …

A similar protest at Shell last month attracted 23% of the vote … And at Barclays Bank, three out of 10 of the larger shareholders registered dissent over rewards for top executives …

20:05 19/05/03: Page 1 (HHH) CITY Glaxo

Pharmaceuticals giant GlaxoSmithKline tonight suffered an unprecedented defeat at its annual general meeting when shareholders won a vote against a multi-million pound pay and rewards package for executives.

Cub Reporter Scenario: ExampleCub Reporter Scenario: Example

Snap

20:05 19/05/03: Page 1 (HHH) CITY Glaxo

Pharmaceuticals giant GlaxoSmithKline tonight suffered an unprecedented defeat at its annual general meeting when shareholders won a vote against a multi-million pound pay and rewards package for executives.

How have shareholders in other companies recently voted on executive pay?

How have shareholders in GlaxosmithKline voted on executive pay in the past?

Questions

Archive (10 years of PA Text)

QuestionGeneration

QuestionAnswering

Answers andAnswer Source Documents

MultidocumentSummarisation

Background

Page 21: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Cub Reporter Scenario: EvaluationCub Reporter Scenario: Evaluation

What role will evaluation play in the project?

Application task technology evaluation– Question answering (TREC QA)– Summarisation (DUC)

Component task technology evaluation– Part of speech tagging (Penn Tree Bank; BNC)– Parsing – Time and event recognition (TimeML)

User-centred evaluation– Observation of journalists performing controlled task with

and without the developed system

Page 22: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Application Task Evaluation: TREC QA TrackApplication Task Evaluation: TREC QA Track

Aim is to move beyond document retrieval (traditional search engines) to information retrieval

The TExt Retrieval Conferences started in 1992 to stimulate research in information retrieval through providing:– Standardised task definitions– Standardised resources (corpora, human relevance judgments)– Standardised metrics (e.g. recall and precision) and evaluation

procedures– An annual competition and forum for reporting results

TREC has added and removed variant tasks– In 1999 a task (“track”) on open domain question answering was

added

Page 23: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The TREC QA Track: Task Definition (TREC 8/9)The TREC QA Track: Task Definition (TREC 8/9)

Inputs:– 4GB newswire texts (from the TREC text collection)– File of natural language questions (200 TREC-8/700 TREC-9), e.g.

Where is the Taj Mahal? How tall is the Eiffel Tower?

Who was Johnny Mathis’ high school track coach?

Outputs:– Five ranked answers per question, including pointer to source document

50 byte category 250 byte category

– Up to two runs per category per site

Limitations:– Each question has an answer in the text collection– Each answer is a single literal string from a text (no implicit or multiple

answers)

Page 24: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The TREC QA Track: Task Definition (TREC2001)The TREC QA Track: Task Definition (TREC2001)

Subtrack1 : Main (similar to previous years) NIL is a valid response – no longer a guaranteed answer in corpus

– questions (500) more ‘real’ – from MSNSearch and AskJeeves logs– more definition type questions -- What is an atom?

Subtrack 2: List– system must assemble an answer from multiple documents– questions specified number of instances to retrieve

What are 9 novels written by John Updike?

– Response is an unordered set of the target number of [doc-id answer] pairs– Target number of instances guaranteed to exist in the collection

Subtrack 3: Context:– System must return ranked list of response pairs for question in a series

How many species of spider are there? How many are poisonous to humans? What percentage of spider bites in the US are fatal?

– Evaluated using reciprocal rank– Answers guaranteed to exist + later questions independently answerable

Page 25: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The TREC QA Track: Task Definition (TREC2002)The TREC QA Track: Task Definition (TREC2002) New text collection – AQUAINT collection

– AP newswire (1998-2000), New York Times newswire (1998-2000), Xinhua News Agency (English portion, 1996-2000)

– Approximately 1,033,000 documents / 3 gigabytes of text

Subtrack1 : Main, similar to previous years but:– No definition type questions – One answer per question only– Exact matches only (no 50/250 byte strings)– principal metric is confidence weighted score

Subtrack 2: List

– As for TREC2001

No context track

Page 26: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TREC QA Track 2003TREC QA Track 2003

Two tasks: main task and passage task

Main task features 3 sorts of questions:– Factoid (400-450); exact answers; answers not guaranteed

– List (25-50); exact answers; answers guaranteed

– Definition (25-50); answers guaranteed Definition question ask for “a set of interesting and salient information

items about a person, organization, or thing”

Questions are tagged as to type

Passage task – Relaxes the requirement of exact answers for factoid questions

– Answers may be up to 250 bytes in length

Same corpus (AQUAINT) as TREC2002

Page 27: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The TREC QA Track: Metrics and ScoringThe TREC QA Track: Metrics and Scoring

Principal metric for TREC8-10 was Mean Reciprocal Rank (MRR)– Correct answer at rank 1 scores 1– Correct answer at rank 2 scores 1/2– Correct answer at rank 3 scores 1/3– …Sum over all questions and divide by number of questions

More formally:

N = # questions

ri = reciprocal of best (lowest) rank assigned by system at which a correct answer is found for question i, or 0 if no correct answer found

Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation)

N

rMRR

N

1ii

Page 28: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The TREC QA Track: Metrics and ScoringThe TREC QA Track: Metrics and Scoring

For list questions– each list judged as a unit

– evaluation measure is accuracy:

# distinct instances returned / target # instances

The principal metric for TREC2002 was Confidence Weighted Score

where Q is number of questions

1

#correct in first positionsconfidence weighted score

Q

i

i i

Q

Page 29: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The TREC QA Track: Metrics and ScoringThe TREC QA Track: Metrics and Scoring

A systems overall score will be:1/2*factoid-score + 1/4*list-score + 1/4*definition-score

A factoid answer is one of: correct, non-exact, unsupported, incorrect.

Factoid-score is % factoid answers judged correct List answers are treated as sets of factoid answers or “instances”

Instance recall + precision are defined as:IR = # instances judged correct & distinct/|final answer set|IP = # instances judged correct & distinct/# instances returned

Overall factoid score is then the F1 measure:F = (2*IP*IR)/(IP+IR)

Definition answers are scored based on the number of “essential” and “acceptable” information “nuggets” they contain – see track definition for details

Page 30: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TREC QA Track Evaluation: ObservationsTREC QA Track Evaluation: Observations

Track has evolved – Each year has presented a more challenging task

– Metrics have changed

– Bad ideas have been identified and discarded

Task definitions developed/modified in conjunction with the participants

While scoring involves human judges, automatic scoring procedures which approximate human judges have been developed– Allows evaluation to be repeated outside formal evaluation

– Supports development of supervised machine learning

Support from academic and industrial participants

Page 31: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …

Most QA systems follow a two part architecture:1. Use an IR component with the (preprocessed) question as query

to retrieve documents/passages from the overall collection which are likely to contain answers

2. Use an answer extraction component to extract answers from the highest ranked documents/passages retrieved in step 1

Clearly performance of step 1 places on bound performance of step 2

What is the most appropriate way to assess performance of step 1?– Conventional metrics for evaluating IR systems are recall and

precision

– Not directly useful in QA context …

Page 32: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …

Intuitively, want to know – % of questions which have an answer in top n ranks of returned

documents/passages, i.e. how far down the ranking to go

– # of answer instances in top n ranks, i.e. redundancy

Let – Q be the question set

– D the document (or passage) collection

– A D, q the subset of D which contains correct answers for q Q

– R D,q,n be the n top-ranked documents (or passages) in D retrieved given question $q$

Define

||

|}|{|),,(

,,,

Q

ARQqnQDCoverage

qDnqD

||

||),,(

,,,

Q

ARnDQreundancy

qDQq

nqD

Page 33: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …

Given these new metrics can now ask new questions about e.g. different passage retrieval approaches (passage before or after initial query? )

Page 34: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TREC QA Track Evaluation: Stimulations …TREC QA Track Evaluation: Stimulations …

Conclusion: evaluations lead to new questions about components which in turn lead to new metrics and new metrics

Page 35: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Outline of TalkOutline of Talk

Introduction: Perspectives on Evaluation in HLT

Technology Evaluation: Terms and Definitions

An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition

Ingredients of a Successful Evaluation

Snares and Delusions

Conclusions

Page 36: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Component Task Evaluation: Time and Component Task Evaluation: Time and Event RecognitionEvent Recognition

Answering questions, extraction information from text, summarising documents all presuppose sensitivity to the temporal location and ordering of events in text

• When did the war between Iran and Iraq end?• When did John Sununu travel to a fundraiser for John Ashcroft?• How many Tutsis were killed by Hutus in Rwanda in 1994?• Who was Secretary of Defense during the Gulf War?• What was the largest U.S. military operation since Vietnam?• When did the astronauts return from the space station on the last shuttle flight?

Page 37: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Time and Event RecognitionTime and Event Recognition

To address this task a significant effort has been recently been made to create– An agreed standard annotation for times, events and temporal

relations in text (TimeML)

– A corpus of texts annotated according to this standard (TimeBank)

– An annotation tool to support manual creation of the annotations

– Automated tools to assist in the annotation process (time and event taggers)

This effort was made via a 9-month workshop (Jan-Sep 2002) called TERQAS: Time and Event Recognition for Question Answering Systems sponsored by ARDA

See www.time2002.org

Significantly informed by earlier work by Andrea Setzer at Sheffield

Page 38: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TERQAS: OrganisationTERQAS: Organisation

TERQAS was organised into 6 working groups: – TimeML Definition and Specification – Algorithm Review and Development– Article Corpus Collection Development– Query Corpus Development and Classification– TIMEBANK Annotation– TimeML and Algorithm Evaluation

Page 39: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TERQAS: ParticipantsTERQAS: Participants

– James Pustejovsky, PI– Rob Gaizauskas– Graham Katz– Bob Ingria – José Castaño– Inderjeet Mani– Antonio Sanfilippo– Dragomir Radev– Patrick Hanks– Marc Verhagen– Beth Sundheim– Andrea Setzer

– Jerry Hobbs– Bran Boguraev– Andy Latto– John Frank– Lisa Ferro– Marcia Lazo– Roser Saurí– Anna Rumshisky– David Day– Luc Belanger– Harry Wu– Andrew See

Page 40: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TERQAS: OutcomesTERQAS: Outcomes Creation of a robust markup language for temporal expressions, event

expressions, and the relations between them (TimeML 1.0)

Guidelines for Annotation

Creation of a Gold standard annotated against this language (TIMEBANK)

Creation of a Suite of Algorithms for recognizing (T3PO):

– Temporal Expressions ─ Event Expressions

– Signals ─ Link Construction

Development of a Text Segmented Closure Algorithm

Creation of a Semi-graphical Annotation Tool for speeding up annotation of dependency-rich texts (SGAT)

Query Database Creation Tool

Guidelines for Creating a Corpus of Questions

Initial Scoring and Inter-annotator Evaluation Setup

Page 41: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TimeML: The Conceptual and Linguistic BasisTimeML: The Conceptual and Linguistic Basis

TimeML presupposes the following temporal entities and relations.

Events are taken to be situations that occur or happen, punctual or lasting for a period of time. They are generally expressed by means of tensed or untensed verbs, nominalisations, adjectives, predicative clauses, or prepositional phrases.

Times may be either points, intervals, or durations. They may be referred to by fully specified or underspecified temporal expressions, or intensionally specified expressions.

Relations can hold between events and events and times. They can be temporal, subordinate, or aspectual relations.

Page 42: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TimeML: TimeML: Annotating EventsAnnotating Events

Events are marked up by annotating a representative of the event expression, usually the head of the verb phrase.

The attributes of events are a unique identifier, the event class, tense, and aspect.

Fully annotated example: All 75 passengers <EVENT eid="1" class="OCCURRENCE" tense="past"

aspect="NONE"> died </EVENT>

See full TimeML spec for handling of events conveyed by nominalisations or stative adjectives.

Page 43: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TimeML: TimeML: Annotating TimesAnnotating Times

Annotation of times designed to be as compatible with TIMEX2 time expression annotation guidelines as possible.

Fully annotated example for a straightforward time expression:

<TIMEX3 tid="1" type="DATE" value="1966-07" temporalFunction="false"> July 1966 </TIMEX3>

Additional attributes are used to, e.g. anchor relative time expressions and supply functions for computing absolute time values (last week).

Page 44: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TimeML: TimeML: Annotating SignalsAnnotating Signals

The SIGNAL tag is used to annotate sections of text, typically function words, that indicate how temporal objects are to be related to each other.

Also used to mark polarity indicators such as not, no, none, etc., as well as indicators of temporal quantification such as twice, three times, and so forth.

Signals have only one attribute, a unique identifier.

Fully annotated example:

Two days <SIGNAL sid="1" before </SIGNAL> the attack …

Page 45: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TimeML: TimeML: Annotating Relations (1)Annotating Relations (1)

To annotate the different types of relations that can hold between events and events and times, the LINK tag has been introduced.

There are three types of LINKs: TLINKs, SLINKs, and ALINKs, each of which has temporal implications.

A TLINK or Temporal Link represents the temporal relationship holding between events or between an event and a time.

It establishes a link between the involved entities making explicit whether their relationship is: before, after, includes, is_included, holds, simultaneous, immediately after, immediately before, identity, begins, ends, begun by, ended by.

Page 46: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

TimeML: TimeML: Annotating Relations (2)Annotating Relations (2)

An SLINK or Subordination Link is used for contexts introducing relations between two events, or an event and a signal. – SLINKs are of one of the following sorts: Modal, Factive,

Counter-factive, Evidential, Negative evidential, Negative.

An ALINK or Aspectual Link represents the relationship between an aspectual event and its argument event. – The aspectual relations encoded are: initiation, culmination,

termination, continuation.

Page 47: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Annotating Relations (3)Annotating Relations (3)

Annotated examples: TLINK: John taught on Monday

<TLINK eventInstanceID="2" relatedToTime="4" signalID="4" relType="IS_INCLUDED"/>

SLINK: John said he taught

<SLINK eventInstanceID="3" subordinatedEvent="4" relType="EVIDENTIAL"/>

ALINK: John started to read

<ALINK eventInstanceID="5" relatedToEvent="6" relType="INITIATES"/>

Page 48: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Comparing TimeML AnnotationsComparing TimeML Annotations

Time and event annotations may be compared as are other text tagging annotations (e.g. named entities, template elements)

Problem: semantically identical temporal relations can be annotated in multiple ways

Solution: the deductive closure of temporal relations is computed and compared for two different annotations of the same text (similar to solution for MUC coreference task)

Provides basis for defining Precision and Recall metrics for evaluation and inter-annotator agreement

A

BC

<

<~

Page 49: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

An Earlier Pilot StudyAn Earlier Pilot Study

Based on Andrea Setzer’s annotation scheme which was taken as the starting point for TimeML

trial corpus: 6 New York Times newspaper texts (1996) each text annotated by 2-3 annotators gold standard questions:

– guidelines comprehensive and unambiguous?– how much genuine disagreement?– feasible to annotate a larger corpus?

calculated recall and precision of annotators against gold standard

Page 50: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Pilot Study Annotation ProcedurePilot Study Annotation Procedure

annotation takes place in stages:1. annotate events and times

2. annotate explicit temporal relations

3. annotate ‘obvious’ implicit temporal relations

4. annotate less obvious implicit temporal relations by inference and interactively soliciting information from the annotator

TimeBank annotation follows more or less same procedure– Events and times automatically annotated in 1st pass and

corrected by human

Page 51: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Pilot Study Inter-Annotator ResultsPilot Study Inter-Annotator Results

two sets of results:1. agreement on entities (events and times)

77% recall and 81% precision

agreement on attributes60% recall and 64% precision

2. agreement on temporal relations 40% recall and 68% precision

Page 52: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Outline of TalkOutline of Talk

Introduction: Perspectives on Evaluation in HLT

Technology Evaluation: Terms and Definitions

An Extended Example: The Cub Reporter Scenario– Application Evaluation: Question Answering and Summarisation– Component Evaluation: Time and Event Recognition

Ingredients of a Successful Evaluation

Snares and Delusions

Conclusions

Page 53: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Ingredients of a Successful (Technology) EvaluationIngredients of a Successful (Technology) Evaluation

Resources– Human (people to make it work)

– Data (copyright issues, distribution)

Well-defined task– Can humans independently perform the task at an acceptable level (e.g.

> 80%) when given only a written specification?

Controlled challenge– Should be hard enough to attract interest, stimulate new approaches

– Not so hard as to require too much effort from participants or to be impossible to specify

Metrics– Need to capture intuitively significant aspects of systems performance

– Need to experiment with new metrics (lead to asking different questions)

= money

Page 54: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Ingredients of a Successful (Technology) EvaluationIngredients of a Successful (Technology) Evaluation

Participants– Must be a community willing to take part– Is the task so hard that participants need funding to do it?

Reusability– Evaluation should result in resources – data (raw and

annotated), scoring software, guidelines, that can be reused b

Extensibility– Should be possible to ramp up the challenge year-on-year so

as to capitalise on the resources invested and the community which has been established

Page 55: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

Snares and DelusionsSnares and Delusions

Diminishing returns– Evaluations can lead to participants getting obsessed with

reducing error/increasing performance by fractions of a percent

Pseudo-science– Because we’re measuring something it must be science

Significance– Are differences between participant’s systems statistically

significant?

Users care

Page 56: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

ConclusionsConclusions

A successful technology evaluation should advance the field

Advances can include:– Better models of human language processing (e.g. better

applications and components)– Creation of reusable resources and materials– Creation of a community of researchers focused on the

same task– Creation of an ethos of repeatability and transparency (i.e.

good scientific practice)

Page 57: Evaluating Language Processing Applications and Components PROPOR’03 Faro Robert Gaizauskas Natural Language Processing Group Department of Computer Science.

June 27, 2003 Propor03, Faro

The EndThe End

I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it and cannot express it in numbers your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.

Lord Kelvin, Popular Lectures and Addresses, (1889), vol 1. p. 73.


Recommended