The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa...

The Second PASCAL Recognising Textual Entailment Challenge

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini,

Idan Szpektor

Bar Ilan, CELCT, ITC-irst, Microsoft Research, MITRE

Variability of Semantic Expression

Model variability as relations between text expressions:

Equivalence: expr1 expr2 Entailment: expr1 expr2 – more general

Dow ends up

Dow climbs 255

The Dow Jones Industrial Average closed up 255

Stock market hits a record high

Dow gains 255 pointsAll major stock markets surged

Applied Textual Entailment:Definition

Directional relation between two text fragments: Text (t) and Hypothesis (h):

t entails h (th) if, typically, a human reading t would infer that h is most likely true”

Operational (applied) definition: As in NLP applications Assuming common background knowledge

Why textual entailment?

Unified modeling of semantic inference As required by various applications (IR,IE,QA,MDS)

Text-to-text mapping Independent of concrete semantic representation

Goals for RTE-2

Support research progress More “realistic” examples

Input from common benchmarks Output from real systems Shows entailment potential to improve

performance across applications Improve data collection and annotation

Revised and expanded guidelines Most pairs triply annotated

Provide linguistic processing

The RTE-2 Dataset

Overview

1600 pairs: 800 development; 800 test

Followed RTE-1 setting t is 1-2 sentences, h is one (shorter) sentence 50%-50% positive-negative split in all subtasks

Focused on primary applications IE, IR, QA, (Multi-document) Summarization

Collecting IE pairs

Motivation: a sentence containing a target relation should entail an instantiated template.

Pairs were generated in several ways Outputs of IE systems:

for ACE-2004 and MUC-4 relations Manually:

for ACE-2004 and MUC-4 relations for additional relations in news domain

Collecting IR pairs

Motivation: relevant documents should entail a given “propositional” query.

Hypotheses are propositional IR queries adapted and simplified from TREC and CLEF

Texts selected from documents retrieved by different search engines

Collecting QA pairs

Motivation: a passage containing the answer slot filler should entail the corresponding answer statement.

QA systems were given TREC and CLEF questions.

Hypothesis generated by “plugging” the system answer term into the affirmative form of the question

Texts correspond to the candidate answer passages

Collecting SUM (MDS) pairs

Motivation: identifying redundant phrases

Using web document clusters and system summary

Picking sentences having high lexical overlap with summary

In final pairs: Texts are original sentences (usually from summary) Hypotheses:

Positive pairs: simplify h until entailed by t Negative pairs: simplify h similarly

Creating the final dataset

Average pairwise inter-judge agreement: 89.2% Average Kappa 0.78 – substantial agreement Better than RTE-1

Removed 18.2% of pairs due to disagreement (3-4 judges)

Disagreement example: (t) Women are under-represented at all political levels ...

(h) Women are poorly represented in parliament.

Additional review removed 25.5% of pairs too difficult / vague / redundant

RTE-2 Systems

Submissions

23 groups - 35% growth compared to RTE-1

41 runs 13 groups participated

for the first time (1+2=30)

CountryNumber of Groups

USA9

Italy3.5

Spain3

Netherlands2

UK1.5

Australia1

Canada1

Ireland1

Germany1

Methods and Approaches

Measure similarity between t and h (coverage of h by t): Lexical overlap (unigram, N-gram, subsequence) Lexical substitution (WordNet, statistical) Syntactic matching/transformations Lexical-syntactic variations (“paraphrases”) Semantic role labeling and matching Global similarity parameters (e.g. negation, modality)

Cross-pair similarity Detect mismatch (for non-entailment) Logical inference

Dominant approach: Supervised Learning

Features model both similarity and mismatch Train on development set and auxiliary t-h corpora

t,hFeatures:

Lexical, n-gram,syntacticsemantic, global

Feature vector

Classifier

YES

NO

Evaluation Measures

Main task: classification Compare to entailment judgment Evaluation criterion: accuracy Baseline: 60%

Simple lexical overlapping system, used as baseline in [Zanzotto et al.]

Secondary task: ranking Sorted by entailment confidence Evaluation criterion: average precision

Results

First Author (Group)AccuracyAverage Precision

Hickl (LCC)75.4%80.8%

Tatu (LCC)73.8%71.3%

Zanzotto (Milan & Rome)63.9%64.4%

Adams (Dallas)62.6%62.8%

Bos (Rome & Leeds)61.6%66.9%

11 groups58.1%-60.5%

7 groups52.9%-55.6%

Average: 60%Median: 59%

Analysis

For the first time: deep methods (semantic/ syntactic/ logical) clearly outperform shallow methods (lexical/n-gram)

Cf. Kevin Knight’s invited talk in EACL, titled:

Isn’t linguistic Structure Important, Asked the Engineer

Still, most systems based on deep analysis did not score significantly better than the lexical baseline

Why?

System reports point at two directions: Lack of knowledge (syntactic transformation

rules, paraphrases, lexical relations, etc.) Lack of training data

It seems that systems that coped better with these issues performed best: Hickl et al. - acquisition of large entailment

corpora for training Tatu et al. – large knowledge bases (linguistic

and world knowledge)

Open Questions

Are knowledge and training data more important than inference/matching method?

Or perhaps given more knowledge and training data, the difference between inference methods

will become more apparent?

Per-task analysis

Average AccuracyBest Result

SUM67.9%84.5%

IR60.8%74.5%

QA58.2%70.5%

IE52.2%73.0%

Total59.8%75.4%

Some systems trained per-task

Some suggested research directions

Acquiring larger entailment corpora Beyond parameter tuning – discovering needed

linguistic and world knowledge Manual knowledge engineering for concise

knowledge E.g. syntactic transformations, logical axioms

Further exploration of global information Principled framework for fusing information levels

Are we happy with bags of features?

Conclusions

RTE-2 introduced a more realistic dataset, based mostly on system outputs

Participation shows growing interest in the textual entailment framework

Accuracy improvements are very encouraging Many interesting new ideas and approaches

Acknowledgments Funding: PASCAL Network of Excellence PASCAL challenges program managers: Michele Sebag,

Florence d’Alche-Buc, Steve Gunn Workshop local organizer: Rodolfo Delmonte Contributing systems :

IE – NYU, IBM, ITC-irst QA - AnswerBus, LCC IR – Google, Yahoo, MSN SUM – NewsBlaster (Columbia), NewsInEssence (U. Michigan)

Datasets: TREC, TREC-QA, CLEF, QA@CLEF, MUC, ACE Annotation: Malky Rabinowitz, Dana Mills, Ruthie Mandel,

Errol Hayman, Vanessa Sandrini, Allesandro Valin, Elizabeth Lima, Jeff Stevenson, Amy Muia, The Butler Hill Group

Advice: Dan Roth Special thanks: Oren Glickman

Enjoy the workshop!

Date post:	11-Jan-2016
Category:	Documents
Upload:	sandra-may
View:	213 times
Download:	1 times

The Second PASCAL Recognising Textual Entailment Challenge Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa...

Documents