Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | sandra-may |
View: | 213 times |
Download: | 1 times |
The Second PASCAL Recognising Textual Entailment Challenge
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini,
Idan Szpektor
Bar Ilan, CELCT, ITC-irst, Microsoft Research, MITRE
Variability of Semantic Expression
Model variability as relations between text expressions:
Equivalence: expr1 expr2 Entailment: expr1 expr2 – more general
Dow ends up
Dow climbs 255
The Dow Jones Industrial Average closed up 255
Stock market hits a record high
Dow gains 255 pointsAll major stock markets surged
Applied Textual Entailment:Definition
Directional relation between two text fragments: Text (t) and Hypothesis (h):
t entails h (th) if, typically, a human reading t would infer that h is most likely true”
Operational (applied) definition: As in NLP applications Assuming common background knowledge
Why textual entailment?
Unified modeling of semantic inference As required by various applications (IR,IE,QA,MDS)
Text-to-text mapping Independent of concrete semantic representation
Goals for RTE-2
Support research progress More “realistic” examples
Input from common benchmarks Output from real systems Shows entailment potential to improve
performance across applications Improve data collection and annotation
Revised and expanded guidelines Most pairs triply annotated
Provide linguistic processing
The RTE-2 Dataset
Overview
1600 pairs: 800 development; 800 test
Followed RTE-1 setting t is 1-2 sentences, h is one (shorter) sentence 50%-50% positive-negative split in all subtasks
Focused on primary applications IE, IR, QA, (Multi-document) Summarization
Collecting IE pairs
Motivation: a sentence containing a target relation should entail an instantiated template.
Pairs were generated in several ways Outputs of IE systems:
for ACE-2004 and MUC-4 relations Manually:
for ACE-2004 and MUC-4 relations for additional relations in news domain
Collecting IR pairs
Motivation: relevant documents should entail a given “propositional” query.
Hypotheses are propositional IR queries adapted and simplified from TREC and CLEF
Texts selected from documents retrieved by different search engines
Collecting QA pairs
Motivation: a passage containing the answer slot filler should entail the corresponding answer statement.
QA systems were given TREC and CLEF questions.
Hypothesis generated by “plugging” the system answer term into the affirmative form of the question
Texts correspond to the candidate answer passages
Collecting SUM (MDS) pairs
Motivation: identifying redundant phrases
Using web document clusters and system summary
Picking sentences having high lexical overlap with summary
In final pairs: Texts are original sentences (usually from summary) Hypotheses:
Positive pairs: simplify h until entailed by t Negative pairs: simplify h similarly
Creating the final dataset
Average pairwise inter-judge agreement: 89.2% Average Kappa 0.78 – substantial agreement Better than RTE-1
Removed 18.2% of pairs due to disagreement (3-4 judges)
Disagreement example: (t) Women are under-represented at all political levels ...
(h) Women are poorly represented in parliament.
Additional review removed 25.5% of pairs too difficult / vague / redundant
RTE-2 Systems
Submissions
23 groups - 35% growth compared to RTE-1
41 runs 13 groups participated
for the first time (1+2=30)
CountryNumber of Groups
USA9
Italy3.5
Spain3
Netherlands2
UK1.5
Australia1
Canada1
Ireland1
Germany1
Methods and Approaches
Measure similarity between t and h (coverage of h by t): Lexical overlap (unigram, N-gram, subsequence) Lexical substitution (WordNet, statistical) Syntactic matching/transformations Lexical-syntactic variations (“paraphrases”) Semantic role labeling and matching Global similarity parameters (e.g. negation, modality)
Cross-pair similarity Detect mismatch (for non-entailment) Logical inference
Dominant approach: Supervised Learning
Features model both similarity and mismatch Train on development set and auxiliary t-h corpora
t,hFeatures:
Lexical, n-gram,syntacticsemantic, global
Feature vector
Classifier
YES
NO
Evaluation Measures
Main task: classification Compare to entailment judgment Evaluation criterion: accuracy Baseline: 60%
Simple lexical overlapping system, used as baseline in [Zanzotto et al.]
Secondary task: ranking Sorted by entailment confidence Evaluation criterion: average precision
Results
First Author (Group)AccuracyAverage Precision
Hickl (LCC)75.4%80.8%
Tatu (LCC)73.8%71.3%
Zanzotto (Milan & Rome)63.9%64.4%
Adams (Dallas)62.6%62.8%
Bos (Rome & Leeds)61.6%66.9%
11 groups58.1%-60.5%
7 groups52.9%-55.6%
Average: 60%Median: 59%
Analysis
For the first time: deep methods (semantic/ syntactic/ logical) clearly outperform shallow methods (lexical/n-gram)
Cf. Kevin Knight’s invited talk in EACL, titled:
Isn’t linguistic Structure Important, Asked the Engineer
Still, most systems based on deep analysis did not score significantly better than the lexical baseline
Why?
System reports point at two directions: Lack of knowledge (syntactic transformation
rules, paraphrases, lexical relations, etc.) Lack of training data
It seems that systems that coped better with these issues performed best: Hickl et al. - acquisition of large entailment
corpora for training Tatu et al. – large knowledge bases (linguistic
and world knowledge)
Open Questions
Are knowledge and training data more important than inference/matching method?
Or perhaps given more knowledge and training data, the difference between inference methods
will become more apparent?
Per-task analysis
Average AccuracyBest Result
SUM67.9%84.5%
IR60.8%74.5%
QA58.2%70.5%
IE52.2%73.0%
Total59.8%75.4%
Some systems trained per-task
Some suggested research directions
Acquiring larger entailment corpora Beyond parameter tuning – discovering needed
linguistic and world knowledge Manual knowledge engineering for concise
knowledge E.g. syntactic transformations, logical axioms
Further exploration of global information Principled framework for fusing information levels
Are we happy with bags of features?
Conclusions
RTE-2 introduced a more realistic dataset, based mostly on system outputs
Participation shows growing interest in the textual entailment framework
Accuracy improvements are very encouraging Many interesting new ideas and approaches
Acknowledgments Funding: PASCAL Network of Excellence PASCAL challenges program managers: Michele Sebag,
Florence d’Alche-Buc, Steve Gunn Workshop local organizer: Rodolfo Delmonte Contributing systems :
IE – NYU, IBM, ITC-irst QA - AnswerBus, LCC IR – Google, Yahoo, MSN SUM – NewsBlaster (Columbia), NewsInEssence (U. Michigan)
Datasets: TREC, TREC-QA, CLEF, QA@CLEF, MUC, ACE Annotation: Malky Rabinowitz, Dana Mills, Ruthie Mandel,
Errol Hayman, Vanessa Sandrini, Allesandro Valin, Elizabeth Lima, Jeff Stevenson, Amy Muia, The Butler Hill Group
Advice: Dan Roth Special thanks: Oren Glickman
Enjoy the workshop!