Evaluating and Improving Inference Rules via Crowdsourcing
Naomi Zeichner
Supervisors: Prof. Ido Dagan & Dr. Meni Adler
Evaluating and Improving Inference Rules via Crowdsourcing
Naomi Zeichner
Supervisors: Prof. Ido Dagan & Dr. Meni Adler
Inference Rules – important component in semantic applications
Q Where was Reagan raised?
A Reagan was brought up in Dixon.
X brought up in Y X raised in Y
Hiring Event
PERSON ROLE
Bob worked as an analyst for Dell
X work as Y X hired as Y
analystBob
1
Text Hypothesis
• Many algorithms for the automatic acquisition of inference-rules
• Poor quality of automatically acquired rules
• We would like an indication of how likely the rule is to extract correct rule-applications
• Many algorithms for the automatic acquisition of inference-rules
• Poor quality of automatically acquired rules
• We would like an indication of how likely the rule is to extract correct rule-applications
Current State
2
X reside in Y X born in Y
X criticize Y X attack Y
X reside in Y X live in Y
An efficient and reliable way to manually
assess the validity of inference rules
Useful for two purposes:
Dataset for training and evaluation
Improving rule-base
An efficient and reliable way to manually
assess the validity of inference rules
Useful for two purposes:
Dataset for training and evaluation
Improving rule-base
Our Goal
3
Outline
Crowdsourcing Rule Application Annotations
Evaluate & Improve Inference Rule-Base
2
3
Inference Rule-Base Evaluation
1
4
Current stateCurrent state
Our FrameworkOur Framework
Use CasesUse Cases
Outline
Crowdsourcing Rule Application Annotations
Evaluate & Improve Inference Rule-Base
Inference Rule-Base Evaluation
1
4
Current stateCurrent state
Our FrameworkOur Framework
Use CasesUse Cases
2
3
Evaluation - What are the options?
Impact on end task QA, IE, RTEPro: What interests an inference system developer
Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.
1
Judge rule correctness directlyPro: Theoretically most intuitive
Con: In fact hard to do Often results in low inter-annotator agreement.
2
X reside in Y X live in Y
X reside in Y X born in Y
X criticize Y X attack Y
Instance-based evaluation (Szpektor et al 2007., Bhagat et al. 2007)
Pro: Simulates utility of rules in an application
Yields high inter-annotator agreement.
3
5
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Instance Based Evaluation
6
RuleFind LHSSentence
Sentence LHS
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing No
NoNo
Yes
Yes
Yes
Invalid
X acquire Y X buy Y Kim acquired new abilities at school.
Kim acquired school.
Kim acquired abilities
Kim buy abilities.Dropbox buy Audiogalaxy.Kim buy abilities.
Dropbox buy Audiogalaxy.
Kim acquired new abilities at school.Kim acquired new abilities at school.
Dropbox acquired Audiogalaxy. Dropbox acquired Audiogalaxy.
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Instance Based Evaluation – Issues
7
RuleFind LHSSentence
Sentence LHS
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing No
NoNo
Yes
Yes
Yes
Invalid
Szpektor reported 43%
Requires lengthy guidelines & training
Hard to
Replicate
Complex
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Crowdsourcing
• Recent trend of using crowdsourcing for
annotation tasks
• Requires tasks to be
• Coherent
• Simple
• Does not allow for
• Long instructions
• Extensive training
8
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
• Replicable & Reliable– Rule applications:
• Good representation of rule use• Coherent
– Annotation process:• Simple• Communicate entailment
without lengthy guidelines and training
• Replicable & Reliable– Rule applications:
• Good representation of rule use• Coherent
– Annotation process:• Simple• Communicate entailment
without lengthy guidelines and training
Requirements Summary
9
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Outline
Crowdsourcing Rule Application Annotations
Evaluate & Improve Inference Rule-Base
Inference Rule-Base Evaluation
10
Current stateCurrent state
Our FrameworkOur Framework
Use CasesUse Cases
2
1
3
Overview
11
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
RuleBase
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Overview - Generation
12
Find LHSSentence
Generate RHSRule
Base
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
13
Rule: X shoot Y X attack Y
Rule Application Generation
shoot:V
subj
X:N
obj
Y:N Y:N
obj
attack:V
subj
X:N
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
14
Rule Application Generation
shoot:V
subj
X:N
obj
Y:N
Sentence Extraction
Sentence: The bank manager shoots one of the robbers.
3:manager:N
4:shoot:V
1:The:Det
subj
det
5:one:N
obj
pcomp-n
2:bank:N 6:of:Prep
8:robber:N
nn comp1
det
7:the:Det
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
14
Rule Application Generation
shoot:V
subj
X:N
obj
Y:N
Sentence Extraction
Sentence: The bank manager shoots one of the robbers.
3:manager:N
4:shoot:V
1:The:Det
subj
det
5:one:N
obj
pcomp-n
2:bank:N 6:of:Prep
8:robber:N
nn comp1
det
7:the:Det
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
15
Rule Application Generation
attack:V
subj
X:N
obj
Y:N
RHS Phrase Generation
Sentence: The bank manager shoots one of the robbers.
Phrase: X attack Y
3:manager:N
4:shoot:V
1:The:Det
subj
det
5:one:N
obj
pcomp-n
2:bank:N 6:of:Prep
8:robber:N
nn comp1
det
7:the:Det
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
15
Rule Application Generation
attack:V
subj
X:N
obj
Y:N
RHS Phrase Generation
Sentence: The bank manager shoots one of the robbers.
Phrase: manager attack one
3:manager:N
4:shoot:V
1:The:Det
subj
det
5:one:N
obj
pcomp-n
2:bank:N 6:of:Prep
8:robber:N
nn comp1
det
7:the:Det
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
15
Rule Application Generation
attack:V
subj
X:N
obj
Y:N
RHS Phrase Generation
Sentence: The bank manager shoots one of the robbers.
Phrase: The bank manager attack one of the robbers.
3:manager:N
4:shoot:V
1:The:Det
subj
det
5:one:N
obj
pcomp-n
2:bank:N 6:of:Prep
8:robber:N
nn comp1
det
7:the:Det
X Y
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
16
Rule Application GenerationSentence Filtering
Problem: ‘Left phrase not entailed by sentence’Sentence LHS
Solution: Verify sentence parsing
53% of sentences filtered out
Sentence: They were first used as fighting dogs.
LHS extraction: they fight dogs
fight:V
Y:NX:N
subj obj
Cause: Parsing Errors BonusFilter out ungrammatical
sentences
Find LHSSentence
Generate RHS
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Overview - Crowdsourcing
17
Find LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
EntailingNoNo
YesYes
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
RuleBase
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Rule: X greet Y X marry YSentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him
Rule: X acquire Y X buy YSentence: Kim acquired new abilities at school.Phrase: Kim buy abilities
Rule: X shoot Y X attack YSentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.
18
Crowdsourcing: Simplify ProcessFind LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Is a phrase meaningful?1
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Rule: X greet Y X marry YSentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him
Rule: X acquire Y X buy YSentence: Kim acquired new abilities at school.Phrase: Kim buy abilities
Rule: X shoot Y X attack YSentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.The bank manager attack one of the robbers.
she marry him
18
Kim buy abilities
Crowdsourcing: Simplify ProcessFind LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Is a phrase meaningful?1
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Rule: X greet Y X marry YSentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him
Rule: X acquire Y X buy YSentence: Kim acquired new abilities at school.Phrase: Kim buy abilities
Rule: X shoot Y X attack YSentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.
The bank manager attack one of the robbers.
she marry him
Kim buy abilities
18
Crowdsourcing: Simplify ProcessFind LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Is a phrase meaningful?1
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Is a phrase meaningful?1
2 Judge if a phrase is true given a sentence.
The bank manager attack one of the robbers.
she marry him
18
Crowdsourcing: Simplify ProcessFind LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Kim buy abilities
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
2 Judge if a phrase is true given a sentence.
Sentence: The bank manager shoots one of the robbers.Phrase:
Sentence: Mr. Monk visits her, and she greets him with real joy.Phrase:
The bank manager attack one of the robbers.
she marry him
The bank manager attack one of the robbers.
she marry him
18
Crowdsourcing: Simplify ProcessFind LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Is a phrase meaningful?1
Kim buy abilities
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Is a phrase meaningful?1
Crowdsourcing: Simplify Process
2 Judge if a phrase is true given a sentence.
Sentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.
Sentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him
The bank manager attack one of the robbers.
she marry himshe marry him
18
Find LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Kim buy abilities
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Gold standard – annotated rule applicationsGold standard – annotated rule applications
Crowdsourcing: Communicate
19
Educating “Confusing” examples used as gold with feedback if Turkers get them wrong
1
2 Enforcing Unanimous examples used as gold to estimate Turker reliability
Sentence: Michelle thinks like an artistPhrase: Michelle behave like an artistFeedback: No. It is quite possible for someone to think like an artist but not behave like an artist.
Entailment
Find LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Crowdsourcing: Aggregate Annotation
20
• Each rule application is evaluated by 3 Turkers• Annotations aggregated by
– Majority Vote– Bias correction for non-expert annotators
measure (Snow et al. 2008)
• Each rule application is evaluated by 3 Turkers• Annotations aggregated by
– Majority Vote– Bias correction for non-expert annotators
measure (Snow et al. 2008)
Find LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
21
xi – `true’ label of example i
yiw – label provided by worker to example i
Estimate of worker w’s probability to label Y or N given the `true’ label xi
Calculated using performance on expert annotated examples
Aggregated label
xi – `true’ label of example i
yiw – label provided by worker to example i
Estimate of worker w’s probability to label Y or N given the `true’ label xi
Calculated using performance on expert annotated examples
Aggregated label
0i
Yy
N otherwise
1
11
( | )log ( | ) log
( | )
i i i W
i i i Wi i i W
P x Y y yposterior ratio x Y y y
P x N y y
( | ) ( | ) ( | ) ( | )iw i iw i iw i iw iP y Y x Y P y N x Y P y Y x N P y N x N
w W
Snow’s Method
Uniform distribution
( | ) ( )log log
( | ) ( )iw i i
w W iw i i
P y x Y P x Y
P y x N P x N
Crowdsourcing: Aggregate AnnotationFind LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Crowdsourcing: Evaluation
• Agreement between Turker and Expert Annotations• Agreement between Turker and Expert Annotations
Agreement Kappa
Task 1
Task 2
*considerably higher than the 0.65 kappa reported by Szpektor et al. (2007)
22
88% 0.70
93% 0.78*
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
• Replicable & Reliable– Rule applications:
• Good representation of rule use• Coherent
– Annotation process:• Simple• Communicate entailment
without lengthy guidelines and training– Crowdsourcing
• Replicable & Reliable– Rule applications:
• Good representation of rule use• Coherent
– Annotation process:• Simple• Communicate entailment
without lengthy guidelines and training– Crowdsourcing
Requirements Summary
23
– Parsing validation
– Simple Wikipedia & argument sub-tree
– Split tasks
– Gold standard
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Outline
Crowdsourcing Rule Application Annotations
Evaluate & Improve Inference Rule-Base
Inference Rule-Base Evaluation
24
Current stateCurrent state
Our FrameworkOur Framework
Use CasesUse Cases
1
3
2
Use Cases
Our Goals
Find an efficient and reliable way to manually assess the validity of inference rules
Useful for two purposes: Dataset for training and evaluation
Use Case 1: Evaluating Rule Acquisition Methods
Improving rule-base
Use Case 2: Improving Accuracy-estimate of Automatically Acquired
Inference-rules
Our Goals
Find an efficient and reliable way to manually assess the validity of inference rules
Useful for two purposes: Dataset for training and evaluation
Use Case 1: Evaluating Rule Acquisition Methods
Improving rule-base
Use Case 2: Improving Accuracy-estimate of Automatically Acquired
Inference-rules
25
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 1: Data Set
26
• Supplement study derived from this work (Zeichner et al. 2012)
• Generated rule applications using four inference rule learning methods
• Annotated each rule application using our framework
• After some filtering 6,567 rule applications remained
• Supplement study derived from this work (Zeichner et al. 2012)
• Generated rule applications using four inference rule learning methods
• Annotated each rule application using our framework
• After some filtering 6,567 rule applications remained
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 1: Output
27
• Task 1• 1,012 meaningless phrase
• 5,555 meaningful phrase
•Task 2• 2,447 positive entailment
• 3,108 negative entailment
• Overall• 6,567 rule applications
• Annotated for $1000
• About a week
• Task 1• 1,012 meaningless phrase
• 5,555 meaningful phrase
•Task 2• 2,447 positive entailment
• 3,108 negative entailment
• Overall• 6,567 rule applications
• Annotated for $1000
• About a week
non-entailment
passed to Task 2
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 1: Algorithm Comparison
28
Algorithm AUC
DIRT (Lin and Pantel, 2001) 0.40
Cover (Weeds andWeir, 2003) 0.43
BInc (Szpektor and Dagan, 2008) 0.44
Berant (Berant et al., 2010) 0.52
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 1: Results
29
• Large scale data-set of rule-application annotations
• Quickly
• Reasonable cost
• Allowed comparison between different inference-rule learning methods
• Large scale data-set of rule-application annotations
• Quickly
• Reasonable cost
• Allowed comparison between different inference-rule learning methods
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 2: Setting
• We follow the evaluation methodology by Szpektor et al. (2008)
• Implemented a naïve Information Extraction (IE) system
Attack
X Y X attack Y
X Y X attack Y
X Y X attack Y
Banks was convicted of shooting and killing a 16-year-old at a park in 1980
• We follow the evaluation methodology by Szpektor et al. (2008)
• Implemented a naïve Information Extraction (IE) system
Attack
X Y X attack Y
X Y X attack Y
X Y X attack Y
Banks was convicted of shooting and killing a 16-year-old at a park in 1980
30
shoot
bomb
destroy
0.245773
0.30322
0.298797
X Y
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 2: Rule Re-Scoring Methods
• Crowd Score number of instantiations annotated as entailing out of those judged for the rule
• Combined Score linear combination of Crowd and Rule learning score
• Original Score (Baseline) score produced by rule-learning algorithm
• Crowd Score number of instantiations annotated as entailing out of those judged for the rule
• Combined Score linear combination of Crowd and Rule learning score
• Original Score (Baseline) score produced by rule-learning algorithm
31
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
• Context in which rules will be used must be reflected in crowdsourced annotations
End-Position
• Annotation guidelines adapted to consider context in judgment
• Context in which rules will be used must be reflected in crowdsourced annotations
End-Position
• Annotation guidelines adapted to consider context in judgment
X shoot Y
X dismiss Y
X fire Y
Use Case 2: Context Specific Instructions
32
X fire Y
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 2: Evaluation
• Comparison with manual expert ranking
• Performance on Information Extraction (IE) task
• Comparison with manual expert ranking
• Performance on Information Extraction (IE) task
33
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Sentence
Original Score convict X of Y sentence Y to Xconvict X of Y sentence X for YX guilty of Y sentence X for YX order Y X sentence Y convict X of Y Y sentence X
Mean Average Precision Original Score: 0.47 Crowd Score: 0.80
Sentence
Original Score convict X of Y sentence Y to Xconvict X of Y sentence X for YX guilty of Y sentence X for YX order Y X sentence Y convict X of Y Y sentence X
Mean Average Precision Original Score: 0.47 Crowd Score: 0.80
Crowd Scoreconvict X of Y sentence X for Ycondemn X to Y sentence Y to XX serve Y sentence Y to Xconvict X of Y sentence Y to Xconvict X of Y sentence Y in X
Use Case 2: Evaluation – Manual Ranking
34
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Scoring MethodMajority
Snow
Original Score 0.077 0.077
Crowd Score 0.115 0.135
Combined 0.118 0.138
• Ranking Settings – Mean Average Precision• Ranking Settings – Mean Average Precision
Use Case 2: Evaluation – IE Performance
Scoring MethodMajority
Snow
35
Original Score 0.077 0.077
Crowd Score 0.115 0.135
Combined Score 0.118 0.138
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Cases: Error Analysis
• Ambiguity Sentence: members disagree with leadership Phrase: members take exception to leadership
raise an objection take offense
• Entailment definition Sentence: A doctor claimed he died of stomach cancer Phrase: he die of stomach cancer
• Ambiguity Sentence: members disagree with leadership Phrase: members take exception to leadership
raise an objection take offense
• Entailment definition Sentence: A doctor claimed he died of stomach cancer Phrase: he die of stomach cancer
36
Crowdsourced Annotation Performance
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
• Corpora DifferencesEvent: Arrest-JailRule: X capture Y X arrest YFrom IE corpus:Sentence: American commandos captured a half brother of
Saddam Hussein on ThursdayPhrase: commandos arrest half brotherFrom Simple Wikipedia:Sentence: In 1622 AD, Nurhaci's armies captured Guang Ning.Phrase: Nurhaci's armies arrest Guang Ning
• Corpora DifferencesEvent: Arrest-JailRule: X capture Y X arrest YFrom IE corpus:Sentence: American commandos captured a half brother of
Saddam Hussein on ThursdayPhrase: commandos arrest half brotherFrom Simple Wikipedia:Sentence: In 1622 AD, Nurhaci's armies captured Guang Ning.Phrase: Nurhaci's armies arrest Guang Ning
37
Use Cases: Error AnalysisRule-base performance on the IE Task
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
• Better corpus to use for rule-application generation
• Use the framework to determine rule-context
• Look at rule-base ranking as a Machine Learning problem `Learning to rank’
• Better corpus to use for rule-application generation
• Use the framework to determine rule-context
• Look at rule-base ranking as a Machine Learning problem `Learning to rank’
38
Future work
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
• Replicable framework
• High quality annotations quickly and at reasonable cost
• Hopefully encourage the use of inference-rules
• Replicable framework
• High quality annotations quickly and at reasonable cost
• Hopefully encourage the use of inference-rules
39
Conclusion
Thank
You
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
attack X in Y
Y:N
in:Prep
mod
attack:V
obj
X:N
Generation: Creating the RHS instantiation
• Template Linearization
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Target: Judge if a rule application is valid or not
Instance Based Evaluation – Decisions
Rule: X greet Y X marry YSentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him
Rule: X turn in Y X bring in Y
Sentence: Humans turn in bed during the night.
Phrase: Humans bring in bed
Rule: X fight Y X attack YSentence: The American soldiers fought the British troops, in 1775.Phrase: The American soldiers attack the British troops.
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
21a
Bias correction for non-expert annotators• Worker Probabilities:
Bias correction for non-expert annotators• Worker Probabilities: ( | ) ( | )
( | ) ( | )iw i iw i
iw i iw i
P y Y x Y P y Y x N
P y N x Y P y Y x N
1
1
1
|log | log
|
i i i W
i i i W
i i i W
P x Y y yposterior ratio x Y y y
P x N y y
|log
|iw i i
w W iw i i
P y x Y P x Y
P y x N P x N
| |log log
| |
i iw iw i iw W w W
i iw iw i iw W w W
P x Y y P y x Y P x Y
P x N y P y x N P x N
Snow’s Method
Find LHSSentence
Generate RHS
RHS meaningful?
Not Entailing
Sentence RHS
Entailing
No
No
Yes
Yes
RuleBase
Generation
RuleApplications
AnnotatedRule
Applications
Crowdsourcing
Crowdsourcing: Aggregate Annotation
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Use Case 1: Data Set - Details
26a
1. Apply each rule learning method on a set of one billion tuple extractions of the form: Arg1 predicate Arg2
2. Sample 5000 extractions3. For each extraction apply all relevant rules4. Compared extractions of each method to crowdsourced
annotations to get true-positive (TP), false-positive (FP) and false-negative (FN) values
5. Calculated Recall and Precision values
Recall = Precision =
1. Apply each rule learning method on a set of one billion tuple extractions of the form: Arg1 predicate Arg2
2. Sample 5000 extractions3. For each extraction apply all relevant rules4. Compared extractions of each method to crowdsourced
annotations to get true-positive (TP), false-positive (FP) and false-negative (FN) values
5. Calculated Recall and Precision values
Recall = Precision = TP
TP FNTP
TP FP
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 2: Mean Average Precision (MAP)
34a
1. MAP 02. for each IE event
2.1 AP 02.2 TPcum 02.3 FPcum 02.4 for each rule in event
2.4.1 if rule is judged correct by expert 2.4.1.1 TPcum TPcum +1
2.4.1.2 AP AP + [TPcum / (TPcum + FPcum)]2.4.2 else
2.4.2.1 FPcum FPcum + 12.5 AP AP / TPcum2.6 MAP MAP + AP
3. MAP MAP / number of events
1. MAP 02. for each IE event
2.1 AP 02.2 TPcum 02.3 FPcum 02.4 for each rule in event
2.4.1 if rule is judged correct by expert 2.4.1.1 TPcum TPcum +1
2.4.1.2 AP AP + [TPcum / (TPcum + FPcum)]2.4.2 else
2.4.2.1 FPcum FPcum + 12.5 AP AP / TPcum2.6 MAP MAP + AP
3. MAP MAP / number of events
AP – average precision
TP / FP – true / false positive
TPcum / FPcum – cumulative TP / FP
Cumulative Precision
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Use Case 2: Mean Average Precision (MAP)
35a
1. MAP 02. for each IE event
2.1 AP 02.2 TPcum 02.3 FPcum 02.4 for each rule in event
2.4.1 FPcum FPcum + ruleFP2.4.2 ruleTPcum 02.4.3 for each ruleTP
2.4.3.1 ruleTPcum ruleTPcum + 1 2.4.3.2 AP AP + [ruleTPcum / (TPcum + ruleTPcum + FPcum)]
2.4.4 TPcum TPcum + ruleTPcum2.5 AP AP / TPcum2.6 MAP MAP + AP
3. MAP MAP / # events
1. MAP 02. for each IE event
2.1 AP 02.2 TPcum 02.3 FPcum 02.4 for each rule in event
2.4.1 FPcum FPcum + ruleFP2.4.2 ruleTPcum 02.4.3 for each ruleTP
2.4.3.1 ruleTPcum ruleTPcum + 1 2.4.3.2 AP AP + [ruleTPcum / (TPcum + ruleTPcum + FPcum)]
2.4.4 TPcum TPcum + ruleTPcum2.5 AP AP / TPcum2.6 MAP MAP + AP
3. MAP MAP / # events
AP – average precision
TP / FP – true / false positive
ruleTP / ruleFP – TP / FP for rule
TPcum / FPcum – cumulative TP / FP
ruleTPcum – cumulative TP for rule
Cumulative Precision
Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations
Crowdsourcing: Evaluation - Kappa
• Takes into account the agreement occurring by chance
Relative observed agreement among raters
Hypothetical probability of chance agreement
• Takes into account the agreement occurring by chance
Relative observed agreement among raters
Hypothetical probability of chance agreement
22
Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base
Pr( ) Pr( )
1 Pr( )
a e
e
Pr( )examples agreed on
aall annotated examples
Pr( )examples worker A answered Yes examples worker B answered Yes
eall annotated examples all annotated examples
examples worker A answered No examples worker B answered No
all annotated examples all annotated examples