Date post: | 21-Mar-2017 |
Category: |
Technology |
Upload: | lora-aroyo |
View: | 664 times |
Download: | 3 times |
Anca Dumitrache, Lora Aroyo, Chris Welty http://CrowdTruth.org
Achieving Expert-Level Annotation Quality with the Crowd
The Case of Medical Relation Extraction
Biomedical Data Mining, Modeling & Semantic Integration @ ISWC2015
#CrowdTruth @anouk_anca @laroyo @cawelty #BDM2I
• Annotator disagreement is signal, not noise.
• It is indicative of the variation in human semantic interpretation of signs
• It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality
CrowdTruth http://CrowdTruth.org
• Goals: collecting a relation extraction
gold standard improve the performance of a
relation extraction classifier
• Approach: crowdsource 900 medical
sentences measure disagreement with
CrowdTruth metrics train & evaluate classifier with
CrowdTruth score
CrowdTruth for medical rela2on extrac2on
http://CrowdTruth.org
RelEx TA
SK in CrowdFlow
er Pa2ents with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1
Is ACUTE FEVER – related to → INFLUENZA AH1N1?
h"p://CrowdTruth.org
1 1 1
Worker Vector
h"p://CrowdTruth.org
1 1 1
1 1
1
1 1
1 1
1 1
1
1
1
0 1 1 0 0 4 3 0 0 5 1 0
Sentence Vector
h"p://CrowdTruth.org
0.907, p = 0:007
0.844
Annota2on Quality of Expert vs. Crowd Annota2ons
h"p://CrowdTruth.org
0.907, p = 0:007
0.844
[0.6 -‐ 0.8] crowd significantly out-‐performs expert with max in 0.907 F1 @ 0.7 threshold
Annota2on Quality of Expert vs. Crowd Annota2ons
h"p://CrowdTruth.org
0.642, p = 0:016 0.638
Relex CAUSE Classifier F1 for Crowd vs. Expert Annota2ons
h"p://CrowdTruth.org
0.642, p = 0:016 0.638
crowd provides training data that is at least as good if not beEer than experts
Relex CAUSE Classifier F1 for Crowd vs. Expert Annota2ons
h"p://CrowdTruth.org
(crowd with pos./neg. threshold at 0.5)
h"p://CrowdTruth.org
Learning Curves
Learning Curves
(crowd with pos./neg. threshold at 0.5)
above 400 sent.: crowd consistently over baseline & single above 600 sent.: crowd out-‐performs experts
h"p://CrowdTruth.org
Learning Curves Extended
(crowd with pos./neg. threshold at 0.5)
h"p://CrowdTruth.org
Learning Curves Extended
(crowd with pos./neg. threshold at 0.5)
h"p://CrowdTruth.org
crowd consistently performs beEer than baseline
# of Workers: Impact on Sentence-‐Rela2on Score
h"p://CrowdTruth.org
# of Workers: Impact on Annota2on Quality
only 54 sent. had 15 or more workers
h"p://CrowdTruth.org
Experts vs. Crowd in Human Annota2on Overall Comparison
• 91% of expert annotations covered by the crowd • expert annotators reach agreement only in 30% • most popular crowd vote covers 95% of this
expert annotation agreement
h"p://CrowdTruth.org
F1 Cost per sentence
CrowdTruth 0.642 $0.66
Expert Annotator 0.638 $2.00
Single Annotator 0.492 $0.08
h"p://CrowdTruth.org
Expert vs. Crowd in Human Annota2on
Cost Comparison
• crowd performs just as well as medical experts
• crowd is also cheaper • crowd is always available
• using only a few annotators for ground truth is faulty
• min 10 workers/sentence are needed for highest quality annotations
• CrowdTruth = a solution to Clinical
NLP Challenge: • lack of ground truth for training &
benchmarking
Experimentsproved that:
http://CrowdTruth.org
#CrowdTruth @anouk_anca @laroyo @cawelty #BDM2I #ISWC2015
CrowdTruth.org
http://data.CrowdTruth.org/medical-relex