Date post: | 21-Jan-2018 |
Category: |
Technology |
Upload: | lora-aroyo |
View: | 3,973 times |
Download: | 0 times |
Crowdsourcing Ambiguity-Aware Ground Truth
Chris Welty, Anca Dumitrache, Oana Inel,Benjamin Timmermans, Lora Aroyo
June 15th, 2017Collective Intelligence Conference
• rather than accepting disagreement as a natural property of semantic interpretation
• traditionally, disagreement is considered a measure of poor quality in the annotation task because:– task is poorly defined or
– annotators lack training
This makes the elimination of disagreement a goal
What if it is GOOD?
Crowdsourcing Myth: Disagreement is Bad
• typically annotators are asked whether a binary property holds for each example
• often not given a chance to say that the property may partially hold, or holds but is not clearly expressed
• mathematics of using ground truth treats every example the same – either match correct result or not
• poor quality examples tend to generate high disagreement
• disagreement allows us to weight sentences, giving us the ability to both train and evaluate a machine in a more flexible way
Crowdsourcing Myth: All Examples Are Created Equal
What if they are DIFFERENT?
Related Work on Annotating Ambiguity
http://CrowdTruth.org
● Juergens (2013): For word-sense disambiguation, the crowd with ambiguity modeling was able to achieve expert-level quality of annotations.
● Cheatham et al. (2014): Current benchmarks in ontology alignment and evaluation are not designed to model uncertainty caused by disagreement between annotators, both expert and crowd.
● Plank et al. (2014): For part-of-speech tagging, most inter-annotator disagreement pointed to debatable cases in linguistic theory.
● Chang et al. (2017): In a workflow of tasks for collecting and correcting labels for text and images, found that many ambiguous cases cannot be resolved by better annotation guidelines or through worker quality control.
• Annotator disagreement is signal, not noise.
• It is indicative of the variation in human semantic interpretation of signs
• It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality
CrowdTruthhttp://CrowdTruth.org
Medical Relation Extraction
1. workers select the medical relations that hold between the given 2 terms in a sentence
2. closed task - workers pick from given set of 14 top UMLS relations
3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed in the tweet
2. closed task - workers pick from set of 8 events
3. 2,019 English tweets, crawled based on event hashtags
4. 7 workers /tweet (*) no expert annotators
News Events Identification
1. workers highlight words/phrases that describe an event in the sentence
2. open-ended task - workers can pick any words
3. 200 sentences from English TimeBank corpus
4. 15 workers /sentence
Sound Interpretation1. workers give tags that describe a
sound2. open-ended task - workers can pick
any tag3. 284 sounds from Freesound
database4. 10 workers /sentence
http://CrowdTruth.org
Medical Relation Extraction
http://CrowdTruth.org
Patients with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1
Is ACUTE FEVER – related to → INFLUENZA AH1N1?
Twitter Event Extraction
http://CrowdTruth.org
News Event Identification
http://CrowdTruth.org
Sound Interpretation
http://CrowdTruth.org
Medical Relation Extraction
1. workers select the medical relations that hold between the given 2 terms in a sentence
2. closed task - workers pick from given set of 14 top UMLS relations
3. 975 sentences from PubMed abstracts + distant supervision for term pair extraction
4. 15 workers /sentence
Twitter Event Extraction
1. workers select the events expressed in the tweet
2. closed task - workers pick from set of 8 events
3. 2,019 English tweets, crawled based on event hashtags
4. 7 workers /tweet(*) no expert annotators
News Events Identification
1. workers highlight words/phrases that describe an event in the sentence
2. open-ended task - workers can pick any words
3. 200 sentences from English TimeBank corpus
4. 15 workers /sentence
Sound Interpretation1. workers give tags that describe a
sound2. open-ended task - workers can pick
any tag3. 284 sounds from Freesound
database4. 10 workers /sentence
http://CrowdTruth.org
1 1 1
Worker Vector - Closed Task
http://CrowdTruth.org
Medical Relation Extraction
1 1 1
1 1
1
1 1
1 1
1 1
1
1
1
0 1 1 0 0 4 3 0 0 5 1 0
Media Unit Vector - Closed Task
http://CrowdTruth.org
Unclear relationship between the two arguments reflected in the disagreement
Medical Relation Extraction
http://CrowdTruth.org
Clearly expressed relation between the two arguments reflected in the agreement
Medical Relation Extraction
http://CrowdTruth.org
Twitter Event Extraction
http://CrowdTruth.org
The tension building between China, Japan, U.S., and Vietnam is reaching new heights this week - let's stay tuned! #APAC #powerstruggle
Which of the following EVENTS can you identify in the tweet?
4 0 0 0 6 0 0 0 2
islands disputed between China
and Japan
anti China protests in
Vietnam
None
Unclear description of events in tweetreflected in the disagreement
Twitter Event Extraction
http://CrowdTruth.org
RT @jc_stubbs: Another tragic day in #Ukraine - More than 50 rebels killed as new leader unleashes assault: http://t.co/wcfU3kyAFX
Which of the following EVENTS can you identify in the tweet?
0 6 0 0 0 0 0 0 1
Ukraine crisis 2014
None
Clear description of events in tweetreflected in the agreement
Media Unit Vector - Open Task
http://CrowdTruth.org
Worker 1: balloon exploding
Worker 2: gun
Worker 3: gunshot
Worker 4: loud noise, pop
Worker 5: gun, balloon2 3 1 1
balloon
gun, gun
shot
loud noise
pop
+ clustering(e.g. word2vec)
Sound Interpretation
Multiple interpretations of the sound reflected in the disagreement
Sound Interpretation
http://CrowdTruth.org
Worker 1: siren
Worker 2: police siren
Worker 3: siren, alarm
Worker 4: siren
Worker 5: annoying4 1 1
siren
alarm
ann
oying
+ clustering(e.g. word2vec)
Clear meaning of the soundreflected in the agreement
News Event Identification
http://CrowdTruth.org
Other than usage in business, Internet technology is also beginning to infiltrate the lifestyle domain .
HIGHLIGHT the words/phrases that refer to an EVENT.
5 4 2 2 4 5 3
Internet
technology
is also
beginning
infiltrate
lifestyle
1
business
3
usage
0
Other
3
domain
Unclear specification of events in sentencereflected in the disagreement
Clear specification of events in sentencereflected in the agreement
News Event Identification
http://CrowdTruth.org
Most of the city's monuments were destroyed including a magnificent tiled mosque which dominated the skyline for centuries.
HIGHLIGHT the words/phrases that refer to an EVENT.
5 12 0 1 1 1 4 1 0
were
destroyed
including
magnificent
tiled
mosque
dominated
skyline
centuries
5
monum
ents
3
city
1
Most
Measures how clearly a media unit expresses an annotation
1
0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for annotation A6
Media Unit VectorCosine = .55
Media Unit - Annotation Score
http://CrowdTruth.org
• Goal: what is the most accurate data labeling method?
○ CrowdTruth: media unit-annotation score (continuous value)
○ Majority Vote: decision of majority of workers (discrete)
○ Single: decision of a single worker, randomly sampled (discrete)
○ Expert: decision of domain expert (discrete)
• Approach: evaluate with trusted labels set with either:
○ agreement between crowd and expert, or○ manual evaluation when disagreement
Experimental Setup
http://CrowdTruth.org
Evaluation: F1 score
CrowdTruth performs better than Majority Vote, at least as well as Expert.
Each task has different best thresholds in the media unit - annotation score.
Evaluation: number of workers
Each task reaches stable F1 for different number of workers.
Majority Vote never beats CrowdTruth.
Sound Interpretation needs more workers.
• CrowdTruth performs just as well as domain experts • crowd is also cheaper• crowd is always available
• capturing ambiguity is essential:• majority voting discards
important signals in the data
• using only a few annotators for ground truth is faulty• optimal number of workers /
media unit is task dependent
Experimentsproved that:
http://CrowdTruth.org
CrowdTruth.org
Dumitrache et al.: Empirical Methodology for Crowdsourcing Ground Truth. Semantic Web Journal Special Issue on Human Computation and Crowdsourcing (HC&C) in the Context of the Semantic Web (in
review).
Ambiguity in Crowdsourcing
http://CrowdTruth.org
Media Unit
Annotation Worker
Ambiguity in Crowdsourcing
http://CrowdTruth.org
Media Unit
Annotation Worker
Ambiguity in Crowdsourcing
http://CrowdTruth.org
Media Unit
Annotation Worker
Disagreement can indicate ambiguity!