+ All Categories
Home > Documents > Silviu Paun - Comparing Bayesian Models of Annotationsilviupaun.com/files/emnlp_presentation.pdf ·...

Silviu Paun - Comparing Bayesian Models of Annotationsilviupaun.com/files/emnlp_presentation.pdf ·...

Date post: 22-Feb-2020
Category:
Upload: others
View: 18 times
Download: 0 times
Share this document with a friend
15
Comparing Bayesian Models of Annotation Silviu Paun 1 Bob Carpenter 2 Jon Chamberlain 3 Dirk Hovy 4 Udo Kruschwitz 3 Massimo Poesio 1 1 School of Electronic Engineering and Computer Science, Queen Mary University of London 2 Department of Statistics, Columbia University 3 School of Computer Science and Electronic Engineering, University of Essex 4 Department of Marketing, Bocconi University
Transcript

Comparing Bayesian Models of AnnotationSilviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

1School of Electronic Engineering and Computer Science, Queen Mary University of London

2Department of Statistics, Columbia University

3School of Computer Science and Electronic Engineering, University of Essex

4Department of Marketing, Bocconi University

Introduction

• Crowdsourcing is increasingly used as an alternative to traditional

expert annotation

• Crowdsourced annotations require aggregation methods

• Previous methods of analysis include majority voting aggregation and

agreement statistics

2Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

Introduction

• Probabilistic models of annotation can solve many of the problems

faced by previous practices

• A large number of models has been proposed

• The literature comparing models of annotation is limited

3Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

Introduction

• We compare six existing models of annotation with distinct prior and

likelihood structures and a diverse set of effects

• The evaluation is done in both gold dependent and independent

settings

• We use three standard NLP datasets for testing crowdsourcing

• We also include an additional dataset produced using a game with a purpose

4Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

Reviewed models

• A pooled model: all annotators share the same ability

• Ability parameterized in terms of a confusion matrix (the Multinomial model)

• Unpooled models: each annotator has their own response parameter

• Different annotator ability structures

• Confusion matrix (the Dawid and Skene model)

• Credibility and spamming preference (the MACE model)

5Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

Reviewed models

• Partially-pooled models: assume both individual and hierarchical

structure

• Partial pooling across annotators (the Hierarchical Dawid and Skene model)

• Partial pooling across item difficulties (the Item Difficulty model)

• Partial pooling across annotators and item difficulties (the Logistic Random

Effects model)

6Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

The “Logistic Random Effects” model

JK

βj,k

π

INi

yi,n

ci

K

ζk

Ωk θiK

Χk

7Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

The data used in the evaluation

• Datasets produced using a standard crowdsourcing

platform

• The Snow et. al (2008) datasets

• Used often in the crowdsourcing literature

• Data produced using a game with a purpose

• The Phrase Detectives (Chamberlain et al., 2016) corpus

• Less artificial, i.e., more variation

8Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

Dataset Statistics

(min, median, max)

Comparison against a gold standard

9Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

• The models which assume some form of

annotator structure got the best results

• Having a richer annotator structure can be

more beneficial

• Ignoring the annotator structure

generally leads to poor results

• Unless the data is produced by annotators

of a similar behavior

R T E d a t a s e t

P D d a t a s e t

Predictive accuracy evaluation

10Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

• Ambiguity can affect the reliability of gold

standard datasets

• Posterior predictions are a standard assessment

method for Bayesian models

• We measure the predictive performance of

each model using the log predictive density

(lpd), in a Bayesian K-fold cross-validation

setting

Predictive accuracy evaluation

11Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

• Generally, the models which assume some form of

annotator structure got the best results

• In particular, it’s the partially pooled models which are

most consistent

• The unpooled models are prone to overfitting

• Ignoring the annotator structure leads to poor

predictive performance results

• Except for the WSD dataset where all annotators are

highly proficient (above 95% accuracy)

Results

An analysis of different player types

12Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

• The PD corpus comes with a list of spammers and one of good, established players

A typical “spammer” A typical “good player”

Take away points

13Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio

• Majority voting and agreement statistics lead to biased estimates

• Probabilistic models of annotation address these problems

• Model different effects (annotator accuracy and bias, item difficulty)

• Best architecture: partially pooled models with annotator structure

Thank you!

Comparing Bayesian Models of AnnotationSilviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

1School of Electronic Engineering and Computer Science, Queen Mary University of London

2Department of Statistics, Columbia University

3School of Computer Science and Electronic Engineering, University of Essex

4Department of Marketing, Bocconi University

Technical notes

• Non-centred parameterizations

• Problem: in hierarchical models a complicated posterior curvature increases the difficulty of

the sampling process (sparse data or large inter-group variances)

• Solution: separate the local parameters from their parents

• Label-switching

• Problem: refers to likelihood’s invariance under the permutation of the labels

• Occurs in mixture models; makes the models non-identifiable

• Solution: gold alignment

15Comparing Bayesian Models of Annotation. Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, Massimo Poesio


Recommended