Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | quentin-stephens |
View: | 220 times |
Download: | 0 times |
ANALYSIS OF INTER-ANNOTATOR
AGREEMENT
(TEXT MINING & REG. ANNOTATION)RegCreative Jamboree ,
Friday, December, 1st, (2006)
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
TEXT MINING & REG. ANNOTATION
MAIN ASPECTS
Explore annotation overlap
Discuss variability in annotation
Text mining and regulatory element
annotation: needs, limits, tasks
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
TEXT MINING & REG. ANNOTATION
SOCIOLOGY OF GENOME ANNOTATION(Lincoln Stein 2001)
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Models of annotation
Museum model: small group of specialized curators
Jamboree model: a group of biologists and bioinformaticians come together for a short intensive annotation workshop
Cottage industry: decentralized effort of annotators among therecruited community
Factory model: highly automated methods
(Elsik et al, 2006)
TEXT MINING & REG. ANNOTATION
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
WHY PRE-JAMBOREE QUEUE? Get familiar with annotation system (before jamboree)!
Understand content and annotation strategy of Oreganno
Detect aspects which require improvements such as incompleteness, ambiguity or wrong structures in annotation strategy, guidelines or documentation -> active Feedback (Questionnaire and wiki)
Assess consistency of the current annotation procedures
Explore which aspects affect annotation agreement
Estimate difficulty of task (alternative interpretation, uncertainty, etc,..)
TEXT MINING & REG. ANNOTATION
SIMILARITY MEASURES
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Similarity calculation popular subject in computer science
Different entities considered:
Feature vectors: Alignment, Cosine, Dice, Euclidean, … Strings or sequences of strings (text): averaged String Matching, TFIDF Sets: Jaccard, Loss of Information, Resembalance Sequences: Levensthein Edit Distance Trees:Bottom-up/Top-down Maximum Common Subtree, Tree Edit Distance Graphs: Conceptual Similarity, Graph Isomorphism, Subgraph Isomorphism, Maximum Common Subgraph Isomorphism, Graph Isomorphism Covering,
Shortest Path Information theory: Jiang & Conrath, Lin, Resnik
Bioinformatics: sequence similarity, structural similarity, similarity of gene expression
Here similarity between human annotations
( refer to SimPack project examples)
TEXT MINING & REG. ANNOTATION
MEASUREMENT OF OBSERVER AGREEMENT
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Assumption when independent annotators agree they are correct?! Statistical agreement measures for categorical data Overall proportion of agreement Pairwise comparison; Cohen’s kappa; Pearson Chi-square Weighted kappa for multiple categories High accuracy implies high agreement Kappa sometimes is inconsistent with accuracy measured as AROC
Measurement of Observer agreement Kundel andPolansky, Statistical concepts Series (2003)
Kappa coefficient
TEXT MINING & REG. ANNOTATION
ANNOTATOR AGREEMENT FOR WSD
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006A case Study on Inter-Annotator Agreement for WordSense Disambiguation, Ng et al
Word Sense Disambiguation (WSD) a central problem in NLP WSD: discerning the meaning of a word in context Two human annotators may disagree in their sense assignment Agreement of human annotators often the baseline for evaluation of automated approaches Case study using more than 30,000 instances of the most frequently occurring nouns and verbs in English Sense tagged word in sentences manually by two groups of annotator to WordNet Used the Kappa score to measure inter-annotator agreement considering effect of chance agreement Difficult to achieve high agreement when they have to assign refined sense tags Importance of example sentences for the usage of each word sense
TEXT MINING & REG. ANNOTATION
AGREEMENT OF SPEECH CORPORA
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Phonetically annotated speech corpora Quality of manual annotations affected by:
Implicit incoherence: labeling incoherent due to human variability in perceptual capacities and other factors
Lack of consensus on coding schema: manual annotationsreflect the variability of the interpretation and application ofthe coding schema by the annotators
Annotator characteristics: individual characteristics of coderssuch as familiarity with the material, amount of former training,motivation, interest and fatigue induced errors
Measuring the reliability of Manual annotations ofSpeech corpora, Gut and Bayerl
TEXT MINING & REG. ANNOTATION
CHALLENGES FOR OREGANNO ANNOTATION
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Complexity of gene regulation
Need of ontologies and lexical resources
Deep inference of domain expert curators
Spatial, temporal, experimental conditions
Range of entity types: genes, regulatory sequences, proteins
Gene family and individual gene member distinction
TF binding site sequence extraction and mapping to genome
TF mapping to normalized database entries (NCBI, Ensembl)
Archeology-like annotation: annotation of old papers
BUT GENE REGULATION IS ONE OF THE MAIN BIOLOGICAL INFORMATION (ANNOTATION) ASPECTS!
TEXT MINING & REG. ANNOTATION
SOURCES FOR ANNOTATION VARIABILITY (1)
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Curator background (biologist, bioinformatician,...)
Familiarity with the annotation system
Number of previously annotated papers or proteins
Prior knowledge on the regulated gene or TF
Prior knowledge (experience) on the experimental types
Sub-domain knowledge (e.g. developmental biology or OS)
Publication date (reflect the state of knowledge)
TEXT MINING & REG. ANNOTATION
SOURCES FOR ANNOTATION VARIABILITY (2)
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Nr. of papers annotated the same day (fatigue effect)
Unclear or partial documentation of certain annotation aspects
Annotation type (ontology of annotation types?, CV?)
Nr. of pages, figures, tables, references,…
Consultation of additional resources (material, databases, web)
Different degrees of granularity in annotation
Differences in the recall of manually extracted annotations (all ?)
Sequence (paper/database, strand, typos, length)
TEXT MINING & REG. ANNOTATION
REGCREATIVE CASE STUDY: PREJAMBOREE (1)
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Relatively few articles -> only exploratory examination
Annotation type: 9/11 (2071609, 10674400: RR vs. TFBS)
Considerable difference in average nr. of annotations/paper
Some only extracted a single annotation others basically every annotation mentioned in the paper
Almost perfect agreement in organism source (1 case of human and mouse disagreement), but genes correct!
Very high agreement on the gene names, only few user defined cases (which are difficult to evaluate)
TEXT MINING & REG. ANNOTATION
REGCREATIVE CASE STUDY: PREJAMBOREE (2)
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Certain disagreement in TF names, many are user defined!
Evidence class: high agreement many Transcription regulator site, and unknown
Evidence type: high agreement, some more complete than others, (again, some annotate all the types others only some of them)
Evidence sub-type: similar to evidence types, but in generala little lower agreement than for the evidence type.
TEXT MINING & REG. ANNOTATION
Transcription names factor: PREJAMBOREE
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
User defined
NCBI
Ensembl
Unkown
TEXT MINING & REG. ANNOTATION
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Example case 1: TF annotation variance
TEXT MINING & REG. ANNOTATION
7534794
Curator B UNKNOWN USER DEFINED
Curator B AP-1 USER DEFINED
Curator B AP-1 USER DEFINED
Curator A c-Rel/p65 heterodimer USER DEFINED
Curator A UNKNOWN USER DEFINED
Curator A UNKNOWN USER DEFINED
A
B
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Example case 2: TF annotation variance
TEXT MINING & REG. ANNOTATION
1718972
Curator A Tcf1 NCBI
Curator B Tcf1 NCBI
Curator B C\EBP family USER DEFINED
Curator B C\EBP and NF-1 USER DEFINED
Curator B Tcf1 NCBI
Curator B UNKNOWN USER DEFINED
Curator B UNKNOWN USER DEFINED
Curator B UNKNOWN USER DEFINED
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Example case: difference in evidence types
A
B
AB
10674400
Curator A REGULATORY REGION
Curator B TRANSCRIPTION FACTOR BINDING SITE
Curator B TRANSCRIPTION FACTOR BINDING SITE
2071609
Curator A REGULATORY REGION
Curator B TRANSCRIPTION FACTOR BINDING SITE
Curator B TRANSCRIPTION FACTOR BINDING SITE
TEXT MINING & REG. ANNOTATION
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
3038906
Curator A
TRANSCRIPTION FACTOR BINDING SITE Col1a2 UNKNOWN
TCCAAACTTGGCAAGGGCGAGA
CLASS:OREGEC00001 TYPE: OREGET00003 SUBTYPE:OREGES00015
CLASS:OREGEC00001 TYPE: OREGET00001 SUBTYPE:OREGES00003
Curator B
1->TRANSCRIPTION FACTOR BINDING SITE Col1a2 Nfia
TTCCAAACTTGGCAAGGGCGAGAGAGGGCGA
CLASS:OREGEC00001 TYPE: OREGET00003 SUBTYPE:OREGES00033
CLASS:OREGEC00001 TYPE: OREGET00003 SUBTYPE:OREGES00015
CLASS:OREGEC00002 TYPE: OREGET00001 SUBTYPE:OREGES00003
CLASS:OREGEC00002 TYPE: OREGET00001 SUBTYPE:OREGES00003
Different amount of annotation extracted
A
B
A
B
TEXT MINING & REG. ANNOTATION
REGCREATIVE CASE STUDY: JAMBOREE
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Intensive annotation strategy: face to face with other curators and expert annotators
Get direct feedback and provide suggestions
Promote integration of additional aspects in the annotation structure as well as annotated information types
Populate the database with new annotation records
Explore efficient curation training strategies
Create Gold Standard collection of annotation records, maybe useful to allow example-based annotation training/evaluation
Explore demands of biologists / curators to text mining community - > where it would be useful
TEXT MINING & REG. ANNOTATION
REGCREATIVE CASE STUDY: POST-JAMBOREE
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
Monitor improvements in the annotation consistency
Allow consistent community-based annotation
Promote integration of additional aspects in the annotation structure as well as annotated information types
Increase efficiency in populating the database
Construction for text mining training collection
TEXT MINING & REG. ANNOTATION
ANNOTATION CONSISTENCY
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
TEXT MINING & REG. ANNOTATION
For selection as relevant paper for curation
For the evidence class
For the evidence types
For the evidence subtypes
For the regulated genes
For the transcription factors
For cell types
How to structure comments
Other aspects: ...
TEXT MINING TASKS FOR GENE
REGULATION EXTRACTION
MARTIN KRALLINGER, 2006MARTIN KRALLINGER, 2006
TEXT MINING & REG. ANNOTATION
Detection of relevant articles: abstracts or full text
Extraction of ranked list of regulated genes: mention or normalized gene (database entries)
Extraction of ranked list if TF
Extraction of ranked list of evidence type IDs together with name and text passage (sentence)
Extraction of ranked associations between these genes and TF
Extraction of associations to other controlled vocabularies or ontologies