GENIA-GR: a Grammatical Relation Corpus for Parser
Evaluation in the Biomedical Domain
Yuka Tateisi1, Yusuke Miyao2, Kenji Sagae2, Jun'ichi Tsujii2,3
1Department of Informatics, Kogakuin University2Graduate School of Information Science and Technology,
University of Tokyo3School of Computer Science, University of Manchester/National Centre for Text Mining
[Event] Inhibition[Agent] IL-1beta[Target] insulin secretion
IE
Background
Interleukin 1 beta inhibits insulin secretion
Parsing
Sentence IL-1beta is known to inhibit insulin secretion
Insulin secretion is inhibited by IL-1 beta
[Verb] Inhibit [Subject] IL-1beta[Object] insulin secretion
inhibit ↓ [Event]=[Verb] [Agent]=[Subject] [Target]=[Object]
Rules
Predicate-Argument Structure
Constituency vs Dependency
Dependency structure is becoming more commonPredicate-argument relation is easier to access
from dependency than from constituency structure
S
VP
I love you
subject object
I love you
Parser Evaluation
Need to know the performance of parsersCross-formalism evaluation
( PTB-based, TAG, HPSG, etc )Cross-domain evaluation
( Newswire text etc. )
Previous works used Stanford Dependency (Clegg et al. 2007, Pyysalo et al. 2007)Automatic translation from treebank-based corpusMay contain error in conversion processGood for comparison between treebank-based parsers
but may be unreliable for evaluating parsers based on other formalisms
Our Approach
Manual Annotation of Dependency StructureGrammatical Relations (GR : Carrol et al.1998)
Gold Standard Corpus exists for Wall Street Journal section of Penn Treebank
Has been used for cross-framework evaluation in newswire domain
Annotation on GENIA textOther information (POS, tree, terms, event structure)
available
GENIA-GR
Base Text50-abstract subset of GENIA
MeSH term: NF kappa BFull text in PubMed CentralNever been used for training with GENIATreebank
Annotation SchemeFollows Briscoe’s scheme*
Not annotated inside technical terms
*An introduction to tag sequence grammars and the RASP system parser (Technical report, University of Cambridge)
Grammatical Relations
Shows dependency structureHead-Dependent pairs with types of
dependency
I love you
(ncsubj love I _) I is the subject of love (dobj love you) you is the object of love
Technical Terms (Named Entities)
Technical Terms := Terms tagged in GENIA term corpusMulti-word terms are “declared” with
IDs taken from GENIA term corpusPosition in the sentence
Only IDs are referred to in the dependency structure
Example
sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*)))
Example
sentence( id(96099434.1) Sentence ID named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*)))
Example
sentence( id(96099434.1) named_entities( Multi-word Terms term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*)))
Example
sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes . ) rasp( Sentence Form (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) (dobj mediates:4 *T2*) (dobj in:8 *T4*)))
Example
sentence( id(96099434.1) named_entities( term( id(T1) span(1:3)(Protein kinase C-zeta)) term( id(T2) span(5:7)(NF-kappa B activation)) term( id(T4) span(9:12)(human immunodeficiency virus-infected monocytes)) sentence_form(Protein kinase C-zeta mediates NF-kappa B activation in human immunodeficiency virus-infected monocytes . ) rasp( (ncsubj mediates:4 *T1* _) (iobj mediates:4 in:8) Dependency Structure (dobj mediates:4 *T2*) (dobj in:8 *T4*)))
Annotation Procedure
Manual correction of Rasp parser output Annotation by one annotator Procedure
1. Initial annotation2. Error correction, identification of problems 3. Refinement of scheme, further correction4. Annotation by other annotator (s),
Inter Annotator Agreement evaluation5. Error correction, re-refinement of scheme, etc6. Publication
Annotation Results
Corpus No. of sentences
No. of dependencies
GENIA-GR 492 10029
PTB-GR (Carroll and Briscoe 2006)
560 10906
Distribution of Dependency Types
Relation Type NGENIA FGENIA NPTB FPTB FGENIA/FPTB
dobj 2181 4.43 1762 3.15 1.41
ncmod 2163 4.4 3548 6.34 0.69
det 990 2.01 1115 1.99 1.01
conj 975 1.98 591 1.06 1.88
iobj 962 1.96 545 0.97 2.01
ncsubj 946 1.92 1351 2.41 0.8
passive 330 0.67 228 0.41 1.65
xcomp 294 0.6 380 0.68 0.88
xmod 294 0.6 178 0.32 1.88
Other 894 1.82 1208 2.16 0.84
TOTAL 10029 10906
Nx: occurrence in the corpusFx : frequency per sentence
GENIA: GENIA (492 sentences)PTB : PTB-WSJ (560 sentences)
Typical Construction in Biomedical Abstract (?)
A similar potentiation of TNF effects was observed in Jurkat T cells and HeLa cells treated with soluble Tat protein .
Problems found in the initial annotation process
Domain-specific structuresCoordinationPrepositional Arguments/ModifiersInconsistency with other GENIA corpora
Genes
residues 296 to 302
residues 296
to 302
ncmod(ta)ncmod
dobj
Binding(participle as a prenominal modifier)
New York-based firm
NF kappa B binding domain
ncmodNew York
based
firmncmod
ncmodNF kappa B
binding
domainncmod
Current annotation
Binding(participle as a prenominal modifier)
NF kappa B binding domain But perhaps it’s better to do this way
dobjNF kappa B binding domainncsubj
xmod
NF kappa B binding activity
dobjNF kappa B binding activity
ncmod(ta)
References in the text
M.Nugeyre , F.Barre-Sinoussi , and N.Israel , J.Virol.72 : 5852-5861 , 1998
Where is the head?
M.Nugeyre F.Barre-Sinoussi N.Israel
J.Virol.72
5852-5861 1998and
conjconj
conj
ta tata
Math expressions
2 x NFKappaB > or = SlVmac239 approximately deltaNFkappaB
What’s the dependency type?
2 x NFKappaBor
>=≈conj conj SlVmac239
arg
arg
deltaNFkappaB
argxmod
References and Math expressions
Not frequent but problematicMay occur more frequently in full textReferences may be treated like termsMath expressions can work as S or VP
Tn<Tn+1 if n>0.
Coordination
CD8(+)
CD4(+)
CD4(+) and CD8(+) T lymphocytes
CD4(+)CD8(+)
CD4(+) and CD8(+) T lymphocytes
CD8(+)
CD4(+)
CD4(+) CD8(+)
T lymphocytesand
conjconj
ncmod
CD4(+) and CD8(+) T lymphocytes
CD4(+)CD8(+) ?
ellip
CD4(+)
T lymphocytes
CD8(+)
and
ncmod ncmod
conjconj
no mechanism to coindex
PP argument/modifier
iobj (argument) or ncmod(modifier)?
Reference to external resources GENIA event corpus
PASBio (Wattarujeekrit et al 2008)
A similar potentiation of TNF effects was observed in Jurkat T cells and HeLa cells treated with soluble Tat protein .
Inconsistencies with other GENIA corpora
Dependency Inside a Tokenrenal cell carcinoma-derived gangliosides
Mostly involving hyphensMany inside technical terms
Dependency on Part of a Technical Term more mature cell lines
Conclusions
50-abstract GENIA subset with dependency annotation is in the phase of error correction
Works to be doneRefinement of the schemeEvaluation(Enhancement to full text)