Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
6/29/05 1
New Frontiers in Corpus Annotation Workshop, 6/29/05
Ann Bies – Linguistic Data Consortium*Seth Kulick – Institute for Research in Cognitive Science*Mark Mandel – Linguistic Data Consortium* *University of Pennsylvania
Parallel Entity and Treebank
Annotation
6/29/05 2
Mining the Bibliome: Information Extraction from the Biomedical Literature
• NSF ITR grant EIA-0205448• Collaboration with Division of Oncology,
Children’s Hospital of Philadelpia• PubMed abstracts – mining cancer literature for
associations that link variations in genes with malignancies
• http://bioie.ldc.upenn.edu - release 0.9 available 1157 abstracts entity annotated, 318 also treebanked
6/29/05 3
Outline
• Entity Annotation
• Treebank Annotation – • Modifications from Penn Treebank guidelines
• Annotation Process and Merged Format
• Entity-Constituent Mapping – How successful?
6/29/05 4
Entity Annotation
• Gene X with genomic Variation event Y is correlated with Malignancy Z• Gene – composite entity, can refer to gene or protein
: Gene-generic, Gene-protein, Gene-RNA• (Malignancy – under development, not included in
release 0.9)• Variation Event – Relation between entities
representing different aspects of a variation
6/29/05 5
Entity Annotation - Variations
• Variation – A relation between variation component entities
• “a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution”• Var-type – substitution• Var-location –codon 249• Var-state-orig –serine• Var-state-altered –cysteine
6/29/05 6
A Change in Tokenization
• Tokenization – Many hyphenated words treated as separate tokens• “New York-based”
• Old (Penn Treebank) tokenization: [New] [York-based]
• New tokenization: [New][York][-][based]
6/29/05 7
Discontinuous Entities• E.g.: “K- and N-ras”
• Tokenization: [K][-][and][N][-][ras]
• Entity annotation: • [K][-]… [ras] – “chain” of discontinuous tokens
• [N][-][ras] – Contiguous tokens
• Splitting up not always done, depends on coordination
6/29/05 8
Treebank Annotation
• Default NP right-branching structure
• (NP (JJ primary) (NN liver) (NN cancer))
• Simplifies multi-token nominal annotation
• Allows recovery of implicit constituents:• (NP (JJ primary)
(newnode (NN liver) (NN cancer)))
• Entities sometimes map to such implicit constituents
6/29/05 9
Treebank Annotation • Exceptions to right-branching marked by NML • So: Any two or more non-final elements that form
a constituent are a NML• (ADJP (NML (NNP New) (NNP York))
(HYPH -) (VBN based))
• (ADJP (NML (NN breast) (NN cancer)) (HYPH -) (VBN associated))
• (NP (NML (NN human) (NN liver) (NN tumor)) (NN analysis)
6/29/05 10
Treebank Annotation • Placeholder *P* for distributed material in
coordinated nominal structures
• “K- and N-ras”NP
NN
NP CC
K
andHYPH
-
NML-1
-NONE-
*P*
NN
NP
N
HYPH
-
NML-1
-NONE-
ras
6/29/05 11
Treebank Annotation
• To the left or right
• “codon 12 or 13” NP
NML-1
NN
NP CC
codon
CD
12
or NML-1
-NONE-
NP
*P*
CD
13
6/29/05 12
First Release
• Goal – let users choose how to handle the integration of entity and treebank levels
• Standoff annotation for entity and treebank
• Identical tokenization
• Merged representation• Penn Treebank style
• (POSTag:[from..to] terminal)
• Entity listing before each tree.
6/29/05 13
Merged Output Example
sentence 4 Span:331..605;In the present study, we screened for ;the K-ras exon 2 point mutations in a ;group of 87 gynecological neoplasms ;[373..378]:gene-rna:"K-ras";[379..385]:variation-location:"exon 2";[386..401]:variation-type: "point mutations“
6/29/05 14
Merged Output Example
[…] ((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) […]
6/29/05 15
Merged Output Example
((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))
;[373..378]:gene-rna:"K-ras";[379..385]:variation-location:"exon 2";[386..401]:variation-type: "point mutations"
6/29/05 16
Entity-Constituent Mapping : Exact Match
• Exact Match: A node in the tree yields exactly the entity:
;[379..385]:variation-location:"exon 2"
(NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))
6/29/05 17
Entity-Constituent Mapping : Missing Node
• Missing Node – Possible to add a node to yield exactly the entity
;[386..401]:variation-type: "point mutations"
(NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations)))
6/29/05 18
Entity-Constituent Mapping : Missing Node
• Done for internal research purposes, not in release (implicit constituents)
• NML already in release (explicit constituents)
(NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (newnode(NN:[386..391] point) (NNS:[392..401] mutations))))
6/29/05 19
Entity-Constituent Mapping : Crossing
• Crossing: Cuts across constituent boundaries, so cannot even add a node yielding the entity
• Typical case: entity containing text corresponding to a prepositional phrase
One ER showed a G-to-T mutation in the second position of codon 12
[1280..1307]: variation-location: “second position of codon 12”
6/29/05 20
Entity-Constituent Mapping : Crossing
• Crossing - Determiner in NP but not in entity.
• Could relax matching, or modify entity or treebank annotation. Didn’t do that.
(NP (NP (DT:[1276..1279] the) (JJ:[1280..1286] second) (NN:[1287..1295] position)) (PP (IN:[1296..1298] of) (NP (NN:[1299..1304] codon) (CD:[1305..1307] 12)))))
[1280..1307]: variation-location: “second position of codon 12”
6/29/05 21
Entity-Constituent Mapping – Chain Exact Match
• “codon 12 or 13”• Entities: “codon 12”, “codon..13”
NP
NML-1
NN
NP CC
codon
CD
12
or NML-1
-NONE-
NP
*P*
CD
13
6/29/05 22
Entity-Constituent Mapping – Chain Not a Exact Match
• “specific codons (12, 13, and 61)•Entities: “codons…12”, “codons..13”, “codons..61”
(NP (JJ specific) (NNS codons) (PRN (-LRB- -LRB-) (NP (NP (CD 12)) (, ,) (NP (CD 13)) (, ,) (CC and) (NP (CD 61))) (-RRB- -RRB-)))
6/29/05 23
Multiple Token Entities (Non-Chained)
Entity Type Total Exact Match
Missing Node
Crossing
Gene-generic 6 4 1 1
Gene-protein 349 236 103 10
Gene-RNA 156 115 35 6
Var-location 445 348 68 29
Var-state-orig 5 3 1 1
Var-state-altered 10 8 0 2
Var-type 271 123 142 6
Total 1242 837 350 55(4.4%)
6/29/05 24
Multiple Token Entities (Chained)Entity Type Total Exact
MatchNot Exact Match
Gene-generic 0 0 0
Gene-protein 6 4 2
Gene-RNA 36 29 7
Var-location 125 103 22
Var-state-orig 0 0 0
Var-state-altered 0 0 0
Var-type 1 0 1
Total 168 136 32(19%)
6/29/05 25
Conclusion• Annotation of entities and treebank done together
• Identical tokenization for entities and trees, with standoff annotation
• Allows flexibility in use of integrated annotation
• Only 6.2% of the entities cannot be mapped to an implicit or explicit constituent node• Changes in Treebank guidelines• Use of Relations for potentially large entities
• Next: Relation annotation and integrated taggers
6/29/05 26
References
• Ryan’s tagger
• Dan’s parser
• Web page again
6/29/05 27
Entity Annotation - Variations• “(S249C)”
• Var-type – none • Var-location –249• Var-state-orig –S• Var-state-altered –C
• Gene-{RNA,generic,protein} disambiguates gene metonymy
• Var-{type,location,state-orig,state-altered} are different kinds of entities
6/29/05 28
Entities
Entity Type Single Tokens
Non-chains
Chains
Gene-generic 104 6 0
Gene-protein 921 349 6
Gene-RNA 1987 156 36
Var-location 95 445 125
Var-state-orig 151 5 0
Var-state-altered 162 10 0
Var-type 235 271 1
--Multiple Tokens--
6/29/05 29
Introduction
• Corpus for biomedical IE with several levels of annotation:• Entity
• Syntactic Structure (Treebank)
• Relations (McDonald et al, ACL 2005)
• Ideal - entities mapped to treebank constituents
• Allow users to choose how to integrate the levels
6/29/05 30
Annotation Process
• Tokenization Entity POS Treebanking Merged Representation
• Minimal requirement: identical tokenization for entity and treebank annotation
• Did not require an entity/constituent correspondence – but how did it work out?