+ All Categories
Home > Documents > Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE...

Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE...

Date post: 31-Mar-2015
Category:
Upload: jaren-waite
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION
Transcript
Page 1: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Hong Cui

University of Arizona

Phenotype RCN Feb 25-27, 2013

SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION

Page 2: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Agenda• CharaParser• Methodology• Evaluation• Applications

• CharaParser for Phenoscape• New modules• Evaluations• Challenges

Page 3: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

“Fine-Grained Semantic Mark-up”

• To annotate factual information from textual morphological descriptions of biodiversity in such a detailed manner that the machine readable annotation itself provides information equivalent to the original text.

Page 4: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

An Example

Page 5: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.
Page 6: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Previous Research

• Syntactic parsing approach (Taylor, 1995 ; Abascal & Sanchenz, 1999; Vanel, 2004)• Interactive extraction (Diederich, J., Fortuner, R. & Milton, J. 1999).• Semi-supervised bootstrapping for lexicons (Ellen Riloff, 1999) • Supervised regular expression rule learning (Soderland, 1999; Tang & Heidorn 2008)•Ontology driven and parallel text (Woods et. al. 2004)• Supervised association rule learning (Cui & Heidorn, 2007)

Page 7: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.
Page 8: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.
Page 9: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.
Page 10: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

General-Purpose Parsers?

Page 11: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

CharaParser Approach

1. Unsupervised machine learning to find anatomy and character terms from descriptions automatically• No need to prepare training examples• 50% - 80% terms learned

2. General-purpose syntactic parser (e.g., Stanford Parser) to parse syntactic structure of sentences• No need to create special-purpose, domain-dependent

parser• Learned lexicon from 1 is used to adapt the Parser for

biodiversity domains

3. Intuitive rules to produce annotations from parse trees.

Page 12: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Unsupervised lexicon learning

If it is known “roots” is an organ:

•Roots yellow to medium brown or black, thin.• Petals yellow or white• Petals absent;• Subtending bracts absent;• Abaxial hastula absent;

Page 13: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

CharaParser: Term Reviewer

Page 14: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Ontology Term Organizer

Page 15: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Compared against a Heuristics-Based Method

• Parser performance evaluated on the same data sets.• CharaParser: unsupervised learning + Stanford Parser• Heuristics-based: unsupervised learning + regular expression rules

Page 16: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Annotation Problems

• Chunk errors:• Leaves oblanceolate to lanceolate, largest 14–20(–40) × 3–

4(–5) mm, pliant;

• Attachment errors:• on outer cypselae, crowns of bristlelike scales ca. 0.5 mm;

on inner, of dusky white or pale yellow, plumose bristles 5–6 mm.

• Semantics:• straight posterolateral bounding ridges to subtriangular ,

bilobed ventral muscle field;

Page 17: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Applications at Various Development Stages

• Convert XML markup to • SDD for identification key generation• Character matrices for tree of life• RDF for the Semantic Web and search

• Use marked-up descriptions to support search• FNA Experimental Search • Data source is RDF triples

• Allow character based search• Plants that give yellow flowers at 200-400 meter elevation in April in North

Carolina

Page 18: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.
Page 19: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

To-Dos

• Tighter integration of ontologies in annotation process.• Currently internal glossaries are used in place of

ontologies to link a character state (e.g., “red”) to a character (“color”)• Synonyms are not controlled• “Petiolate” = “with petiole”

• Continue to reduce annotation errors• Accommodate various syntactic styles • Diagnosis paragraphs• Comparison among different taxa

Page 20: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Phenotype Curation

• Convert character and character state information from natural language descriptions to EQ statements

Page 21: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Curator Mental Process

readdescription

Identify key phrases (raw EQ)

ontologized EQ

ontologies

Page 22: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Adapted CharaParserCharacter Description

State Descriptions

CharaParser

XML to Raw EQs

Raw EQs to Final

EQs

Ontologies

Page 23: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Evaluations• Internal evaluation: • The development corpus (three publications on fishes and

archosaurs) provided 1,200 character descriptions. 100 of them included in the internal evaluation benchmark.• Raw EQ performance: 90%• Final EQ performance: 50%

• BioCreative2012 evaluation:• 50 descriptions independently selected by the organizer (>50% Qs

were not in ontologies)• Gold standard created by chief phenoscape curator (raw and final)• Three biocurators worked in two modes (Phenex vs.

Phenex+CharaParser)• Raw EQ performance: CharaParser better than biocurators• Final EQ perfoamnce: biocuration better than CharaParser • Inter-curator agreements:

Page 24: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Inter-Curator AgreementsPrecision Recall

Curator 1 vs 2 39 49

Curator 1 vs 3 47 56

Curator 2 vs 3 77 71

Page 25: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Error Analyses

• Various fixable syntactic problems• E.g., “digits I-III”

• Curation granularity• CharaParser generated more candidate EQs than curators• “Preopercular latero-sensory canal leaves preopercle at first

exit and enters a plate: yes/no”

• Annotating relations (relational quality)• “contact between …”

Page 26: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Ontology Access

• Currently use keyword-based search• Class labels and exact, narrower, and related synonyms• False positives • acute(shape) =? acute (process)

•  "margin" is a broad synonym of "marginal zone of embryo" in UBERON

• Pre-composed terms in ontology• “ceratobranchial 5 tooth”, “rib of vertebra 5”, “body of

humerus”• Ambiguious term use in descriptions

• ‘epibranchial 1’ => epibranchial 1 element? bone? cartilage?

• No matching

Page 27: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Exploration of Solutions• Experimented with• Word sense disambiguation: • “crinkly” not in PATO

• Candidate matches: [undulate->1.00000000000002] [obovate->1.00000000000001] [flat->1.00000000000001] [flattened->1] [circinate->0.884697579551583]

• Experimenting with• Subsets• Specify included classes: e.g. classes related to vertebrates

• Specify excluded classes: e.g. exclude certain developmental stages

• Ideas to try out: • Bootstrapping to narrow down the search space• starting from known classes

• evaluating candidate matches based on the distances to the known classes and other source of evidences.

Page 28: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Annotation consistency• Instructions given to human curators are helpful to CharaParser • Restricted relation list:• http://phenoscape.org/wiki/Guide_to_Character_Annotation#Relations_used_for_post-compositions

Page 29: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Feed more info to EQ generation module

Character Description

State Descriptions

CharaParser

XML to EQs

Raw EQs to Final EQs

Ontologies

Page 30: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Recent Improvements• Explorer of Taxon Concepts project• Making it a pure-java program/web-based application• Currently requires MySQL + Perl

• Making it faster• Optimization of the program• Removing MySQL and reducing I/O

• “Parallel” computing using java threads

• Preliminary evaluation shows • 20 times faster: 2 sec/taxon description• Memory requirements increased by 3 folds

Page 31: Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION.

Acknowledgements• Fine-Grained Semantic Markup Project (current and past)• James Macklin: Agriculture and Agri-Food Canada • Robert (Bob) Morris, Alex Dusenbery: UMass-Boston• Hariharan Gopalakrishnan, Zilong Chang, Thomas Rodenhausen, Mohan

Krishna Gowda, ParthaPartha Pratim Sanyal, Chunshui Yu: University of Arizona

• Phenoscape Project• Chris Mungall: Laurence Berkeley National Lab• Melissa Haendel : Oregon Health & Science University • Paula Mabee, Alex Dececchi: University of South Dakota• Jim Balhoff, Wasila Dahdul, Hilmar Lapp, Todd Vision: NESCent

• NSF ABI and EF Programs• The Flora of North American Project


Recommended