Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | gervase-davidson |
View: | 218 times |
Download: | 3 times |
https://bmir-gforge.stanford.edu/gf/project/odie/
NCBO Seminar SeriesMarch 3, 2010
Progress on the ODIE Toolkit
Rebecca Crowley, University of Pittsburgh
https://bmir-gforge.stanford.edu/gf/project/odie/
Agenda
• Quick overview of project and people• Progress on project and software• Demo of ODIE version 1.0• What’s planned for version 1.1• Sample of results from research projects• Manuscripts and collaborations• Future of the project
https://bmir-gforge.stanford.edu/gf/project/odie/
Two Tasks ~ One problem
Ontology
TextOntology Enrichment:Uses concepts as source of concepts and relationships to enrich and validate ontology
Information Extraction:Use ontologies to create structured data from unstructured clinical data
https://bmir-gforge.stanford.edu/gf/project/odie/
Specific Aims Specific Aim 1: Develop and evaluate methods for information extraction (IE) tasks using existing
OBO ontologies, including:
Named Entity Recognition Co-reference ResolutionDiscourse Reasoning and Attribute Value Extraction
Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:
Concept Discovery Concept ClusteringTaxonomic positioning
Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.
Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.
Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.
https://bmir-gforge.stanford.edu/gf/project/odie/
Domain
Will attempt to develop general tools whenever possible
• Priorities for evaluation of components in :Radiology and pathology reports NCIT as well as clinically relevant OBO ontologies
(e.g. RadLex)Cancer domains (including hematologic oncology)
https://bmir-gforge.stanford.edu/gf/project/odie/
• Toolkit for developers of NLP applications and ontologies
• Support interaction and experimentation• Foster cycle of enrichment and extraction needed to
advance development of NLP systems• Package systems at the conclusion of working with
ODIE (not yet)• Ontology enrichment as opposed to denovo
development• Human-machine collaboration as opposed to fully
automated learning
Product Goals
https://bmir-gforge.stanford.edu/gf/project/odie/
Users/Workflow
ODIE is intended for: • users who want to use NCBO ontologies to
perform various NLP tasks (+/- may need to add concepts locally to achieve sufficient performance)
• users who want to enrich ontologies using concepts derived from documents (very early in process of ontology development)
https://bmir-gforge.stanford.edu/gf/project/odie/
People and Organization
Ontology Enrichment Coreference Resolution Software
Study and compare methods for ontology enrichment; design methods for evaluation
Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms
Develop and implement architecture and UI; Create framework for using results of research; Implement work of research groups
https://bmir-gforge.stanford.edu/gf/project/odie/
Progress since last NCBO Talk
• ODIE 1.0 released with UIMA pipeline (12/14/09)• Now releasing on NCBO G-Forge Site (2X/yr)• Y2 ODIE Face to Face held in Pittsburgh with
participation of all three groups• 3 submitted manuscripts, 3 more in preparation• Nearing release of ODIE 1.1 (expected 3/10)
https://bmir-gforge.stanford.edu/gf/project/odie/
What’s new in ODIE 1.0?• Load and use any ontology from NCBO BioPortal or .owl/.obo
file.• Can run any UIMA pipeline on a set of documents.
– Can install PEAR or work directly with an Analysis Engine Descriptor.– Visualize annotated documents– No configuration support in this version. Users must configure
pipeline using the descriptor file.– If pipeline uses certain ODIE configuration parameters additional UI
features are exposed.• Two new statistical methods for enrichment• Analyze up to 250 documents at a time. (With 1.5GB RAM)
https://bmir-gforge.stanford.edu/gf/project/odie/
ODIE v1.0 System Requirements
• Recommended System– Windows or Linux OS.– Intel Core2 Duo @ 1.5Ghz+ or equivalent– 1.5GB RAM– 1GB disk space. Additional space required as you add
more ontologies.– Internet access for connecting to BioPortal
https://bmir-gforge.stanford.edu/gf/project/odie/
System Architecture
https://bmir-gforge.stanford.edu/gf/project/odie/
ODIE Download/Info
GForge Site: https://bmir-gforge.stanford.edu/gf/project/odie/
User Forums:https://bmir-gforge.stanford.edu/gf/project/odie/forum/
ODIE on NCBO Tools Page:http://bioontology.org/ODIE
ODIE Installer: http://caties.cabig.upmc.edu/ODIE/odieinstaller.exe
User Manual:http://caties.cabig.upmc.edu/ODIE/odiev1_0manual.doc
ODIE 1.0 Demo
http://www.odie.pitt.eduhttp://www.odie.pitt.edu
Uncovered NPs and known NEs fit precompiled hyponymy pattern
Lexical Syntactic Pattern (LSP)
https://bmir-gforge.stanford.edu/gf/project/odie/
LSP Implementation
• Used GATE Gazetteer and Java Annotations Processing Engine (JAPE)
• Year One work ported to new UIMA environment• Provided UIMA Wrappers for GATE Processing Resources• GATE Annotations flow generically in and out of the
UIMA CAS• Patterns taken from literature including Hearst, Snow,
Charniak• New patterns derived on hand inspection of clinical
corpora by Kaihong Lui
http://www.odie.pitt.eduhttp://www.odie.pitt.edu
Terms always appear together
Mutual Information Measure (Church)
https://bmir-gforge.stanford.edu/gf/project/odie/
Mutual Information (Church) Implementation
• Term pairs scored based upon I(x,y) = log2( ( f(x,y,w) * N) / f(x) * f(y) ))
• f(x,y,w) defined as frequency at which two terms cooccur in a window of w. We used window size 4.
• Pairs must have frequency of at least 3• I(x,y) range is Normalized between 1.0 and 0.0
and suggestions are presented in descending order
http://www.odie.pitt.eduhttp://www.odie.pitt.edu
Words are used in similar contexts
Similarity Measure (Lin)
https://bmir-gforge.stanford.edu/gf/project/odie/
Similarity Measure (LIN) Implementation
• Based upon Minipar Broad Coverage Parser that provides word level triples like (copd,s,involves)
• Used Minipar exe wrapped with Gate 5.0 distribution• First calculate mutual information for each triple across
corpus as I(w1,r,w2) = ( ||w1,r,w2|| x ||*,r,*|| ) / ( ||w1,r,*|| x ||*,r,w2|| )
• Define T(w) as set of pairs (r,w’) such that I(w,r,w’) is positive
• Compute Similarity of two nouns w1 and w2 as the I(w,r,w’) quotient between T(w1) intersect T(w2) and the sum of T(w1) and T(w2) individually.
https://bmir-gforge.stanford.edu/gf/project/odie/
All Techniques
• Only uncovered Noun Phrase terminology that has a method-scored relationship with a known Named Entity are elevated to an ODIE Suggestion
• All methods need cTAKES Chunker for Noun Phrase discovery and ODIE IndexFinder NER for NamedEntity discovery
• NP and NE annotation are shared across all methods
https://bmir-gforge.stanford.edu/gf/project/odie/
Next Release
• ODIE v1.1 will be released Late March.• New Features
– All OE methods will include multi-word terms– Co-reference Visualization– Additional charts and statistics for NER analyses– Easier installation with zero configuration.– Ontology placement for new concepts– Exporting proposal ontologies as OWL or CSV files.
https://bmir-gforge.stanford.edu/gf/project/odie/
Coref Visualization (simple)
https://bmir-gforge.stanford.edu/gf/project/odie/
Coref Visualization (advanced)
https://bmir-gforge.stanford.edu/gf/project/odie/
Research Project 1:Ontology Enrichment
Survey of OE methodsEvaluation of utility of LSPMethodology to study OE
utility• Evaluation of statistical
methods
Concept Discovery
Kaihong LiuRebecca Crowley Wendy ChapmanKevin Mitchell
Study and compare methods for ontology enrichment; design methods for evaluation
Task Primary Method Secondary Method Authors
Synonym and Concept extraction
Linguistic
Component noun information Hamon [69]Lexico-syntactic patterns (LSP) Downey [70]
LSP + component noun information Moldovan, Girju [72]Church [49]
Smadia [50]Grefenstette [73], Hindel [53]
Statistical
Clustering Geffet, Dagan [78]Agirre [55]
Faatz, Steimetz [79]
Hidden Markov Model (HMM)Collier [59]Bikel [80]
Morgan [60]
Support Vector Model (SVM)Shen [61]
Kazama [62]Yamamoto [63]
Taxonomic relationship extraction
Linguistic
LSP
Hearst [44]Caraballo [82]
Cederberg, Widdow [85]Fiszman [89]
Snow [92]Riloff [93]
Component noun information
Velardi [97] Cimiano [98]Rinaldi [99]Morin [100]
Bodenreider [101]Ryu [102]
Statistical Clustering Alfonseca, Manandler [65]Machine learning Witschel [104]
Non-taxonomic
relationship extraction
Linguistic
LSP
Berland [106]Sundblad [107]
Girju [108]Nenadić, Ananiadou [109]
Statistical
Co-occurring information Kavalec [110]
Association rule miningGulla [56]Chefi [57]
Bodenreider [58]Ontology generation (combining all tasks) Statistical
Dependency triples Lin [52]
Nearest neighbor clustering Blaschke, Valencia [115]
From Liu and Crowley, submitted 2/09
https://bmir-gforge.stanford.edu/gf/project/odie/
Review of methods – Linguistic
• Lexico-Syntactic Pattern (LSP) matching– Assumption: syntactic regularities within a specialized
corpus could indicate a particular semantic relationship between two nouns
– Hearst first explored this method for hypernym discovery
– Example: COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA
– “Such as”, “including”, “especially”, “other”
Hearst, M. A. (1992). "Automatic acquisition of hyponyms from large text corpora." Proc. of ACL
https://bmir-gforge.stanford.edu/gf/project/odie/
LSP Patterns
The presence of certain “lexico-syntactic patterns” can indicate a particular semantic relationship between two nouns
Example:DIFFERENTIAL DIAGNOSIS INCLUDES, BUT IS NOT LIMITED TO,
SPINDLE CELL NEOPLASM OF PERINEURIAL ORIGIN (SUCH AS SCHWANNOMA) AND SPINDLE CELL MALIGNANT MELANOMA
“such as” indicates hyponym relationship between two noun phrase
https://bmir-gforge.stanford.edu/gf/project/odie/
Evaluation of ontology suggestions
• How many terms extracted by LSP are medically meaningful (MMT)?
• How many of MMTs extracted are not in the ontology, therefore, can be new concept candidates?
• How many of the relationships between the MMTs are not in the ontology, therefore, can be added to the ontology?
Extraction Output
Step 1: Domain expert annotation
Step 2: Ontology curator judging
Two step process
https://bmir-gforge.stanford.edu/gf/project/odie/
Step 1: Domain Experts annotations
List of Sentences that contain LSPs
PRURIGO NODULE (aka LICHEN SIMPLEX CHRONICUS)
.........
Input: two sets of data, one for pathologists and one for radiologists
Annotation task: 1. Medically meaningful terms (MMTs) that can stand alone before LSP and after LSP2. The terms before and after LSP have to be related
https://bmir-gforge.stanford.edu/gf/project/odie/
Example
Output
COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA.Term that Precedes the LSP Term that follows LSP LSP
List of paired terms
Term1 Term2
BENIGN ECCRINE NEOPLASIA
NODULAR HIDROADENOMA
….. …….
Calculate : total # of MMTs , # of MMTs per LSP
https://bmir-gforge.stanford.edu/gf/project/odie/
Step 2: Ontologist judgingDomain expert annotation output
Term1 Term2
BENIGN ECCRINE NEOPLASIA
NODULAR HIDROADENOMA
….. …….
1. Is the concept in the ontology?2. If not, should it be added into
the ontology?3. If not, what is the reason?
For each term
1. What is the relationship between them?
2. Is this relationship exist in the ontology?
3. If not, should it be added into the ontology?
4. If not, what is the reason?
For each pair of terms
Ontology curators
https://bmir-gforge.stanford.edu/gf/project/odie/
Evaluation metrics
Concept suggestion rate (CSR) =
Concept acceptance rate (CAR) =
Concept relationship suggestion rate (CRSR) =
Concept relationship acceptance rate (CRAR) =
method LSPby extracted MMTs
ontology in thenot that wereMMTs
method LSPby extracted MMTs
ontology the toadded be should MMTs
method LSPby extracted ipsrelationsh
ontology in thenot that wereipsrelationsh
method LSPby extracted ipsrelationsh
ontology in the added be should ipsrelationsh
https://bmir-gforge.stanford.edu/gf/project/odie/
Ontology curators
• NCIT curator: Dr. Nicholas Sioutos
• RadLex curator: Dr. David Channin
https://bmir-gforge.stanford.edu/gf/project/odie/
Results - LPS distribution result
Patterns
Pathology Corpus852764 reports, 16157608
sentences
Radiology Corpus209997 Reports, 4057228
sentences
# Sentences Unique # of sentences # Sentences
Unique # of sentences
NP especially NP 14 11 19 10NP also called NP 48 37 29 22NP such as NP 98 95 906 251NP's NP 202 45 5 2NP in NP 4851 1689 106 47
NP aka NP 5396 460 2 2
NP including NP 6291 4952 1403 747NP other NP 6940 2251 10622 1407
NP like NP 7649 2267 410 235NP, NP 8211 5351 7385 3889NP of NP 14275 4032 2906 607NP in the NP 47124 23178 64044 29285NP is NP 92374 25024 7349 2896NP of the NP 246798 70735 173016 54895
Number of sentences contain lexico-syntactic pastterns
https://bmir-gforge.stanford.edu/gf/project/odie/
Results - MMT yield using LSP method
Pathology Reports Radiology Reports
Preceding the LSP Following the LSP Preceding the LSP Following the LSP
# of MMTs/ # of
instances of LSP (%)
# of MMTs/ # of
instances of LSP (%)
# of MMTs/ # of
instances of LSP (%)
# of MMTs/ # of
instances of LSP (%)
NP such as NP1, NP2 52/50 (104%) 94/50 (188%) 50/50 (100%) 95/50 (190%)
NP including NP1, NP2 49/50 (98%) 81/50 (162%) 50/50 (100%) 86/50 (172%)
NP other NP1, NP2 50/50 (100%) 53/50 (106%) 43/43 (100%) 43/43 100%
NP also called NP1, NP2 35/37 (95%) 36/37 (97%) NA NA
NP aka NP1, NP2 47/50 (96%) 59/50 (118%) NA NA
NP in NP1 50/50 (100%) 50/50 (100%) 47/50 (94%) 39/50 (76%)
NP of NP1 50/50 (100%) 50/50 (100%) 40/50 (80%) 34/50 (68%)
Total 333/337 (99%) 423/337 (126%) 230/243 (95%) 296/243 (122%)
1 to 2 MMTs per LSP instance
https://bmir-gforge.stanford.edu/gf/project/odie/
Results – Ontology Concept Suggestion Rate and Ontology Concept Acceptance Rate
LSPs
Pathology Reports Radiology Reports
Suggestion Rate Acceptance Rate Suggestion Rate Acceptance Rate
NP such as NP1, NP2 37% (52/140) 31% (43/140) 52% (75/145) 10% (14/145)
NP including NP1, NP2 32% (61/189) 32% (60/189) 39% (54/138) 14% (19/138)
NP other NP1, NP2 16% (18/113) 16% (18/113) 33% (28/86) 8% (7/86)
NP also called NP1, NP2 14% (10/74) 10% (7/74) NA NA
NP aka NP1, NP2 31% (37/119) 31% (37/119) NA NA
NP in NP1 12% (12/100) 6% (6/100) 18% (13/74) 8% (6/74)
NP of NP1 11% (11/98) 6% (6/98) 26% (21/80) 14% (11/80)
Average 24% (201/833) 21%(177/833) 37% (191/523) 11% (57/523)
https://bmir-gforge.stanford.edu/gf/project/odie/
Results – Ontology Concept Relationship Suggestion Rate Ontology Concept Relationship Acceptance Rate
Pathology Reports (Enrich NCIT) Radiology Reports (Enrich RADLex)
LSPs Suggestion Rate Acceptance Rate Suggestion Rate Acceptance Rate
NP such as NP1, NP2 55% (36/65) 26% (17/65) 94% (34/36) 94% (34/36)
NP including NP1, NP2 78% (89/114) 15% (17/114) 61% (11/18) 39% (7/18)
NP other NP1, NP2 51% (31/61) 8% (5/61) 57% (12/21) 57% (12/21)
NP also called NP1, NP2 29% (12/41) 10% (4/41) NA NA
NP aka NP1, NP2 73% (43/59) 24% (14/59) NA NA
NP in NP1 84% (38/45) 0% (0/45) 50% (13/26) 12% (3/26)
NP of NP1 64% (28/44) 5% (0/44) 0% (0/27) 0% (0/27)
Average 64% (277/429) 14% (59/429) 55% (70/128) 44% (56/128)
https://bmir-gforge.stanford.edu/gf/project/odie/
Results – Relationship Distribution
Corpus LSPSemantic Relationships
Hyponym Synonym Meronym Other None
Pathology Reports
NP such as NP1, NP2 37% (24/65) 0% 2% (1/65) 57% (37/65) 5% (3/65)
NP including NP1, NP2 10% (11/114) 1% (1/114) 6% (7/114) 78% (89/114) 5% (6/114)
NP other NP1, NP2 39% (24/61) 0% (0/61) 2% (1/61) 46% (28/61) 8% (5/61)
NP also called NP1, NP2 22% (9/41) 20% (8/41) 0% 37% (15/41) 10% (4/41)
NP aka NP1, NP2 10% (6/59) 44% (26/59) 0% 39% (23/59) 5% (3/59)
NP in NP1 0% 0% 0% 100% (45/45) 0%
NP of NP1 5% (2/44) 0% 18% (8/44) 61% (27/44) 16% (7/44)
Average 18% (76/429) 8% (35/429) 4% (17/429) 62% (264/429) 7% (28/429)
Radiology Reports
NP such as NP1, NP2 72% (26/36) 0% 0% 28% (10/36) 0%
NP including NP1, NP2 39% (7/18) 0% 11% (2/19) 33% (6/18) 17% (3/18)
NP other NP1, NP2 76% (16/21) 0% 0% 0% 24% (5/21)
NP in NP1 4% (1/26) 0% 12% (3/26) 42% (11/26) 42% (11/26)
NP of NP1 4% (4/27) 0% 0% 0% 96% (26/27)
Average 40% (51/128) 0% 4% (5/128) 21% (27/128) 35% (45/128)
https://bmir-gforge.stanford.edu/gf/project/odie/
Research Project 2:Coreference Resolution
Annotation schema development and implementation in KnowtatorDetailed guidelines documentAnnotated corpus (~ 100K tokens; double-annotations and consensus) Prototype released as part of ODIE First manuscript submitted, second one underway Anticipate public release of corpus and guidelines
Coreference Resolution
Wendy ChapmanGuergana SavovaMelissa Castine
Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms
https://bmir-gforge.stanford.edu/gf/project/odie/
Manuscripts
Submitted:Liu K and Crowley RS. Natural Language Processing Methods and Systems for Biomedical
Ontology Learning (Review). Submitted to JBILiu K, Chapman WW, Savova GK, Chute C, Sioutos N, Crowley RS. Effectiveness of Lexico-
Syntactic Pattern Matching for Ontology Enrichment with Clinical Documents. Submitted to MIM
Savova GK, Chapman WW, Zheng J. Anaphoric relations in the clinical narrative: corpus creation. Submitted to JAMIA
Planned (next 3 months):Chavan G, Mitchell K, Liu K, Savova GK, Chapman WW, Chute C, Crowley RS. ODIE – A
workbench for cyclic entity recognition and ontology enrichment. Planned for AMIA 2010 submission
https://bmir-gforge.stanford.edu/gf/project/odie/
Future of the project
• Continued releases for remainder of grant• New coreference algorithms• Additional OE algorithms and modifications• Better integration with BioPortal• Planning to apply for competitive renewal
(Dec ’10)