Https://bmir-gforge.stanford.edu/gf/project/odie/ NCBO Seminar Series March 3, 2010 Progress on the...

https://bmir-gforge.stanford.edu/gf/project/odie/

NCBO Seminar SeriesMarch 3, 2010

Progress on the ODIE Toolkit

Rebecca Crowley, University of Pittsburgh


Agenda

• Quick overview of project and people• Progress on project and software• Demo of ODIE version 1.0• What’s planned for version 1.1• Sample of results from research projects• Manuscripts and collaborations• Future of the project


Two Tasks ~ One problem

Ontology

TextOntology Enrichment:Uses concepts as source of concepts and relationships to enrich and validate ontology

Information Extraction:Use ontologies to create structured data from unstructured clinical data


Specific Aims Specific Aim 1: Develop and evaluate methods for information extraction (IE) tasks using existing

OBO ontologies, including:

Named Entity Recognition Co-reference ResolutionDiscourse Reasoning and Attribute Value Extraction

Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:

Concept Discovery Concept ClusteringTaxonomic positioning

Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.

Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.

Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.


Domain

Will attempt to develop general tools whenever possible

• Priorities for evaluation of components in :Radiology and pathology reports NCIT as well as clinically relevant OBO ontologies

(e.g. RadLex)Cancer domains (including hematologic oncology)


• Toolkit for developers of NLP applications and ontologies

• Support interaction and experimentation• Foster cycle of enrichment and extraction needed to

advance development of NLP systems• Package systems at the conclusion of working with

ODIE (not yet)• Ontology enrichment as opposed to denovo

development• Human-machine collaboration as opposed to fully

automated learning

Product Goals


Users/Workflow

ODIE is intended for: • users who want to use NCBO ontologies to

perform various NLP tasks (+/- may need to add concepts locally to achieve sufficient performance)

• users who want to enrich ontologies using concepts derived from documents (very early in process of ontology development)


People and Organization

Ontology Enrichment Coreference Resolution Software

Study and compare methods for ontology enrichment; design methods for evaluation

Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms

Develop and implement architecture and UI; Create framework for using results of research; Implement work of research groups


Progress since last NCBO Talk

• ODIE 1.0 released with UIMA pipeline (12/14/09)• Now releasing on NCBO G-Forge Site (2X/yr)• Y2 ODIE Face to Face held in Pittsburgh with

participation of all three groups• 3 submitted manuscripts, 3 more in preparation• Nearing release of ODIE 1.1 (expected 3/10)


What’s new in ODIE 1.0?• Load and use any ontology from NCBO BioPortal or .owl/.obo

file.• Can run any UIMA pipeline on a set of documents.

– Can install PEAR or work directly with an Analysis Engine Descriptor.– Visualize annotated documents– No configuration support in this version. Users must configure

pipeline using the descriptor file.– If pipeline uses certain ODIE configuration parameters additional UI

features are exposed.• Two new statistical methods for enrichment• Analyze up to 250 documents at a time. (With 1.5GB RAM)


ODIE v1.0 System Requirements

• Recommended System– Windows or Linux OS.– Intel Core2 Duo @ 1.5Ghz+ or equivalent– 1.5GB RAM– 1GB disk space. Additional space required as you add

more ontologies.– Internet access for connecting to BioPortal


System Architecture


ODIE Download/Info

GForge Site: https://bmir-gforge.stanford.edu/gf/project/odie/

User Forums:https://bmir-gforge.stanford.edu/gf/project/odie/forum/

ODIE on NCBO Tools Page:http://bioontology.org/ODIE

ODIE Installer: http://caties.cabig.upmc.edu/ODIE/odieinstaller.exe

User Manual:http://caties.cabig.upmc.edu/ODIE/odiev1_0manual.doc


https://bmir-gforge.stanford.edu/gf/project/odie/forum/

http://bioontology.org/ODIE

http://caties.cabig.upmc.edu/ODIE/odieinstaller.exe

http://caties.cabig.upmc.edu/ODIE/odiev1_0manual.doc

ODIE 1.0 Demo

http://www.odie.pitt.eduhttp://www.odie.pitt.edu

Uncovered NPs and known NEs fit precompiled hyponymy pattern

Lexical Syntactic Pattern (LSP)


LSP Implementation

• Used GATE Gazetteer and Java Annotations Processing Engine (JAPE)

• Year One work ported to new UIMA environment• Provided UIMA Wrappers for GATE Processing Resources• GATE Annotations flow generically in and out of the

UIMA CAS• Patterns taken from literature including Hearst, Snow,

Charniak• New patterns derived on hand inspection of clinical

corpora by Kaihong Lui


Terms always appear together

Mutual Information Measure (Church)


Mutual Information (Church) Implementation

• Term pairs scored based upon I(x,y) = log2( ( f(x,y,w) * N) / f(x) * f(y) ))

• f(x,y,w) defined as frequency at which two terms cooccur in a window of w. We used window size 4.

• Pairs must have frequency of at least 3• I(x,y) range is Normalized between 1.0 and 0.0

and suggestions are presented in descending order


Words are used in similar contexts

Similarity Measure (Lin)


Similarity Measure (LIN) Implementation

• Based upon Minipar Broad Coverage Parser that provides word level triples like (copd,s,involves)

• Used Minipar exe wrapped with Gate 5.0 distribution• First calculate mutual information for each triple across

corpus as I(w1,r,w2) = ( ||w1,r,w2|| x ||*,r,*|| ) / ( ||w1,r,*|| x ||*,r,w2|| )

• Define T(w) as set of pairs (r,w’) such that I(w,r,w’) is positive

• Compute Similarity of two nouns w1 and w2 as the I(w,r,w’) quotient between T(w1) intersect T(w2) and the sum of T(w1) and T(w2) individually.


All Techniques

• Only uncovered Noun Phrase terminology that has a method-scored relationship with a known Named Entity are elevated to an ODIE Suggestion

• All methods need cTAKES Chunker for Noun Phrase discovery and ODIE IndexFinder NER for NamedEntity discovery

• NP and NE annotation are shared across all methods


Next Release

• ODIE v1.1 will be released Late March.• New Features

– All OE methods will include multi-word terms– Co-reference Visualization– Additional charts and statistics for NER analyses– Easier installation with zero configuration.– Ontology placement for new concepts– Exporting proposal ontologies as OWL or CSV files.


Coref Visualization (simple)


Coref Visualization (advanced)


Research Project 1:Ontology Enrichment

Survey of OE methodsEvaluation of utility of LSPMethodology to study OE

utility• Evaluation of statistical

methods

Concept Discovery

Kaihong LiuRebecca Crowley Wendy ChapmanKevin Mitchell

Study and compare methods for ontology enrichment; design methods for evaluation

Task Primary Method Secondary Method Authors

Synonym and Concept extraction

Linguistic

Component noun information Hamon [69]Lexico-syntactic patterns (LSP) Downey [70]

LSP + component noun information Moldovan, Girju [72]Church [49]

Smadia [50]Grefenstette [73], Hindel [53]

Statistical

Clustering Geffet, Dagan [78]Agirre [55]

Faatz, Steimetz [79]

Hidden Markov Model (HMM)Collier [59]Bikel [80]

Morgan [60]

Support Vector Model (SVM)Shen [61]

Kazama [62]Yamamoto [63]

Taxonomic relationship extraction

Linguistic

LSP

Hearst [44]Caraballo [82]

Cederberg, Widdow [85]Fiszman [89]

Snow [92]Riloff [93]

Component noun information

Velardi [97] Cimiano [98]Rinaldi [99]Morin [100]

Bodenreider [101]Ryu [102]

Statistical Clustering Alfonseca, Manandler [65]Machine learning Witschel [104]

Non-taxonomic

relationship extraction

Linguistic

LSP

Berland [106]Sundblad [107]

Girju [108]Nenadić, Ananiadou [109]

Statistical

Co-occurring information Kavalec [110]

Association rule miningGulla [56]Chefi [57]

Bodenreider [58]Ontology generation (combining all tasks) Statistical

Dependency triples Lin [52]

Nearest neighbor clustering Blaschke, Valencia [115]

From Liu and Crowley, submitted 2/09


Review of methods – Linguistic

• Lexico-Syntactic Pattern (LSP) matching– Assumption: syntactic regularities within a specialized

corpus could indicate a particular semantic relationship between two nouns

– Hearst first explored this method for hypernym discovery

– Example: COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA

– “Such as”, “including”, “especially”, “other”

Hearst, M. A. (1992). "Automatic acquisition of hyponyms from large text corpora." Proc. of ACL


LSP Patterns

The presence of certain “lexico-syntactic patterns” can indicate a particular semantic relationship between two nouns

Example:DIFFERENTIAL DIAGNOSIS INCLUDES, BUT IS NOT LIMITED TO,

SPINDLE CELL NEOPLASM OF PERINEURIAL ORIGIN (SUCH AS SCHWANNOMA) AND SPINDLE CELL MALIGNANT MELANOMA

“such as” indicates hyponym relationship between two noun phrase


Evaluation of ontology suggestions

• How many terms extracted by LSP are medically meaningful (MMT)?

• How many of MMTs extracted are not in the ontology, therefore, can be new concept candidates?

• How many of the relationships between the MMTs are not in the ontology, therefore, can be added to the ontology?

Extraction Output

Step 1: Domain expert annotation

Step 2: Ontology curator judging

Two step process


Step 1: Domain Experts annotations

List of Sentences that contain LSPs

PRURIGO NODULE (aka LICHEN SIMPLEX CHRONICUS)

.........

Input: two sets of data, one for pathologists and one for radiologists

Annotation task: 1. Medically meaningful terms (MMTs) that can stand alone before LSP and after LSP2. The terms before and after LSP have to be related


Example

Output

COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA.Term that Precedes the LSP Term that follows LSP LSP

List of paired terms

Term1 Term2

BENIGN ECCRINE NEOPLASIA

NODULAR HIDROADENOMA

….. …….

Calculate : total # of MMTs , # of MMTs per LSP


Step 2: Ontologist judgingDomain expert annotation output

Term1 Term2

BENIGN ECCRINE NEOPLASIA

NODULAR HIDROADENOMA

….. …….

1. Is the concept in the ontology?2. If not, should it be added into

the ontology?3. If not, what is the reason?

For each term

1. What is the relationship between them?

2. Is this relationship exist in the ontology?

3. If not, should it be added into the ontology?

4. If not, what is the reason?

For each pair of terms

Ontology curators


Evaluation metrics

Concept suggestion rate (CSR) =

Concept acceptance rate (CAR) =

Concept relationship suggestion rate (CRSR) =

Concept relationship acceptance rate (CRAR) =

method LSPby extracted MMTs

ontology in thenot that wereMMTs

method LSPby extracted MMTs

ontology the toadded be should MMTs

method LSPby extracted ipsrelationsh

ontology in thenot that wereipsrelationsh

method LSPby extracted ipsrelationsh

ontology in the added be should ipsrelationsh


Ontology curators

• NCIT curator: Dr. Nicholas Sioutos

• RadLex curator: Dr. David Channin


Results - LPS distribution result

Patterns

Pathology Corpus852764 reports, 16157608

sentences

Radiology Corpus209997 Reports, 4057228

sentences

# Sentences Unique # of sentences # Sentences

Unique # of sentences

NP especially NP 14 11 19 10NP also called NP 48 37 29 22NP such as NP 98 95 906 251NP's NP 202 45 5 2NP in NP 4851 1689 106 47

NP aka NP 5396 460 2 2

NP including NP 6291 4952 1403 747NP other NP 6940 2251 10622 1407

NP like NP 7649 2267 410 235NP, NP 8211 5351 7385 3889NP of NP 14275 4032 2906 607NP in the NP 47124 23178 64044 29285NP is NP 92374 25024 7349 2896NP of the NP 246798 70735 173016 54895

Number of sentences contain lexico-syntactic pastterns


Results - MMT yield using LSP method

Pathology Reports Radiology Reports

Preceding the LSP Following the LSP Preceding the LSP Following the LSP

# of MMTs/ # of

instances of LSP (%)

# of MMTs/ # of


# of MMTs/ # of


# of MMTs/ # of


NP such as NP1, NP2 52/50 (104%) 94/50 (188%) 50/50 (100%) 95/50 (190%)

NP including NP1, NP2 49/50 (98%) 81/50 (162%) 50/50 (100%) 86/50 (172%)

NP other NP1, NP2 50/50 (100%) 53/50 (106%) 43/43 (100%) 43/43 100%

NP also called NP1, NP2 35/37 (95%) 36/37 (97%) NA NA

NP aka NP1, NP2 47/50 (96%) 59/50 (118%) NA NA

NP in NP1 50/50 (100%) 50/50 (100%) 47/50 (94%) 39/50 (76%)

NP of NP1 50/50 (100%) 50/50 (100%) 40/50 (80%) 34/50 (68%)

Total 333/337 (99%) 423/337 (126%) 230/243 (95%) 296/243 (122%)

1 to 2 MMTs per LSP instance


Results – Ontology Concept Suggestion Rate and Ontology Concept Acceptance Rate

LSPs

Pathology Reports Radiology Reports

Suggestion Rate Acceptance Rate Suggestion Rate Acceptance Rate

NP such as NP1, NP2 37% (52/140) 31% (43/140) 52% (75/145) 10% (14/145)

NP including NP1, NP2 32% (61/189) 32% (60/189) 39% (54/138) 14% (19/138)

NP other NP1, NP2 16% (18/113) 16% (18/113) 33% (28/86) 8% (7/86)

NP also called NP1, NP2 14% (10/74) 10% (7/74) NA NA

NP aka NP1, NP2 31% (37/119) 31% (37/119) NA NA

NP in NP1 12% (12/100) 6% (6/100) 18% (13/74) 8% (6/74)

NP of NP1 11% (11/98) 6% (6/98) 26% (21/80) 14% (11/80)

Average 24% (201/833) 21%(177/833) 37% (191/523) 11% (57/523)


Results – Ontology Concept Relationship Suggestion Rate Ontology Concept Relationship Acceptance Rate

Pathology Reports (Enrich NCIT) Radiology Reports (Enrich RADLex)

LSPs Suggestion Rate Acceptance Rate Suggestion Rate Acceptance Rate

NP such as NP1, NP2 55% (36/65) 26% (17/65) 94% (34/36) 94% (34/36)

NP including NP1, NP2 78% (89/114) 15% (17/114) 61% (11/18) 39% (7/18)

NP other NP1, NP2 51% (31/61) 8% (5/61) 57% (12/21) 57% (12/21)

NP also called NP1, NP2 29% (12/41) 10% (4/41) NA NA

NP aka NP1, NP2 73% (43/59) 24% (14/59) NA NA

NP in NP1 84% (38/45) 0% (0/45) 50% (13/26) 12% (3/26)

NP of NP1 64% (28/44) 5% (0/44) 0% (0/27) 0% (0/27)

Average 64% (277/429) 14% (59/429) 55% (70/128) 44% (56/128)


Results – Relationship Distribution

Corpus LSPSemantic Relationships

Hyponym Synonym Meronym Other None

Pathology Reports

NP such as NP1, NP2 37% (24/65) 0% 2% (1/65) 57% (37/65) 5% (3/65)

NP including NP1, NP2 10% (11/114) 1% (1/114) 6% (7/114) 78% (89/114) 5% (6/114)

NP other NP1, NP2 39% (24/61) 0% (0/61) 2% (1/61) 46% (28/61) 8% (5/61)

NP also called NP1, NP2 22% (9/41) 20% (8/41) 0% 37% (15/41) 10% (4/41)

NP aka NP1, NP2 10% (6/59) 44% (26/59) 0% 39% (23/59) 5% (3/59)

NP in NP1 0% 0% 0% 100% (45/45) 0%

NP of NP1 5% (2/44) 0% 18% (8/44) 61% (27/44) 16% (7/44)

Average 18% (76/429) 8% (35/429) 4% (17/429) 62% (264/429) 7% (28/429)

Radiology Reports

NP such as NP1, NP2 72% (26/36) 0% 0% 28% (10/36) 0%

NP including NP1, NP2 39% (7/18) 0% 11% (2/19) 33% (6/18) 17% (3/18)

NP other NP1, NP2 76% (16/21) 0% 0% 0% 24% (5/21)

NP in NP1 4% (1/26) 0% 12% (3/26) 42% (11/26) 42% (11/26)

NP of NP1 4% (4/27) 0% 0% 0% 96% (26/27)

Average 40% (51/128) 0% 4% (5/128) 21% (27/128) 35% (45/128)


Research Project 2:Coreference Resolution

Annotation schema development and implementation in KnowtatorDetailed guidelines documentAnnotated corpus (~ 100K tokens; double-annotations and consensus) Prototype released as part of ODIE First manuscript submitted, second one underway Anticipate public release of corpus and guidelines

Coreference Resolution

Wendy ChapmanGuergana SavovaMelissa Castine

Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms


Manuscripts

Submitted:Liu K and Crowley RS. Natural Language Processing Methods and Systems for Biomedical

Ontology Learning (Review). Submitted to JBILiu K, Chapman WW, Savova GK, Chute C, Sioutos N, Crowley RS. Effectiveness of Lexico-

Syntactic Pattern Matching for Ontology Enrichment with Clinical Documents. Submitted to MIM

Savova GK, Chapman WW, Zheng J. Anaphoric relations in the clinical narrative: corpus creation. Submitted to JAMIA

Planned (next 3 months):Chavan G, Mitchell K, Liu K, Savova GK, Chapman WW, Chute C, Crowley RS. ODIE – A

workbench for cyclic entity recognition and ontology enrichment. Planned for AMIA 2010 submission


Future of the project

• Continued releases for remainder of grant• New coreference algorithms• Additional OE algorithms and modifications• Better integration with BioPortal• Planning to apply for competitive renewal

(Dec ’10)

Date post:	02-Jan-2016
Category:	Documents
Upload:	gervase-davidson
View:	218 times
Download:	3 times