+ All Categories
Home > Documents > Literature Data Mining and Protein Ontology Development

Literature Data Mining and Protein Ontology Development

Date post: 03-Jan-2016
Category:
Upload: nolan-shepard
View: 30 times
Download: 1 times
Share this document with a friend
Description:
Literature Data Mining and Protein Ontology Development. At the Protein Information Resource (PIR). Hu ZZ *, Mani I, Liu H, Hermoso V, Vijay-Shanker K, Nikolskaya A, Natale DA, and Wu CH ISMB 2005, Detroit, Michigan. June 29, 2005 Zhang-Zhi Hu, M.D. Senior Bioinformatics Scientist, PIR - PowerPoint PPT Presentation
18
Literature Data Mining and Protein Ontology Development June 29, 2005 Zhang-Zhi Hu, M.D. Senior Bioinformatics Scientist, PIR Georgetown University Medical Center Washington, DC 20007 At the Protein Information Resource (PIR) Hu ZZ *, Mani I, Liu H, Hermoso V, Vijay- Shanker K, Nikolskaya A, Natale DA, and Wu CH ISMB 2005, Detroit, Michigan
Transcript
Page 1: Literature Data Mining and Protein Ontology Development

Literature Data Mining and Protein Ontology Development

June 29, 2005

Zhang-Zhi Hu, M.D.Senior Bioinformatics Scientist, PIR

Georgetown University Medical Center

Washington, DC 20007

At the Protein Information Resource (PIR)

Hu ZZ*, Mani I, Liu H, Hermoso V, Vijay-Shanker K, Nikolskaya A, Natale DA, and Wu CH

ISMB 2005, Detroit, Michigan

Page 2: Literature Data Mining and Protein Ontology Development

2

UniProt – Central international database of protein sequence and function (http://www.uniprot.org)

PIR – Integrated Protein Informatics Resource for Genomic/Proteomic Research (http://pir.georgetown.edu)

New version of PIR homepage

Page 3: Literature Data Mining and Protein Ontology Development

3

Objective: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function

Literature-Based Curation – Extract Reliable Information from Literature Function, domains/sites, developmental stages, catalytic

activity, binding and modified residues, regulation, pathways, tissue specificity, subcellular location …...

Ensure high quality, accurate and up-to-date experimental data for each protein.

A major bottleneck!

Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management UniProtKB entries will be annotated using widely accepted

biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature.

Page 4: Literature Data Mining and Protein Ontology Development

4

iProLINK: An integrated protein resource for literature mining and literature-based curation

1. Bibliography mapping

- UniProt mapped citations

2. Annotation extraction

- annotation tagged literature

3. Protein named entity recognition

- dictionary, name tagged literature

4. Protein ontology development

- PIRSF-based ontology

Page 5: Literature Data Mining and Protein Ontology Development

5

iProLINK http://pir.georgetown.edu/iprolink/

Testing and Benchmarking Dataset

• RLIMS-P text mining tool

• Protein dictionaries

• Name tagging guideline

• Protein ontology

Page 6: Literature Data Mining and Protein Ontology Development

6

Protein Phosphorylation Annotation Extraction Manual tagging assisted with computational extraction Training sets of positive and negative samples

Substrate(e.g., cPLA2)

phosphorylated-cPLA2

Enzyme(e.g., MAP kinase)

<THEME> Substrate (protein being phosphorylated)

<AGENT> Enzyme (kinase catalyzing the phosphorylation)

Phosphorylation

P-site

(e.g., Ser505)

P-group

<SITE> P-Site (amino acid residue being phosphorylated)

Ser-P

RLIMS-P

Evidence attribution

3 objects

Page 7: Literature Data Mining and Protein Ontology Development

7

RLIMS-P Rule-based LIterature Mining System for Protein Phosphorylation

Sentence extraction

Part of speech tagging

Preprocessing

Acronym detection

Term recognition

Entity Recognition

Noun and verb group detection

Other syntactic structure detection

Phrase Detection

Semantic Type

Classification

Nominal level relation

Verbal level relation

Relation Identification

Abstracts Full-Length Texts

Post-Processing

Extracted Annotations Tagged Abstracts

Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)?ATR/FRP-1 also phosphorylated p53 in Ser 15

http://pir.georgetown.edu/iprolink/

download

Page 8: Literature Data Mining and Protein Ontology Development

8

Benchmarking of RLIMS-P

UniProtKB site feature annotation Proteomics Mass Spec. data

analysis: protein identification

High recall for paper retrieval and high precision for information extractionBioinformatics. 2005 Jun 1;21(11):2759-65

Page 9: Literature Data Mining and Protein Ontology Development

9

Online RLIMS-P http://pir.georgetown.edu/iprolink/rlimsp/

(version 1.0)

• Search interface

• Summary table with top hit of all sites

• All sites and tagged text evidence

1.

2.

3.

Page 10: Literature Data Mining and Protein Ontology Development

10

Raw Thesaurus

iProClass

NCBIEntrez Gene

RefSeqGenPept

UniProt

UniProtKBUniRef90/5

0PIR-PSD

Genome

FlyBaseWormBase

MGDSGDRGD

OtherHUGO

ECOMIM

Name Filtering

Highly Ambiguous Nonsensical

Terms

Semantic Typing

UMLS

NameExtraction

UniProtKB Entries:

Protein/Gene Names &

Synonyms

BioThesaurus

BioThesaurus http://pir.georgetown.edu/iprolink/biothesaurus/

• Biological entity tagging

• Name mapping

• Database annotation

• literature mining

• Gateway to other resources

Applications:

# UniProtKB entry 1.86m

# Source DB record 6.6m

# Gene/protein names/terms 3.6m

BioThesaurus v1.0 m = million

(May, 2005)

Page 11: Literature Data Mining and Protein Ontology Development

11

BioThesaurus Report

1 3

Synonyms for Metalloproteinase inhibitor 3

Gene/Protein Name Mapping

1. Search Synonyms

2. Resolve Name Ambiguity

3. Underlying ID Mapping

2

ID Mapping

Name ambiguityTMP3

Page 12: Literature Data Mining and Protein Ontology Development

12

Protein Name Tagging

Tagging guideline versions 1.0 and 2.0 Generation of domain expert-tagged corpora Inter-coder reliability – upper bound of machine tagging

Dictionary pre-tagging F-measure: 0.412 (0.372 Precision, 0.462 Recall) Advantages: helpful with standardization and extent of

tagging, reducing the fatigue problem, and improve inter-coder reliability.

BioThesaurus for pre-tagging

Page 13: Literature Data Mining and Protein Ontology Development

13

PIRSF family hierarchy based on evolutionary relationships Standardized PIRSF family names as hierarchical protein ontology DAG Network structure for PIRSF family classification system

PIRSF-Based Protein Ontology

PIRSF in DAG View

Page 14: Literature Data Mining and Protein Ontology Development

14

PIRSF to GO Mapping Mapped 5363 curated PIRSF homeomorphic families and

subfamilies to the GO hierarchy 68% of the PIRSF families and subfamilies map to GO leaf nodes 2329 PIRSFs have shared GO leaf nodes

Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies

Superimpose GO and PIRSF hierarchies

Bidirectional display (GO- or PIRSF-centric views)

DynGO viewerHongfang Liu

University of Maryland

Page 15: Literature Data Mining and Protein Ontology Development

15

Protein Ontology Can Complement GO

Expanding a Node: Identification of GO subtrees that can be expanded when GO concepts are too broad IGFBP subfamilies and High- vs. low-affinity

binding for IGF between IGFBP and IGFBPrP

GO-centric view

Page 16: Literature Data Mining and Protein Ontology Development

16

Exploration of Gene and Protein Ontology

PIRSF-centric viewMolecular function

Biological process

Estrogen receptor alpha (PIRSF50001)

Systematic links between three GO sub-ontologies, e.g., linking molecular function and biological process: Estrogen receptor binding Estrogen receptor

signaling pathway

Page 17: Literature Data Mining and Protein Ontology Development

17

Summary

PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development

RLIMS-P text-mining tool for protein phosphorylation from PubMed literature.

BioThesaurus can be used for name mapping to solve name synonym and ambiguity issues.

PIRSF-based protein ontology can complement other biological ontologies such as GO.

Page 18: Literature Data Mining and Protein Ontology Development

18

Acknowledgements

Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology)

Collaborators I. Mani from Georgetown University Department of Linguistics on

protein name recognition and protein name ontology. H. Liu from University of Maryland

Department of Information System on protein name recognition and text mining.

Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.


Recommended