+ All Categories
Home > Documents > The Pathway Tools Ontology and Inferencing Layer

The Pathway Tools Ontology and Inferencing Layer

Date post: 01-Jan-2016
Category:
Upload: haley-rowland
View: 26 times
Download: 0 times
Share this document with a friend
Description:
The Pathway Tools Ontology and Inferencing Layer. Peter D. Karp, Ph.D. SRI International. Overview. Definitions Ontologies ultimately exciting because of the inferences/computations they enable: Where are the ontology killer apps? - PowerPoint PPT Presentation
Popular Tags:
48
The Pathway Tools Ontology and Inferencing Layer Peter D. Karp, Ph.D. SRI International
Transcript

The Pathway Tools Ontology and Inferencing

Layer

Peter D. Karp, Ph.D.

SRI International

SRI InternationalBioinformaticsOverview

Definitions

Ontologies ultimately exciting because of the inferences/computations they enable:

Where are the ontology killer apps?

Adding more facets to an ontology increases inferences that can be made with it

Pathway Tools ontology and associated applications

SRI InternationalBioinformaticsTerminology

Model Organism Database (MOD) – DB describing genome and other information about an organism

Pathway/Genome Database (PGDB) – MOD that combines information about

Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters,

operons, DNA binding sites

BioCyc – Collection of 15 PGDBs at BioCyc.org

EcoCyc, AgroCyc, YeastCyc

SRI InternationalBioinformaticsTerminology –

Pathway Tools Software PathoLogic

Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases

Pathway/Genome Editors Distributed curation of PGDBs Distributed object database system, interactive editing tools

Pathway/Genome Navigator WWW publishing of PGDBs Querying, visualization of pathways, chromosomes, operons Analysis operations

Pathway visualization of gene-expression data Global comparisons of metabolic networks

Bioinformatics 18:S225 2002

SRI InternationalBioinformaticsOntology

Ontology = Terms + Taxonomy + Slots + Constraints

SRI InternationalBioinformaticsPathway Tools Ontology:

Terms and Taxonomy

Pathway Tools ontology contains 916 classes Define datatypes

Replicons, Genes, Operons, Promoters, Trans Fac Binding Sites Proteins: Enzymes, Transporters, Transcription Factors Small molecule compounds Reactions, pathways

Define taxonomies Taxonomy of chemical compounds Riley’s gene ontology Taxonomy of metabolic pathways EC system

Bioinformatics 16:269 2000

SRI InternationalBioinformaticsOperations Enabled by

Controlled Vocabulary

Equality testing: Is the function of gene X in organism A the same as the

function of gene Y in organism B? Is location L1 in organism A the same as location L2 in

organism B?

SRI InternationalBioinformaticsOperations Enabled by

Taxonomy

Counting / Pie charts How many genes of category “small molecule metabolism”

are in organism A?

Intersecting sets How many of these up-regulated genes are in class “cell

cycle”?

User search via drill down

Applying rules If the substrate of X is an amino acid, then XXX

SRI InternationalBioinformaticsOntology

Ontology = Terms + Taxonomy + Slots + Constraints

SRI InternationalBioinformaticsPathway Tools Ontology:

Slots

Pathway Tools ontology contains 199 slots

Categories of slots: Meta-data: Creator, Creation-Date Textual data: Common-Name, Synonyms, Comment,

Citations Attributes: Molecular-Weight, pI Relationships: Gene, Catalyzes, In-Reaction

Give stats on how many slots in each of these classes

SRI InternationalBioinformaticsPathway Tools Ontology:

Slots

Slots introduced at appropriate place in taxonomy Child classes inherit the slot; parent classes do not

Examples:

Proteins: pI, MolWt, Component-Of Polypeptides: Gene Protein-Complexes: Components

Reactions: Left, Right, Keq, In-Pathway Pathways: Reaction-List, Predecessor-List Transcription Units: Components Genes: Product, Component-Of

SRI InternationalBioinformaticsOperations Enabled by Slots

Store/retrieve attributes of an entity Get pI of protein Get citations associated with pathway

Traverse network of semantic relationships Find all substrates of all reactions in pathway X Find all genes that encode an enzyme that catalyzes a

reaction in pathway X Find all regulons encoding multiple metabolic pathways

SRI InternationalBioinformaticsOntology

Ontology = Terms + Taxonomy + Slots + Constraints

SRI InternationalBioinformaticsPathway Tools Ontology:

Constraints

Every Pathway Tools slot has associated meta data: Class(es) to which it pertains

Keq pertains to Reactions Data type (number, string, frame, etc)

Keq data type is number Collection type (list, bag)

Keq is not a collection Documentation string Cardinality constraints -- At most one Keq value Range constraints Taxonomy constraints

Values of Left slot of Reactions must be Chemicals

SRI InternationalBioinformaticsOperations Enabled by

Constraints

Constraints make a system “intelligent” because they encode definitions in a machine-understandable fashion

Automated DB consistency checkers (batch or interactive)

Schema-driven data input toolsSubsumption – Compare two concept definitions

SRI InternationalBioinformaticsPathway Tools Inference Layer

Commonly used queries implemented as stored procedures

Infer what is implicitly recorded in the KB

SRI InternationalBioinformaticsCompute Transitive

Relationships

Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2

sdhA sdhB sdhC sdhD

succinate + FAD = fumarate + FADH2

Enzymatic-reaction

Succinate dehydrogenase

TCA Cycle

product

component-of

catalyzes

reaction

in-pathway

Chrom

succinate

FAD

fumarate

FADH2

left

right

SRI InternationalBioinformaticsPathway Tools Inference Layer

Enumerate reactions given alternative definitions of a reaction: all, enzyme, transport, small-mol, smm

All substrates, all cofactors, all transported chemicals Protein tests: Is X a transcription factor, enzyme,

transporter Rather than force user to manually assign physiological roles, compute

when possible from biochemical function

Transcription-unit-binding-sites Compute in parts hierarchy: monomers-of-protein,

components-of-protein, genes-of-protein, modified-forms Complex: regulon-of-protein, regulator-proteins-of-

transcription-unit

SRI InternationalBioinformaticsWhat Killer Apps have

Ontologies Enabled?

What comes after pie charts and drill-down interfaces?

SRI InternationalBioinformaticsTerminology –

Pathway Tools Software PathoLogic

Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases

Pathway/Genome Editors Distributed curation of PGDBs Distributed object database system, interactive editing tools

Pathway/Genome Navigator WWW publishing of PGDBs Querying, visualization of pathways, chromosomes, operons Analysis operations

Pathway visualization of gene-expression data Global comparisons of metabolic networks

SRI InternationalBioinformaticsBioCyc Collection of

Pathway/Genome DBs

Literature-based Datasets:

MetaCyc

Escherichia coli (EcoCyc)

Computationally Derived Datasets:

Agrobacterium tumefaciensCaulobacter crescentusChlamydia trachomatisBacillus subtilisHelicobacter pyloriHaemophilus influenzaeMycobacterium tuberculosis RvH37Mycobacterium tuberculosis CDC1551Mycoplasma pneumoniaPseudomonas aeruginosaSaccharomyces cerevisiaeTreponema pallidumVibrio cholerae

Yellow Underlined = Open Database

http://BioCyc.org/

SRI InternationalBioinformatics

Pathway/Genome DBs Created byExternal UsersPlasmodium falciparum, Stanford University

plasmocyc.stanford.edu

Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington

Arabidopsis.org:1555

Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555

Other PGDBs in progress by 20 other usersSoftware freely availableEach PGDB owned by its creator

SRI InternationalBioinformaticsOntology Reuse

A holy grail in AI since “ontology” became a buzz-word

Decrease knowledge acquisition bottleneck

GO qualifies as a large success in ontology reuse

Pathway Tools ontology reused across 18 PGDBsPathway Tools algorithms portable across all

PGDBs

SRI InternationalBioinformaticsPathway Tools Algorithms

Visualization and editing tools for following datatypes

Full Metabolic Map Paint gene expression data on metabolic network;

compare metabolic networksPathways

Pathway predictionReactions

Balance checkerCompounds

Chemical substructure comparisonEnzymes, Transporters, Transcription FactorsGenesChromosomesOperons

Operon prediction; visualize genetic network

SRI InternationalBioinformaticsInference of Metabolic Pathways

Pathway/GenomeDatabase

Annotated GenomicSequence

Genes/ORFs

Gene Products

DNA Sequences

Reactions

Pathways

Compounds

Multi-organism PathwayDatabase (MetaCyc)

PathoLogic Software

Integrates genome and pathway data to identify

putative metabolic networks

Genomic Map

Genes

Gene Products

Reactions

Pathways

Compounds

SRI InternationalBioinformaticsPathoLogic Analysis Phases

Trial parsing of input data files [few days] Initialize schema of new PGDB [3 min] Create DB objects for replicons, genes, proteins [5 min] Assign enzymes to reactions they catalyze

ferrochelatase [10 min / 1 week] glutamate 1-semialdehyde 2,1-aminomutase porphobilinogen deaminase

A C GB D E F

E1 E2

SRI InternationalBioinformaticsPathoLogic Analysis Phases

From assigned reactions, infer what pathways are present [5 min / few days]

Define metabolic overview diagram [1 day]

Define protein complexes [few days]

SRI InternationalBioinformatics

Killer App: Global Consistency Checking of Biochemical Network

Given: A PGDB for an organism A set of initial metabolites

Infer: What set of products can be synthesized by the small-

molecule metabolism of the organism

Can known growth medium yield known essential compounds?

Pacific Symposium on Biocomputing p471 2001

SRI InternationalBioinformaticsAlgorithm:

Forward Propagation

Nutrientset

Metaboliteset

“Fire”reactions

Transport

Products

Reactants

PGDBreaction

pool

SRI InternationalBioinformaticsResults

Phase I: Forward propagation 21 initial compounds yielded only half of 38 essential

compounds for E. coli

Phase II: Manually identify Bugs in EcoCyc (e.g., two objects for tryptophan) Missing initial protein substrates (e.g., ACP) Missing pathways in EcoCyc

Phase III: Forward propagation with 11 more initial metabolites

Yielded all 38 essential compounds

SRI InternationalBioinformaticsHow to Characterize the

Metabolic Network of a Cell?

SRI InternationalBioinformaticsAggregate Properties of the E.

coli Metabolic Network

EcoCyc is not a complete picture of E. coli metabolism

30% of E. coli genes remain unidentified

Analysis pertains to pathways of small-molecule metabolism

Computed with respect to EcoCyc v4.5 (Sep-1998)

Joint work with Christos Ouzounis of EBIGenome Research 10:268 2001

SRI InternationalBioinformaticsEnzymes

4391 genes in E. coli genome

4288 code for proteins

676 (15%) gene products form 607 enzymes

Of the 607 enzymes, 296 are monomers, 311 are multimers

90% of genes for heteromultimers are linked

SRI InternationalBioinformaticsReactions

744 reactions of small-molecule metabolism 582 assigned to at least one pathway

SRI InternationalBioinformaticsCompounds

791 substrates in the 744 reactions

Each reaction contains 4.0 substrates on average

Each substrate appears in 2.1 reactions

SRI InternationalBioinformaticsEnzyme Modulation

805 enzymatic-reaction objects in EcoCyc

80 have physiological inhibitors 22 have physiological activators 17 have both 43% have a modulator

327 require a cofactor or prosthetic group

SRI InternationalBioinformaticsEnzyme-Reaction Associations

585 reactions catalyzed by 1 enzyme 55 reactions catalyzed by 2 enzymes 12 reactions catalyzed by 3 enzymes 1 reaction catalyzed by 4 enzymes

483 reactions belong to a single pathway99 reactions belong to multiple pathways

100 of the 607 E. coli enzymes are multifunctional

SRI InternationalBioinformaticsPathway Tools Implementation

Allegro Common LispSun and PC platforms

Run as window application or WWW server

Ocelot object database

250,000 lines of code

Lisp-based WWW server at BioCyc.org Lisp process reads URLs from the network and generates

GIF+HTML from PGDBs Manages 15 PGDBs

SRI InternationalBioinformaticsOcelot Knowledge Server

Architecture

Frame data model Classes, instances, inheritance

Persistent storage via disk files, Oracle DBMS Concurrent development: Oracle Single-user development: disk files Read-only delivery: bundle data into binary program

Transaction logging facilitySchema evolutionLocal disk cache to improve Internet performance

J. Intelligent Information Systems 1:155-94 1999

SRI InternationalBioinformaticsGKB Editor

Browser and editor for KBs and ontologies

Three editing tools: Taxonomy editor Frame editor Relationships editor

All operations are schema driven

http://www.ai.sri.com/~gkb/user-man.html

SRI InternationalBioinformaticsThe Common Lisp Programming

Environment

Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)

SRI InternationalBioinformaticsPeter Norvig’s Solution

“I wrote my version in Lisp. It took me about 2 hours (compared to a range of 2-8.5 hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of 51-182 for Lisp, and 107-614 for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)”

http://www.norvig.com/java-lisp.html

SRI InternationalBioinformaticsCommon Lisp Programming

Environment

Interpreted and/or compiled executionFabulous debugging environmentHigh-level languageInteractive data explorationExtensive built-in librariesDynamic redefinition

Find out more! ALU.org -- Association of Lisp Users BioLisp.org

SRI InternationalBioinformaticsPathway Exchange Ontology

BioPathways group developing ontology and format for exchange of pathway data

Metabolic pathways Signaling pathways Protein interactions

Moving upwards from chemicals, proteins, to reactions and pathways

Working to extend CMLDraft ontology at

http://www.ai.sri.com/pkarp/misc/interactions.html

SRI InternationalBioinformaticsSummary

Pathway Tools apps: Predict pathways and generate PGDBs Visualization and editing tools Paint gene expression data; compare entire pathway maps Global consistency checking of metabolic network Characterize metabolic and genetic networks

New killer apps: Interoperability Text mining Bake-off for genome annotation pipelines

SRI InternationalBioinformaticsBioCyc and Pathway Tools

Availability

WWW BioCyc freely available to all BioCyc.org Six BioCyc DBs openly available to all

BioCyc DBs freely available to non-profits Flatfiles downloadable from BioCyc.org Binary executable:

Sun UltraSparc-170 w/ 64MB memory PC, 400MHz CPU, 64MB memory, Windows-98 or newer

PerlCyc API

Pathway Tools freely available to non-profits

SRI InternationalBioinformaticsAcknowledgements

SRI Suzanne Paley, Pedro Romero,

John Pick, Cindy Krieger, Martha Arnaud

EcoCyc Project Julio Collado-Vides, Ian Paulsen,

Monica Riley, Milton Saier

MetaCyc Project Sue Rhee, Lukas Mueller, Peifen

Zhang, Chris Somerville

Stanford Gary Schoolnik, Harley McAdams,

Lucy Shapiro, Russ Altman, Iwei Yeh

Funding sources: NIH National Center for

Research Resources NIH National Institute of

General Medical Sciences

NIH National Human Genome Research Institute

Department of Energy Microbial Cell Project

DARPA BioSpice, UPC

BioCyc.org


Recommended