13. Lecture WS 2003/04Bioinformatics III1 For your exam preparation: relevant topics from lectures...

transcript

13. Lecture WS 2003/04

Bioinformatics III 1

For your exam preparation:relevant topics from lectures 13-26

V13 Protein Networks / Protein Complexes

Protein networks could be defined in a number of ways

- Co-regulated expression of genes/proteins

- Proteins participating in the same metabolic pathways

- Proteins sharing substrates

- Proteins that are co-localized

- Proteins that form permanent supracomplexes = protein machineries

- Proteins that bind eachother transiently

(signal transduction, bioenergetics ... )

Please describe different types of cellular networks.

Relation between lethality and function as centers in protein networks

Likehood p(k) of finding proteins in yeast that interact

with exactly k other proteins.

Probability has power law dependence.

(Similar plot for bacterium Heliobacter pylori.)

network of protein-protein interactions is a very

inhomogenous scale-free network where a few,

highly connected, proteins play central roles of

mediating the interactions among other, less

strongly connected, proteins.

Jeong, Mason, Barabási, Oltvai, Nature 411, 41 (2001)

Relation between lethality and function as centers in protein networks

Computational analysis of the tolerance of protein

networks for random errors (gene deletions).

Random mutations don’t have an effect on the total

topology of the network.

When “hub” proteins with many interactions are

eliminated, the diameter of the network decreases

quickly.

The degree of proteins being essential (gene knock-

out is lethal for cell) depends on the connectivity in the

yeast protein network.

Strongly connected proteins with central roles in the

architecture of the network are 3 times as essential as

proteins with few connections.

Jeong, Mason, Barabási, Oltvai, Nature 411, 41 (2001)

Analyis of protein complexes in yeast (S. cerevisae)

Gavin et al. Nature 415, 141 (2002)

Identify proteins by

scanning yeast protein

database for protein

composed of fragments

of suitable mass.

Here, the identified

proteins are listed

according to their

localization (a).

(b) lists the number of

proteins per complex.

V14 Protein Networks / Protein Complexes

Protein networks could be defined in a number of ways

- Co-regulated expression of genes/proteins

- Proteins participating in the same metabolic pathways

- Proteins sharing substrates

- Proteins that are co-localized

- Proteins that form permanent supracomplexes = protein machineries

- Proteins that bind eachother transiently

(signal transduction, bioenergetics ... )

Overview

Statistical analysis of protein-protein interfaces in crystal structures of

protein-protein complexes: residues in interfaces have significantly different

amino acid composition that the rest of the protein.

predict protein-protein interaction sites from local sequence information

Conservation at protein-protein interfaces: interface regions are more conserved

than other regions on the protein surface

identify conserved regions on protein surface e.g. from solvent accessibility

Interacting residues on two binding partners often show correlated mutations (among

different organisms) if being mutated

identify correlated mutations

Surface patterns of protein-protein interfaces: interface often formed by hydrophobic

patch surrounded by ring of polar or charged residues.

identify suitable patches on surface if 3D structure is known

1 Analysis of interfacesPDB contains 1812 non-redundant

protein complexes

(less than 25% identity).

Results don‘t change significantly if

NMR structures, theoretical models,

or structures at lower resolution

(altogether 50%) are excluded.

Most interesting are the results for

transiently formed complexes.

How many PDB structures of

protein-protein complexes are

known? How many residues are

typically at an interface?

(ca. 10- 20) Ofran, Rost, J. Mol. Biol. 325, 377 (2003)

1 Properties of interfaces

Amino acid composition of six interface types. The propensities of all residues

found in SWISS-PROT were used as background. If the frequency of an amino

acid is similar to its frequency in SWISS-PROT, the height of the bar is close to

zero. Over-representation results in a positive bar, and under-representation

results in a negative bar. Ofran, Rost, J. Mol. Biol. 325, 377 (2003)

1 Pairing frequencies at interfaces

Residue–residue preferences.

(A) Intra-domain: hydrophobic core is clear

(B) domain–domain, (C) obligatory homo-

oligomers (homo-obligomers), (D) transient

homo-oligomers (homo-complexes), (E)

obligatory hetero-oligomers (hetero-

obligomers), and (F) transient hetero-

oligomers (hetero-complexes). A red square

indicates that the interaction occurs more

frequently than expected; a blue square

indicates that it occurs less frequently than

expected. The amino acid residues are

ordered according to hydrophobicity, with

isoleucine as the most hydrophobic and

arginine as the least hydrophobic.

Ofran, Rost, J. Mol. Biol. 325, 377 (2003)

3 Correlated mutations at interface

Pazos, Helmer-Citterich, Ausiello, Valencia J Mol Biol 271, 511 (1997):

correlation information is sufficient for selecting the correct structural arrangement of

known heterodimers and protein domains because the correlated pairs between the

monomers tend to accumulate at the contact interface.

Use same idea to identify interacting protein pairs.

Correlated mutations at interface

Correlated mutations evaluate the similarity in variation patterns between positions in

a multiple sequence alignment.

Similarity of those variation patterns is thought to be related to compensatory

mutations.

Calculate for each positions i and j in the sequence a rank correlation coefficient (rij):

Pazos, Valencia, Proteins 47, 219 (2002)

lkjjkl

lkiikl

lkjjkliikl

where the summations run over every possible pair of proteins k and l in the multiple

sequence alignment.

Sikl is the ranked similarity between residue i in protein k and residue i in protein l.

Sjkl is the same for residue j.

Si and Sj are the means of Sikl and Sjkl.

Correlated mutations at interface

Generate for protein i multiple sequence alignment of homologous proteins (HSSP

database).

Compare MSAs of two proteins, reduce them by leaving only sequences of

coincident species (delete rows).

i2h method

Schematic representation of the i2h method.

A: Family alignments are collected for two different proteins, 1 and 2, including corresponding sequences from different species (a, b, c, ).

B: A virtual alignment is constructed, concatenating the sequences of the probable orthologous sequences of the two proteins. Correlated mutations are calculated.

C: The distributions of the correlation values are recorded. We used 10 correlation levels. The corresponding distributions are represented for the pairs of residues internal to the two proteins (P11 and P22) and for the pairs composed of one residue from each of the two proteins (P12).

Predictions from correlated mutationsResults obtained by i2h in a set of 14 two domain proteins of known structure = proteins with two interacting domains. Treat the 2 domains as different proteins.

A: Interaction index for the 133 pairs with 11 or more sequences in common. The true positive hits are highlighted with filled squares.

B: Representation of i2h results, reminiscent of those obtained in the experimental yeast two-hybrid system. The diameter of the black circles is proportional to the interaction index; true pairs are highlighted with gray squares. Empty spaces correspond to those cases in which the i2h system could not be applied, because they contained <11 sequences from different species in common for the two domains.

In most cases, i2h scored the correct pair of protein domains above all other possible interactions.

4 Coevolutionary Analysis

Idea: if co-evolution is relevant, a ligand-receptor pair should occupy

related positions in phylogenetic trees.

Observe that for ligand-receptor pairs that are part of most large protein

families, the correlation between their phylogenetic distance matrices is

significantly greater than for uncorrelated protein families (Goh et al. 2000,

Pazos, Valencia, 2001).

Finer analysis (Goh & Cohen, 2002) shows that within these correlated

phylogenetic trees, the protein pairs that bind have a higher correlation between

their phylogenetic distance matrices than other homologs drawn drom the ligand

and receptor families that do not bind.

Goh, Cohen J Mol Biol 324, 177 (2002)

Summary

There exists now a small zoo of promising experimental and theoretical methods to

analyze cellular interactome: which proteins interact with each other.

Problem 1: each method detects too few interactions (as seen by the fact that the

overlap between predictions of various methods is very small)

Problem 2: each method has an intrinsic error rate producing „false positives“ and

„false negatives“).

Ideally, everything will converge to a big picture eventually.

Solving Problem 1 will help solving problem 2 by combining predictions.

Problem 1 can be partially solved by producing more data :-)

In the mean time, the value of network analysis (e.g. the identification of „isolated“

modules) is questionable to some extent.

Modularity in molecular networks?

A functional module is, by definition, a discrete entity whose function is

separable from those of other modules.

This separation depends on chemical isolation, which can originate from

spatial localization or from chemical specificity.

E.g. a ribosome concentrates the reactions involved in making a polypeptide

into a single particle, thus spatially isolating its function.

A signal transduction system is an extended module that achieves its isolation

through the specificity of the initial binding of the chemical signal to receptor

proteins, and of the interactions between signalling proteins within the cell.

Please give 3 reasons for the occurrence of functional modules in

biological cells.

Hartwell et al. Nature 402, C47 (1999)

Modularity in molecular networks

Modules can be insulated from or connected to each other.

Insulation allows the cell to carry out many diverse reactions without cross-talk

that would harm the cell.

Connectivity allows one function to influence another.

The higher-level properties of cells, such as their ability to integrate information

from multiple sources, will be described by the pattern of connections among their

functional modules.

Hartwell et al. Nature 402, C47 (1999)

Organization of large-scale molecular networks

Organization of molecular networks revealed by large-scale experiments:

- power-law distribution ; P(k) exp-

- similar distribution of the node degree k (i.e. the number of edges of a node)

- small-world property (i.e. a high clustering coefficient and a small shortest path

between every pair of nodes)

- anticorrelation in the node degree of connected nodes (i.e. highly interacting

nodes tend to be connected to low-interacting ones)

These properties become evident when hundreds or thousands of molecules and

their interactions are studied together.

On the other end of the spectrum: recently discovered motifs that consist of 3-4

nodes.

Mesoscale properties of networks

Most relevant processes in biological networks correspond to the

mesoscale (5-25 genes or proteins).

It is computationally enormously expensive to study mesoscale properties

of biological networks.

e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.

Spirin & Mirny analyzed combined network of protein interactions with data from

CELLZOME, MIPS, BIND: 6500 interactions.

Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph

with proteins as nodes and protein interactions as undirected edges.

Aim: identify highly connected subgraphs (clusters) that have more

interactions

within themselves and fewer with the rest of the graph.

A fully connected subgraph, or clique, that is not a part of any other clique is

an example of such a cluster.

In general, clusters need not to be fully connected.

Measure density of connections by

where n is the number of proteins in the cluster

and m is the number of interactions between them.

Spirin, Mirny, PNAS 100, 12123 (2003)

(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.

Because the protein interaction graph is sofar very sparse (the number of interactions

(edges) is similar to the number of proteins (nodes), this can be done quickly.

To find cliques of size n one needs to enumerate only the cliques of size n-1.

The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500

protein interactions) successively.

For every pair A-B and C-D check whether there are edges between A and C, A and

D, B and C, and B and D. If these edges are present, ABCD is a clique.

For every clique identified, ABCD, pick all known proteins successively.

For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,

then ABCDE is a clique with size 5.

Continue for n = 6, 7, ... The largest clique found in the protein-interaction network

has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)

(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.

For example, the clique with size 14 contains 14 cliques with size 13.

To find all nonredundant subgraphs, mark all proteins comprising the clique of size

14, and out of all subgraphs of size 13 pick those that have at least one protein

other than marked.

After all redundant cliques of size 13 are removed, proceed to remove redundant

twelves etc.

In total, only 41 nonredundant cliques with sizes 4 - 14 were found.

Describe an algorithm to detect all fully-connected subgraphs (cliques) in the

network of protein-protein interactions.

(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.

At time t = 0, a random set of M nodes is selected.

For each pair of nodes i,j from this set, the shortest path Lij between i and j on the

graph is calculated.

Denote the sum of all shortest paths Lij from this set as L0.

At every time step one of M nodes is picked at random, and one node is picked at

random out of all its neighbors.

The new sum of all shortest paths, L1, is calculated if the original node were to be

replaced by this neighbor.

If L1 < L0, accept replacement with probability 1.

If L1 > L0, accept replacement with probability

where T is the effective temperature.

(III) Monte Carlo Simulation

Every tenth time step an attempt is made to replace one of the nodes from

the current set with a node that has no edges to the current set to avoid

getting caught in an isolated disconnected subgraph.

This process is repeated

(i) until the original set converges to a complete subgraph, or

(ii) for a predetermined number of steps,

after which the tightest subgraph (the subgraph corresponding to the smallest

L0) is recorded.

The recorded clusters are merged and redundant clusters are removed.

Optimal temperature in MC simulationFor every cluster size there is an

optimal temperature that gives the

fastest convergence to the tightest

subgraph.

Time to find a clique with size 7 in MC steps

per site as a function of temperature T.

The region with optimal temperature is

shown in Inset.

The required time increases sharply as the

temperature goes to 0, but has a relatively

wide plateau in the region 3 < T < 7.

Simulations suggest that the choice of

temperature T M would be safe for any

cluster size M.

Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are

statistically insignificant. Clean such statistically insignificant members first.

Then merge overlapping clusters:

For every cluster Ai find all clusters Ak that overlap with this cluster by at least one

protein.

For every such found cluster calculate Q value of a possible merged cluster

Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.

After the best match is found for every cluster, every cluster Ai is replaced by a

merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value

for QC.

This process continues until there are no more overlapping clusters or until merging

any of the remaining clusters witll make a cluster with Q value lower than QC.

Statistical significance of complexes and modules

Number of complete cliques (Q = 1) as

a function of clique size enumerated in

the network of protein interactions

(red) and in randomly rewired graphs

(blue, averaged >1,000 graphs where

number of interactions for each protein

is preserved).

Inset shows the same plot in log-

normal scale. Note the dramatic

enrichment in the number of cliques in

the protein-interaction graph

compared with the random graphs.

Most of these cliques are parts of

bigger complexes and modules.

Draw the distrubution of cliques in the plot.

Statistical significance of complexes and modules

Distribution of Q of clusters found by the MC search

method.

Red bars: original network of protein interactions.

Blue cuves: randomly rewired graphs.

Clusters in the protein network have many more

interactions than their counterparts in the

random graphs.

Discovered functional modules

Examples of discovered functional modules.

(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and

CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).

Although they have many interactions, these proteins are not present in the cell at the same

(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This

module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-

activated protein kinase kinase) kinases, as well as other proteins involved in signal

transduction. These proteins do not form a single complex; rather, they interact in a specific

order.

Assuming that functional modules are identified from analyzing protein-protein

interaction networks: will all protein belonging to a functional module always be present

simultaneously?

Robustness of clusters found

Model effect of false positives in

experimental data: randomly reconnect,

remove or add 10-50% of interactions

in network.

Cluster recovery probability as a

function of the fraction of altered links.

Black curves correspond to the case

when a fraction of links are rewired.

Red, removed;

green, added.

Circles represent the probability to

recover 75% of the original cluster;

triangles represent the probability to

recover 50%.

Noise in the form of removal or addions

lf links has less deteriorating effect

than random rewiring. About 75% of

clusters can still be found when 10% of

links are rewired.

Summary

Here: analysis of meso-scale properties demonstrated the presence of highly

connected clusters of proteins in a network of protein interactions. Strong

support for suggested modular architecture of biological networks.

Distinguish 2 types of clusters: protein complexes and dynamic functional

modules.

Both complexes and modules have more interactions among their members

than with the rest of the network.

Dynamic modules are elusive to experimental purification because they are

not assembled as a complex at any single point in time.

Computational analysis allows detection of such modules by integrating

pairwise molecular interactions that occur at different times and places.

However, computational analysis alone, does not allow to distinguish

between complexes and modules or between transient and simultaneous

interactions.

Evolution of the yeast protein interaction network

How do biological networks develop?

Sofar, protein interaction network of yeast is one of the best characterized

networks.

Parts of this network should be inherited from the last common ancestor of the

three domains of life: Eubacteria, Archaea, and Eukaryotes.

Use again graph theory to model the yeast protein interaction network.

Proteins = nodes, pairwise interactions = link between two nodes.

Evolution can be inferred by analyzing the growth pattern of the graph.

Classify all nodes (proteins) into isotemporal categories based on each

protein‘s orthologous hits in COG data base.

Qin et al. PNAS 100, 12820 (2003)

Isotemporal categories are designed

through a binary (b) coding scheme.

The b code represents the

distribution of each yeast protein's

orthologs in the universal tree of life.

Bit value 1 indicates the presence of

at least one orthologous hit for a

yeast protein in a corresponding

group of genomes, and bit value 0

indicates the absence of any

orthologous hit. The presented

example is 110011 in the b format

and 51 in the d format. Orthologous

identifications are based on COGs at

NCBI and in von Mering et al. (2002).

Qin et al. PNAS 100, 12820 (2003)

Previously, phylogenetic profileswere used to detect proteininteraction partners.Here, use phylogenetic profiles to detect modules.

Interaction patterns.

Z scores for all possible interactions

of the isotemporal categories in the

protein interaction network.

For categories i and j,

Zi,j = (Fi,jobs – Fi,j

mean)/i,j

where Fi,jobs is the observed number

of interactions, and Fi,jmean and i,j are

the average number of interactions

and the SD, respectively, in 10,000

MS02 null models.

Qin et al. PNAS 100, 12820 (2003)

The diagonal distribution of large positive Z scores indicates that yeast

proteins tend to interact with proteins from the same or closely related

isotemperal categories.

The observed intracategory association tendencies are consistent with the

intuitive notion that a new function likely requires a group of new proteins,

and that the growth of the protein interaction network is under functional

constraints.

Although the turnover rate of the protein interaction network is suggested to

be very fast, these results suggest that many isotemporal clusters can still

remain well preserved during evolution.

The formation and conservation of isotemporal clusters during evolution

may be the consequence of selection for the modular organization of the

protein interaction network.

The progressive nature of the network evolution and significant isotemporal

clustering may have contributed to the hierarchical organization of

modularity in biological networks in general.Qin et al. PNAS 100, 12820 (2003)

V17, V18

look at Tihamer‘s overheads – basics of network types are important for exam

Computational Studies of Metabolic Networks - Introduction

Different levels for describing metabolic networks:

- classical biochemical pathways (glycolysis, TCA cycle, ...

- stoichiometric modelling (flux balance analysis): theoretical capabilities of an

integrated cellular process, feasible metabolic flux distributions

- automatic decomposition of metabolic networks

(elementary nodes, extreme pathways ...)

- kinetic modelling (E-Cell ...) problem: general lack of kinetic information

on the dynamics and regulation of cellular metabolism

As a primer today EcoCyc:

Global Properties of the Metabolic Map of E. coli,

Ouzonis, Karp, Genome Research 10, 568 (2000)

EcoCyc Database

Genetic complement of E.coli: 4.7 million DNA bases.

How can we characterize the functional complement of E.coli and according to

what criteria can we compare the biochemical networks of two organisms?

EcoCyc contains the metabolic map of E.coli defined as the set of all known

pathways, reactions and enzymes of E.coli small-molecule metabolism.

Analyze

- the connectivity relationships of the metabolic network

- its partitioning into pathways

- enzyme activation and inhibition

- repetition and multiplicity of elements such as enzymes, reactions, and substrates.

Ouzonis, Karp, Genome Res. 10, 568 (2000)

Reactions

The number of reactions (744) and the number of enzymes (607) differ in

EcoCyc. WHY??

(1) there is no one-to-one mapping between enzymes and reactions –

some enzymes catalyze multiple reactions, and some reactions are catalyzed

by multiple enzymes.

(2) for some reactions known to be catalyzed by E.coli, the enzyme has not

yet been identified.

Compounds

The 744 reactions of E.coli small-molecule metabolism involve a total of 791

different substrates.

On average, each reaction contains 4.0 substrates.

Number of reactions containing varying numbers of substrates (reactants plus products).

Pathways

EcoCyc describes 131 pathways:

energy metabolism

nucleotide and amino acid biosynthesis

secondary metabolism

Pathways vary in length from a

single reaction step to 16 steps

with an average of 5.4 steps.

Fill the distribution of pathway

lengths in EcoCyc into the plot:

Length distribution of EcoCyc pathways

Pathways

However, there is no precise

biological definition of a

pathway.

The partitioning of the

metabolic network into

pathways (including the well-

known examples of

biochemical pathways) is

somehow arbitrary.

These decisions of course also

affect the distribution of

pathway lengths.

Protein Subunits

A unique property of EcoCyc is that it explicitly encodes the subunit organization of

proteins.

Therefore, one can ask questions such as:

Are protein subunits encoded by neighboring genes?

Interestingly, this is the case for > 80% of known heteromeric enzymes.

Reactions Catalyzed by More Than one Enzyme

Diagram showing the number of reactions

that are catalyzed by one or more enzymes.

Most reactions are catalyzed by one enzyme,

some by two, and very few by more than two

enzymes.

For 84 reactions, the corresponding enzyme is not yet encoded in EcoCyc.

What may be the reasons for isozyme redundancy?

(2) the reaction is easily „invented“; therefore, there is more than one protein

family that is independently able to perform the catalysis (convergence).

(1) the enzymes that catalyze the same reaction are homologs and have

duplicated (or were obtained by horizontal gene transfer),

acquiring some specificity but retaining the same mechanism (divergence)

Enzymes that catalyze more than one reaction

Genome predictions usually assign a single enzymatic function.

However, E.coli is known to contain many multifunctional enzymes.

Of the 607 E.coli enzymes, 100 are multifunctional, either having the same active

site and different substrate specificities or different active sites.

Number of enzymes that catalyze one or

more reactions. Most enzymes catalyze

one reaction; some are multifunctional.

The enzymes that catalyze 7 and 9 reactions are purine nucleoside phosphorylase

and nucleoside diphosphate kinase.

Take-home message: The high proportion of multifunctional enzymes implies that

the genome projects significantly underpredict multifunctional enzymes!

Reactions participating in more than one pathway

The 99 reactions belonging to multiple

pathways appear to be the intersection

points in the complex network of chemical

processes in the cell.

E.g. the reaction present in 6 pathways corresponds to the reaction catalyzed by

malate dehydrogenase, a central enzyme in cellular metabolism.

Ouzonis, Karp,

Genome Res. 10, 568 (2000)

Implications of EcoCyc Analysis

Although 30% of E.coli genes remain unidentified, enzymes are the best studied

and easily identifiable class of proteins.

Therefore, few new enzymes can be expected to be discovered.

The metabolic map presented may be 90% complete.

Implication for metabolic maps derived from automatic genome annotation:

automatic annotation does generally not identify multifunctional proteins.

The network complexity may therefore be underestimated.

EcoCyc results often cannot be obtained from protein or nucleic acid sequence

databases because they store protein functions using text descriptions.

E.g. sequence databases don‘t include precise information about subunit

organization of proteins.

Stoichiometric matrix

Stoichiometric matrix:

A matrix with reaction stochio-

metries as columns and

metabolite participations as

The stochiometric matrix is an

important part of the in silico

model.

With the matrix, the methods

of extreme pathway and

elementary mode analyses

can be used to generate a

unique set of pathways P1,

P2, and P3 (see future

lecture).

Papin et al. TIBS 28, 250 (2003)

Flux balancingmass conservation.

Therefore one may analyze metabolic systems by requiring mass conservation.

Only required: knowledge about stoichiometry of metabolic pathways and

metabolic demands

For each metabolite:

Under steady-state conditions, the mass balance constraints in a metabolic

network can be represented mathematically by the matrix equation:

S · v = 0

where the matrix S is the m n stoichiometric matrix,

m = the number of metabolites and n = the number of reactions in the network.

The vector v represents all fluxes in the metabolic network, including the internal

fluxes, transport fluxes and the growth flux.

)( dtransporteuseddegradeddsynthesizei

i VVVVdt

Any chemical reaction requires

Flux balance analysis

Since the number of metabolites is generally smaller than the number of reactions

(m < n) the flux-balance equation is typically underdetermined.

Therefore there are generally multiple feasible flux distributions that satisfy the mass

balance constraints.

The set of solutions are confined to the nullspace of matrix S.

To find the „true“ biological flux in cells ( e.g. Heinzle, Huber, UdS) one needs

additional (experimental) information,

or one may impose constraints

on the magnitude of each individual metabolic flux.

The intersection of the nullspace and the region defined by those linear inequalities

defines a region in flux space = the feasible set of fluxes.

Feasible solution set for a metabolic reaction network

(A) The steady-state operation of the metabolic network is restricted to the

region within a cone, defined as the feasible set. The feasible set contains

all flux vectors that satisfy the physicochemical constrains. Thus, the

feasible set defines the capabilities of the metabolic network. All feasible

metabolic flux distributions lie within the feasible set, and

(B) in the limiting case, where all constraints on the metabolic network are

known, such as the enzyme kinetics and gene regulation, the feasible set

may be reduced to a single point. This single point must lie within the

feasible set.

E.coli in silico

Edwards & Palsson, PNAS 97, 5528 (2000)

Define i = 0 for irreversible internal fluxes,

i = - for reversible internal fluxes (use biochemical literature)

Transport fluxes for PO42-, NH3, CO2, SO4

2-, K+, Na+ was unrestrained.

For other metabolites

except for those that are able to leave the metabolic network (i.e. acetate, ethanol,

lactate, succinate, formate, pyruvate etc.)

Find particular metabolic flux distribution with feasible set by linear programming.

LP finds a solution that minimizes a particular metabolic objective (subject to the

imposed constraints) –Z where

maxii vv 0

vcii vcZ

In fact, the method finds the solution that maximizes fluxes = gives maximal

biomass.

Interpretation of gene deletion results

The essential gene products were involved in the 3-carbon stage of glycolysis, 3

reactions of the TCA cycle, and several points within the PPP.

The remainder of the central metabolic genes could be removed while E.coli in

silico maintained the potential to support cellular growth.

This suggests that a large number of the central metabolic genes can be

removed without eliminating the capability of the metabolic network to

support growth under the conditions considered.

Edwards & Palsson PNAS 97, 5528 (2000)

SummaryFBA analysis constructs the optimal network utilization simply using

stoichiometry of metabolic reactions and capacity constraints.

For E.coli the in silico results are consistent with experimental data.

FBA shows that in the E.coli metabolic network there are relatively few critical

gene products in central metabolism.

However, the the ability to adjust to different environments (growth conditions) may

be dimished by gene deletions.

FBA identifies „the best“ the cell can do, not how the cell actually behaves under a

given set of conditions. Here, survival was equated with growth.

FBA does not directly consider regulation or regulatory constraints on the

metabolic network. This can be treated separately (see future lecture).

Edwards & Palsson PNAS 97, 5528 (2000)

Extreme Pathwaysintroduced into metabolic analysis by the lab of Bernard Palsson

(Dept. of Bioengineering, UC San Diego). The publications of this lab

are available at http://gcrg.ucsd.edu/publications/index.html

Extreme pathway

technique is based

on the stoichiometric

matrix representation

of metabolic networks.

All external fluxes are

defined as pointing outwards.

Schilling, Letscher, Palsson,

J. theor. Biol. 203, 229 (2000)

Extreme Pathways – algorithm - setup

The algorithm to determine the set of extreme pathways for a reaction network

follows the pinciples of algorithms for finding the extremal rays/ generating

vectors of convex polyhedral cones.

Combine n n identity matrix (I) with the transpose of the stoichiometric

matrix ST. I serves for bookkeeping.

J. theor. Biol. 203, 229 (2000)

separate internal and external fluxes

Examine contraints on each of the exchange fluxes as given by

j bj j

If the exchange flux is constained to be positive do nothing,

if the exchange flux is constrained to be negative multiply the corresponding

row of the initial matrix by -1.

If the exchange flux is unconstrained move the entire row to a temporary

matrix T(E). This completes the first tableau T(0).

T(0) and T(E) for the example reaction system are shown on the previous slide.

Each element of this matrices will be designated Tij.

Starting with x = 1 and T(0) = T(x-1) the next tableau is generated in the following

J. theor. Biol. 203, 229 (2000)

idea of algorithm

(1) Identify all metabolites that do not have an unconstrained exchange flux

associated with them.

The total number of such metabolites is denoted by .

For the example, this is only the case for metabolite C ( = 1).

What is the main idea?

- We want to find balanced extreme pathways

that don‘t change the concentrations of

metabolites when flux flows through

(input fluxes are channelled to products not to

accumulation of intermediates).

- The stochiometrix matrix describes the coupling of each reaction to the

concentration of metabolites X.

- Now we need to balance combinations of reactions that leave concentrations

unchanged. Pathways applied to metabolites should not change their

concentrations the matrix entries

need to be brought to 0.Schilling, Letscher, Palsson,

J. theor. Biol. 203, 229 (2000)

keep pathways that do not change concentrations of internal metabolites

(2) Begin forming the new matrix T(x) by copying

all rows from T(x – 1) which contain a zero in the

column of ST that corresponds to the first

metabolite identified in step 1, denoted by index c.

(Here 3rd column of ST.)

Schilling, Letscher, Palsson, J. theor. Biol. 203, 229 (2000)

1 -1 1 0 0 0

1 0 -1 1 0 0

1 0 1 -1 0 0

1 0 0 -1 1 0

1 0 0 1 -1 0

1 0 0 -1 0 1

1 -1 1 0 0 0

T(0) =

T(1) =

balance combinations of other pathways

(3) Of the remaining rows in T(x-1) add together

all possible combinations of rows which contain

values of the opposite sign in column c, such that

the addition produces a zero in this column.

Schilling, et al.

JTB 203, 229

1 -1 1 0 0 0

1 0 -1 1 0 0

1 0 1 -1 0 0

1 0 0 -1 1 0

1 0 0 1 -1 0

1 0 0 -1 0 1

T(0) =

T(1) =

1 0 0 0 0 0 -1 1 0 0 0

0 1 1 0 0 0 0 0 0 0 0

0 1 0 1 0 0 0 -1 0 1 0

0 1 0 0 0 1 0 -1 0 0 1

0 0 1 0 1 0 0 1 0 -1 0

0 0 0 1 1 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 -1 1

remove “non-orthogonal” pathways

(4) For all of the rows added to T(x) in steps 2 and 3 check to make sure that no

row exists that is a non-negative combination of any other sets of rows in T(x) .

One method used is as follows:

let A(i) = set of column indices j for with the elements of row i = 0.

For the example above Then check to determine if there exists

A(1) = {2,3,4,5,6,9,10,11} another row (h) for which A(i) is a

A(2) = {1,4,5,6,7,8,9,10,11} subset of A(h).

A(3) = {1,3,5,6,7,9,11}

A(4) = {1,3,4,5,7,9,10} If A(i) A(h), i h

A(5) = {1,2,3,6,7,8,9,10,11} where

A(6) = {1,2,3,4,7,8,9} A(i) = { j : Ti,j = 0, 1 j (n+m) }

then row i must be eliminated from T(x)

Schilling et al.

JTB 203, 229

repeat steps for all internal metabolites

(5) With the formation of T(x) complete steps 2 – 4 for all of the metabolites that do

not have an unconstrained exchange flux operating on the metabolite,

incrementing x by one up to . The final tableau will be T().

Note that the number of rows in T () will be equal to k, the number of extreme

pathways.

Schilling et al.

JTB 203, 229

balance external fluxes

(6) Next we append T(E) to the bottom of T(). (In the example here = 1.)

This results in the following tableau:

Schilling et al.

JTB 203, 229

T(1/E) =

1 -1 1 0 0 0

1 1 0 0 0 0 0

1 1 0 -1 0 1 0

1 1 0 1 0 -1 0

1 1 0 0 0 0 0

1 1 0 0 0 -1 1

1 -1 0 0 0 0

1 0 -1 0 0 0

1 0 0 0 -1 0

1 0 0 0 0 -1

(7) Starting in the n+1 column (or the first non-zero column on the right side),

if Ti,(n+1) 0 then add the corresponding non-zero row from T(E) to row i so as to

produce 0 in the n+1-th column.

This is done by simply multiplying the corresponding row in T(E) by Ti,(n+1) and

adding this row to row i .

Repeat this procedure for each of the rows in the upper portion of the tableau so

as to create zeros in the entire upper portion of the (n+1) column.

When finished, remove the row in T(E) corresponding to the exchange flux for the

metabolite just balanced.

Schilling et al.

JTB 203, 229

(8) Follow the same procedure as in step (7) for each of the columns on the right

side of the tableau containing non-zero entries.

(In this example we need to perform step (7) for every column except the middle

column of the right side which correponds to metabolite C.)

The final tableau T(final) will contain the transpose of the matrix P containing the

extreme pathways in place of the original identity matrix.

Schilling et al.

JTB 203, 229

pathway matrix

T(final) =

Schilling et al.

JTB 203, 229

1 -1 1 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 -1 1 0 0 0 0 0 0

1 1 1 -1 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 -1 1 0 0 0 0 0 0

1 0 0 0 0 0 -1 1 0 0

0 1 1 0 0 0 0 0 0 0

0 1 0 1 0 0 0 -1 1 0

0 1 0 0 0 1 0 -1 0 1

0 0 1 0 1 0 0 1 -1 0

0 0 0 1 1 0 0 0 0 0

0 0 0 0 1 1 0 0 -1 1

v1 v2 v3 v4 v5 v6 b1 b2 b3 b4

p1 p7 p3 p2 p4 p6 p5

Extreme Pathways for model system

Schilling et al.

JTB 203, 229

1 0 0 0 0 0 -1 1 0 0

0 1 1 0 0 0 0 0 0 0

0 1 0 1 0 0 0 -1 1 0

0 1 0 0 0 1 0 -1 0 1

0 0 1 0 1 0 0 1 -1 0

0 0 0 1 1 0 0 0 0 0

0 0 0 0 1 1 0 0 -1 1

v1 v2 v3 v4 v5 v6 b1 b2 b3 b4

p1 p7 p3 p2 p4 p6 p5

2 pathways p6 and p7 are not shown (right below) because all exchange fluxes with the exterior are 0.Such pathways have no net overall effect on the functional capabilities of the network.They belong to the cycling of reactions v4/v5 and v2/v3.

How reactions appear in pathway matrix

In the matrix P of extreme pathways, each column is an EP and each row

corresponds to a reaction in the network.

The numerical value of the i,j-th element corresponds to the relative flux level

through the i-th reaction in the j-th EP.

Papin, Price, Palsson,

Genome Res. 12, 1889 (2002)

PPP TLM

Papin, Price, Palsson, Genome Res. 12, 1889 (2002)

A symmetric Pathway Length Matrix PLM can be calculated:

where the values along the diagonal correspond to the length of the EPs.

PPP TLM

Properties of pathway matrix

The off-diagonal terms of PLM are the number of reactions that a pair of extreme

pathways have in common.

One can also compute a reaction participation matrix PPM from P:

where the diagonal correspond to the number of pathways in which the

given reaction participates.

TPM PPP

Properties of pathway matrix

Extreme Pathway Analysis

Calculation of EPs for increasingly large networks is computationally

intensive and results in the generation of large data sets.

Even for integrated genome-scale models for microbes under simple

conditions,

EP analysis can generate thousands of vectors!

Interpretation:

- the metabolic network of H. influenza has an order of magnitude larger degree of

pathway redundancy than the metabolic network of H. pylori

Found elsewhere: the number of reactions that participate in EPs that produce a

particular product is poorly correlated to the product yield and the molecular

complexity of the product.

Possible way out?

Diagonalisation of pathway matrix?

http://mathworld.wolfram.com

Single Value Decomposition of EP matrices

For a given EP matrix P np, SVD decomposes P into 3 matrices

Price et al. Biophys J 84, 794 (2003)

where U nn is an orthonormal matrix of the left singular vectors, V pp is an

analogous orthonormal matrix of the right singular vectors, and rr is a

diagonal matrix containing the singular values i=1..r arranged in descending order

where r is the rank of P.

The first r columns of U and V, referred to as the left and right singular vectors, or

modes, are unique and form the orthonormal basis for the column space and row

space of P.

The singular values are the square roots of the eigenvalues of PTP. The magnitude

of the singular values in indicate the relative contribution of the singular vectors in

U and V in reconstructing P.

E.g. the second singular value contributes less to the construction of P than the first

singular value etc.

Single Value Decomposition of EP: Interpretation

The first mode (as the other modes) corresponds to a valid biochemical pathway

through the network.

The first mode will point

into the portions of the

cone with highest density

of EPs.

SVD applied for Heliobacter systems

Cumulative fractional contributions for the singular value decomposition of the EP

matrices of H. influenza and H. pylori.

This plot represents the

contribution of the first

n modes to the overall

description of the system.

Summary

Extreme pathway analysis provides a mathematically rigorous way to dissect

complex biochemical networks.

The matrix products PT P and PT P are useful ways to interpret pathway

lengths and reaction participation.

However, the number of computed vectors may range in the 1000sands.

Therefore, meta-methods (e.g. singular value decomposition) are required

that reduce the dimensionality to a useful number that can be inspected by

humans.

Single value decomposition may be one useful method ... and there are more

to come.

Metabolic Pathway Analysis: Elementary ModesThe technique of Elementary Flux Modes (EFM) was developed prior to extreme

pathways (EP) by Stephan Schuster, Thomas Dandekar and co-workers:Pfeiffer et al. Bioinformatics, 15, 251 (1999)

Schuster et al. Nature Biotech. 18, 326 (2000)

The method is very similar to the „extreme pathway“ method to construct a basis

for metabolic flux states based on methods from convex algebra.

Extreme pathways are a subset of elementary modes, and for many systems,

both methods coincide.

Are the subtle differences important?

Review: Metabolite BalancingFor analyzing a biochemical network, its structure is expressed by the stochiometric

matrix S consisting of m rows corresponding to the substances (metabolites) and n

rows corresponding to the stochiometric coefficients of the metabolites in each

reaction.

A vector v denotes the reaction rates (mmol/g dry weight * hour) and a vector c

describes the metabolite concentrations.

Due to the high turnover of metabolite pools one often assumes pseudo-steady state

(c(t) = constant) leading to the fundamental Metabolic Balancing Equation:

Flux distributions v satisfying this relationship lie in the null space of S and are able

to balance all metabolites.

Klamt et al. Bioinformatics 19, 261 (2003)

Review: Metabolic flux analysisMetabolic flux analysis (MFA): determine preferably all components of the flux

distribution v in a metabolic network during a certain stationary growth experiment.

Typically some measured or known rates must be provided to calculate unknown

rates. Accordingly, v and S are partioned into the known (vb, Sb) and unknown part

(va, Sa).

(1) leads to the central equation for MFA describing a flux scenario:

0 = S v = Sa va + Sb vb.

The rank of Sa determines whether this scenario is redundant and/or

underdetermined. Redundant systems can be checked on inconsistencies. In

underdetermined scenarios, only some element of va are uniquely calculable.

Review: structural network analysis (SNA)Whereas MFA focuses on a single flux distribution, techniques of Structural

(Stochiometric, Topological) Network Analysis (SNA) address general

topological properties, overall capabilities, and the inherent pathway structure of a

metabolic network.

Basic topological properties are, e.g., conserved moieties.

Flux Balance Analysis (FBA9 searches for single optimal flux distributions (mostly

with respect to the synthesis of biomass) fulfilling S v = 0 and additionally

reversibility and capacity restrictions for each reaction (i vi i).

Review: Metabolic Pathway Analysis (MPA)Metabolic Pathway Analysis searches for meaningful structural and functional units

in metabolic networks. The most promising, very similar approaches are based on

convex analysis and use the sets of elementary flux modes (Schuster et al. 1999,

2000) and extreme pathways (Schilling et al. 2000).

Both sets span the space of feasible steady-state flux distributions by non-

decomposable routes, i.e. no subset of reactions involved in an EFM or EP can hold

the network balanced using non-trivial fluxes.

MPA can be used to study e.g.

- routing + flexibility/redundancy of networks

- functionality of networks

- idenfication of futile cycles

- gives all (sub)optimal pathways with respect to product/biomass yield

- can be useful for calculability studies in MFA

Elementary Flux ModesStart from list of reaction equations and a declaration of reversible and irreversible

reactions and of internal and external metabolites.

E.g. reaction scheme of monosaccharide Fig.1

metabolism. It includes 15 internal

metabolites, and 19 reactions.

S has dimension 15 19.

It is convenient to reduce this matrix

by lumping those reactions that

necessarily operate together.

{Gap,Pgk,Gpm,Eno,Pyk},

{Zwf,Pgl,Gnd}

Such groups of enzymes can be detected automatically.

This reveals another two sequences {Fba,TpiA} and {2 Rpe,TktI,Tal,TktII}.

Schuster et al. Nature Biotech 18, 326 (2000)

Elementary Flux ModesLumping the reactions in any one sequence gives the following reduced system:

Construct initial tableau by combining

S with identity matrix:

1 0 ... 0 0 0 1 0 0

0 1 ... 0 0 -1 0 2 0

0 0 ... 0 -1 0 0 0 1

0 0 ... 0 -2 0 2 1 -1

0 0 ... 0 0 0 0 -1 0

0 0 ... 0 1 0 0 0 0

0 0 ... 0 0 1 -1 0 0

0 0 ... 0 0 -1 1 0 0

0 0 ... 1 0 0 0 0 -1

Pgi{Fba,TpiA}Rpi reversible{2Rpe,TktI,Tal,TktII}{Gap,Pgk,Gpm,Eno,Pyk}{Zwf,Pgl,Gnd}Pfk irreversibleFbpPrs_DeoB

Elementary Flux ModesAim again: bring all entries

of right part of matrix to 0.E.g. 2*row3 - row4 gives

„reversible“ row with 0 in column

New „irreversible“ rows with 0 entry

in column 10 by row3 + row6 and

by row4 + row7.

In general, linear combinations

of 2 rows corresponding

to the same type of directio-

nality go into the part of

the respective type in the

tableau. Combinations by

different types go into the

„irreversible“ tableau

because at least 1 reaction is

irreversible. Irreversible reactions

can only combined using positive

coefficients.Schuster et al. Nature Biotech 18, 326 (2000)

1 0 0 1 0 0

1 0 -1 0 2 0

1 -1 0 0 0 1

1 -2 0 2 1 -1

1 0 0 0 -1 0

1 1 0 0 0 0

1 0 1 -1 0 0

1 0 -1 1 0 0

1 0 0 0 0 -1

1 0 0 1 0 0

1 0 -1 0 2 0

2 -1 0 0 -2 -1 3

1 0 0 0 -1 0

1 0 1 -1 0 0

1 0 -1 1 0 0

1 0 0 0 0 -1

1 1 0 0 0 0 1

1 2 0 0 2 1 -1

Elementary Flux ModesAim: zero column 11.Include all possible (direction-wise

allowed) linear combinations of

continue with columns 12-

14. Schuster et al. Nature Biotech 18, 326 (2000)

1 0 0 1 0 0

1 0 -1 0 2 0

2 -1 0 0 -2 -1 3

1 0 0 0 -1 0

1 0 1 -1 0 0

1 0 -1 1 0 0

1 0 0 0 0 -1

1 1 0 0 0 0 1

1 2 0 0 2 1 -1

1 0 0 1 0 0

2 -1 0 0 -2 -1 3

1 0 0 0 -1 0

1 0 0 0 0 -1

1 1 0 0 0 0 1

1 2 0 0 2 1 -1

1 1 0 0 -1 2 0

-1 1 0 0 1 -2 0

1 1 0 0 0 0 0

Elementary Flux ModesIn the course of the algorithm, one must avoid

- calculation of nonelementary modes (rows that contain fewer zeros than the row

already present)

- duplicate modes (a pair of rows is only combined if it fulfills the condition

S(mi(j)) S(mk

(j)) S(ml(j+1)) where S(ml

(j+1)) is the set of positions of 0 in this row.

- flux modes violating the sign restriction for the irreversible reactions.

Final tableau

T(5) =

This shows that the number of rows may decrease or increase in the course of the

algorithm. All constructed elementary modes are irreversible.

1 1 0 0 2 0 1 0 0 0 ... ... 0

-2 0 1 1 1 3 0 0 0 ... ...

0 2 1 1 5 3 2 0 0

0 0 1 0 0 1 0 0 1

5 1 4 -2 0 0 1 0 6

-5 -1 2 2 0 6 0 1 0 ... ...

0 0 0 0 0 0 1 1 0 0 ... ... 0

Two approaches for Metabolic Pathway Analysis?The pathway P(v) is an elementary flux mode if it fulfills conditions C1 – C3.

(C1) Pseudo steady-state. S e = 0. This ensures that none of the metabolites is

consumed or produced in the overall stoichiometry.

(C2) Feasibility: rate ei 0 if reaction is irreversible. This demands that only

thermodynamically realizable fluxes are contained in e.

(C3) Non-decomposability: there is no vector v (unequal to the zero vector and to

e) fulfilling C1 and C2 and that P(v) is a proper subset of P(e). This is the core

characteristics for EFMs and EPs and supplies the decomposition of the network

into smallest units (able to hold the network in steady state).

C3 is often called „genetic independence“ because it implies that the enzymes in

one EFM or EP are not a subset of the enzymes from another EFM or EP.

Klamt & Stelling Trends Biotech 21, 64 (2003)

Two approaches for Metabolic Pathway Analysis?The pathway P(e) is an extreme pathway if it fulfills conditions C1 – C3 AND

conditions C4 – C5.

(C4) Network reconfiguration: Each reaction must be classified either as exchange

flux or as internal reaction. All reversible internal reactions must be split up into

two separate, irreversible reactions (forward and backward reaction).

(C5) Systemic independence: the set of EPs in a network is the minimal set of

EFMs that can describe all feasible steady-state flux distributions.

Two approaches for Metabolic Pathway Analysis?

A(ext) B(ext) C(ext)R1 R2 R3

Reconfigured Network

R7bR7f

3 EFMs are not systemically independent:EFM1 = EP4 + EP5EFM2 = EP3 + EP5EFM4 = EP2 + EP3

Property 1 of EFMs

The only difference in the set of EFMs emerging upon reconfiguration consists in

the two-cycles that result from splitting up reversible reactions. However, two-cycles

are not considered as meaningful pathways.

Valid for any network: Property 1

Reconfiguring a network by splitting up reversible reactions leads to the

same set of meaningful EFMs.

Software: FluxAnalyzerWhat is the consequence of when all exchange fluxes (and hence all

reactions in the network) are irreversible?

EFMs and EPs always co-incide!

Property 2 of EFMs

Property 2

If all exchange reactions in a network are irreversible then the sets of

meaningful EFMs (both in the original and in the reconfigured network) and

EPs coincide.

Reconfigured Network

R7bR7f

3 EFMs are not systemically independent:EFM1 = EP4 + EP5EFM2 = EP3 + EP5EFM4 = EP2 + EP3

Comparison of EFMs and EPs

Problem EFM (network N1) EP (network N2)

Recognition of 4 genetically indepen- Set of EPs does not contain

operational modes: dent routes all genetically independent

routes for converting (EFM1-EFM4) routes. Searching for EPs

exclusively A to P. leading from A to P via B,

no pathway would be found.

Interpret the property of the EFM and EP network to recognize operational

modes, finding all the routes, ...

In the exam I would give you the desired property and ask you for the

interpretation.

Problem EFM (network N1) EP (network N2)

Finding all the EFM1 and EFM2 are One would only find the

optimal routes: optimal because they suboptimal EP1, not the

optimal pathways for yield one mole P per optimal routes EFM1 and

synthesizing P during mole substrate A EFM2.

growth on A alone. (i.e. R3/R1 = 1),

whereas EFM3 and

EFM4 are only sub-

optimal (R3/R1 = 0.5).

EFM (network N1)

4 pathways convert A

to P (EFM1-EFM4),

whereas for B only one

route (EFM8) exists.

When one of the

internal reactions (R4-

R9) fails, for production

of P from A 2 pathways

will always „survive“. By

contrast, removing

reaction R8 already

stops the production of

P from B alone.

EFM (network N1)

Only 1 EP exists for

producing P by substrate A

alone, and 1 EP for

synthesizing P by (only)

substrate B. One might

suggest that both

substrates possess the

same redundancy of

pathways, but as shown by

EFM analysis, growth on

substrate A is much more

flexible than on B.

Problem

Analysis of network

flexibility (structural

robustness,

redundancy):

relative robustness of

exclusive growth on

A or B.

EFM (network N1)

R8 is essential for

producing P by substrate

B, whereas for A there is

no structurally „favored“

reaction (R4-R9 all occur

twice in EFM1-EFM4).

However, considering the

optimal modes EFM1,

EFM2, one recognizes the

importance of R8 also for

growth on A.

EFM (network N1)

Consider again biosynthesis

of P from substrate A (EP1

only). Because R8 is not

involved in EP1 one might

think that this reaction is not

important for synthesizing P

from A. However, without this

reaction, it is impossible to

obtain optimal yields (1 P per

A; EFM1 and EFM2).

Problem

Relative importance

of single reactions:

relative importance of

reaction R8.

EFM (network N1)

R6 and R9 are an enzyme

subset. By contrast, R6

and R9 never occur

together with R8 in an

EFM. Thus (R6,R8) and

(R8,R9) are excluding

reaction pairs.(In an arbitrary composable

steady-state flux distribution they

might occur together.)

EFM (network N1)

The EPs pretend R4 and R8

to be an excluding reaction

pair – but they are not

(EFM2). The enzyme

subsets would be correctly

identified. However, one can construct simple

examples where the EPs would also

pretend wrong enzyme subsets (not

shown).

Problem

Enzyme subsets

and excluding

reaction pairs:

suggest regulatory

structures or rules.

EFM (network N1)

The shortest pathway

from A to P needs 2

internal reactions (EFM2),

the longest 4 (EFM4).

EFM (network N1)

Both the shortest (EFM2)

and the longest (EFM4)

pathway from A to P are not

contained in the set of EPs.

Problem

Pathway length:

shortest/longest

pathway for

production of P from

EFM (network N1)

All EFMs not involving the

specific reactions build up

the complete set of EFMs

in the new (smaller) sub-

network. If R7 is deleted,

EFMs 2,3,6,8 „survive“.

Hence the mutant is

viable.

EFM (network N1)

Analyzing a subnetwork

implies that the EPs must be

newly computed. E.g. when

deleting R2, EFM2 would

become an EP. For this

reason, mutation studies

cannot be performed easily.

Problem

Removing a

reaction and

mutation studies:

effect of deleting R7.

EFM (network N1)

For the case of R7, all

EFMs but EFM1 and

EFM7 „survive“ because

the latter ones utilize R7

with negative rate.

EFM (network N1)

In general, the set of EPs

must be recalculated:

compare the EPs in network

N2 (R2 reversible) and N4

(R2 irreversible).

Problem

Constraining

reaction

reversibility:

effect of R7 limited to

Application of elementary modesMetabolic network structure of E.coli determines

key aspects of functionality and regulation

Compute EFMs for central

metabolism of E.coli.

Catabolic part: substrate uptake

reactions, glycolysis, pentose

phosphate pathway, TCA cycle,

excretion of by-products (acetate,

formate, lactate, ethanol)

Anabolic part: conversions of

precursors into building blocks like

amino acids, to macromolecules,

and to biomass.

Stelling et al. Nature 420, 190 (2002)

Metabolic network topology and phenotypeThe total number of EFMs for given

conditions is used as quantitative

measure of metabolic flexibility.

a, Relative number of EFMs N enabling

deletion mutants in gene i ( i) of E. coli

to grow (abbreviated by µ) for 90 different

combinations of mutation and carbon

source. The solid line separates

experimentally determined mutant

phenotypes, namely inviability (1–40)

from viability (41–90).

The # of EFMs for mutant strain

allows correct prediction of

growth phenotype in more than 90%

of the cases.

Robustness analysis

The # of EFMs qualitatively indicates whether a mutant is viable or not, but does

not describe quantitatively how well a mutant grows.

Define maximal biomass yield Ymass as the optimum of:

ei is the single reaction rate (growth and substrate uptake) in EFM i selected for

utilization of substrate Sk.

iSXi e

Software: FluxAnalyzer

Dependency of the mutants' maximal

growth yield Ymax( i) (open circles) and the

network diameter D( i) (open squares) on

the share of elementary modes

operational in the mutants. Data were

binned to reduce noise. Stelling et al. Nature 420, 190 (2002)

Central metabolism of E.coli behaves in a highly robust manner because

mutants with significantly reduced metabolic flexibility show a growth yield

similar to wild type.

Assume that optimization during biological evolution can be characterized

by the two objectives of flexibility (associated with robustness) and of

efficiency.

Flexibility means the ability to adapt to a wide range of environmental

conditions,

that is, to realize a maximal bandwidth of thermodynamically feasible flux

distributions (maximizing # of EFMs).

Efficiency could be defined as fulfilment of cellular demands with an optimal

outcome such as maximal cell growth using a minimum of constitutive

elements (genes and proteins, thus minimizing # EFMs).

These 2 criteria pose contradictory challenges.

Optimal cellular regulation needs to find a trade-off.

Can regulation be predicted by EFM analysis?

Robustness analysis

The # of EFMs qualitatively indicates whether a mutant is viable or not, but does

not describe quantitatively how well a mutant grows.

Define maximal biomass yield Ymass as the optimum of:

ei is the single reaction rate (growth and substrate uptake) in EFM i selected for

utilization of substrate Sk.

iSXi e

Compute control-effective fluxes for each reaction l by determining the efficiency of any EFM

ei by relating the system‘s output to the substrate uptake and to the sum of all absolute

fluxes.

With flux modes normalized to the total substrate uptake, efficiencies i(Sk, ) for

the targets for optimization -growth and ATP generation, are defined as:

Can regulation be predicted by EFM analysis?

eS ,and,

Control-effective fluxes vl(Sk) are obtained by averaged weighting of the product of reaction-

specific fluxes and mode-specific efficiencies over all EFMs using the substrate under

consideration:

SXkl ATPS

YmaxX/Si and Ymax

A/Si are optimal yields of biomass production and of ATP synthesis.

Control-effective fluxes represent the importance of each reaction for efficient and flexible

operation of the entire network.

SummaryEFM are a robust method that offers great opportunities for studying functional and

structural properties in metabolic networks.

Klamt & Stelling suggest that the term „elementary flux modes“ should be used

whenever the sets of EFMs and EPs are identical.

In cases where they don‘t, EPs are a subset of EFMs.

It remains to be understood more thoroughly how much valuable information about

the pathway structure is lost by using EPs.

Ongoing Challenges:

- study really large metabolic systems by subdividing them

- combine metabolic model with model of cellular regulation.

Integrated Analysis of Metabolic and Regulatory NetworksSofar, studies of large-scale cellular networks have focused on their connectivities.

The emerging picture shows a densely-woven web where almost everything is

connected to everything.

In the cell‘s metabolic network, hundreds of substrates are interconnected through

biochemical reactions. Although this could could in principle lead to the

simultaneous flow of substrates in numerous directions, in practice metabolic fluxes

pass through specific pathways.

Topological studies sofar did not consider how the modulation of this connectivity

might also determine network properties.

Therefore it is important to correlate the network topology (picture derived

from EFMs and EPs) with the expression of enzymes in the cell.

Start with review of last lecture‘s final point about coupling of metabolic and

regulatory networks.

Analyze transcriptional control in metabolic networks

Regulatory and metabolic functions of cells are mediated by networks of interacting

biochemical components.

Metabolic flux is optimized to maximize metabolic efficiency under different

conditions.

Control of metabolic flow:

- allosteric interactions

- covalent modifications involving enzymatic activity

- transcription (revealed by genome-wide expression studies)

Here: N. Barkai and colleagues analyzed published experimental expression data of

Saccharomyces cerevisae.

Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)

Recurrence signature algorithmAvailability of DNA microarray data study transcriptional response of a complete

genome to different experimental conditions.

An essential task in studying the global structure of transcriptional networks is the

gene classification.

Commonly used clustering algorithms classify genes successfully when applied to

relatively small data sets, but their application to large-scale expression data is

limited by 2 well-recognized drawbacks:

- commonly used algorithms assign each gene to a single cluster, whereas in fact

genes may participate in several functions and should thus be included in several

clusters

- these algorithms classify genes on the basis of their expression under all

experimental conditions, whereas cellular processes are generally affected only by

a small subset of these conditions.

Ihmels et al. Nat Genetics 31, 370 (2002)

Recurrence signature algorithmAim: identify transcription „modules“ (TMs).

a set of randomly selected genes is unlikely to be identical to the genes of any

TM. Yet many such sets do have some overlap with a specific TM.

In particular, sets of genes that are compiled according to existing knowledge of

their functional (or regulatory) sequence similarity may have a significant overlap

with a transcription module.

Algorithm receives a gene set that partially overlaps a TM and then provides the

complete module as output. Therefore this algorithm is referred to as „signature

algorithm“.

Recurrence signature algorithm

a, The signature algorithm.

b , Recurrence as a reliability measure. The signature algorithm is applied to distinct input

sets containing different subsets of the postulated transcription module. If the different input

sets give rise to the same module, it is considered reliable.

c, General application of the recurrent signature method.

normalizationof data

identify modules

classify genesinto modules

Correlation between genes of the same metabolic pathwayDistribution of the average correlation

between genes assigned to the same

metabolic pathway in the KEGG database.

The distribution corresponding to random

assignment of genes to metabolic

pathways of the same size is shown for

comparison. Importantly, only genes

coding for enzymes were used in the

random control.

Interpretation: pairs of genes associated

with the same metabolic pathway show a

similar expression pattern.

However, typically only a set of the

genes assigned to a given pathway

are coregulated.

Correlation between genes of the same metabolic pathway

Genes of the glycolysis pathway

(according KEGG) were clustered

and ordered based on the correlation

in their expression profiles.

Shown here is the matrix of their

pair-wise correlations.

The cluster of highly correlated

genes (orange frame) corresponds

to genes that encode the central

glycolysis enzymes.

The linear arrangement of these

genes along the pathway is shown at

right.

Of the 46 genes assigned to the

glycolysis pathway in the KEGG

database, only 24 show a correlated

expression pattern.

In general, the coregulated genes

belong to the central pieces of

pathways.

Coexpressed enzymes often catalyze linear chain of reactionsCoregulation between enzymes

associated with central metabolic

pathways. Each branch

corresponds to several enzymes.

In the cases shown, only one of the

branches downstream of the

junction point is coregulated with

upstream genes.

Interpretation: coexpressed

enzymes are often arranged in a

linear order, corresponding to a

metabolic flow that is directed in a

particular direction.

Co-regulation at branch points

To examine more systematically whether coregulation enhances the linearity of

metabolic flow, analyze the coregulation of enzymes at metabolic branch-points.

Search KEGG for metabolic compounds that are involved in exactly 3 reactions.

Only consider reactions that exist in S.cerevisae.

3-junctions can integrate metabolic flow (convergent junction)

or allow the flow to diverge in 2 directions (divergent junction).

In the cases where several reactions are catalyzed by the same enzymes, choose

one representative so that all junctions considered are composed of precisely 3

reactions catalyzed by distinct enzymes.

Each 3-junction is categorized according to the correlation pattern found between

enzymes catalyzing its branches. Correlation coefficients > 0.25 are considered

significant.

Coregulation pattern in three-point junctions

In the majority of divergent

junctions, only one of the

emanating branches is

significantly coregulated with

the incoming reaction that

synthesizes the metabolite.

All junctions corresponding to metabolites that participate in exactly 3

reactions (according to KEGG) were identified and the correlations

between the genes associated with each such junction were calculated.

The junctions were grouped according to the directionality of the

reactions, as shown.

Divergent junctions, which allow the flow of metabolites in two

alternative directions, predominantly show a linear coregulation pattern,

where one of the emanating reaction is correlated with the incoming

reaction (linear regulatory pattern) or the two alternative outgoing

reactions are correlated in a context-dependent manner with a distinct

isozyme catalyzing the incoming reaction (linear switch).

By contrast, the linear regulatory pattern is significantly less abundant

in convergent junctions, where the outgoing flow follows a unique

direction, and in conflicting junctions that do not support metabolic flow.

Most of the reversible junctions comply with linear regulatory patterns.

Indeed, similar to divergent junctions, reversible junctions allow

metabolites to flow in two alternative directions. Reactions were

counted as coexpressed if at least two of the associated genes were

significantly correlated (correlation coefficient >0.25). As a random

control, we randomized the identity of all metabolic genes and repeated

the analysis.

Co-regulation at branch points: conclusions

The observed co-regulation patterns correspond to a linear metabolic flow, whose

directionality can be switched in a condition-specific manner.

When analyzing junctions that allow metabolic flow in a larger number of

directions, there also only a few important branches are coregulated with the

incoming branch.

Therefore: transcription regulation is used to enhance the linearity of metabolic

flow, by biasing the flow toward only a few of the possible routes.

The connectivity of a given metabolite

is defined as the number of reactions

connecting it to other metabolites.

Shown are the distributions of

connectivity between metabolites in an

unrestricted network () and in a

network where only correlated

reactions are considered ().

In accordance with previous results

(Jeong et al. 2000) , the connectivity

distribution between metabolites

follows a power law (log-log plot).

Connectivity of metabolites

In contrast, when coexpression is

used as a criterion to distinguish

functional links, the connectivity

distribution becomes exponential

(log-linear plot).

Differential regulation of isozymes

Observe that isozymes at junction points are often preferentially coexpressed

with alternative reactions.

investigate their role in the metabolic network more systematically.

Two possible functions of isozymes

associated with the same metabolic

reaction.

An isozyme pair could provide redundancy which may be needed for buffering genetic

mutations or for amplifying metabolite production. Redundant isozymes are expected

to be coregulated.

Alternatively, distinct isozymes could be dedicated to separate biochemical

pathways using the associated reaction. Such isozymes are expected to be

differentially expressed with the two alternative processes.

Arrows represent metabolic

pathways composed of a sequence

of enzymes.

Coregulation is indicated with the

same color (e.g., the isozyme

represented by the green arrow is

coregulated with the metabolic

pathway represented by the green

arrow).

Most members of isozyme

pairs are separately coregulated

with alternative processes.

Differential regulation of isozymes in central metabolic PW

The primary role of isozyme multiplicity is to allow for differential regulation

of reactions that are shared by separated processes.

Dedicating a specific enzyme to each pathway may offer a way of

independently controlling the associated reaction in response to pathway-

specific requirements, at both the transcriptional and the post-transcriptional

levels.

Describe the likely function of isozymes in metabolic pathways.

Differential regulation of isozymes: interpretation

Co-expression of transporters

Transporter genes are

co-expressed with the relevant

metabolic pathways providing

the pathways with its

metabolites.

Co-expression is marked in green.

Co-regulation of transcription factors

Transcription factors are often co-regulated with their regulated pathways.

Shown here are transcription factors which were found to be co-regulated in the

analysis. Co-regulation is shown by color-coding such that the transcription factor

and the associated pathways are of the same color.

Sofar: co-expression analysis revealed a strong tendency toward coordinated

regulation of genes involved in individual metabolic pathways.

Hierarchical modularity in the metabolic network

Does transcription regulation also define a higher-order metabolic organization, by

coordinated expression of distinct metabolic pathways?

Based on observation that feeder pathways (which synthesize metabolites) are

frequently coexpressed with pathways using the synthesized metabolites.

Feeder-pathways/enzymesFeeder pathways or genes

co-expressed with the

pathways they fuel. The

feeder pathways (light

blue) provide the main

pathway (dark blue) with

metabolites in order to

assist the main pathway,

indicating that co-

expression extends

beyond the level of

individual pathways.

These results can be

interpreted in the following

way: the organism will

produce those enzymes that

are needed.

Hierarchical modularity in the metabolic networkDerive hierarchy by applying an iterative

signature algorithm to the metabolic pathways,

and decreasing the resolution parameter

(coregulation stringency) in small steps.

Each box contains a group of coregulated genes

(transcription module). Strongly associated

genes (left) can be associated with a specific

function, whereas moderately correlated

modules (right) are larger and their function is

less coherent.

The merging of 2 branches indicates that the

associated modules are induced by similar

conditions.

All pathways converge to one of 3 low-resolution

modules: amino acid biosynthesis, protein

synthesis, and stress.Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)

Global network propertiesJeong et al. showed that the structural connectivity between metabolites imposes a

hierarchical organization of the metabolic network. That analysis was based on

connectivity between substrates, considering all potential connections.

Here, analysis is based on coexpression of enzymes.

In both approaches, related metabolic pathways were clustered together!

There are, however, some differences in the particular groupings (not

discussed here),

and importantly, when including expression data the connectivity pattern of

metabolites changes from a power-law dependence to an exponential one

corresponding to a network structure with a defined scale of connectivity.

This reflects the reduction in the complexity of the network.

SummaryTranscription regulation is prominently involved in shaping the metabolic

network of S. cerevisae.

1 Transcription leads the metabolic flow toward linearity.

2 Individual isozymes are often separately coregulated with distinct

processes, providing a means of reducing crosstalk between pathways

using a common reaction.

3 Transcription regulation entails a higher-order structure of the metabolic

network.

It exists a hierarchical organization of metabolic pathways into groups of

decreasing expression coherence.

V23, 24

Integrating Protein-Protein Interactions: Bayesian Networks

- Lot of direct experimental data coming about protein-protein interactions

(Y2H, MS)

Jansen et al. Science 302, 449 (2003)

- Genomic information also provides indirect information:

- interacting proteins are often significantly coexpressed ( microarrays)

- interacting proteins are often colocalized to the same subcellular

compartment

Problems

Unfortunately, interaction data sets are often incomplete and contradictory

(von Mering et al. 2002)

In the context of genome-wide analyses, these inaccuracies are greatly

magnified because the protein pairs that do not interact (negatives) by far

outnumber those that do interact (positives).

E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential

interactions. But the estimated number of actual interactions is < 100.000.

Therefore, even reliable techniques can generate many false positives when

applied genome-wide.

Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in

0.1% of the population. This would roughly produce 1 true positive for every 10

false ones.

Why are these inaccuracies greatly magnified in the case of protein

interaction networks?

Integrative Approach

One would like to integrate evidence from many different sources to increase the

predictivity of true and false protein-protein predictions.

Here, use Bayesian approach for integrating interaction information that allows for

the probabilistic combination of multiple data sets; apply to yeast.

Input: Approach can be used for combining noisy genomic interaction data sets.

Normalization: Each source of evidence for interactions is compared against

samples of known positives and negatives (“gold-standard”).

Output: predict for every possible protein pair likelihood of interaction.

Verification: test on experimental interaction data not included in the gold-

standard + new TAP (tandem affinity purification experiments).

Integration of various information sources

(iii) Gold-standards of known interactions

and noninteracting protein pairs.

The three different types of data

used: (i) Interaction data from

high-throughput experiments.

These comprise large-scale two-

hybrid screens (Y2H) (Uetz et al.,

Ito et al.) and in vivo pull-down

experiments (Gavin et al., Ho et

al. ).

(ii) Other genomic features. We

considered expression data,

biological function of proteins

(from Gene Ontology biological

process and the MIPS functional

catalog), and data about whether

proteins are essential.

Combination of data sets into probabilistic interactomes

(B) Combination of data sets into

probabilistic interactomes.

The 4 interaction data sets

from HT experiments were

combined into 1 PIE.

The PIE represents a

transformation of the

individual binary-valued

interaction sets into a data

set where every protein pair

is weighed according to the

likelihood that it exists in a

complex. A „naïve” Bayesian network is used to

model the PIP data. These information sets

hardly overlap.

Because the 4 experimental

interaction data sets contain

correlated evidence, a fully

connected Bayesian network

is used.

Gold-Standard

should be

(i) independent from the data sources serving as evidence

(ii) sufficiently large for reliable statistics

(iii) free of systematic bias (e.g. towards certain types of interactions).

Positives: use MIPS (Munich Information Center for Protein Sequences, HW

Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that

are within the same complex) from biomedical literature.

Negatives:

- harder to define

- essential for successful training

Assume that proteins in different compartments do not interact.

Synthesize “negatives” from lists of proteins in separate subcellular compartments.

Measure of reliability: likelihood ratio

Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“).

Likelihood ratio L(f) is defined as:

L(f) = 1 means that the feature has no predictability: the same number of positives

and negatives have feature f.

The larger L(f) the better its predictability.

featurehavingnegativesstandardgoldoffraction

featurehavingpositivesstandardgoldoffraction

Combination of features

For two features f1 and f2 with uncorrelated evidence,

the likelihood ratio of the combined evidence is simply the product:

L(f1,f2) = L(f1) L(f2)

For correlated evidence L(f1,f2) cannot be factorized in this way.

Bayesian networks are a formal representation of such relationships between

features.

The combined likelihood ratio is proportional to the estimated odds that two

proteins are in the same complex, given multiple sources of information.

Prior and posterior odds

„positive“ : a pair of proteins that are in the same complex. Given the number of

positives among the total number of protein pairs, the „prior“ odds of finding a

positive are:

„posterior“ odds: odds of finding a positive after considering N datasets with values

f1 ... fN :

posPOprior

Nprior ffnegP

ffposPO

The terms „prior“ and „posterior“ refer to the situation before and after knowing the

information in the N datasets.

Static naive Bayesian Networks

In the case of protein-protein interaction data, the posterior odds describe the

odds of having a protein-protein interaction given that we have the information from

the N experiments,

whereas the prior odds are related to the chance of randomly finding a protein-

protein interaction when no experimental data is known.

If Opost > 1, the chances of having an interaction are

higher than having no interaction.

Static naive Bayesian Networks

The likelihood ratio L defined as

relates prior and posterior odds according to Bayes‘ rule:

negffP

posffPffL

NN ...

......

priorNpost OffLO ...1

In the special case that the N features are conditionally independent

(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called

„naïve” network, and L can be simplified to:

iiN negfP

posfPfLffL

1 11...

Computation of prior and posterior odds

L can be computed from contingency tables relating positive and negative

examples with the N features (by binning the feature values f1 ... fN into discrete

intervals) – wait for examples.

priorO

Opost > 1 can be achieved with L > 600.

Determining the prior odds Oprior is somewhat arbitrary in that it requires an

assumption about the number of positives.

Jansen et al. believe that 30,000 is a conservative lower bound for the number of

positives (i.e. pairs of proteins that are in the same complex).

Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in

total (with N = 6000 for yeast),

Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete

values (EE: both essential; NN: both non-essential; NE: one essential and one not).

Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast

interactome (~18M pairs).

Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the

2,708,746 gold-standard negatives.

Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the

protein pairs with likelihood ratio L, computed by summing up the values in the „pos“ (or „neg“) column.

P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and

L, the ratio of these two conditional probabilities.

573724

mRNA expression dataProteins in the same complex tend to have correlated expression profiles.

Although large differences can exist between the mRNA and protein abundance, protein abundance can

be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA

transcript.

Experimental data source:

- time course of expression fluctuations during the yeast cell cycle

- Rosetta compendium: expression profiles of 300 deletion mutants and cells under

chemical treatments.

Problem: both data sets are strongly correlated.

Compute first principal component of the vector of the 2 correlations.

Use this as independent source of evidence for the P-P interaction prediction.

The first principal component is a stronger predictor of P-P interactions that either

of the 2 expression correlation datasets by themselves.

PIP – Functional similarityQuantify functional similarity between two proteins:

- consider which set of functional classes two proteins share, given either the MIPS or Gene

Ontology (GO) classification system.

- Then count how many of the ~18 million protein pairs in yeast share the exact same

functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned

into 5 intervals.

- In general, the smaller this count, the more similar and specific is the functional description

of the two proteins.

PIP – Functional similarity

Observation: low counts correlate with a higher chance of two proteins being in

the same complex. But signal (L) is quite weak.

Calculation of the fully connected Bayesian network (PIE)

The 3 binary experimental interaction datasets can be combined in at most 24 = 16

different ways (subsets). For each of these 16 subsets, one can compute a

likelihood ratio from the overlap with the gold-standard positives („pos“) and

negatives („neg“).

51003.08250

2708746

2 8250

2708746

27087462

825026

PIP vs. the information sources

Ratio of true to false positives (TP/FP) increases

monotonically with Lcut, confirming L as an

appropriate measure of the odds of a real

interaction.

The ratio is computed as:

Protein pairs with Lcut > 600 have a > 50%

chance of being in the same complex.Jansen et al. Science 302, 449 (2003)

PIE vs. the information sources

9897 interactions are predicted from PIP and

163 from PIE.

In contrast, likelihood ratios derived from single

genomic factors (e.g. mRNA coexpression) or

from individual interaction experiments (e.g. the

Ho data set) did no exceed the cutoff when used

alone.

This demonstrates that information sources that,

taken alone, are only weak predictors of

interactions can yield reliable predictions when

combined.

Concentrate on large complexes

Sofar all interactions were treated as independent.

However, the joint distribution of interactions in the PIs can help identify large

complexes: an ideal complex should be a fully connected „clique“ in an

interaction graph.

In practice, this rarely happens because of incorrect or missing links.

Yet large complexes tend to have many interconnections between them,

whereas false-positive links to outside proteins tend to occur randomly,

without a coherent pattern.

Improve ratio TP / FP

Observation: Increasing the minimum number of links raises

TP/FP by preserving the interactions among proteins in large

complexes, while filtering out false-positive interactions with

heterogeneous groups of proteins outside the complexes.

TP/FP for subsets of the

thresholded PIP that only include

proteins with a minimum number

of links. Requiring a minimum

number of links isolates large

complexes in the thresholded PIP

graph (Fig. 3B).

How can one increase the predictivity TP/FP of P-P predictions?

Summary

In a similar manner, the approach could have been extended to a number of other

features related to interactions (e.g. phylogenetic co-occurrence, gene fusions,

gene neighborhood).

Bayesian approach allows reliable predictions of protein-protein interactions

by combining weakly predictive genomic features.

The de novo prediction of complexes replicated interactions found in the gold-

standard positives and PIE.

Also, several predictions were confirmed by new TAP experiments.

The accuracy of the PIP was comparable to that of the PIE while simultaneously

achieving greater coverage.

As a word of caution: Bayesian approaches don‘t work everywhere.

Please draw P(k) and C(k) as a function of k for thedifferent network types random network, scale-free networkand hierarchical network.

Additional + required reading: review article in Nature Genetics Reviews

Characterising metabolic networks

(d) The degree distribution, P(k) of the metabolic network illustrates its scale-free

topology.

(e) The scaling of the clustering coefficient C(k) with the degree k illustrates the

hierarchical architecture of metabolism (The data shown in d and e represent an average

over 43 organisms).

(f) The flux distribution in the central metabolism of Escherichia coli follows a power

law, which indicates that most reactions have small metabolic flux, whereas a few reactions,

with high fluxes, carry most of the metabolic activity. It should be noted that on all three plots

the axis is logarithmic and a straight line on such log–log plots indicates a power-law scaling.

CTP, cytidine triphosphate; GLC, aldo-hexose glucose; UDP, uridine diphosphate; UMP,

uridine monophosphate; UTP, uridine triphosphate.

Degree

Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)

The most elementary characteristic of a node is its

degree (or connectivity), k, which tells us how many links

the node has to other nodes. For example, in the

undirected network shown in part a of the figure, node A

has degree k = 5. In networks in which each link has a

selected direction (see figure, part b) there is an

incoming degree, kin, which denotes the number of links

that point to a node, and an outgoing degree, kout, which

denotes the number of links that start from it. For

example, node A in part b of the figure has kin = 4 and

kout = 1. An undirected network with N nodes and L

links is characterized by an average degree <k> =

2L/N (where <> denotes the average).

Degree distribution

The degree distribution, P(k), gives the probability that a selected node has exactly k links. P(k) is obtained by counting the number o f nodes N(k) with k = 1,2... links and dividing by the total number of nodes N. The degree distribution allows us to distinguish between different classes of networks. For example, a peaked degree distribution, as seen in a random network, indicates that the system has a characteristic degree and that there are no highly connected nodes (which are also known as hubs). By contrast, a power-law degree distribution indicates that a few hubs hold together numerous small nodes.

Network measures

Scale-free networks and the degree exponent

Most biological networks are scale-free, which means that their

degree distribution approximates a power law, P(k) k- , where

is the degree exponent and ~ indicates 'proportional to'. The

value of determines many properties of the system. The

smaller the value of , the more important the role of the hubs

is in the network. Whereas for >3 the hubs are not relevant, for

2> >3 there is a hierarchy of hubs, with the most connected

hub being in contact with a small fraction of all nodes, and for

= 2 a hub-and-spoke network emerges, with the largest hub

being in contact with a large fraction of all nodes. In general, the

unusual properties of scale-free networks are valid only for <

3, when the dispersion of the P(k) distribution, which is defined

as 2 = <k2> - <k>2, increases with the number of nodes (that

is, diverges), resulting in a series of unexpected features,

such as a high degree of robustness against accidental node

failures. For >3, however, most unusual features are absent,

and in many respects the scale-free network behaves like a

random one.

Shortest path and mean path length

Distance in networks is measured with the path length, which

tells us how many links we need to pass through to travel

between two nodes. As there are many alternative paths

between two nodes, the shortest path — the path with the

smallest number of links between the selected nodes — has a

special role. In directed networks, the distance ℓAB from node A

to node B is often different from the distance ℓBA from B to A. For

example, in part b of the figure, ℓBA = 1, whereas ℓAB = 3. Often

there is no direct path between two nodes. As shown in part b of

the figure, although there is a path from C to A, there is no path

from A to C. The mean path length, <ℓ>, represents the

average over the shortest paths between all pairs of nodes

and offers a measure of a network's overall navigability.

Clustering coefficient

In many networks, if node A is connected to B, and B is connected to C,

then it is highly probable that A also has a direct link to C. This

phenomenon can be quantified using the clustering coefficient33 CI =

2nI/k(k-1), where nI is the number of links connecting the kI neighbours of

node I to each other. In other words, CI gives the number of 'triangles'

that go through node I, whereas kI (kI -1)/2 is the total number of triangles

that could pass through node I, should all of node I's neighbours be

connected to each other. For example, only one pair of node A's five

neighbours in part a of the figure are linked together (B and C), which

gives nA = 1 and CA = 2/20. By contrast, none of node F's neighbours link

to each other, giving CF = 0. The average clustering coefficient, <C >,

characterizes the overall tendency of nodes to form clusters or groups.

An important measure of the network's structure is the function C(k),

which is defined as the average clustering coefficient of all nodes with k

links. For many real networks C(k) k-1, which is an indication of a

network's hierarchical character.

The average degree <k>, average path length <ℓ> and average

clustering coefficient <C> depend on the number of nodes and links

(N and L) in the network. By contrast, the P(k) and C(k ) functions

are independent of the network's size and they therefore capture a

network's generic features, which allows them to be used to classify

various networks.

The Erdös–Rényi (ER) model of a random network starts with N nodes

and connects each pair of nodes with probability p, which creates a

graph with approximately pN (N-1)/2 randomly placed links.

The node degrees follow a Poisson distribution, which indicates that

most nodes have approximately the same number of links (close to the

average degree <k>). The tail (high k region) of the degree distribution

P(k ) decreases exponentially, which indicates that nodes that

significantly deviate from the average are extremely rare.

The clustering coefficient is independent of a node's degree, so C(k)

appears as a horizontal line if plotted as a function of k. The mean path

length is proportional to the logarithm of the network size, l log N, which

indicates that it is characterized by the small-world property.

Random networks

Scale-free networks Scale-free networks are characterized by a power-law degree

distribution; the probability that a node has k links follows P(k) ~ k- ,

where is the degree exponent. The probability that a node is highly

connected is statistically more significant than in a random graph, the

network's properties often being determined by a relatively small number

of highly connected nodes that are known as hubs (see figure, part Ba;

blue nodes). In the Barabási–Albert model of a scale-free network, at

each time point a node with M links is added to the network, which

connects to an already existing node I with probability I = kI/JkJ,

where kI is the degree of node I and J is the index denoting the sum over

network nodes. The network that is generated by this growth process has

a power-law degree distribution that is characterized by the degree

exponent = 3.

Bb Such distributions are seen as a straight line on a log–log plot. The

network that is created by the Barabási–Albert model does not have an

inherent modularity, so C(k) is independent of k (Bc). Scale-free

networks with degree exponents 2< <3, a range that is observed in

most biological and non-biological networks, are ultra-small, with the

average path length following ℓ ~ log log N, which is significantly shorter

than log N that characterizes random small-world networks.

Hierarchical networks To account for the coexistence of modularity, local clustering and scale-

free topology in many real systems it has to be assumed that clusters

combine in an iterative manner, generating a hierarchical network.

The starting point of this construction is a small cluster of four densely

linked nodes (see the four central nodes in Ca). Next, three replicas of

this module are generated and the three external nodes of the replicated

clusters connected to the central node of the old cluster, which produces

a large 16-node module. Three replicas of this 16-node module are then

generated and the 16 peripheral nodes connected to the central node of

the old module, which produces a new module of 64 nodes. The

hierarchical network model seamlessly integrates a scale-free topology

with an inherent modular structure by generating a network that has a

power-law degree distribution with degree exponent = 1 + ln4/ln3 =

2.26 (see Cb) and a large, system-size independent average clustering

coefficient <C> ~ 0.6.

The most important signature of hierarchical modularity is the scaling of

the clustering coefficient, which follows C(k) ~ k-1 a straight line of slope -

1 on a log–log plot (see Cc). A hierarchical architecture implies that

sparsely connected nodes are part of highly clustered areas, with

communication between the different highly clustered neighbourhoods

being maintained by a few hubs (see Ca).

13. Lecture WS 2003/04Bioinformatics III1 For your exam preparation: relevant topics from lectures...

Documents