Post on 19-Jan-2016
transcript
13. Lecture WS 2003/04
Bioinformatics III 1
For your exam preparation:relevant topics from lectures 13-26
13. Lecture WS 2003/04
Bioinformatics III 2
V13 Protein Networks / Protein Complexes
Protein networks could be defined in a number of ways
- Co-regulated expression of genes/proteins
- Proteins participating in the same metabolic pathways
- Proteins sharing substrates
- Proteins that are co-localized
- Proteins that form permanent supracomplexes = protein machineries
- Proteins that bind eachother transiently
(signal transduction, bioenergetics ... )
Please describe different types of cellular networks.
13. Lecture WS 2003/04
Bioinformatics III 3
Relation between lethality and function as centers in protein networks
Likehood p(k) of finding proteins in yeast that interact
with exactly k other proteins.
Probability has power law dependence.
(Similar plot for bacterium Heliobacter pylori.)
network of protein-protein interactions is a very
inhomogenous scale-free network where a few,
highly connected, proteins play central roles of
mediating the interactions among other, less
strongly connected, proteins.
Jeong, Mason, Barabási, Oltvai, Nature 411, 41 (2001)
13. Lecture WS 2003/04
Bioinformatics III 4
Relation between lethality and function as centers in protein networks
Computational analysis of the tolerance of protein
networks for random errors (gene deletions).
Random mutations don’t have an effect on the total
topology of the network.
When “hub” proteins with many interactions are
eliminated, the diameter of the network decreases
quickly.
The degree of proteins being essential (gene knock-
out is lethal for cell) depends on the connectivity in the
yeast protein network.
Strongly connected proteins with central roles in the
architecture of the network are 3 times as essential as
proteins with few connections.
Jeong, Mason, Barabási, Oltvai, Nature 411, 41 (2001)
13. Lecture WS 2003/04
Bioinformatics III 5
Analyis of protein complexes in yeast (S. cerevisae)
Gavin et al. Nature 415, 141 (2002)
Identify proteins by
scanning yeast protein
database for protein
composed of fragments
of suitable mass.
Here, the identified
proteins are listed
according to their
localization (a).
(b) lists the number of
proteins per complex.
13. Lecture WS 2003/04
Bioinformatics III 6
V14 Protein Networks / Protein Complexes
Protein networks could be defined in a number of ways
- Co-regulated expression of genes/proteins
- Proteins participating in the same metabolic pathways
- Proteins sharing substrates
- Proteins that are co-localized
- Proteins that form permanent supracomplexes = protein machineries
- Proteins that bind eachother transiently
(signal transduction, bioenergetics ... )
13. Lecture WS 2003/04
Bioinformatics III 7
Overview
Statistical analysis of protein-protein interfaces in crystal structures of
protein-protein complexes: residues in interfaces have significantly different
amino acid composition that the rest of the protein.
predict protein-protein interaction sites from local sequence information
Conservation at protein-protein interfaces: interface regions are more conserved
than other regions on the protein surface
identify conserved regions on protein surface e.g. from solvent accessibility
Interacting residues on two binding partners often show correlated mutations (among
different organisms) if being mutated
identify correlated mutations
Surface patterns of protein-protein interfaces: interface often formed by hydrophobic
patch surrounded by ring of polar or charged residues.
identify suitable patches on surface if 3D structure is known
13. Lecture WS 2003/04
Bioinformatics III 8
1 Analysis of interfacesPDB contains 1812 non-redundant
protein complexes
(less than 25% identity).
Results don‘t change significantly if
NMR structures, theoretical models,
or structures at lower resolution
(altogether 50%) are excluded.
Most interesting are the results for
transiently formed complexes.
How many PDB structures of
protein-protein complexes are
known? How many residues are
typically at an interface?
(ca. 10- 20) Ofran, Rost, J. Mol. Biol. 325, 377 (2003)
13. Lecture WS 2003/04
Bioinformatics III 9
1 Properties of interfaces
Amino acid composition of six interface types. The propensities of all residues
found in SWISS-PROT were used as background. If the frequency of an amino
acid is similar to its frequency in SWISS-PROT, the height of the bar is close to
zero. Over-representation results in a positive bar, and under-representation
results in a negative bar. Ofran, Rost, J. Mol. Biol. 325, 377 (2003)
13. Lecture WS 2003/04
Bioinformatics III 10
1 Pairing frequencies at interfaces
Residue–residue preferences.
(A) Intra-domain: hydrophobic core is clear
(B) domain–domain, (C) obligatory homo-
oligomers (homo-obligomers), (D) transient
homo-oligomers (homo-complexes), (E)
obligatory hetero-oligomers (hetero-
obligomers), and (F) transient hetero-
oligomers (hetero-complexes). A red square
indicates that the interaction occurs more
frequently than expected; a blue square
indicates that it occurs less frequently than
expected. The amino acid residues are
ordered according to hydrophobicity, with
isoleucine as the most hydrophobic and
arginine as the least hydrophobic.
Ofran, Rost, J. Mol. Biol. 325, 377 (2003)
13. Lecture WS 2003/04
Bioinformatics III 11
3 Correlated mutations at interface
Pazos, Helmer-Citterich, Ausiello, Valencia J Mol Biol 271, 511 (1997):
correlation information is sufficient for selecting the correct structural arrangement of
known heterodimers and protein domains because the correlated pairs between the
monomers tend to accumulate at the contact interface.
Use same idea to identify interacting protein pairs.
13. Lecture WS 2003/04
Bioinformatics III 12
Correlated mutations at interface
Correlated mutations evaluate the similarity in variation patterns between positions in
a multiple sequence alignment.
Similarity of those variation patterns is thought to be related to compensatory
mutations.
Calculate for each positions i and j in the sequence a rank correlation coefficient (rij):
Pazos, Valencia, Proteins 47, 219 (2002)
lkjjkl
lkiikl
lkjjkliikl
ij
SSSS
SSSS
r
,
2
,
2
,
where the summations run over every possible pair of proteins k and l in the multiple
sequence alignment.
Sikl is the ranked similarity between residue i in protein k and residue i in protein l.
Sjkl is the same for residue j.
Si and Sj are the means of Sikl and Sjkl.
13. Lecture WS 2003/04
Bioinformatics III 13
Correlated mutations at interface
Generate for protein i multiple sequence alignment of homologous proteins (HSSP
database).
Compare MSAs of two proteins, reduce them by leaving only sequences of
coincident species (delete rows).
Pazos, Valencia, Proteins 47, 219 (2002)
13. Lecture WS 2003/04
Bioinformatics III 14
i2h method
Schematic representation of the i2h method.
A: Family alignments are collected for two different proteins, 1 and 2, including corresponding sequences from different species (a, b, c, ).
B: A virtual alignment is constructed, concatenating the sequences of the probable orthologous sequences of the two proteins. Correlated mutations are calculated.
C: The distributions of the correlation values are recorded. We used 10 correlation levels. The corresponding distributions are represented for the pairs of residues internal to the two proteins (P11 and P22) and for the pairs composed of one residue from each of the two proteins (P12).
Pazos, Valencia, Proteins 47, 219 (2002)
13. Lecture WS 2003/04
Bioinformatics III 15
Predictions from correlated mutationsResults obtained by i2h in a set of 14 two domain proteins of known structure = proteins with two interacting domains. Treat the 2 domains as different proteins.
A: Interaction index for the 133 pairs with 11 or more sequences in common. The true positive hits are highlighted with filled squares.
B: Representation of i2h results, reminiscent of those obtained in the experimental yeast two-hybrid system. The diameter of the black circles is proportional to the interaction index; true pairs are highlighted with gray squares. Empty spaces correspond to those cases in which the i2h system could not be applied, because they contained <11 sequences from different species in common for the two domains.
In most cases, i2h scored the correct pair of protein domains above all other possible interactions.
Pazos, Valencia, Proteins 47, 219 (2002)
13. Lecture WS 2003/04
Bioinformatics III 16
4 Coevolutionary Analysis
Idea: if co-evolution is relevant, a ligand-receptor pair should occupy
related positions in phylogenetic trees.
Observe that for ligand-receptor pairs that are part of most large protein
families, the correlation between their phylogenetic distance matrices is
significantly greater than for uncorrelated protein families (Goh et al. 2000,
Pazos, Valencia, 2001).
Finer analysis (Goh & Cohen, 2002) shows that within these correlated
phylogenetic trees, the protein pairs that bind have a higher correlation between
their phylogenetic distance matrices than other homologs drawn drom the ligand
and receptor families that do not bind.
Goh, Cohen J Mol Biol 324, 177 (2002)
13. Lecture WS 2003/04
Bioinformatics III 17
Summary
There exists now a small zoo of promising experimental and theoretical methods to
analyze cellular interactome: which proteins interact with each other.
Problem 1: each method detects too few interactions (as seen by the fact that the
overlap between predictions of various methods is very small)
Problem 2: each method has an intrinsic error rate producing „false positives“ and
„false negatives“).
Ideally, everything will converge to a big picture eventually.
Solving Problem 1 will help solving problem 2 by combining predictions.
Problem 1 can be partially solved by producing more data :-)
In the mean time, the value of network analysis (e.g. the identification of „isolated“
modules) is questionable to some extent.
13. Lecture WS 2003/04
Bioinformatics III 18
V15
13. Lecture WS 2003/04
Bioinformatics III 19
Modularity in molecular networks?
A functional module is, by definition, a discrete entity whose function is
separable from those of other modules.
This separation depends on chemical isolation, which can originate from
spatial localization or from chemical specificity.
E.g. a ribosome concentrates the reactions involved in making a polypeptide
into a single particle, thus spatially isolating its function.
A signal transduction system is an extended module that achieves its isolation
through the specificity of the initial binding of the chemical signal to receptor
proteins, and of the interactions between signalling proteins within the cell.
Please give 3 reasons for the occurrence of functional modules in
biological cells.
Hartwell et al. Nature 402, C47 (1999)
13. Lecture WS 2003/04
Bioinformatics III 20
Modularity in molecular networks
Modules can be insulated from or connected to each other.
Insulation allows the cell to carry out many diverse reactions without cross-talk
that would harm the cell.
Connectivity allows one function to influence another.
The higher-level properties of cells, such as their ability to integrate information
from multiple sources, will be described by the pattern of connections among their
functional modules.
Hartwell et al. Nature 402, C47 (1999)
13. Lecture WS 2003/04
Bioinformatics III 21
Organization of large-scale molecular networks
Organization of molecular networks revealed by large-scale experiments:
- power-law distribution ; P(k) exp-
- similar distribution of the node degree k (i.e. the number of edges of a node)
- small-world property (i.e. a high clustering coefficient and a small shortest path
between every pair of nodes)
- anticorrelation in the node degree of connected nodes (i.e. highly interacting
nodes tend to be connected to low-interacting ones)
These properties become evident when hundreds or thousands of molecules and
their interactions are studied together.
On the other end of the spectrum: recently discovered motifs that consist of 3-4
nodes.
13. Lecture WS 2003/04
Bioinformatics III 22
Mesoscale properties of networks
Most relevant processes in biological networks correspond to the
mesoscale (5-25 genes or proteins).
It is computationally enormously expensive to study mesoscale properties
of biological networks.
e.g. a network of 1000 nodes contains 1 1023 possible 10-node sets.
Spirin & Mirny analyzed combined network of protein interactions with data from
CELLZOME, MIPS, BIND: 6500 interactions.
13. Lecture WS 2003/04
Bioinformatics III 23
Identify connected subgraphsThe network of protein interactions is typically presented as an undirected graph
with proteins as nodes and protein interactions as undirected edges.
Aim: identify highly connected subgraphs (clusters) that have more
interactions
within themselves and fewer with the rest of the graph.
A fully connected subgraph, or clique, that is not a part of any other clique is
an example of such a cluster.
In general, clusters need not to be fully connected.
Measure density of connections by
where n is the number of proteins in the cluster
and m is the number of interactions between them.
Spirin, Mirny, PNAS 100, 12123 (2003)
12
nn
mQ
13. Lecture WS 2003/04
Bioinformatics III 24
(method I) Identify all fully connected subgraphs (cliques)Generally, finding all cliques of a graph is an NP-hard problem.
Because the protein interaction graph is sofar very sparse (the number of interactions
(edges) is similar to the number of proteins (nodes), this can be done quickly.
To find cliques of size n one needs to enumerate only the cliques of size n-1.
The search for cliques starts with n = 4, pick all (known) pairs of edges (6500 6500
protein interactions) successively.
For every pair A-B and C-D check whether there are edges between A and C, A and
D, B and C, and B and D. If these edges are present, ABCD is a clique.
For every clique identified, ABCD, pick all known proteins successively.
For every picked protein E, if all of the interactions E-A, E-B, E-C, and E-D are known,
then ABCDE is a clique with size 5.
Continue for n = 6, 7, ... The largest clique found in the protein-interaction network
has size 14. Spirin, Mirny, PNAS 100, 12123 (2003)
13. Lecture WS 2003/04
Bioinformatics III 25
(I) Identify all fully connected subgraphs (cliques)These results include, however, many redundant cliques.
For example, the clique with size 14 contains 14 cliques with size 13.
To find all nonredundant subgraphs, mark all proteins comprising the clique of size
14, and out of all subgraphs of size 13 pick those that have at least one protein
other than marked.
After all redundant cliques of size 13 are removed, proceed to remove redundant
twelves etc.
In total, only 41 nonredundant cliques with sizes 4 - 14 were found.
Describe an algorithm to detect all fully-connected subgraphs (cliques) in the
network of protein-protein interactions.
Spirin, Mirny, PNAS 100, 12123 (2003)
13. Lecture WS 2003/04
Bioinformatics III 26
(method III) Monte Carlo SimulationUse MC to find a tight subgraph of a predetermined number of nodes M.
At time t = 0, a random set of M nodes is selected.
For each pair of nodes i,j from this set, the shortest path Lij between i and j on the
graph is calculated.
Denote the sum of all shortest paths Lij from this set as L0.
At every time step one of M nodes is picked at random, and one node is picked at
random out of all its neighbors.
The new sum of all shortest paths, L1, is calculated if the original node were to be
replaced by this neighbor.
If L1 < L0, accept replacement with probability 1.
If L1 > L0, accept replacement with probability
where T is the effective temperature.
Spirin, Mirny, PNAS 100, 12123 (2003)
T
LL 01
exp
13. Lecture WS 2003/04
Bioinformatics III 27
(III) Monte Carlo Simulation
Every tenth time step an attempt is made to replace one of the nodes from
the current set with a node that has no edges to the current set to avoid
getting caught in an isolated disconnected subgraph.
This process is repeated
(i) until the original set converges to a complete subgraph, or
(ii) for a predetermined number of steps,
after which the tightest subgraph (the subgraph corresponding to the smallest
L0) is recorded.
The recorded clusters are merged and redundant clusters are removed.
Spirin, Mirny, PNAS 100, 12123 (2003)
13. Lecture WS 2003/04
Bioinformatics III 28
Optimal temperature in MC simulationFor every cluster size there is an
optimal temperature that gives the
fastest convergence to the tightest
subgraph.
Spirin, Mirny, PNAS 100, 12123 (2003)
Time to find a clique with size 7 in MC steps
per site as a function of temperature T.
The region with optimal temperature is
shown in Inset.
The required time increases sharply as the
temperature goes to 0, but has a relatively
wide plateau in the region 3 < T < 7.
Simulations suggest that the choice of
temperature T M would be safe for any
cluster size M.
13. Lecture WS 2003/04
Bioinformatics III 29
Merging Overlapping ClustersA simple statistical test shows that nodes which have only one link to a cluster are
statistically insignificant. Clean such statistically insignificant members first.
Then merge overlapping clusters:
For every cluster Ai find all clusters Ak that overlap with this cluster by at least one
protein.
For every such found cluster calculate Q value of a possible merged cluster
Ai U Ak . Record cluster Abest(i) which gives the highest Q value if merged with Ai.
After the best match is found for every cluster, every cluster Ai is replaced by a
merged cluster Ai U Abest(i) unless Ai U Abest(i) is below a certain threshold value
for QC.
This process continues until there are no more overlapping clusters or until merging
any of the remaining clusters witll make a cluster with Q value lower than QC.
Spirin, Mirny, PNAS 100, 12123 (2003)
13. Lecture WS 2003/04
Bioinformatics III 30
Statistical significance of complexes and modules
Number of complete cliques (Q = 1) as
a function of clique size enumerated in
the network of protein interactions
(red) and in randomly rewired graphs
(blue, averaged >1,000 graphs where
number of interactions for each protein
is preserved).
Inset shows the same plot in log-
normal scale. Note the dramatic
enrichment in the number of cliques in
the protein-interaction graph
compared with the random graphs.
Most of these cliques are parts of
bigger complexes and modules.
Spirin, Mirny, PNAS 100, 12123 (2003)
Draw the distrubution of cliques in the plot.
13. Lecture WS 2003/04
Bioinformatics III 31
Statistical significance of complexes and modules
Spirin, Mirny, PNAS 100, 12123 (2003)
Distribution of Q of clusters found by the MC search
method.
Red bars: original network of protein interactions.
Blue cuves: randomly rewired graphs.
Clusters in the protein network have many more
interactions than their counterparts in the
random graphs.
13. Lecture WS 2003/04
Bioinformatics III 32
Discovered functional modules
Examples of discovered functional modules.
(A) A module involved in cell-cycle regulation. This module consists of cyclins (CLB1-4 and
CLN2) and cyclin-dependent kinases (CKS1 and CDC28) and a nuclear import protein (NIP29).
Although they have many interactions, these proteins are not present in the cell at the same
time.
(B) Pheromone signal transduction pathway in the network of protein–protein interactions. This
module includes several MAPK (mitogen-activated protein kinase) and MAPKK (mitogen-
activated protein kinase kinase) kinases, as well as other proteins involved in signal
transduction. These proteins do not form a single complex; rather, they interact in a specific
order.
Assuming that functional modules are identified from analyzing protein-protein
interaction networks: will all protein belonging to a functional module always be present
simultaneously?
13. Lecture WS 2003/04
Bioinformatics III 33
Robustness of clusters found
Model effect of false positives in
experimental data: randomly reconnect,
remove or add 10-50% of interactions
in network.
Cluster recovery probability as a
function of the fraction of altered links.
Black curves correspond to the case
when a fraction of links are rewired.
Red, removed;
green, added.
Circles represent the probability to
recover 75% of the original cluster;
triangles represent the probability to
recover 50%.
Spirin, Mirny, PNAS 100, 12123 (2003)
Noise in the form of removal or addions
lf links has less deteriorating effect
than random rewiring. About 75% of
clusters can still be found when 10% of
links are rewired.
13. Lecture WS 2003/04
Bioinformatics III 34
Summary
Here: analysis of meso-scale properties demonstrated the presence of highly
connected clusters of proteins in a network of protein interactions. Strong
support for suggested modular architecture of biological networks.
Distinguish 2 types of clusters: protein complexes and dynamic functional
modules.
Both complexes and modules have more interactions among their members
than with the rest of the network.
Dynamic modules are elusive to experimental purification because they are
not assembled as a complex at any single point in time.
Computational analysis allows detection of such modules by integrating
pairwise molecular interactions that occur at different times and places.
However, computational analysis alone, does not allow to distinguish
between complexes and modules or between transient and simultaneous
interactions.
13. Lecture WS 2003/04
Bioinformatics III 35
Evolution of the yeast protein interaction network
How do biological networks develop?
Sofar, protein interaction network of yeast is one of the best characterized
networks.
Parts of this network should be inherited from the last common ancestor of the
three domains of life: Eubacteria, Archaea, and Eukaryotes.
Use again graph theory to model the yeast protein interaction network.
Proteins = nodes, pairwise interactions = link between two nodes.
Evolution can be inferred by analyzing the growth pattern of the graph.
Classify all nodes (proteins) into isotemporal categories based on each
protein‘s orthologous hits in COG data base.
Qin et al. PNAS 100, 12820 (2003)
13. Lecture WS 2003/04
Bioinformatics III 36
Evolution of the yeast protein interaction network
Isotemporal categories are designed
through a binary (b) coding scheme.
The b code represents the
distribution of each yeast protein's
orthologs in the universal tree of life.
Bit value 1 indicates the presence of
at least one orthologous hit for a
yeast protein in a corresponding
group of genomes, and bit value 0
indicates the absence of any
orthologous hit. The presented
example is 110011 in the b format
and 51 in the d format. Orthologous
identifications are based on COGs at
NCBI and in von Mering et al. (2002).
Qin et al. PNAS 100, 12820 (2003)
Previously, phylogenetic profileswere used to detect proteininteraction partners.Here, use phylogenetic profiles to detect modules.
13. Lecture WS 2003/04
Bioinformatics III 37
Evolution of the yeast protein interaction network
Interaction patterns.
Z scores for all possible interactions
of the isotemporal categories in the
protein interaction network.
For categories i and j,
Zi,j = (Fi,jobs – Fi,j
mean)/i,j
where Fi,jobs is the observed number
of interactions, and Fi,jmean and i,j are
the average number of interactions
and the SD, respectively, in 10,000
MS02 null models.
Qin et al. PNAS 100, 12820 (2003)
The diagonal distribution of large positive Z scores indicates that yeast
proteins tend to interact with proteins from the same or closely related
isotemperal categories.
13. Lecture WS 2003/04
Bioinformatics III 38
Evolution of the yeast protein interaction network
The observed intracategory association tendencies are consistent with the
intuitive notion that a new function likely requires a group of new proteins,
and that the growth of the protein interaction network is under functional
constraints.
Although the turnover rate of the protein interaction network is suggested to
be very fast, these results suggest that many isotemporal clusters can still
remain well preserved during evolution.
The formation and conservation of isotemporal clusters during evolution
may be the consequence of selection for the modular organization of the
protein interaction network.
The progressive nature of the network evolution and significant isotemporal
clustering may have contributed to the hierarchical organization of
modularity in biological networks in general.Qin et al. PNAS 100, 12820 (2003)
13. Lecture WS 2003/04
Bioinformatics III 39
V16
13. Lecture WS 2003/04
Bioinformatics III 40
V17, V18
look at Tihamer‘s overheads – basics of network types are important for exam
13. Lecture WS 2003/04
Bioinformatics III 41
V19
13. Lecture WS 2003/04
Bioinformatics III 42
Computational Studies of Metabolic Networks - Introduction
Different levels for describing metabolic networks:
- classical biochemical pathways (glycolysis, TCA cycle, ...
- stoichiometric modelling (flux balance analysis): theoretical capabilities of an
integrated cellular process, feasible metabolic flux distributions
- automatic decomposition of metabolic networks
(elementary nodes, extreme pathways ...)
- kinetic modelling (E-Cell ...) problem: general lack of kinetic information
on the dynamics and regulation of cellular metabolism
As a primer today EcoCyc:
Global Properties of the Metabolic Map of E. coli,
Ouzonis, Karp, Genome Research 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 43
EcoCyc Database
Genetic complement of E.coli: 4.7 million DNA bases.
How can we characterize the functional complement of E.coli and according to
what criteria can we compare the biochemical networks of two organisms?
EcoCyc contains the metabolic map of E.coli defined as the set of all known
pathways, reactions and enzymes of E.coli small-molecule metabolism.
Analyze
- the connectivity relationships of the metabolic network
- its partitioning into pathways
- enzyme activation and inhibition
- repetition and multiplicity of elements such as enzymes, reactions, and substrates.
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 44
Reactions
The number of reactions (744) and the number of enzymes (607) differ in
EcoCyc. WHY??
(1) there is no one-to-one mapping between enzymes and reactions –
some enzymes catalyze multiple reactions, and some reactions are catalyzed
by multiple enzymes.
(2) for some reactions known to be catalyzed by E.coli, the enzyme has not
yet been identified.
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 45
Compounds
The 744 reactions of E.coli small-molecule metabolism involve a total of 791
different substrates.
On average, each reaction contains 4.0 substrates.
Number of reactions containing varying numbers of substrates (reactants plus products).
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 46
Pathways
EcoCyc describes 131 pathways:
energy metabolism
nucleotide and amino acid biosynthesis
secondary metabolism
Pathways vary in length from a
single reaction step to 16 steps
with an average of 5.4 steps.
Fill the distribution of pathway
lengths in EcoCyc into the plot:
Length distribution of EcoCyc pathways
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 47
Pathways
However, there is no precise
biological definition of a
pathway.
The partitioning of the
metabolic network into
pathways (including the well-
known examples of
biochemical pathways) is
somehow arbitrary.
These decisions of course also
affect the distribution of
pathway lengths.
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 48
Protein Subunits
A unique property of EcoCyc is that it explicitly encodes the subunit organization of
proteins.
Therefore, one can ask questions such as:
Are protein subunits encoded by neighboring genes?
Interestingly, this is the case for > 80% of known heteromeric enzymes.
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 49
Reactions Catalyzed by More Than one Enzyme
Diagram showing the number of reactions
that are catalyzed by one or more enzymes.
Most reactions are catalyzed by one enzyme,
some by two, and very few by more than two
enzymes.
For 84 reactions, the corresponding enzyme is not yet encoded in EcoCyc.
What may be the reasons for isozyme redundancy?
(2) the reaction is easily „invented“; therefore, there is more than one protein
family that is independently able to perform the catalysis (convergence).
(1) the enzymes that catalyze the same reaction are homologs and have
duplicated (or were obtained by horizontal gene transfer),
acquiring some specificity but retaining the same mechanism (divergence)
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 50
Enzymes that catalyze more than one reaction
Genome predictions usually assign a single enzymatic function.
However, E.coli is known to contain many multifunctional enzymes.
Of the 607 E.coli enzymes, 100 are multifunctional, either having the same active
site and different substrate specificities or different active sites.
Number of enzymes that catalyze one or
more reactions. Most enzymes catalyze
one reaction; some are multifunctional.
The enzymes that catalyze 7 and 9 reactions are purine nucleoside phosphorylase
and nucleoside diphosphate kinase.
Take-home message: The high proportion of multifunctional enzymes implies that
the genome projects significantly underpredict multifunctional enzymes!
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 51
Reactions participating in more than one pathway
The 99 reactions belonging to multiple
pathways appear to be the intersection
points in the complex network of chemical
processes in the cell.
E.g. the reaction present in 6 pathways corresponds to the reaction catalyzed by
malate dehydrogenase, a central enzyme in cellular metabolism.
Ouzonis, Karp,
Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 52
Implications of EcoCyc Analysis
Although 30% of E.coli genes remain unidentified, enzymes are the best studied
and easily identifiable class of proteins.
Therefore, few new enzymes can be expected to be discovered.
The metabolic map presented may be 90% complete.
Implication for metabolic maps derived from automatic genome annotation:
automatic annotation does generally not identify multifunctional proteins.
The network complexity may therefore be underestimated.
EcoCyc results often cannot be obtained from protein or nucleic acid sequence
databases because they store protein functions using text descriptions.
E.g. sequence databases don‘t include precise information about subunit
organization of proteins.
Ouzonis, Karp, Genome Res. 10, 568 (2000)
13. Lecture WS 2003/04
Bioinformatics III 53
Stoichiometric matrix
Stoichiometric matrix:
A matrix with reaction stochio-
metries as columns and
metabolite participations as
rows.
The stochiometric matrix is an
important part of the in silico
model.
With the matrix, the methods
of extreme pathway and
elementary mode analyses
can be used to generate a
unique set of pathways P1,
P2, and P3 (see future
lecture).
Papin et al. TIBS 28, 250 (2003)
13. Lecture WS 2003/04
Bioinformatics III 54
Flux balancingmass conservation.
Therefore one may analyze metabolic systems by requiring mass conservation.
Only required: knowledge about stoichiometry of metabolic pathways and
metabolic demands
For each metabolite:
Under steady-state conditions, the mass balance constraints in a metabolic
network can be represented mathematically by the matrix equation:
S · v = 0
where the matrix S is the m n stoichiometric matrix,
m = the number of metabolites and n = the number of reactions in the network.
The vector v represents all fluxes in the metabolic network, including the internal
fluxes, transport fluxes and the growth flux.
)( dtransporteuseddegradeddsynthesizei
i VVVVdt
dXv
Any chemical reaction requires
13. Lecture WS 2003/04
Bioinformatics III 55
Flux balance analysis
Since the number of metabolites is generally smaller than the number of reactions
(m < n) the flux-balance equation is typically underdetermined.
Therefore there are generally multiple feasible flux distributions that satisfy the mass
balance constraints.
The set of solutions are confined to the nullspace of matrix S.
To find the „true“ biological flux in cells ( e.g. Heinzle, Huber, UdS) one needs
additional (experimental) information,
or one may impose constraints
on the magnitude of each individual metabolic flux.
The intersection of the nullspace and the region defined by those linear inequalities
defines a region in flux space = the feasible set of fluxes.
iii v
13. Lecture WS 2003/04
Bioinformatics III 56
Feasible solution set for a metabolic reaction network
(A) The steady-state operation of the metabolic network is restricted to the
region within a cone, defined as the feasible set. The feasible set contains
all flux vectors that satisfy the physicochemical constrains. Thus, the
feasible set defines the capabilities of the metabolic network. All feasible
metabolic flux distributions lie within the feasible set, and
(B) in the limiting case, where all constraints on the metabolic network are
known, such as the enzyme kinetics and gene regulation, the feasible set
may be reduced to a single point. This single point must lie within the
feasible set.
13. Lecture WS 2003/04
Bioinformatics III 57
E.coli in silico
Edwards & Palsson, PNAS 97, 5528 (2000)
Define i = 0 for irreversible internal fluxes,
i = - for reversible internal fluxes (use biochemical literature)
Transport fluxes for PO42-, NH3, CO2, SO4
2-, K+, Na+ was unrestrained.
For other metabolites
except for those that are able to leave the metabolic network (i.e. acetate, ethanol,
lactate, succinate, formate, pyruvate etc.)
Find particular metabolic flux distribution with feasible set by linear programming.
LP finds a solution that minimizes a particular metabolic objective (subject to the
imposed constraints) –Z where
maxii vv 0
vcii vcZ
In fact, the method finds the solution that maximizes fluxes = gives maximal
biomass.
13. Lecture WS 2003/04
Bioinformatics III 58
Interpretation of gene deletion results
The essential gene products were involved in the 3-carbon stage of glycolysis, 3
reactions of the TCA cycle, and several points within the PPP.
The remainder of the central metabolic genes could be removed while E.coli in
silico maintained the potential to support cellular growth.
This suggests that a large number of the central metabolic genes can be
removed without eliminating the capability of the metabolic network to
support growth under the conditions considered.
Edwards & Palsson PNAS 97, 5528 (2000)
13. Lecture WS 2003/04
Bioinformatics III 59
SummaryFBA analysis constructs the optimal network utilization simply using
stoichiometry of metabolic reactions and capacity constraints.
For E.coli the in silico results are consistent with experimental data.
FBA shows that in the E.coli metabolic network there are relatively few critical
gene products in central metabolism.
However, the the ability to adjust to different environments (growth conditions) may
be dimished by gene deletions.
FBA identifies „the best“ the cell can do, not how the cell actually behaves under a
given set of conditions. Here, survival was equated with growth.
FBA does not directly consider regulation or regulatory constraints on the
metabolic network. This can be treated separately (see future lecture).
Edwards & Palsson PNAS 97, 5528 (2000)
13. Lecture WS 2003/04
Bioinformatics III 60
V20
13. Lecture WS 2003/04
Bioinformatics III 61
Extreme Pathwaysintroduced into metabolic analysis by the lab of Bernard Palsson
(Dept. of Bioengineering, UC San Diego). The publications of this lab
are available at http://gcrg.ucsd.edu/publications/index.html
Extreme pathway
technique is based
on the stoichiometric
matrix representation
of metabolic networks.
All external fluxes are
defined as pointing outwards.
Schilling, Letscher, Palsson,
J. theor. Biol. 203, 229 (2000)
13. Lecture WS 2003/04
Bioinformatics III 62
Extreme Pathways – algorithm - setup
The algorithm to determine the set of extreme pathways for a reaction network
follows the pinciples of algorithms for finding the extremal rays/ generating
vectors of convex polyhedral cones.
Combine n n identity matrix (I) with the transpose of the stoichiometric
matrix ST. I serves for bookkeeping.
Schilling, Letscher, Palsson,
J. theor. Biol. 203, 229 (2000)
S
I ST
13. Lecture WS 2003/04
Bioinformatics III 63
separate internal and external fluxes
Examine contraints on each of the exchange fluxes as given by
j bj j
If the exchange flux is constained to be positive do nothing,
if the exchange flux is constrained to be negative multiply the corresponding
row of the initial matrix by -1.
If the exchange flux is unconstrained move the entire row to a temporary
matrix T(E). This completes the first tableau T(0).
T(0) and T(E) for the example reaction system are shown on the previous slide.
Each element of this matrices will be designated Tij.
Starting with x = 1 and T(0) = T(x-1) the next tableau is generated in the following
way:
Schilling, Letscher, Palsson,
J. theor. Biol. 203, 229 (2000)
13. Lecture WS 2003/04
Bioinformatics III 64
idea of algorithm
(1) Identify all metabolites that do not have an unconstrained exchange flux
associated with them.
The total number of such metabolites is denoted by .
For the example, this is only the case for metabolite C ( = 1).
What is the main idea?
- We want to find balanced extreme pathways
that don‘t change the concentrations of
metabolites when flux flows through
(input fluxes are channelled to products not to
accumulation of intermediates).
- The stochiometrix matrix describes the coupling of each reaction to the
concentration of metabolites X.
- Now we need to balance combinations of reactions that leave concentrations
unchanged. Pathways applied to metabolites should not change their
concentrations the matrix entries
need to be brought to 0.Schilling, Letscher, Palsson,
J. theor. Biol. 203, 229 (2000)
13. Lecture WS 2003/04
Bioinformatics III 65
keep pathways that do not change concentrations of internal metabolites
(2) Begin forming the new matrix T(x) by copying
all rows from T(x – 1) which contain a zero in the
column of ST that corresponds to the first
metabolite identified in step 1, denoted by index c.
(Here 3rd column of ST.)
Schilling, Letscher, Palsson, J. theor. Biol. 203, 229 (2000)
1 -1 1 0 0 0
1 0 -1 1 0 0
1 0 1 -1 0 0
1 0 0 -1 1 0
1 0 0 1 -1 0
1 0 0 -1 0 1
1 -1 1 0 0 0
T(0) =
T(1) =
+
13. Lecture WS 2003/04
Bioinformatics III 66
balance combinations of other pathways
(3) Of the remaining rows in T(x-1) add together
all possible combinations of rows which contain
values of the opposite sign in column c, such that
the addition produces a zero in this column.
Schilling, et al.
JTB 203, 229
1 -1 1 0 0 0
1 0 -1 1 0 0
1 0 1 -1 0 0
1 0 0 -1 1 0
1 0 0 1 -1 0
1 0 0 -1 0 1
T(0) =
T(1) =
1 0 0 0 0 0 -1 1 0 0 0
0 1 1 0 0 0 0 0 0 0 0
0 1 0 1 0 0 0 -1 0 1 0
0 1 0 0 0 1 0 -1 0 0 1
0 0 1 0 1 0 0 1 0 -1 0
0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 -1 1
13. Lecture WS 2003/04
Bioinformatics III 67
remove “non-orthogonal” pathways
(4) For all of the rows added to T(x) in steps 2 and 3 check to make sure that no
row exists that is a non-negative combination of any other sets of rows in T(x) .
One method used is as follows:
let A(i) = set of column indices j for with the elements of row i = 0.
For the example above Then check to determine if there exists
A(1) = {2,3,4,5,6,9,10,11} another row (h) for which A(i) is a
A(2) = {1,4,5,6,7,8,9,10,11} subset of A(h).
A(3) = {1,3,5,6,7,9,11}
A(4) = {1,3,4,5,7,9,10} If A(i) A(h), i h
A(5) = {1,2,3,6,7,8,9,10,11} where
A(6) = {1,2,3,4,7,8,9} A(i) = { j : Ti,j = 0, 1 j (n+m) }
then row i must be eliminated from T(x)
Schilling et al.
JTB 203, 229
13. Lecture WS 2003/04
Bioinformatics III 68
repeat steps for all internal metabolites
(5) With the formation of T(x) complete steps 2 – 4 for all of the metabolites that do
not have an unconstrained exchange flux operating on the metabolite,
incrementing x by one up to . The final tableau will be T().
Note that the number of rows in T () will be equal to k, the number of extreme
pathways.
Schilling et al.
JTB 203, 229
13. Lecture WS 2003/04
Bioinformatics III 69
balance external fluxes
(6) Next we append T(E) to the bottom of T(). (In the example here = 1.)
This results in the following tableau:
Schilling et al.
JTB 203, 229
T(1/E) =
1 -1 1 0 0 0
1 1 0 0 0 0 0
1 1 0 -1 0 1 0
1 1 0 -1 0 1 0
1 1 0 1 0 -1 0
1 1 0 0 0 0 0
1 1 0 0 0 -1 1
1 -1 0 0 0 0
1 0 -1 0 0 0
1 0 0 0 -1 0
1 0 0 0 0 -1
13. Lecture WS 2003/04
Bioinformatics III 70
balance external fluxes
(7) Starting in the n+1 column (or the first non-zero column on the right side),
if Ti,(n+1) 0 then add the corresponding non-zero row from T(E) to row i so as to
produce 0 in the n+1-th column.
This is done by simply multiplying the corresponding row in T(E) by Ti,(n+1) and
adding this row to row i .
Repeat this procedure for each of the rows in the upper portion of the tableau so
as to create zeros in the entire upper portion of the (n+1) column.
When finished, remove the row in T(E) corresponding to the exchange flux for the
metabolite just balanced.
Schilling et al.
JTB 203, 229
13. Lecture WS 2003/04
Bioinformatics III 71
balance external fluxes
(8) Follow the same procedure as in step (7) for each of the columns on the right
side of the tableau containing non-zero entries.
(In this example we need to perform step (7) for every column except the middle
column of the right side which correponds to metabolite C.)
The final tableau T(final) will contain the transpose of the matrix P containing the
extreme pathways in place of the original identity matrix.
Schilling et al.
JTB 203, 229
13. Lecture WS 2003/04
Bioinformatics III 72
pathway matrix
T(final) =
PT =
Schilling et al.
JTB 203, 229
1 -1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 -1 1 0 0 0 0 0 0
1 1 -1 1 0 0 0 0 0 0
1 1 1 -1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 -1 1 0 0 0 0 0 0
1 0 0 0 0 0 -1 1 0 0
0 1 1 0 0 0 0 0 0 0
0 1 0 1 0 0 0 -1 1 0
0 1 0 0 0 1 0 -1 0 1
0 0 1 0 1 0 0 1 -1 0
0 0 0 1 1 0 0 0 0 0
0 0 0 0 1 1 0 0 -1 1
v1 v2 v3 v4 v5 v6 b1 b2 b3 b4
p1 p7 p3 p2 p4 p6 p5
13. Lecture WS 2003/04
Bioinformatics III 73
Extreme Pathways for model system
Schilling et al.
JTB 203, 229
1 0 0 0 0 0 -1 1 0 0
0 1 1 0 0 0 0 0 0 0
0 1 0 1 0 0 0 -1 1 0
0 1 0 0 0 1 0 -1 0 1
0 0 1 0 1 0 0 1 -1 0
0 0 0 1 1 0 0 0 0 0
0 0 0 0 1 1 0 0 -1 1
v1 v2 v3 v4 v5 v6 b1 b2 b3 b4
p1 p7 p3 p2 p4 p6 p5
2 pathways p6 and p7 are not shown (right below) because all exchange fluxes with the exterior are 0.Such pathways have no net overall effect on the functional capabilities of the network.They belong to the cycling of reactions v4/v5 and v2/v3.
13. Lecture WS 2003/04
Bioinformatics III 74
How reactions appear in pathway matrix
In the matrix P of extreme pathways, each column is an EP and each row
corresponds to a reaction in the network.
The numerical value of the i,j-th element corresponds to the relative flux level
through the i-th reaction in the j-th EP.
Papin, Price, Palsson,
Genome Res. 12, 1889 (2002)
PPP TLM
13. Lecture WS 2003/04
Bioinformatics III 75
Papin, Price, Palsson, Genome Res. 12, 1889 (2002)
A symmetric Pathway Length Matrix PLM can be calculated:
where the values along the diagonal correspond to the length of the EPs.
PPP TLM
Properties of pathway matrix
The off-diagonal terms of PLM are the number of reactions that a pair of extreme
pathways have in common.
13. Lecture WS 2003/04
Bioinformatics III 76
Papin, Price, Palsson, Genome Res. 12, 1889 (2002)
One can also compute a reaction participation matrix PPM from P:
where the diagonal correspond to the number of pathways in which the
given reaction participates.
TPM PPP
Properties of pathway matrix
13. Lecture WS 2003/04
Bioinformatics III 77
Extreme Pathway Analysis
Calculation of EPs for increasingly large networks is computationally
intensive and results in the generation of large data sets.
Even for integrated genome-scale models for microbes under simple
conditions,
EP analysis can generate thousands of vectors!
Interpretation:
- the metabolic network of H. influenza has an order of magnitude larger degree of
pathway redundancy than the metabolic network of H. pylori
Found elsewhere: the number of reactions that participate in EPs that produce a
particular product is poorly correlated to the product yield and the molecular
complexity of the product.
Possible way out?
Papin, Price, Palsson, Genome Res. 12, 1889 (2002)
13. Lecture WS 2003/04
Bioinformatics III 78
Diagonalisation of pathway matrix?
http://mathworld.wolfram.com
13. Lecture WS 2003/04
Bioinformatics III 79
Single Value Decomposition of EP matrices
For a given EP matrix P np, SVD decomposes P into 3 matrices
Price et al. Biophys J 84, 794 (2003)
T
pn
V00
0ΣUP
where U nn is an orthonormal matrix of the left singular vectors, V pp is an
analogous orthonormal matrix of the right singular vectors, and rr is a
diagonal matrix containing the singular values i=1..r arranged in descending order
where r is the rank of P.
The first r columns of U and V, referred to as the left and right singular vectors, or
modes, are unique and form the orthonormal basis for the column space and row
space of P.
The singular values are the square roots of the eigenvalues of PTP. The magnitude
of the singular values in indicate the relative contribution of the singular vectors in
U and V in reconstructing P.
E.g. the second singular value contributes less to the construction of P than the first
singular value etc.
13. Lecture WS 2003/04
Bioinformatics III 80
Single Value Decomposition of EP: Interpretation
Price et al. Biophys J 84, 794 (2003)
The first mode (as the other modes) corresponds to a valid biochemical pathway
through the network.
The first mode will point
into the portions of the
cone with highest density
of EPs.
13. Lecture WS 2003/04
Bioinformatics III 81
SVD applied for Heliobacter systems
Price et al. Biophys J 84, 794 (2003)
Cumulative fractional contributions for the singular value decomposition of the EP
matrices of H. influenza and H. pylori.
This plot represents the
contribution of the first
n modes to the overall
description of the system.
13. Lecture WS 2003/04
Bioinformatics III 82
Summary
Price et al. Biophys J 84, 794 (2003)
Extreme pathway analysis provides a mathematically rigorous way to dissect
complex biochemical networks.
The matrix products PT P and PT P are useful ways to interpret pathway
lengths and reaction participation.
However, the number of computed vectors may range in the 1000sands.
Therefore, meta-methods (e.g. singular value decomposition) are required
that reduce the dimensionality to a useful number that can be inspected by
humans.
Single value decomposition may be one useful method ... and there are more
to come.
13. Lecture WS 2003/04
Bioinformatics III 83
V21
13. Lecture WS 2003/04
Bioinformatics III 84
Metabolic Pathway Analysis: Elementary ModesThe technique of Elementary Flux Modes (EFM) was developed prior to extreme
pathways (EP) by Stephan Schuster, Thomas Dandekar and co-workers:Pfeiffer et al. Bioinformatics, 15, 251 (1999)
Schuster et al. Nature Biotech. 18, 326 (2000)
The method is very similar to the „extreme pathway“ method to construct a basis
for metabolic flux states based on methods from convex algebra.
Extreme pathways are a subset of elementary modes, and for many systems,
both methods coincide.
Are the subtle differences important?
13. Lecture WS 2003/04
Bioinformatics III 85
Review: Metabolite BalancingFor analyzing a biochemical network, its structure is expressed by the stochiometric
matrix S consisting of m rows corresponding to the substances (metabolites) and n
rows corresponding to the stochiometric coefficients of the metabolites in each
reaction.
A vector v denotes the reaction rates (mmol/g dry weight * hour) and a vector c
describes the metabolite concentrations.
Due to the high turnover of metabolite pools one often assumes pseudo-steady state
(c(t) = constant) leading to the fundamental Metabolic Balancing Equation:
(1)
Flux distributions v satisfying this relationship lie in the null space of S and are able
to balance all metabolites.
Klamt et al. Bioinformatics 19, 261 (2003)
vS0
c
dt
td
13. Lecture WS 2003/04
Bioinformatics III 86
Review: Metabolic flux analysisMetabolic flux analysis (MFA): determine preferably all components of the flux
distribution v in a metabolic network during a certain stationary growth experiment.
Typically some measured or known rates must be provided to calculate unknown
rates. Accordingly, v and S are partioned into the known (vb, Sb) and unknown part
(va, Sa).
(1) leads to the central equation for MFA describing a flux scenario:
0 = S v = Sa va + Sb vb.
The rank of Sa determines whether this scenario is redundant and/or
underdetermined. Redundant systems can be checked on inconsistencies. In
underdetermined scenarios, only some element of va are uniquely calculable.
Klamt et al. Bioinformatics 19, 261 (2003)
13. Lecture WS 2003/04
Bioinformatics III 87
Review: structural network analysis (SNA)Whereas MFA focuses on a single flux distribution, techniques of Structural
(Stochiometric, Topological) Network Analysis (SNA) address general
topological properties, overall capabilities, and the inherent pathway structure of a
metabolic network.
Basic topological properties are, e.g., conserved moieties.
Flux Balance Analysis (FBA9 searches for single optimal flux distributions (mostly
with respect to the synthesis of biomass) fulfilling S v = 0 and additionally
reversibility and capacity restrictions for each reaction (i vi i).
Klamt et al. Bioinformatics 19, 261 (2003)
13. Lecture WS 2003/04
Bioinformatics III 88
Review: Metabolic Pathway Analysis (MPA)Metabolic Pathway Analysis searches for meaningful structural and functional units
in metabolic networks. The most promising, very similar approaches are based on
convex analysis and use the sets of elementary flux modes (Schuster et al. 1999,
2000) and extreme pathways (Schilling et al. 2000).
Both sets span the space of feasible steady-state flux distributions by non-
decomposable routes, i.e. no subset of reactions involved in an EFM or EP can hold
the network balanced using non-trivial fluxes.
MPA can be used to study e.g.
- routing + flexibility/redundancy of networks
- functionality of networks
- idenfication of futile cycles
- gives all (sub)optimal pathways with respect to product/biomass yield
- can be useful for calculability studies in MFA
Klamt et al. Bioinformatics 19, 261 (2003)
13. Lecture WS 2003/04
Bioinformatics III 89
Elementary Flux ModesStart from list of reaction equations and a declaration of reversible and irreversible
reactions and of internal and external metabolites.
E.g. reaction scheme of monosaccharide Fig.1
metabolism. It includes 15 internal
metabolites, and 19 reactions.
S has dimension 15 19.
It is convenient to reduce this matrix
by lumping those reactions that
necessarily operate together.
{Gap,Pgk,Gpm,Eno,Pyk},
{Zwf,Pgl,Gnd}
Such groups of enzymes can be detected automatically.
This reveals another two sequences {Fba,TpiA} and {2 Rpe,TktI,Tal,TktII}.
Schuster et al. Nature Biotech 18, 326 (2000)
13. Lecture WS 2003/04
Bioinformatics III 90
Elementary Flux ModesLumping the reactions in any one sequence gives the following reduced system:
Construct initial tableau by combining
S with identity matrix:
1 0 ... 0 0 0 1 0 0
0 1 ... 0 0 -1 0 2 0
0 0 ... 0 -1 0 0 0 1
0 0 ... 0 -2 0 2 1 -1
0 0 ... 0 0 0 0 -1 0
0 0 ... 0 1 0 0 0 0
0 0 ... 0 0 1 -1 0 0
0 0 ... 0 0 -1 1 0 0
0 0 ... 1 0 0 0 0 -1
Pgi{Fba,TpiA}Rpi reversible{2Rpe,TktI,Tal,TktII}{Gap,Pgk,Gpm,Eno,Pyk}{Zwf,Pgl,Gnd}Pfk irreversibleFbpPrs_DeoB
Schuster et al. Nature Biotech 18, 326 (2000)
Ru5
P
FP
2
F6P
GA
P
R5P
T(0)=
13. Lecture WS 2003/04
Bioinformatics III 91
Elementary Flux ModesAim again: bring all entries
of right part of matrix to 0.E.g. 2*row3 - row4 gives
„reversible“ row with 0 in column
10
New „irreversible“ rows with 0 entry
in column 10 by row3 + row6 and
by row4 + row7.
In general, linear combinations
of 2 rows corresponding
to the same type of directio-
nality go into the part of
the respective type in the
tableau. Combinations by
different types go into the
„irreversible“ tableau
because at least 1 reaction is
irreversible. Irreversible reactions
can only combined using positive
coefficients.Schuster et al. Nature Biotech 18, 326 (2000)
1 0 0 1 0 0
1 0 -1 0 2 0
1 -1 0 0 0 1
1 -2 0 2 1 -1
1 0 0 0 -1 0
1 1 0 0 0 0
1 0 1 -1 0 0
1 0 -1 1 0 0
1 0 0 0 0 -1
1 0 0 1 0 0
1 0 -1 0 2 0
2 -1 0 0 -2 -1 3
1 0 0 0 -1 0
1 0 1 -1 0 0
1 0 -1 1 0 0
1 0 0 0 0 -1
1 1 0 0 0 0 1
1 2 0 0 2 1 -1
T(1)=
T(0)=
13. Lecture WS 2003/04
Bioinformatics III 92
Elementary Flux ModesAim: zero column 11.Include all possible (direction-wise
allowed) linear combinations of
rows.
continue with columns 12-
14. Schuster et al. Nature Biotech 18, 326 (2000)
1 0 0 1 0 0
1 0 -1 0 2 0
2 -1 0 0 -2 -1 3
1 0 0 0 -1 0
1 0 1 -1 0 0
1 0 -1 1 0 0
1 0 0 0 0 -1
1 1 0 0 0 0 1
1 2 0 0 2 1 -1
1 0 0 1 0 0
2 -1 0 0 -2 -1 3
1 0 0 0 -1 0
1 0 0 0 0 -1
1 1 0 0 0 0 1
1 2 0 0 2 1 -1
1 1 0 0 -1 2 0
-1 1 0 0 1 -2 0
1 1 0 0 0 0 0
T(2)=
T(1)=
13. Lecture WS 2003/04
Bioinformatics III 93
Elementary Flux ModesIn the course of the algorithm, one must avoid
- calculation of nonelementary modes (rows that contain fewer zeros than the row
already present)
- duplicate modes (a pair of rows is only combined if it fulfills the condition
S(mi(j)) S(mk
(j)) S(ml(j+1)) where S(ml
(j+1)) is the set of positions of 0 in this row.
- flux modes violating the sign restriction for the irreversible reactions.
Final tableau
T(5) =
This shows that the number of rows may decrease or increase in the course of the
algorithm. All constructed elementary modes are irreversible.
Schuster et al. Nature Biotech 18, 326 (2000)
1 1 0 0 2 0 1 0 0 0 ... ... 0
-2 0 1 1 1 3 0 0 0 ... ...
0 2 1 1 5 3 2 0 0
0 0 1 0 0 1 0 0 1
5 1 4 -2 0 0 1 0 6
-5 -1 2 2 0 6 0 1 0 ... ...
0 0 0 0 0 0 1 1 0 0 ... ... 0
13. Lecture WS 2003/04
Bioinformatics III 94
Two approaches for Metabolic Pathway Analysis?The pathway P(v) is an elementary flux mode if it fulfills conditions C1 – C3.
(C1) Pseudo steady-state. S e = 0. This ensures that none of the metabolites is
consumed or produced in the overall stoichiometry.
(C2) Feasibility: rate ei 0 if reaction is irreversible. This demands that only
thermodynamically realizable fluxes are contained in e.
(C3) Non-decomposability: there is no vector v (unequal to the zero vector and to
e) fulfilling C1 and C2 and that P(v) is a proper subset of P(e). This is the core
characteristics for EFMs and EPs and supplies the decomposition of the network
into smallest units (able to hold the network in steady state).
C3 is often called „genetic independence“ because it implies that the enzymes in
one EFM or EP are not a subset of the enzymes from another EFM or EP.
Klamt & Stelling Trends Biotech 21, 64 (2003)
13. Lecture WS 2003/04
Bioinformatics III 95
Two approaches for Metabolic Pathway Analysis?The pathway P(e) is an extreme pathway if it fulfills conditions C1 – C3 AND
conditions C4 – C5.
(C4) Network reconfiguration: Each reaction must be classified either as exchange
flux or as internal reaction. All reversible internal reactions must be split up into
two separate, irreversible reactions (forward and backward reaction).
(C5) Systemic independence: the set of EPs in a network is the minimal set of
EFMs that can describe all feasible steady-state flux distributions.
Klamt & Stelling Trends Biotech 21, 64 (2003)
13. Lecture WS 2003/04
Bioinformatics III 96
Two approaches for Metabolic Pathway Analysis?
Klamt & Stelling Trends Biotech 21, 64 (2003)
A C P
B
D
A(ext) B(ext) C(ext)R1 R2 R3
R5
R4 R8
R9
R6
R7
13. Lecture WS 2003/04
Bioinformatics III 97
Reconfigured Network
Klamt & Stelling Trends Biotech 21, 64 (2003)
A C P
B
D
A(ext) B(ext) C(ext)R1 R2 R3
R5
R4 R8
R9
R6
R7bR7f
3 EFMs are not systemically independent:EFM1 = EP4 + EP5EFM2 = EP3 + EP5EFM4 = EP2 + EP3
13. Lecture WS 2003/04
Bioinformatics III 98
Property 1 of EFMs
Klamt & Stelling Trends Biotech 21, 64 (2003)
The only difference in the set of EFMs emerging upon reconfiguration consists in
the two-cycles that result from splitting up reversible reactions. However, two-cycles
are not considered as meaningful pathways.
Valid for any network: Property 1
Reconfiguring a network by splitting up reversible reactions leads to the
same set of meaningful EFMs.
13. Lecture WS 2003/04
Bioinformatics III 99
Software: FluxAnalyzerWhat is the consequence of when all exchange fluxes (and hence all
reactions in the network) are irreversible?
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFMs and EPs always co-incide!
13. Lecture WS 2003/04
Bioinformatics III 100
Property 2 of EFMs
Klamt & Stelling Trends Biotech 21, 64 (2003)
Property 2
If all exchange reactions in a network are irreversible then the sets of
meaningful EFMs (both in the original and in the reconfigured network) and
EPs coincide.
13. Lecture WS 2003/04
Bioinformatics III 101
Reconfigured Network
Klamt & Stelling Trends Biotech 21, 64 (2003)
A C P
B
D
A(ext) B(ext) C(ext)R1 R2 R3
R5
R4 R8
R9
R6
R7bR7f
3 EFMs are not systemically independent:EFM1 = EP4 + EP5EFM2 = EP3 + EP5EFM4 = EP2 + EP3
13. Lecture WS 2003/04
Bioinformatics III 102
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
Problem EFM (network N1) EP (network N2)
Recognition of 4 genetically indepen- Set of EPs does not contain
operational modes: dent routes all genetically independent
routes for converting (EFM1-EFM4) routes. Searching for EPs
exclusively A to P. leading from A to P via B,
no pathway would be found.
Interpret the property of the EFM and EP network to recognize operational
modes, finding all the routes, ...
In the exam I would give you the desired property and ask you for the
interpretation.
13. Lecture WS 2003/04
Bioinformatics III 103
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
Problem EFM (network N1) EP (network N2)
Finding all the EFM1 and EFM2 are One would only find the
optimal routes: optimal because they suboptimal EP1, not the
optimal pathways for yield one mole P per optimal routes EFM1 and
synthesizing P during mole substrate A EFM2.
growth on A alone. (i.e. R3/R1 = 1),
whereas EFM3 and
EFM4 are only sub-
optimal (R3/R1 = 0.5).
13. Lecture WS 2003/04
Bioinformatics III 104
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFM (network N1)
4 pathways convert A
to P (EFM1-EFM4),
whereas for B only one
route (EFM8) exists.
When one of the
internal reactions (R4-
R9) fails, for production
of P from A 2 pathways
will always „survive“. By
contrast, removing
reaction R8 already
stops the production of
P from B alone.
EFM (network N1)
Only 1 EP exists for
producing P by substrate A
alone, and 1 EP for
synthesizing P by (only)
substrate B. One might
suggest that both
substrates possess the
same redundancy of
pathways, but as shown by
EFM analysis, growth on
substrate A is much more
flexible than on B.
Problem
Analysis of network
flexibility (structural
robustness,
redundancy):
relative robustness of
exclusive growth on
A or B.
13. Lecture WS 2003/04
Bioinformatics III 105
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFM (network N1)
R8 is essential for
producing P by substrate
B, whereas for A there is
no structurally „favored“
reaction (R4-R9 all occur
twice in EFM1-EFM4).
However, considering the
optimal modes EFM1,
EFM2, one recognizes the
importance of R8 also for
growth on A.
EFM (network N1)
Consider again biosynthesis
of P from substrate A (EP1
only). Because R8 is not
involved in EP1 one might
think that this reaction is not
important for synthesizing P
from A. However, without this
reaction, it is impossible to
obtain optimal yields (1 P per
A; EFM1 and EFM2).
Problem
Relative importance
of single reactions:
relative importance of
reaction R8.
13. Lecture WS 2003/04
Bioinformatics III 106
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFM (network N1)
R6 and R9 are an enzyme
subset. By contrast, R6
and R9 never occur
together with R8 in an
EFM. Thus (R6,R8) and
(R8,R9) are excluding
reaction pairs.(In an arbitrary composable
steady-state flux distribution they
might occur together.)
EFM (network N1)
The EPs pretend R4 and R8
to be an excluding reaction
pair – but they are not
(EFM2). The enzyme
subsets would be correctly
identified. However, one can construct simple
examples where the EPs would also
pretend wrong enzyme subsets (not
shown).
Problem
Enzyme subsets
and excluding
reaction pairs:
suggest regulatory
structures or rules.
13. Lecture WS 2003/04
Bioinformatics III 107
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFM (network N1)
The shortest pathway
from A to P needs 2
internal reactions (EFM2),
the longest 4 (EFM4).
EFM (network N1)
Both the shortest (EFM2)
and the longest (EFM4)
pathway from A to P are not
contained in the set of EPs.
Problem
Pathway length:
shortest/longest
pathway for
production of P from
A.
13. Lecture WS 2003/04
Bioinformatics III 108
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFM (network N1)
All EFMs not involving the
specific reactions build up
the complete set of EFMs
in the new (smaller) sub-
network. If R7 is deleted,
EFMs 2,3,6,8 „survive“.
Hence the mutant is
viable.
EFM (network N1)
Analyzing a subnetwork
implies that the EPs must be
newly computed. E.g. when
deleting R2, EFM2 would
become an EP. For this
reason, mutation studies
cannot be performed easily.
Problem
Removing a
reaction and
mutation studies:
effect of deleting R7.
13. Lecture WS 2003/04
Bioinformatics III 109
Comparison of EFMs and EPs
Klamt & Stelling Trends Biotech 21, 64 (2003)
EFM (network N1)
For the case of R7, all
EFMs but EFM1 and
EFM7 „survive“ because
the latter ones utilize R7
with negative rate.
EFM (network N1)
In general, the set of EPs
must be recalculated:
compare the EPs in network
N2 (R2 reversible) and N4
(R2 irreversible).
Problem
Constraining
reaction
reversibility:
effect of R7 limited to
B C.
13. Lecture WS 2003/04
Bioinformatics III 110
Application of elementary modesMetabolic network structure of E.coli determines
key aspects of functionality and regulation
Compute EFMs for central
metabolism of E.coli.
Catabolic part: substrate uptake
reactions, glycolysis, pentose
phosphate pathway, TCA cycle,
excretion of by-products (acetate,
formate, lactate, ethanol)
Anabolic part: conversions of
precursors into building blocks like
amino acids, to macromolecules,
and to biomass.
Stelling et al. Nature 420, 190 (2002)
13. Lecture WS 2003/04
Bioinformatics III 111
Metabolic network topology and phenotypeThe total number of EFMs for given
conditions is used as quantitative
measure of metabolic flexibility.
a, Relative number of EFMs N enabling
deletion mutants in gene i ( i) of E. coli
to grow (abbreviated by µ) for 90 different
combinations of mutation and carbon
source. The solid line separates
experimentally determined mutant
phenotypes, namely inviability (1–40)
from viability (41–90).
Stelling et al. Nature 420, 190 (2002)
The # of EFMs for mutant strain
allows correct prediction of
growth phenotype in more than 90%
of the cases.
13. Lecture WS 2003/04
Bioinformatics III 112
Robustness analysis
The # of EFMs qualitatively indicates whether a mutant is viable or not, but does
not describe quantitatively how well a mutant grows.
Define maximal biomass yield Ymass as the optimum of:
ei is the single reaction rate (growth and substrate uptake) in EFM i selected for
utilization of substrate Sk.
Stelling et al. Nature 420, 190 (2002)
ki Si
iSXi e
eY
/,
13. Lecture WS 2003/04
Bioinformatics III 113
Software: FluxAnalyzer
Dependency of the mutants' maximal
growth yield Ymax( i) (open circles) and the
network diameter D( i) (open squares) on
the share of elementary modes
operational in the mutants. Data were
binned to reduce noise. Stelling et al. Nature 420, 190 (2002)
Central metabolism of E.coli behaves in a highly robust manner because
mutants with significantly reduced metabolic flexibility show a growth yield
similar to wild type.
13. Lecture WS 2003/04
Bioinformatics III 114
Assume that optimization during biological evolution can be characterized
by the two objectives of flexibility (associated with robustness) and of
efficiency.
Flexibility means the ability to adapt to a wide range of environmental
conditions,
that is, to realize a maximal bandwidth of thermodynamically feasible flux
distributions (maximizing # of EFMs).
Efficiency could be defined as fulfilment of cellular demands with an optimal
outcome such as maximal cell growth using a minimum of constitutive
elements (genes and proteins, thus minimizing # EFMs).
These 2 criteria pose contradictory challenges.
Optimal cellular regulation needs to find a trade-off.
Can regulation be predicted by EFM analysis?
Stelling et al. Nature 420, 190 (2002)
13. Lecture WS 2003/04
Bioinformatics III 115
Robustness analysis
The # of EFMs qualitatively indicates whether a mutant is viable or not, but does
not describe quantitatively how well a mutant grows.
Define maximal biomass yield Ymass as the optimum of:
ei is the single reaction rate (growth and substrate uptake) in EFM i selected for
utilization of substrate Sk.
Stelling et al. Nature 420, 190 (2002)
ki Si
iSXi e
eY
/,
13. Lecture WS 2003/04
Bioinformatics III 116
Compute control-effective fluxes for each reaction l by determining the efficiency of any EFM
ei by relating the system‘s output to the substrate uptake and to the sum of all absolute
fluxes.
With flux modes normalized to the total substrate uptake, efficiencies i(Sk, ) for
the targets for optimization -growth and ATP generation, are defined as:
Can regulation be predicted by EFM analysis?
Stelling et al. Nature 420, 190 (2002)
l
li
ATPi
ki
l
li
iki
e
eATPS
e
eS ,and,
Control-effective fluxes vl(Sk) are obtained by averaged weighting of the product of reaction-
specific fluxes and mode-specific efficiencies over all EFMs using the substrate under
consideration:
lki
i
liki
SAl
ki
i
liki
SXkl ATPS
eATPS
YS
eS
YSv
kk,
,1
,
,1
max/
max/
YmaxX/Si and Ymax
A/Si are optimal yields of biomass production and of ATP synthesis.
Control-effective fluxes represent the importance of each reaction for efficient and flexible
operation of the entire network.
13. Lecture WS 2003/04
Bioinformatics III 117
SummaryEFM are a robust method that offers great opportunities for studying functional and
structural properties in metabolic networks.
Klamt & Stelling suggest that the term „elementary flux modes“ should be used
whenever the sets of EFMs and EPs are identical.
In cases where they don‘t, EPs are a subset of EFMs.
It remains to be understood more thoroughly how much valuable information about
the pathway structure is lost by using EPs.
Ongoing Challenges:
- study really large metabolic systems by subdividing them
- combine metabolic model with model of cellular regulation.
Klamt & Stelling Trends Biotech 21, 64 (2003)
13. Lecture WS 2003/04
Bioinformatics III 118
V22
13. Lecture WS 2003/04
Bioinformatics III 119
Integrated Analysis of Metabolic and Regulatory NetworksSofar, studies of large-scale cellular networks have focused on their connectivities.
The emerging picture shows a densely-woven web where almost everything is
connected to everything.
In the cell‘s metabolic network, hundreds of substrates are interconnected through
biochemical reactions. Although this could could in principle lead to the
simultaneous flow of substrates in numerous directions, in practice metabolic fluxes
pass through specific pathways.
Topological studies sofar did not consider how the modulation of this connectivity
might also determine network properties.
Therefore it is important to correlate the network topology (picture derived
from EFMs and EPs) with the expression of enzymes in the cell.
Start with review of last lecture‘s final point about coupling of metabolic and
regulatory networks.
13. Lecture WS 2003/04
Bioinformatics III 120
Analyze transcriptional control in metabolic networks
Regulatory and metabolic functions of cells are mediated by networks of interacting
biochemical components.
Metabolic flux is optimized to maximize metabolic efficiency under different
conditions.
Control of metabolic flow:
- allosteric interactions
- covalent modifications involving enzymatic activity
- transcription (revealed by genome-wide expression studies)
Here: N. Barkai and colleagues analyzed published experimental expression data of
Saccharomyces cerevisae.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 121
Recurrence signature algorithmAvailability of DNA microarray data study transcriptional response of a complete
genome to different experimental conditions.
An essential task in studying the global structure of transcriptional networks is the
gene classification.
Commonly used clustering algorithms classify genes successfully when applied to
relatively small data sets, but their application to large-scale expression data is
limited by 2 well-recognized drawbacks:
- commonly used algorithms assign each gene to a single cluster, whereas in fact
genes may participate in several functions and should thus be included in several
clusters
- these algorithms classify genes on the basis of their expression under all
experimental conditions, whereas cellular processes are generally affected only by
a small subset of these conditions.
Ihmels et al. Nat Genetics 31, 370 (2002)
13. Lecture WS 2003/04
Bioinformatics III 122
Recurrence signature algorithmAim: identify transcription „modules“ (TMs).
a set of randomly selected genes is unlikely to be identical to the genes of any
TM. Yet many such sets do have some overlap with a specific TM.
In particular, sets of genes that are compiled according to existing knowledge of
their functional (or regulatory) sequence similarity may have a significant overlap
with a transcription module.
Algorithm receives a gene set that partially overlaps a TM and then provides the
complete module as output. Therefore this algorithm is referred to as „signature
algorithm“.
Ihmels et al. Nat Genetics 31, 370 (2002)
13. Lecture WS 2003/04
Bioinformatics III 123
Recurrence signature algorithm
a, The signature algorithm.
b , Recurrence as a reliability measure. The signature algorithm is applied to distinct input
sets containing different subsets of the postulated transcription module. If the different input
sets give rise to the same module, it is considered reliable.
c, General application of the recurrent signature method.
Ihmels et al. Nat Genetics 31, 370 (2002)
normalizationof data
identify modules
classify genesinto modules
13. Lecture WS 2003/04
Bioinformatics III 124
Correlation between genes of the same metabolic pathwayDistribution of the average correlation
between genes assigned to the same
metabolic pathway in the KEGG database.
The distribution corresponding to random
assignment of genes to metabolic
pathways of the same size is shown for
comparison. Importantly, only genes
coding for enzymes were used in the
random control.
Interpretation: pairs of genes associated
with the same metabolic pathway show a
similar expression pattern.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
However, typically only a set of the
genes assigned to a given pathway
are coregulated.
13. Lecture WS 2003/04
Bioinformatics III 125
Correlation between genes of the same metabolic pathway
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Genes of the glycolysis pathway
(according KEGG) were clustered
and ordered based on the correlation
in their expression profiles.
Shown here is the matrix of their
pair-wise correlations.
The cluster of highly correlated
genes (orange frame) corresponds
to genes that encode the central
glycolysis enzymes.
The linear arrangement of these
genes along the pathway is shown at
right.
Of the 46 genes assigned to the
glycolysis pathway in the KEGG
database, only 24 show a correlated
expression pattern.
In general, the coregulated genes
belong to the central pieces of
pathways.
13. Lecture WS 2003/04
Bioinformatics III 126
Coexpressed enzymes often catalyze linear chain of reactionsCoregulation between enzymes
associated with central metabolic
pathways. Each branch
corresponds to several enzymes.
In the cases shown, only one of the
branches downstream of the
junction point is coregulated with
upstream genes.
Interpretation: coexpressed
enzymes are often arranged in a
linear order, corresponding to a
metabolic flow that is directed in a
particular direction.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 127
Co-regulation at branch points
To examine more systematically whether coregulation enhances the linearity of
metabolic flow, analyze the coregulation of enzymes at metabolic branch-points.
Search KEGG for metabolic compounds that are involved in exactly 3 reactions.
Only consider reactions that exist in S.cerevisae.
3-junctions can integrate metabolic flow (convergent junction)
or allow the flow to diverge in 2 directions (divergent junction).
In the cases where several reactions are catalyzed by the same enzymes, choose
one representative so that all junctions considered are composed of precisely 3
reactions catalyzed by distinct enzymes.
Each 3-junction is categorized according to the correlation pattern found between
enzymes catalyzing its branches. Correlation coefficients > 0.25 are considered
significant.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 128
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Coregulation pattern in three-point junctions
In the majority of divergent
junctions, only one of the
emanating branches is
significantly coregulated with
the incoming reaction that
synthesizes the metabolite.
All junctions corresponding to metabolites that participate in exactly 3
reactions (according to KEGG) were identified and the correlations
between the genes associated with each such junction were calculated.
The junctions were grouped according to the directionality of the
reactions, as shown.
Divergent junctions, which allow the flow of metabolites in two
alternative directions, predominantly show a linear coregulation pattern,
where one of the emanating reaction is correlated with the incoming
reaction (linear regulatory pattern) or the two alternative outgoing
reactions are correlated in a context-dependent manner with a distinct
isozyme catalyzing the incoming reaction (linear switch).
By contrast, the linear regulatory pattern is significantly less abundant
in convergent junctions, where the outgoing flow follows a unique
direction, and in conflicting junctions that do not support metabolic flow.
Most of the reversible junctions comply with linear regulatory patterns.
Indeed, similar to divergent junctions, reversible junctions allow
metabolites to flow in two alternative directions. Reactions were
counted as coexpressed if at least two of the associated genes were
significantly correlated (correlation coefficient >0.25). As a random
control, we randomized the identity of all metabolic genes and repeated
the analysis.
13. Lecture WS 2003/04
Bioinformatics III 129
Co-regulation at branch points: conclusions
The observed co-regulation patterns correspond to a linear metabolic flow, whose
directionality can be switched in a condition-specific manner.
When analyzing junctions that allow metabolic flow in a larger number of
directions, there also only a few important branches are coregulated with the
incoming branch.
Therefore: transcription regulation is used to enhance the linearity of metabolic
flow, by biasing the flow toward only a few of the possible routes.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 130
The connectivity of a given metabolite
is defined as the number of reactions
connecting it to other metabolites.
Shown are the distributions of
connectivity between metabolites in an
unrestricted network () and in a
network where only correlated
reactions are considered ().
In accordance with previous results
(Jeong et al. 2000) , the connectivity
distribution between metabolites
follows a power law (log-log plot).
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Connectivity of metabolites
In contrast, when coexpression is
used as a criterion to distinguish
functional links, the connectivity
distribution becomes exponential
(log-linear plot).
13. Lecture WS 2003/04
Bioinformatics III 131
Differential regulation of isozymes
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Observe that isozymes at junction points are often preferentially coexpressed
with alternative reactions.
investigate their role in the metabolic network more systematically.
Two possible functions of isozymes
associated with the same metabolic
reaction.
An isozyme pair could provide redundancy which may be needed for buffering genetic
mutations or for amplifying metabolite production. Redundant isozymes are expected
to be coregulated.
Alternatively, distinct isozymes could be dedicated to separate biochemical
pathways using the associated reaction. Such isozymes are expected to be
differentially expressed with the two alternative processes.
13. Lecture WS 2003/04
Bioinformatics III 132
Arrows represent metabolic
pathways composed of a sequence
of enzymes.
Coregulation is indicated with the
same color (e.g., the isozyme
represented by the green arrow is
coregulated with the metabolic
pathway represented by the green
arrow).
Most members of isozyme
pairs are separately coregulated
with alternative processes.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Differential regulation of isozymes in central metabolic PW
13. Lecture WS 2003/04
Bioinformatics III 133
The primary role of isozyme multiplicity is to allow for differential regulation
of reactions that are shared by separated processes.
Dedicating a specific enzyme to each pathway may offer a way of
independently controlling the associated reaction in response to pathway-
specific requirements, at both the transcriptional and the post-transcriptional
levels.
Describe the likely function of isozymes in metabolic pathways.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Differential regulation of isozymes: interpretation
13. Lecture WS 2003/04
Bioinformatics III 134
Co-expression of transporters
Transporter genes are
co-expressed with the relevant
metabolic pathways providing
the pathways with its
metabolites.
Co-expression is marked in green.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 135
Co-regulation of transcription factors
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Transcription factors are often co-regulated with their regulated pathways.
Shown here are transcription factors which were found to be co-regulated in the
analysis. Co-regulation is shown by color-coding such that the transcription factor
and the associated pathways are of the same color.
13. Lecture WS 2003/04
Bioinformatics III 136
Sofar: co-expression analysis revealed a strong tendency toward coordinated
regulation of genes involved in individual metabolic pathways.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
Hierarchical modularity in the metabolic network
Does transcription regulation also define a higher-order metabolic organization, by
coordinated expression of distinct metabolic pathways?
Based on observation that feeder pathways (which synthesize metabolites) are
frequently coexpressed with pathways using the synthesized metabolites.
13. Lecture WS 2003/04
Bioinformatics III 137
Feeder-pathways/enzymesFeeder pathways or genes
co-expressed with the
pathways they fuel. The
feeder pathways (light
blue) provide the main
pathway (dark blue) with
metabolites in order to
assist the main pathway,
indicating that co-
expression extends
beyond the level of
individual pathways.
These results can be
interpreted in the following
way: the organism will
produce those enzymes that
are needed.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 138
Hierarchical modularity in the metabolic networkDerive hierarchy by applying an iterative
signature algorithm to the metabolic pathways,
and decreasing the resolution parameter
(coregulation stringency) in small steps.
Each box contains a group of coregulated genes
(transcription module). Strongly associated
genes (left) can be associated with a specific
function, whereas moderately correlated
modules (right) are larger and their function is
less coherent.
The merging of 2 branches indicates that the
associated modules are induced by similar
conditions.
All pathways converge to one of 3 low-resolution
modules: amino acid biosynthesis, protein
synthesis, and stress.Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 139
Global network propertiesJeong et al. showed that the structural connectivity between metabolites imposes a
hierarchical organization of the metabolic network. That analysis was based on
connectivity between substrates, considering all potential connections.
Here, analysis is based on coexpression of enzymes.
In both approaches, related metabolic pathways were clustered together!
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
There are, however, some differences in the particular groupings (not
discussed here),
and importantly, when including expression data the connectivity pattern of
metabolites changes from a power-law dependence to an exponential one
corresponding to a network structure with a defined scale of connectivity.
This reflects the reduction in the complexity of the network.
13. Lecture WS 2003/04
Bioinformatics III 140
SummaryTranscription regulation is prominently involved in shaping the metabolic
network of S. cerevisae.
1 Transcription leads the metabolic flow toward linearity.
2 Individual isozymes are often separately coregulated with distinct
processes, providing a means of reducing crosstalk between pathways
using a common reaction.
3 Transcription regulation entails a higher-order structure of the metabolic
network.
It exists a hierarchical organization of metabolic pathways into groups of
decreasing expression coherence.
Ihmels, Levy, Barkai, Nat. Biotech 22, 86 (2004)
13. Lecture WS 2003/04
Bioinformatics III 141
V23, 24
13. Lecture WS 2003/04
Bioinformatics III 142
V25
13. Lecture WS 2003/04
Bioinformatics III 143
Integrating Protein-Protein Interactions: Bayesian Networks
- Lot of direct experimental data coming about protein-protein interactions
(Y2H, MS)
Jansen et al. Science 302, 449 (2003)
- Genomic information also provides indirect information:
- interacting proteins are often significantly coexpressed ( microarrays)
- interacting proteins are often colocalized to the same subcellular
compartment
13. Lecture WS 2003/04
Bioinformatics III 144
Problems
Jansen et al. Science 302, 449 (2003)
Unfortunately, interaction data sets are often incomplete and contradictory
(von Mering et al. 2002)
In the context of genome-wide analyses, these inaccuracies are greatly
magnified because the protein pairs that do not interact (negatives) by far
outnumber those that do interact (positives).
E.g. in yeast, the ~6000 proteins allow for N (N-1) / 2 ~ 18 million potential
interactions. But the estimated number of actual interactions is < 100.000.
Therefore, even reliable techniques can generate many false positives when
applied genome-wide.
Think of a diagnostic with a 1% false-positive rate for a rare disease occurring in
0.1% of the population. This would roughly produce 1 true positive for every 10
false ones.
Why are these inaccuracies greatly magnified in the case of protein
interaction networks?
13. Lecture WS 2003/04
Bioinformatics III 145
Integrative Approach
Jansen et al. Science 302, 449 (2003)
One would like to integrate evidence from many different sources to increase the
predictivity of true and false protein-protein predictions.
Here, use Bayesian approach for integrating interaction information that allows for
the probabilistic combination of multiple data sets; apply to yeast.
Input: Approach can be used for combining noisy genomic interaction data sets.
Normalization: Each source of evidence for interactions is compared against
samples of known positives and negatives (“gold-standard”).
Output: predict for every possible protein pair likelihood of interaction.
Verification: test on experimental interaction data not included in the gold-
standard + new TAP (tandem affinity purification experiments).
13. Lecture WS 2003/04
Bioinformatics III 146
Integration of various information sources
Jansen et al. Science 302, 449 (2003)
(iii) Gold-standards of known interactions
and noninteracting protein pairs.
The three different types of data
used: (i) Interaction data from
high-throughput experiments.
These comprise large-scale two-
hybrid screens (Y2H) (Uetz et al.,
Ito et al.) and in vivo pull-down
experiments (Gavin et al., Ho et
al. ).
(ii) Other genomic features. We
considered expression data,
biological function of proteins
(from Gene Ontology biological
process and the MIPS functional
catalog), and data about whether
proteins are essential.
13. Lecture WS 2003/04
Bioinformatics III 147
Combination of data sets into probabilistic interactomes
(B) Combination of data sets into
probabilistic interactomes.
The 4 interaction data sets
from HT experiments were
combined into 1 PIE.
The PIE represents a
transformation of the
individual binary-valued
interaction sets into a data
set where every protein pair
is weighed according to the
likelihood that it exists in a
complex. A „naïve” Bayesian network is used to
model the PIP data. These information sets
hardly overlap.
Jansen et al. Science 302, 449 (2003)
Because the 4 experimental
interaction data sets contain
correlated evidence, a fully
connected Bayesian network
is used.
13. Lecture WS 2003/04
Bioinformatics III 148
Gold-Standard
Jansen et al. Science 302, 449 (2003)
should be
(i) independent from the data sources serving as evidence
(ii) sufficiently large for reliable statistics
(iii) free of systematic bias (e.g. towards certain types of interactions).
Positives: use MIPS (Munich Information Center for Protein Sequences, HW
Mewes) complexes catalog: hand-curated list of complexes (8250 protein pairs that
are within the same complex) from biomedical literature.
Negatives:
- harder to define
- essential for successful training
Assume that proteins in different compartments do not interact.
Synthesize “negatives” from lists of proteins in separate subcellular compartments.
13. Lecture WS 2003/04
Bioinformatics III 149
Measure of reliability: likelihood ratio
Jansen et al. Science 302, 449 (2003)
Consider a genomic feature f expressed in binary terms (i.e. „absent“ or „present“).
Likelihood ratio L(f) is defined as:
L(f) = 1 means that the feature has no predictability: the same number of positives
and negatives have feature f.
The larger L(f) the better its predictability.
f
ffL
featurehavingnegativesstandardgoldoffraction
featurehavingpositivesstandardgoldoffraction
13. Lecture WS 2003/04
Bioinformatics III 150
Combination of features
Jansen et al. Science 302, 449 (2003)
For two features f1 and f2 with uncorrelated evidence,
the likelihood ratio of the combined evidence is simply the product:
L(f1,f2) = L(f1) L(f2)
For correlated evidence L(f1,f2) cannot be factorized in this way.
Bayesian networks are a formal representation of such relationships between
features.
The combined likelihood ratio is proportional to the estimated odds that two
proteins are in the same complex, given multiple sources of information.
13. Lecture WS 2003/04
Bioinformatics III 151
Prior and posterior odds
„positive“ : a pair of proteins that are in the same complex. Given the number of
positives among the total number of protein pairs, the „prior“ odds of finding a
positive are:
„posterior“ odds: odds of finding a positive after considering N datasets with values
f1 ... fN :
posP
posP
negP
posPOprior
1
N
Nprior ffnegP
ffposPO
...
...
1
1
The terms „prior“ and „posterior“ refer to the situation before and after knowing the
information in the N datasets.
Jansen et al. Science 302, 449 (2003)
13. Lecture WS 2003/04
Bioinformatics III 152
Static naive Bayesian Networks
In the case of protein-protein interaction data, the posterior odds describe the
odds of having a protein-protein interaction given that we have the information from
the N experiments,
whereas the prior odds are related to the chance of randomly finding a protein-
protein interaction when no experimental data is known.
If Opost > 1, the chances of having an interaction are
Jansen et al. Science 302, 449 (2003)
higher than having no interaction.
13. Lecture WS 2003/04
Bioinformatics III 153
Static naive Bayesian Networks
The likelihood ratio L defined as
relates prior and posterior odds according to Bayes‘ rule:
negffP
posffPffL
N
NN ...
......
1
11
priorNpost OffLO ...1
In the special case that the N features are conditionally independent
(i.e. they provide uncorrelated evidence) the Bayesian network is a so-called
„naïve” network, and L can be simplified to:
N
i
N
i i
iiN negfP
posfPfLffL
1 11...
Jansen et al. Science 302, 449 (2003)
13. Lecture WS 2003/04
Bioinformatics III 154
Computation of prior and posterior odds
L can be computed from contingency tables relating positive and negative
examples with the N features (by binning the feature values f1 ... fN into discrete
intervals) – wait for examples.
600
1
1018
1036
4
priorO
Opost > 1 can be achieved with L > 600.
Jansen et al. Science 302, 449 (2003)
Determining the prior odds Oprior is somewhat arbitrary in that it requires an
assumption about the number of positives.
Jansen et al. believe that 30,000 is a conservative lower bound for the number of
positives (i.e. pairs of proteins that are in the same complex).
Considering that there are ca. 18 million = 0.5 * N (N – 1) possible protein pairs in
total (with N = 6000 for yeast),
13. Lecture WS 2003/04
Bioinformatics III 155
Parameters of the naïve Bayesian Networks (PIP) Column 1 describes the genomic feature. In the „essentiality data“ protein pairs can take on 3 discrete
values (EE: both essential; NN: both non-essential; NE: one essential and one not).
Jansen et al. Science 302, 449 (2003)
Column 2 gives the number of protein pairs with a particular feature (i.e. „EE“) drawn from the whole yeast
interactome (~18M pairs).
Columns „pos“ and „neg“ give the overlap of these pairs with the 8,250 gold-standard positives and the
2,708,746 gold-standard negatives.
Columns „sum(pos)“ and „sum(neg)“ show how many gold-standard positives (negatives) are among the
protein pairs with likelihood ratio L, computed by summing up the values in the „pos“ (or „neg“) column.
P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values – and
L, the ratio of these two conditional probabilities.
143.0
518.0
2150
1114
573724
81924
13. Lecture WS 2003/04
Bioinformatics III 156
mRNA expression dataProteins in the same complex tend to have correlated expression profiles.
Although large differences can exist between the mRNA and protein abundance, protein abundance can
be indirectly and quite crudely measured by the presence or absence of the corresponding mRNA
transcript.
Jansen et al. Science 302, 449 (2003)
Experimental data source:
- time course of expression fluctuations during the yeast cell cycle
- Rosetta compendium: expression profiles of 300 deletion mutants and cells under
chemical treatments.
Problem: both data sets are strongly correlated.
Compute first principal component of the vector of the 2 correlations.
Use this as independent source of evidence for the P-P interaction prediction.
The first principal component is a stronger predictor of P-P interactions that either
of the 2 expression correlation datasets by themselves.
13. Lecture WS 2003/04
Bioinformatics III 157
PIP – Functional similarityQuantify functional similarity between two proteins:
Jansen et al. Science 302, 449 (2003)
- consider which set of functional classes two proteins share, given either the MIPS or Gene
Ontology (GO) classification system.
- Then count how many of the ~18 million protein pairs in yeast share the exact same
functional classes as well (yielding integer counts between 1 and ~ 18 million). It was binned
into 5 intervals.
- In general, the smaller this count, the more similar and specific is the functional description
of the two proteins.
13. Lecture WS 2003/04
Bioinformatics III 158
PIP – Functional similarity
Observation: low counts correlate with a higher chance of two proteins being in
the same complex. But signal (L) is quite weak.
Jansen et al. Science 302, 449 (2003)
13. Lecture WS 2003/04
Bioinformatics III 159
Calculation of the fully connected Bayesian network (PIE)
The 3 binary experimental interaction datasets can be combined in at most 24 = 16
different ways (subsets). For each of these 16 subsets, one can compute a
likelihood ratio from the overlap with the gold-standard positives („pos“) and
negatives („neg“).
51003.08250
26
2708746
2 8250
2708746
27087462
825026
Jansen et al. Science 302, 449 (2003)
13. Lecture WS 2003/04
Bioinformatics III 160
PIP vs. the information sources
Ratio of true to false positives (TP/FP) increases
monotonically with Lcut, confirming L as an
appropriate measure of the odds of a real
interaction.
The ratio is computed as:
Protein pairs with Lcut > 600 have a > 50%
chance of being in the same complex.Jansen et al. Science 302, 449 (2003)
cut
cut
LL
LL
cut
cut
Lneg
Lpos
LFP
LTP
13. Lecture WS 2003/04
Bioinformatics III 161
PIE vs. the information sources
9897 interactions are predicted from PIP and
163 from PIE.
In contrast, likelihood ratios derived from single
genomic factors (e.g. mRNA coexpression) or
from individual interaction experiments (e.g. the
Ho data set) did no exceed the cutoff when used
alone.
This demonstrates that information sources that,
taken alone, are only weak predictors of
interactions can yield reliable predictions when
combined.
Jansen et al. Science 302, 449 (2003)
13. Lecture WS 2003/04
Bioinformatics III 162
Concentrate on large complexes
Jansen et al. Science 302, 449 (2003)
Sofar all interactions were treated as independent.
However, the joint distribution of interactions in the PIs can help identify large
complexes: an ideal complex should be a fully connected „clique“ in an
interaction graph.
In practice, this rarely happens because of incorrect or missing links.
Yet large complexes tend to have many interconnections between them,
whereas false-positive links to outside proteins tend to occur randomly,
without a coherent pattern.
13. Lecture WS 2003/04
Bioinformatics III 163
Improve ratio TP / FP
Observation: Increasing the minimum number of links raises
TP/FP by preserving the interactions among proteins in large
complexes, while filtering out false-positive interactions with
heterogeneous groups of proteins outside the complexes.
Jansen et al. Science 302, 449 (2003)
TP/FP for subsets of the
thresholded PIP that only include
proteins with a minimum number
of links. Requiring a minimum
number of links isolates large
complexes in the thresholded PIP
graph (Fig. 3B).
How can one increase the predictivity TP/FP of P-P predictions?
13. Lecture WS 2003/04
Bioinformatics III 164
Summary
In a similar manner, the approach could have been extended to a number of other
features related to interactions (e.g. phylogenetic co-occurrence, gene fusions,
gene neighborhood).
Jansen et al. Science 302, 449 (2003)
Bayesian approach allows reliable predictions of protein-protein interactions
by combining weakly predictive genomic features.
The de novo prediction of complexes replicated interactions found in the gold-
standard positives and PIE.
Also, several predictions were confirmed by new TAP experiments.
The accuracy of the PIP was comparable to that of the PIE while simultaneously
achieving greater coverage.
As a word of caution: Bayesian approaches don‘t work everywhere.
13. Lecture WS 2003/04
Bioinformatics III 165
V25
Please draw P(k) and C(k) as a function of k for thedifferent network types random network, scale-free networkand hierarchical network.
Additional + required reading: review article in Nature Genetics Reviews
13. Lecture WS 2003/04
Bioinformatics III 166
Characterising metabolic networks
(d) The degree distribution, P(k) of the metabolic network illustrates its scale-free
topology.
(e) The scaling of the clustering coefficient C(k) with the degree k illustrates the
hierarchical architecture of metabolism (The data shown in d and e represent an average
over 43 organisms).
(f) The flux distribution in the central metabolism of Escherichia coli follows a power
law, which indicates that most reactions have small metabolic flux, whereas a few reactions,
with high fluxes, carry most of the metabolic activity. It should be noted that on all three plots
the axis is logarithmic and a straight line on such log–log plots indicates a power-law scaling.
CTP, cytidine triphosphate; GLC, aldo-hexose glucose; UDP, uridine diphosphate; UMP,
uridine monophosphate; UTP, uridine triphosphate.
13. Lecture WS 2003/04
Bioinformatics III 167
Degree
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
The most elementary characteristic of a node is its
degree (or connectivity), k, which tells us how many links
the node has to other nodes. For example, in the
undirected network shown in part a of the figure, node A
has degree k = 5. In networks in which each link has a
selected direction (see figure, part b) there is an
incoming degree, kin, which denotes the number of links
that point to a node, and an outgoing degree, kout, which
denotes the number of links that start from it. For
example, node A in part b of the figure has kin = 4 and
kout = 1. An undirected network with N nodes and L
links is characterized by an average degree <k> =
2L/N (where <> denotes the average).
13. Lecture WS 2003/04
Bioinformatics III 168
Degree distribution
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
The degree distribution, P(k), gives the probability that a selected node has exactly k links. P(k) is obtained by counting the number o f nodes N(k) with k = 1,2... links and dividing by the total number of nodes N. The degree distribution allows us to distinguish between different classes of networks. For example, a peaked degree distribution, as seen in a random network, indicates that the system has a characteristic degree and that there are no highly connected nodes (which are also known as hubs). By contrast, a power-law degree distribution indicates that a few hubs hold together numerous small nodes.
13. Lecture WS 2003/04
Bioinformatics III 169
Network measures
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
Scale-free networks and the degree exponent
Most biological networks are scale-free, which means that their
degree distribution approximates a power law, P(k) k- , where
is the degree exponent and ~ indicates 'proportional to'. The
value of determines many properties of the system. The
smaller the value of , the more important the role of the hubs
is in the network. Whereas for >3 the hubs are not relevant, for
2> >3 there is a hierarchy of hubs, with the most connected
hub being in contact with a small fraction of all nodes, and for
= 2 a hub-and-spoke network emerges, with the largest hub
being in contact with a large fraction of all nodes. In general, the
unusual properties of scale-free networks are valid only for <
3, when the dispersion of the P(k) distribution, which is defined
as 2 = <k2> - <k>2, increases with the number of nodes (that
is, diverges), resulting in a series of unexpected features,
such as a high degree of robustness against accidental node
failures. For >3, however, most unusual features are absent,
and in many respects the scale-free network behaves like a
random one.
13. Lecture WS 2003/04
Bioinformatics III 170
Shortest path and mean path length
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
Distance in networks is measured with the path length, which
tells us how many links we need to pass through to travel
between two nodes. As there are many alternative paths
between two nodes, the shortest path — the path with the
smallest number of links between the selected nodes — has a
special role. In directed networks, the distance ℓAB from node A
to node B is often different from the distance ℓBA from B to A. For
example, in part b of the figure, ℓBA = 1, whereas ℓAB = 3. Often
there is no direct path between two nodes. As shown in part b of
the figure, although there is a path from C to A, there is no path
from A to C. The mean path length, <ℓ>, represents the
average over the shortest paths between all pairs of nodes
and offers a measure of a network's overall navigability.
13. Lecture WS 2003/04
Bioinformatics III 171
Clustering coefficient
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
In many networks, if node A is connected to B, and B is connected to C,
then it is highly probable that A also has a direct link to C. This
phenomenon can be quantified using the clustering coefficient33 CI =
2nI/k(k-1), where nI is the number of links connecting the kI neighbours of
node I to each other. In other words, CI gives the number of 'triangles'
that go through node I, whereas kI (kI -1)/2 is the total number of triangles
that could pass through node I, should all of node I's neighbours be
connected to each other. For example, only one pair of node A's five
neighbours in part a of the figure are linked together (B and C), which
gives nA = 1 and CA = 2/20. By contrast, none of node F's neighbours link
to each other, giving CF = 0. The average clustering coefficient, <C >,
characterizes the overall tendency of nodes to form clusters or groups.
An important measure of the network's structure is the function C(k),
which is defined as the average clustering coefficient of all nodes with k
links. For many real networks C(k) k-1, which is an indication of a
network's hierarchical character.
The average degree <k>, average path length <ℓ> and average
clustering coefficient <C> depend on the number of nodes and links
(N and L) in the network. By contrast, the P(k) and C(k ) functions
are independent of the network's size and they therefore capture a
network's generic features, which allows them to be used to classify
various networks.
13. Lecture WS 2003/04
Bioinformatics III 172
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
Aa
The Erdös–Rényi (ER) model of a random network starts with N nodes
and connects each pair of nodes with probability p, which creates a
graph with approximately pN (N-1)/2 randomly placed links.
Ab
The node degrees follow a Poisson distribution, which indicates that
most nodes have approximately the same number of links (close to the
average degree <k>). The tail (high k region) of the degree distribution
P(k ) decreases exponentially, which indicates that nodes that
significantly deviate from the average are extremely rare.
Ac
The clustering coefficient is independent of a node's degree, so C(k)
appears as a horizontal line if plotted as a function of k. The mean path
length is proportional to the logarithm of the network size, l log N, which
indicates that it is characterized by the small-world property.
Random networks
13. Lecture WS 2003/04
Bioinformatics III 173
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
Scale-free networks Scale-free networks are characterized by a power-law degree
distribution; the probability that a node has k links follows P(k) ~ k- ,
where is the degree exponent. The probability that a node is highly
connected is statistically more significant than in a random graph, the
network's properties often being determined by a relatively small number
of highly connected nodes that are known as hubs (see figure, part Ba;
blue nodes). In the Barabási–Albert model of a scale-free network, at
each time point a node with M links is added to the network, which
connects to an already existing node I with probability I = kI/JkJ,
where kI is the degree of node I and J is the index denoting the sum over
network nodes. The network that is generated by this growth process has
a power-law degree distribution that is characterized by the degree
exponent = 3.
Bb Such distributions are seen as a straight line on a log–log plot. The
network that is created by the Barabási–Albert model does not have an
inherent modularity, so C(k) is independent of k (Bc). Scale-free
networks with degree exponents 2< <3, a range that is observed in
most biological and non-biological networks, are ultra-small, with the
average path length following ℓ ~ log log N, which is significantly shorter
than log N that characterizes random small-world networks.
13. Lecture WS 2003/04
Bioinformatics III 174
Barabasi & Oltvai, Nature Reviews Genetics 5, 101 (2004)
Hierarchical networks To account for the coexistence of modularity, local clustering and scale-
free topology in many real systems it has to be assumed that clusters
combine in an iterative manner, generating a hierarchical network.
The starting point of this construction is a small cluster of four densely
linked nodes (see the four central nodes in Ca). Next, three replicas of
this module are generated and the three external nodes of the replicated
clusters connected to the central node of the old cluster, which produces
a large 16-node module. Three replicas of this 16-node module are then
generated and the 16 peripheral nodes connected to the central node of
the old module, which produces a new module of 64 nodes. The
hierarchical network model seamlessly integrates a scale-free topology
with an inherent modular structure by generating a network that has a
power-law degree distribution with degree exponent = 1 + ln4/ln3 =
2.26 (see Cb) and a large, system-size independent average clustering
coefficient <C> ~ 0.6.
The most important signature of hierarchical modularity is the scaling of
the clustering coefficient, which follows C(k) ~ k-1 a straight line of slope -
1 on a log–log plot (see Cc). A hierarchical architecture implies that
sparsely connected nodes are part of highly clustered areas, with
communication between the different highly clustered neighbourhoods
being maintained by a few hubs (see Ca).