Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | aileen-strickland |
View: | 212 times |
Download: | 0 times |
8. Lecture WS 2004/05
Bioinformatics III 1
Protein protein interaction networks
Most ideas on biological networks based on initial HT data for
Sacharomyces cerevisae (Uetz Y2H, Ito) – unicellular, compartmentalized, all
proteins
Program for today:
Further experimental data sets
(1) C. elegans – multicellular, focus on proteins that promote cell-cell interactions
(2) Drosophila melanogaster
(3) Classification of networks
(4) Essentiality of proteins
8. Lecture WS 2004/05
Bioinformatics III 2
Current status of research on real networks
► Many systems show small-world property
► Statistical abundance of „hubs“, p(k) follows a power-law distribution Robustness to damages, vulnerability to attacks
► Realize that most complex networks are the result of a growth process
► View networks as dynamical systems that evolve through the subsequent
addition and deletion of vertices and edges. Requires dynamical theory.
What is left?
Eur. Phys. J. B 38, 143 (2004)
8. Lecture WS 2004/05
Bioinformatics III 3
10 questionsAre there formal ways of classifying the structure of different growing models?
Open questions concerning the universality of some topological properties, the correlations introduced by
the dynamic process and the interplay between clustering, hierarchies and centralities in networks.
Aim for rigorous mathematical analysis.
Are there further statistical distributions (except p(k), <C>, <L> and the degree-degree
correlation p(k,k‘)) that can provide insights on the structure and classification of
complex networks?
Why are most networks modular?
Are there universal features of network dynamics?
Networks are not only specified by their topology but also by the dynamics of information or traffic flow
taking place along the links. Aim at mathematical characterization for general principles describing the
networks‘ dynamics.
Eur. Phys. J. B 38, 143 (2004)
8. Lecture WS 2004/05
Bioinformatics III 4
10 open questionsHow do the dynamical processes taking place on a network shape the network
topology?
Dynamics, traffic and underlying topology of networks are mutually correlated.
Need to obtain large empirical datasets that simultaneously capture the topology of the network and the
time-resolved dynamics taking place on it.
What are the evolutionary mechanisms that shape the topology of biological networks?
Uncovering evolution is relatively simple in technological and large infrastructure networks, but not in
biological networks. Here, the role of evolution and selection in shaping biological networks is still unclear,
especially if we want a quantitative dynamical implementation of evolutionary principles in network
modeling.
Eur. Phys. J. B 38, 143 (2004)
8. Lecture WS 2004/05
Bioinformatics III 5
10 questionsHow to quantify the interaction between networks of different character (network of
networks)?
Most networks are interconnected among them forming networks of networks. E.g. Internet, energy and
power distribution, or gene network interconnected with protein-protein interaction network and with the
metabolic network.
Understanding and characterization of the complicate set of regulatory and feedback mechanisms
connecting various networks is one of the most ambitious tasks in network research.
How to characterize small networks?
Many real world networks are far from being large scale objects that are well described by statistical
measures. Scaling behavior or average properties are not well defined for small networks new concepts
and mathematics required
Why are social networks all assortive, while all biological and technological networks
are disassortative?In social networks, hubs tend to connect to eachother („assortive“). In biological and technological
networks, hubs mostly connect to less connected hubs.
Eur. Phys. J. B 38, 143 (2004)
8. Lecture WS 2004/05
Bioinformatics III 6
C. Elegans protein-protein interaction network
As Y2H baits, select set of 3024 worm predicted proteins that relate directly or
indirectly to multicellular functions.
Cloned ORFs that are not autoactivated by Y2H:GAL1::HIS3 reporter : 1873
Only consider those that activate at least 2 out of 3 different Gal4-responsive
promoters.
Divide in 3 confidence classes.
Core1: 858 interactions, 3 times found independently
Core 2: 1299 interactions, found < 3 times, passed retest
Non-core: 1892 other interactions found in Y2H screen
Li et al., Science 303, 5657 (2004)
8. Lecture WS 2004/05
Bioinformatics III 7
C. Elegans protein-protein interaction network
Coaffinity purification assays. Shown are 10 examples from the Core-1,
Core-2, and Non-Core data sets. The top panels show Myc-tagged prey
expression after affinity purification on glutathione-Sepharose,
demonstrating binding to GST-bait. The middle and bottom panels show
expression of Myc-prey and GST-bait, respectively. The lanes alternate
between extracts expressing GST-bait proteins (+) and GST alone (–).
ORF pairs are identified in table S1 with the lane number corresponding
to the order in which they appear in the table.
Li et al., Science 303, 5657 (2004)
8. Lecture WS 2004/05
Bioinformatics III 8
Confidence of interactions
Li et al., Science 303, 5657 (2004)
8. Lecture WS 2004/05
Bioinformatics III 9
C. Elegans protein-protein interaction network
Estimate coverage:
out of 108 known interactors in WormPD („literature data set“)
only 8 Core and 2 Non-core interactions are found in this benchmark data set
coverage is ca. 10% of all interactions
In silico searches for potentially conserved interactions, „interologs“ whose
orthologous pairs are known to interact in one or more other species.
High-confidence yeast interaction data 949 potential worm interologs
+ other data
5534 interactions for worm, connecting 15% of the C. elegans proteome.
Li et al., Science 303, 5657 (2004)
8. Lecture WS 2004/05
Bioinformatics III 10
Analysis of the C. elegans protein-protein network
Li et al., Science 303, 5657 (2004)
(C) Do evolutionary recent proteins preferentially
interact with eachother?
„Ancient“: 748 proteins with yeast ortholog
„Multicellular“: 1314 proteins with ortholog in
Drosophila, Arabidopsis or human but not in yeast
„Worm“: 836 proteins with no ortholog outside of worm.
The 3 groups connect equally well with each other
new cellular functions rely on a combination of
evolutionarily new and ancient elements.
(A) Nodes (proteins) are colored
according to their phylogenic
class: ancient (red), multicellular
(yellow), and worm (blue).
Giant network component
contains 2898 nodes connected
by 5460 edges.
The inset highlights a small part
of the network.
(B) The proportion of proteins,
P(k), with different numbers of
interacting partners, k, is shown
for C. elegans proteins used as
baits or preys and for S.
cerevisiae proteins.
Again, the worm interactome
network exhibits small-world and
scale-free properties.
8. Lecture WS 2004/05
Bioinformatics III 11
Relate interactome with transcriptome and phenome
Li et al., Science 303, 5657 (2004)
(D) Overlap with transcriptome in yeast, Pearson correlation coefficients (PCCs) were calculated
and graphed for each pair of proteins in the interaction data sets and their corresponding
randomized data sets. Red area corresponds to interactions that show a significant relationship
to expression profiling data (P < 0.05). 9.5% of interacting core proteins are co-expressed.
Note: 75% of literature pairs (= biologically relevant) do not co-express.
(E) Example of highly connected cluster (Y2H interactions) where both proteins belong to
common C. elegans expression clusters.
(F) Proportion of interaction pairs where both genes are embryonic lethal (P < 10-7).
8. Lecture WS 2004/05
Bioinformatics III 12
Example of highly connected subnetwork
Li et al., Science 303, 5657 (2004)
Conclusions:
- Y2H data set provides functional
hypotheses for 1000s of uncharacterized
proteins of C. elegans.
Integration with other functional
genomic data indicates that the
correlation between
transcriptome and interactome
data is lower than found in yeast.
Explanation? Biological processes in
multicellular organisms may occur
differently in the organism, across various
organs, tissues, or single cells.
8. Lecture WS 2004/05
Bioinformatics III 13
Drosophila is an important model for
human biology.
High-throughput Y2H screen identified
20405 interactions involving 7048
proteins.
Giot et al. Science 302, 1727 (2003)
8. Lecture WS 2004/05
Bioinformatics III 14
Confidence scores for protein-protein interactions
Manual method: expert biologist reviewed list of interactions on the basis of the
names of the proteins in each interaction pair.
High-confidence interactions: those published previously or those involving two
proteins of the same complex.
Low-confidence interactions: unlikely to occur in vivo, e.g. an interaction between
a nuclear and an extracellular protein.
Automated method: compare Drosophila interactions with those in yeast.
Positive examples: interacting proteins whose yeast orthologs interact as well
Negative examples: Drosophila interactions whose yeast orthologs are a distance
of 3 or more protein-protein interaction links apart (pair of random yeast proteins
has distance of 2.8)
Positive training set: 129 examples (70 manual, 65 automated, 6 common)
Negative training set: 196 examples (88/112/4)
Giot et al. Science 302, 1727 (2003)
8. Lecture WS 2004/05
Bioinformatics III 15
Confidence scores for protein-protein interactions
Giot et al. Science 302, 1727 (2003)
Validate selection:
confidence score for
interaction correlates
well with correlation
of gene ontology (GO)
annotation (see C)
Fit generalized linear model to
training set with predictors
- number of times each
interaction was observed
- number of interaction partners
of each protein
- local clustering of network
- gene region (5‘ untranslated
region, coding sequence, 3‘
untranslated region)
Find dividing surface.
8. Lecture WS 2004/05
Bioinformatics III 16
Giot et al. Science 302, 1727 (2003)
Confidence scores for protein-protein interactions
(D) p(k) for all interactions (black circles) and
for the high-confidence interactions (green
circles). Linear behavior in this log-log plot
would indicate a power-law distribution.
Although regions of each distribution appear
linear, neither distribution may be adequately fit
by a single power-law. Both may be fit,
however, by a combination of power-law and
exponential decay.
Faster decay of high-confidence interactions
may indicate that highly connected proteins
may be suppressed in biological networks.
8. Lecture WS 2004/05
Bioinformatics III 17
Statistical properties of refined Drosophila PI map
The high-confidence Drosophila protein-protein interactions
form a small-world network with evidence for a hierarchy of
organization. Network properties are presented for the giant
connected component, in which 3659 pairwise interactions
connect 3039 proteins into a single cluster.
(A) The probability distribution for the shortest path between
a pair of proteins in the actual network (green points) peaks
at 9 to 11 links, with a mean of 9.4 links. In contrast, an
ensemble of randomly rewired networks shows a mean
separation of 7.7 links between proteins.
Biological organization may be responsible for flattening the
actual network by enhancing links between proteins that are
already close.
Giot et al. Science 302, 1727 (2003)
8. Lecture WS 2004/05
Bioinformatics III 18
Statistical properties of refined Drosophila PI map
Giot et al. Science 302, 1727 (2003)
(B) Clustering is analyzed quantitatively by
counting the number of closed loops (triangles,
squares, pentagons, etc.) in which the
perimeter is formed by a series of proteins
connected head-to-tail, with no protein
repeated.
The actual network (green points) shows an
enhancement of loops with perimeter up to 10
to 11 relative to the random network (red
points).
In both (A) and (B), the one-level and two-level
models produce nearly indistinguishable fits
for the random networks, indicating the
absence of structured clustering (= a hierarchy
where proteins connect to protein complexes,
and a second level where protein complexes
interact with eachother).
8. Lecture WS 2004/05
Bioinformatics III 19
Protein family/human disease orthlog view
Proteins are color-coded according
to protein family as annotated by the
GO hierarchy.
Proteins orthologous to human
disease proteins have a jagged,
starry border.
Interactions were sorted according to
interaction confidence score, and the
top 3000 interactions are shown with
their corresponding 3522 proteins.
This representation is particularly
relevant to understanding human
diseases and potential treatment.
Giot et al. Science 302, 1727 (2003)
8. Lecture WS 2004/05
Bioinformatics III 20
Subcellular location view(B) Subcellular localization view. This
view shows the fly interaction map
with each protein colored by its GO
Cellular Component annotation.
This map has been filtered by only
showing proteins with less than or
equal to 20 interactions and with at
least one GO annotation.
We show proteins for all interactions
with a confidence score of 0.5 or
higher. This results in a map with 2346
proteins and 2268 interactions.
View allows annotation of subcellular
localizations, and potential function of
proteins not annotated sofar.
Giot et al. Science 302, 1727 (2003)
8. Lecture WS 2004/05
Bioinformatics III 21
Local interaction networks
Local interaction networks involved in
A transcription
B Splicing
C Signal transduction
Giot et al. Science 302, 1727 (2003)
8. Lecture WS 2004/05
Bioinformatics III 22
Cell cycle regulation: Drosophila Skp pathway
Giot et al. Science 302, 1727 (2003)
Network surrounding the Skp
protein complex that targets
proteins to ubiquitin-mediated
proteasomal degradation.
Target proteins are recruited to
the Skp complex by F-box
proteins.
Among the Skp proteins, only
SkpA was reported to bind F-box
proteins Morgue and Slmb.
8. Lecture WS 2004/05
Bioinformatics III 23
10 examples of local pathway
views identified in the interaction
network
Giot et al. Science 302, 1727 (2003)
Local pathway views
8. Lecture WS 2004/05
Bioinformatics III 24
The functional significance of a gene is
defined by its essentiality.
A pair of non-essential genes can be
synthetically lethal (cell-death occurs
when boh genes are deleted simultaneously.
Hypothesis of ‚marginal benefit‘: many
non-essential genes make significant
but small contributions to the fitness of the cell although the effects may not be
sufficiently large to be detected by conventional methods.
Yu et al. Trends Gen 20, 227 (2004)
8. Lecture WS 2004/05
Bioinformatics III 25
Define ‚marginal essentiality‘ (M) as a quantitative measure of the importance of a
non-essential gene to a cell.
Incorporate results from diverse set of 4 large-scale knockout experiments that
examined different aspects of the impact of a protein on the fitness of a yeast cell:
(i) growth rate
(ii) phenotypes under diverse environments
(iii) sporulation efficiency
(iv) sensitivity to small molecules.
Previously findings (Jeong & Barabasi): hubs tend to be essential
(Fraser et al.) effect of an individual protein on cell fitness correlates with p(k)
Analyze data set: 4743 yeast proteins – 23294 unique interactions
Yu et al. Trends Gen 20, 227 (2004)
‚Marginal essentiality‘
8. Lecture WS 2004/05
Bioinformatics III 26
Basic analysis
Essential proteins have ca. twice
as many links as non-essential
proteins.
Essential proteins have a
shallower slope a larger
proportion of them are ‚hubs‘
Yu et al. Trends Gen 20, 227 (2004)
8. Lecture WS 2004/05
Bioinformatics III 27
determine hubs that are essential
‚hub‘: proteins that belong to upper 25% of proteins with most links
(a) 43% of the hubs are essential vs. 20% for random proteins
Within the network, essential proteins tend to be more cliquish and tend to be
more closely connected to eachother (see mean path length).
Yu et al. Trends Gen 20, 227 (2004)
(b) out-degree: transcription
factors with many (> 100)
targets are more essential
than the other TFs.
(c) in-degree: genes regula-
ted by many TFs are less
likely to be essential than
genes regulated by few TFs.
(d) Genes with more
functions are more likely to
be essential.
8. Lecture WS 2004/05
Bioinformatics III 28
properties of essential genes
Most essential genes are ‚house-keeping‘ genes = their expression level is much
higher and the fluctuation of their expression is much lower compared with
non-essential genes.
The regulation of essential genes tends to have less regulation
non-essential genes often use more regulators to control the expression of gene
products.
Explanation? essential proteins perform the most basic and important functions
within the cell and always need to be switched ‚on‘.
Their expression does not need to be regulated by many factors because this
makes the essential genes dependent on the viability of more regulators, and the
cell less stable.
Yu et al. Trends Gen 20, 227 (2004)
8. Lecture WS 2004/05
Bioinformatics III 29
Analysis of non-essential genes
The more marginally essential a protein is the more likely
- is it to have a large number of interaction partners (a, b)
- it will be closely connected to other proteins (short characteristic path length) (c)
- will it be one of the 1061 hubs (d)
Yu et al. Trends Gen 20, 227 (2004)
8. Lecture WS 2004/05
Bioinformatics III 30
Evolution of Drosophila melanogaster PI network
Various network growth and evolution models have been developed to explain
the properties of real-world networks
- small world (Watts & Strogatz)
- preferential attachment (Barabasi et al.)
- duplication-mutation mechanisms
However, often model parameters can be tuned such that multiple models of
widely varying mechanisms perfectly fit the motivating real network in terms
of single selected features such as the scale-free exponent and the
clustering coefficient.
Aim here: use a discriminative classification technique from machine learning to
classify a given real network as one of many proposed network mechanisms by
enumerating local substructures.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 31
Analysis of Drosophila melanogaster protein interaction network
Data set: protein-protein interaction map for Drosophila by Giot et al.
Problem: data set is subject to numerous false positives.
Giot et al. assign a confidence score p [0,1] to each interaction measuring how
likely the interaction occurs in vivo.
What threshold p* should be used?
Measure size of the components for all possible values of p*.
Observe: for p*= 0.65, the two largest components are connected
use this value as threshold. Edges in the graph correspond to interactions for
which p > p*.
Remove self-interactions and isolated vertices
3359 (4625) nodes with 2795 (4683) edges for p*= 0.65 (0.5)
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 32
Network evolution models considered
Duplication-mutation-complementation (DMC) algorithm:
based on model that proposes that most of the duplicate genes observed today have been
preserved by functional complementation. If either the gene or its copy loses one of its
functions (edges), the other becomes essential in assuring the organisms‘s survival.
Algorithm: duplication step is followed by mutations that preserve functional
complementarity.
At every time step choose a node v at random.
A twin vertex vtwin is introduced copying all of v‘s edges.
For each edge of v, delete with probability qdel either the original edge or its corresponding
edge of vtwin.
Cojoin twins themselves with independent probability qcon representing an interaction of a
protein with its own copy.
No edges are created by mutations DMC algorithm assumes that the probability of
creating new advantageous functions by random mutations is negligible.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 33
Network evolution models considered
Variant of DMC: Duplication-random mutations (DMR) algorithm:
Possible interactions between twins are neglected.
Instead, edges between vtwin and the neighbors of v can be removed with probability qdel
and new edges can be created at random between vtwin and any other vertices with
probability qnew/N, where N is the current total number of vertices.
DMR emphasizes the creation of new advantageous functions by mutation.
Other models:
- linear preferential attachment (LPA) (Barabasi)
- random static networks (Erdös-Renyi) (RDS)
- random growing networks (RDG – growing graphs where new edges are created randomly
between existing nodes)
- aging vertex networks (AGV – growing graphs modeling citation networks, where the
probability for new edges decreases with the age of the vertex)
- small-world network (SMV – interpolation between regular ring lattices and randomly
connected graphs).
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 34
Training setCreate 1000 graphs as training data for each of the seven different models.
Every graph is generated with the same number of edges and nodes as measured
in Drosophila.
Quantify topology of a network by counting all possible subgraphs up to a given
cut-off, which could be the number of nodes, number of edges, or the length of a
given walk.
Here: count all subgraphs that can be constructed by a walk of length=8 (148 non-
isomorphic subgraphs) or length=7 (130 non-isomorphic subgraphs).
Use these counts as input features for classifier.
Note that the average shortest path between two nodes of the Drosophila
network‘s giant component is 11.6 (9.4) for p*=0.65 (0.5).
Walks of length=8 can traverse large parts of the network.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 35
Learning algorithm: Alternating Decision Tree
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
Rectangles: decision nodes.
A given network‘s subgraph counts
determine paths in the tree dictated
by inequalities specified by the
decision nodes.
For each class, the ADT
outputs a real-valued
prediction score, which is the
sum of all weights over all paths.
The class with the heightest score
wins.
8. Lecture WS 2004/05
Bioinformatics III 36
Performance on training set
The confusion matrix shows truth and prediction for the test sets.
5 out of 7 have nearly perfect prediction accuracy.
AGV is constructed as an interpolation between LPA and a ring lattice
the AGV, LPA and SMW mechanisms are equivalent in specific parameter
regimes and show a non-negligible overlap.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 37
Discriminating similar networks
Ten graphs of two different mechanisms exhibit similar average geodesic lengths
and almost identical degree distribution and clustering coefficients.
(a) cumulative degree distribution p(k>k0), average clustering coefficient <C> and
average geodesic length <L>, all quantities averaged over a set of 10 graphs.
(b) Prediction score for all ten graphs and all five cross-validated ADTs. The two
sets of graphs can be perfectly separated by the classifier.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 38
Learning algorithm: Alternating Decision Tree
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
Figure shows the first few descision
nodes (out of 120) of a resulting ADT.
The prediction scores reveal that a
high count of 3-cycless suggest a
DMC network.
DMC mechanism indeed
facilitates creation of many
3-cycles by allowing 2 copies
to attach eachother, thus
creating 3-cycles with their common
neighbors.
A low count in 3-cycles but a high
count in 8-edge linear chains is a
good precictor for LPA and DMR
networks.
8. Lecture WS 2004/05
Bioinformatics III 39
Prediction for Drosophila melanogaster network
Use this classifier (ADT) with good prediction accuracy now to determine the
network mechanism that best reproduces the Drosophila network (or any network
of the same size).
Prediction scores for the Drosophial protein network for different confidence
threshold p* and different cut-offs in subgraph size. Drosophial is consistently
classified as a DMC network, with an especially strong prediction for a confidence
threshould of p*=0.65 and independently of the cut-off in subgraph size.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 40
Visualization of subgraphs
A qualitative and more intuitive
way of interpreting the
classification result is
visualizing the subgraph
profiles.
Subgraphs associated with
Figures 3 and 1.
A representatie subset of 50
subgraphs out of 148 is
shown.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
8. Lecture WS 2004/05
Bioinformatics III 41
Subgraph profilesThe average subgraph count of the
training data for every mechanism is
shown for the 50 representative
subgraphs S1-S50.
Black lines indicate that this model is
closest to Drosophila based on the
absolute difference between the
subgraph counts.
For 60% of the subgraphs (S1-S30),
the counts for Drosophila
are closest to the DMC model.
All of these subgraphs contain one or
more cycles, including highly
connected subgraphs (S1) and long
linear chains ending in cycles (S16,
S18, S22, S23, S25).
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
The DMC algorithm is the only
mechanism that produces such
cycles with a high occurrence.
8. Lecture WS 2004/05
Bioinformatics III 42
Robustness against noise
Edges in Drosophila network are randomly
replaced and the network is classified.
Plotted are prediction scores for each of the 7
classes as more and more edges are replaced.
Every point is an average over 200 independent
random replacements.
For high noise level (beyond 80%), the
network is classified as an Erdös-Renyi (RDS)
graph. For low noise (< 30%), the confidence
in the classification as a DMC network is even
higher than in the classification as an RDS
network for high noise. The prediction score
y(c) for class c is related to the estimated
probability p(c) for the tested network to be in
class c by
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15
cy
cy
e
ecp
2
2
1
8. Lecture WS 2004/05
Bioinformatics III 43
Conclusions
Very nice (!) method that allows to infer growth mechanisms for real networks.
Method is robust against noise and data subsampling,
no prior assumption about network features/topology required.
Learning algorithm does not assume any relationships between features
(e.g. orthogonality). Therefore the input space can be augmented with various
features in addition to subgraph counts.
The protein interaction network of Drosophila is confidently classified as DMC
network.
Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15