8. Lecture WS 2004/05Bioinformatics III1 Protein protein interaction networks Most ideas on...

8. Lecture WS 2004/05

Bioinformatics III 1

Protein protein interaction networks

Most ideas on biological networks based on initial HT data for

Sacharomyces cerevisae (Uetz Y2H, Ito) – unicellular, compartmentalized, all

proteins

Program for today:

Further experimental data sets

(1) C. elegans – multicellular, focus on proteins that promote cell-cell interactions

(2) Drosophila melanogaster

(3) Classification of networks

(4) Essentiality of proteins



Current status of research on real networks

► Many systems show small-world property

► Statistical abundance of „hubs“, p(k) follows a power-law distribution Robustness to damages, vulnerability to attacks

► Realize that most complex networks are the result of a growth process

► View networks as dynamical systems that evolve through the subsequent

addition and deletion of vertices and edges. Requires dynamical theory.

What is left?

Eur. Phys. J. B 38, 143 (2004)



10 questionsAre there formal ways of classifying the structure of different growing models?

Open questions concerning the universality of some topological properties, the correlations introduced by

the dynamic process and the interplay between clustering, hierarchies and centralities in networks.

Aim for rigorous mathematical analysis.

Are there further statistical distributions (except p(k), <C>, <L> and the degree-degree

correlation p(k,k‘)) that can provide insights on the structure and classification of

complex networks?

Why are most networks modular?

Are there universal features of network dynamics?

Networks are not only specified by their topology but also by the dynamics of information or traffic flow

taking place along the links. Aim at mathematical characterization for general principles describing the

networks‘ dynamics.

Eur. Phys. J. B 38, 143 (2004)



10 open questionsHow do the dynamical processes taking place on a network shape the network

topology?

Dynamics, traffic and underlying topology of networks are mutually correlated.

Need to obtain large empirical datasets that simultaneously capture the topology of the network and the

time-resolved dynamics taking place on it.

What are the evolutionary mechanisms that shape the topology of biological networks?

Uncovering evolution is relatively simple in technological and large infrastructure networks, but not in

biological networks. Here, the role of evolution and selection in shaping biological networks is still unclear,

especially if we want a quantitative dynamical implementation of evolutionary principles in network

modeling.

Eur. Phys. J. B 38, 143 (2004)



10 questionsHow to quantify the interaction between networks of different character (network of

networks)?

Most networks are interconnected among them forming networks of networks. E.g. Internet, energy and

power distribution, or gene network interconnected with protein-protein interaction network and with the

metabolic network.

Understanding and characterization of the complicate set of regulatory and feedback mechanisms

connecting various networks is one of the most ambitious tasks in network research.

How to characterize small networks?

Many real world networks are far from being large scale objects that are well described by statistical

measures. Scaling behavior or average properties are not well defined for small networks new concepts

and mathematics required

Why are social networks all assortive, while all biological and technological networks

are disassortative?In social networks, hubs tend to connect to eachother („assortive“). In biological and technological

networks, hubs mostly connect to less connected hubs.

Eur. Phys. J. B 38, 143 (2004)



C. Elegans protein-protein interaction network

As Y2H baits, select set of 3024 worm predicted proteins that relate directly or

indirectly to multicellular functions.

Cloned ORFs that are not autoactivated by Y2H:GAL1::HIS3 reporter : 1873

Only consider those that activate at least 2 out of 3 different Gal4-responsive

promoters.

Divide in 3 confidence classes.

Core1: 858 interactions, 3 times found independently

Core 2: 1299 interactions, found < 3 times, passed retest

Non-core: 1892 other interactions found in Y2H screen

Li et al., Science 303, 5657 (2004)




Coaffinity purification assays. Shown are 10 examples from the Core-1,

Core-2, and Non-Core data sets. The top panels show Myc-tagged prey

expression after affinity purification on glutathione-Sepharose,

demonstrating binding to GST-bait. The middle and bottom panels show

expression of Myc-prey and GST-bait, respectively. The lanes alternate

between extracts expressing GST-bait proteins (+) and GST alone (–).

ORF pairs are identified in table S1 with the lane number corresponding

to the order in which they appear in the table.

Li et al., Science 303, 5657 (2004)



Confidence of interactions

Li et al., Science 303, 5657 (2004)




Estimate coverage:

out of 108 known interactors in WormPD („literature data set“)

only 8 Core and 2 Non-core interactions are found in this benchmark data set

coverage is ca. 10% of all interactions

In silico searches for potentially conserved interactions, „interologs“ whose

orthologous pairs are known to interact in one or more other species.

High-confidence yeast interaction data 949 potential worm interologs

+ other data

5534 interactions for worm, connecting 15% of the C. elegans proteome.

Li et al., Science 303, 5657 (2004)



Analysis of the C. elegans protein-protein network

Li et al., Science 303, 5657 (2004)

(C) Do evolutionary recent proteins preferentially

interact with eachother?

„Ancient“: 748 proteins with yeast ortholog

„Multicellular“: 1314 proteins with ortholog in

Drosophila, Arabidopsis or human but not in yeast

„Worm“: 836 proteins with no ortholog outside of worm.

The 3 groups connect equally well with each other

new cellular functions rely on a combination of

evolutionarily new and ancient elements.

(A) Nodes (proteins) are colored

according to their phylogenic

class: ancient (red), multicellular

(yellow), and worm (blue).

Giant network component

contains 2898 nodes connected

by 5460 edges.

The inset highlights a small part

of the network.

(B) The proportion of proteins,

P(k), with different numbers of

interacting partners, k, is shown

for C. elegans proteins used as

baits or preys and for S.

cerevisiae proteins.

Again, the worm interactome

network exhibits small-world and

scale-free properties.



Relate interactome with transcriptome and phenome

Li et al., Science 303, 5657 (2004)

(D) Overlap with transcriptome in yeast, Pearson correlation coefficients (PCCs) were calculated

and graphed for each pair of proteins in the interaction data sets and their corresponding

randomized data sets. Red area corresponds to interactions that show a significant relationship

to expression profiling data (P < 0.05). 9.5% of interacting core proteins are co-expressed.

Note: 75% of literature pairs (= biologically relevant) do not co-express.

(E) Example of highly connected cluster (Y2H interactions) where both proteins belong to

common C. elegans expression clusters.

(F) Proportion of interaction pairs where both genes are embryonic lethal (P < 10-7).



Example of highly connected subnetwork

Li et al., Science 303, 5657 (2004)

Conclusions:

- Y2H data set provides functional

hypotheses for 1000s of uncharacterized

proteins of C. elegans.

Integration with other functional

genomic data indicates that the

correlation between

transcriptome and interactome

data is lower than found in yeast.

Explanation? Biological processes in

multicellular organisms may occur

differently in the organism, across various

organs, tissues, or single cells.



Drosophila is an important model for

human biology.

High-throughput Y2H screen identified

20405 interactions involving 7048

proteins.

Giot et al. Science 302, 1727 (2003)



Confidence scores for protein-protein interactions

Manual method: expert biologist reviewed list of interactions on the basis of the

names of the proteins in each interaction pair.

High-confidence interactions: those published previously or those involving two

proteins of the same complex.

Low-confidence interactions: unlikely to occur in vivo, e.g. an interaction between

a nuclear and an extracellular protein.

Automated method: compare Drosophila interactions with those in yeast.

Positive examples: interacting proteins whose yeast orthologs interact as well

Negative examples: Drosophila interactions whose yeast orthologs are a distance

of 3 or more protein-protein interaction links apart (pair of random yeast proteins

has distance of 2.8)

Positive training set: 129 examples (70 manual, 65 automated, 6 common)

Negative training set: 196 examples (88/112/4)






Validate selection:

confidence score for

interaction correlates

well with correlation

of gene ontology (GO)

annotation (see C)

Fit generalized linear model to

training set with predictors

- number of times each

interaction was observed

- number of interaction partners

of each protein

- local clustering of network

- gene region (5‘ untranslated

region, coding sequence, 3‘

untranslated region)

Find dividing surface.





(D) p(k) for all interactions (black circles) and

for the high-confidence interactions (green

circles). Linear behavior in this log-log plot

would indicate a power-law distribution.

Although regions of each distribution appear

linear, neither distribution may be adequately fit

by a single power-law. Both may be fit,

however, by a combination of power-law and

exponential decay.

Faster decay of high-confidence interactions

may indicate that highly connected proteins

may be suppressed in biological networks.



Statistical properties of refined Drosophila PI map

The high-confidence Drosophila protein-protein interactions

form a small-world network with evidence for a hierarchy of

organization. Network properties are presented for the giant

connected component, in which 3659 pairwise interactions

connect 3039 proteins into a single cluster.

(A) The probability distribution for the shortest path between

a pair of proteins in the actual network (green points) peaks

at 9 to 11 links, with a mean of 9.4 links. In contrast, an

ensemble of randomly rewired networks shows a mean

separation of 7.7 links between proteins.

Biological organization may be responsible for flattening the

actual network by enhancing links between proteins that are

already close.




Statistical properties of refined Drosophila PI map


(B) Clustering is analyzed quantitatively by

counting the number of closed loops (triangles,

squares, pentagons, etc.) in which the

perimeter is formed by a series of proteins

connected head-to-tail, with no protein

repeated.

The actual network (green points) shows an

enhancement of loops with perimeter up to 10

to 11 relative to the random network (red

points).

In both (A) and (B), the one-level and two-level

models produce nearly indistinguishable fits

for the random networks, indicating the

absence of structured clustering (= a hierarchy

where proteins connect to protein complexes,

and a second level where protein complexes

interact with eachother).



Protein family/human disease orthlog view

Proteins are color-coded according

to protein family as annotated by the

GO hierarchy.

Proteins orthologous to human

disease proteins have a jagged,

starry border.

Interactions were sorted according to

interaction confidence score, and the

top 3000 interactions are shown with

their corresponding 3522 proteins.

This representation is particularly

relevant to understanding human

diseases and potential treatment.




Subcellular location view(B) Subcellular localization view. This

view shows the fly interaction map

with each protein colored by its GO

Cellular Component annotation.

This map has been filtered by only

showing proteins with less than or

equal to 20 interactions and with at

least one GO annotation.

We show proteins for all interactions

with a confidence score of 0.5 or

higher. This results in a map with 2346

proteins and 2268 interactions.

View allows annotation of subcellular

localizations, and potential function of

proteins not annotated sofar.




Local interaction networks

Local interaction networks involved in

A transcription

B Splicing

C Signal transduction




Cell cycle regulation: Drosophila Skp pathway


Network surrounding the Skp

protein complex that targets

proteins to ubiquitin-mediated

proteasomal degradation.

Target proteins are recruited to

the Skp complex by F-box

proteins.

Among the Skp proteins, only

SkpA was reported to bind F-box

proteins Morgue and Slmb.



10 examples of local pathway

views identified in the interaction

network


Local pathway views



The functional significance of a gene is

defined by its essentiality.

A pair of non-essential genes can be

synthetically lethal (cell-death occurs

when boh genes are deleted simultaneously.

Hypothesis of ‚marginal benefit‘: many

non-essential genes make significant

but small contributions to the fitness of the cell although the effects may not be

sufficiently large to be detected by conventional methods.

Yu et al. Trends Gen 20, 227 (2004)



Define ‚marginal essentiality‘ (M) as a quantitative measure of the importance of a

non-essential gene to a cell.

Incorporate results from diverse set of 4 large-scale knockout experiments that

examined different aspects of the impact of a protein on the fitness of a yeast cell:

(i) growth rate

(ii) phenotypes under diverse environments

(iii) sporulation efficiency

(iv) sensitivity to small molecules.

Previously findings (Jeong & Barabasi): hubs tend to be essential

(Fraser et al.) effect of an individual protein on cell fitness correlates with p(k)

Analyze data set: 4743 yeast proteins – 23294 unique interactions


‚Marginal essentiality‘



Basic analysis

Essential proteins have ca. twice

as many links as non-essential

proteins.

Essential proteins have a

shallower slope a larger

proportion of them are ‚hubs‘




determine hubs that are essential

‚hub‘: proteins that belong to upper 25% of proteins with most links

(a) 43% of the hubs are essential vs. 20% for random proteins

Within the network, essential proteins tend to be more cliquish and tend to be

more closely connected to eachother (see mean path length).


(b) out-degree: transcription

factors with many (> 100)

targets are more essential

than the other TFs.

(c) in-degree: genes regula-

ted by many TFs are less

likely to be essential than

genes regulated by few TFs.

(d) Genes with more

functions are more likely to

be essential.



properties of essential genes

Most essential genes are ‚house-keeping‘ genes = their expression level is much

higher and the fluctuation of their expression is much lower compared with

non-essential genes.

The regulation of essential genes tends to have less regulation

non-essential genes often use more regulators to control the expression of gene

products.

Explanation? essential proteins perform the most basic and important functions

within the cell and always need to be switched ‚on‘.

Their expression does not need to be regulated by many factors because this

makes the essential genes dependent on the viability of more regulators, and the

cell less stable.




Analysis of non-essential genes

The more marginally essential a protein is the more likely

- is it to have a large number of interaction partners (a, b)

- it will be closely connected to other proteins (short characteristic path length) (c)

- will it be one of the 1061 hubs (d)




Evolution of Drosophila melanogaster PI network

Various network growth and evolution models have been developed to explain

the properties of real-world networks

- small world (Watts & Strogatz)

- preferential attachment (Barabasi et al.)

- duplication-mutation mechanisms

However, often model parameters can be tuned such that multiple models of

widely varying mechanisms perfectly fit the motivating real network in terms

of single selected features such as the scale-free exponent and the

clustering coefficient.

Aim here: use a discriminative classification technique from machine learning to

classify a given real network as one of many proposed network mechanisms by

enumerating local substructures.

Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15



Analysis of Drosophila melanogaster protein interaction network

Data set: protein-protein interaction map for Drosophila by Giot et al.

Problem: data set is subject to numerous false positives.

Giot et al. assign a confidence score p [0,1] to each interaction measuring how

likely the interaction occurs in vivo.

What threshold p* should be used?

Measure size of the components for all possible values of p*.

Observe: for p*= 0.65, the two largest components are connected

use this value as threshold. Edges in the graph correspond to interactions for

which p > p*.

Remove self-interactions and isolated vertices

3359 (4625) nodes with 2795 (4683) edges for p*= 0.65 (0.5)




Network evolution models considered

Duplication-mutation-complementation (DMC) algorithm:

based on model that proposes that most of the duplicate genes observed today have been

preserved by functional complementation. If either the gene or its copy loses one of its

functions (edges), the other becomes essential in assuring the organisms‘s survival.

Algorithm: duplication step is followed by mutations that preserve functional

complementarity.

At every time step choose a node v at random.

A twin vertex vtwin is introduced copying all of v‘s edges.

For each edge of v, delete with probability qdel either the original edge or its corresponding

edge of vtwin.

Cojoin twins themselves with independent probability qcon representing an interaction of a

protein with its own copy.

No edges are created by mutations DMC algorithm assumes that the probability of

creating new advantageous functions by random mutations is negligible.




Network evolution models considered

Variant of DMC: Duplication-random mutations (DMR) algorithm:

Possible interactions between twins are neglected.

Instead, edges between vtwin and the neighbors of v can be removed with probability qdel

and new edges can be created at random between vtwin and any other vertices with

probability qnew/N, where N is the current total number of vertices.

DMR emphasizes the creation of new advantageous functions by mutation.

Other models:

- linear preferential attachment (LPA) (Barabasi)

- random static networks (Erdös-Renyi) (RDS)

- random growing networks (RDG – growing graphs where new edges are created randomly

between existing nodes)

- aging vertex networks (AGV – growing graphs modeling citation networks, where the

probability for new edges decreases with the age of the vertex)

- small-world network (SMV – interpolation between regular ring lattices and randomly

connected graphs).




Training setCreate 1000 graphs as training data for each of the seven different models.

Every graph is generated with the same number of edges and nodes as measured

in Drosophila.

Quantify topology of a network by counting all possible subgraphs up to a given

cut-off, which could be the number of nodes, number of edges, or the length of a

given walk.

Here: count all subgraphs that can be constructed by a walk of length=8 (148 non-

isomorphic subgraphs) or length=7 (130 non-isomorphic subgraphs).

Use these counts as input features for classifier.

Note that the average shortest path between two nodes of the Drosophila

network‘s giant component is 11.6 (9.4) for p*=0.65 (0.5).

Walks of length=8 can traverse large parts of the network.




Learning algorithm: Alternating Decision Tree


Rectangles: decision nodes.

A given network‘s subgraph counts

determine paths in the tree dictated

by inequalities specified by the

decision nodes.

For each class, the ADT

outputs a real-valued

prediction score, which is the

sum of all weights over all paths.

The class with the heightest score

wins.



Performance on training set

The confusion matrix shows truth and prediction for the test sets.

5 out of 7 have nearly perfect prediction accuracy.

AGV is constructed as an interpolation between LPA and a ring lattice

the AGV, LPA and SMW mechanisms are equivalent in specific parameter

regimes and show a non-negligible overlap.




Discriminating similar networks

Ten graphs of two different mechanisms exhibit similar average geodesic lengths

and almost identical degree distribution and clustering coefficients.

(a) cumulative degree distribution p(k>k0), average clustering coefficient <C> and

average geodesic length <L>, all quantities averaged over a set of 10 graphs.

(b) Prediction score for all ten graphs and all five cross-validated ADTs. The two

sets of graphs can be perfectly separated by the classifier.




Learning algorithm: Alternating Decision Tree


Figure shows the first few descision

nodes (out of 120) of a resulting ADT.

The prediction scores reveal that a

high count of 3-cycless suggest a

DMC network.

DMC mechanism indeed

facilitates creation of many

3-cycles by allowing 2 copies

to attach eachother, thus

creating 3-cycles with their common

neighbors.

A low count in 3-cycles but a high

count in 8-edge linear chains is a

good precictor for LPA and DMR

networks.



Prediction for Drosophila melanogaster network

Use this classifier (ADT) with good prediction accuracy now to determine the

network mechanism that best reproduces the Drosophila network (or any network

of the same size).

Prediction scores for the Drosophial protein network for different confidence

threshold p* and different cut-offs in subgraph size. Drosophial is consistently

classified as a DMC network, with an especially strong prediction for a confidence

threshould of p*=0.65 and independently of the cut-off in subgraph size.




Visualization of subgraphs

A qualitative and more intuitive

way of interpreting the

classification result is

visualizing the subgraph

profiles.

Subgraphs associated with

Figures 3 and 1.

A representatie subset of 50

subgraphs out of 148 is

shown.




Subgraph profilesThe average subgraph count of the

training data for every mechanism is

shown for the 50 representative

subgraphs S1-S50.

Black lines indicate that this model is

closest to Drosophila based on the

absolute difference between the

subgraph counts.

For 60% of the subgraphs (S1-S30),

the counts for Drosophila

are closest to the DMC model.

All of these subgraphs contain one or

more cycles, including highly

connected subgraphs (S1) and long

linear chains ending in cycles (S16,

S18, S22, S23, S25).


The DMC algorithm is the only

mechanism that produces such

cycles with a high occurrence.



Robustness against noise

Edges in Drosophila network are randomly

replaced and the network is classified.

Plotted are prediction scores for each of the 7

classes as more and more edges are replaced.

Every point is an average over 200 independent

random replacements.

For high noise level (beyond 80%), the

network is classified as an Erdös-Renyi (RDS)

graph. For low noise (< 30%), the confidence

in the classification as a DMC network is even

higher than in the classification as an RDS

network for high noise. The prediction score

y(c) for class c is related to the estimated

probability p(c) for the tested network to be in

class c by


cy

cy

e

ecp

2

2

1



Conclusions

Very nice (!) method that allows to infer growth mechanisms for real networks.

Method is robust against noise and data subsampling,

no prior assumption about network features/topology required.

Learning algorithm does not assume any relationships between features

(e.g. orthogonality). Therefore the input space can be augmented with various

features in addition to subgraph counts.

The protein interaction network of Drosophila is confidently classified as DMC

network.


Date post:	30-Dec-2015
Category:	Documents
Upload:	aileen-strickland
View:	212 times
Download:	0 times

8. Lecture WS 2004/05Bioinformatics III1 Protein protein interaction networks Most ideas on...

Documents