Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | arnold-austin |
View: | 215 times |
Download: | 2 times |
Graphs and networksWolfgang HuberVincent Carey
Robert GentlemanSeth Falcon
Based on "Bioinformatics and Computational Biology Solutions using R and Bioconductor",
Gentleman, Carey, Huber, Irizarry, Dudoit. Springer Verlag.
Graphs
Set of nodes and set of edges.
Nodes: objects of interest
Edges: relationships between them
A useful abstraction to talk about relationships and interactions (think of apples, pears, fingers, integer numbers)
Edges may have weights, directions, types
Use of graphs in biology
Knowledge representation
signal transduction, regulatory or metabolic networks; cartoons or more formalized
Gene Ontology
Graph-like data (“interactions”)
Y2H, APMS, ChIP-chip
Statistical models that use (sparse) graphs
Bayesian networks, phylogenetic trees
“real” network measured networkE
xper
imen
t
Graph-like data: true state of nature vs experimental measurement
Statistical inference
Graph-like data: uncertainty
Distinguish between the true, underlying property that you want to measure, and the actual result of a measurement (experiment)
1. False positive edges2. False negative edges (were tested, were not found, but are there in nature)3. Untested edges (were not tested, are not in your data, but are there in nature)
Uncertainty is not usually considered in mainstream graph theory, but cannot be ignored in bioinformatics
applications.
The “real” network
...is a mathematical model of nature, not to be confused with nature itself.
E.g. we can model protein-protein interactions as yes/no relationships, even though at closer look they can have a continuum of affinities, can be dynamic and can depend on the environment.
“All models are wrong, some are useful”
Motivating examples
Transcription factor graphsPathway graphs
GOLiterature graphs
Protein-Complex graphs
Transcription factor interactions
Nodes = transcription factors
Directed edge: X regulates transcription of Y
Transcriptional regulatory networksfrom "ChIP on chip" (chromatin immunoprecipitation)
regulator := a transcription factor (TF) or a ligand of a TFtag: c-myc epitope
106 microarrayssamples: enriched (tagged-regulator + DNA-promoter)probes: cDNA of all promoter regionsspot intensity ~ affinity of a promotor to a certain regulator
Lee et al. Science 2002
Transcriptional regulatory network: a bipartite graph
1
1
1
1
1
1
1
106 regulators (TFs)
6270
pro
mo
ter
reg
ion
s
regulators
promoters
Application: network motifs
A pathway graph
Machine-readable pathway databases
KEGG
reactome
BioCarta (biocarta.com)
National Cancer Institute cMAP
Gene Ontology (GO)
A structed vocabulary to describe molecular function of gene products, biological processes, and cellular components.
Plus
A set of "is a", "is part of" relationships between these terms
Directed acyclic graph
GO graphs
>tfG=GOGraph("GO:0003700", GOMFPARENTS)
Gene-Literature graphs
DKC1
The bipartite gene-literature graph: actor and event size adjustment
actors: genesactor size: number of papers that a gene appears inevent: paperevent size: number of genes that appear in a paper
Example: R. Strausberg et al. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. PNAS 99:16899–903, 2002
cites 15,000 genes
Closing gene lists with literature
Boundary of gene list L: set of all genes that have co-citation (above threshold weight) with genes in L.
Gene 1
Gene 2
Gene 3
Gene 5
Gene 6
Gene 7
Gene 4
Graphs: vocabulary
Directed, undirected graphsAdjacent nodesAccessible nodesSelf-loopMulti-edgeNode degreeWalk: alternating sequence of nodes and incident edgesClosed walkDistance between nodes, shortest walkTrail: walk with no repeated edgesPath: trail with no repeated nodes (except possibly first/last)CycleConnected graphWeakly connected directed graph (see next page)
Strong and weak connectivity
Connectivity: minimum number of edges whose removal results in disconnected graph
Clique: every pair of nodes joined by an edge
Graphs: vocabularyCut: remove edges to disconnect a graphCut-set: remove nodes - " -
Special types of graphs
Bipartite graph
Bipartite graphs
AG adjacency matrix (n x m) of a bipartite graph G with
node sets U, V
One mode graphs
AU = AGt AG
AV = AG AG
t
(Boolean algebra)
Multigraphs
Can have different types of edges
Hypergraphs
:= set of Nodes + set of hyperedges
A hyperedge is a set of nodes (can be more than 2)
A directed hyperedge: pair (tail and head) of sets of nodes
Directed acyclic graphs
Useful for representing hierarchies and partial orderings (e.g. in time, from general to special, from cause to effect)
Many applications:GOMeSHGraphical models
Random Edge Graphs
n nodes, m edges
p(i,j) = 1/m
with high probability:
m < n/2: many disconnected components
m > n/2: one giant connected component: size ~ n.
(next biggest: size ~ log(n)).
degrees of separation: log(n).
Erdös and Rényi 1960
Random edge graph
100 nodes 50 edges
degree distribution
Random graphs versus permutation graphs
For statistical inference, one can consider null hypotheses based on aforementioned random graph models; and ones based on node permutation of data graphs.
The second is often more appropriate.
Cohesive subgroups
For data graphs, the concept of clique is usually too restrictive (false negative or untested edges)
n-clique: distance between all members is <=n. (Clique: n=1)
k-plex: maximal subgraph G in which each member is neighbour of at least |G|-k others. (Clique: k=1)
k-core: maximal subgraph G in which each member is neighbour of at least k others. (Clique: k=|G|-1)
After: Social Network Analysis, Wasserman and Faust (1994)
A Graph Theoretic Algorithm for Estimating ProteinComplex Membership using Data from AffinityPurication - Mass Spectrometry Technology
Denise Scholtens and Robert Gentleman
Two Types of Protein Relationship
• AP-MS (Affinity Purification - Mass Spectrometry )
– Measures Complex Comembership
• Gavin, et al. (Nature, 2002)
– TAP : Tandem Affinity Purification
• Ho, et al. (Nature, 2002)
– HMS-PCI: High-throughput Mass Spectromic Protein Complex Identification
• Y2H (Yeast Two Hybrid)
– Measures Physical Interaction
• Ito, et al. (PNAS, 1998)• Uetz, et al. (Nature, 2000)
AP-MS data:
Using a bait protein, AP-MS technology finds hit proteins that are comembers of at least one complex with the bait.
Y2H data:
Y2H technology finds pairs of physically interacting proteins.
(one purification)
bait
hits
AP-MS data: Y2H data:
We want to estimate the bipartite protein complex membership graph, A:
Estimation of A requires estimation of k, the number of complexes.
The strategy
1. Some proteins participate in more than one complex
2. In an AP-MS experiment, some proteins are used as baits and some proteins are only ever found as hits
3. Graph theoretic aspects of the model:
• Bipartite graph for complex membership (A)
• Relationship of complex membership (A) to complex comembership (Y) assayed in an AP-MS experiment (Z)
• AP-MS and Y2H are different technologies that measure different relationships between proteins
4. Statistical aspects: three types of errors: false positive, false negative (assayed), false negative (unassayed)
PP2A
Heterotrimeric complex consisting of:
Tpd3 - regulatory A subunit
Rts1 or Cdc55- regulatory B subunits
Pph21 or Pph22- catalytic subunits
Jiang and Broach (1999). EMBO.
1. Some proteins participate in more than one complex
Gavin, et al. (2002)Rgraphviz plot ofyTAP C151
Bader & Hogue (2002)Portion of Figure 2: Overlap of the spoke models of TAP and HMS-PCI.
Jansen, et al. (2003)PIT Bayesian Network, LR>600
http://genecensus.org/intint
Tpd3
Pph21
Myo5
Cdc55
Cdc11
Pph22
Cdc10
1. Some proteins participate in more than one complex
PP2A
Heterotrimeric complex consisting of:
Tpd3 - regulatory A subunit
Rts1 or Cdc55- regulatory B subunits
Pph21 or Pph22- catalytic subunits
Jiang and Broach (1999). EMBO.
apComplex algorithm detects:
Zds1 and Zds2 (known cell-cycle regulators) only exist in complexes with the Cdc55-Pph22 trimer!
2. Graph theoretic paradigm to allow for succinct expression of constructs involved
•Bipartite graph for complex membership •Relationship of complex membership (A) to complex comembership (Y) assayed in an AP-MS experiment (Z)•AP-MS and Y2H are different technologies that measure different relationships between proteins
We want to estimate PCMGusing AP-MS assays of CCG
The Connection: Maximal Complete SubgraphsComplete Subgraph: set of n nodes for which all n(n-1) directed edges existMaximal Complete Subgraph: complete subgraph that is not contained in
any other complete subgraph
2. Graph theoretic paradigm to allow for succinct expression of constructs involved
•Relationship of complex membership (A) to complex comembership (Y) assayed in an AP-MS experiment (Z)
Y represents “ideal” complex comembership observations from perfectly sensitive and perfectly specific AP-MS technology. Y depends on the baits that are used in an experiment. Y is assayed by AP-MS technology.The Connection: Maximal BH-Complete SubgraphsBH-Complete Subgraph: set of n bait nodes and m hit-only nodes for which
all n(n-1)+nm directed edges existMaximal BH-Complete Subgraph: BH-complete subgraph that is not
contained in any other complete subgraph
3. Statistical paradigm to allow for false positive and false negative observations
Z represents actual observations using AP-MS technology.We will look for sets of proteins that form maximalBH-complete subgraphs with anallowance for false positive and false negative observations.
AP-MS data for N bait proteins and M hit proteins→ noisy directed graph Z
Model: Z = AA' +
We want to estimate A.
Start with an initial estimate for A, and then refine that estimate according to a two component probability measure:
In summary…
P(Z |A, μ,α)=L(Z|Y=A A', μ,α) C(Z|A , μ,α))usual likelihood regularization/penalty term
(no. of complexes)
Software for graphs and networks
Here I will focus on what is available from the Bioconductor project, and through it (BGL, graphviz)
There are many other software packages, e.g.LEDAcytoscape
graph, RBGL, Rgraphviz
graph basic class definitions and functionality
RBGL interface to graph algorithms
Rgraphviz rendering functionality Different layout algorithms. Node plotting, line type, color etc. can be controlled by the user.
Representation of graphs:
From-To matrix
from to [1,] "a" "b"[2,] "b" "c"[3,] "c" "d"[4,] "d" "b"
Representation of graphs:
Adjacency matrix (naive)
a b c da 0 1 0 0b 0 0 1 0c 0 0 0 1d 0 1 0 0
Representation of graphs:
Adjacency matrix (sparse)> aM a b c da 0 1 0 0b 0 0 1 0c 0 0 0 1d 0 1 0 0
> as.matrix.csr(am)An object of class ”matrix.csr”Slot "ra":[1] 1 1 1 1
Slot "ja":[1] 2 3 4 2
Slot "ia":[1] 1 2 3 4 5
Slot "dimension":[1] 4 4
Representation of graphs:
Node edge list> class(g)[1] "graphNEL"attr(,"package")[1] "graph"
> nodes(g)[1] "a" "b" "c" "d"
> edges(g)$a[1] "b"
$b[1] "c"
$c[1] "d"
$d[1] "b"
Representation of graphs:
package graph
From-To matrixAdjacency matrix (naive)Adjacency matrix (sparse)Node-edge lists
They are equivalent, but may be different in performance and convenience for different applications.
Can coerce between the representations
Creating our first graph
> library("graph"); library(Rgraphviz)
> myNodes = c("s", "p", "q", "r")
> myEdges = list(s = list(edges = c("p", "q")), p = list(edges = c("p", "q")), q = list(edges = c("p", "r")), r = list(edges = c("s")))
> g = new("graphNEL", nodes = myNodes, edgeL = myEdges, edgemode = "directed")
> plot(g)
Querying nodes, edges, degree
> nodes(g)[1] "s" "p" "q" "r"
> edges(g)$s[1] "p" "q"$p[1] "p" "q"$q[1] "p" "r"$r[1] "s"
> degree(g)$inDegrees p q r1 3 2 1$outDegrees p q r2 2 2 1
adjacent and accessible nodes
> adj(g, c("b", "c"))$b[1] "b" "c"$c[1] "b" "d"
> acc(g, c("b", "c"))$ba c d3 1 2
$ca b d2 1 1
Graph manipulation
> g1 <- addNode("e", g)
> g2 <- removeNode("d", g)
> ## addEdge(from, to, graph, weights)
> g3 <- addEdge("e", "a", g1, pi/2)
> ## removeEdge(from, to, graph)
> g4 <- removeEdge("e", "a", g3)
> identical(g4, g1)
[1] TRUE
Elementary computations on IMCA pathway
> library("graph")> data("integrinMediatedCellAdhesion")> acc(IMCAGraph, "SOS")Ha-Ras Raf MEK 1 2 3 ERK MYLK MYO 4 5 6F-actin cell proliferation 7 5
GXL: graph exchange language
<gxl> <graph edgemode="directed" id="G"> <node id="A"/> <node id="B"/> <node id="C"/> … <edge id="e1" from="A" to="C"> <attr name="weights"> <int>1</int> </attr> </edge> <edge id="e2" from="B" to="D"> <attr name="weights"> <int>1</int> </attr> </edge> …</graph></gxl>
from graph/GXL/kmstEx.gxl
GXL (www.gupro.de/GXL)
is "an XML sublanguage
designed to be a standard exchange format for graphs". The graph package
provides tools for im- and exporting
graphs as GXL
RBGL: interface to the Boost Graph Library
Connected componentscc = connComp(rg) table(listLen(cc)) 1 2 3 4 15 18 36 7 3 2 1 1
Choose the largest componentwh = which.max(listLen(cc)) sg = subGraph(cc[[wh]], rg)
Depth first searchdfsres = dfs(sg, node = "N14")nodes(sg)[dfsres$discovered] [1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53" [9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35" [17] "N48" "N09"
rg
depth / breadth first search
dfs(sg, "N14")bfs(sg, "N14")
connected componentssc = strongComp(g2)
nattrs = makeNodeAttrs(g2, fillcolor="")
for(i in 1:length(sc)) nattrs$fillcolor[sc[[i]]] =
myColors[i]
plot(g2, "dot", nodeAttrs=nattrs)
wc = connComp(g2)
shortest path algorithms
Different algorithms for different types of graphs o all edge weights the sameo positive edge weightso real numbers
…and different settings of the problemo single pairo single sourceo single destinationo all pairs
Functionsbfsdijkstra.spsp.betweenjohnson.all.pairs.sp
shortest path
1
set.seed(123)rg2 = randomEGraph(nodeNames, edges = 100)fromNode = "N43"toNode = "N81"sp = sp.between(rg2,
fromNode, toNode)
sp[[1]]$path [1] "N43" "N08" "N88" [4] "N73" "N50" "N89" [7] "N64" "N93" "N32" [10] "N12" "N81"
sp[[1]]$length [1] 10
connectivity
Consider graph g with single connected component.Edge connectivity of g: minimum number of edges in g that can be cut to produce a graph with two components. Minimum disconnecting set: the set of edges in this cut.
> edgeConnectivity(g)$connectivity[1] 2
$minDisconSet$minDisconSet[[1]][1] "D" "E"
$minDisconSet[[2]][1] "D" "H"
Rgraphviz: the different layout engines
Rgraphviz: the different layout engines
ImageMap
lg = agopen(g, …)
imageMap(lg, con=file("imca-frame1.html", open="w") tags= list(HREF = href, TITLE = title, TARGET = rep("frame2", length(AgNode(nag)))), imgname=fpng, width=imw, height=imh)
Show drosophila interaction network example
Combining R graphics and graphviz: custom node drawing functions
Combining: graphviz layout and R plot
Using GO to interprete gene lists
Using GO to interprete gene lists
Packages: Gostats, Rgraphviz
A pathway graph
A pathway graph
Acknowledgements
Vince CareySeth FalconRobert GentlemanJeff GentryLi LongDenise Scholtens
Bioconductor developers