Graphs in R for Metagenomics
Susan Holmes
Bio-X and Statistics, Stanford University
SISMID14-Lect 13
ABabcdfghiejkl. . . . . .
Example 1: Co-occurrence Networks
In work on mice with Relman, Theriot and Hoy (to appear)I Core community of species present at a minimum
threshold. (threshold 1)I Which of these species co-occur in more than a certain
percentage of samples? (threshold 2)
. . . . . .
. . . . . .
Co−occurrence graph with isolates,jt=0.6
. . . . . .
Co−occurrence graph without isolates, jt=0.6
. . . . . .
How is it done?
################################################################################ Function that inputs an abundance table as a matrix# and makes a graph adjacency matrix ready to plot# with the igraph package.# Author: Susan Holmes# Adapted for PhyloSeq by Paul J. McMurdie###############################################################################makenetwork=function(abund,plotgraph=TRUE,community=TRUE,threshold=0,incommon=0.4,method="jaccard"){
# abundance is the abundance table with no rows that are all zero# plotgraph is a toggle for whether a plotted network is requested# if commun=TRUE the function will also output the community groups
# threshold is the number that is fixed for when prsence occurs# for instance, threshold>1 means that species with 2 or more reads
# are considered presentrequire(vegan); require(igraph)
. . . . . .
### Keep the original row numbers as labels### this is for the later analysis of groups
if (is.na(dimnames(abund)[[1]][1])) dimnames(abund)=list(1:nrow(abund),1:ncol(abund))### Only take the rows where there are at least one value over threshold
abundance = abund[rowSums(abund)>threshold,]n=nrow(abundance)
. . . . . .
# Convert to 1,0 binary matrix for input to vegdist. -0 converts to numericpresenceAbsence = (abundance > threshold) - 0
##Compute the Jaccard distance between the rows, this will only make points##closer if they actually present together
##You could use any of the other distances in vegan or elsewherejaccpa=vegdist(presenceAbsence, method)
###Distances in R are vectors by default, we make them into matricesjaacm=as.matrix(jaccpa)
coinc=matrix(0,n,n)ind1=which((jaacm>0 & jaacm<(1-incommon)),arr.ind=TRUE)
coinc[ind1]=1dimnames(coinc)=list(dimnames(abundance)[[1]],dimnames(abundance)[[1]])
. . . . . .
###If using the network package create the graph with### g<-as.network.matrix(coinc,matrix.type="adjacency")
####Here I use the igraph adjacency commandig=graph.adjacency(coinc)###What class is this?class(ig)###Take out the isolatesisolates=V(ig)[ degree(ig)==0 ]ignoisol=delete.vertices(ig, V(ig)[ degree(ig)==0 ])if (plotgraph==TRUE){
plot(ignoisol, layout=layout.fruchterman.reingold,vertex.size=0.6, vertex.label.dist=0.1,edge.arrow.mode="-",vertex.color="red",vertex.label=NA,edge.color="blue")
title("Co-occurrence graph without isolates")}
. . . . . .
Finding which taxa in which group
if (community==TRUE){communitywalk=walktrap.community(ignoisol)nonisolates= V(ig)[ degree(ig)!=0 ]
group0=nonisolates[which(communitywalk$membership==0)]group1=nonisolates[which(communitywalk$membership==1)]###You can then play around with coloring and labelling in the graph###For help don't forget to look up plot.igraph or plot.network###not just plot as it inherits the plot method appropriate to its class
groups=list(group0,group1)return(groups)
}}
Functions contained in package Phyloseq, joint with PJMcMurdie (to appear). . . . . . .
Significant Graphs by Simulated Annealing:GXNA
Bacteria interact, and there is a lot of interest in methodsthat analyze groups of species.
. . . . . .
Prior knowledge: interaction graph
Similar in spirit to Ideker.You need an interaction graph.Rather than using a list of known pathways, we representprior knowledge as an interaction graph. The nodes of thegraph correspond to genes; there is an edge between twonodes if their genes interact. The interaction information isobtained from public databases such as EntrezGene.
. . . . . .
A list of known pathways contains some of the sameinformation, but an interaction graph is more precise, as itdescribes which genes are closely connected within a givenpathway. Hence this model is more likely to detect localdisturbances within known pathways, as well as withinpathways that have not yet been described.A group of related genes corresponds to a connectedsubgraph of the interaction graph. We are interested in themost significant connected subgraphs; several scoringfunctions can be used, as described in the methods section.Once the interaction graph and scoring functions are selected,a search algorithm is used to find subgraphs with high scores.
. . . . . .
Example: Gene Interaction Data
Interaction data was downloaded from the EntrezGenedatabase in December 2005. This was represented as anundirected graph where each node is a gene and two nodesare connected by an edge if their genes interact. Loops(nodes connected to themselves) were eliminated. This resultsin a graph with 6227 nodes and 17146 edges. The highestdegree is 156 (gene TP53). The most common degree is 1(genes interacting with only one other gene); there are 1845such nodes.
. . . . . .
A subset of the gene interaction network.
. . . . . .
Scoring Functions
Given a gene or a set of genes, we need to compute a scorethat measures to what extent it is differentially expressed.We discuss several possible choices and their pros and cons.
Scores for a single geneConsider first a single gene. The most popular scoring
function is the t statistic:
Ti = (µi1 − µi0)/√
σ2i1/n1 + σ2
i0/n0 (1)
where the mean and standard deviation µi1, σi1 are for gene iand the case phenotype, and µi0, σi0 are for gene i and thecontrol phenotype. Several alternatives exist. For simplicity,
we focus on the t statistic, though most of our methodsremain valid if we replace it with any reasonable competitor.
. . . . . .
Scores based on averaging test statisticsNow consider a set S = {g1, · · · , gk} of k genes. A natural
way to assign the set a score is to average the scores of itsindividual genes, leading to the scoring function used in Tian
et al. 2003:
f1(S) =1
k
k∑i=1
Tgi(2)
Often pathways contain both upregulated and downregulatedgenes; as pointed out in Ideker, this can be captured by
taking absolute values of the test statistic, possibly at thecost of creating more false positives:
f2(S) =1
k
k∑i=1
|Tgi| (3)
Either way, the distribution of the score depends on the setS (e.g. on its size), so it should be normalized before it is
used to compare different sets.Tian et al. propose a nonparametric normalization method:
permutations of the phenotypes are used to estimate the nulldistribution of the score, and the score is adjusted by
essentially replacing it with its quantile. The advantage ofthis method is that it is nonparametric. Its main drawback is
that estimates of the null distribution are reliable only ifenough permutations are used, which may not be possible if
the number of phenotypes is small.
. . . . . .
The alternative is to make some kind of parametricassumption. For example, Ideker normalizes all sets of size kby comparing with a single reference distribution, computedby sampling from random sets of k genes. Ignoring the smalleffect due to sampling without replacement, this amounts to
using
f3(S) =1√k
( k∑i=1
|Tgi| − kµ
)(4)
where µ is the mean of |T| over all genes. The implicitassumptions here are that (1) the normalization need only
depend on the size of the set, and (2) individual gene scoresare independent. The latter assumption in particular is notrealistic; it would be better to normalize by sampling among
connected sets of k genes, leading to
f4(S) =1
σk
( k∑i=1
|Tgi| − µk
)(5)
. . . . . .
where µk and σk are the mean and standard deviation scorefor random connected sets of k genes.
Here ``random'' need not mean ``uniformly random''; ideallythe sampling should be similar in spirit to the one used in the
search algorithm. Also, µk need not equal kµ (some nodesmay be sampled more than others) and σk need not be
proportional to√
k; one would expect it to scale like kα forsome exponent 1/2 < α < 1, where 1/2 would correspond to
independence and 1 to full dependence.
. . . . . .
Group Selection And Search Algorithms
Given a scoring function, we would like to find groups ofinteracting genes with high scores.
Two possible approaches:-go through a limited list of pre-defined groups and select
the ones with high scores,- use a search algorithm to find high-scoring sets among all
possible sets subject to some structural constraints (e.g.being connected).
. . . . . .
Using a subgraph search algorithmsSince the problem of finding the maximal subgraph of ageneric graph is NP-hard (Ideker), various approximate
algorithms have been proposed.Ideker[?] uses simulated annealing, however this is slow and
tends to produce large subgraphs that are difficult tointerpret.
Since we are primarily interested in small networks, we use adifferent approach, where we start with a seed vertex and
gradually expand around it. After k steps we will haveconstructed a connected subgraph Gk with k nodes. Let Nk bethe set of all nodes that are outside Gk but have at least oneneighbor on Gk. We update Gk be choosing a vertex in Nk andattaching it to Gk. The choice can be done in various ways:I random search: pick a uniform random vertexI greedy search: pick a vertex such that the new graph
has maximal score. Variants of this are used in Sohler,Breitling. It is fast, but reduces the number ofsubgraphs searched and may get stuck in local maxima
I random greedy search: pick a random vertex, but assignhigher probabilities to vertices that yield high scores.This combines features of the former two algorithms andavoids some of the problems of greedy search. It issimilar in spirit to Metropolis/MCMC.
We implemented and compared all of the above algorithms.
. . . . . .
Computing Significance Levels
We would like to assign p-values to the graphs identified bythe search algorithms. Clearly we have a multiple testingproblem: searching a large network will yield some highscoring subgraphs by mere chance, even if they have no
biological significance.Permutations
The nonparametric techniques developed for standardmicroarray analysis also apply in principle to our setting. In
the two-phenotype case (n0 controls and n1 cases), theindices are permuted, thus relabeling some controls as casesand viceversa. The analysis (scoring and graph searching) isrepeated for each permutation. If enough permutations are
available, this gives a reasonable estimate for the nulldistribution of the subgraph scores, and allows us to compute
p-values.
. . . . . .
Choice of Permutations
Depending on experiment design, uniform randompermutations may not capture the null hypothesis. The
melanoma data is one example: to gain power, we pooled datafor several kinds of cells
Thus, in addition to the main phenotype (healthy ormelanoma), there is a ``ghost'' phenotype (cell type).
Hence it is desirable to use permutations that preserve celltype; we call them ``invariant'' permutations. We implemented
this option, and compared invariant and uniform randompermutations.
. . . . . .
Interesting Networks
A subgraph related to a chemokine pathway. Each nodecontains the gene name and its t statistic.
We found several high scoring subgraphs; the mostinteresting ones involve genes that were not identified by
standard analysis.. . . . . .
Four of its genes are the chemokine ligands CXCL9, CXCL10,CXCL11 and the chemokine receptor CXCR3. The ligands bindto the receptor, hence the nodes are connected. The ligandshave reasonably high t-statistics (|t| > 2.5) but they are notlarge enough to be significant in a single-gene analysis afteradjusting for multiple testing. As a group, however, the valuesof t are much more significant; they strongly suggest that achemokine pathway is downregulated in melanoma patients.
. . . . . .