Networks (b: Susan).

Graphs in R for Metagenomics

Susan Holmes

Bio-X and Statistics, Stanford University

SISMID14-Lect 13

ABabcdfghiejkl. . . . . .

Example 1: Co-occurrence Networks

In work on mice with Relman, Theriot and Hoy (to appear)I Core community of species present at a minimum

threshold. (threshold 1)I Which of these species co-occur in more than a certain

percentage of samples? (threshold 2)

. . . . . .

. . . . . .

Co−occurrence graph with isolates,jt=0.6

. . . . . .

Co−occurrence graph without isolates, jt=0.6

. . . . . .

How is it done?

################################################################################ Function that inputs an abundance table as a matrix# and makes a graph adjacency matrix ready to plot# with the igraph package.# Author: Susan Holmes# Adapted for PhyloSeq by Paul J. McMurdie###############################################################################makenetwork=function(abund,plotgraph=TRUE,community=TRUE,threshold=0,incommon=0.4,method="jaccard"){

# abundance is the abundance table with no rows that are all zero# plotgraph is a toggle for whether a plotted network is requested# if commun=TRUE the function will also output the community groups

# threshold is the number that is fixed for when prsence occurs# for instance, threshold>1 means that species with 2 or more reads

# are considered presentrequire(vegan); require(igraph)

. . . . . .

### Keep the original row numbers as labels### this is for the later analysis of groups

if (is.na(dimnames(abund)[[1]][1])) dimnames(abund)=list(1:nrow(abund),1:ncol(abund))### Only take the rows where there are at least one value over threshold

abundance = abund[rowSums(abund)>threshold,]n=nrow(abundance)

. . . . . .

# Convert to 1,0 binary matrix for input to vegdist. -0 converts to numericpresenceAbsence = (abundance > threshold) - 0

##Compute the Jaccard distance between the rows, this will only make points##closer if they actually present together

##You could use any of the other distances in vegan or elsewherejaccpa=vegdist(presenceAbsence, method)

###Distances in R are vectors by default, we make them into matricesjaacm=as.matrix(jaccpa)

coinc=matrix(0,n,n)ind1=which((jaacm>0 & jaacm<(1-incommon)),arr.ind=TRUE)

coinc[ind1]=1dimnames(coinc)=list(dimnames(abundance)[[1]],dimnames(abundance)[[1]])

. . . . . .

###If using the network package create the graph with### g<-as.network.matrix(coinc,matrix.type="adjacency")

####Here I use the igraph adjacency commandig=graph.adjacency(coinc)###What class is this?class(ig)###Take out the isolatesisolates=V(ig)[ degree(ig)==0 ]ignoisol=delete.vertices(ig, V(ig)[ degree(ig)==0 ])if (plotgraph==TRUE){

plot(ignoisol, layout=layout.fruchterman.reingold,vertex.size=0.6, vertex.label.dist=0.1,edge.arrow.mode="-",vertex.color="red",vertex.label=NA,edge.color="blue")

title("Co-occurrence graph without isolates")}

. . . . . .

Finding which taxa in which group

if (community==TRUE){communitywalk=walktrap.community(ignoisol)nonisolates= V(ig)[ degree(ig)!=0 ]

group0=nonisolates[which(communitywalk$membership==0)]group1=nonisolates[which(communitywalk$membership==1)]###You can then play around with coloring and labelling in the graph###For help don't forget to look up plot.igraph or plot.network###not just plot as it inherits the plot method appropriate to its class

groups=list(group0,group1)return(groups)

}}

Functions contained in package Phyloseq, joint with PJMcMurdie (to appear). . . . . . .

Significant Graphs by Simulated Annealing:GXNA

Bacteria interact, and there is a lot of interest in methodsthat analyze groups of species.

. . . . . .

Prior knowledge: interaction graph

Similar in spirit to Ideker.You need an interaction graph.Rather than using a list of known pathways, we representprior knowledge as an interaction graph. The nodes of thegraph correspond to genes; there is an edge between twonodes if their genes interact. The interaction information isobtained from public databases such as EntrezGene.

. . . . . .

A list of known pathways contains some of the sameinformation, but an interaction graph is more precise, as itdescribes which genes are closely connected within a givenpathway. Hence this model is more likely to detect localdisturbances within known pathways, as well as withinpathways that have not yet been described.A group of related genes corresponds to a connectedsubgraph of the interaction graph. We are interested in themost significant connected subgraphs; several scoringfunctions can be used, as described in the methods section.Once the interaction graph and scoring functions are selected,a search algorithm is used to find subgraphs with high scores.

. . . . . .

Example: Gene Interaction Data

Interaction data was downloaded from the EntrezGenedatabase in December 2005. This was represented as anundirected graph where each node is a gene and two nodesare connected by an edge if their genes interact. Loops(nodes connected to themselves) were eliminated. This resultsin a graph with 6227 nodes and 17146 edges. The highestdegree is 156 (gene TP53). The most common degree is 1(genes interacting with only one other gene); there are 1845such nodes.

. . . . . .

A subset of the gene interaction network.

. . . . . .

Scoring Functions

Given a gene or a set of genes, we need to compute a scorethat measures to what extent it is differentially expressed.We discuss several possible choices and their pros and cons.

Scores for a single geneConsider first a single gene. The most popular scoring

function is the t statistic:

Ti = (µi1 − µi0)/√

σ2i1/n1 + σ2

i0/n0 (1)

where the mean and standard deviation µi1, σi1 are for gene iand the case phenotype, and µi0, σi0 are for gene i and thecontrol phenotype. Several alternatives exist. For simplicity,

we focus on the t statistic, though most of our methodsremain valid if we replace it with any reasonable competitor.

. . . . . .

Scores based on averaging test statisticsNow consider a set S = {g1, · · · , gk} of k genes. A natural

way to assign the set a score is to average the scores of itsindividual genes, leading to the scoring function used in Tian

et al. 2003:

f1(S) =1

k

k∑i=1

Tgi(2)

Often pathways contain both upregulated and downregulatedgenes; as pointed out in Ideker, this can be captured by

taking absolute values of the test statistic, possibly at thecost of creating more false positives:

f2(S) =1

k

k∑i=1

|Tgi| (3)

Either way, the distribution of the score depends on the setS (e.g. on its size), so it should be normalized before it is

used to compare different sets.Tian et al. propose a nonparametric normalization method:

permutations of the phenotypes are used to estimate the nulldistribution of the score, and the score is adjusted by

essentially replacing it with its quantile. The advantage ofthis method is that it is nonparametric. Its main drawback is

that estimates of the null distribution are reliable only ifenough permutations are used, which may not be possible if

the number of phenotypes is small.

. . . . . .

The alternative is to make some kind of parametricassumption. For example, Ideker normalizes all sets of size kby comparing with a single reference distribution, computedby sampling from random sets of k genes. Ignoring the smalleffect due to sampling without replacement, this amounts to

using

f3(S) =1√k

( k∑i=1

|Tgi| − kµ

)(4)

where µ is the mean of |T| over all genes. The implicitassumptions here are that (1) the normalization need only

depend on the size of the set, and (2) individual gene scoresare independent. The latter assumption in particular is notrealistic; it would be better to normalize by sampling among

connected sets of k genes, leading to

f4(S) =1

σk

( k∑i=1

|Tgi| − µk

)(5)

. . . . . .

where µk and σk are the mean and standard deviation scorefor random connected sets of k genes.

Here ``random'' need not mean ``uniformly random''; ideallythe sampling should be similar in spirit to the one used in the

search algorithm. Also, µk need not equal kµ (some nodesmay be sampled more than others) and σk need not be

proportional to√

k; one would expect it to scale like kα forsome exponent 1/2 < α < 1, where 1/2 would correspond to

independence and 1 to full dependence.

. . . . . .

Group Selection And Search Algorithms

Given a scoring function, we would like to find groups ofinteracting genes with high scores.

Two possible approaches:-go through a limited list of pre-defined groups and select

the ones with high scores,- use a search algorithm to find high-scoring sets among all

possible sets subject to some structural constraints (e.g.being connected).

. . . . . .

Using a subgraph search algorithmsSince the problem of finding the maximal subgraph of ageneric graph is NP-hard (Ideker), various approximate

algorithms have been proposed.Ideker[?] uses simulated annealing, however this is slow and

tends to produce large subgraphs that are difficult tointerpret.

Since we are primarily interested in small networks, we use adifferent approach, where we start with a seed vertex and

gradually expand around it. After k steps we will haveconstructed a connected subgraph Gk with k nodes. Let Nk bethe set of all nodes that are outside Gk but have at least oneneighbor on Gk. We update Gk be choosing a vertex in Nk andattaching it to Gk. The choice can be done in various ways:I random search: pick a uniform random vertexI greedy search: pick a vertex such that the new graph

has maximal score. Variants of this are used in Sohler,Breitling. It is fast, but reduces the number ofsubgraphs searched and may get stuck in local maxima

I random greedy search: pick a random vertex, but assignhigher probabilities to vertices that yield high scores.This combines features of the former two algorithms andavoids some of the problems of greedy search. It issimilar in spirit to Metropolis/MCMC.

We implemented and compared all of the above algorithms.

. . . . . .

Computing Significance Levels

We would like to assign p-values to the graphs identified bythe search algorithms. Clearly we have a multiple testingproblem: searching a large network will yield some highscoring subgraphs by mere chance, even if they have no

biological significance.Permutations

The nonparametric techniques developed for standardmicroarray analysis also apply in principle to our setting. In

the two-phenotype case (n0 controls and n1 cases), theindices are permuted, thus relabeling some controls as casesand viceversa. The analysis (scoring and graph searching) isrepeated for each permutation. If enough permutations are

available, this gives a reasonable estimate for the nulldistribution of the subgraph scores, and allows us to compute

p-values.

. . . . . .

Choice of Permutations

Depending on experiment design, uniform randompermutations may not capture the null hypothesis. The

melanoma data is one example: to gain power, we pooled datafor several kinds of cells

Thus, in addition to the main phenotype (healthy ormelanoma), there is a ``ghost'' phenotype (cell type).

Hence it is desirable to use permutations that preserve celltype; we call them ``invariant'' permutations. We implemented

this option, and compared invariant and uniform randompermutations.

. . . . . .

Interesting Networks

A subgraph related to a chemokine pathway. Each nodecontains the gene name and its t statistic.

We found several high scoring subgraphs; the mostinteresting ones involve genes that were not identified by

standard analysis.. . . . . .

Four of its genes are the chemokine ligands CXCL9, CXCL10,CXCL11 and the chemokine receptor CXCR3. The ligands bindto the receptor, hence the nodes are connected. The ligandshave reasonably high t-statistics (|t| > 2.5) but they are notlarge enough to be significant in a single-gene analysis afteradjusting for multiple testing. As a group, however, the valuesof t are much more significant; they strongly suggest that achemokine pathway is downregulated in melanoma patients.

. . . . . .

Date post:	02-Jan-2017
Category:	Documents
Upload:	dinhthu
View:	213 times
Download:	0 times

Networks (b: Susan).

Documents