Gene Set Testing on the Gene Ontology Graph - uni · PDF fileGene Set Testing on the Gene...

Gene Set Testing on theGene Ontology Graph

Manuela Hummel

Ulrich Mansmann

BioMed-S Seminar, June 19, 2007

Overview

Analysis of groups of genes

• Motivation

• Gene set enrichment versus holistic approaches

GlobalAncova

• Linear models, F-test, p-values

• Analysis examples

Detecting interesting Gene Ontology terms

• Structure of the GO

• Methods based on gene set enrichment

• Methods based on global tests

Motivation

Gene-wise analysis

• E.g. analysis of differential expression

• Genes are treated independently

• Correction for multiple testing is crucial

• Resulting lists of interesting genes are rather ’instable’

• Biological interpretation of such gene lists is hard

Analysis of gene sets

• Predefined gene groups provide more biological knowledge

(Gene Ontology terms, pathways, genomic regions, ...)

• More meaningful interpretation in biological context

• Number of gene sets to be investigated is smaller than

number of individual genes

Strategies for Group Testing

Gene set enrichment

• Idea: Provide biological meaning to a list of interesting genes

by means of an over-representation analysis

• Step 1: Gene-wise analysis of differential expression

Step 2: Score gene groups for enrichment

• Goal: Find gene groups that contain many interesting genes

Holistic approaches

• Idea: Look directly at gene sets and ask whether they are

biologically relevant with respect to differential expression

• Global analysis of differential expression for gene groups

• Goal: Find gene groups that contain at least one interesting

gene or many genes with moderate differentiality

Gene Set Enrichment

• Step 1: Define a list of differentially expressed (DE) genes,e.g. by t-tests and correction for multiple testing

• Step 2: Connection of the list to a functional gene set canbe summarized in a 2× 2 contingency table

∈ gene group /∈ gene group

∈ DE genes x K − x K

/∈ DE genes M − x (N −M)− (K − x) N −K

M N −M N

• Use e.g. Fisher’s exact test to test for association betweengene list and gene set

• Similar tests based on gene counts are very often used inGene Ontology analysis (binomial test, χ2 test, z scores)Rivals et al. (2006)

Holistic Approaches

• Is the global expression pattern X of a group of genes signi-

ficantly related to some clinical variable Y of interest?

• E.g. globaltest of Goeman et al. (2004)

Does knowledge of X help to predict Y ?

• Generalized linear model with βj ∼ (0, τ2)

E(yi|β) = h−1

α+m∑j=1

xijβj

• H0 : β1 = . . . = βm = 0 is equivalent to H0 : τ2 = 0

Globaltest

• Test statistic

Q ∼ (Y − µ)′R(Y − µ)

∼∑g

[Xg(Y − µ)]2 sum over genes

∼∑i

∑jRij(Yi − µ)(Yj − µ) sum over subjects

R = X ′X matrix of correlations between gene expression of subjects

• Test to see whether subjects with similar expression also have

similar outcomes

• Permutation based and asymptotic p-values are available

• Also multicategorical, continuous or survival variables can be

considered and adjustment for covariates is possible

Competitive vs Self-Contained

Gene set enrichment – competitive

• Differential expression of a gene set G is compared to a stan-

dard defined by all the remaining genes G

• Hcomp0 : The genes in G are at most as often differentially

expressed as the genes in G

• Significance of G is”

penalized“ for the significance of G

• If all genes are (similarly) DE no gene set will be significant

Holistic approaches – self-contained

• Differential expression is explored only within the gene set

• Hself0 : No genes in G are differentially expressed

• Hself0 is more restrictive → more power

• If all genes are (similarly) DE all gene sets will be significant

Goeman and Buhlmann (2007)

Competitive vs Self-Contained

Gene set enrichment – competitive

• Single genes are treated very differently from gene sets con-

sisting of single genes

• Testing the set of all genes is not possible

• p-values of different gene sets tend to be negatively correlated

Holistic approaches – self-contained

• Generalization of single gene testing to gene set testing

• Testing the set of all genes can be a useful preliminary data

quality check

• In some cases maybe too powerful

Gene vs Subject Sampling

Gene set enrichment – gene sampling

• Genes are the sampling units

• A new sample would correspond to a sample of new genes

for the same subjects

• Classical roles of variables and observations are reversed

Holistic approaches – subject sampling

• Subjects are the sampling units

• A new sample corresponds to measurements of the same va-

riables (= genes) for a new subject

• Classical roles of variables and observations



• p-values relate to replications of the urn experiment6= biological replications

• Small p-value: similar association between ’membership ofthe gene set’ and ’being differentially expressed’ will be foundwith new genes for the same subjects

• Sample size = number of genes does not correspond to thebiological sample size


• p-values relate to measuring same variables on new subjects= biological replications

• Small p-value: similar association between expression andphenotype will be found for the same genes in new subjects

• Sample size = number of subjects does correspond to thebiological sample size



• Unrealistic assumption of independence between genes

• Correlations affect the number of genes called differentially

expressed

• If (gene-wise) p-values are positively correlated the true null

hypothesis is not hypergeometric but has heavier tails

• Gene set p-values may be understated


• Assumption of independence between subjects, but genes

may be correlated

Overview


• Motivation


GlobalAncova







GlobalAncova

• How is gene expression X influenced by variable Y

H0 : P (X|Y,C) = P (X|C) (where C can be other covariates)

• The expectation for gene j is assumed to follow a linear modelE(xj) = Dβj = Hxj, with hat matrix H = D(D′D)−1D′

• D is a usual design matrix, e.g.

Int Y Csample 1sample 2sample 3

. . .

( 1 0 11 0 21 1 1

. . .

)

• Residual sum of squares for gene j

εj′εj, with εj = (I −H)xj

• Total residual sum of squares

SSR =p∑

j=1εj′εj

Hummel, Meister and Mansmann (Technical Report)

F-Test

• Do we need the variable Y to explain the data X?

→ Extra sum of squares principle

• The full model containing the variable of interest is compared

to a reduced model without it

Dfull = (1, Y, C), Dreduced = (1, C)

• Extra residual sum of squares

SSRextra = SSRreduced − SSRfull

• F-statistic

F = MSRextra/MSRfull

• Permutation based and asymptotic p-values are available

Permutation based p-values

• Permute column(s) Y in the design matrix of the full model

B times

• Compute the GlobalAncova statistic Fb for each permutation,

b = 1, . . . , B

• An empirical p-value is the fraction of permutation Fb’s that

are greater than the observed F

p = #(Fb>F )B

Asymptotic p-values

• Distribution of the F-statistic nominator can be approximated

by a series of χ2 distributions (Kotz et al., 1967)

F (α;x) = F (SSRreduced − SSRfull)

=∞∑k=0

ckχ2np+2k(x/β) ≈

np∑k=0

ckχ2np+2k(x/β)

where weights ck and β are calculated through α

• α = {γi · λj; i = 1, . . . , n; j = 1, . . . , p} with

γi, i = 1, . . . , n: Eigenvalues of Hfull −Hreducedλj, j = 1, . . . , p: Eigenvalues of the gene covariance matrix

• The gene covariance matrix is estimated via a shrinkage esti-

mate (necessary since p >> n)

Schafer et al. (2005)

Linear Models

General linear model framework allows analysis of

Design Full model Reduced model

Various groups ∼ group + cov ∼ cov

Dose response ∼ dose + cov ∼ cov

Group by dose interaction ∼ group * dose + cov ∼ group + dose + cov

Time trends in groups ∼ group * time + cov ∼ group + time + cov

Gene-gene interaction ∼ gene + cov ∼ cov

Co-expression ∼ group + gene + cov ∼ group + cov

Differential co-expression ∼ group * gene + cov ∼ group + gene + cov

Meta-analysis ∼ group * dataset ∼ dataset

. . . . . . . . .

Example: Co-Expression

• Van’t Veer et al. (2002) present a gene signature of 70 genes

to predict recurrence of breast cancer

• We consider 9 cancer related pathways

• For demonstration we pick the cell cycle pathway and signa-

ture gene cyclin E2

• Questions:

Is it possible to relate the signature genes to the pathways?

• How is the relation with respect to the clinical outcome: de-

velopment of distant metastases within 5 years (yes/no)?


Is there co-expression between the signature gene and the pa-

thway genes, stratified by prognosis group?

Full model: ∼ metastases + signature.gene

Reduced model: ∼ metastases

Yes:

$effect[1] "signature.gene"

$ANOVASSQ DF MS

Effect 19.11152 31 0.6165006Error 119.06502 2883 0.0412990

$test.result[,1]

F.value 1.492774e+01p.approx 6.117746e-14

$terms[1] "(Intercept)" "metastases" "signature.gene"


Is there differential co-expression between signature gene and

pathway regarding the clinical outcome?

Full model: ∼ metastases * signature.gene

Reduced model: ∼ metastases + signature.gene

No: p.approx 0.2579824

−0.5 0.0 0.5

−0.5

0.00.5

good prognosis group

cyclin E2

cell cy

cle co

ntrol

−0.5 0.0 0.5

−0.5

0.00.5

bad prognosis group

cyclin E2

Example: Meta-Analysis

• Combine several data sets, eventually derived from differentmicroarray technologies, for gene expression analysisFull model: ∼ phenotype * dataset

Reduced model: ∼ dataset

meta analysis data set 1

data set 2

60

1

1

5

11

9

8

• Problems

– Only genes present in all data sets can be included

– What about differential expression in different directions?

Example: Differential Time Course

• Study about neurodegeneration in mouse scrapie model

Xiang et al. (2007)

day 90 day 120 day 150control n=3 n=3 n=3infected n=3 n=3 n=3

• Are there differences in the time courses of expression bet-

ween the two study groups?

Full model: ∼ group * time

Reduced model: ∼ group + time

• Which biological processes are involved?

Example: Differential Time Course

• Some GO categories show expression patterns with differen-

tial time course5

67

89

10

quinone cofactor metabolic process

time points

norm

alis

ed m

ean

expr

essi

on

day 90 day 120 day 150

1416665_at

1417264_at

1417265_s_at

1426902_at

1428134_at

1431893_a_at

1436351_at

1437364_at

1452770_at

1454090_at

0.00 0.05 0.10 0.15 0.20

quinone cofactor metabolic process

Reduction in Sum of SquaresG

enes

1454090_at

1452770_at

1437364_at

1436351_at

1431893_a_at

1428134_at

1426902_at

1417265_s_at

1417264_at

1416665_at

• However, none of the categories would ’survive’ adjustment

for multiple testing

Overview


• Motivation


GlobalAncova







Gene Ontology

• The Gene Ontology (GO) is a

controlled vocabulary to describe

gene and gene product attributes

(http://www.geneontology.org/)

• Three Ontologies

Molecular Function (7527 terms)

Biological Process (13155 terms)

Cellular Component (1864 terms)

• Relations between GO terms are

displayed in directed acyclic graphs transcription factor activity

DNA binding

transcription regulator activity

molecular function

nucleic acid binding

Gene Ontology

binding

Gene Ontology

• Genes known to be associated with some

attributes are mapped to corresponding

GO terms

• Inheritance

Each gene associated with some term is

also mapped to all its ancestors

• Overlap exists also between unrelated

terms

• Not every gene belongs to a leave node

{genes in the leaves} 6= {genes in the root}

GO Analysis

• Most current tools for GO analysis use tests based on genecounts like Fisher-testRivals et al. (2006)

• In principle all gene set testing methods can be used to detectinteresting GO termsDifferent approaches answer different questions!Goeman and Buhlmann (2007)

• Testing thousands of GO terms requires some adjustment formultiple testing

• Most group testing and adjustment methods do not accountfor dependencies between gene sets

• Recent approaches incorporate the special structure of theGene Ontology

GO Methods Based on GSE

Find truly enriched GO nodes within sets of closely related termswith similar levels of significance

• Parent nodes might only inherit significance from their morespecific childrenDecorrelating the GO, Alexa et al. (2006)

• Children nodes might only inherit significance from their moregeneral parentsParent-child approach, Grossmann et al. (2006)

Decorrelating the GO

How enriched is a GO node with interesting genes if we do notconsider the genes from its significant children?

elim algorithm:

• Nodes are tested bottom-up from

most specific to most general terms

• Nodes are tested with Fisher’s exact

test

• If a node is significantly enriched

all genes mapped to it are removed

from all of its ancestors

weight algorithm:Do not remove genes but give weights that denote the generelevance in the significant nodes

Alexa et al. (2006)

Decorrelating the GO

• Families of significant nodes are ’broken’

• Focus lies in more specific terms

• No adjustement for multiple testing but heuristic algorithmto find extreme spots within the GO graph

Parent-Child Approach

• If many differentially expressed genes are annotated to aGO term it is not surprising that there is also found over-representation in the more specific descendants of the term

• Compute hypergeometric p-values where the reference genepopulation does not consist of all genes m but rather of onlyall parental genes mpa(t) of a given GO term t

P (Xt ≥ xt|Xpa(t) = xpa(t)) =

min(xpa(t),mt)∑k=x

(mtk

)(mpa(t) −mt

xpa(t) − k

)(mpa(t)xpa(t)

)pa(t): set of parents parents of term t

mt: number of genes annotated to term t

xt: number of differentially expressed genes annotated to term t

Grossmann et al. (2006)

Parent-Child Approach

• Additionally, adjustment for multiple testing is suggested

• Focus lies in more general terms

Problems

• Usual problems of gene set enrichment

– Definition of a list of interesting genes by some cutoff

– Analysis consists of two separated steps

– Competitive testing framework

– Gene sampling model

• There are also dependencies between unrelated GO terms

• What are the underlying statistical models?

• How can results be interpreted?

Methods Based on Global Tests

Significant terms logically must have significant ancestor terms

• Focus level approach combines closed testing procedure with

correction method of Holm and controls the FWER

Goeman and Mansmann (Technical Report)

• Hierarchical testing could be adapted to the special structure

of the GO

Meinshausen (Technical Report)

Summary and Outlook

• Holistic approaches for gene set analysis seem more appro-

priate from a statistical point of view

• However, gene set enrichment can still be useful in practice

to detect interesting functional categories

• Exceptions from the presented subdivision exist

• Simulation study for comparison of the various methods

• New ideas how to determine interesting regions in the GO

graph

References

1. Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from geneexpression data by decorrelating GO graph structure. Bioinformatics 2006.

2. Goeman JJ, de Kort F, van de Geer SA, van Houwelingen JC. A global test for groups ofgenes: testing association with a clinical outcome. Bioinformatics 2004; 20(1): 93-99.

3. Goeman JJ, Mansmann U. Multiple testing on the directed acyclic graph of Gene On-tology. Technical Report.

4. Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: metho-dological issues. Bioinformatics. 2007; 23 (8) 980-987.

5. Grossmann S, Bauer S, Robinson PN, Vingron M. An Improved Statistic for DetectingOver-representated Gene Ontology Annotations in Gene Sets. Research in Computatio-nal Molecular Biology: 10th Annual International Conference, RECOMB 2006, Venice,Italy, April 2-5, 2006. Proceedings: Lecture Notes in Computer Science 3909, Mar 2006,85-98.

6. Hummel M, Meister R and Mansmann U. GlobalANCOVA – exploration and assessmentof gene group effects. Technical Report.

7. Kotz S, Johnson NL, Boyd DW. Series representations of distributions of quadraticforms in normal variables. I. Central case. The Annals of Mathematical Statistics, 1967;38 (3): 823-837.

8. Mansmann U, Meister R. Testing differential gene expression in functional groups. Me-thods Inf Med 2005; 44(3).

9. Meinshausen N. Hierarchical testing of variable importance. Technical Report.10. Rivals I, Personnaz L, Taing L, Potier MC. Enrichment or depletion of a GO category

within a class of genes: which test? Bioinformatics 2006.11. Schafer J, Strimmer K. A shrinkage approach to large-scale covariance estimation and

implications for functional genomics. Statist Appl Genet Mol Biol, 2005; 4: 32.12. van t’Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der

Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, LinsleyPS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breastcancer. Nature 2002; 415: 530-536.

13. Xiang W, Hummel M, Mitteregger G, Pace C, Windl O, Mansmann U, Kretzschmar HA.Transcriptome analysis reveals altered cholesterol metabolism during the neurodegene-ration in mouse scrapie model. Journal of Neurochemistry 2007.

Date post:	02-Feb-2018
Category:	Documents
Upload:	vuongnhan
View:	224 times
Download:	0 times

Gene Set Testing on the Gene Ontology Graph - uni · PDF fileGene Set Testing on the Gene...

Documents