Gene Set Testing on theGene Ontology Graph
Manuela Hummel
Ulrich Mansmann
BioMed-S Seminar, June 19, 2007
Overview
Analysis of groups of genes
• Motivation
• Gene set enrichment versus holistic approaches
GlobalAncova
• Linear models, F-test, p-values
• Analysis examples
Detecting interesting Gene Ontology terms
• Structure of the GO
• Methods based on gene set enrichment
• Methods based on global tests
Motivation
Gene-wise analysis
• E.g. analysis of differential expression
• Genes are treated independently
• Correction for multiple testing is crucial
• Resulting lists of interesting genes are rather ’instable’
• Biological interpretation of such gene lists is hard
Analysis of gene sets
• Predefined gene groups provide more biological knowledge
(Gene Ontology terms, pathways, genomic regions, ...)
• More meaningful interpretation in biological context
• Number of gene sets to be investigated is smaller than
number of individual genes
Strategies for Group Testing
Gene set enrichment
• Idea: Provide biological meaning to a list of interesting genes
by means of an over-representation analysis
• Step 1: Gene-wise analysis of differential expression
Step 2: Score gene groups for enrichment
• Goal: Find gene groups that contain many interesting genes
Holistic approaches
• Idea: Look directly at gene sets and ask whether they are
biologically relevant with respect to differential expression
• Global analysis of differential expression for gene groups
• Goal: Find gene groups that contain at least one interesting
gene or many genes with moderate differentiality
Gene Set Enrichment
• Step 1: Define a list of differentially expressed (DE) genes,e.g. by t-tests and correction for multiple testing
• Step 2: Connection of the list to a functional gene set canbe summarized in a 2× 2 contingency table
∈ gene group /∈ gene group
∈ DE genes x K − x K
/∈ DE genes M − x (N −M)− (K − x) N −K
M N −M N
• Use e.g. Fisher’s exact test to test for association betweengene list and gene set
• Similar tests based on gene counts are very often used inGene Ontology analysis (binomial test, χ2 test, z scores)Rivals et al. (2006)
Holistic Approaches
• Is the global expression pattern X of a group of genes signi-
ficantly related to some clinical variable Y of interest?
• E.g. globaltest of Goeman et al. (2004)
Does knowledge of X help to predict Y ?
• Generalized linear model with βj ∼ (0, τ2)
E(yi|β) = h−1
α+m∑j=1
xijβj
• H0 : β1 = . . . = βm = 0 is equivalent to H0 : τ2 = 0
Globaltest
• Test statistic
Q ∼ (Y − µ)′R(Y − µ)
∼∑g
[Xg(Y − µ)]2 sum over genes
∼∑i
∑jRij(Yi − µ)(Yj − µ) sum over subjects
R = X ′X matrix of correlations between gene expression of subjects
• Test to see whether subjects with similar expression also have
similar outcomes
• Permutation based and asymptotic p-values are available
• Also multicategorical, continuous or survival variables can be
considered and adjustment for covariates is possible
Competitive vs Self-Contained
Gene set enrichment – competitive
• Differential expression of a gene set G is compared to a stan-
dard defined by all the remaining genes G
• Hcomp0 : The genes in G are at most as often differentially
expressed as the genes in G
• Significance of G is”
penalized“ for the significance of G
• If all genes are (similarly) DE no gene set will be significant
Holistic approaches – self-contained
• Differential expression is explored only within the gene set
• Hself0 : No genes in G are differentially expressed
• Hself0 is more restrictive → more power
• If all genes are (similarly) DE all gene sets will be significant
Goeman and Buhlmann (2007)
Competitive vs Self-Contained
Gene set enrichment – competitive
• Single genes are treated very differently from gene sets con-
sisting of single genes
• Testing the set of all genes is not possible
• p-values of different gene sets tend to be negatively correlated
Holistic approaches – self-contained
• Generalization of single gene testing to gene set testing
• Testing the set of all genes can be a useful preliminary data
quality check
• In some cases maybe too powerful
Gene vs Subject Sampling
Gene set enrichment – gene sampling
• Genes are the sampling units
• A new sample would correspond to a sample of new genes
for the same subjects
• Classical roles of variables and observations are reversed
Holistic approaches – subject sampling
• Subjects are the sampling units
• A new sample corresponds to measurements of the same va-
riables (= genes) for a new subject
• Classical roles of variables and observations
Gene vs Subject Sampling
Gene set enrichment – gene sampling
• p-values relate to replications of the urn experiment6= biological replications
• Small p-value: similar association between ’membership ofthe gene set’ and ’being differentially expressed’ will be foundwith new genes for the same subjects
• Sample size = number of genes does not correspond to thebiological sample size
Holistic approaches – subject sampling
• p-values relate to measuring same variables on new subjects= biological replications
• Small p-value: similar association between expression andphenotype will be found for the same genes in new subjects
• Sample size = number of subjects does correspond to thebiological sample size
Gene vs Subject Sampling
Gene set enrichment – gene sampling
• Unrealistic assumption of independence between genes
• Correlations affect the number of genes called differentially
expressed
• If (gene-wise) p-values are positively correlated the true null
hypothesis is not hypergeometric but has heavier tails
• Gene set p-values may be understated
Holistic approaches – subject sampling
• Assumption of independence between subjects, but genes
may be correlated
Overview
Analysis of groups of genes
• Motivation
• Gene set enrichment versus holistic approaches
GlobalAncova
• Linear models, F-test, p-values
• Analysis examples
Detecting interesting Gene Ontology terms
• Structure of the GO
• Methods based on gene set enrichment
• Methods based on global tests
GlobalAncova
• How is gene expression X influenced by variable Y
H0 : P (X|Y,C) = P (X|C) (where C can be other covariates)
• The expectation for gene j is assumed to follow a linear modelE(xj) = Dβj = Hxj, with hat matrix H = D(D′D)−1D′
• D is a usual design matrix, e.g.
Int Y Csample 1sample 2sample 3
. . .
( 1 0 11 0 21 1 1
. . .
)
• Residual sum of squares for gene j
εj′εj, with εj = (I −H)xj
• Total residual sum of squares
SSR =p∑
j=1εj′εj
Hummel, Meister and Mansmann (Technical Report)
F-Test
• Do we need the variable Y to explain the data X?
→ Extra sum of squares principle
• The full model containing the variable of interest is compared
to a reduced model without it
Dfull = (1, Y, C), Dreduced = (1, C)
• Extra residual sum of squares
SSRextra = SSRreduced − SSRfull
• F-statistic
F = MSRextra/MSRfull
• Permutation based and asymptotic p-values are available
Permutation based p-values
• Permute column(s) Y in the design matrix of the full model
B times
• Compute the GlobalAncova statistic Fb for each permutation,
b = 1, . . . , B
• An empirical p-value is the fraction of permutation Fb’s that
are greater than the observed F
p = #(Fb>F )B
Asymptotic p-values
• Distribution of the F-statistic nominator can be approximated
by a series of χ2 distributions (Kotz et al., 1967)
F (α;x) = F (SSRreduced − SSRfull)
=∞∑k=0
ckχ2np+2k(x/β) ≈
np∑k=0
ckχ2np+2k(x/β)
where weights ck and β are calculated through α
• α = {γi · λj; i = 1, . . . , n; j = 1, . . . , p} with
γi, i = 1, . . . , n: Eigenvalues of Hfull −Hreducedλj, j = 1, . . . , p: Eigenvalues of the gene covariance matrix
• The gene covariance matrix is estimated via a shrinkage esti-
mate (necessary since p >> n)
Schafer et al. (2005)
Linear Models
General linear model framework allows analysis of
Design Full model Reduced model
Various groups ∼ group + cov ∼ cov
Dose response ∼ dose + cov ∼ cov
Group by dose interaction ∼ group * dose + cov ∼ group + dose + cov
Time trends in groups ∼ group * time + cov ∼ group + time + cov
Gene-gene interaction ∼ gene + cov ∼ cov
Co-expression ∼ group + gene + cov ∼ group + cov
Differential co-expression ∼ group * gene + cov ∼ group + gene + cov
Meta-analysis ∼ group * dataset ∼ dataset
. . . . . . . . .
Example: Co-Expression
• Van’t Veer et al. (2002) present a gene signature of 70 genes
to predict recurrence of breast cancer
• We consider 9 cancer related pathways
• For demonstration we pick the cell cycle pathway and signa-
ture gene cyclin E2
• Questions:
Is it possible to relate the signature genes to the pathways?
• How is the relation with respect to the clinical outcome: de-
velopment of distant metastases within 5 years (yes/no)?
Example: Co-Expression
Is there co-expression between the signature gene and the pa-
thway genes, stratified by prognosis group?
Full model: ∼ metastases + signature.gene
Reduced model: ∼ metastases
Yes:
$effect[1] "signature.gene"
$ANOVASSQ DF MS
Effect 19.11152 31 0.6165006Error 119.06502 2883 0.0412990
$test.result[,1]
F.value 1.492774e+01p.approx 6.117746e-14
$terms[1] "(Intercept)" "metastases" "signature.gene"
Example: Co-Expression
Is there differential co-expression between signature gene and
pathway regarding the clinical outcome?
Full model: ∼ metastases * signature.gene
Reduced model: ∼ metastases + signature.gene
No: p.approx 0.2579824
−0.5 0.0 0.5
−0.5
0.00.5
good prognosis group
cyclin E2
cell cy
cle co
ntrol
−0.5 0.0 0.5
−0.5
0.00.5
bad prognosis group
cyclin E2
Example: Meta-Analysis
• Combine several data sets, eventually derived from differentmicroarray technologies, for gene expression analysisFull model: ∼ phenotype * dataset
Reduced model: ∼ dataset
meta analysis data set 1
data set 2
60
1
1
5
11
9
8
• Problems
– Only genes present in all data sets can be included
– What about differential expression in different directions?
Example: Differential Time Course
• Study about neurodegeneration in mouse scrapie model
Xiang et al. (2007)
day 90 day 120 day 150control n=3 n=3 n=3infected n=3 n=3 n=3
• Are there differences in the time courses of expression bet-
ween the two study groups?
Full model: ∼ group * time
Reduced model: ∼ group + time
• Which biological processes are involved?
Example: Differential Time Course
• Some GO categories show expression patterns with differen-
tial time course5
67
89
10
quinone cofactor metabolic process
time points
norm
alis
ed m
ean
expr
essi
on
day 90 day 120 day 150
1416665_at
1417264_at
1417265_s_at
1426902_at
1428134_at
1431893_a_at
1436351_at
1437364_at
1452770_at
1454090_at
0.00 0.05 0.10 0.15 0.20
quinone cofactor metabolic process
Reduction in Sum of SquaresG
enes
1454090_at
1452770_at
1437364_at
1436351_at
1431893_a_at
1428134_at
1426902_at
1417265_s_at
1417264_at
1416665_at
• However, none of the categories would ’survive’ adjustment
for multiple testing
Overview
Analysis of groups of genes
• Motivation
• Gene set enrichment versus holistic approaches
GlobalAncova
• Linear models, F-test, p-values
• Analysis examples
Detecting interesting Gene Ontology terms
• Structure of the GO
• Methods based on gene set enrichment
• Methods based on global tests
Gene Ontology
• The Gene Ontology (GO) is a
controlled vocabulary to describe
gene and gene product attributes
(http://www.geneontology.org/)
• Three Ontologies
Molecular Function (7527 terms)
Biological Process (13155 terms)
Cellular Component (1864 terms)
• Relations between GO terms are
displayed in directed acyclic graphs transcription factor activity
DNA binding
transcription regulator activity
molecular function
nucleic acid binding
Gene Ontology
binding
Gene Ontology
• Genes known to be associated with some
attributes are mapped to corresponding
GO terms
• Inheritance
Each gene associated with some term is
also mapped to all its ancestors
• Overlap exists also between unrelated
terms
• Not every gene belongs to a leave node
{genes in the leaves} 6= {genes in the root}
GO Analysis
• Most current tools for GO analysis use tests based on genecounts like Fisher-testRivals et al. (2006)
• In principle all gene set testing methods can be used to detectinteresting GO termsDifferent approaches answer different questions!Goeman and Buhlmann (2007)
• Testing thousands of GO terms requires some adjustment formultiple testing
• Most group testing and adjustment methods do not accountfor dependencies between gene sets
• Recent approaches incorporate the special structure of theGene Ontology
GO Methods Based on GSE
Find truly enriched GO nodes within sets of closely related termswith similar levels of significance
• Parent nodes might only inherit significance from their morespecific childrenDecorrelating the GO, Alexa et al. (2006)
• Children nodes might only inherit significance from their moregeneral parentsParent-child approach, Grossmann et al. (2006)
Decorrelating the GO
How enriched is a GO node with interesting genes if we do notconsider the genes from its significant children?
elim algorithm:
• Nodes are tested bottom-up from
most specific to most general terms
• Nodes are tested with Fisher’s exact
test
• If a node is significantly enriched
all genes mapped to it are removed
from all of its ancestors
weight algorithm:Do not remove genes but give weights that denote the generelevance in the significant nodes
Alexa et al. (2006)
Decorrelating the GO
• Families of significant nodes are ’broken’
• Focus lies in more specific terms
• No adjustement for multiple testing but heuristic algorithmto find extreme spots within the GO graph
Parent-Child Approach
• If many differentially expressed genes are annotated to aGO term it is not surprising that there is also found over-representation in the more specific descendants of the term
• Compute hypergeometric p-values where the reference genepopulation does not consist of all genes m but rather of onlyall parental genes mpa(t) of a given GO term t
P (Xt ≥ xt|Xpa(t) = xpa(t)) =
min(xpa(t),mt)∑k=x
(mtk
)(mpa(t) −mt
xpa(t) − k
)(mpa(t)xpa(t)
)pa(t): set of parents parents of term t
mt: number of genes annotated to term t
xt: number of differentially expressed genes annotated to term t
Grossmann et al. (2006)
Parent-Child Approach
• Additionally, adjustment for multiple testing is suggested
• Focus lies in more general terms
Problems
• Usual problems of gene set enrichment
– Definition of a list of interesting genes by some cutoff
– Analysis consists of two separated steps
– Competitive testing framework
– Gene sampling model
• There are also dependencies between unrelated GO terms
• What are the underlying statistical models?
• How can results be interpreted?
Methods Based on Global Tests
Significant terms logically must have significant ancestor terms
• Focus level approach combines closed testing procedure with
correction method of Holm and controls the FWER
Goeman and Mansmann (Technical Report)
• Hierarchical testing could be adapted to the special structure
of the GO
Meinshausen (Technical Report)
Summary and Outlook
• Holistic approaches for gene set analysis seem more appro-
priate from a statistical point of view
• However, gene set enrichment can still be useful in practice
to detect interesting functional categories
• Exceptions from the presented subdivision exist
• Simulation study for comparison of the various methods
• New ideas how to determine interesting regions in the GO
graph
References
1. Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from geneexpression data by decorrelating GO graph structure. Bioinformatics 2006.
2. Goeman JJ, de Kort F, van de Geer SA, van Houwelingen JC. A global test for groups ofgenes: testing association with a clinical outcome. Bioinformatics 2004; 20(1): 93-99.
3. Goeman JJ, Mansmann U. Multiple testing on the directed acyclic graph of Gene On-tology. Technical Report.
4. Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: metho-dological issues. Bioinformatics. 2007; 23 (8) 980-987.
5. Grossmann S, Bauer S, Robinson PN, Vingron M. An Improved Statistic for DetectingOver-representated Gene Ontology Annotations in Gene Sets. Research in Computatio-nal Molecular Biology: 10th Annual International Conference, RECOMB 2006, Venice,Italy, April 2-5, 2006. Proceedings: Lecture Notes in Computer Science 3909, Mar 2006,85-98.
6. Hummel M, Meister R and Mansmann U. GlobalANCOVA – exploration and assessmentof gene group effects. Technical Report.
7. Kotz S, Johnson NL, Boyd DW. Series representations of distributions of quadraticforms in normal variables. I. Central case. The Annals of Mathematical Statistics, 1967;38 (3): 823-837.
8. Mansmann U, Meister R. Testing differential gene expression in functional groups. Me-thods Inf Med 2005; 44(3).
9. Meinshausen N. Hierarchical testing of variable importance. Technical Report.10. Rivals I, Personnaz L, Taing L, Potier MC. Enrichment or depletion of a GO category
within a class of genes: which test? Bioinformatics 2006.11. Schafer J, Strimmer K. A shrinkage approach to large-scale covariance estimation and
implications for functional genomics. Statist Appl Genet Mol Biol, 2005; 4: 32.12. van t’Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der
Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, LinsleyPS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breastcancer. Nature 2002; 415: 530-536.
13. Xiang W, Hummel M, Mitteregger G, Pace C, Windl O, Mansmann U, Kretzschmar HA.Transcriptome analysis reveals altered cholesterol metabolism during the neurodegene-ration in mouse scrapie model. Journal of Neurochemistry 2007.