Large-scale mining of Large-scale mining of gene expression gene expression
patternspatterns
Paul PavlidisPaul [email protected]@bioinformatics.ubc.ca
VanBUG September 2007
StudentsStudentsLeon FrenchLeon FrenchMeeta MistryMeeta MistryVaneet LotayVaneet Lotay
PostdocPostdocJesse GillisJesse Gillis
UndergraduatesUndergraduatesRaymond LimRaymond LimSuzanne LaneSuzanne Lane
ProgrammersProgrammersKelsey HamerKelsey HamerLuke McCarthyLuke McCarthy
Synapse Genome
Signal transduction
Synaptic modulation
InjuryStress
DiseaseAging
Development
TopicsTopics
• Connectivity database and analysis• Gene expression data re-use system• Scaling up gene coexpression analysis• Applications and ongoing work
Another ‘omeAnother ‘ome
Leon French, Suzanne Lane
Growth of GEO
0
20000
40000
60000
80000
100000
120000
Dec-99 Apr-01 Sep-02 Jan-04 May-05 Oct-06 Feb-08
Date
Su
bm
iss
ion
s
Age
Genes
SamplesWith JJ Mann, V Arango, E Sibille et al.
Age
Genes
SamplesData from http://national_databank.mclean.harvard.edu/
GEO
Goals for a systemGoals for a system
• Researchers should be able to put their new expression data in a wider context of previous studies without extraordinary effort.
• Move analyzing multiple microarray data sets from a niche activity to the mainstream
• Integration of other data types, domain specific information.
CoexpressionDifferential expression
Public data sources
Challenges to comparing data Challenges to comparing data setssets
• Need to match genes/transcripts across platforms• Data from third parties not always easy to handle• Varying scales, normalization, etc.• Varying data quality• Varying levels of “raw data” available• Selecting appropriate data to compare
With Cincinnati Children’s Hospital (D.Glass, M. Barnes et al.)
Fraction of probes with alignments
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
10
15
20
Fraction non-specific probes
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8 1.0
02
46
81
01
21
4
Probe specificity (or lack Probe specificity (or lack thereof)thereof)
Which data sets are reasonable to Which data sets are reasonable to compare?compare?
All mouse data sets
Mouse brain data sets
Mouse neocortex data sets
Mouse neocortex data sets examining stress
Mouse neocortex data sets examining hypoxic stress
Mouse neocortex data sets examining hypoxic stress after 3 hours of hypoxia
Too general, but lots of power
Very specific, low power
Expression experiments 519 Mus musculus 254 Homo Sapiens 203
Rattus norvegicus 62 Array Designs: 178 Assays (i.e., chips): 20837 Coexpression links (probe-level): >100 million
Scaling up analysis of gene Scaling up analysis of gene coexpressioncoexpression
• Genes that are coexpressed tend to have related function• Needed at the same place at the same time• “Guilt by association”
• Reasonable to compare across studies
Samples
Exp
ress
ion
Eisen et al., 1998 PNAS
Two ribosomal protein genes.
Biological noiseBiological noise• Induced gene expression effects are often small.• Gene expression varies between “replicates” in
biologically-meaningful ways. • Allows us to repurpose data
Sample type
Functional coexpression should be Functional coexpression should be (somewhat) generalized(somewhat) generalized
• If two genes are coexpressed under one condition, they will probably be coexpressed under at least some other conditions (or data sets).
• Coexpression seen “only once” needs special care in interpretation.• We shouldn’t expect coexpression to be perfectly reproducible (for biological
and technical reasons)
Correlation Correlation
Genome Research, June 2004
A simple approach:
Count Recurring patterns
Pipeline for one datasetPipeline for one dataset
Proof of concept analysisProof of concept analysis
• 60 human data sets, 15700 RefSeq genes.• 70% cancer data• 11 million “links”• About 9.7 million different links
Many links are replicated across Many links are replicated across studiesstudies
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1 10 100
Minimum number of data sets link is seen in
Nu
mb
er
of
lin
ks
Observed
Shuff led database (mean)
Evaluation on biological Evaluation on biological groundsgrounds
Cluster involving NMDAR1 Cluster involving NMDAR1 (GRIN1)(GRIN1)
ATP6V0A1PLD3
GRIN1
Allen Brain Institute
Application: analysis of imprinted Application: analysis of imprinted genesgenes
Laurent Journot, INSERM – Universités Montpellier
Ewing et al, 2007 Molecular Systems Biology
Cor
rela
tion
p-va
lue
LYAR interacting proteinsLYAR interacting proteins
LYAR-interactors
Vote counting limitationsVote counting limitations
• Weak evidence distributed across data sets will not be picked up.
• This example meets strict “vote counting” criteria in only 2/23 data sets
Correlation
2 4 6 8 10 12 14
-1.0
-0.5
0.0
0.5
1.0
Support (datasets)
Glo
ba
l effe
ct s
ize
Cor
rela
tion
(Glo
bal)
Support (# of datasets)
Gen
es p
airs
Datasets
Related work: Zhou XJ et al., Nat.Biotech 2005
SummarySummary
• Reuse of public data: ‘adding value’• Meta-analysis of coexpression• Some applications
• Functional prediction• Candidate identification• Platform evaluation
Ongoing and future workOngoing and future work• Applications and analyses
• Protein interactions and hubs• Prediction of gene function at the synapse• Differential expression analysis
• Regionalization• Mouse models of brain injury• Mouse models of psychosis
• Expanding our public database and softwarehttp://www.bioinformatics.ubc.ca/GemmaWeb-based tools for biologists; web services coming soon
• Integration with other information sources
ThanksThanksGemmaXiang Wan Kelsey HamerLuke McCarthyKiran KeshavSuzanne LaneMeeta MistraJesse Gillis
Joseph SantosGozde CozenDavid QuigleyAnshu SinhaSpiro PantazatosWei-Keat Lim
TmmHomin LeeAmy HsuJon SajdakJie QinTzu-Lin Hsaio
And to:
NCBI GEO team
Groups who made data available
Collaborators who provided data prior to publication
Conrad Gilliam
Abraham Palmer
Andreas Kottmann
Etienne Sibille
CollaboratorsBarclay MorrisonJoseph GogosMichael HaydenBlair LeavittTony BlauPanos Papapanou
Answers to FAQsAnswers to FAQs
• No, they don’t have to be time course experiments.• Yes, we’re using cDNA as well as Affymetrix etc.• Yes, we see reproducible negative correlations.• Yes, we’re interested in finding differences as well as
similarities between data sets.• No, we aren’t necessarily inferring regulatory relationships• Yes, we know that RNA is just one way of measuring cell
state.• No, we don’t have {worm,fly,yeast…} data, but we’d like to.