Large-scale mining of gene expression patterns

Post on 16-Jan-2016

43 views 0 download

Tags:

description

Large-scale mining of gene expression patterns. Paul Pavlidis paul@bioinformatics.ubc.ca. VanBUG September 2007. Students Leon French Meeta Mistry Vaneet Lotay Postdoc Jesse Gillis Undergraduates Raymond Lim Suzanne Lane Programmers Kelsey Hamer Luke McCarthy. Genome. Synapse. - PowerPoint PPT Presentation

transcript

Large-scale mining of Large-scale mining of gene expression gene expression

patternspatterns

Paul PavlidisPaul Pavlidispaul@bioinformatics.ubc.capaul@bioinformatics.ubc.ca

VanBUG September 2007

StudentsStudentsLeon FrenchLeon FrenchMeeta MistryMeeta MistryVaneet LotayVaneet Lotay

PostdocPostdocJesse GillisJesse Gillis

UndergraduatesUndergraduatesRaymond LimRaymond LimSuzanne LaneSuzanne Lane

ProgrammersProgrammersKelsey HamerKelsey HamerLuke McCarthyLuke McCarthy

Synapse Genome

Signal transduction

Synaptic modulation

InjuryStress

DiseaseAging

Development

TopicsTopics

• Connectivity database and analysis• Gene expression data re-use system• Scaling up gene coexpression analysis• Applications and ongoing work

Another ‘omeAnother ‘ome

Leon French, Suzanne Lane

Growth of GEO

0

20000

40000

60000

80000

100000

120000

Dec-99 Apr-01 Sep-02 Jan-04 May-05 Oct-06 Feb-08

Date

Su

bm

iss

ion

s

Age

Genes

SamplesWith JJ Mann, V Arango, E Sibille et al.

Age

Genes

SamplesData from http://national_databank.mclean.harvard.edu/

GEO

Goals for a systemGoals for a system

• Researchers should be able to put their new expression data in a wider context of previous studies without extraordinary effort.

• Move analyzing multiple microarray data sets from a niche activity to the mainstream

• Integration of other data types, domain specific information.

CoexpressionDifferential expression

Public data sources

Challenges to comparing data Challenges to comparing data setssets

• Need to match genes/transcripts across platforms• Data from third parties not always easy to handle• Varying scales, normalization, etc.• Varying data quality• Varying levels of “raw data” available• Selecting appropriate data to compare

With Cincinnati Children’s Hospital (D.Glass, M. Barnes et al.)

Fraction of probes with alignments

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

10

15

20

Fraction non-specific probes

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

02

46

81

01

21

4

Probe specificity (or lack Probe specificity (or lack thereof)thereof)

Which data sets are reasonable to Which data sets are reasonable to compare?compare?

All mouse data sets

Mouse brain data sets

Mouse neocortex data sets

Mouse neocortex data sets examining stress

Mouse neocortex data sets examining hypoxic stress

Mouse neocortex data sets examining hypoxic stress after 3 hours of hypoxia

Too general, but lots of power

Very specific, low power

Expression experiments 519 Mus musculus 254 Homo Sapiens 203

Rattus norvegicus 62 Array Designs: 178 Assays (i.e., chips): 20837 Coexpression links (probe-level): >100 million

Scaling up analysis of gene Scaling up analysis of gene coexpressioncoexpression

• Genes that are coexpressed tend to have related function• Needed at the same place at the same time• “Guilt by association”

• Reasonable to compare across studies

Samples

Exp

ress

ion

Eisen et al., 1998 PNAS

Two ribosomal protein genes.

Biological noiseBiological noise• Induced gene expression effects are often small.• Gene expression varies between “replicates” in

biologically-meaningful ways. • Allows us to repurpose data

Sample type

Functional coexpression should be Functional coexpression should be (somewhat) generalized(somewhat) generalized

• If two genes are coexpressed under one condition, they will probably be coexpressed under at least some other conditions (or data sets).

• Coexpression seen “only once” needs special care in interpretation.• We shouldn’t expect coexpression to be perfectly reproducible (for biological

and technical reasons)

Correlation Correlation

Genome Research, June 2004

A simple approach:

Count Recurring patterns

Pipeline for one datasetPipeline for one dataset

Proof of concept analysisProof of concept analysis

• 60 human data sets, 15700 RefSeq genes.• 70% cancer data• 11 million “links”• About 9.7 million different links

Many links are replicated across Many links are replicated across studiesstudies

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1 10 100

Minimum number of data sets link is seen in

Nu

mb

er

of

lin

ks

Observed

Shuff led database (mean)

Evaluation on biological Evaluation on biological groundsgrounds

Cluster involving NMDAR1 Cluster involving NMDAR1 (GRIN1)(GRIN1)

ATP6V0A1PLD3

GRIN1

Allen Brain Institute

Application: analysis of imprinted Application: analysis of imprinted genesgenes

Laurent Journot, INSERM – Universités Montpellier

Ewing et al, 2007 Molecular Systems Biology

Cor

rela

tion

p-va

lue

LYAR interacting proteinsLYAR interacting proteins

LYAR-interactors

Vote counting limitationsVote counting limitations

• Weak evidence distributed across data sets will not be picked up.

• This example meets strict “vote counting” criteria in only 2/23 data sets

Correlation

2 4 6 8 10 12 14

-1.0

-0.5

0.0

0.5

1.0

Support (datasets)

Glo

ba

l effe

ct s

ize

Cor

rela

tion

(Glo

bal)

Support (# of datasets)

Gen

es p

airs

Datasets

Related work: Zhou XJ et al., Nat.Biotech 2005

SummarySummary

• Reuse of public data: ‘adding value’• Meta-analysis of coexpression• Some applications

• Functional prediction• Candidate identification• Platform evaluation

Ongoing and future workOngoing and future work• Applications and analyses

• Protein interactions and hubs• Prediction of gene function at the synapse• Differential expression analysis

• Regionalization• Mouse models of brain injury• Mouse models of psychosis

• Expanding our public database and softwarehttp://www.bioinformatics.ubc.ca/GemmaWeb-based tools for biologists; web services coming soon

• Integration with other information sources

ThanksThanksGemmaXiang Wan Kelsey HamerLuke McCarthyKiran KeshavSuzanne LaneMeeta MistraJesse Gillis

Joseph SantosGozde CozenDavid QuigleyAnshu SinhaSpiro PantazatosWei-Keat Lim

TmmHomin LeeAmy HsuJon SajdakJie QinTzu-Lin Hsaio

And to:

NCBI GEO team

Groups who made data available

Collaborators who provided data prior to publication

Conrad Gilliam

Abraham Palmer

Andreas Kottmann

Etienne Sibille

CollaboratorsBarclay MorrisonJoseph GogosMichael HaydenBlair LeavittTony BlauPanos Papapanou

Answers to FAQsAnswers to FAQs

• No, they don’t have to be time course experiments.• Yes, we’re using cDNA as well as Affymetrix etc.• Yes, we see reproducible negative correlations.• Yes, we’re interested in finding differences as well as

similarities between data sets.• No, we aren’t necessarily inferring regulatory relationships• Yes, we know that RNA is just one way of measuring cell

state.• No, we don’t have {worm,fly,yeast…} data, but we’d like to.