Thanks to the Lipper Center for Computational Genetics
Government and private grant agencies: NHLBI,
NSF, ONR, DOE, DARPA, HHMI, Armenise
Corporate collaborators & sponsors:
Affymetrix, GTC, Mosaic, Aventis, Dupont, Cistran
CHI Macroresults through Microarrays 3
George Church 1-May-02
Array quantitation for modeling mutations affecting RNA, protein interactions & cell
proliferation.
gggatttagctcagttgggagagcgccagactgaa gatttg gaggtcctgtgttcgatccacagaattcgcacca
Post- 300 genomes &
3D structures
DNA RNA Protein: in vivo & in vitro interactions
Metabolites
Replication rate
Environment
Biosystems Measures & Models
Microbes Cancer & stem cells DarwinianIn vitro replicationSmall multicellular organisms
RNAiInsertionsSNPs
Functional Genomics Challenges • Systems dynamics and optimality modeling.• Multiple genetic domains per gene: high density readout of whole genome mutant phenotypes.• Multiple RNAs & regulatory proteins per gene.• Many causative genes & haplotypes per disease.
• Polony RNA exon-typing• Multiplex in situ RNA & protein analyses • Automated differentiation• Homologous recombination genome engineering
Human Red Blood CellODE model200 measured parameters
GLCe GLCi
G6P
F6P
FDP
GA3P
DHAP
1,3 DPG
2,3 DPG
3PG
2PG
PEP
PYR
LACi LACe
GL6P GO6P RU5PR5P
X5P
GA3P
S7P
F6P
E4P
GA3P F6P
NADPNADPH
NADPNADPH
ADPATP
ADPATP
ADP ATPNADHNAD
ADPATP
NADHNAD
K+
Na+
ADP
ATPADP
ATP
2 GSH GSSGNADPH NADP
ADO
INO
AMP
IMPADOe
INOe
ADE
ADEeHYPX
PRPP
PRPP
R1P R5PATP
AMPATP
ADP
Cl-
pH
HCO3-
Jamshidi, Edwards, Fahland, Church, Palsson, B.O. (2001) Bioinformatics 17: 286.(http://atlas.med.harvard.edu/gmc/rbc.html)
Modeling suboptimality:
Segre, Edwards, Vitkup
0 20 40 60 80 100 120 140 160 180 200
0
20
40
60
80
100
120
140
160
180
200
12
3
4 56
7
8
9
10
11121314
15
16
1718
Sauer wild type
LP w
tSauer data and FBA fluxes comparison
Wild type, C 0.4-limited CC=0.97
Cal
cult
ed F
lux
Calculated & Observed Fluxes in wt
Observed Fluxes in wt
Replication rate of a whole-genome set of mutants
Badarinarayana, et al. (2001) Nature Biotech.19: 1060
Replication rate challenge met: multiple homologous domains
1 2 3
1 2 3
thrA
metL
1.1 6.7
1.8 1.8
1 2lysC
10.4
probes
Selective disadvantage in minimal media
Multiple mutations per gene
Correlation between two selection experiments
Badarinarayana, et al. (2001) Nature Biotech.19: 1060
Comparison of selection data with Flux Balance Optimization predictions on 488 genes
predictions number of genes
negatively selected
not negatively selected
essential 143 80 63
reduced growth rate
46 24 22
non essential
299 119 180
P-value Chi Square = 0.004
>
<
Novelduplicates?
Positioneffects, toxin
accumulation, non-opt?
DNA RNA Protein: in vivo & in vitro interactions
Metabolites
Replication rate
Environment
Biosystems Measures & Models
microbescancer & stem cellsIn vitro replicationsmall multicellular organisms
RNAiInsertionsSNPs
RNA quantitation issues
Small fold changes in RNA are important. Example: 1.5-fold in trisomies.
Cross-hybridizing RNAs. Alternative RNAs, gene families.
Mixed tissues.In situ hybridization has low multiplex.
Gene Expression database Aach, Rindone, Church, (2000) Genome Research 10: 431-445.
• Microarrays1
• Affymetrix2
• Lynx-MPSS3, SAGE4
experiment
control • R/G ratios
• R, G values
• quality indicators
ORF
ORF
PMMM
• Averaged PM-MM
• “presence”
• feature statistics
• 25-mers
• Counts of 14-mers sequence tags for each ORF
1 DeRisi, et.al., Science 278:680-686 (1997)2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)3 Brenner et al. Massively Parallel Signature Sequencing, Nat Biotechnol. 18:630-4 (2000)4 Velculescu, et.al, Serial Analysis of Gene Expression, Science 270:484-487 (1995)
agactagcag
RNA Cluster Analyses: Cell Cycle
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Replication & DNA synthesis (2)
s.d
. fr
om
mean
MCB SCB
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3005
101520253035
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
CLUSTERCLUSTER
Nu
mb
er o
f O
RF
s
05
1015
2025
3035
Distance from ATG (b.p.)
Nu
mb
er o
f si
tes
02468
1012141618
Distance from ATG (b.p.)
Nu
mb
er o
f si
tes
Nu
mb
er o
f O
RF
s
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
MIPS Functional category (total ORFs) ORFs withinfunctional category
(k)
P-value-Log10
DNA synthesis and replication (82)Cell cycle control and mitosis (312)Recombination and DNA repair (84)Nuclear organization (720)
23301140
16854
N = 186
Tavazoie, et al. 1999 Nature Genetics 22:281.
(homeobox gene Crx-/-)
Livesey, Furukawa, Steffen, Church, Cepko (2000) Current Biol. 10:301.
sp
Combining mouse knockouts with
RNA array analysis
DNA RNA Protein: in vivo & in vitro interactions
Metabolites
Replication rate
Environment
Biosystems Measures & Models
microbescancer & stem cellsIn vitro replicationsmall multicellular organisms
RNAiInsertionsSNPs
ds-DNA ds-DNA arrayarray
HMS: Martha Bulyk, Xiaohua Wang, Martin Steffen HMS: Martha Bulyk, Xiaohua Wang, Martin Steffen MRC: Yen ChooMRC: Yen Choo
Combinatorial arrays for binding constantsHuman/Mouse EGR1
Combinatorial DNA-binding
protein domains
ds-DNA ds-DNA arrayarray
Phage
pVIIIpVIII
pIIIpIII
Antibodies
Combinatorial arrays for binding constants
PhycoerythrinPhycoerythrin- 2º IgG- 2º IgG
Combinatorial DNA-binding
protein domains
ds-DNA ds-DNA arrayarray
Martha Bulyk et alMartha Bulyk et al
Phage
Combinatorial arrays for binding constants
Isalan et al., Biochemistry (‘98) 37:12026-12033
Interactions of Adjacent Basepairs in EGR1 Interactions of Adjacent Basepairs in EGR1 Zinc Finger DNA RecognitionZinc Finger DNA Recognition
high [DNA](+) ctrl sequence
for wt binding
alignment oligos
etc.
Wildtype EGR1 MicroarrayWildtype EGR1 Microarray
WildtypeWildtypeRSDHLTTRSDHLTT
RGPDLARRGPDLARREDVLIRREDVLIR
LRHNLETLRHNLET
TGG 2.8 nM
GCG 16 nM
2.5 nM
TAT 5.7 nM
AAA,AAT,ACT,AGA,AGC,AGT,CAT,CCT,CGA,CTT,TTC,TTT
AAT 240 nM
KASNLVSKASNLVS
Motifs weight all 64 Kaapp
DNA RNA Protein: in vivo & in vitro interactions
Metabolites
Replication rate
Environment
Biosystems Measures & Models
microbescancer & stem cellsIn vitro replicationsmall multicellular organisms
RNAiInsertionsSNPs
Common diseases: billions of “new” allelesplus a millions of balanced polymorphisms
• 60 new mutations per generation * 5,000 generations since major bottleneck(s) which set up the linkage patterns (=300,000 per genome)
• Each of the 3 Gbp in the genome exist in all SNP forms: A,C,G,T, 600,000 of each SNP on earth (spread over the common haplotypes).The population frequency will be <0.01%. (Aach et al, 2001 Nature 409: 856)
• Functional genomics (FG) may provide better leads for therapies & diagnostics. (Accuracy goal 1 ppb?)
Projected costs affect our view of what is possible.
In 1985, the dawn of the genome project, $10 per bp, would have been $30B per genome.In 2002, Perlegen or Lynx: $3M (103 bits/$, 4 logs)
In 2001, the cost of video data collection? 1013 bits/$
Genotyping & functional genomics demand will probably be as high as permitted by costs.
Femtoliter (10-15) scale & low-cost scannersPolymerase DNA colonies (polonies)Fluorescent in situ sequencing (FISSEQ)
Why lower-cost, high quality “sequencing”?
Mitra & Church Nucleic Acids Res. 27: e34
Environmental, food, & biodiversity monitoring Human genome haplotypingRNA splicing & editingimmune B&T cell receptor spectra
& How?
A’
A’A’
A’
A’
A’
B
BB
B
BB
A
Single Molecule From Library
B
BA’
A’
1st Round of PCR
Primer is Extendedby Polymerase
B
A’
BA’
Primer A has 5’ immobilizing (Acrydite) modification.
1. Remove 1 strand of DNA.2. Hybridize Universal Primer.3. Add Red (Cy3) dTTP.
B B’
3’ 5’
AGT..
T
4. Wash; Scan Red Channel
B B’
3’ 5’
GCG..
Sequence polonies by sequential,fluorescent single-base extensions
5. Add Green (FITC) dCTP
6. Wash; Scan Green Channel
B B’
3’ 5’
AGT.
TC
B B’
3’ 5’
GCG..
C
Sequence polonies by sequential, fluorescent single-base extensions
Polony Template
3’ P’
P5’ A ATA CAA TTCACACAGGAAACAGCTATGA CATT CTATTGTTAAAGTGTGTCCTTTGTCGATACTGGTA…5’
FITC ( C ) CY3 ( T )
Mean Intensity: 58, 0.5 40, 6.5 0.3, 48 0.4, 43
Primer Extension 26 cycles, 34 Nucleotides
Femtoliter (10-15) scale & low-cost scannersPolymerase DNA colonies (polonies)Fluorescent in situ sequencing (FISSEQ)
Why lower-cost, high quality “sequencing”?
Mitra & Church Nucleic Acids Res. 27: e34
Environmental, food, & biodiversity monitoring •Human genome haplotypingRNA splicing & editingimmune B&T cell receptor spectra
& How?
Femtoliter (10-15) scale & low-cost scannersPolymerase DNA colonies (polonies)Fluorescent in situ sequencing (FISSEQ)
Why lower-cost, high quality “sequencing”?
Mitra & Church Nucleic Acids Res. 27: e34
Environmental, food, & biodiversity monitoring Human genome haplotyping•RNA splicing & editingimmune B&T cell receptor spectra
& How?
RNA Exon typing
•Single molecules of RNA dispersed.
•Multiplex polonies spanning all likely variable exons
•Sequential probing of each exon.
Functional Genomics Challenges • Systems dynamics and optimality modeling.• Multiple genetic domains per gene: high density readout of whole genome mutant phenotypes.• Multiple RNAs & regulatory proteins per gene.• Many causative genes & haplotypes per disease.
• Polony RNA exon-typing• Multiplex in situ RNA & protein analyses • Automated differentiation• Homologous recombination genome engineering
For more information:
arep.med.harvard.edu