Genome Function Project
We thank for support:
Government and private grant agencies: NHLBI,
NSF, ONR, DOE, DARPA, HHMI, Lipper, Armenise
Corporate collaborators & sponsors:
Affymetrix, GTC, Mosaic, Aventis, Dupont
UCSCGeorge Church 24 Aug 2001
gcggatttagctcagttgggag agcgccagact gaagatttgga ggtcctgtgtt cgatccacagaattcgcacca
Post-StructuralGenomics
Data
DNA RNA Protein
Metabolites
Growth rateExpression
Interactions
Environment
Function Genomics Measures & Models
Exponential technologies
1993 first browser 1994 commercial www
Agenda
1. mapping human variation (haplotype map)
2. obtaining a complete and validated set of human genes including - multiple alleles, transcripts, protein or structural RNA products - regulatory elements
3. understanding the diversity of life through genomic analysis of manyorganisms, and understanding how one organism works by comparativegenomics with others - how genomes evolved
4. creating a new quantitative systems biology, beyond drawing circlesand arrows on paper and labeling them with names nobody can remember - mapping the key interactions - mathematical/computational models of pathways and systems - dealing with multiple levels from atoms to cells
In vitro minigenomeSteve Blackwell, HMS: pure IF, EFTony Forster, BWH: tRNAs & modified basesManz Ehrenberg, Dieter Soll : tRNA-synthetasesJosh LaBaer, HMS-HIP: Expression constructsJingdong Tian, HMS: Protein synthesisRob Mitra & Xiaohua Huang, HMS: Polymerases, RCAGloria Culver, Iowa State: ribosomal proteins & rRNAHarry Noller, UCSC: ribosomes
In vitro minigenome A) From atoms to evolving minigenomes and cells.This could improve in vitro transcription/translation/replication systems and conceptually link atomic (mutational) changes via molecular and systems modeling to population evolution. The synthesis of pure systems of proteins with natural or novel modifications would be or great significance. This could give an incredible focus to structural genomics. B) From cells to tissues.Modeling the effects of combinations of membrane signals and genome-programming on RNA and protein expression profiles, would allow, among other things, manipulating stem-cell fate and stability. Stability would be key to both cell culture and to long-term avoidance of cancerous stem-cell proliferation. The ability of "programmed" cells to replace or augment small molecule drugs could be rigorously assessed. C) From tissues to systemsComputational programming of cell and tissue morphology can develop quantitative concepts in complexity, chaos, robustness, evolvability to engineer useful models such as sensor-effector neural feedback systems where macro aspects of the system determine the past (Darwinian) or future (prosthetic) function of the altered genomes.
Grand Challenges: goals (& details)
• The Manhattan Project ’43-45: Nuclear chain reaction (without igniting the atmosphere)
• The Apollo Project ’62-69: Send a person to the moon (& back)
• The Smallpox Eradication ’66-77: from the whole globe (including freezers)
• The Human Genome Project ’90-05: 3 billion bases (at 99.99% accuracy & searchable)
Grand Challenges: goals (& details)
• The Manhattan Project ’43-45: Nuclear chain reaction (without igniting the atmosphere)
• The Apollo Project ’62-69: Send a person to the moon (& back)
• The Smallpox Eradication ’66-77: from the whole globe (including military freezers?)
• The Human Genome Project ’90-05: 3 billion bases (at 99.99% accuracy with comparisons)
• The BioSystems Project ’02- ??
Potential BioSystems Project Challenges
Programming smart biomaterials 1. 0.1 nanometer positioning at 1kHz in a 50nm cube (Foresight Feynman Challenge) 2. I/O to sub-nano memory in DNAProgramming cells & populations: 3. 10 sec. mini-cell cycle, 85kbp genome 4. Bioremediation microbial populationsProgramming ourselves: 5. Drug structure-activity prioritization 6. Universal, non-aging human stem cells
Potential BioSystems Project Challenges
Programming smart biomaterials 1. 0.1 nanometer positioning at 1kHz in a 50nm cube (Foresight Feynman Challenge) 2. I/O to sub-nano memory in DNAProgramming cells & populations: 3. 10 sec. mini-cell cycle, 85kbp genome 4. Bioremediation microbial populationsProgramming ourselves: 5. Drug structure-activity prioritization 6. Universal, non-aging human stem cells
Why the genome project worked
Hood’75-00, Hunkapiller’77-00,
Carruthers’79... Polymer synthesis &
sequencing
Shotgun & mappingSanger’77, Brenner’72-02, Sulston’90, Olson’80-00...
Ulam’61-74, Staden’79, Lipman’87, Myers’87,
Green’93...Sequence searching
Tabor’93, Karger’94, Mathies’96, Mullis’84... Chemistry
InfrastructureWada’82, DeLisi’84, Gilbert’87, Watson’88, Venter’91...
Automate Data Model Similarity quality quality search
X-ray 1960 resolution |o-c|/o DALI,etc.diffraction < 0.2nm R < 0.2
Sequence 1988 discrepancy conserved BLAST bp <0.01% proteins
Metrics for structural & functional data
Expression 1999 cc, t-test shared motifs, Biclustering shared function
Interact/growth outliers optimality as above?
Types of Systems Interaction Models
Quantum Electrodynamics subatomicQuantum mechanics electron cloudsMolecular mechanics spherical atoms nm-fsMaster equations stochastic single molecules Fokker-Planck approx. stochasticMacroscopic rates ODE Concentration & time (C,t) Flux Balance Optima dCik/dt optimal steady state Thermodynamic models dCik/dt = 0 k reversible reactions
Steady State dCik/dt = 0 (sum k reactions) Metabolic Control Analysis d(dCik/dt)/dCj (i = chem.species) Spatially inhomogenous dCi/dx Population dynamics as above km-yr
Increasing scope, decreasing resolution
Capillary electrophoresis $300,000(DNA Sequencing) : 0.4Mb/day
Chromatography-Mass Spectrometry (eg. peptide LC-ESI-MS) : 20Mb/day
Microarray scanners (eg. RNA) : 300 Mb/day mpg
Reagent costs: mpg
Electrophoresis (DNA Sequencing) : 10 ul per 0.5 KbMicroarray reactions: 10 ul per 1000 Kb
Intel cmosmicroscope$99
Sources of Data for BioSystems Modeling:
RNA quantitation Aach, Rindone, Church, (2000) Genome Research 10: 431-445.
• Microarrays1
• Affymetrix2
• SAGE3
experiment
control • R/G ratios
• R, G values
• quality indicators
ORF
ORF
PMMM
• Averaged PM-MM
• “presence”
• feature statistics
• 25-mers
• Counts of SAGE 14-mers sequence tags for each ORF
ORF SAGE Tag
concatamers
1 DeRisi, et.al., Science 278:680-686 (1997)2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)3 Velculescu, et.al, Serial Analysis of Gene Expression, Science 270:484-487 (1995)
Array opportunities
• 22 bp ds-RNAi array modulates single cell type• Drug array time-release or photo-release• Primer pair arrays for haplotyping• Gene & genome synthesis (DARPA)
Polypeptide arrays
Photo-deprotect peptides (Affymax)Piezo or contact spotting (Harvard-CGR, Stanford)Phage or ribosome display capture (Bulyk)In situ ribosomal synthesis (Tian)
Harvard Inst. Proteomics, FLEXGene consortium
A’
A’A’
A’
A’
A’
B
BB
B
BB
A
Single Molecule From Library
B
BA’
A’
1st Round of PCR
Primer is Extendedby Polymerase
B
A’
BA’
Primer A has 5’ immobilizing (Acrydite) modification.
1. Remove 1 strand of DNA.2. Hybridize Universal Primer.3. Add Red (Cy3) dTTP.
B B’
3’ 5’
AGT..
T
4. Wash; Scan Red Channel
Sequence polonies by sequential,fluorescent single-base extensions
B B’
3’ 5’
GCG..
5. Add Green (FITC) dCTP
6. Wash; Scan Green Channel
B B’
3’ 5’
AGT.
T
Sequence polonies by sequential, fluorescent single-base extensions
C
B B’
3’ 5’
GCG..
C
Polony Template
3’ P’
P5’ A ATA CAA TTCACACAGGAAACAGCTATGA CATT CTATTGTTAAAGTGTGTCCTTTGTCGATACTGGTA…5’
FITC ( C ) CY3 ( T )
Primer Extension 26 cycles, 34 Nucleotides
Mean Intensity: 58, 0.5 40, 6.5 0.3, 48 0.4, 43
Polony haplotyping
Trans Cis
DNA RNA Protein
Metabolites
Growth rate
Environment
Function Genomics Measures & Models
microbesstem cellscancer cellsmulticellular organisms
RNAiInsertionsSNPs
Competition among multiple mutations & multiple homologous domains
1 2 3
1 2 3
thrA
metL
1.1 6.7
1.8 1.8
1 2lysC
10.4
probes
Selective disadvantage in minimal media
Multiple mutations per gene
Correlation between two selection experiments
Comparison of selection data with FBO predictions(scale up from79 to 488 genes)
predictions number of genes
negatively selected
not negatively selected
essential 143 80 63
reduced growth rate
46 24 22
non essential
299 119 180
P-value Chi Square = 0.004
>
<
Novelduplicates?
Positioneffects?
DNA RNA Protein
Metabolites
Expression
Environment
Function Genomics Measures & Models
RNA quantitation(Frequently Asked Questions)
Is less than a 2-fold RNA-ratio ever important? Yes; 1.5-fold in trisomies.
Why oligonucleotides rather than cDNAs? Alternative RNAs, gene families.
Using a subset of the genomeor ratios to various control RNAs?Trouble for later (meta) analyses.
Lpp mRNA start & structure
-1
-0.5
0
0.5
1
1.5
2
-300 -200 -100 0 100 200 300 400
Bases from Translation Start
Inte
ns
ity
(P
M -
MM
) / S
ma
x
Log
Stationary
Genomic DNA
KnownHairpin
Translation Stop(237 bases)
Known Transcription Start(position -33)
See: Selinger et al Nat Biotech
Oligo selection
• PGA/Smith group already designing software for oligo selection• Church Lab / Lipper Center has additional tools
– Unique oligos (cu-15s)– RNA string matching program
gene-specificoligos
controls, text, border oligos
gene sequences
parameters(Tm, length, ...)
generate candidate
oligos
background sequences
predict cross-hybridization
filter & select oligos
generate chip layout
experimental results
generate control, border oligos chip layout
Figure courtesy of Adnan Derti
Combinatorial arrays for binding constants(EGR1)
ds-DNA ds-DNA arrayarray
HMS: Martha Bulyk, Xiaohua Wang, Martin Steffen HMS: Martha Bulyk, Xiaohua Wang, Martin Steffen MRC: Yen ChooMRC: Yen Choo
Combinatorial arrays for binding constants
Combinatorial DNA-binding
protein domains
ds-DNA ds-DNA arrayarray
Phage
pVIIIpVIII
pIIIpIII
Antibodies
Combinatorial arrays for binding constants
PhycoerythrinPhycoerythrin- 2º IgG- 2º IgG
Combinatorial DNA-binding
protein domains
ds-DNA ds-DNA arrayarray
Martha Bulyk et alMartha Bulyk et al
Phage
Interactions of Adjacent Basepairs in EGR1 Interactions of Adjacent Basepairs in EGR1 Zinc Finger DNA RecognitionZinc Finger DNA Recognition
Isalan et al., Biochemistry (‘98) 37:12026-12033
Wildtype EGR1 MicroarrayWildtype EGR1 Microarray
high [DNA](+) ctrl sequence
for wt binding
alignment oligos
etc.
WildtypeWildtypeRSDHLTTRSDHLTT
Motifs weight all 64 Kaapp
RGPDLARRGPDLARREDVLIRREDVLIR
LRHNLETLRHNLET
TGG 2.8 nM
GCG 16 nM
2.5 nM
TAT 5.7 nM
AAA,AAT,ACT,AGA,AGC,AGT,CAT,CCT,CGA,CTT,TTC,TTT
AAT 240 nM
KASNLVSKASNLVS
For more information:
arep.med.harvard.edu