Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | morgan-holland |
View: | 214 times |
Download: | 0 times |
Interpreting the human genome
Manolis Kellis
CSAIL MIT Computer Science and Artificial Intelligence Lab
Broad Institute of MIT and Harvard for Genomics in Medicine
32 mammals
17 yeasts 12 flies
The age of comparative genomics
opossum armadillo rabbit cow hyrax elephant
human mouse ratchimp dog
bat dolphin lemur bushbaby pika hedgehog tenrec
pangolinTree shrewllama
etc...
Resolving power in mammals, flies, fungi
• Neutral: 2.57 subs/site
(opp: 0.62 32sps: 4.87)
• Coding: 1.16 subs/site• Detect: 6-mer at FP 10-6
10 mammals 17 yeasts12 flies
8 Candida
9 Yeasts
Po
st-
du
pli
ca
tio
nD
iplo
idH
ap
loid
Pre
-du
p
P
P
P
PP
P
• Neutral: 4.13 subs/site
• Coding: 1.65 subs/site
• Detect: 6-mer at 10-11
• Neutral: 15.5 subs/site
(Yeast: 6.5 Candida: 6.5)
• Coding: 7.91 subs/site• Detect: 3-mer at 10-21
Comparative Genomics 101: Conservation Function
• Conserved elements are typically functional (and vice versa)– For example: exons are deeply conserved to mouse, chicken, fish
• Some conserved elements are still uncharacterized– How do we make sense of them? – How do we distinguish each type of functional element
• Answer: evolutionary signatures (Comp. Genomics 201)– Tell me how you evolve, I’ll tell you who you are– Patterns of change selective pressures specific function
Gene identification
Study known genes
Derive conservation rules
Discover new genes
• Evolutionary signatures– “Tell me how you evolve, i’ll tell you who you are” – Each type of functional elements evolves in its own specific ways
Distinguishing genes from non-coding regionsDmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC
Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT
Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA
Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC
Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC
Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC
Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT
Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC
Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT
***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * **
• Protein-coding genes have specific evolutionary constraints– Gaps are multiples of three (preserve amino acid translation)– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)
• Encode as ‘evolutionary signatures’– Computational test for each of them– Combine and score systematically
Splice
Signature 1: Reading frame conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
100%
100%
100%
100%
100%
100%
100%
100%
100%
60%
55%
90%
40%
60%
100%
20%
30%
40%
100% 60%
RFC RFC
Signature 2: Distinct patterns of codon substitution
Codon observed in species 2
Cod
on o
bser
ved
in s
peci
es 1
Genes
• Codon substitution patterns specific to genes– Genetic code dictates substitution patterns– Amino acid properties dictate substitution patterns
Codon observed in species 2
Cod
on o
bser
ved
in s
peci
es 1
Intergenic
Codon Substitution Matrix (CSM)
hum
an
mousealiphaticaromatic
negativepolar positivepolar
Signatures 3, 4, 5, 6, 7, etc…
• Mutation patterns of splicing signals– Real splice acceptor/donor evolve in specific ways
• Evolution of other motifs associated with splicing– Exonic/Intronic Splicing Enhancers/Silencers (ESE,ESI)– Density of motif clouds surrounding real exons
• Sharp conservation boundaries– Relative conservation exon vs. surrounding regions
• Length of longest ‘open’ reading frame– Frequency of stop codons in each frame / each species
ISEs ISEs
ESEs
real exon
acceptorsite
donorsite
Putting it all together: probabilistic framework
• Hidden Markov Models (HMMs)– Generative model, learn emission, transition probabilities– Easy to train, hard to integrate long-range signals
• Conditional Random Fields (CRFs)– Discriminative dual of HMMs, learn weights on features– Easy to integrate diverse signals, gradient ascent for training
From HMMs … to CRFs
yiyi-1 yi+1
X
hiddensequence
featurefunctions
F(i-1) F(i) F(i+1)
observed
From HMMs … to CRFs
yyxy
y
y
L
iii
aeXiyy
yiXiyyyi
yiXiyyyi
XiyyYX
i ,',
'
'
11
),,,'(F
)',1(α ),,,'F(),(α
))',1V(),,,'(F(max),V(
),,,F(),P(
Transition and Emission probabilities
Generative model Discriminative model
For example, features can simply be ei and aij
hit BLASTnearest todistance),,,'(f
...in CpG %),,,'(f
...in Heads%),,,'(f
17
50509
50503
Xiyy
xxXiyy
xxXiyy
ii
ii
Or pretty much anything:
Running on real genomes
• Obtain optimal weights (from training set)– Experimentally-defined, genetics, curation, cDNA
• Apply CRF systematically to new genome– Revisit existing genomes– Annotate new genomes
• Power of evolutionary signatures– New genes and exons, dubious genes and exons
– Adjust gene boundaries: ATG, frame, splice site, seq errors
• Signatures more powerful than primary signals– Recognize unusual gene structures read-through, uORFs, editing
• Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature
Experimentation: BDGP large-scale functional validation novel exons
D. simulans
D. erecta
D. persimilis
D. melanog.
579 fullyrejected
1,454 exons(~800 genes)
2,499 notaligned
+668 exonsin 443 genes
Revisiting fly genome annotation
10,845 fullyconfirmed
(…)
Systematic application leads to
• Exon-level changes– Ex 1: New genes– Ex 2: New exons– Ex 3: Dubious genes
• More subtle changes– Ex 4: Start/end adjustments– Ex 5: Wrong reading frame– Ex 6: Splice site adjustments– Ex 7: Sequencing errors fixed
• Unusual gene structures– W1: Stop-codon read-through– W2: uORFs & dicistronic– W3: Internal frame-shifts
Codon observed in species 2
Co
do
n o
bse
rved
in
sp
ecie
s 1
Genes vs. Intergenic
Reading Frame Conservation
Codon Substitution Matrix
conserved
substitution
insertion
frameshift
gap
Example 1: Known genes stand out
Sharp conservation
boundaries.
Known exons
stand out.
High sensitivity
and specificity.
Example 2: Novel multi-exon gene
1,454 novel exons
outside known genes– Many cluster in new
multi-exon genes– Others are isolated
high-confidence exons
Example 2b: Novel exons inside known genes
(sorry, this example is from human, mouse, dog, rat)
• 668 cases in fly– New candidate alternatively spliced gene forms– New protein domains
Novel genes and exons
• 1,454 novel exons outside existing genes– 60% cluster in 300 multi-exon genes– 40% isolated exons
• 668 novel exons inside existing genes– Alternative splicing: Many with cDNA support– Nested genes: Few known examples
• Human curation– Collaboration with FlyBase– Hundreds of changes in release 5.1, more in 5.2
• Systematic experimentation– Sue Celniker and Berkeley Genome Project– Thousands of new genes in the pipeline
Example 3: Dubious single-exon gene
• Only evidence was an open reading frame– Comparative
information much stronger
579 Dubious Genes
• Classification approach: Yes / No answer– Closely related species: both genes and intergenic aligned– Show very different patterns of mutation
• Comparative analysis provides negative evidence– Alignment is unambiguous, orthologous, spans entire gene– Sequence shows mutations and indels in every species
• Weak or missing experimental evidence– 100 of these independently rejected by FlyBase– These are missing from systematic clone collections– Only 34 (6%) have assigned names (vs. 36% of all fly genes)
Systematic application leads to
• Exon-level changes– Ex 1: New genes– Ex 2: New exons– Ex 3: Dubious genes
• More subtle changes– Ex 4: Start/end adjustments– Ex 5: Wrong reading frame– Ex 6: Splice site adjustments– Ex 7: Sequencing errors fixed
• Unusual gene structures– W1: Stop-codon read-through– W2: uORFs & dicistronic– W3: Internal frame-shifts
Codon observed in species 2
Co
do
n o
bse
rved
in
sp
ecie
s 1
Genes vs. Intergenic
Reading Frame Conservation
Codon Substitution Matrix
CG6664/FBtr0100439
annotated start codon conserved start codon
Example 4: Start codon adjustment• Codon substitution patterns suggest new start in 200 genes
– Score each substitution using Codon Substitution Matrix (CSM)
poor CSM score, atypical substitutionhigh CSM score, protein-like substitution
ATG ATG
Annotated ORF (345nt) Real ORF (315nt)
Example 5: Gene annotated on wrong reading frame
• cDNA evidence supports overlapping reading frames, both open– Annotation traditionally selects longer one– Conservation enables distinguishing the two
mRNA supports both ORFs
Conservation only supports shorter ORF
Shorter ORF is the correct one
CG7738-RA is incorrect
Example 6: Incorrect splice causes wrong frame
• Second exon annotated in the wrong frame– Due to splice site boundary error– Correction is supported by cDNA evidence
Fix exon boundary
First exon: correct frame 2nd exon: incorrect frame
Example 7: Detect seq. errors / strain mutations
• Insertion/deletion causes frameshift– Conservation signature shifts from ‘frame1’ to ‘frame2’– All other species disagree with D. melanogaster indel– Sequencing error or species-specific mutation
chr3R:6,953,865-6,953,927 (Ugt86Dd) dm CAGTACATATTTGTGGAGAGTTACTTGAAAG-CTTGGCAGCTAAGGGTCATCAGGTGACCGTTAdroSec CAGTACATATTTTTGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTAdroSim CAGTACATATTTATGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTAdroYak CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCCAAGGGTCACCAGGTGACCGTTAdroEre CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCTAGGGGTCACCAGGTGACTGTTAdroAna CAGTACATCTTTGTGGAGACCTATCTGAAGGCTTTGGCCGACAAAGGTCACCAGGTGACTGTTAdroWil CAATACATATTCATTGAGGCGTATCTAAAGGCATTGGCTGCCAAAGGACATCAGTTAACTGTGAdroMoj CAGTACATATTCGCCGAGGCGTATTTGAAGGCGCTAGCAGCCCGGGGCCATGAGGTCACCGTGAdroVir CAGTATATATTTGCCGAGTCGTATTTGAAGGCCTTGGCAGCGCGGGGTCATGAGGTGACAGTGA 01201201201201201201201201201201 2012012012012012012012012012012 ** ** ** ** *** ** * ** * * ** * ** ** ** * ** ** *
Conservation in correct frame Conservation in 2nd frame
Frame-shift (sequencing error / recent mutation)
Example 8: Dubious gene is a miRNA transcript
• Evolutionary signatures reveal specific function
Systematic application leads to
• Exon-level changes– Ex 1: New genes– Ex 2: New exons– Ex 3: Dubious genes
• More subtle changes– Ex 4: Start/end adjustments– Ex 5: Wrong reading frame– Ex 6: Splice site adjustments– Ex 7: Sequencing errors fixed
• Unusual gene structures– W1: Stop-codon read-through– W2: uORFs & dicistronic– W3: Internal frame-shifts
Codon observed in species 2
Co
do
n o
bse
rved
in
sp
ecie
s 1
Genes vs. Intergenic
Reading Frame Conservation
Codon Substitution Matrix
Unusual genes 1: Stop codon read-through
• Method #1 (single exons)– 112 events, 95 extending known genes Manual curation: 82– Enriched in neuronal function
• Method #2 (after splicing)– 256 events, looser cutoff, large overlap, needs manual curation– Enriched in transcription factors
Protein-coding
conservation
Continued protein-coding
conservation
No more
conservation
Stop codon
read through
2nd stop codon
Unusual genes 2: Polycistronic messages / uORFs
• Method– High-scoring ORFs with cDNA evidence– Disjoint from the annotated ORF
• Results– 217 cases
Protein-coding conservation in the 5’UTR
Unusual genes 3: Frame-shift in the middle of exons
• Method– Exons changing high-scoring frame– Far from splice junctions
• Results– 68 cases in 44 genes
dm GACTATTTCAACAATCAGCAGCGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCGdroSim GACTATTTCAACAACCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCGdroSec GACTATTTCAACAACCAACAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCCTCACGCAGACCGdroYak GACTACTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GGCGAGATTTGTACCGCCTCCACCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCGdroEre GACTATTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTTCTCACGCAGACCGdroAna GACTACTACAACAATCAGCAGCGGGAGCGGCACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGGCCAGCGGCGAAGTTCGTCCCTCCTCCGCCGCCTCCGCGACGTTTGCTTCTCACGCAGACAGdroPse GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGCAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCAdroPer GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGAAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGCTCACGCAGACCAdroWil GACTACTACAACAATCAGCAGAGGGAGCGACACTACGAGCAACGTCGCCAAAGCCAGCGGCAGGCC---AGCCAAATTTATACCACCGCCACCGCCTCCACGTCGACTGCTGCTAACGCAGACAAdroMoj GACTACTACAACAACCAGCAGCGGGAGCGGCACTACCAGCTGCGCCACCAGAGCCAACGTCAAGCC---ACCGAGATTTATACCACCACCGCCGCCGCCTCGTCGTCTGCTGCTCACGCAGACAAdroVir GACTACTACAACAACCAACAGCGGGAGCGGCACTACCAGCAGCGCCGCCAGAGCCAACGTCAAGCC---ACCGAGATTCATTCCACCGCCGCCGCCGCCTCGTCGTCTGCTGCTCACGCAGACAAdroGri GACTACTACAACAATCAGCAGCGGGAGCGGCACTATCAACAGCGTCGCCAGAGTCATCGTCAAGCC---ACCGAGATTTATACCACCACCACCGCCACCTCGTCGTCTATTGCTCACGCAGACAA 012012012012012012012012012012012012012012012012012012012012012012 01201201201201201201201201201201201201201201201201201201 ***** * ****** ** ** * ***** ***** * * ** ** ** ** ** * ** * * ** * ** ** ** ***** ** ** ** * * ** ********
chrX:2,226,518-2,226,639 (CG14047)
012 120
Frame 1 is high-scoring Frame 2 is high-scoring
• Fully rejected genes: weak/no evidence• New exons: existing & novel experimental evidence• Need: large-scale functional annotation for novel genes
Dog
Mouse
Rat
Human
1,065 fullyrejected
454 novel(2591 exons)
1,919 notaligned
7,717refined
Initial results for the whole human genome
9,862 fullyconfirmed
12 s
peci
es
12 s
peci
es
2 sp
ecie
s
Discriminative framework shows continued increase in power
• Reading frame conservation (RFC) scoreDmel,Dyak,Dpse
0
500
1000
1500
2000
2500
3000
3500
4000
4500
-2 -1 0 1 2
Dmel,Dyak,Dpse,Dwil,Dgri
0
500
1000
1500
2000
2500
3000
3500
-4 -3 -2 -1 0 1 2 3 4
12 flies
0
200
400
600
800
1000
1200
1400
1600
Dmel,Dpse
0
1000
2000
3000
4000
5000
6000
7000
8000
-1 0 1
• Codon substitution matrix (CSM) score
2 species 3 species 5 species 12 species
2 species
12 species
90%
10%
30%
70%80%
95%
5%
20%
Overview
Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures
Part 2. Gene regulation Regulatory motif discovery microRNA regulation Enhancer identification
Part 3. Genome evolution Phylogenomics The two forces of gene evolution Accurate gene trees in complete genomes
Who’s actually doing the work
Matt RasmussenPhylogenomics
Erez LiebermanMotif evolution
Aviva PresserNetwork evolution
Mike LinGene identification
Alex StarkFly motifs and miRNAs
Pouya KheradpourHuman enhancers
Josh GrochowNetwork motif discovery
Ameya DeorasSpectral genomics