Post on 18-Jan-2016
transcript
Multiple Species Gene Finding
Sourav Chatterjisouravc@eecs.berkeley.edu
Predicting Replication Origins in Yeast. Breier AM, Chatterji S, Cozzarelli NR. Genome Biol.
2004;5(4):R22. Comparative GeneFinding using SLAM.
Paired Splice Site Detection in SLAM. Zhao X, Huang H, Speed TP. Proceedings of RECOMB 2004; 68-75.
Rat Genome Sequencing Consortium. Nature. 2004 Apr 1;428(6982):493-521.
Multiple Species GeneFinding. Chatterji S, Pachter L. Proceedings of RECOMB 2004; 187-193.
Evidence Based GeneFinding - Work in Progress.
State of the Genomes 5 Roundworm Genomes
C.elegans and C.briggsae completed. 3 other worm genomes in progress.
11 Fruitfly Genomes D. Melanogaster - completed Seuqencing of 7 genomes in progress 3 more genomes in pipleline
18 Mammalian Genomes Human, Mouse, and Rat Genomes Published
Sequencing of 6 genomes in progress. 9 other genomes in pipeline 4 primate genomes : Orangutan, Macaque, Chimpanzee and Human.
Outline
GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Gibbs Sampling in HMMs Gene finding by Gibbs sampling Results
Evidence Based Multiple Species GeneFinding Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work
Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation).
High Sensitivity/Low Specificity
Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation).
High Sensitivity/Low Specificity Conserved regions among related species more likely to be functional than divergent regions.
IDEA: Comparative-based gene finding
Comparative (Pairwise) Gene Finding
ROSETTA : Global alignment followed by gene finding [Batzoglou and Pachter et al., 1999]
SLAM : Simultaneous Global alignment and gene finding [Alexandersson et al. 2001]
TWINSCAN : Blast alignment followed by gene finding [Korf et al. 2001]
AGENDA : Global alignment followed by gene finding [Rinner and Morgenstern, 2002]
DOUBLESCAN : Simultaneous alignment and gene finding [Meyer and Durbin, 2002]
SGP2 : Blast alignment followed by gene finding [Parra et al. 2003]
The Good News Gene structure
Number of exons conserved (86% human/mouse) Exons have similar lengths (91% identical, remainder almost all differ by a multiple of 3)
Intron lengths are divergent (~1% identical length)
Sequence similarity Exons highly conserved (both amino acids & DNA)
Intron sequences dissimilarWaterston et al., 2003
The Bad News Difficult to generalize many pairwise methods to multiple sequence methods
Alignment Exons may be misaligned (much shorter than introns)
Multiple sequence alignment is much harder than pairwise sequence alignment
Long Conserved Non Coding Sequences Confuse methods that rely on conservation in a naive way
Missing Sequence
Multiple Species Comparative Gene Finding
(with Alignment)
McAuliffe et al. (2004), Siepel et al. (2004)
Multiple Species Comparative Gene Finding
(with Alignment)
McAuliffe et al. (2004), Siepel et al. (2004)
Multiple Species Comparative Gene Finding
(without Alignment)
Gibbs Sampling for Biological Sequence
Analysis Introduced by Lawrence et al. 1993
Motif Detection Extensions
Multiple Motifs in a Sequence Multiple Types of Motifs Phylogenetic Relationships between Sequences
Applications Alignment Linkage Analysis
Gibbs Sampling Aim : To sample from the joint distribution p(x1,x2,…,xn) when it is easy to sample from the conditional distributions
p(xi | x1,…xi-1,xi+1,…,xn) but not from the joint distribution. Method: Iteratively sample xi
t from the conditional distribution p(xi | x1
t,…xi-1t,xi+1
t-1 ,…,xn
t-1) Theorem : For discrete distributions, the distribution of (x1
t,x2t …,xn
t) converges to p(x1,x2,…,xn)
t
s
Connection to HMMs
Z1
Y1
Z2
Ym
Zm
Y2
s s
tt
t = output probabilities
s = transition probabilities Difficult to sample from
P(Z | Y)
Easy to sample from P(| Z,Y)
Easy to sample Z from P(Z | ,Y)
The Motif Finding Problem
Fixed width unknown motif.
1 motif per sequence, unknown location.
P1 P2 P3 P4 P5
A ? ? ? ? ?
C ? ? ? ? ?
G ? ? ? ? ?
T ? ? ? ? ?
?
?
?
?
PSSM
The Motif Finding HMM
: PSSM parameters
Y : Observed sequences
Z : Alignment
P1 P2 P3 P4 P5
A ? ? ? ? ?
C ? ? ? ? ?
G ? ? ? ? ?
T ? ? ? ? ?
?
?
?
?
PSSM
ZY
BG P1 P5 BG
Z2
Y2
Zm-1
Ym-1
Z1
Y1
Zm
Ym
…
Gibbs Sampling for Motif Detection
Sample from P( | Z,Y) [sample PSSM from alignment]
Sample Z from P(Z | ,Y) [find positions from PSSM]
Samples from P(Z,|Y)
P1 P2 P3 P4 P5
A ? ? ? ? ?
C ? ? ? ? ?
G ? ? ? ? ?
T ? ? ? ? ?
?
?
?
?
PSSM
ZY
BG P1 P5 BG
Z2
Y2
Zm-1
Ym-1
Z1
Y1
Zm
Ym
…
Gibbs Sampling for HMMs
N sequences independently generated by an HMM.
Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables
Y = Y1,Y2,…,Yi …,YN : observed variables
Gibbs Sampling for HMMs
N sequences independently generated by an HMM.
Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables
Y = Y1,Y2,…,Yi …,YN : observed variables
Zi1,Zi
2,…,Zi
m
Yi1,Yi
2,…,Yi
m
Gibbs Sampling for HMMs
N sequences independently generated by an HMM.
Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables
Y = Y1,Y2,…,Yi …,YN : observed variables
Aim: To Sample from the distribution P(Z,|Y)
Iterations of a Gibbs Sampler Sample Zi from p(Zi| Y,) , Sample from p(| Y,Z)
E00 E01 E02
Intron0
E10 E11 E12
Intron1
E20 E21 E22
Intron2
EI0 EI1 EI2ET0 ET1 ET2
SingleExon
IG
E100 E2
00 Ek00
Gibbs Sampling for Gene Finding
Gibbs Sampling for Gene Finding
Initial Predictions
Gibbs Sampling for Gene Finding
Sample Z1 from P(Z1 | Z[-
1] , Y)
Gibbs Sampling for Gene Finding
Sample Z2 from P(Z2 | Z[-
2] , Y)
Learning the Number of Exon Classes
Learning the Number of
Exon Classes
Find Significant Hits Among Peptides
Learning the Number of
Exon Classes
Each Connected Component forms a Class of Genes
Testing
1.6 Mb Data from the NISC Comparative Sequencing Project
Divided into large genomic regions (100-200 kB) some of which contained multiple genes
Selection Criteria 4 mammals roughly equidistant from each other
Human, Mouse/Rat, Dog/Cat, Pig/Cow. Available RefSeq annotations with no obvious alternative splicing
Results
Nucl. Sn
Nucl. Sp
Exon Sn
Exon Sp
Gibbs 0.897 0.886 0.714 0.628
Genscan 0.911 0.548 0.777 0.518
Twinscan
0.692 0.856 0.440 0.513
SLAM 0.791 0.881 0.632 0.527
Robustness ResultsNucl. Sn
Nucl. Sp
Exon Sn
Exon Sp
Gibbs(before) 0.939
0.950
0.763
0.735
Gibbs(after) 0.885
0.910
0.740
0.703
Genscan(before)
0.911
0.680
0.771
0.612
Genscan(after)
0.866
0.652
0.748
0.594
Twinscan(before)
0.694
0.895
0.465
0.604
Twinscan(after)
0.665
0.853
0.465
0.598
SLAM(before) 0.927
0.911
0.718
0.566
SLAM(after) 0.438
0.936
0.250
0.646
Conclusions Efficient
Running time O(kNL) Memory requirements O(L)
k=#iterations,N=#sequences, L=max. length Converges rapidly.
No Alignment Required !! Symmetric Prediction for All Species Application : rapid comparative based annotation of newly sequenced genomes
Robust Rearrangements Draft Quality Sequence
Outline
GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Overview of Gibbs Sampling Gene finding by Gibbs sampling Results
Evidence Based Multiple Species GeneFinding
Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work
Evidence Based GeneFinding
Procrustes : cDNA Evidence, DP based Spliced Alignment [Gelfland et al. 1996]
Genewise : Protein evidence, combines genefinding HMM with protein profile HMM, part of the ENSEMBL pipeline. [Birney at al. 1996]
Projector : Evidence from orthologous genes in related species, uses pair HMM based model. [Meyer and Durbin 2004]
Evidence
Annotation
Evidence Based GeneFinding
Large scale sequencing efforts. 8 Drosophila genomes very soon
D. Melanogaster well annotated. 9 mammalian genomes by early 2005
Human genome well annotated. Aim : Rapid annotation of newly sequenced genomes. Use well annotated genomes as evidence. Draft Quality Genomes
Robustness for Sequencing Errors Using sequences from multiple species will result in more accurate annotations.
Will also give us high quality multiple alignments.
Data to study the evolution of genomes.
Evidence Based Multiple Species
GeneFinding
Basic Idea : Use annotations from a referencereference genome (e.g. D.melanogaster or H. sapiens) as evidence to annotate newly sequenced genomes.
Use Whole Genome Homology Maps (courtesy Colin Dewey)
Project exons from reference genome into every other genome.
Join projections to get multiple alignments. Use orthologous sequences from multiple species to get more accurate annotations.
Produce annotations with all supporting evidence.
Exploit phylogenetic relationships among the species.
Projecting Annotated Exons
Annotation
Homology Map
ExonAligner
ExonAligner : An Exon Alignment Program
Mixture of global and local alignment. Penalize overhanging ends in Evidence. Overhanging ends in Target is OK.
Exploit the property that they code for homologous proteins.
Special Dynamic Programming Matrix Robust.
Sequencing Errors. Phase Shifts.
Chaining Algorithm for large sequences.
Evidence
Target
The Dynamic Programming Matrix
The figure only shows edges into the black node. The red edges represent
non-codon gaps, i.e. gaps caused by phase shifts/sequencing errors and are of length which is not a multiple of 3. They are heavily penalized.
Chaining Algorithms Widely used in large scale alignment algorithms. MUMer [Salzberg et al. 1999] AVID [Bray et al. 2002] LAGAN [Brudno et al. 2003]
Step 1: Find good local alignments or fragments.
Step 2: Select a consistent subset of fragments for chaining and call these fragments anchors.
Step 3: Join the anchors to get an alignment.
The ExonAligner Chaining Algorithm
Construction of Fragments. Translate target sequence in the 3 frames.
Find significant hits with (translated) evidence and use them as fragments.
Selection of Anchors. Construct weighted DAG from fragments.
Weigh edges by using dynamic programming. Use nodes in the shortest path in the DAG as anchors.
Use dynamic programming to join anchors together.
Exploiting Phylogeny
Recoverable Exons/Genes
?
Recoverable Exons/Genes
Project Using Exon Aligner
Preliminary Results Created a Homology map of Human, Chimp, Rat, Mouse and Chicken genomes 266836 exon cliques. 45543 non-convex exon cliques. 27502 of these recoverable.
Used ExonAligner to map 3300 human Refseq genes into the chimp genome. Robustness of Algorithm critical.
500 of the 42662 exons had non-codon gaps. These alignments will in turn be used to learn parameters for ExonAligner.
Extrapolate parameters for other species.
An Illustrative Example
RefSeq Gene NM_030575 Single exon gene, 221 a.a. protein. No orthologous gene found by Genewise.
Potential orthologous gene in chimp 2 non-codon gaps in alignment (1 insertion and 1 deletion separated by 60 nt).
212 out of 221 amino acids are matches.
Is this a real ortholog? Phase Shift/Sequencing Error? Find orthologous genes in other species and use multiple alignment.
Future Work Extend ExonAligner for Multiple Species
Robust realignment Take into account codon structure Robust for phase shifts/sequencing errors
Annotation with supporting evidence. Basic Evidence e.g. RefSeq Gene Annotation Multiple Alignment with Orthologous Features
Score : Statistical Significance of the Feature
Future Work Comprehensive Annotation Program
Put Evidence Based and Ab initio methods together
Try to use alignment/homology in Gibbs Sampler
Rapid annotation of Drosophila and Mammalian genomes Berkeley AAA group for Drosophila genomes.
Study the evolution of genes Find human specific genes