Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu.

transcript

Multiple Species Gene Finding

Sourav Chatterjisouravc@eecs.berkeley.edu

Predicting Replication Origins in Yeast. Breier AM, Chatterji S, Cozzarelli NR. Genome Biol.

2004;5(4):R22. Comparative GeneFinding using SLAM.

Paired Splice Site Detection in SLAM. Zhao X, Huang H, Speed TP. Proceedings of RECOMB 2004; 68-75.

Rat Genome Sequencing Consortium. Nature. 2004 Apr 1;428(6982):493-521.

Multiple Species GeneFinding. Chatterji S, Pachter L. Proceedings of RECOMB 2004; 187-193.

Evidence Based GeneFinding - Work in Progress.

State of the Genomes 5 Roundworm Genomes

C.elegans and C.briggsae completed. 3 other worm genomes in progress.

11 Fruitfly Genomes D. Melanogaster - completed Seuqencing of 7 genomes in progress 3 more genomes in pipleline

18 Mammalian Genomes Human, Mouse, and Rat Genomes Published

Sequencing of 6 genomes in progress. 9 other genomes in pipeline 4 primate genomes : Orangutan, Macaque, Chimpanzee and Human.

Outline

GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Gibbs Sampling in HMMs Gene finding by Gibbs sampling Results

Evidence Based Multiple Species GeneFinding Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work

Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation).

High Sensitivity/Low Specificity

Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation).

High Sensitivity/Low Specificity Conserved regions among related species more likely to be functional than divergent regions.

IDEA: Comparative-based gene finding

Comparative (Pairwise) Gene Finding

ROSETTA : Global alignment followed by gene finding [Batzoglou and Pachter et al., 1999]

SLAM : Simultaneous Global alignment and gene finding [Alexandersson et al. 2001]

TWINSCAN : Blast alignment followed by gene finding [Korf et al. 2001]

AGENDA : Global alignment followed by gene finding [Rinner and Morgenstern, 2002]

DOUBLESCAN : Simultaneous alignment and gene finding [Meyer and Durbin, 2002]

SGP2 : Blast alignment followed by gene finding [Parra et al. 2003]

The Good News Gene structure

Number of exons conserved (86% human/mouse) Exons have similar lengths (91% identical, remainder almost all differ by a multiple of 3)

Intron lengths are divergent (~1% identical length)

Sequence similarity Exons highly conserved (both amino acids & DNA)

Intron sequences dissimilarWaterston et al., 2003

The Bad News Difficult to generalize many pairwise methods to multiple sequence methods

Alignment Exons may be misaligned (much shorter than introns)

Multiple sequence alignment is much harder than pairwise sequence alignment

Long Conserved Non Coding Sequences Confuse methods that rely on conservation in a naive way

Missing Sequence

Multiple Species Comparative Gene Finding

(with Alignment)

McAuliffe et al. (2004), Siepel et al. (2004)

(with Alignment)

McAuliffe et al. (2004), Siepel et al. (2004)

(without Alignment)

Gibbs Sampling for Biological Sequence

Analysis Introduced by Lawrence et al. 1993

Motif Detection Extensions

Multiple Motifs in a Sequence Multiple Types of Motifs Phylogenetic Relationships between Sequences

Applications Alignment Linkage Analysis

Gibbs Sampling Aim : To sample from the joint distribution p(x1,x2,…,xn) when it is easy to sample from the conditional distributions

p(xi | x1,…xi-1,xi+1,…,xn) but not from the joint distribution. Method: Iteratively sample xi

t from the conditional distribution p(xi | x1

t,…xi-1t,xi+1

t-1 ,…,xn

t-1) Theorem : For discrete distributions, the distribution of (x1

t,x2t …,xn

t) converges to p(x1,x2,…,xn)

Connection to HMMs

t = output probabilities

s = transition probabilities Difficult to sample from

P(Z | Y)

Easy to sample from P(| Z,Y)

Easy to sample Z from P(Z | ,Y)

The Motif Finding Problem

Fixed width unknown motif.

1 motif per sequence, unknown location.

P1 P2 P3 P4 P5

A ? ? ? ? ?

C ? ? ? ? ?

G ? ? ? ? ?

T ? ? ? ? ?

The Motif Finding HMM

: PSSM parameters

Y : Observed sequences

Z : Alignment

P1 P2 P3 P4 P5

A ? ? ? ? ?

C ? ? ? ? ?

G ? ? ? ? ?

T ? ? ? ? ?

BG P1 P5 BG

Gibbs Sampling for Motif Detection

Sample from P( | Z,Y) [sample PSSM from alignment]

Sample Z from P(Z | ,Y) [find positions from PSSM]

Samples from P(Z,|Y)

P1 P2 P3 P4 P5

A ? ? ? ? ?

C ? ? ? ? ?

G ? ? ? ? ?

T ? ? ? ? ?

BG P1 P5 BG

Gibbs Sampling for HMMs

N sequences independently generated by an HMM.

Three types of random variables : Parameters Z = Z1,Z2,…,Zi …,ZN : hidden variables

Y = Y1,Y2,…,Yi …,YN : observed variables

Zi1,Zi

2,…,Zi

Yi1,Yi

2,…,Yi

Aim: To Sample from the distribution P(Z,|Y)

Iterations of a Gibbs Sampler Sample Zi from p(Zi| Y,) , Sample from p(| Y,Z)

E00 E01 E02

Intron0

E10 E11 E12

Intron1

E20 E21 E22

Intron2

EI0 EI1 EI2ET0 ET1 ET2

SingleExon

E100 E2

00 Ek00

Gibbs Sampling for Gene Finding

Initial Predictions

Sample Z1 from P(Z1 | Z[-

1] , Y)

Sample Z2 from P(Z2 | Z[-

2] , Y)

Learning the Number of Exon Classes

Learning the Number of

Exon Classes

Find Significant Hits Among Peptides

Learning the Number of

Exon Classes

Each Connected Component forms a Class of Genes

Testing

1.6 Mb Data from the NISC Comparative Sequencing Project

Divided into large genomic regions (100-200 kB) some of which contained multiple genes

Selection Criteria 4 mammals roughly equidistant from each other

Human, Mouse/Rat, Dog/Cat, Pig/Cow. Available RefSeq annotations with no obvious alternative splicing

Results

Nucl. Sn

Nucl. Sp

Exon Sn

Exon Sp

Gibbs 0.897 0.886 0.714 0.628

Genscan 0.911 0.548 0.777 0.518

Twinscan

0.692 0.856 0.440 0.513

SLAM 0.791 0.881 0.632 0.527

Robustness ResultsNucl. Sn

Nucl. Sp

Exon Sn

Exon Sp

Gibbs(before) 0.939

Gibbs(after) 0.885

Genscan(before)

Genscan(after)

Twinscan(before)

Twinscan(after)

SLAM(before) 0.927

SLAM(after) 0.438

Conclusions Efficient

Running time O(kNL) Memory requirements O(L)

k=#iterations,N=#sequences, L=max. length Converges rapidly.

No Alignment Required !! Symmetric Prediction for All Species Application : rapid comparative based annotation of newly sequenced genomes

Robust Rearrangements Draft Quality Sequence

Outline

GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Overview of Gibbs Sampling Gene finding by Gibbs sampling Results

Evidence Based Multiple Species GeneFinding

Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work

Evidence Based GeneFinding

Procrustes : cDNA Evidence, DP based Spliced Alignment [Gelfland et al. 1996]

Genewise : Protein evidence, combines genefinding HMM with protein profile HMM, part of the ENSEMBL pipeline. [Birney at al. 1996]

Projector : Evidence from orthologous genes in related species, uses pair HMM based model. [Meyer and Durbin 2004]

Evidence

Annotation

Evidence Based GeneFinding

Large scale sequencing efforts. 8 Drosophila genomes very soon

D. Melanogaster well annotated. 9 mammalian genomes by early 2005

Human genome well annotated. Aim : Rapid annotation of newly sequenced genomes. Use well annotated genomes as evidence. Draft Quality Genomes

Robustness for Sequencing Errors Using sequences from multiple species will result in more accurate annotations.

Will also give us high quality multiple alignments.

Data to study the evolution of genomes.

Evidence Based Multiple Species

GeneFinding

Basic Idea : Use annotations from a referencereference genome (e.g. D.melanogaster or H. sapiens) as evidence to annotate newly sequenced genomes.

Use Whole Genome Homology Maps (courtesy Colin Dewey)

Project exons from reference genome into every other genome.

Join projections to get multiple alignments. Use orthologous sequences from multiple species to get more accurate annotations.

Produce annotations with all supporting evidence.

Exploit phylogenetic relationships among the species.

Projecting Annotated Exons

Annotation

Homology Map

ExonAligner

ExonAligner : An Exon Alignment Program

Mixture of global and local alignment. Penalize overhanging ends in Evidence. Overhanging ends in Target is OK.

Exploit the property that they code for homologous proteins.

Special Dynamic Programming Matrix Robust.

Sequencing Errors. Phase Shifts.

Chaining Algorithm for large sequences.

Evidence

Target

The Dynamic Programming Matrix

The figure only shows edges into the black node. The red edges represent

non-codon gaps, i.e. gaps caused by phase shifts/sequencing errors and are of length which is not a multiple of 3. They are heavily penalized.

Chaining Algorithms Widely used in large scale alignment algorithms. MUMer [Salzberg et al. 1999] AVID [Bray et al. 2002] LAGAN [Brudno et al. 2003]

Step 1: Find good local alignments or fragments.

Step 2: Select a consistent subset of fragments for chaining and call these fragments anchors.

Step 3: Join the anchors to get an alignment.

The ExonAligner Chaining Algorithm

Construction of Fragments. Translate target sequence in the 3 frames.

Find significant hits with (translated) evidence and use them as fragments.

Selection of Anchors. Construct weighted DAG from fragments.

Weigh edges by using dynamic programming. Use nodes in the shortest path in the DAG as anchors.

Use dynamic programming to join anchors together.

Exploiting Phylogeny

Recoverable Exons/Genes

Project Using Exon Aligner

Preliminary Results Created a Homology map of Human, Chimp, Rat, Mouse and Chicken genomes 266836 exon cliques. 45543 non-convex exon cliques. 27502 of these recoverable.

Used ExonAligner to map 3300 human Refseq genes into the chimp genome. Robustness of Algorithm critical.

500 of the 42662 exons had non-codon gaps. These alignments will in turn be used to learn parameters for ExonAligner.

Extrapolate parameters for other species.

An Illustrative Example

RefSeq Gene NM_030575 Single exon gene, 221 a.a. protein. No orthologous gene found by Genewise.

Potential orthologous gene in chimp 2 non-codon gaps in alignment (1 insertion and 1 deletion separated by 60 nt).

212 out of 221 amino acids are matches.

Is this a real ortholog? Phase Shift/Sequencing Error? Find orthologous genes in other species and use multiple alignment.

Future Work Extend ExonAligner for Multiple Species

Robust realignment Take into account codon structure Robust for phase shifts/sequencing errors

Annotation with supporting evidence. Basic Evidence e.g. RefSeq Gene Annotation Multiple Alignment with Orthologous Features

Score : Statistical Significance of the Feature

Future Work Comprehensive Annotation Program

Put Evidence Based and Ab initio methods together

Try to use alignment/homology in Gibbs Sampler

Rapid annotation of Drosophila and Mammalian genomes Berkeley AAA group for Drosophila genomes.

Study the evolution of genes Find human specific genes

Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu.

Documents