Download - Proteogenomics10Oct2008 v2 com

8/8/2019 Proteogenomics10Oct2008 v2 com

1/22

Annotating genomes using

proteomics data

Andy Jones

Department of Preclinical VeterinaryScience


2/22

Overview

Genome annotation

Current informatics methods

Experimental data

How good are we at annotating genomes?

Proteome data for genome annotation

Study on Toxoplasma

Challenges

Proposed solutions


3/22


4/22

Annotating eukaryotic genomes

Genome annotation:

Find start codons / transcriptional initiation

Recognise splice acceptor and donor sequences Stop codon

Predict alternative splicing...

Start codon

Exon 1 Exon 2 Exon 3 Exon 4

Stop codon

Genomic DNA

mRNA


5/22

Computational gene prediction

Denovo prediction single genome Trained with typical gene structures - learn exon-intron

signals, translation initiation and termination signals e.g.Markovmodels

Many different predictions scored based on training set ofknown genes

Multiple genome Compare confirmed gene sequences from other species

Coding regions more highly conserved conservationindicates gene position

Pattern searching: Higher mutation rate of bases separatedin multiples of three (mutations in 3rd position of codons areoften silent)

Experimental data also contribute to many genomeprojects

New methods weigh evidence from a variety ofsources Attempting to reproduce how a human annotator would

work

Brent, Nat Rev Genet. 2008 Jan;9(1):62-73


6/22

Experimental corroboration of models

Expressed Sequence Tags Simple to obtain large volumes of data sequence

randomly from cDNA libraries

Problems:

Data sets can contain unprocessed transcripts (do not alwaysconfirm splicing)

Rarely cover 5 end of gene

Generally low-quality sequences

High-throughput sequencing

Next-generation sequencers capable of directlysequencing mRNA

Likely to become more widely used in the future

Proteome data (peptide sequence data)


7/22

How good are gene models?

Plasmodium falciparum (causative agent malaria)

genome sequenced in 2002, undergone considerable

curation of gene models

Recent article: cDNA study ofP. falciparum

Suggests ~25% of genes in

Plasmodium

falciparum are incorrect (85 genes out of 356

sampled)

Majority of errors are in splice junctions (intron-

exon boundaries)

What does this mean for other genomes...?

Likely that high percentage of gene sequences areincorrect!

BMC Genomics. 2007 Jul 27;8:255.


8/22

Proteome data for genome annotation

Motivation for genome annotation:

Can rule out that transcripts are non protein-coding

Large volumes of proteome data often collected for other

purposes

Certain types of proteome data able to confirm the start

codon of genes (difficult by other methods)

Even where considerable ESTs / cDNA sequencing has been

performed, proteins can be detected with nocorresponding EST evidence


9/22

Proteogenomic study ofToxoplasma gondii

Proteome study ofToxoplasma gondiiusing three

complementary techniques

parasite of clinical significance related to Plasmodium

Study aims:

Identify as many components of the

proteome as possible

Relate peptide sequence data back to

genome to confirm genes

Relate protein expression data totranscriptional data (EST / microarray)


10/22

2D gel electrophoresis

1D gel

electrophoresis

Cut bands

Trypsin digestion

Cut gel spot

Trypsin digestion

Trypsin digestion

Fractions

Mass spectrometry

Sequence database search

(compare with theoretical spectra

predicted for each peptide in DB)

Liquid chromatography

Peptides


11/22

Database search strategy

ToxoDB

60MB genome

sequence

Official gene models

Alternative gene models

predicted by gene

finders

= DNA sequence database

= amino acid sequence database

ORFs predicted in a 6 frametranslation

Concatenate

databases

Search all spectra

Identify peptides

and proteins

Align peptide sequences back to corresponding genomic region


12/22

Five exon gene; incomplete agreement between different gene models

Peptide evidence for all 5 exons and 2 introns out of 4

Note: Can only provide positive evidence, no peptides matched to 5 and 3

termini of gene model


13/22

-Appears to be additional exon at 5

-None of GLEAN, TwinScan or TigrScan algorithms appears to have made correct

prediction


14/22

ORF/ part of TgGlimmerHMM sequence:VVGGFSSNFLSFFSVIITSVKMSDAEDVTFETA

DAGASHTYPMQAGAIKKNGFVMLKGNPCKV

VDYSTSKTGKHGHAKAHIVGLDIFTGKKYED

VCPTSHNMEVPNVKRSEFQLIDLSDDGFCTLL

LENGETKDDLMLPKDSEGNLDEVATQVKNLF

TDGKSVLVTVLQACGKEKIIASKEL

50.m5694 sequence:

MVEGVYSSFEAMIFSLPHACRTVTRT

DLPSVKRFLTCVATSSKFPSESLGSIK

SSFVSPFSRSSVQKPSSDKSINWNSDL

FTFGTSML

- All peptides matched to gene models on opposite strand


15/22

Study outcomes

Protein evidence for approximately 1/3 of predictedgenes (2250 proteins)

Around 2500 splicing events confirmed Peptides aligned across intron-exon boundaries

Around 400 protein IDs appear to match alternativegene models

Genome database (ToxoDB) hosts peptide sequencesaligned against gene models

Can we use informatics to improve this strategy...?

Xia et al. (2008) Genome Biology,9(7),pp.R11


16/22

Challenges of proteogenomics

Main informatics challenge: A protein can usually only be identified if the gene sequence has

been correctly predicted from the genome

In effect, would like to use MS data directly for gene discovery

But... searching a six frame genome translation is problematic

All peptide and protein identifications are probabilistic False positive rate is proportional to search database size

On average only ~10-20% of spectra identify a peptide

Need methods that can exploit the rest of the meaningful spectra

When gene models change, protein identifications are outof date No dynamic interaction between proteome and genome data


17/22

Automated re-annotation pipeline

Planned improvements to the informatics workflow:

1. Re-querying pipeline each time gene models change, all mass spectra are automatically re-

queried2. Integrate peptide evidence directly into gene finding

software

3. Maximising the number of informative mass spectra

4. Attempt to optimise algorithms for denovo sequencing of

peptides5. N-terminal proteomics

- Could be used to confirm gene initiation point


18/22

Spectra

Multiple

database searchengines

Official

gene set

Confirmed official

model

Multiple

database search

engines

Modified de

novo

algorithms

Novel ORF, splice

junction

Promote alternative

model

Stage 1

Stage 2

Gene

Finder

Proteomic evidence

Alternative

gene models

Genome

sequence

Spectra searched in series Peptide evidence confirming official gene, alternative model, new ORF:

Direct flow back to modified gene finder

Produce new set of predictions Iteratively improve number of spectra identified

In each iteration, fewer spectra flow on to stage 2 and 3

Stage 3


19/22


20/22

Query spectra using different search engines

Jones et al.Improving sensitivity in proteome studies by analysis of false discoveryrates for multiple search engines. PROTEOMICS, in press (2008)

Each search engine produces a different non-standard score of the quality of a match

Developed a search engine independent score, based on analysis of false discovery rate

Identifications made more search engines are scored more highly

Can generate 35% more peptide identification than best single search engine

Omssa

X!Tandem

Mascot

Peptides

Combined

listP

eptides

Peptides

Omssa X!Tandem

Mascot

Peptide identifications

Rescoring

Algorithm(FDR)


21/22

Conclusions

Proteome data is able to confirm gene models are

correct

Currently data under-exploited

Challenges searching mass spec data directly againstthe genome for gene discovery

Build re-querying pipeline

Iteratively improve gene models

Improve capabilities for using multiple search engines

Integrate peptide evidence directly into gene finders


22/22

Acknowledgments

Data from Wastling lab:

Dong Xia, Sanya Sanderson, Jonathan Wastling

ToxoDB at Upenn David Roos, Brian Brunk

Email: [email protected]