Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and...

Post on 28-Dec-2015

214 views 1 download

Tags:

transcript

Improving Genome

Annotation using

Proteomics

Improving Genome

Annotation using

ProteomicsNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

3

Mass Spectrometer

Ionizer

Sample

+_

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

5

Mass is fundamental!

6

Mass Spectrometry for Proteomics

• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

7

Mass Spectrometry for Proteomics

• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein / genome sequences• A reference for comparison

8

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

9

Single Stage MS

MS

m/z

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

11

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

12

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from (any) sequence database• Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ...

• Automated, high-throughput peptide identification in complex mixtures

13

Peptide Identification

...can provide direct experimental evidence for the amino-acid sequence of functional proteins.

Evidence for:• Functional protein isoforms• Translation start and frame• Proteins with short open-reading-frames

14

How could this help?

• Evidence for SNPs and alternative splicing stops with transcription

• No genomic or transcript evidence for translation start-site.

• Conservation doesn’t stop at coding bases!

• Statistical gene-finders struggle with micro-exons, translation start-site, and short ORFs.

15

What can be observed?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Microexons ( non-cannonical splice-sites )

• Alternative translation start-sites ( codons )

• Alternative translation frames

• “Dark” open-reading-frames

16

Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

19

Translation Start-Site

• Human erythroleukemia K562 cell-line• Depth of coverage study• Resing et al. Anal. Chem. 2004.

• THOC2 gene:• Part of the heteromultimeric THO/TREX complex.

• Initially believed to be a “novel” ORF• RefSeq mRNA in Jun 2007, no RefSeq protein• TrEMBL entry Feb 2005, no SwissProt entry• Genbank mRNA in May 2002 (complete CDS)• Plenty of EST support• ~ 100,000 bases upstream of other isoforms

23

Translation Start-Site

24

Easily distinguish minor sequence variations

Two B. anthracis Sterne α/β SASP annotations

• RefSeq/Gb: MVMARN... (7441 Da)• CMR: MARN... (7211 Da)

• Intact proteins differ by 230 Da• 7441 Da vs 7211 Da

• N-terminal tryptic peptides:• MVMAR (606.3 Da), MVMARNR (876.4 Da), vs• MARNR (646.3 Da)• Very different MS/MS spectra

25

Bacterial Gene-Finding

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stopcodon

Stopcodon

• Find all the open-reading-frames...

...courtesy of Art Delcher

26

Bacterial Gene-Finding

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stopcodon

Stopcodon

…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…

ShiftedStop

Stopcodon

Reversestrand

• Find all the open-reading-frames...

...but they overlap – which ones are correct?

...courtesy of Art Delcher

27

Coding-Sequence “Score”

...courtesy of Art Delcher

28

Glimmer3 Performance

Organism Length GC% # Genes ExtraArchaeoglobus fulgidus 2.18Mb 48.6 1165 1162 99.70% 875 75.10% 1305Bacillus anthracis 5.23Mb 35.4 3132 3129 99.9% 2768 88.4% 2340Bacillus subtilis 4.21Mb 43.5 1576 1567 99.4% 1429 90.7% 2879Campylobacter jejuni 1.78Mb 30.3 1233 1233 100.0% 1149 93.2% 668Carboxydothermus hydrogenoformans 2.40Mb 42.0 1753 1752 99.9% 1590 90.7% 865Caulobacter crescentus 4.02Mb 67.2 2192 2187 99.8% 1552 70.8% 1559Chlorobium tepidum 2.15Mb 56.5 1292 1289 99.8% 949 73.5% 765Clostridium perfringens 3.03Mb 28.6 1504 1503 99.9% 1385 92.1% 1178Colwellia psychrerythraea 5.37Mb 38.0 3063 3060 99.9% 2663 86.9% 1714Dehalococcoides ethenogenes 1.47Mb 48.9 1069 1059 99.1% 929 86.9% 483Escherichia coli 4.64Mb 50.8 3603 3553 98.6% 3150 87.4% 913Geobacter sulfurreducens 3.81Mb 60.9 2351 2340 99.5% 1974 84.0% 1091Haemophilus influenzae 1.83Mb 38.1 1170 1170 100.0% 1054 90.1% 639Helicobacter pylori 1.67Mb 38.9 915 914 99.9% 805 88.0% 765Listeria monocytogenes 2.91Mb 38.0 1966 1965 99.9% 1797 91.4% 845Methylococcus capsulatus 3.30Mb 63.6 2015 2005 99.5% 1542 76.5% 1231Mycobacterium tuberculosis 4.40Mb 65.6 2217 2205 99.5% 1493 67.3% 2104Neisseria meningitidis 2.27Mb 51.5 1232 1217 98.8% 1042 84.6% 1329Porphyromonas gingivalis 2.34Mb 48.3 1200 1198 99.8% 933 77.8% 887Pseudomonas fluorescens 7.07Mb 63.3 4535 4503 99.3% 3577 78.9% 1871Pseudomonas putida 6.18Mb 61.5 3633 3596 99.0% 2825 77.8% 1916Ralstonia solanacearum 3.72Mb 67.0 2512 2487 99.0% 2061 82.0% 1077Staphylococcus epidermidis 2.62Mb 32.1 1650 1649 99.9% 1511 91.6% 771Streptococcus agalactiae 2.16Mb 35.6 1441 1438 99.8% 1336 92.7% 683Streptococcus pneumoniae 2.16Mb 39.7 1359 1355 99.7% 1214 89.3% 780Thermotoga maritima 1.86Mb 46.2 1092 1090 99.8% 892 81.7% 804Treponema denticola 2.84Mb 37.9 1463 1463 100.0% 1332 91.0% 1210Treponema pallidum 1.14Mb 52.8 575 572 99.5% 425 73.9% 557Ureaplasma parvum 0.75Mb 25.5 327 327 100.0% 300 91.7% 293Wolbachia endosymbiont 1.08Mb 34.2 628 627 99.8% 528 84.1% 537

99.6% 84.3%Averages: 

Genome Glimmer3 PredictionsMatches Correct Starts

• Glimmer3 trained & compared to RefSeq genes with annotated function

• Correct STOP:• 99.6%

• Correct START:• 84.3%

• “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.”

29

N-terminal peptides

• (Protein) N-terminal peptides establish• start-site of known & unexpected ORFs

Use:• Directly to annotate genomes• Evaluate and improve algorithms• Map cross-species

30

N-terminal peptide workflows

• Typical proteomics workflows sample peptides from the proteome “randomly”

• Caulobacter crescentus (70%)• 3733 Proteins (RefSeq Genome annot.)• 66K tryptic peptides (600 Da to 3000 Da)• 2085 N-terminal tryptic peptides (3%)

31

N-terminal peptide workflow

• Protect protein N-terminus

• Digest to peptides• Chemically modify

free peptide N-term• Use chem. mod. to

capture unwanted peptides

Nat Biotech, Vol. 21, pp. 566-569, 2003.

32

Increasing N-terminal peptide coverage

• Multiple (digest) enzymes:• trypsin-R:

60% (80%)• acid + lys-C + trypsin:

85% (94%)• Repeated LC-MS/MS• Precursor Exclusion /

Inclusion lists• MALDI / ESI• Protein separation

and/or orthogonal fractionation Anal Chem, Vol. 76, pp. 4193-4201, 2004.

33

Proteomics Informatics

• Search spectra against:• Entire bacterial genome;• All Met initiated peptides; or • Statistically likely Met initiated peptides.

• Easily consider initial Met loss PTM, too

• Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)

34

Other Practical Issues

• Suitable for commonly available instrumentation• Only the sample prep. is (somewhat) novel.

• Need living organism• Stage of life-cycle?

• Bang for buck?• N-terminal peptides / $$$$

35

Other Research Projects

• Alternative splicing and coding SNPs in clinical cancer samples

• MS/MS spectral matching using HMMs• Combining MS/MS search engine results

using machine learning• Microorganism identification using MS

(www.RMIDb.org)• Gapped/spaced seeds for inexact sequence

alignment.• Applications of SBH-graphs and Eulerian

paths

36

Hidden Markov Models for Spectral Matching

• Capture statistical variation and consensus in peak intensity

• Capture semantics of peaks• Extrapolate model to other peptides

• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (w/ p-value < 10-5)

37

Peptide DLATVYVDVLK

38

Peptide DLATVYVDVLK

39

Acknowledgements

• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Cheng Lee• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: NIH/NCI, USDA/ARS