Download - Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Protein Identificationby

Database Searching

John CottrellMatrix Science

ASMS 2005

Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint

A set of peptide molecular weights from an enzyme digest of a protein

ASMS 2005

Henzel, W. J., Billeci, T. M., Stults, J. T., Wong, S. C., Grimley, C. and Watanabe, C. (1993). Proc Natl Acad Sci USA 90, 5011-5.

James, P., Quadroni, M., Carafoli, E. and Gonnet, G. (1993). Biochem Biophys Res Commun 195, 58-64.

Mann, M., Hojrup, P. and Roepstorff, P. (1993). Biol Mass Spectrom 22, 338-45.

Pappin, D. J. C., Hojrup, P. and Bleasby, A. J. (1993). Curr. Biol. 3, 327-32.

Yates, J. R., 3rd, Speicher, S., Griffin, P. R. and Hunkapiller, T. (1993). Anal Biochem 214, 397-408.

1993

ASMS 2005

ASMS 2005

ASMS 2005

ASMS 2005

Peptide Mass Fingerprint

• Fast, simple analysis• High sensitivity• Need database of protein sequences, not

ESTs or genomic DNA• Sequence (or close homolog) must be present

in database• Not good for mixtures, especially a minor

component.

ASMS 2005

H – N – C – C – N – C – C – N – C – C – N – C – C – OH

R1 R2 R3 R4O O O

HH H H H HHH

O

a1 b1 c1 a2 b2 c2 a3 b3 c3

x3 y3 z3 x2 y2 z2 x1 y1 z1H+

Roepstorff, P. and Fohlman, J. (1984). Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11, 601.

ASMS 2005



2. Sequence QueryMass values combined with amino acid sequence or composition data

ASMS 2005

Mann, M. and Wilm, M. (1994). Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 66, 4390-9.

ASMS 2005

TA

G

913.2 1278.3

ASMS 2005

Sequence Tag

• Rapid search times• Error tolerant• Requires interpretation• Requires high quality data.

ASMS 2005



2. Sequence QueryMass values combined with amino acid sequence or composition data

3. MS/MS Ions SearchMS/MS data from a single peptide or from a complete LC-MS/MS run

ASMS 2005

Eng, J. K., McCormack, A. L. and Yates, J. R., 3rd (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976-89.

ASMS 2005

MS/MS Ions Search

• Easily automated for high throughput• Get matches from marginal data• Can be slow

• No enzyme• Lots of variable modifications• Large database• Large dataset

• Peptide identification, proteins by inference.

ASMS 2005

MS/MS matching identifies peptides, not proteins

• Grouping peptide matches into protein matches is an arbitrary procedure

Protein A Protein BProtein C

Peptide 1 Peptide 2 Peptide 3

Peptide 1 Peptide 3

Peptide 2

• If match peptides 1, 2 and 3 from 2D gel spot, Mascot will prefer Protein A (Occam’s razor)

• But, could easily have been mixture of B and C.

ASMS 2005

BLAST / FASTA• Sequence against sequence• Can be used to find weak / distant similarity• Can make gapped alignments

MS-based ID• Mass & intensity values against sequence• Looking for identity or near identity• Generally, short peptides

ASMS 2005

What is probability based scoring?

We compute the probability that the observed match between the experimental data and mass values calculated from a candidate protein or peptide sequence is a random event.

The ‘correct’ match, which is not a random event, has a very low probability.

ASMS 2005

Why is probability based scoring important?• Human (even expert) judgement is subjective

and can be unreliable.

ASMS 2005

ASMS 2005

ASMS 2005


and can be unreliable• Standard, statistical tests of significance can

be applied to the results.

ASMS 2005

Standard significance tests can be applied to results• Mascot score is -10Log10(P), where P is absolute probability that

observed match is random event

• If we make 50,000 trials, a 1 in a 20 significance threshold is

• -10Log10(1 / (20 x 50,000)) = 60 … “identity”

• If data quality are poor, this may not be achievable. If match is clearly an outlier, also report a lower, empirical threshold

• … “homology”

ASMS 2005


and can be unreliable• Standard, statistical tests of significance can

be applied to the results• Arbitrary scoring schemes are susceptible to

false positives.

ASMS 2005

Major proteomics study published in Nature, 2002• 11,381 peptides• 2,415 proteins• Matches to fully non-tryptic peptides discarded• Overall fraction of semi-tryptic peptides 34%• For proteins identified using

• 1 peptide: 63% semi-tryptic• 2 peptides: 54% semi-tryptic• 3 peptides: 46% semi-tryptic.

ASMS 2005

Can we calculate a probability that the match is correct?• Maybe, if it is a test sample and you know what

the answer should be• If the sample is an unknown, then you have to

define “correct” very carefully:– The best match in the database?– The best match out of all possible peptides?– The peptide sequence that is uniquely and completely

defined by the MS data?– A statistically unlikely match?

Expect 1.8E-5

Expect 9.2E-4

Expect 0.037

Expect 4.0

ASMS 2005