ASMS 2005
Protein Identificationby
Database Searching
John CottrellMatrix Science
ASMS 2005
Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint
A set of peptide molecular weights from an enzyme digest of a protein
ASMS 2005
Henzel, W. J., Billeci, T. M., Stults, J. T., Wong, S. C., Grimley, C. and Watanabe, C. (1993). Proc Natl Acad Sci USA 90, 5011-5.
James, P., Quadroni, M., Carafoli, E. and Gonnet, G. (1993). Biochem Biophys Res Commun 195, 58-64.
Mann, M., Hojrup, P. and Roepstorff, P. (1993). Biol Mass Spectrom 22, 338-45.
Pappin, D. J. C., Hojrup, P. and Bleasby, A. J. (1993). Curr. Biol. 3, 327-32.
Yates, J. R., 3rd, Speicher, S., Griffin, P. R. and Hunkapiller, T. (1993). Anal Biochem 214, 397-408.
1993
ASMS 2005
ASMS 2005
ASMS 2005
ASMS 2005
Peptide Mass Fingerprint
• Fast, simple analysis• High sensitivity• Need database of protein sequences, not
ESTs or genomic DNA• Sequence (or close homolog) must be present
in database• Not good for mixtures, especially a minor
component.
ASMS 2005
H – N – C – C – N – C – C – N – C – C – N – C – C – OH
R1 R2 R3 R4O O O
HH H H H HHH
O
a1 b1 c1 a2 b2 c2 a3 b3 c3
x3 y3 z3 x2 y2 z2 x1 y1 z1H+
Roepstorff, P. and Fohlman, J. (1984). Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11, 601.
ASMS 2005
Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint
A set of peptide molecular weights from an enzyme digest of a protein
2. Sequence QueryMass values combined with amino acid sequence or composition data
ASMS 2005
Mann, M. and Wilm, M. (1994). Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 66, 4390-9.
ASMS 2005
TA
G
913.2 1278.3
ASMS 2005
Sequence Tag
• Rapid search times• Error tolerant• Requires interpretation• Requires high quality data.
ASMS 2005
Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint
A set of peptide molecular weights from an enzyme digest of a protein
2. Sequence QueryMass values combined with amino acid sequence or composition data
3. MS/MS Ions SearchMS/MS data from a single peptide or from a complete LC-MS/MS run
ASMS 2005
Eng, J. K., McCormack, A. L. and Yates, J. R., 3rd (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976-89.
ASMS 2005
MS/MS Ions Search
• Easily automated for high throughput• Get matches from marginal data• Can be slow
• No enzyme• Lots of variable modifications• Large database• Large dataset
• Peptide identification, proteins by inference.
ASMS 2005
MS/MS matching identifies peptides, not proteins
• Grouping peptide matches into protein matches is an arbitrary procedure
Protein A Protein BProtein C
Peptide 1 Peptide 2 Peptide 3
Peptide 1 Peptide 3
Peptide 2
• If match peptides 1, 2 and 3 from 2D gel spot, Mascot will prefer Protein A (Occam’s razor)
• But, could easily have been mixture of B and C.
ASMS 2005
BLAST / FASTA• Sequence against sequence• Can be used to find weak / distant similarity• Can make gapped alignments
MS-based ID• Mass & intensity values against sequence• Looking for identity or near identity• Generally, short peptides
ASMS 2005
What is probability based scoring?
We compute the probability that the observed match between the experimental data and mass values calculated from a candidate protein or peptide sequence is a random event.
The ‘correct’ match, which is not a random event, has a very low probability.
ASMS 2005
Why is probability based scoring important?• Human (even expert) judgement is subjective
and can be unreliable.
ASMS 2005
ASMS 2005
ASMS 2005
Why is probability based scoring important?• Human (even expert) judgement is subjective
and can be unreliable• Standard, statistical tests of significance can
be applied to the results.
ASMS 2005
Standard significance tests can be applied to results• Mascot score is -10Log10(P), where P is absolute probability that
observed match is random event
• If we make 50,000 trials, a 1 in a 20 significance threshold is
• -10Log10(1 / (20 x 50,000)) = 60 … “identity”
• If data quality are poor, this may not be achievable. If match is clearly an outlier, also report a lower, empirical threshold
• … “homology”
ASMS 2005
Why is probability based scoring important?• Human (even expert) judgement is subjective
and can be unreliable• Standard, statistical tests of significance can
be applied to the results• Arbitrary scoring schemes are susceptible to
false positives.
ASMS 2005
Major proteomics study published in Nature, 2002• 11,381 peptides• 2,415 proteins• Matches to fully non-tryptic peptides discarded• Overall fraction of semi-tryptic peptides 34%• For proteins identified using
• 1 peptide: 63% semi-tryptic• 2 peptides: 54% semi-tryptic• 3 peptides: 46% semi-tryptic.
ASMS 2005
Can we calculate a probability that the match is correct?• Maybe, if it is a test sample and you know what
the answer should be• If the sample is an unknown, then you have to
define “correct” very carefully:– The best match in the database?– The best match out of all possible peptides?– The peptide sequence that is uniquely and completely
defined by the MS data?– A statistically unlikely match?
Expect 1.8E-5
Expect 9.2E-4
Expect 0.037
Expect 4.0
ASMS 2005