Predicting Protein Sequences From Mass
Spectral Data
Gary Van DomselaarUniversity of Alberta
Canadian Proteomics Initiative
May 18, 2004
Introduction
• Review:– Protein Separation
– Cleavage
– Mass Spectra
– MS and MS/MS
• The Objective: Matching Mass Spectra to Protein Sequences
Introduction
• Strategies:– Maldi & Peptide Mass Fingerprinting
– MS/MS & Fragment Ion Searching
– MS/MS & Sequence Tag Searches
– MS/MS & De Novo Peptide Sequencing
Review: Protein Separation
High Performance Liquid Chromatography2D Gel Electrophoresis
Protein Separation: 2D Gel Electrophoresis
SDSPAGE
Protein Separation: High Performance Liquid
Chromatography (HPLC)
Solvent
Solvent
Mixer Pump
SampleInjector
Column MassSpec
Protein Separation: 1D PAGE LC/MS
Solvent
Solvent
Mixer Pump
Sample Injector
Column
Complex Protein Mixture SDS
PAGE
In-Gel Digestion
ESI-MS
Protein Separation: 2D LC/MS
Solvent
Solvent
Mixer Pump
Sample Injector
SCX RPC
Complex Protein Mixture
In-Solution Digestion
ESI-MS
Review: Cleavage
http://ca.expasy.org/tools/peptidecutter/peptidecutter_enzymes.html
Protease Cleavage Rules
Trypsin XXX[KR]--[!P]XXX
Chymotrypsin XX[FYW]--[!P]XXX
Lys C XXXXXK-- XXXXX
Asp N endo XXXXXD-- XXXXX
CNBr XXXXXM--XXXXX
Missed Cleavages• Proteases are not perfect enzymes
• Protease products are not confined to the predicted products
– Contaminating proteases
– PTMs at the recognition site blocks access
– Unexpected recognition sites:
• Ex: trypsin produces 'ragged termini' when two or more consecutive basic residues are present in the sequence
Missed Cleavages
>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe
Sequence Tryptic Fragments (no missed cleavage)acedfhsak (1007.4251) dfgeasdfpk (1183.5266) ivtmeeewendadnfek (2098.8909) gwfe (609.2667)
Tryptic Fragments (1 missed cleavage)acedfhsak (1007.4251) dfgeasdfpk (1183.5266) ivtmeeewendadnfek 2098.8909) gwfe (609.2667)acedfhsakdfgeasdfpk (2171.9338)ivtmeeewendadnfekgwfe (2689.1398)dfgeasdfpkivtmeeewendadnfek (3263.2997)
Autolysis Peaks
500 1000 1500 2000 2500
698
2098
11991007
609
450
2211 (trp)
1940 (trp)
Review: The Mass Spectrum
Rel
ativ
e In
tens
ity
Mass / Charge (m/z)
Average Mass and Monoisotopic Mass
• Monoisotopic mass is the mass determined using the masses of the most abundant isotopes
• Average mass is the abundance weighted mass of all isotopic components
http://www.matrixscience.com/help/mass_accuracy_help.html
Average Mass and Monoisotopic Mass
http://65.219.84.5/moverz/tutorials/pages/peak.html
Calculating Peptide Masses• Sum the monoisotopic residue masses
• Add mass of H2O (18.01056)
• Add mass of H+ (1.00785 to get M+H)
• If Met is oxidized add 15.99491
• If Cys has acrylamide adduct add 71.0371
• If Cys is iodoacetylated add 58.0071
• Other modifications are listed at– http://prowl.rockefeller.edu/aainfo/deltamassv2.html
• Only consider peptides with masses > 400
Amino Acid Residue Masses
Glycine 57.02147Alanine 71.03712Serine 87.03203Proline 97.05277Valine 99.06842Threonine 101.04768Cysteine 103.00919Isoleucine 113.08407Leucine 113.08407Asparagine 114.04293
Aspartic acid 115.02695Glutamine 128.05858Lysine 128.09497Glutamic acid 129.04264Methionine 131.04049Histidine 137.05891Phenylalanine 147.06842Arginine 156.10112Tyrosine 163.06333Tryptophan 186.07932
Monoisotopic Mass
Amino Acid Residue Masses
Glycine 57.0520Alanine 71.0788Serine 87.0782Proline 97.1167Valine 99.1326Threonine 101.1051Cysteine 103.1448Isoleucine 113.1595Leucine 113.1595Asparagine 114.1039
Aspartic acid 115.0886Glutamine 128.1308Lysine 128.1742Glutamic acid 129.1155Methionine 131.1986Histidine 137.1412Phenylalanine 147.1766Arginine 156.1876Tyrosine 163.1760Tryptophan 186.2133
Average Mass
Review: ESI-MS Spectrum
http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm
m/z = (MW + nH+)n
Review: ESI-MS/MS
MS1 MS2
Collision Cell
Review: ESI-MS/MS
AEGKLRFK(biotin)
b1 A EGKLRFK(biotin) a
7b
2 AE GKLRFK(biotin) a
6
b3 AEG KLRFK(biotin) a
5
b4 AEGK LRFK(biotin) a
4
b5 AEGKL RFK(biotin) a
3
b6 AEGKLR FK(biotin) a
2
b7 AEGKLRF K(biotin) a
1
http://www.abrf.org/JBT/2000/December00/dec00bibbs.html
Review: MALDI Spectra
http://biop.ox.ac.uk/www/lj2000/endicott/endicott_7.html
• Generates Singly Charged Ions
• High Upper Detection Limit
Matching Spectra and Protein Sequences
Protein Digest
MRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIAKDWVLTAAHCNLNKRSQVILGAHSITYEEPTKQIMLVKKEFPYPCYDPATREGDLKLLQL
In Silico DigestionProtein Database
LASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRTICAGALIAKDWVLTAAHCNLNKRITTTYEEPTKQIMLVKEFPYPCYDPATREGDLKLL
0
20
40
60
80
100
m/z
%T
IC
0
20
40
60
80
100
m/z
%T
IC
Theoretical MS
Experimental MS
?Mass Analysis
Strategies for Matching Mass Spectra with Protein Sequences• Maldi & Peptide Mass Fingerprinting
• MS/MS & Fragment Ion Searching
• MS/MS & Sequence Tag Searches
• MS/MS & De Novo Peptide Sequencing
Peptide Mass Fingerprinting• Used to identify protein spots on gels or
protein peaks from an HPLC run
• Depends of the fact that if a peptide is cut up or fragmented in a known way, the resulting fragments (and resulting masses) are unique enough to identify the protein
• Requires a database of known sequences
• Uses software to compare observed masses with masses calculated from database
Principles of Fingerprinting
>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe
>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe
>Protein 3acedfhsadfqekasdfpkivtmeeewendakdnfeqwfe
Sequence Mass (M+H) Tryptic Fragments
4842.05
4842.05
4842.05
acedfhsakdfgeasdfpkivtmeeewendadnfekgwfe
acekdfhsadfgeasdfpkivtmeeewenkdadnfeqwfe
acedfhsadfgekasdfpkivtmeeewendakdnfegwfe
Principles of Fingerprinting
>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe
>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe
>Protein 3acedfhsadfqekasdfpkivtmeeewendakdnfeqwfe
Sequence Mass (M+H) Mass Spectrum
4842.05
4842.05
4842.05
Preparing a Peptide Mass Fingerprint Database
• Take a protein sequence database (Swiss-Prot or nr-GenBank)
• Determine cleavage sites and identify resulting peptides for each protein entry
• Calculate the mass (M+H) for each peptide
• Sort the masses from lowest to highest
• Have a pointer for each calculated mass to each protein accession number in databank
Building A PMF Database
>P12345acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe
>P21234acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe
>P89212acedfhsadfqekasdfpkivtmeeewendakdnfeqwfe
Sequence DB Calc. Tryptic Frags Mass Listacedfhsakdfgeasdfpkivtmeeewendadnfekgwfe
acekdfhsadfgeasdfpkivtmeeewenkdadnfeqwfe
acedfhsadfgekasdfpkivtmeeewendakdnfegwfe
450.2017 (P21234) 609.2667 (P12345) 664.3300 (P89212) 1007.4251 (P12345)1114.4416 (P89212)1183.5266 (P12345)1300.5116 (P21234) 1407.6462 (P21234)1526.6211 (P89212)1593.7101 (P89212) 1740.7501 (P21234) 2098.8909 (P12345)
The Simplest Scoring Scheme: Peptide Counting
• Take a mass spectrum of a trypsin-cleaved protein (from gel or HPLC peak)
• Identify as many masses as possible in spectrum (avoid autolysis peaks)
• Compare query masses with database masses and calculate # of matches or matching score (based on length and mass difference)
• Rank proteins by number of hits and return top scoring entry – this is the protein of interest
Query vs. DatabaseQuery Masses Database Mass List Results
450.2017 (P21234) 609.2667 (P12345) 664.3300 (P89212) 1007.4251 (P12345)1114.4416 (P89212)1183.5266 (P12345)1300.5116 (P21234) 1407.6462 (P21234)1526.6211 (P89212)1593.7101 (P89212) 1740.7501 (P21234) 2098.8909 (P12345)
450.2201609.3667698.31001007.53911199.49162098.9909
2 Unknown masses1 hit on P212343 hits on P12345
Conclude the queryprotein is P12345
Peptide Counting• Works well for high quality data
• Gives higher scores to larger proteins
• PeptIdent• http://us.expasy.org/tools/peptident.html
• PepSea• http://pepsea.protana.com/PA_PepSeaForm.html
• MS-Fit• http://prospector.ucsf.edu/ucsfhtml3.2/msfit.htm
MOWSE
• MOlecular Weight SEarch
• Scoring based on peptide frequency distribution from the OWL non redundant Database
BleasbyPappin DJC, Hojrup P, and Bleasby AJ (1993) Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 3:327-332
>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfe
>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfe
>Protein 3MASMGTLAFD EYGRPFLIIK DQDRKSRLMG LEALKSHIMA AKAVANTMRT SLGPNGLDKMMVDKDGDVTV TNDGATILSM MDVDHQIAKL MVELSKSQDD EIGDGTTGVV VLAGALLEEAEQLLDRGIHP IRIAD
Sequence Mass (M+H) Tryptic Fragments
4842.05
4842.05
14563.36
acedfhsakdfgeasdfpkivtmeeewendadnfekgwfe
acekdfhsadfgeasdfpkivtmeeewenkdadnfeqwfe
SQDDEIGDGTTGVVVLAGALLEEAEQLLDR2DGDVTVTNDGATILSMMDVD HQIAKMASMGTLAFDEYGRPFLIIK2TSLGPNGLDKLMGLEALKLMVELSKAVANTMRSHIMAAKGIHPIRMMVDKDQDR
MOWSE
>Protein 1acedfhsakdfqeasdfpkivtmeeewendadnfekqwfel
>Protein 2acekdfhsadfqeasdfpkivtmeeewenkdadnfeqwfekqwfei
MOWSE2. For each protein, place fragments into 100 Da bins.
Mol. Wt. Fr agment2098.8909 IVTMEEEWENDADNFEK1183.5266 DFQEASDFPK1007.4251 ACEDFHSAK 722.3508 QWFEL
1740.7500 DFHSADFQEASDFPK1407.6460 IVTMEEEWENK1456.6127 DADNFEQWFEK 722.3508 QWFEI
��� � ������� �������������� ������� � ��������� �"!#� $�% &�%�$�'�� (��)������ ���������*������ ��)������+������ ��*���� %�'�,.-"&�%�'�/0�1&�-�%�'123(��4������ ��+������5������ ��4������6������ ��5���� � ��������� �"!#� $�(�7�%�&�%�$�'1�"/�!8'1���9������ ��61������������ ��9�����1������� ������� %�'�/:�1&�-�%�'123(��������� ������� &�;�� %�'1,�-3&�()������ �������*������ )����+������ *����4������ +����5������ 4����61����� 5����
/�!8'3� <�7 /�!8'3���
MOWSE3. Divide the number of fragments for each bin by the total number of fragments for each 10 kDa protein interval=�> ? @�A�B�C�D E�?�F G"H�I J�K L1M N1O3PQN3RQS TU�V�V�V�W U1XQV�V Y Z�[�\0]3]3]1^ ]3_�` a�`�_�b1]3c X V1d XQU�eX�f�V�V�W U�V�V�V V V1d V�V�VX�g�V�V�W X�f�V�V V V1d V�V�VX�h V�V�W XQg�V�V `�b1i.j�a�`�b�k�]�a�j�`�b1l3c X V1d XQU�eX�m�V�V�W X�h V�V V V1d V�V�VX�e�V�V�W X�m�V�V V V1d V�V�VXQn�V�V�W X�e�V�V Y Z�[�\0]3]3]1^ ]3_�cpop`�a.`�_�b1]1k�^ b1] U V1d U�e�VX�q�V�V�W XQn�V�V V V1d V�V�VX�U�V�V�W X�q�V�V V V1d V�V�VX�XQV�V�W X�U�V�V `�b�k�]�a.jp`�b�lpc X V1d XQU�eX�V�V�V�W X�XQV�V a�r�]3`�b1i�j�a.c X V1d XQU�ef�V�V�W XQV�V�V V V1d V�V�Vg�V�V�W f�V�V V V1d V�V�Vh V�V�W g�V�V V V1d V�V�Vm�V�V�W h V�V U V1d U�e�Ve�V�V�W m�V�V V V1d V�V�Vn�V�V�W e�V�V V V1d V�V�V
k�^sb1] t�o k�^sb�] Y
MOWSE4. For each 10 kD interval, normalize to the largest bin value=�> ? @�A�B�C�D E�?�F G"H�I J�K L1M N1O3PQN3RQS TU�V�V�V�W U1XQV�V Y Z�[�\0]3]3]1^ ]3_�` a�`�_�b1]3c X V1d XQU�e V�d eX�f�V�V�W U�V�V�V V V1d V�V�V VX�g�V�V�W X�f�V�V V V1d V�V�V VX�h V�V�W XQg�V�V `�b1i.j�a�`�b�k�]�a�j�`�b1l3c X V1d XQU�e V�d eX�m�V�V�W X�h V�V V V1d V�V�V VX�e�V�V�W X�m�V�V V V1d V�V�V VXQn�V�V�W X�e�V�V Y Z�[�\0]3]3]1^ ]3_�cpop`�a.`�_�b1]1k�^ b1] U V1d U�e�V XX�q�V�V�W XQn�V�V V V1d V�V�V VX�U�V�V�W X�q�V�V V V1d V�V�V VX�XQV�V�W X�U�V�V `�b�k�]�a.jp`�b�lpc X V1d XQU�e V�d eX�V�V�V�W X�XQV�V a�r�]3`�b1i�j�a.c X V1d XQU�e V�d ef�V�V�W XQV�V�V V V1d V�V�V Vg�V�V�W f�V�V V V1d V�V�V Vh V�V�W g�V�V V V1d V�V�V Vm�V�V�W h V�V U V1d U�e�V Xe�V�V�W m�V�V V V1d V�V�V Vn�V�V�W e�V�V V V1d V�V�V V
� H3M � J�K � �QN��
k�^sb1] t�o k�^sb�] Y
MOWSE5. Compare spectrum masses against fragment masslist for each protein in the database. Retrieve the frequency score for each match and multiply.
��� ��� ��������� G"H�I J1K L1M N1O3PQN3RQS TU�V�V�V�W U�X�V�V � ���������������� �!#"$!� �%��#& X V�d XQU�e V1d eXQf�V�V�W U�V�V�V V V�d V�V�V VXQg�V�V�W XQf�V�V V V�d V�V�V VX�h V�V�W XQg�V�V !�%�'�(�"�!�%*)$�*"�(#!�%,+�& X V�d XQU�e V1d eXQm�V�V�W X�h V�V V V�d V�V�V VX�e�V�V�W XQm�V�V V V�d V�V�V VX n�V�V�W X�e�V�V � ���������������� �&�-�!#"�!� $%���)��.%�� U V�d U�e�V XX�q�V�V�W X n�V�V V V�d V�V�V VXQU�V�V�W X�q�V�V V V�d V�V�V VX�X�V�V�W XQU�V�V !�%,)$�/"�(/!$%�+*& X V�d XQU�e V1d eXQV�V�V�W X�X�V�V "10$��!�%,'�(�"�& X V�d XQU�e V1d ef�V�V�W XQV�V�V V V�d V�V�V Vg�V�V�W f�V�V V V�d V�V�V Vh V�V�W g�V�V V V�d V�V�V Vm�V�V�W h V�V U V�d U�e�V Xe�V�V�W m�V�V V V�d V�V�V Vn�V�V�W e�V�V V V�d V�V�V V
� H"M ��J3K � �QN��
)2��%���3/- )2��%/�#�
1740.7500 1456.6127 722.3508
0.5 x 1 x 1 = 0.5
MOWSE6. Invert and multiply, and normalize to an 'average' protein of 50 000 k Da:
PN = product of distribution frequency scores
H = 'Hit' Protein MW = 5672.48
50 000 PN x H
Score =
= 0.5 x 1 x 1 = 0.5
50 000 0.5 x 5672.48
= = 17.62
MOWSE4. For each 10 kD interval, normalize
��������� ������� � � ������������� ������� ��� ��� ������������ ������� ���������� ������� ��� ������ ������� � ��!#"$������%&��� "#����'$( ��� �����)������ �� ���� ���*������ ��)���� ���������� ��*���� � � ������������� ( ��� �����+������ ������� ������� ���$%,����� ��� ������������ ��+���� ���������� �������-� ��%.��� "#����'$( ��� ������������ ������� ��/ ������!0"���( ��� ���������� ������� �������� ����� � ������ ����� �)������ ���� %,����� ��� ���*������ )���� �������� *���� �
MOWSE Takes into account relative abundance of peptides in the database when calculating scores. Protein size is compensated for.
The model consists of numerous spaces separated by 100 Da (the average aa mass).
Does not provide a measure of confidence for the prediction.
• MOWSE• http://www.hgmp.mrc.ac.uk/Bioinformatics/Web
app/mowse/
• MS-Fit• http://prospector.ucsf.edu/ucsfhtml3.2/msfit.htm
MOWSE
MASCOT• Probability-based MOWSE
• The probability that the observed match between experimental data and a protein sequence is a random event is approximately calculated for each protein in the sequence database.Probability model details not published.
Perkins DN, Pappin DJC, Creasy DM, and Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551-3567.
Extreme Value Distribution
0
1000
2000
3000
4000
5000
6000
7000
8000
<20 30 40 50 60 70 80 90 100 110 >120
P(x) = 1 - e -e-x
MASCOT
Mascot/Mowse Scoring
• The Mascot Score is given as S = -10*Log(P), where P is the probability that the observed match is a random event
• Try to aim for probabilit ies where P<0.05 (less than a 5% chance the peptide mass match is random).
ProFound
• Uses a bayesian probability model
• Takes individual properties of each protein in the database.
Bayes Theorem
• Describes the probability of some event given that some other event has already occurred (conditional probability).
P(A | B) = P(B | A) P(A)
P(B)
• “The probability of some event A occurring given that event B has occurred is equal to the probability of event B occurring given that event A has occurred, multiplied by the probability of event A occurring and divided by the probability of event B occurring”.
Likelihood
Prior Probability
Posterior Probability
Bayes Theorem• Example:
• 0.1% of women aged 40 have breast cancer.
• For 40 y/o women with cancer, a mammography will show positive 95% of the time.
• For 40 y/o women without cancer, a mammography will show positive 10% of the time.
• A 40 y/o woman tests positive by mammography for breast cancer.
• What is the probability she really does have breast cancer?
Bayes TheoremP(Disease) = 0.001 P(No Disease) = 0.999
P(Positive test | Disease) = 0.95 P(Negative test | Disease) = 0.05
P(Negative test | No disease) = 0.90 P(Positive test | No disease) = 0.10
P(Disease | Positive test) = P(Positive test | Disease) P(Disease)
P(Positive test)
P(Positive test) = P(Disease) P(Positive test | Disease)
+ P(No disease) P(Positive test | No disease) = 0.001x0.95 + .999x0.10 = 0.101
P(Disease | Positive test) = 0.95 x 0.999 = 0.0094 (less than 1%).
0.101
Bayes Theorem
������������ ������������������� ����The Product Rule:
Example: Draw a card from a deck of playing cards. What is the probability that it is the king of clubs?
A KingB ClubsC Deck
������� �����! "���#�%$'&�(*)+� ,.-'/10324/6587�9;:=<>�@?BA4CED�%�F�G� �H�I "����,.-'/10�� 7J9;:=<>� �DKCMLON
P>QSRUT'V�W>X;Y+Z�[]\_^�`ba c�d+egf�hji�P*QSZ�['\k^�`la RUT'V�WnmoV�p1c�d=eqfUh!P*Q#RUT'V�W_`la c�d=eqfUhP>QSRUT'V�W�XoY;Z�[]\_^�`ba c�d+egf6hFiJrgsut�v�tKsuwkx.yzrlsuwkx
Bayes Theorem and PMFP(D|kI) P(k|I)
P(D|I)P(k|DI) =
K The hypothesis “protein k is the protein being analyzed”D The experimental data = mi...mn I Background information
0
20
40
60
80
100
%TI
C
m1m2
m3
m4
m5
m6m7
m8
m9
m10
m12
m11
m13m14
m15
Bayes Theorem and PMFP(D|kI) P(k|I)
P(D|I)P(k|DI) =
K The hypothesis “protein k is the protein being analyzedD The experimental data = mi...mn I The available background information (species,
approximate mass of the parent protein, cleavage enzyme, mass accuracy, etc.)
P(k|DI) The posterior probability that the hypothesis is true given the data D and the background information I.
P(k|I) The prior probability of the hypothesis given the background information I
P(D|kI) The likelihood probability that the data D would beobserved if the hypothesis were true.
P(D|I) A normalization constant, independent of K.
Bayes Theorem and PMF������� ����� ������� ���������������
P(k|I) The prior probability of the hypothesis given the background information I
•Zero for every hypothesis that doesnt satisfy the background information (protein molecular weight, cleavage enzyme, species, etc.)•Otherwise 2 possibilities:
1. A uniform probability for all hypothesis that satisfy the constraints (all proteins that have the correct MW, cleaved with the correct enzyme), therefore a constant.
2. The prior probability from a previous experiment (ie multiple digestions).
Bayes Theorem and PMF
P(k|I) The prior probability of the hypothesis given the background information I
P(k|I) 0 (does not satisfy constraints)constant (no previous data available)P(k|DprevI) if previous data, Dprev is available
��� � � ��� ������ � ���� � � � � ��
Bayes Theorem and PMF
P(D|kI) The likelihood probability that the data D would beobserved if the hypothesis were true.
0
20
40
60
80
100
m/z
%T
IC
0
20
40
60
80
100
%TI
C
m1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
2 subsets: hits(H) and misses (M).
The 'Product Rule' P(AB|C) = P(B|AC)P(A|C) can be used to factor the data into the probability for hits and the probability for misses given the hits.
c�i ������ ��� � ������ � �������������� �� �������� ����
!#"%$'& (�) *,+-!�" .�/,0 12 .31 45/,0 1 4768 & (�) * +�!#" .�/,0 12 & (�) *,!#" .91 4:/70 1 4 6-& (�)%.;/70 1 *
����� � ��� ������� � ���� � � � ����
Bayes Theorem and PMF
������������ ���������� �� ��������� ������ ������ �!���"��� �� ���#����!�������$�$� ���%�&���'�(��� �� �Likelihood probability for hits
• Factor as products of probabilities for individual hits by applying the product rule:
)+*-,!.�/ 0132 465�7-8:9 ; < .0 )+*-, ; 1%2 465 ,>=�/ ; ? .1@7
A>B-C&D>E F�G�H!A>B�DIE C&F G�AJB�C3E F GExample: 3 hits, r = 3KML N3O�P QRTS UWV�X�Y KZL-N3OR N\[R N^]R_S U_V`X
Set m1 to 'A', m
2m
3 to 'B', kI to 'C'KML N OR N [R N ]R$S U_V`XaY KZL-N [R N ]R_S U_V N O X KZL-N OR_S UWV�X
Now set m2 to 'A', m
3 to 'B', kIm
1 to 'C'KML N [R N ]R_S U_V N OR X�Y KLbN ]R6S U_V N OR N [R X KZL-N [R$S U_V N OR X
KLbN OR N [R N ]R$S U_V�X�Y KZL-N ]R$S U_V N OR N [R X KML N [R_S UWV N OR X KZL-N OR$S U_V`X):*-, .�/ 01 2 465�7-8c9 ; < .
0 ):*-, ; 1 2 465 , =�/ ; ? .1 7 , =1Define as '1'
����� � ��� � ����� � �� ����� � ���
The logical product of 2 hypotheses: 1) the i th hit (H
i) originates from a
particular peptide in the protein k and 2) its measured mass is m
i.
Therefore
dfeg
Bayes Theorem and PMF
h#i-j3k�l mnTo p_q`r�stu v km hMi j u n_o pWq jxw�l u y knzr
j un|{+} u j u
~:������%� �$� �&��� � ������-� ~:� �#� ��� � �6� �!��� � ������� ~:� � � � �6� � ��� � ���� � ~:��� �6� �!��� � ���� ��� �� ~:� ��� � �6� �!��� � ���� � ~+��� �6� � ��� � ���� � � � ���������� � �
The product rule can be applied to separate H
i and m
i:
0
20
40
60
80
100
m/z
%T
IC
0
20
40
60
80
100
%T
IC
m1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
mi0
Hi
mi
�����T� ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£
Bayes Theorem and PMF
h#i-j3k�l mnTo p_q`r�stu v km hMi j u n_o pWq jxw�l u y knzr
~:� �J�� � �6� �>��� � ���� �-� ~��b�#� ��� � �6� �>��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~:� ��� � �$� �&��� � ���� �#� �� ~+� � � � �6� � ��� � ���� � ~:� � � � �6� � ��� � ���� � � � ���� ����� � �
0
20
40
60
80
100
m/z%
TIC
0
20
40
60
80
100
%T
IC
m1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
KML���� S U_V N���P � �`OR XThe probability for the i th measured peptide to be a hit, given protein k and i-1 previous hits.
i
i-1
N
N-i-1K#L�� � S U_V N ��P � ��OR X�Y ��� ��
�����T� ¢¡+£�¤ ����� �b¡�£$���¦ §���¨¡�£
����� � ¢¡�£�¤¢����� � ¡�£$� �f � �¨¡|£
~+� � � � � �6� � ��� � ���� � � ~c��� � � � � �$� � ��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~+� � �6� � ��� � ���� � � �� ~+� �#� � �6� �!��� � ���� � ~:��� �6� � ��� � ���� � � � ���������� � �
Bayes Theorem and PMF
h#i-j3k�l mnTo p_q`r�stu v km hMi j un_o pWq j3w�l u y knzr
0
20
40
60
80
100
m/z
%T
IC
0
20
40
60
80
100
%T
IC
m1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
KML�� U_V N���P � �`OR N�� � XThe probability for the measured mass value to be mi given its theoretical mass mi0.
mi0
mi
~:� �J�� � �6� �>��� � ���� �-� ~��b�#� ��� � �6� �>��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~:� ��� � �$� �&��� � ���� �#� �� ~+� � � � �6� � ��� � ���� � ~:� � � � �6� � ��� � ���� � � � ���� ����� � �
Bayes Theorem and PMF
h#i-j3k�l mnTo p_q`r�stu v km hMi j u n_o pWq jxw�l u y knzr
0
20
40
60
80
100
%T
IC
The probability for the measured mass value to be mi given its theoretical mass mi0. Measured masses are normally distributed:
0
20
40
60
80
100
%T
IC
��� ��� ��� �� ���� ���� ���� � � ������ ��������
�����! " #%$��'&)( +*�,- �. &0/�1 23 �4 57698;:=<?>A@���. CBD�. &)/ E
5 3 E F��� ����� � �
�����T� ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£
~:� �J�� � �6� �>��� � ���� �-� ~��b�#� ��� � �6� �>��� � ���� �� ~:� �#� � �6� �!��� � ���� � ~:� ��� � �$� �&��� � ���� �#� �GIH7J+KML N O=P)QSRUT L VXWY[Z H\JCQ]L N O;PUQ9R0T L V�WY Q]L R Z��� ����� � �
�����T� ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£Bayes Theorem and PMF
h#i-j3k�l mnTo p_q`r�stu v km hMi j un_o pWq j3w�l u y knzr
0
20
40
60
80
100
%T
IC
If there exist more than one potential theoretical match within the tolerance of the measured mass, the probability for the i th hit is:
0
20
40
60
80
100
%T
IC
��� ��� ��� �� ���� ���� ���� � � ������ ��������
��� ����� � �
0
20
40
60
80
100
%T
IC
0
20
40
60
80
100
%T
IC
mi j0
gi
mi
'j ' potential matches
^=_ `Dab0c d�e `gfih j klabnm o ^=_ pq r asCt�u j q `vj c d�e `?fAh j k�abwm
x yz|{~}�{ y y� �=��� ji�q r as t����)��� � _ `7j {�`7j q f m ��=� j ���
Bayes Theorem and PMF
h#i-j3k�l mn o p_q`r�s t u v km hMi j un o pWq j3w�l u y kn r�s tu
v km��� v k��� hZi } u � j u o p_q j w�l u y kn r
x�� j r a� yz|{�}+{ y y� ��� � j �q r as t����)��� � _ `7j {;`|j q f m ��=� j � �
x]yz y�;� ��� �q r as ���)��� � _ `DaC{�`7j q f m ���� a� �� yzg{ ��� y y� � � �=� �q r as � ���)��� � _ ` � {�`�� � q f m ��=� � � �
x _ zv{�� m �_ z m � � j r a��� y�~� �=� �q r as t����)�;� � _ `7j {�`vj q f m ��=� j � � �
Probability for hits for all massesProduct of individual hit Probabilities
Modified for multiple possible matches
��� � ������ � ���� � ~��b�#� ! � �6� �!��� � ����@� ~�� �!� � �6� �>��� � ���� ��� ! �Probability that the i th measured peptide is a hit
Probability that the masses match
�����T� ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£
Bayes Theorem and PMF
"�# $&%(' )*,+ -/.�021 #35476 08#93 098;:< = %)?>A@BDC E7F�GH = %I � JLKNMPO�Q #9$ < 4R$ < H S 0 TEUB <T VXWr the number of observed hitsN the total number of peptidesm
ithe measured mass of the ith peptide
mi j0
the theoretical mass for the ith hit
the measured mass standard deviationY
Z\[^]`_(acbedgf\Zh[jilkNm no ipnjqrksm njqLtu _(acbed^f\Zh[jivksm no _2acbwdeZh[jipnjqrksm njqet\_2a�bxivkwm no d
Probability for hits for all masses
�����T� ¢¡+£�¤¥�����T� ¡�£����¦ §� �¨¡�£
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no dLikelihood probability for misses given the hits
What are misses?
• The remaining measured masses that cannot be accounted for by the protein sequence (w).
• Errors in protein sequence, unknown modification, unexpected cleavage
• “Modified Peptides”
0
20
40
60
80
100
m/z
%T
IC
0
20
40
60
80
100
%TI
C
m1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no dLikelihood probability for misses given the hits
0
20
40
60
80
100
m/z
%T
IC
0
20
40
60
80
100
%TI
C
m1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
• The total number of peptides in protein k is N
• The number of misses is w
• All misses are 'modified peptides'
• The number of modified peptides is J, which is between w and N-r (ie J includes unobserved modified peptides).
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no dThe probability for all misses can be factored like this:
� �� �������� � ��� ������� � ��� � � � � � � � �� � ��� � � � ����� ������ � � � � ����� � � � �� ! � �� �"� � � !� � ���� � � � �#� � � ! � �� �� �� �������� � ��� ������� � ��� ! � �� �"� $ � ! � � ! � � � ������� � � �#� � � ! � ��%�� �� �������� � ��� ����� � � ��� ! � �� �"� $ � ! � ���� � � � ��� � � ! � ��%� � � � � ! � ���&�'� � � ! � �� $ � ! �
� � � � � � �(�� � ������� � � � � �� ����)��� � � � � � � �� � � ������� � � �
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! �
Probability for there being J modified peptides, given protein k and r observed hits
,.-0/21 354 687'9 :;=<�>@?BADC :EFG H"IA2C : ? ADC :G >J?BADC :E
K ADC : LNMOQPSRTUP V RUW)XV T W)X V RSY T W�X P V#Z W�XV)[ YS\]W�X V0Z Y V)[ Y^\]W)W�X
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! �
The probability for observing a modified peptide, given protein k, J modified peptides and r hits plus j-1 misses being observed already.
0
20
40
60
80
100
%TI
C
j-1
N-r-(j-1)# available peptides
J-(j-1) # remaining unobserved peptidesm1m
2
m3
m4
m5
m6m
7
m8
m9
m10
m12
m11
m13m
14
m15
��� ������� �� ������ ���� ����� ����� ���� ��� �! #"%$'&( *)+ *"�$,&
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! �
The likelihood probability for the modified peptide to have a measured mass m
r+ j
0
20
40
60
80
100
%T
IC
mmin
mmax
-/.�021 3�4 5 6879+021�: 1�3�4 ;�<= > 1�3�4�? @ A02BDC E�F#02BDG H j = 1...w
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ]`_(acbedgfhZ\[ji ksm no i njqrksm n^qwtu _2acbwd f\Zh[ji ksm no _2a�bedeZh[ji njqrksm njqLt _(acbxi k m no d�*� � � � � ���� � ���(� � � � � �+�� �(��)�� �*� � � ���� � � � � ! � �� � � $ � ! � ���(� � � � �#� � � ! � �� � �"� � � ! � �����'� � � ! � �� $ � ! ��*� � � � � ���� � ���(� � � � � �+�� �(��)���� ����� ���� ! � ���� ������� ������������� ������ ���"����� ��� �
��� �� ��� � �"� ��� ��� �Probability for all misses
������� ���� �������� ���������� ����
Bayes Theorem and PMF
Z [ ] � a�bedgf\Zh[ji ksm no i njqNksm njqLtu _(acbedgfhZ\[ i ksm no _2a�bed Zh[ji njqrksm njqet � acb/i ksm no d
!#"�$&%('�)�* %('�+-, .0/$1)�* %243576 8$:9<; =�>?$:9�@ A0B +CEDGF4HJI KLNM OQPR�S DUT V#W RUXD�T R�XZY[ \ HK^]`_a7b c#dfeg \ HhGikjml�npo(q D�F [ V?F [ g r R sc�a [s tvuw-xGy{z |Q}G~ � x �7�k��~ �x ��~ �1� G � <1&������ �{���4 � <�v�������k� � x � G �-� G 4 � ~ ���� G ��� �G� �� B�C E ��� B G H� ¡
¢¤£�¥N¦ §©¨#ª<«Z¢¤£�¥N¦ ¨?ª�¢¤£¬§¦ ¥f¨?ª
w�x y z |Q}G~ �kw®x | z }(~ x �¯�k��~ �x ��~ � � G � <1�° ± �� � B C E ��� B�G H� G �4 � <� � ������² � x � G �³� G 4 � ~ ��®� G ��´ µ
ProFound (PMF)• Bayesian approach considered to be the most
coherent, consistent and efficient of the statistical methods.
• Scores reflect the confidence level of the hypothesis that protein k is the sample protein based on the given information
• Scores improve with additional information (tag information)
• Can identify simple mixtures of proteins by fusing single proteins pairwise, in groups of three and so on.
ProFound Results
Advantages of PMF
• Uses a “robust” & inexpensive form of MS (MALDI)
• Doesn’t require too much sample optimization
• Can be done by a moderately skilled operator (don’t need to be an MS expert)
• Widely supported by web servers
• Improves as DB’s get larger & instrumentation gets better
• Very amenable to high throughput robotics (up to 500 samples a day)
Limitations With PMF• Requires that the protein of interest already be in
a sequence database
• Not suitable for searching EST databases
• Typically not all predicted peptides are detected
– Poor solubility
– Selective ionization
– Short peptide length
– Post-translational modification
– Unexpected cleavage
– Contamination
• Spurious or missing critical mass peaks always lead to problem.
Limitations With PMF
• Not suitable for identification of proteins in complex mixtures if unseparated mixtures are proteolyzed
• Mass resolution/accuracy is critical, best to have <20 ppm mass resolution.
• Generally found to only be about 40% effective in positively identifying gel spots
MS-MS and Fragment Ion Searching
• Provides precise sequence-specific data
• More informative than PMF
• Can be used for de novo sequencing
• Can be used to identify post-translational modifications.
SEQUEST
• Compares predicted MS-MS spectra against observed daughter ion spectra to identify and rank matches
Yates JR III, Eng JK, McCormack AL, and Sheiltz D (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 67:1426-1436.
SEQUEST
0
20
40
60
80
100
m/z
%TI
C
I—T—T—T—Y—E—E—P—T—K
MRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIAKDWVLTAAHCNLNKRSQVILGAHSITTTYEEPTKQIMLVKKEFPYPCYDPATREGDLKLL
In Silico Digestion
LASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRTICAGALIAKDWVLTAAHCNLNKRITTTYEEPTKQIMLVKEFPYPCYDPATREGDLKLL
0
20
40
60
80
100
m/z
%TI
C
Protein Database
In Silico Fragmentation
SEQUEST
m/z
Rank/Sp Sp
1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173 W-A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K
(M+H)+ deltCn XCor r I ons Reference Peptide
0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y
• Identify database peptides that match the parent mass
• Keep the 200 most intense peaks from the MS/MS spectrum
• Compare these fragment ions against the theoretical MS/MS spectrum from the database peptide and generate a preliminary score (Sp) based on the number of matching ions (Ions).
• Perform a cross-correlation analysis (Xcorr) on the top 500 preliminary scoring peptides.
• Sort candidate peptide by XCorr.
Interpreting SEQUEST Output
m/z
Rank/Sp Sp
1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173 W-A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K
(M+H)+ deltCn XCor r I ons Reference Peptide
0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y
Sp The preliminary score. Based on the number of matching ions. The higher the better. Larger peptides have bigger Sp values. A 20-residue peptide should have an Sp > 1000, a 6 residue peptide should have an Sp > 500.
Interpreting SEQUEST Output
m/z
Rank/Sp Sp
1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173 W-A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K
(M+H)+ deltCn XCor r I ons Reference Peptide
0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y
Rank/Sp The ranked Sp. The first number is the current rank (1,2,3,4,5). The second number is the preliminary ranking.
Be wary of Rank/Sps that move up dramatically (eg 4/343).
Ideally, look for 1/1 for a good hit.
Interpreting SEQUEST Output
m/z
Rank/Sp Sp
1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173W A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K
(M+H)+ deltCn XCor r I ons Reference Peptide
0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y
DeltCn The delta correlation value. Tells you how different the first hit is from the subsequent hits. Values of DeltCn >0.1 indicate a good top hit.
Interpreting SEQUEST Output
m/z
Rank/Sp Sp
1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173W A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K
(M+H)+ deltCn XCor r I ons Reference Peptide
0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y
XCorr . The cross-cor relation value from the search. Used to produce the final ranking. Xcorr > 2.0 are usually good hits. Increases with increasing peptide size. For 20 residue peptide, look for Xcorr > 5.For 6 residue peptide look for Xcorr > 1.5.
Interpreting SEQUEST Output
m/z
Rank/Sp Sp
1 / 1 2313.4863 5.7752 2729.8 30 / 462 / 42 2313.3834 0.5288 2.7211 401.1 14 / 38 YDR409W N.LMNDNDDDDDDRLMAEITSN.H3 / 5 2311.5780 0.5544 2.5736 693.0 16 / 36 YLR058C M.TTRGM*GEEDFHRIVQYINK.A4 / 343 2313.8718 0.5605 2.5385 261.3 12 / 38 YMR173W A L.PTRRRVLMVPATTIRMVLTT.M 5 / 127 2314.7051 0.5681 2.4942 323.4 13 / 40 YPL168W T.KFSAMEINLITSLVRGYKGEG.K
(M+H)+ deltCn XCor r I ons Reference Peptide
0.0000 YOL086C K.ATDGGAHGVINVSVSEAAIEASTR.Y
Ions. How many of the (top 200 most intense) exper imental ions matched up with theoretical ions.
70% or 80% coverage is good.
SEQUEST Summary
m/z
• Gives a concise overview of a batch of search results without the necessity of having to look at each individual SEQUEST output files.
• Performs protein identification by noting which proteins are most prevalent in a set of SEQUEST output results.
SEQUEST Summary
m/z
MS Spectrum NumberTotal Ion Current (> 5 E+5)
Result File NumberCharge State
Delta Mass (Exp. - Theory)
SEQUEST Summary
m/z
Experimental MassXcorr (>2.0)
DeltCn (>0.2)Sp (Preliminary Score)
Rsp (< 10)
SEQUEST Summary
m/z
Matching Ions (>70%)Accession Number
Database OccurrancesPeptide
SEQUEST Summary
m/zA “Prevalent” Protein
How many times the protein appeared in the SEQUEST output files in the 1st (top scoring) position, 2nd position, ..., down to the 5th position.
Consensus Score =10x8 + 8x1 + 6x0 +4x0 + 2x0 + 1x0= 88
SEQUEST
m/z
• Popular
• Uses heuristics to score results
• Output is complicated, requires user input to assess the validity of a result.
• Confidence cannot be assessed numerically.
PeptideProphet and Protein Prophet
PeptideProphet
• Reads in SEQUEST summary HTML files.
• http://peptideprophet.sourceforge.net/
PeptideProphet
• Validates peptide assignments to MS/MS spectra from SEQUEST (and others).
• Looks at search scores and peptide properties among correct and incorrect peptides:
– Number of termini compatible with enzymatic cleavage (for unconstrained searches)
– Mass differences WRT the precursor ion
• Uses those distributions to compute a probability that it is correct
PeptideProphet • Performed an experiment to identify SEQUEST hits
and misses:
• Prepared a sample of 18 control proteins from various organisms (from bovine, chicken, rabbit, E. coli, S. Cerevisiae, and B. lichenformis).
– Appended the database sequences for the control proteins to a database of Drosophila proteins.
– Searched the modified database with the control protein MS-MS spectra and SEQUEST.
– All identifications from Drosophila are 'misses'.
PeptideProphet • Performed discriminant function analysis to weight the
var ious SEQUEST scores according to their ability to discr iminate hits from misses.F � XCorr ,RankSP, Ions, � Cn, � Mass �=c0 � c1 XCorr � c2 RankSP � c3 Ions � c4 � Cn � c5 � Mass
• Actually used a transformation of the XCorr score to achieve better discr imination, reduce peptide length dependence on XCorrF � XCorr ,RankSP, Ions, � Cn, � Mass �= c0 � c1 XCorr ' � c2 RankSP � c3 Ions � c4 � Cn � c5 � Mass
XCorr ' �ln � XCorr �ln � NL � , if L � Lc
ln � XCorr �ln � NC � , if L � LC
L = # aa, NL = # expected frag. ions
LC = Xcorr independence threshold
NC = Corresponding exp. frag. threshold
PeptideProphet
• Plotted the positive and negative hits as a function of the discriminant score.
P F + � � 1�2 ��� e�
� F���
� 22 � 2
P � F � - ��� � F ����� �"! 1 e! F !$#%& � T �('��
PeptideProphet • The probability of getting a correct
result, given the discriminant score is calculated using our old friend, the Bayes' Law:
P � + � F ��� P � F � + � P � + �P � F � + � P � + ��� P � F � - � P � - �
P � + �F �1�
2 �� e��� F �����2
2 � 2 � Total Correct �1�
2 �� e� � F �����2
2 � 2 � Total Correct ��� � F ����� � � 1e� F ��� ! � T �#"$� � Total Incorrect �
PeptideProphet • Adding extra information to improve
the score: Number of tryptic termini (NTT)– The majority of correctly assigned
peptides have 2 tryptic termini:• A.KMCDPTYR.F
– The majority of incorrectly assigned peptides have 0 tryptic termini
• AGMCDPTYHF
• This information can be used to improve the score
PeptideProphet
• Examine the training set data for relationship between predictions and NTT:– Correct: NTT0 = .03, NTT1 = .28, NTT2 = .69
– Incorrect: NTT0 = .80, NTT1 = .19, NTT2 = .01
• Modify the scoring scheme, eg for NTT=2:
P � + � F �1�
2 � e� � F �����2
2 � 2 � Total Correct � � 0.69
1�2 �� e� � F ����
2
2 � 2 � Total Correct � � 0.69 � � F � � � � � 1 e� F � � ! � T � " � � Total Incorrect ��� 0.01
PeptideProphet
• PeptideProphet uses an Expectation Maximization Algorithm to adjust the probabilities of correct and incorrect assignments from the training set to real datasets.
PeptideProphet
PeptideProphet
PeptideProphet
PeptideProphet
http://www.proteomecenter.org/course/20040113-Day2.pdf
ProteinProphet
• Reads in ProteinProphet results
• Calculates the probability that the peptides identified from PeptideProphet correspond to identified proteins from a protein database.
• http://proteinprophet.sourceforge.net/
ProteinProphetAssuming each peptide assignment to a spectrum is considered independent evidence for its corresponding protein, the protein probability can be calculated as:
P � 1 ��
i
�1 � maxj p
�+ � Di
j ���
ProteinProphetAdjusting for observed peptide grouping:
Correct peptide assignments tend to correspond to “multihit” proteinsIncorrect peptide assignments tend to correspond to proteins with no other hits.
MRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIKDWVLTAAHCNLNRSQVILGAHSITTTYEEPTKQIMLVKKEFPYPCYDPATREGDLKLL
MASMGTLAFDEYGRPFLIIKDQDRKSRLMGLEALKSHIMAAKAVANTMRTSLGPNGLDKMMVDKDGDVTVTNDGATILSMMDVDHQIAKLMVELSKSQDD EIGDGTTGVVVLAGALLEE
NSPi � ��m�m � i � P � + � Dm �
IIGGNEVTPHSR = .91TICAGALIK = .65ITTTYEEPTK = .85
NSP(EGDLK) = .91 + .65 + .85 = 2.41
ProteinProphetAdjusting for observed peptide grouping:
NSPi � ��m �m � i � P � + � Dm �
p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �
D number of tryptic termini, database search scores, number of missed cleavages, etc.
p(+ | D) the peptide probability scores from PeptideProphet
ProteinProphet
p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �
P(+ | D,NSP) The probability that the peptide assignment is correct, given the Data and # sibling peptides
P(NSP | +) The probability of having a particular NSP value, according to the distribution of correct peptide assignments
P(NSP | -) The probability of having a particular NSP value, according to the distribution of incorrect peptide assignments.
ProteinProphet
p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �
To calculate the various NSP-related distributions, the NSP values are made discrete by placing them into bins. The probability that a correctly assigned peptide has an NSP value in bin k is computed by summng over the peptide values in bin k.
0-0.5 0.5-1 1- 1.5 1.5-2 2-2.5
��������� ���� ������������� � � ������� �� "! ���#� $ � ����� � �
N the total number of peptide assignments in bin k
ProteinProphet
p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �
� � ��� � ��� ������#� � �� � � ������� �� "! ���#� $ � � � � � �
N The total number of peptide assignments in bin k
P(+) The prior probability of a correct peptide assignment
����������� �� ������� � � ����� � �Computed by summing over all peptides i:
ProteinProphet
p � + � D ,NSP �� p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �
� � ��� � � �� ������ � ���� � � ������� �� ! � � � $ � ��� � � �
The NSP distributions for incorrect assignments is computed analogously.
���������� �� ������� � � ��� � � �
ProteinProphet
p � + � D ,NSP � � p � + � D � p � NSP � + �p � + � D � p � NSP � + � p � - � D � p � NSP � - �
� � ����� � �� �� ��� � ���� � � ������� �� ! � � � $ � ��� � � ����������� �� ����� � � � ��� � � �
� � ����� � �� ������ ��� �� � � � ����� �� ! � � � $ � ��� � � �
� ��� �� � � � ��� � � � ��� � � �
ProteinProphet
• NSP distributions will change from sample to sample due to data set size, protein sequence database, proteins in the sample set, data quality etc.
• The EM algorithm is used to find p(NSP | +) and p(NSP | -)
ProteinProphet
Inorrect NSP Values Correct NSP Values
ProteinProphet
• Degenerate PeptidesMRNSYRFLASSLSVVVSLLLIPEDVCEKIIGGNEVTPHSRPYMVLLSLDRKTICAGALIKDWVLTAAHCNLNRSQVILGAHSITTTYEEPTKQIMLVKKEFPYPCYDPATREGDLKLLEE
MASMGTLAFDEYGRPFLIIKDQDRKSRLMGLEALKSHIMAAKPYMVLLSLDRKAVANTMRTSLGPNGLDKMMVDKDGDVTVTNDGATILSMMDVDHQIAKLMVELSKSQDDEIGDGTTGV
� Some peptides assigned from MS/MS spectra can be found in more than one protein, thus they are 'degenerate'.
� How does one figure out which is the true corresponding protein?
ProteinProphet
������ � ����� ����� � �
Weight the peptides according to the probability of that protein being in the sample
Peptide i corresponds to Ns different
proteins, the relative weight wni that
this peptide actually corresponds to protein n (n= 1... Ns) is determined according to the probability of protein n relative those of all Ns proteins:
ProteinProphet
� �� � � ����� ����� � �
The Protein probability function is then modified to account for degeneracy
P � 1 ��
i
�1 � wi
n maxj p�+ � Di NSPi
n ���
ProteinProphetProtein Probability NSP-adjusted peptide prob
Original Probability
# tryptic termini
NSPs# peptides inNSP Bin
Shared peptide weight
Protein Coverage