MS/MS Libraries of Identified Peptides and Recurring Spectra in Protein
Digests
Lisa Kilpatrick, Jeri Roth, Paul Rudnick, Xiaoyu Yang, Steve Stein
Mass Spectrometry Data Center
Library searching in not new
Organize for Reuse
MS Library Searching
• Hertz, Hites and Biemann Anal. Chem. (1971).
• PBM: McLafferty, Hertel, Villwock Org. Mass Spectrom. (1974).
• SISCOM: Damen, Henneberg, Weimann, Anal. Chem. Acta (1978).
• INCOS: Sokolow, Karnofsky, Gustafson , Finnigan Application Report 2 (March 1978).
• Stein, Scott J. Amer. Soc. Mass Spectrom., (1994).
‘Dot Product’(cosine of ‘angle’ between a pair of spectra)
• Measured = f(m/z abundance) • Reference = f(m/z abundance)• f(abundance) : Weight as you like
RM
MRSum over all peaks in common
Normalize
Traditional GC/MS Library Search
Variability Depends on S/N
~7,000 Radiodurans
Peptides, LCQ
(PNNL/NCRR)Medians
Library Searching for Peptides
• LIBQUEST (Yates)– Yates et al, Anal. Chem., 1998, 70, 3557
• X!Hunter (Beavis)– Craig et al, J. Proteome Res., 2006, 5, 1843
• BiblioSpec (MacCoss)– Frewen et al., Anal. Chem. 2006, 78, 5678
• Spectral Comparison (Kearney) – Liu et al, Proteome Science 2007, 5:3
• SpectraST (Aebersold)– Lam et al., Proteomics 2007 6, 655-667
• NIST Peptide Ion Fragmentation Library– June 2006 release (US-HUPO – March 2004)
Why Spectrum Libraries?
• More sensitive
• Better scoring
• Faster
• Annotation
• Unrestricted precursor ion
Fraction of MS/MS Spectra Identified vs S/N
0.001
0.01
0.1
1
1 10 100 1000 10000
S/N
Fra
cti
on
ID
ed
All Peptides
HSA Peptides
HSA-OMSSA
Identification by Spectrum Matching is More Sensitive than by Spectrum/Sequence Matching
Simple Protein Mix
Spectrum/Spectrum Scores are More Robust than Sequence/Spectrum Scores
Sequence score
99% Confidence
0.005/s vs. 6.2/s per query spectrum
Matching Spectra is Faster than Matching Sequence
Reference Library Building
• Extract identified spectra from sequence search
– Multiple search engines
– Instrument-class specific
• Create ‘consensus’ spectra
– Two or more matching spectra, also save best
• Assign probability of being correct
– Refine confidence starting from decoy FDR
– Classify peptides – tryptic, missed cleavage, semi, mods
• Create searchable spectral library
– Resolve conflicts, add annotation
Three Classes of Libraries
I. Conventional Target Identification
– Peptides (Proteins)
II. Identifiable
– By unconventional searching
III. Not Identifiable
– Account for all recurring spectra
– QA/QC
I. OMSSA overlap with MS/MS Library Search
747 1350 353
34K6/06
318 1752 833
78K6/07
Identified spectra (1% FDR) for 1-D Yeast NCI/CPTAC – Vanderbilt
Semitryptic
Tryptic bad miss
Tryptic missed cleavage Tryptic
Identified Spectra: Yeast - 1 D
II. Identify What we CanDerive Class-specific FDR
• Tryptic– Simple– Expected missed cleavages– Unexpected missed cleavages
• Semitryptic (cleaved tryptic)– No missed cleavage
• In source (with parent at same retention)• In sample
– Missed cleavage• In source (with parent)• In sample (obey rules)• Uncommon – reject
• Others …
Atypical Peptide Ionsuse Sequence Search Method
• Tryptic only with many mods• Less common: Methylation, Phosphorylation, …• Artifacts: Na, K, Carbamyl• InsPecT/Pevzner (Unidentified, +70)
• High charge states, >2 missed cleavages
• Use class specific score thresholds
HSA/Fibrinogen/Transferrin Mix
6124 Consensus Peptide Spectra, IT, Qtof, TofTof
Ion Trap Peptide Ions: 1300 HSA, 1100 Fibrinogen, 700 Transferrin
contiguous = tryptic, exploded = semitryptic
Bad missMissed
'Insample'
Insource
Unknown modBad miss
Missed
Simple
Identified Peptide Spectra - Simple Protein Mix
III. Library ofRecurring, Unidentified Spectra
• Create consensus spectra– From similar spectra from an experiment
• Combine from multiple experiments
• Identify spectra in other experiments– QA/QC: Artifacts, in standards, …– Apply other sequencing methods
Assign all Spectra• Identified Spectrum
– Matches library peptide or unidentified spectrum– Subset of peaks match library spectrum (impure)– Similar to a matched spectrum (cluster)
• Not a Peptide– Low S/N
• Maximum/Median <15– High charge state (many large peaks)
• Proteins, large fragments, …– One dominant peak
• Stable ion, not peptide– Singly charged (high/low abund < 1.2)
• Probable artifact, lower probability of identification– Narrow m/z range
• Peptide?
exploded = identified, contiguous = unidentified
Peptide?
1+ No ID
OtherLow S/N
NoID Lib/Impure
NoID Lib
Peptide/Impure
Peptide
Spectrum Classification - Yeast - 1D
exploded = identified, contiguous = unidentified
Spectrum Classification - Simple Protein Mix
Peptide?
1+ NoID
NarrowComplex
Dominant PeakLow S/N
NoID Lib/ClusterNoID Lib/Impure
NoID lib
Pep/Cluster
Pep/Impure
Peptide
Library Pipeline of the Future
assigned
No ID No IDPep.Lib
Unass.Lib
unassigned
No ID
Garbage filter
Sequence Search,
De Novo,Theoretical
Spec,Similarity, ...No ID
assigned
Mass spectrometer
NCI/NIH - CPTAC:Clinical Proteomic Technology Assessment
for Cancer
http://proteomics.cancer.gov
Technology assessment; develop standard protocols and clinical reference sets; and evaluate methods to ensure data reproducibility.
Broad Institute of MIT and Harvard, Memorial Sloan-Kettering Cancer Center, Purdue University,
University of California, San Francisco,, and Vanderbilt University School of Medicine.
NCI grants (U24CA126476-01, U24CA126485-01, U24CA126480-01, U24CA126477-01, and U24CA126479-01).
RT: 10.01 - 70.06
15 20 25 30 35 40 45 50 55 60 65 70
Time (min)
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
Re
la
tive
A
bu
nd
an
ce
NL: 6.73E6
TIC F: ITMS + c ESI Full ms [300.00-2000.00] MS NCI_study2_021607_sample1B228_vial_03
Run-to-Run Chromatographic Reproducibility
RT: 9.99 - 70.13
10 15 20 25 30 35 40 45 50 55 60 65 70
Time (min)
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
Re
la
tive
A
bu
nd
an
ce
NL: 5.53E6
TIC F: ITMS + c ESI Full ms [300.00-2000.00] MS NCI_study2_021607_sample1B33_vial_01
CPTAC_STUDY2_WEEK1_1B144_01 2/27/2007 11:31:04 AMHPLC: CPTAC - Dilute 150x - Inj 2 ul
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Rela
tive A
bundance
75.15493.21
46.50516.2741.98
543.2548.07569.75
56.53409.54
63.95575.3141.54
749.799.06
401.1166.44500.81
3.95363.79
12.11401.11
33.69322.18
25.07337.68
80.95528.38
85.52426.73
88.72445.12
NL: 6.63E8
TIC F: FTMS + p NSI Full ms [300.00-2000.00] MS CPTAC_STUDY2_WEEK1_1B144_01
CPTAC_STUDY2_WEEK1_1B144_01 #1745 RT: 16.39 P: + NL: 6.89E6F: FTMS + p NSI Full ms [300.00-2000.00]
400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
10
20
30
40
50
60
70
80
90
100
Rela
tive A
bundance
345.52
517.78371.10
540.29692.30 741.20 869.43 1679.171612.381497.98 1925.52
nw_022207o_liebler_study2_Vanderbilt_... 2/24/2007 6:08:26 AM
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Rela
tive A
bundance
12.65390.14
11.79401.11
13.35421.06
45.28647.29
31.61387.4529.16
722.32 40.65409.543.43
313.0249.31
547.3223.46
358.8540.14
660.0653.26
507.3015.17
588.3368.38
671.8256.75749.38 70.47
682.7095.22
313.0290.34
313.0283.06
313.02
NL: 4.30E7
TIC F: FTMS + p NSI Full ms [300.00-2000.00] MS nw_022207o_liebler_study2_Vanderbilt_Orib2_week1_1B035_070224060826
No scan(s) match the scan filter.
20070511_CPTAC_1B100 5/11/2007 10:26:15 AM
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Rel
ativ
e A
bun
dan
ce
67.05493.21
42.27722.32 45.38
395.2458.53
575.31 78.46671.82
41.92492.75 55.91
500.7533.29749.792.79
319.1071.35
829.3879.61
454.6930.64
371.1084.49
319.1190.86
673.3614.78
371.1024.32
371.109.45
371.10
NL:1.08E8
TIC F: FTMS + p NSI Full ms [300.00-2000.00] MS 20070511_CPTAC_1B100
No scan(s) match the scan filter.
IN_LTQm_041907_1B274_02 4/20/2007 11:21:41 PM
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Re
lative
Ab
un
dan
ce
39.78647.55
68.00682.96
63.29672.12
10.71493.27
1.17444.98
46.11547.69
35.05660.35 76.11
992.8662.31
556.9632.59
409.8577.97
673.64 85.691133.34
52.60749.65
3.98538.07
31.90501.74
16.74722.60
21.81516.71
87.16840.73 95.80
835.36
NL: 5.50E6
Base Peak m/z= 400.00-2000.00 F: ITMS + c ESI Full ms MS IN_LTQm_041907_1B274_02
IN_LTQm_041907_1B274_02 #3160 RT: 34.26 P: + NL: 1.05E5F: ITMS + c ESI Full ms
400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Inte
nsity
318.42
864.14
535.49472.81635.43
451.36751.54352.85 874.75 1069.11 1280.621171.71 1451.95 1553.29 1897.561725.84
liebler_Vanderbilt_1B121_100 2/24/2007 11:17:00 PM
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Rela
tive
Ab
un
dan
ce
47.00647.53
69.41672.1651.41
547.79
42.50409.84
71.26682.89
58.37749.72
65.71484.14
31.29722.72
33.23493.0329.63
441.01 83.82673.6975.32
544.3823.82
538.085.02
401.1619.28
749.9599.02
406.4594.31
461.97
NL: 6.09E6
Base Peak m/z= 400.00-2000.00 F: ITMS + c NSI Full ms [300.00-2000.00] MS liebler_Vanderbilt_1B121_100
No scan(s) match the scan filter.
BroadOrbitrap
VandyOrbitrap
NYUOrbitrap
INCAPSLTQ
NCI_study2_021607_sample1B228_vial_03 2/16/2007 8:45:21 PM sample 1B
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Rela
tive
Ab
un
dan
ce
32.21492.49
40.30647.47
27.35722.5022.63
537.94 63.77671.96
66.20682.8835.90
660.212.69
536.2947.13
547.6919.46749.98 49.18
500.93 56.48829.583.66
444.4273.99
992.737.98508.17
78.50674.14
86.001133.70
96.96435.96
NL: 2.51E6
Base Peak m/z= 400.00-2000.00 F: ITMS + c ESI Full ms MS NCI_study2_021607_sample1B228_vial_03
NCI_study2_021607_sample1B228_vial_03 #3205 RT: 34.26 P: + NL: 5.70E4F: ITMS + c ESI Full ms
400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
10000
20000
30000
40000
50000
Inte
nsi
ty
566.86
849.57
390.88
631.19
354.17 860.65530.80680.41 952.15731.19 1625.481237.471022.05 1364.34 1521.92 1696.42 1865.63 1951.58
NISTLTQ
VandyLTQ
0703141B289 3/15/2007 12:58:07 PM
RT: 0.00 - 100.00
0 10 20 30 40 50 60 70 80 90 100
Time (min)
0
10
20
30
40
50
60
70
80
90
100
Rela
tive A
bundance
39.43575.73
33.33410.00
41.64547.76
24.18516.87
42.88501.15
23.07722.47 47.81
481.7025.84718.0618.27
408.1053.04498.08
58.16426.98 95.61
432.1770.94615.41
65.81419.06
84.48419.13
74.13419.12
1.01445.03
15.131451.05
NL: 5.80E5
Base Peak m/z= 400.00-2000.00 F: ITMS + c NSI Full ms [300.00-2000.00] MS 0703141B289
No scan(s) match the scan filter.
PurdueLTQ
YICENQDSISSK
Lab-to-Lab Chromatography
HSA_CAM_SigmaA9511_5H_8MS2_m2_10de_040406_05
Measures of Reproducibility
• Identified ions– Unique peptides, Ions, Spectrum counts
• Unidentified components– Classify by type, link to origin
• Ion cluster analysis– MS1 linked to MS2
• Chromatography– Time evolution of ion clusters
Ion Component Analysis
Ion Component Analysis (Yeast)
1E-3 0.01 0.1
10
100
1000
Oversampling
Relative Component Intensity
Co
un
ts
Components All MS2 Sampled Peptides
Undersampling
1E-4 1E-3 0.01 0.1
10
100
1000
Nu
mb
er
of C
om
po
ne
nts
Component Intensity
Components in Replicate Runs
total
sampled
identified
▲▼ run 1,2 ■ in both