+ All Categories
Home > Documents > A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein 1 ; Lisa Kilpatrick 2 ;...

A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein 1 ; Lisa Kilpatrick 2 ;...

Date post: 18-Jan-2016
Category:
Upload: magdalen-carr
View: 213 times
Download: 0 times
Share this document with a friend
1
A Reference Library of Peptide Ion Fragmentation Spectra A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein Stephen Stein 1 1 ; Lisa Kilpatrick ; Lisa Kilpatrick 2 2 ; Pedatsur Neta ; Pedatsur Neta 1 1 ; Jeri Roth ; Jeri Roth 1 1 ; Xiaoyu Yang ; Xiaoyu Yang 1 1 National Institute of Standards and Technology, National Institute of Standards and Technology, 1 1 Gaithersburg, MD/ Gaithersburg, MD/ 2 2 Charleston, SC Charleston, SC Overview Overview Purpose Purpose Create comprehensive, annotated mass spectral libraries from various organisms and selected proteins to identify peptides by matching their MS/MS spectra to reference spectra. Methods Methods - Acquire ‘Shotgun’ proteomics data files from diverse sources. - Identify peptides with available sequence search engines. - For each peptide ion, create a ‘consensus spectrum’ from replicate spectra; also find best single spectrum. - Derive reliability measures and remove ambiguities. Results Results - Spectrum libraries were built by matching both m/z’s and intensities of MS/MS peaks. - Libraries derived from widely studied organisms such as human, yeast, M. Smegamtis, D. Radiodurans, and standard proteins. - Consensus spectra derived from reliable peptide identifications. - Library indexing leads to very fast identification (<< 1 sec) even for very large libraries. - Sequence identification by spectrum library searching identifies far more spectra of known peptides than sequence library searching, can be 100 times faster and yields more robust and understandable results. Introduction Introduction High throughput proteomics requires automated, fast and accurate library search engines to identify peptide sequences from acquired MS/MS spectra. Current peptide identification methods match each measured MS/MS spectrum against a coarse ‘theoretical’ spectrum of each possible peptide sequence. Since relative abundances, neutral losses from parent and product ions, and ratios of products having different charge states are not predictable, this rich, peptide-specific information is not effectively used for establishing identity. Also, prior occurrence information is ignored – each search identifies the peptide as if for the first time. A spectrum library search matches not only the m/z, but also the relative intensities of the MS/MS peaks and can make use of other prior information. However, spectrum libraries can propagate errors, so reliable searching requires high quality reference libraries, the development of which is described here. We find that identifying peptides by matching their MS/MS spectra to reference spectra can be faster, more reliable and more informative than current sequence-based methods. Methods Methods 1. Acquire and organize ‘Shotgun’ proteomics data files from diverse sources. Human 5347 LC-MS/MS data files from 11 labs and repositories Boston U. (Steffen/Ahmad) GPM (Beavis) HUPO/Plasma Proteome Project/Omenn HUPO/Brain Proteome Project/Meyer (not yet published) ISB/PeptideAtlas (Deutsch/King/Aebersold/…) NCI-SAIC (Veenstra) PNNL/NCRR (Smith), UC Davis (Rice/Lee) Q-tof data from USB (Pannell) and Mayo Clinic (Muddiman) Yeast 2503 LC-MS/MS data files from 12 laboratories Online repositories PeptideAtlas Open Proteomics Database Collaborators/Contributors Blueprint Initiative (Hogue) Harvard University (Gygi) ISB –PeptideAtlas (Deutsch/King/Aebersold/…) NIH/LNT (Markey/Maynard/Geer/Kowalak/…) University of Arizona (Haynes) University of San Francisco (Burlingame/Baker) NIST Test Measurements Mycobacteria Smegmatis 253 LC-MS/MS data files from the Open Proteomics Database online repository Deinococcus Radiodurans 495 LC-MS/MS data files from the PNNL/NCRR Repository. Standard Proteins 19 Proteins were analyzed in NIST laboratories by LC-MS/MS. 2. Identify peptides with available sequence search engines Different search engines often give very different scores for matching a given peptide ion with a single spectrum (figure bottom left panel). To capture the largest number of identifications, the highest score of up to four different search engines was used. This increased the number of reliable identifications by over 25% compared to any single method. thresholds Probably wrong Probably right Probably right Confirmed 3. Create consensus spectrum and find best replicate spectrum For all spectra matching a given peptide ion, a multi-step process aligns m/z peaks, rejects outliers and creates a consensus spectrum. It also finds the best replicate spectrum based on search engine scores and spectrum quality. A peak in a consensus spectrum must be present in a majority of the spectra that might have generated the peak. 5. Remove ambiguities and build library Create annotated spectra for consensus and best matching single spectra. Resolve problems of similar spectra that appear to generate different peptide ions. Results Results Libraries were built from different organisms. Peptide Class # Peptides Consensus 35,807 Singular (one ID) 2,458 Simple Tryptic 24,205 Tryptic Missed Cleavage 5,620 Semi Tryptic 5,982 1+ 3,658 2+ 22,327 3+ 9,822 ICAT 15,061 Yeast Peptide Class # Peptides Consensus 43,601 Singular (one ID) 1,864 Simple Tryptic 36,447 Tryptic Missed Cleavage 7,127 1+ 3,677 2+ 30,194 3+ 9,730 ICAT 6,640 Human Peptide Class # Peptides Consensus 3,562 Singular (one ID) 126 Simple Tryptic 3,252 Tryptic Missed Cleavage 254 Semi Tryptic 56 1+ 111 2+ 2,130 3+ 1,287 M. Smegmatis Peptide Class # Peptides Consensus 8,809 Singular (one ID) 284 Simple Tryptic 6,050 Tryptic Missed Cleavage 2,486 Semi Tryptic 273 1+ 1,816 2+ 5,168 3+ 1,799 D. Radiodurans Peptide Class # Peptides Consensus 4,095 Singular (one ID) 15 Simple Tryptic 2,097 Tryptic Missed Cleavage 1,555 Semi Tryptic 443 1+ 663 2+ 1832 3+ 1320 4+ 245 5+ 35 Standard Proteins 4. Derive reliability measures for each spectrum A) Spectrum/Sequence Consistency Match theoretical spectrum, based on relative dissociation rates of adjacent amino acids (from statistical analysis of reliable spectra). Discrimination shown at right Fraction of unassigned abundance (peaks not originating from a known fragmentation path) Y/B ion continuity and Y/B correlation B) Peptide Sequence Confirmation Other peptide ions with same sequence (different charge state or modification) Sequence contained in (or contains) another peptide Number of peptides per protein / protein length C) Peptide Class (for setting acceptance threshold) Tryptic or semiTryptic SemiTryptic – In source or unexpected SemiTryptic – Confirmed or unconfirmed Missed Cleavages: None or explained, or unexplained Missed Cleavages: Confirmed (found contained peptide) or unconfirmed These libraries depend on contributors for their success. Please contribute. All spectra cite contributors. Spectrum searching identifies peptides fast and reliably. Algorithms: Spectrum similarity scores have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance: - Reduce significance of common impurity ions (e.g., neutral loss from parent ion) - Reduce weight for uncertain and isotopic peaks - Use library spectrum reliability - Fold in sequence score for instrument dependence – uses OMSSA scoring Speed: Straightforward indexing leads to very fast identification (<< 1 sec) even for very large libraries. Robustness: Spectrum match scores are less sensitive to spectral details than sequence scores (see figure below, left). Contact: Steve Stein Director, Mass Spectrometry Data Center National Institute of Standards and Technology [email protected] 301-975-2505 Input list Matching peptide and probability scores Reference spectrum and annotation Query MS/MS Head to tail sample and reference spectra comparison Three library formats: - Simple ASCII ‘msp’ format (derived from EI MS Library) - NIST Search Software (Windows, see figure below) - Dynamic Link Library (Source & Binary) Small Missing Peaks Can Have A Big Effect on Sequence Scores Sequence Search Score Spectrum library performance: Several times as many spectra identified by searching against spectra than against sequence (left panel, bottom right) Test Set: Yeast analysis files from the Open Proteomics Database (OPD40, 12 LC-MS/MS runs). Spectrum Library: Consensus spectra in current yeast library. Radiodurans library for false ids. Sequence Library: Search forward and reverse yeast library using relative homology scores or expectation values. Search Speed: Spectrum searching was about 100 times faster than sequence searching. May be accelerated by more peak indexing. A reference spectrum library provides a sensitive, reliable, fast, and comprehensive resource for peptide identification. A peptide mass spectrum library can be used for: • Direct peptide identification • Validating peptides identified by sequence search programs • Organizing and identifying recurring, unidentified spectra. • Sensitive, high reliability detection of internal standards, biomarkers, and target proteins • Subtracting a component from a mixture spectrum Conclusion Conclusion Current sequence search methods yield divergent scores for the same spectrum due to use of incomplete spectrum information. Collaborators Collaborators N. King et al, ISB - “Annotation of the Yeast Proteome with PeptideAtlas” (Poster WP 27/522) H. Lam et al., ISB – “SpectraST: An Open-Source MS/MS Spectra-Matching Library Search Tool for Targeted Proteomics” (poster WP27/530) L. Geer et al, NIH “Reducing false positive rates in MS/MS sequence searching and incorporating intensity into match based statistics” (Poster TP34/638) HUPO PPP and BPP Projects Repositories and dozens of labs who directly and indirectly provided MS/MS data for public use Similarity of Measured vs. Theoretical Spectra (Dot Product x 100)
Transcript
Page 1: A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein 1 ; Lisa Kilpatrick 2 ; Pedatsur Neta 1 ; Jeri Roth 1 ; Xiaoyu Yang 1 National Institute.

A Reference Library of Peptide Ion Fragmentation SpectraA Reference Library of Peptide Ion Fragmentation Spectra

Stephen SteinStephen Stein11; Lisa Kilpatrick; Lisa Kilpatrick22; Pedatsur Neta; Pedatsur Neta11; Jeri Roth; Jeri Roth11; Xiaoyu Yang; Xiaoyu Yang11

National Institute of Standards and Technology, National Institute of Standards and Technology, 11Gaithersburg, MD/Gaithersburg, MD/22Charleston, SCCharleston, SCOverviewOverview

• PurposePurposeCreate comprehensive, annotated mass spectral libraries from

various organisms and selected proteins to identify peptides by matching their MS/MS spectra to reference spectra.

• MethodsMethods- Acquire ‘Shotgun’ proteomics data files from diverse sources.- Identify peptides with available sequence search engines.- For each peptide ion, create a ‘consensus spectrum’ from

replicate spectra; also find best single spectrum.- Derive reliability measures and remove ambiguities.

• ResultsResults- Spectrum libraries were built by matching both m/z’s and

intensities of MS/MS peaks.- Libraries derived from widely studied organisms such as human,

yeast, M. Smegamtis, D. Radiodurans, and standard proteins. - Consensus spectra derived from reliable peptide identifications.- Library indexing leads to very fast identification (<< 1 sec) even for

very large libraries.- Sequence identification by spectrum library searching identifies far

more spectra of known peptides than sequence library searching, can be 100 times faster and yields more robust and understandable results.

IntroductionIntroduction

High throughput proteomics requires automated, fast and accurate

library search engines to identify peptide sequences from acquired

MS/MS spectra. Current peptide identification methods match each

measured MS/MS spectrum against a coarse ‘theoretical’ spectrum of

each possible peptide sequence. Since relative abundances, neutral

losses from parent and product ions, and ratios of products having

different charge states are not predictable, this rich, peptide-specific

information is not effectively used for establishing identity. Also, prior

occurrence information is ignored – each search identifies the peptide

as if for the first time. A spectrum library search matches not only the

m/z, but also the relative intensities of the MS/MS peaks and can make

use of other prior information. However, spectrum libraries can

propagate errors, so reliable searching requires high quality reference

libraries, the development of which is described here. We find that

identifying peptides by matching their MS/MS spectra to reference

spectra can be faster, more reliable and more informative than current

sequence-based methods.

MethodsMethods1. Acquire and organize ‘Shotgun’ proteomics data files from

diverse sources.

Human5347 LC-MS/MS data files from 11 labs and repositories

Boston U. (Steffen/Ahmad) GPM (Beavis)HUPO/Plasma Proteome Project/OmennHUPO/Brain Proteome Project/Meyer (not yet published)ISB/PeptideAtlas (Deutsch/King/Aebersold/…)NCI-SAIC (Veenstra) PNNL/NCRR (Smith), UC Davis (Rice/Lee)Q-tof data from USB (Pannell) and Mayo Clinic (Muddiman)

Yeast2503 LC-MS/MS data files from 12 laboratories

Online repositoriesPeptideAtlasOpen Proteomics Database

Collaborators/ContributorsBlueprint Initiative (Hogue)Harvard University (Gygi)ISB –PeptideAtlas (Deutsch/King/Aebersold/…)NIH/LNT (Markey/Maynard/Geer/Kowalak/…)University of Arizona (Haynes)University of San Francisco (Burlingame/Baker)

NIST Test Measurements

Mycobacteria Smegmatis

253 LC-MS/MS data files from the Open Proteomics Database online repository

Deinococcus Radiodurans

495 LC-MS/MS data files from the PNNL/NCRR Repository.

Standard Proteins

19 Proteins were analyzed in NIST laboratories by LC-MS/MS.

2. Identify peptides with available sequence search engines

Different search engines often give very different scores for matching a given peptide ion with a single spectrum (figure bottom left panel). To capture the largest number of identifications, the highest score of up to four different search engines was used. This increased the number of reliable identifications by over 25% compared to any single method.

thresholds

Probably wrong

Probablyright

Probably right

Confirmed

3. Create consensus spectrum and find best replicate spectrum

For all spectra matching a given peptide ion, a multi-step process aligns m/z peaks, rejects outliers and creates a consensus spectrum. It also finds the best replicate spectrum based on search engine scores and spectrum quality. A peak in a consensus spectrum must be present in a majority of the spectra that might have generated the peak.

5. Remove ambiguities and build libraryCreate annotated spectra for consensus and best matching single spectra. Resolve problems of similar spectra that appear to generate different peptide ions.

ResultsResults

• Libraries were built from different organisms.

Peptide Class # Peptides

Consensus 35,807

Singular (one ID) 2,458

Simple Tryptic 24,205

Tryptic Missed Cleavage 5,620

Semi Tryptic 5,982

1+ 3,658

2+ 22,327

3+ 9,822

ICAT 15,061

Yeast

Peptide Class # Peptides

Consensus 43,601

Singular (one ID) 1,864

Simple Tryptic 36,447

Tryptic Missed Cleavage 7,127

1+ 3,677

2+ 30,194

3+ 9,730

ICAT 6,640

Human

Peptide Class # Peptides

Consensus 3,562

Singular (one ID) 126

Simple Tryptic 3,252

Tryptic Missed Cleavage 254

Semi Tryptic 56

1+ 111

2+ 2,130

3+ 1,287

M. Smegmatis

Peptide Class # Peptides

Consensus 8,809

Singular (one ID) 284

Simple Tryptic 6,050

Tryptic Missed Cleavage 2,486

Semi Tryptic 273

1+ 1,816

2+ 5,168

3+ 1,799

D. Radiodurans

Peptide Class # Peptides

Consensus 4,095

Singular (one ID) 15

Simple Tryptic 2,097

Tryptic Missed Cleavage 1,555

Semi Tryptic 443

1+ 663

2+ 1832

3+ 1320

4+ 245

5+ 35

Standard Proteins

4. Derive reliability measures for each spectrum

A) Spectrum/Sequence Consistency

• Match theoretical spectrum, based on relative dissociation rates of adjacent amino acids (from statistical analysis of reliable spectra). Discrimination shown at right

• Fraction of unassigned abundance (peaks not originating from a known fragmentation path)

• Y/B ion continuity and Y/B correlation

B) Peptide Sequence Confirmation

• Other peptide ions with same sequence (different charge state or modification)

• Sequence contained in (or contains) another peptide

• Number of peptides per protein / protein length

C) Peptide Class (for setting acceptance threshold)

• Tryptic or semiTryptic

• SemiTryptic – In source or unexpected

• SemiTryptic – Confirmed or unconfirmed

• Missed Cleavages: None or explained, or unexplained

• Missed Cleavages: Confirmed (found contained peptide) or unconfirmed

These libraries depend oncontributors for their success.

Please contribute.All spectra cite contributors.

• Spectrum searching identifies peptides fast and reliably.Algorithms: Spectrum similarity scores have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance:

- Reduce significance of common impurity ions (e.g., neutral loss from parent ion)

- Reduce weight for uncertain and isotopic peaks

- Use library spectrum reliability

- Fold in sequence score for instrument dependence – uses OMSSA scoring

Speed: Straightforward indexing leads to very fast identification (<< 1 sec) even for very large libraries.

Robustness: Spectrum match scores are less sensitive to spectral details than sequence scores (see figure below, left).

Contact: Steve SteinDirector, Mass Spectrometry Data CenterNational Institute of Standards and [email protected] 301-975-2505

Input list

Matching peptide and probability scores

Reference spectrum and annotation

Query MS/MSHead to tail sample and reference spectra comparison

• Three library formats: - Simple ASCII ‘msp’ format (derived from EI MS Library) - NIST Search Software (Windows, see figure below) - Dynamic Link Library (Source & Binary)

Small Missing Peaks Can Have A Big Effect on Sequence Scores

Sequence Search Score

• Spectrum library performance:Several times as many spectra identified by searching against spectra than against sequence (left panel, bottom right)Test Set: Yeast analysis files from the Open Proteomics Database (OPD40, 12 LC-MS/MS runs).Spectrum Library: Consensus spectra in current yeast library. Radiodurans library for false ids.Sequence Library: Search forward and reverse yeast library using relative homology scores or expectation values. Search Speed: Spectrum searching was about 100 times faster than sequence searching. May be accelerated by more peak indexing.

• A reference spectrum library provides a sensitive, reliable, fast, and comprehensive resource for peptide identification.

• A peptide mass spectrum library can be used for:• Direct peptide identification• Validating peptides identified by sequence search programs • Organizing and identifying recurring, unidentified spectra.• Sensitive, high reliability detection of internal standards, biomarkers, and target proteins• Subtracting a component from a mixture spectrum

ConclusionConclusion

Current sequence search methods yield divergent scores for the same spectrum due to use of incomplete spectrum information.

CollaboratorsCollaboratorsN. King et al, ISB - “Annotation of the Yeast Proteome with PeptideAtlas” (Poster WP 27/522)

H. Lam et al., ISB – “SpectraST: An Open-Source MS/MS Spectra-Matching Library Search Tool for Targeted Proteomics” (poster WP27/530)

L. Geer et al, NIH “Reducing false positive rates in MS/MS sequence searching and incorporating intensity into match based statistics” (Poster TP34/638)

HUPO PPP and BPP Projects

Repositories and dozens of labs who directly and indirectly provided MS/MS data for public use

Similarity of Measured vs. Theoretical Spectra (Dot Product x 100)

Recommended