nature biotechnology volume 28 number 7 july 2010 647
To the Editor:Direct protein profiling offers unique insights beyond those afforded by transcriptomics or genomics technologies. These include information about the abundance level, post-translational modifications and interaction partners of proteins. Shotgun proteomics1 is
the method of choice for the comprehensive analysis of complex protein mixtures, and extensive proteome coverage has recently been reported even for eukaryotic model organisms2,3. However, shotgun proteomics also faces significant challenges, such as the protein inference problem4. Protein inference
refers to the process of deducing the proteins that were originally present in a sample on the basis of the experimentally identified peptides. Because significant amounts of shared peptides—peptides that could be derived from several proteins—introduce ambiguity, protein inference can be tricky
Experimental workflow Computational workflow
>GenomeATGGTGGGTGGCAAGAAGAAAACCAAGATA...
>Protein_2.1
MVQYNFKRIT...>Protein_1.2MVQYNFKRIT...
>Protein_1.1MVQYNFKRITVVPNGK...
MVQYNFKRITVVPNGKMVVRPYNDELR
>Gene_2CTACTCGCATGTAGA
>Gene_1CTACTCGCATGTAGA...
Genome
Proteins
Genes
Peptides
Spectra
Measurement & prediction
MS spectrum
Protein mixture
Proteolytic digestion In silico digestion
Predicted spectrum
Peptide classification
MS/MS spectrum
Protein inference
Minimal protein list
Peptide spectrum match
Peptide selection for quantitative proteomics
Peptide ScoreMVQYNFK 0.99 ITVVPNGK 0.25MVVRPYNDELR 0.95
Scored peptide list
Peptide Score Class Proteins/genesMVQYNFK 0.99 1a Protein_1.1ITVVPNGK 0.25 2 Gene_1MVVRPYNDELR 0.95 3b Protein_1.1,2.1
Classified peptide list
Figure 1 Schematic overview of where PeptideClassifier maps onto a shotgun proteomics workflow and selected applications. Proteins are extracted and digested with a protease (typically trypsin) before further separation of the peptide mixture, ionization, selection of precursor ions, fragmentation and recording of fragment ion spectra in a mass spectrometer. PeptideClassifier takes peptides assigned by the PSM process as input, ideally further processed with a probabilistic or other scoring scheme. Because gene-model information is included in the classification process (red arrow), several different peptide classes with distinct information content can be reported (see main text). For deterministic protein inference, a minimal list of protein identifications can be generated for peptides above a user-defined threshold (for more details, see Supplementary Fig. 1). Other major applications include the information content–based selection of peptides for targeted quantitative proteomics workflows (based on experimental data or an in silico pre computed index for all peptides; see dashed arrow) and integration with transcriptomics data (not shown).
PeptideClassifier for protein inference and targeted quantitative proteomics
correspondence©
201
0 N
atu
re A
mer
ica,
Inc.
All
rig
hts
res
erve
d.
648 volume 28 number 7 july 2010 nature biotechnology
per evidence class; and finally, it can report a minimal set of protein identifications that would explain the remaining ambiguous peptides, following the Occam’s Razor approach5. In its current implementation, PeptideClassifier can work with the output of common database search engines or, alternatively, with a list of confident peptide identifications provided by a user.
The classification and protein inference approach is generic: on the basis of their different gene structures, for prokaryotes we report three peptide evidence classes (classes 1a, 3a and 3b), and for eukaryotes, to capture potential alternative splice isoforms, we consider three additional evidence classes (classes 1b, 2a and 2b) (Fig. 2). Class 1a peptides unambiguously identify a single unique protein sequence. Class 1b peptides also unambiguously identify one unique protein sequence, but this sequence could be derived from distinct splice isoform transcripts of a gene model that, although identical in the coding sequence, differ in the 5′ or 3′ untranslated region, or in both regions. We extend the original classification8 to further distinguish class 2 peptides into
the Nature Biotechnology website, or from http://www.mop.uzh.ch/software.html) and illustrate its general applicability for both eukaryotes and prokaryotes, and its value for applications beyond protein inference (Supplementary Table 1). These include integration with transcriptomics data and information content–based selection of peptides for targeted quantitative proteomics studies (Fig. 1).
PeptideClassifier can classify shotgun proteomics data from any organism, provided that a clear relationship exists between the gene model, its encoded protein sequences and their identifiers. Several reference databases (such as FlyBase, Wormbase, TAIR, ENSEMBL and RefSeq) fulfill this requirement. PeptideClassifier carries out several steps (Supplementary Fig. 1): first, it analyzes protein sequence redundancies and generates an identifiable proteome index; second, it parses the database search result files; third, it classifies the experimentally identified peptides into six evidence classes with different information contents (see below); fourth, it infers a minimal list of protein identifications
and error prone. Furthermore, the error rate at the protein level is typically substantially higher than that at the peptide level4,5. Informatics solutions that provide accurate and reproducible results are thus needed to minimize the propagation of errors in the literature and in data repositories, and to allow readers to critically evaluate the conclusions of papers6,7.
To address this issue, we have recently devised a novel, deterministic peptide classification and protein inference scheme8. This approach is the first to take into account the gene model–protein sequence–protein identifier relationships. Each peptide sequence is classified according to its information content with respect to protein sequences and gene models (Fig. 1). This allows shared peptides to be further distinguished depending on whether the implied proteins could be encoded either by the same or by distinct gene models.
Here we announce the release of the modular software tool PeptideClassifier (folders containing the code for PeptideClassifier as well as some AuxiliaryScripts can be downloaded on
Class Protein sequence(s) Protein isoform(s) Gene(s)
1a Unambiguous Unambiguous Unambiguous
1b Unambiguous Ambiguous Unambiguous
2a Ambiguous Ambiguous Unambiguous
2b Ambiguous Ambiguous Unambiguous
3a Unambiguous Ambiguous Ambiguous
3b Ambiguous Ambiguous Ambiguous
5’3’
Protein 1.2
Protein 1.1
Gene 1 Gene 2
Protein 2.1
Protein 2.2
Protein 2.3
3’5’
Gene 3
Protein 3.1
Gene 4
Protein 4.1
Protein 2.4
Eukaryotes
Prokaryotes
5’3’
Protein 1
Gene 1
3’5’
Gene 2
Protein 2
Gene 3
Protein 3
UTR
UTR
UTR
Figure 2 Overview of the distinct peptide evidence classes of our classification scheme for eukaryotes and prokaryotes. We distinguish six peptide evidence classes for eukaryotes (upper panel). Introns are shown as gray boxes, exons as green boxes and splicing events as dashed green lines. The 3′ untranslated region (UTR) is indicated for one gene model, to highlight the relevance of 5′ or 3′ UTRs for assigning class-1b identifications. Also shown are sets of two gene models that encode an identical protein sequence (dashed gray lines). Owing to the lack of splice variants, class 1b, 2a and 2b do not apply to prokaryotes (lower panel). A table summarizing the ability of the respective peptide evidence classes to distinguish protein sequences, annotated protein isoforms and genes is shown in the figure; the implications for major applications are indicated in Supplementary Table 1.
CORR eSPO nden Ce©
201
0 N
atu
re A
mer
ica,
Inc.
All
rig
hts
res
erve
d.
nature biotechnology volume 28 number 7 july 2010 649
for all human gene models. In the first phase, peptides specific for a gene model but not a specific splice isoform or modified protein are required. Thus, a classification that displays the in silico pre computed information content of each peptide could help researchers select the best candidates for a specific use case both at the splice-variant level and at the gene-model level. We detail the steps for generating a proteome-wide pre computed peptide information content index and its advantages for this use case in Supplementary Table 3.
Accurate protein identification and quantification are of key interest for the proteomics field. Our classification scheme, which is, to our knowledge, the first to consider gene model–protein sequence–protein identifier relationships, can help to minimize potential protein inference errors. PeptideClassifier displays all ambiguities, enabling a researcher to further examine candidates of specific interest and to distinguish or even remove protein-level ambiguities by integrating transcriptomics or other data sets. Its applications for data integration and information content–based selection of peptides for targeted quantitative proteomics are expected to find widespread use.
Note: Supplementary information is available on the Nature Biotechnology website.
ACKNOWLEDGMENTSWe thank C. Panse for contributing a first XML parser to extract information from database search engine output files, E. Brunner and G. Hausmann for feedback on the manuscript and K. Basler, U. Grossniklaus, R. Aebersold, M. Hengartner and J. Jiricny for continued support of the Quantitative Model Organism Proteomics bioinformatics core group. E.Q. and C.H.A. are members of the Quantitative Model Organism Proteomics Initiative, which is supported by the University Research Priority Program Systems Biology/Functional Genomics of the University of Zurich.
AUTHOR CONTRIBUTIONSE.Q. wrote the software code and documentation and generated the figures; C.H.A. originally devised the peptide classification scheme and wrote the manuscript.
COMPETING FINANCIAL INTERESTSThe authors declare no competing financial interests.
Ermir Qeli & Christian H Ahrens
Quantitative Model Organism Proteomics, Institute of Molecular Life Sciences, University of Zürich, Winterthurerstrasse 190, 8057 Zürich, Switzerland. Correspondence should be addressed to C.H.A. ([email protected]).
1. Washburn, M.P., Wolters, d. & Yates, J.R. III. Nat. Biotechnol. 19, 242–247 (2001).
2. Brunner, e. et al. Nat. Biotechnol. 25, 576–583
identifications reported in the literature. Alternatively, by opting to manually validate all single-hit identifications with an information-rich peptide, a user could reduce the overall protein FDR while keeping valid single hits. We have shown that single hits passing manual evaluation (only 35% of all single hits) are enriched in short and low-abundance proteins8, which, by definition, will contribute fewer observable peptides. The rejected single hits accounted for around 90% of the incorrect PSM matches estimated to be present in the data set on the basis of target-decoy database search results. Removing them greatly reduces the actual FDR in the data set at the peptide level and even more so at the protein level. Ideally, one would be able to rely on a scoring scheme; the solution by Gupta and Pevzner13 may represent one valuable resource. We suggest that the guidelines for protein identification6 be extended to consider the peptide information content.
The classification facilitates seamless integration with transcriptomics data. We have demonstrated this for current transcriptomics platforms, which predominantly report results at the gene-model level. However, to allow more fine-grained integration with data from exon-based array platforms or RNA-Seq and take advantage of their potential to distinguish splice variants, we have further subdivided class 2 peptides: class 2a peptides imply a proper subset of distinct protein sequences encoded by one gene model, whereas class 2b peptides imply all encoded protein sequences. In combination, class 1a, 1b and 2a peptides can thus be informative in identifying and distinguishing different splice isoforms. Because a substantial part of the continuous updates to eukaryotic reference protein databases represent splice variants (Supplementary Table 2), we expect that such a classification will become increasingly valuable. In addition, shared peptides may lead to inaccurate protein quantification results in semiquantitative spectral counting applications: the distinction of several peptide evidence classes provided by PeptideClassifier can help prevent this.
Finally, PeptideClassifier can assist in selecting the most relevant peptides for targeted quantitative proteomics approaches using multiple reaction monitoring. Applications can range from selecting proteotypic peptides from existing experimental proteome catalogs14 to supporting larger projects like the Human Protein Detection and Quantitation project15, which aims to identify expression evidence
those peptides that identify a proper subset (class 2a) versus those that imply all protein sequences encoded by a gene model (class 2b). Finally, class 3a peptides unambiguously identify a protein sequence that can be encoded by several gene models from distinct loci. Such cases, which include histones or the products of duplicated genes in prokaryotes, are typically very rare. In contrast, class 3b peptides are derived from different protein sequences encoded by gene models from distinct loci. They have the least information content but can account for a large percentage of the experimental data8.
The conceptually simple extension of integrating the gene model distinguishes our solution from other common protein inference tools, such as ProteinProphet5, IsoformResolver9, Scaffold10 or IDPicker11. Similar to these tools, PeptideClassifier addresses protein inference using the assigned peptides, but it does not try to improve the peptide-spectrum matching (PSM) process (Fig. 1). Notably, for protein inference, our deterministic method considers only peptides above a user-defined threshold, and not lower-scoring peptides that could provide additional evidence for certain protein identifications, the default approach adopted by ProteinProphet5 and Scaffold10. The deterministic approach is therefore very stringent. Similar to IDPicker11, Scaffold10 and other solutions, the output of two different database search algorithms could in principle be classified and integrated to achieve additional stringency.
One example of the use of PeptideClassifier concerns the reporting of reference data sets in proteomics, where the error rate should be as minimal as possible. For peptides of class 1a, 1b and 3a, a minimal list of nonoverlapping, unambiguous protein sequence identifications can be generated (Fig. 2). For cases in which the peptide evidence cannot distinguish between several possible protein sequences that are encoded either by the same gene model (class 2a and 2b) or by different gene models (class 3b) (Fig. 2), a minimal list of ambiguous protein identifications can be generated that reports the inherent ambiguities (for more details, see Supplementary Fig. 1). Using a target-decoy database search strategy12, one can estimate the spectrum-level false discovery rate (FDR) for a selected peptide confidence threshold. Because the FDR is much higher for proteins identified by a single hit, one option would be to exclude them. Existing guidelines for protein identification requiring two distinct peptides6 have limited the number of false-positive protein
CORR eSPO nden Ce©
201
0 N
atu
re A
mer
ica,
Inc.
All
rig
hts
res
erve
d.
650 volume 28 number 7 july 2010 nature biotechnology
With the broadening availability of tools and methods, researchers have to define the most efficient binder applicable to the method and approach they have selected. These applications are carried out under different experimental conditions, which affect the choice of affinity reagent used. For example, binders can be either in solution or immobilized to a solid phase, and target proteins may be present either in a native, conformationally folded form or in a denatured state. To compare affinity reagents and decide upon the most appropriate one, users need comprehensive information regarding each reagent. Currently, multiple sources of information exist, including commercial catalogs of antibodies, portals centralizing affinity-reagent properties from various sources and experimental results published in the literature describing the successful use of a binder in a specific application. Large-scale production initiatives also add other sources such as validation and quality-control results from production centers and independent quality assessment laboratories (such as the Antibodypedia portal; http://www.antibodypedia.org/). Even so, the available information may be incomplete; for example, the identification of a protein belonging to a particular family using a given antibody may be reported with no information concerning the assessment of possible cross-reactivity of the antibody with other family members. Existing information may also be biased by unsubstantiated reports from a commercial producer. Furthermore, data may appear contradictory at first glance, owing to a lack of precision in target or sample descriptions.
The purpose of MIAPAR is to permit the reliable identification of affinity reagent–target–application triples. A binder is designed and produced for the detection of a particular target protein or peptide, often within a complex mixture. For maximum benefit of potential users, reporting of data about such a protein binder must describe (or reference) both its intended target and its qualities as a molecular tool. Ideally, such a description should include: (i) affinity reagent (and target) production processes, which may influence the characteristics of the binder and permit the unambiguous identification of the molecules; (ii) properties of the reagent as a binding tool, including its specificity, affinity, binding kinetics and cross-reactivity; (iii) the use of the reagent in applications (that is, compatibility with experimental
(2004).10. Searle, B.C., Turner, M. & nesvizhskii, A.I. J.
Proteome Res. 7, 245–253 (2008).11. Ma, Z.Q. et al. J. Proteome Res. 8, 3872–3881
(2009).12. elias, J.e. & Gygi, S.P. Nat. Methods 4, 207–214
(2007).13. Gupta, n. & Pevzner, P.A. J. Proteome Res. 8, 4173–
4181 (2009).14. Ahrens, C.H., Brunner, e., Hafen, e., Aebersold, R.
& Basler, K. Fly 1, 182–186 (2007).15. Anderson, n.L. et al. Mol. Cell. Proteomics 8, 883–
886 (2009).
(2007).3. de Godoy, L.M. et al. Nature 455, 1251–1254
(2008).4. nesvizhskii, A.I. & Aebersold, R. Mol. Cell.
Proteomics 4, 1419–1440 (2005).5. nesvizhskii, A.I., Keller, A., Kolker, e. & Aebersold,
R. Anal. Chem. 75, 4646–4658 (2003).6. Carr, S. et al. Mol. Cell. Proteomics 3, 531–533
(2004).7. nesvizhskii, A.I., Vitek, O. & Aebersold, R. Nat.
Methods 4, 787–797 (2007).8. Grobei, M.A. et al. Genome Res. 19, 1786–1800
(2009).9. Resing, K.A. et al. Anal. Chem. 76, 3556–3568
Minimum information about a protein affinity reagent (MIAPAR)
To the Editor:We wish to alert your readers to MIAPAR, the minimum information about a protein affinity reagent. This is a proposal developed within the community as an important first step in formalizing standards in reporting the production and properties of protein binding reagents, such as antibodies, developed and sold for the identification and detection of specific proteins present in biological samples. It defines a checklist of required information, intended for use by producers of affinity reagents, quality-control laboratories, users and databases (Supplementary Table 1). We envision that both commercial and freely available affinity reagents, as well as published studies using these reagents, could include a MIAPAR-compliant document describing the product’s properties with every available binding partner. This would enable the user or reader to make a fully informed evaluation of the validity of conclusions drawn using this reagent (Fig. 1).
Supplementary Table 2 shows an example of a MIAPAR-compliant document, which could be derived from the information supplied in a single publication using the workflow summarized in Supplementary Figure 1.
Affinity reagents serve various roles in experimental studies. These include protein sample identification and detection; protein capture for isolation, purification and quantification; and functional studies. The choice of an applicable molecular tool is conditioned by the experimental objectives and the chosen approaches and methods. This has led to a widening of the range of molecules being used as affinity reagents (Table 1 and ref. 1). The best established are ‘natural’ polyclonal and monoclonal
antibodies; however, an expanding range of recombinant constructs are now available, including single-chain variable fragments (scFvs), single-domain antibody fragments and diabodies. More recently, alternative affinity reagents have been developed, the biophysical properties of which present advantages in specific applications. They include protein scaffolds, such as fibronectin, lipocalins and ankyrin and armadillo repeat domains, and nucleic acid aptamers. These reagents are used in a growing range of experimental methods, including enzyme-linked immunosorbent assay (ELISA), western blotting, immunohistochemistry, affinity chromatography and immunoprecipitation (Table 2).
At the same time, the systematic characterization of complete proteomes has led to an increase in the scale on which affinity reagents are produced. Several ambitious projects aim to develop systematic affinity-reagent collections. In Europe, they include the EU ProteomeBinders consortium1, the Human Protein Resource and Human Protein Atlas2 and the Antibody Factory3. In the United States, the National Cancer Institute (Bethesda, MD) has initiated the Clinical Proteomic Reagents Resource within the Clinical Proteomic Technologies Initiative for Cancer4. Globally, the Human Proteome Organization (HUPO) Human Antibody Initiative aims to promote and facilitate the use of antibodies for proteomics research, which embraces many of these activities (http://www.hupo.org/research/hai), and the HUPO Proteomics Standards Initiative has developed PSI-PAR, a global community standard format for the representation and exchange of protein affinity-reagent data5.
CORR eSPO nden Ce©
201
0 N
atu
re A
mer
ica,
Inc.
All
rig
hts
res
erve
d.
Peptide Classification
Reference Protein Database
Analysis of protein sequence redundancies
(FastaAnalyzer)
Identifiable Proteome Index
Sequence ProteinsMVQYNFKYISHLNAR... Prot_1.1MALLVFQLKITVVPNGK... Prot_2.1, Prot_2.2MEDLVVPKYISHLNAR... Prot_3.1, Prot_900.1...
Peptide-Protein AssignmentsSpectrum# Peptide_Seq Protein_Alternatives Score1 MVQYNFK Prot_1.1 0.952 ITVVPNGK Prot_2.1, Prot_2.2 0.913 YISHLNAR Prot_1.1, Prot_3.1, Prot_900.1 0.92...
Organism-specific gene-protein mappingGene ProteinGene_1 Prot_1.1Gene_2 Prot_2.1Gene_2 Prot_2.2...
Classified PeptidesSpectrum# Peptide_Seq Protein_Alternatives Score Class1 MVQYNFK Prot_1.1 0.95 1a2 ITVVPNGK Prot_2.1, Prot_2.2 0.91 23 YISHLNAR Prot_1.1, Prot_3.1, Prot_900.1 0.92 3b...
Extraction of minimal protein list for shared peptides
(SharedPeptideExplainer)
Minimal Protein List
1
Externalpost-processing
e.g. TPP
Peptide Spectrum Match
User-selected peptides
Peptide ListpepXML Results
XML parsing(PepXMLParser)
Peptide-Protein Mapping(Peptide2Proteins)2 2’
5
3
Extraction of minimal protein list for unique peptides
(UniquePeptideExplainer)4
Supplementary Fig. 1, Qeli & Ahrens
(PeptideClassifier)
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
Supplementary Figure 1: Schematic overview of the modular architecture of
PeptideClassifier and its functionalities.
The names of individual software modules are shown in blue, respectively. The score
shown in the peptide-protein assignment box and the classified peptides box
schematically represents the score a probabilistic or other scoring algorithm would
provide as a measure for the quality of the PSM assignment.
Description of individual PeptideClassifier modules and their functionalities:
PeptideClassifier is command-line driven in order to support large-scale proteomics
approaches. For the entire software package and each of the five individual software
modules, a readme file is provided that explains the usage.
• Analysis of protein sequence redundancies (step 1, FastaAnalyzer)
In a first step, a reference protein database of choice is analyzed to identify any
protein sequence redundancies and to define the set of protein sequences that differ
by at least one amino acid and thus are (at least in theory) distinguishable by tandem
mass spectrometry (identifiable proteome index). Such redundancies can include
proteins encoded for by the same gene model (whose transcripts only differ in the 5’
or 3’ untranslated regions, UTRs) or proteins encoded for by different gene models
(e.g. histones). For each unique sequence, all protein identifiers pointing to that
sequence are listed. This information is later used in the peptide classification
process.
• Organism-specific gene/protein naming convention & supported
databases
1Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
PeptideClassifier requires a clear mapping between gene models and their encoded
proteins. The annotation of genes and their encoded proteins follows different rules in
different organisms. For some model organisms such as Drosophila and Arabidopsis,
the annotation is straightforward in the sense that different splice variants encoded
by the same gene model have the same prefix in the identifier and differ from each
other only in their suffix. For example the Drosophila gene “supercoiling factor” with
the gene symbol ‘scf’ and the gene identifier CG9148 encodes 3 protein products
annotated as CG9148-PA, CG9148-PB and CG9148-PC. The annotation for
Arabidopsis thaliana and Caenorhabditis elegans follows similar conventions. Hence
for these model organisms the relationships between protein names and gene model
names is trivial and can easily be extracted.
Extracting these relationships for other organisms, such as human and mouse,
depends on the reference database. For the IPI databases that are built from
different resources in order to provide a very comprehensive database, not all entries
have a link to a gene model identifier. While ENSEMBL reference databases do not
possess such a clear and intuitive relationship between the gene model and the
protein identifier as Drosophila and Arabidopsis, the mapping between gene model
and protein identifiers can be extracted from the headers of the respective Fasta
sequence entries and has the following form:
ENSP00000386659|ENSG00000222041
ENSP00000386718|ENSG00000222041
ENSP00000386407|ENSG00000222041
ENSP00000387320|ENSG00000222041
In this example, the human gene ENSG00000222041 has four splice variants named
as ENSP00000386659, ENSP00000386718, ENSP00000386407 and
ENSP00000387320. As such, ENSEMBL annotated data can be easily processed
following the same rules.
2Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
In the folder Auxiliary_Scripts we have provided several shell scripts to enable users
to generate these gene model - protein identifier mapping files for their reference
database version of choice. By adapting the code of the shell scripts for the human or
mouse ENSEMBL database, these can easily be extended to other ENSEMBL model
organism databases.
Importantly, researchers could also opt to build their own enhanced database and
map the proteins without gene model identifier onto a reference genome, assign their
own identifiers, and then use PeptideClassifier. Any clear mapping e.g. from the
unique locus position of the gene to protein name would work. This would require a
continued bioinformatics effort, but represents a solution to map the peptides back to
the gene level and to minimize mis-classifications based on non-comprehensive
databases.
• XML parsing (Step 2, PepXMLParser)
The purpose of this step is to parse pepXML data files, which contain the output of
common database search algorithms and, if applicable, subsequent post-processing
steps (e.g. PeptideProphet as part of the Trans-Proteomics Pipeline), into a comma-
separated file format (CSV format) for easier downstream processing.
• Peptide-Protein mapping (Step 2’, Peptide2Proteins)
As an alternative to the above step, a user can also provide a list of peptides that will
be processed further through the pipeline. For each peptide in the list, all protein(s)
where this peptide is part of their sequence (a substring) are extracted. This list of
alternative proteins for each peptide is thus similar to the output obtained after the
processing of pepXML files.
3Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
• Peptide classification (Step 3, PeptideClassifier)
The central classification step takes as input the peptides with the respective
alternative protein information, as well as protein sequence - protein identifier and
protein identifier - gene model relationships and classifies the peptides in a
deterministic way into one of six predefined peptide evidence classes for eukaryotes,
and three classes for prokaryotes.
• Extraction of a minimal protein list for unique peptides (Step 4,
UniquePeptideExplainer)
After the classification step, all peptides are categorized into one of three groups for
prokaryotes (1a, 3a, 3b) or six groups for eukaryotes (1a, 1b, 2a, 2b, 3a, 3b). We
thus extended the original classification reported in1 by further distinguishing class 2
peptides and by also covering prokaryotes. Proteins implied by peptides of classes
1a, 1b and 3a do not overlap with each other, implying as such a minimal set of
proteins that explains these peptides.
• Extraction of a minimal protein list for shared peptides (Step 5,
SharedPeptideExplainer)
The ambiguity of shared peptides which could be derived from several proteins, for
example splice variants, or similar proteins encoded by distinct gene models that
arose from duplication events, pose a problem for protein inference2. While they are
less prominent in prokaryotic proteomes, they are quite common in higher plants, in
the mouse and human proteomes. Shared peptides of classes 2a, 2b and 3b can
imply several distinct protein sequences. For eukaryotes, class 2a and 2b peptides
can imply proteins that could also contain peptides of class 1a, 1b and 3a, and as
4Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
such these peptides can be explainable by these proteins. Any overlaps in this
context are considered and excluded.
Peptides of class 3b, present in both eukaryotes and prokaryotes, constitute shared
peptides of lowest information content pointing to several proteins encoded for by
different gene models. Some of these peptides point to proteins that are also
identified with unique peptides. Applying the “Occam’s Razor” principle3 in this case,
the simplest hypothesis that would explain these peptides is that the proteins with
unique peptides are “present” in the sample and that these shared peptides also
originated from these proteins.
However, some of the class 3b peptides remain unexplained by unambiguously
identified proteins. This case is treated separately from the rest and the explanation
of these peptides is solved using an integer programming approach (again following
the Occam’s Razor principle to find the minimal set of proteins that explain these
remaining peptides).
These peptides can imply up to N additional proteins, where N is the maximal
number of proteins implied by all otherwise unexplained 3b peptides. The problem of
finding the minimal set of proteins can be formulated in the following way. We
denote y i a binary variable representing whether the protein i is in the sample ( )
or not ( ). Each peptide
yi =1
yi = 0 j ∈ J3b implies a set of proteins
Z j ⊆ {y1, y2,..., yN yi ∈ {0,1}} .
Then the purpose would be to find the N dimensional configuration vector ˆ Y that
solves the following integer programming problem:
Minimize: yi
i=1
N
∑
Subject to: for each yi
yi ∈Z j
∑ >=1 j ∈ J3b
5Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
This optimization problem can be solved using libraries for Linear Programming e.g.
using the GNU Linear Programming Kit (http://www.gnu.org/software/glpk/).
A shell script that includes the instructions on how to run it with protein lists implied
by class 3b peptides is provided in the software code (SharedPeptideExplainer.sh).
As a last point, even some peptides of class 2a/2b, might remain unexplained by
unambiguous protein identifications and as such they can be processed similarly to
peptides of class 3b to derive a minimal list of protein identifications that explain
these peptides.
References
1. Grobei, M.A. et al. Genome Res. 19, 1786-1800 (2009). 2. Nesvizhskii, A.I. & Aebersold, R. Mol. Cell. Proteomics 4, 1419-1440
(2005). 3. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. Anal. Chem. 75,
4646-4658 (2003).
6Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
Protein inference Targeted proteomics Data integration with transcriptomics
1a unambiguous unambiguos unambiguous minimal list proteotypic peptide integrate gene model based array datapotential to integrate exon array data
1b unambiguous ambiguous unambiguous minimal list proteotypic peptide integrate gene model based array datapotential to integrate exon array data
2a ambiguous ambiguous unambiguous specific for one gene modelidentifies proper subset
identifies a proper subset of isoforms of a gene model
integrate gene model based array datapotential to integrate exon array data
2b ambiguous ambiguous unambiguos specific for one gene modelimplies all encoded proteins
identifies all annotated isoforms of a gene model
integrate gene model based array data
3a unambiguous ambiguous ambiguous minimal list of proteinscan be encoded by distinct gene models (proteotypic peptide)
potential to distinguish an ambiguous identification at the gene model level with transcriptomics, RT-PCR, or
other data
3b ambiguous ambiguous ambiguousminimal list of proteins to explain remaining
peptide evidence*(Occam's razor)
n.a. (potential to minimize the list of ambiguous proteins by integrating transcriptomics, RT-PCR, or other data )
* see Supplementary Fig. 1 for more details. Supplementary Table 1. Implications of the six peptide evidence classes pre-computed by PeptideClassifier for selected major applications.
Implication for major applicationsPeptide evidence
class Protein
sequence(s) Gene(s)Protein isoform(s)
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
Supplementary Table 2: Overview of the percentage of gene models with multiple
annotated protein sequences in reference databases.
PeptideClassifier is most useful when working with completely sequenced and well-
annotated genomes and the corresponding reference protein databases. An analysis of
different model organism and other reference database releases over time indicates that the
percentage of gene models for which more than one protein sequence is annotated
(including alternatively spliced protein isoforms) is increasing. This increase is particularly
striking for the human proteome. This is in part due to the formation of a Genome Reference
Consortium which aims to provide a comprehensive reference genome annotation
(http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml).
Organism /
reference database
# Proteins # Gene
models (GM) # GM
with one protein identifier
# GM with more than one protein
identifier D. melanogaster
FlyBase 5.13 21,317 14,101 10,461 3,640 (25.8%)
C. elegans Wormpep 140
22,420 19,735 17,844 1,891 (9.6%)
C. elegans Wormpep 208
24,202 20,226 17,516 2,710 (13.4%)
A. thaliana TAIR7
31,921 27,029 23,230 3,799 (14.0%)
A. thaliana TAIR9
33,410 27,379 22,826 4,553 (16.6%)
M. musculus ENSEMBL NCBIM 37.53
40,732 23,497 14,594 8,903 (37.9%)
H. sapiens ENSEMBL NCBI36.53
47,509 21,673 10,427 11,246 (51.9%)
H. sapiens GRCh 36.56
77,748 23,678 8,945 14,733 (62.2%)
Global gene expression analysis technologies including whole genome tiling microarrays and
RNA-Seq have exceptional potential to provide experimental evidence for genes missed by
current gene prediction algorithms and to identify alternative splice isoforms. Therefore, a
peptide classification scheme as the one employed by PeptideClassifier should become
increasingly important.
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
Supplementary Table 3. Advantages of an in silico classification for information content-
based selection of peptides for targeted quantitative proteomics.
The example peptide ADQELLMYSHDNIICGITSVAFSR shown below implies 4 mouse
proteins, 3 of which have a different protein sequence; it has been observed multiple times in
public data downloaded from PRIDE. Our classification readily provides the information that
i) all these proteins are encoded by the same gene model, which is not apparent from the
cryptic protein identifiers*, and ii) that the peptide cannot distinguish between the 4 proteins
and can thus only imply the encoding gene model (i.e. a class 2b peptide (see also Figure 1).
*For data searched in house, the protein identifier - gene model relationship information
would have to be parsed separately from the PSM search results and inspected manually.
This is very cumbersome and not practical for large-scale datasets. Importantly, for data
downloaded from public databases, the information that the proteins implied by a given
peptide are encoded by the same gene model is often not included. Thus, a pre-computed
index detailing the evidence class of each peptide offers multiple benefits.
Information that needs to be extracted / computed in silico
Peptide Identifiers of implied protein(s)
Protein length
Gene model
Peptide class
ADQELLMYSHDNIICGITSVAFSR
ENSMUSP00000106656
382 ENSMUSG00000029713
2b
ENSMUSP00000031726
340 ENSMUSG00000029713
ENSMUSP00000106649
296 ENSMUSG00000029713
ENSMUSP00000106652
296 ENSMUSG00000029713
The PRIDE dataset also contained other experimentally observed tryptic peptides, to some
of which PeptideClassifier would assign an evidence class with higher information content
(see colored boxes in the ClustalW multiple sequence alignment shown below). The class
1a peptides (blue boxes) unambiguously identify one specific protein sequence, while the
class 2a peptide (green box) implies two distinct protein sequences encoded for by this gene
model. This represents a proper subset of all four possible splice variants.
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
class 1a class 1a ENSMUSP00000106656 MGCSRPGMGIIPGLGPAARSCLPQPPSRGPQPPPTPPHGPGAMSELEQLRQEAEQLRNQI 60 ENSMUSP00000031726 ------------------------------------------MSELEQLRQEAEQLRNQI 18 ENSMUSP00000106649 ------------------------------------------------------------ ENSMUSP00000106652 ------------------------------------------------------------ class 2a ENSMUSP00000106656 RDARKACGDSTLTQITAGLDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 120 ENSMUSP00000031726 RDARKACGDSTLTQITAGLDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 78 ENSMUSP00000106649 --------------------------MRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 34 ENSMUSP00000106652 --------------------------MRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 34 ********************************** ENSMUSP00000106656 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 180 ENSMUSP00000031726 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 138 ENSMUSP00000106649 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 94 ENSMUSP00000106652 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 94 ************************************************************ ENSMUSP00000106656 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 240 ENSMUSP00000031726 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 198 ENSMUSP00000106649 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 154 ENSMUSP00000106652 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 154 ************************************************************ ENSMUSP00000106656 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 300 ENSMUSP00000031726 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 258 ENSMUSP00000106649 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 214 ENSMUSP00000106652 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 214 ************************************************************ class 2b ENSMUSP00000106656 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 360 ENSMUSP00000031726 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 318 ENSMUSP00000106649 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 274 ENSMUSP00000106652 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 274 ************************************************************ ENSMUSP00000106656 GVTDDGMAVATGSWDSFLKIWN 382 ENSMUSP00000031726 GVTDDGMAVATGSWDSFLKIWN 340 ENSMUSP00000106649 GVTDDGMAVATGSWDSFLKIWN 296 ENSMUSP00000106652 GVTDDGMAVATGSWDSFLKIWN 296 ********************** CLUSTAL 2.0.12 multiple sequence alignment
Readily available information concerning different peptide evidence classes will be important
for the integration of shotgun proteomics data with transcriptomics data, both from gene
model-based and exon-based gene expression analysis platforms or RNA-Seq, and for the
selection of peptides for targeted quantitative proteomics projects.
For the specific use case of selecting peptides for targeted quantitative proteomics projects,
a researcher can use PeptideClassifier to compute a proteome-wide peptide index, and then
browse all peptides for a specific gene model of interest. Integration of publicly available
experimental data (e.g. from PeptideAtlas, PRIDE or Tranche) and ideally peptide
detectability prediction, could further facilitate the decision which peptide likely would have
the best chances to work well with selective reaction monitoring (SRM) or multiple reaction
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
monitoring (MRM). Below, we illustrate this use case for the ENSEMBL human protein
database (see Supplementary Fig. 1 for more detail on the usage of the respective
PeptideClassifier software modules).
Step 1: Extract the relationships between gene models and protein identifiers.
(the example database used below can be found in the folder Auxiliary_Scripts)
cat Hsapiens_NCBI36.53.fasta | grep \> | tr "|" " " | awk '{for(i=1;i<=NF;i++){if(i==1 ||
$i~/gene/){printf $i"\t"}} printf "\n"}' | sed -e "s/>//" -e "s/gene://" -e "s/\t$//" | awk '{print
$2"\t"$1}' | sort -u > Gene_Protein_Map_Human_36.53
Step 2: Analyze redundancies in the database.
./FastaAnalyzer/FastaAnalyzer Hsapiens_NCBI36.53.fasta >
FastaAnalyzer_Results_Human_36.53
Step 3: Peptide digestion and filtering.
This step performs a theoretical protease digest of the proteins, filters the resulting peptides
according to a predefined mass range and provides the identifiers of all possible proteins
they point at. Digestdb, the script that performs the in silico digestion of proteins, is available
from the Trans-Proteomics Pipeline (TPP, http://tools.proteomecenter.org/).
digestdb Hsapiens_NCBI36.53.fasta | awk '$2>=450 && $2<=4500 && length($4)>=6{print
$4"\t"$1}' | sort -u | awk '{if($1!=previous){printf "\n"$1";"$2","; previous=$1}else{printf $2","}}' |
sed 's/,$//' | grep "[A-Z]" > Peptides_Proteins_Human_36.53
Step 4: Peptide classification.
java -jar ./PeptideClassifier/PeptideClassifier.jar Peptides_Proteins_Human_36.53
Gene_Protein_Map_Human_37.53 FastaAnalyzer_Results_Human_36.53 >
outfile_classified_peptides_36.53
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
Step 5: Select a gene model of interest and all its encoded proteins; assess the classified
peptides.
cat Gene_Protein_Map_Human_NCBI36.53 | grep ENSG000000xxxxx | cut -f2 >
proteins_of_interest
cat outfile_classified_peptides_36.53 | grep -f proteins_of_interest
From the output list, peptides of a specific evidence class can be selected for the proteins or
gene model of interest. To select peptides that could imply all protein isoforms encoded by a
gene model, a user could consider class 1a peptides for gene models that encode only one
protein sequence, class 1b peptides for gene models which encode several
undistinguishable protein sequences, and class 2b peptides for gene models that encode two
or more distinct protein sequences.
The class 2 a peptides on the other hand imply a proper subset of the protein sequences
encoded by the gene model. Class 3a peptides are special in that they unambiguously
identify one protein sequence which can be encoded for by distinct gene models. Proteomics
methods are not able to distinguish the encoding gene model, and other technologies would
have to be applied.
The list of classified peptides that can be extracted for the mouse gene model
ENSMUSG00000029713 (which was described above) by this exemplary use case is shown
on the next page.
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.
Peptide sequence (from tryptic digest) Implied proteins Evidence class
(calculated for entire database)
ACGDSTLTQITAGLDPVGRENSMUSP00000031726,ENSMUSP00000106656
2a
ADQELLMYSHDNIICGITSVAFSRENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
2b
AGVLAGHDNRENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656
3b
ELPGHTGYLSCCRENSMUSP00000029212,ENSMUSP00000031726,ENSMUSP00000103869,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656
3b
GPQPPPTPPHGPGAMSELEQLRENSMUSP00000106656
1a
IYAMHWGTDSRENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656
3b
LIIWDSYTTNKENSMUSP00000029212,ENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241, ENSMUSP00000103869,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
3b
LLLAGYDDFNCNIWDAMKENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
2b
LLVSASQDGKENSMUSP00000024206,ENSMUSP00000029212,ENSMUSP00000030940,ENSMUSP00000031726, ENSMUSP00000101241,ENSMUSP00000103869,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
3b
MGCSRPGMGIIPGLGPAARENSMUSP00000106656
1a
MSELEQLRENSMUSP00000029212,ENSMUSP00000031726,ENSMUSP00000057938,ENSMUSP00000103869
3b
QEAEQLRENSMUSP00000029212,ENSMUSP00000031726,ENSMUSP00000103869,ENSMUSP00000106656
3b
QTFIGHESDINAVAFFPNGYAFTTGSDDATCRENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
2b
SCLPQPPSRENSMUSP00000106656
1a
SSWVMTCAYAPSGNFVACGGLDNICSIYSLKENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
2b
TFVSGACDASIKENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
2b
VHAIPLRENSMUSP00000024206,ENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241, ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656
3b
VSCLGVTDDGMAVATGSWDSFLKENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656
3b
Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.