+ All Categories
Home > Documents > Protein grouping in mzIdentML

Protein grouping in mzIdentML

Date post: 24-Feb-2016
Category:
Upload: anne
View: 53 times
Download: 0 times
Share this document with a friend
Description:
Protein grouping in mzIdentML. ProteinDetectionList. ProteinAmbiguityGroup id=“PAG1”. ProteinDetectionHypothesis id=“PDH1” dbseq_ref =“dbseq_Q05421|CP2E1_MOUSE” anchor protein. ProteinDetectionHypothesis id=“PDH2” dbseq_ref =“dbseq_Q05423|CP2E2_MOUSE” sequence same-set. - PowerPoint PPT Presentation
Popular Tags:
12
Protein grouping in mzIdentML
Transcript
Page 1: Protein grouping in  mzIdentML

Protein grouping in mzIdentML

Page 2: Protein grouping in  mzIdentML

ProteinDetectionList

ProteinAmbiguityGroup id=“PAG1”

ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE”anchor protein

ProteinAmbiguityGroup id=“PAG2”

ProteinDetectionHypothesis id=“PDH2” dbseq_ref=“dbseq_Q05423|CP2E2_MOUSE”sequence same-set

ProteinDetectionHypothesis id=“PDH3” dbseq_ref=“dbseq_Q05312|CP2F1_MOUSE”sequence subset

....

Page 3: Protein grouping in  mzIdentML

ProteinAmbiguityGroup and ProteinDetectionHypothesis

Page 4: Protein grouping in  mzIdentML

id: MS:1001591name: anchor proteindef: "A representative protein selected from a set of sequence same-set or spectrum same-set proteins." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001592name: family member proteindef: "A protein with significant homology to another protein, but some distinguishing peptide matches." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001593name: group member with undefined relationship OR ortholog proteindef: "TO ENDETAIL: a really generic relationship OR ortholog protein." [PSI:MS]is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001594name: sequence same-set proteindef: "A protein which is indistinguishable or equivalent to another protein, having matches to an identical set of peptide

sequences." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001595name: spectrum same-set proteindef: "A protein which is indistinguishable or equivalent to another protein, having matches to a set of peptide sequences that

cannot be distinguished using the evidence in the mass spectra." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

Existing CV terms for ProteinDetectionHypothesis

Page 5: Protein grouping in  mzIdentML

id: MS:1001596name: sequence sub-set proteindef: "A protein with a sub-set of the peptide sequence matches for another protein, and no distinguishing peptide matches."

[PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001597name: spectrum sub-set proteindef: "A protein with a sub-set of the matched spectra for another protein, where the matches cannot be distinguished using the

evidence in the mass spectra, and no distinguishing peptide matches." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001598name: sequence subsumable proteindef: "A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

id: MS:1001599name: spectrum subsumable proteindef: "A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

Existing CV terms for ProteinDetectionHypothesis

Page 6: Protein grouping in  mzIdentML

Problems

• No requirement for any exporter to use the terms “MAY”• “anchor protein” doesn’t capture intended role and isn’t used consis

id: MS:1001596name: sequence sub-set proteindef: "A protein with a sub-set ...." [PSI:MS]xref: value-type:xsd\:string "The allowed value-type for this CV term."is_a: MS:1001101 ! protein group or subset relationship

• No definition of what should be put in the value slot of cv terms:• Could be the PDH identifier, accession or DBSequence identifier of group representative or any other protein

that is super-set to this protein• Or anything else for that matter

• What does passThreshold = “true” on PDH mean?• Unclear how to count the number of identified proteins in an mzIdentML file• Count PAGs or count PDHs?

• No terms for protocol describing how inference has been done or how to interpret results

Page 7: Protein grouping in  mzIdentML

Proposed work group outcomes• Attach cv terms to <ProteinDetectionProtocol> describing how protein inference has been done

– Still under discussion, since these effectively describe parts of the algorithm used

• Exactly one mandatory “representative protein” MUST be present per group (new name for “anchor protein”) on PDH– To be checked by semantic validator

• ProteinDetectionList MUST have a cv term “number of identified proteins” (count PAGs that have “representative protein” PDH with passThreshold=“true”

• Each PDH SHOULD be flagged with one term from a group stating whether it is “representative protein”, “sequence|spectrum same-set”, “sequence|spectrum subset”, “sequence|spectrum subsumed” or “marginally distinguished” (i.e. Not strictly any of these, but not enough evidence to be a group representative)

– Value slot of these terms SHOULD contain a comma-separated list of super-set or same-set (as appropriate) PDH IDs

Page 8: Protein grouping in  mzIdentML

Table 1 –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

mzIdentML context

CV terms Values Require-ment level

Description

ProteinDetection-Protocol

“No parsimony”, “Strict parsimony”, “Parsimony with additional considerations”Parent term: “Parsimony usage”

xsd:String (to allow free text description)

SHOULD No parsimony used means no parsimony approach has been applied generating the protein list. Strict parsimony used should be indicated if parsimony is the only consideration used to report proteins. Parsimony with additional considerations used should be indicated if additional information such as quantitation information is used to influence which proteins are reported, or if some additional proteins are reported for other reasons, such as a desire to report one protein from each gene to which any matched peptide maps.

ProteinDetection-Protocol

“No intact protein separation for protein inference”, “Partial isolation for protein inference”, “Nearly complete isolation for protein inference”Parent term: “Role of intact protein separation in protein inference”

xsd:String (to allow free text description)

SHOULD In workflows where proteins are not separated to any degree, or in which protein separation information is not used in the protein inference, this will have a value of No intact protein separation for protein inference”, as will be the case in strictly bottom up proteomics. At the other limit, Nearly complete isolation should be indicated when separation of intact proteins is conducted and relied upon for protein inference, as is common in multi-dimensional gel-based work. The Partial isolation for protein inference value should be specified for cases where some level of protein isolation is used – for example, if a sizing column is used to separate intact proteins into fractions or in the common GeLC-MS workflow where 1D gel separation is followed by bottom up analysis of the gel slices.

ProteinDetection-Protocol

“Attempted isoform differentiation”, “Prevented isoform differentiation”Parent term: “Isoform Differentiation”

- SHOULD In the context of a parsimony approach, an inference tool can either attempt to report multiple protein forms by determining if there is adequate evidence to support the detection of more than one isoform in a cluster (most common), or alternately the tool could prevent this differentiation process and maximally group instead.

ProteinDetection-Protocol

Accession Ambiguity is Reported

“true”, “false” SHOULD Used for reporting whether ambiguity is reported i.e. if true PAGs may contain one or more PDHs, if false, each PAG must contain only one PDH (no attempt to report ambiguity).

Page 9: Protein grouping in  mzIdentML

ProteinDetection-Protocol

Threshold applied to Peptides

“true”, “false” SHOULD Set to true if thresholds are applied to PSMs or peptide level prior to protein inference. If thresholds have been applied, these should be reported under ProteinDetectionProtocol->Threshold using appropriate CV terms.

ProteinDetection-Protocol

Multiple matches per spectrum are considered

“true”, “false” SHOULD This should be set to false for protein inference approaches that limit to a single top ranking peptide per spectrum for consideration during protein inference; true should be set for approaches that preserve multiple answers per spectrum and provide all of these to the protein inference algorithm.

ProteinDetection-Protocol

“Spectrum-centric parsimony Minimization”, “Sequence-centric parsimony minimization”, “Sequence-centric parsimony minimization with additional rules”, “No parsimony minimization”Parent term: Parsimony Minimization Method

- SHOULD Sequence-centric parsimony minimization means that the inference method has sought to find the minimal set of proteins that explain all the peptide sequences observed, while Spectrum-centric parsimony minimization means the inference approach has sought to find the minimal set of proteins that explains the collection of observed spectra. Sequence-centric parsimony with additional rules would apply if a sequence-centric approach is used but additional rules are used – for example, if allowances are made to compensate for limitations of this approach such as I/L and deamidation ambiguities. No parsimony minimization should be indicated only if the Parsimony usage field is set to No parsimony.

ProteinDetection-Protocol

“Exhaustive list ambiguity modeling”, “Limited list ambiguity modeling”, Parent term: Ambiguity Modeling Approach

- SHOULD In modelling a PAG, in one approach an algorithm can list all known intersection relationships, including accessions that have very limited overlap with the representative protein in the group. Alternately approaches to limit the scope of accessions that are included using various approaches. For example, one could list only accessions that have at least some minimal level of intersection with the representative protein in the group. This CV term simply captures whether the group modelling is limited in some way or is exhaustive in listing accessions.

ProteinDetection-Protocol->Threshold

Protein Quality Threshold: MinimumNumSequencesRequired

Integer SHOULD An integer value representing the number of identified peptide sequences required for creating a PDH.

ProteinDetection-Protocol

TaxonomyBasedPreference “true”, “false” SHOULD In some workflows, one might map identified peptides to a multi-species protein sequence database, but prefer matches to sequences from a particular species.

ProteinDetection-Protocol->Threshold

Other thresholding terms?

Table 1 cont. –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

Page 10: Protein grouping in  mzIdentML

mzIdentML context CV term Values Require-ment level

Description

ProteinDetectionList number of identified proteins

Integer MUST The value reported should equal the number of PAGs containing a PDH flagged as Representative Protein and passThreshold=“true”

ProteinAmbiguityGroup Protein cluster identifier String. A within-file unique identifier

MAY A common identifier reported allows multiple PAGs to be linked, for example indicating some peptides are shared between different PAGs.

ProteinAmbiguityGroup NumberDistinctProteinSequences

Integer SHOULD The number of distinct protein sequences among the PDHs in the group. For example, if there are two PDH with different identifiers that have identical full length sequences, the NumberDistinctProteinSequences would be one.

ProteinDetectionHypothesis Representative protein - MUST (be present on one PDH per PAG that is counted)

The Representative protein will generally have likelihood greater than or equal to other proteins in the ProteinAmbiguityGroup, but this is not required Exactly one PDH within a PAG must be assigned with this label to serve as the representative for the putatively detected protein. A PDH labelled as the Representative protein can have passThreshold=“true|false” i.e. it need not have passed the threshold reported in the ProteinDetectionProtocol.

ProteinDetectionHypothesis Sequence Same-Set Protein

xsd:String – comma separated list of PDH Ids that are same-set

SHOULD A protein that is indistinguishable or equivalent to another protein in the group, having matches to an identical set of peptide sequences.

ProteinDetectionHypothesis Spectrum Same-Set Protein

xsd:String – comma separated list of PDH Ids that are same-set

SHOULD A protein that is indistinguishable or equivalent to the Representative protein, having matches to a set of peptide sequences that cannot be distinguished using the evidence in the mass spectra.

Table 2 New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

Page 11: Protein grouping in  mzIdentML

ProteinDetectionHypothesis Sequence Subset Protein

xsd:String – comma separated list of PDH Ids that are super-set

SHOULD A protein with a sub-set of the peptide sequence matches for the Representative protein, and no distinguishing peptide matches.

ProteinDetectionHypothesis Spectrum Subset Protein

xsd:String – comma separated list of PDH Ids that are super-set

SHOULD A protein with a sub-set of the matched spectra for the Representative protein, where the matches cannot be distinguished using the evidence in the mass spectra.

ProteinDetectionHypothesis Sequence Multiply Subsumable Protein

xsd:String – comma separated list of PDH Ids that subsume this PDH

SHOULD A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins.

ProteinDetectionHypothesis Spectrum Multiply Subsumable Protein

xsd:String – comma separated list of PDH Ids that subsume this PDH

SHOULD A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins.

ProteinDetectionHypothesis Marginally distinguished protein

- MAY Assigned to a PDH that has some evidence to support its presence in addition to the representative protein i.e. they have a unique peptide but not sufficient to be promoted as a Representative Protein in a PAG.

ProteinDetectionHypothesis Covering Set Protein MAY A member of a minimal set of proteins sufficient to explain all matched peptides/spectra via a parsimony approach. This provides an alternative means of reporting a parsimonious protein list when ParsimonyUsage=“Parsimony with additional considerations”. A PAG can contain zero, one, or multiple PDHs bearing this term.

DBSequence Protein Sequence Identical

xsd:String – comma separated list of native accession(s) of protein with identical protein sequence

MAY Full length protein sequence is identical with respect to the protein specified in the value attribute of this term.

DBSequence Protein Sequence Subsequence

xsd:String – native accession of protein with “super”-sequence

MAY Full length protein sequence is a subsequence of the protein specified in the value attribute of this term.

Table 2 cont. New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

Page 12: Protein grouping in  mzIdentML

Unresolved issues• Are the protocol terms necessary / sensible / overkill?

• Is there general consensus on the idea that the number of identified proteins MUST be reported– and must equal count of PAGs with PDH passThreshold=“true”

• Is it sensible to have SHOULD rules on all subset/same-sets?

• Extra terms for relationships between protein sequences– Probably these will be removed

• Mechanism for updating the mzIdentML specifications and validation software– Minor update + submission to shortened PSI process?


Recommended