Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Protein Sequence Databases
…and your Mass Spectrometry-based Proteomics Experiment
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Outline
• Protein Database (DB) • Origin • Sources • Format • Size • Composition
• Selecting a database for mass spec search
• Effect of DB on mass spec search results
• Post MS analysis: protein annotation, ontology, alignment
Terminology
• FASTA
• Database repository
• NCBI database
• UniProtKB
• Swiss Prot
• Ref Seq (reference sequence)
• Homology
• Contaminants DB
• Ontology
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
FASTA Protein Sequence • Name and Origin
• FASTA (pronounced ‘fast-aye’)
• ORIGIN: for sequence similarity alignment tool (1985)
• REF: DJ Lipman, WR Pearson (1985) PMID: 2983426 "The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC)."
• Stands for “fast all” – the file format worked with ‘all’ alphabets (amino acid and nucleotide)
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
FASTA Protein Sequence Format
• Structure: TEXT file
• Line 1: description line with sequence identifier
• Line 2: single amino acid letter protein sequence 80 characters wide
• Allowed characters: • AMINO ACID ONE-LETTER CODE • X • * • - • Custom one-letter amino acid codes
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Line 1: description line with sequence identifier FASTA Format Header Line Sequence Identifiers
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Line 2 FASTA Protein Sequence from NCBI- example
Line 1
Line 2
NOTE: In Sept 2016, gi numbers were replaced with accession.version identifiers
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Selecting a Protein Sequence Database • Public repositories, such as
• NCBI • UniProtKB
• Swiss Prot: manually annotated and reviewed • TrEMBL: Automatically annotated and not reviewed
• Custom (from customer) • NOTE: format is important!
• Represent species (1 or more) from which protein sample originated • Example: Mouse protein expressed in E. coli
• Ideal size range ~ 2000 to < 1 million entries
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Selecting a Protein Database: UniProtKB repository
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Selecting a Protein Database: NCBI Ref Seq repository
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Choose Your Taxonomy or Taxonomies
NOTES:
• If recombinant protein expressed in host cell, include host proteins & expressed protein(s)
• If protein database for your species has <2000 proteins, merge with another protein database (yeast) for statistical reasons
• Protein sequence headers must be parsed correctly
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Taxonomy specification - UniProtKB
(19996)
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Taxonomy specification - NCBI
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Protein Database repository content for Thirteen-lined Ground Squirrel
Database Source Number of Proteins
Swiss-Prot* reviewed 20
TrEMBL* unreviewed 20,076
UniProt Reference Proteome 19,966
NCBI (‘non-redundant’) 30,130
NCBI Reference Sequence 29,842
* From UniProt
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Protein Database Characteristics
…related to your mass spectrometry experiment
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
SPLICES FORM variants Sequence alignments: Protein Cytochrome P450 2D6
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Protein Sequence Variants
SNP’s (single nucleotide polymporphisms)
https://hive.biochemistry.gwu.edu
Natural variants)
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
In silico trypsin digest, ‘native’ protein
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
In silico trypsin digest, with VARIANTS
1
2
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Effect of Variant on Peptide Mass
Peptide example Peptide Mass * Peptide Sequence
1 – native 1730.8443 SELEEQLTPVAEETR
1 – variant (Q -> K) 1730.8806 SELEEKLTPVAEETR
1 – variant (Q -> K) 734.3566 SELEEK
1 – variant (Q -> K) 1015.5418 LTPVAEETR
2 – native 830.4366 EQVAEVR
2 – variant (V -> E) 860.4108 EQEAEVR
* Monoisotopic [M + H]+1
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Proteomics Search Program Meets Protein Sequence Database • Protein sequence file is downloaded to local computer
• Merge with common lab contaminants (keratins and more) database • http://www.thegpm.org/crap/
• Protein database is imported or indexed in the proteomics search program (sequence format is critical)
• REVERSED sequences are generated for False Discovery Rate (FDR) calculations
• Protein sequences are digested with enzymes in silico
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Database search > Protein List
• Database search algorithm matches spectrum > peptide > protein
• RESULTS: List of protein identifications with accession numbers
• POST Database search options (outside CMSP): 1. Protein annotation
2. Sequence alignment
3. Obtain related Gene Ontology information
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
POST Database search options
What you can do with your protein list.
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
1) Protein Annotation from UniProtKB
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
2) Sequence alignment with UniProt alignment tool
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
2) Sequence alignment with UniProt alignment tool: numerous amino acid labeling options
* (asterisk) indicates positions which have a single, fully conserved residue. : (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the Gonnet PAM 250 matrix. . (period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the Gonnet PAM 250 matrix.
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
2) Sequence alignment with NCBI BLAST
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
3) Link Gene Ontology information to Proteins • Define: “The Gene Ontology (GO) project is a
collaborative effort to address the need for consistent descriptions of gene products across databases.”
• Ontologies/Vocabularies • molecular function: molecular activities of gene
products • cellular component: where gene products are active • biological process: pathways and larger processes made
up of the activities of multiple gene products
(http://geneontology.org/page/documentation)
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Molecular Function Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)
Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Protein Class Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)
Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Biological Process Pie Chart for a List of 96 Protein Identifiers (gi numbers) submitted to PANTHER (http://www.pantherdb.org/)
Protein list from Supplemental data REF: Thu TM et al (2016) Cell Reports, 15(6):1254-65; PMID: 27134171
Center for Mass Spectrometry and Proteomics | Phone | (612)625-2280 | (612)625-2279
© 2015 Regents of the University of Minnesota. All rights reserved.
Database Tools for Proteins
• http://geneontology.org/
• http://string-db.org/
• http://www.pantherdb.org/
• http://www.ingenuity.com/products/ipa (licensed at UM via MSI)
ALSO:
Match mass spec data to your RNA Seq data with:
• https://galaxyp.msi.umn.edu/