Bioinformatics. Analysis of proteomic data.
Dr Richard J Edwards 28 August 2009; CALMARO workshop.
©Gary Larson
(In not much detail)
Bioinformatic analysis of proteomic data
Improving sequence identifications Dealing with redundancy Annotating protein hits
Adding value to protein lists Accession number mapping & data integration Gene Ontology analysis Protein interaction networks
Example: identifying E. huxleyi proteins with multi-species and EST sequence databases
Open Discussion
Improving identifications:dealing with redundancy.
Identifying redundancy
Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440
Choice of database affects redundancy identification SwissProt/IPI indicate splice variants EnsEMBL peptides map back onto non-redundant gene IDs Poor annotation hard to differentiate variant/error/family
Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440
Example: alpha tubulin protein family
Identifying redundancy Sometimes, identification cannot be conclusive
Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440
Basic peptidegrouping scenarios
Identifying redundancy Sometimes, identification cannot be conclusive
Different scenarios canpresent different problems
How important is it to study? Might need to identify
protein(s) through furtherexperiments
?? ?
???
?
Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440
A simplified example of a protein summary list
Identifying redundancy
Final protein list: Conclusive IDs Protein groups Inconclusive IDs
Are inconclusive/ group hits redundant?
Same protein from different species
Splice variants
Does it matter? Inflated
numbers Biased analyses Comparisons
between experiments
Unique to protein
Unique to group
No unique
Homology groupings
Can use BLAST to identify groups of related proteins Help identify possible redundancies Need to look at peptides
Particularly useful for “off-species” identifications Tendency for many hits
to same protein in different species
Clustering proteins by %identity
http://www.southampton.ac.uk/~re1u06/software/gablam/
Improving identifications:annotating protein hits.
Protein annotation
Database
Protein ListProtein List
NOISE
Poorly (un)annotated proteins Real proteins or database noise? Reliable annotation?
Most of our protein data comes from DNA sequences
PDB: 53,660 structures = 3D
SwissProt: 392,667 = Curated
TrEMBL: >6 million &UniParc: >16 million
= Most inferred from DNA Most annotation inferred through
sequence analysis
Protein data from translated DNA
Lots of errors! Sequence errors Annotation errors
AnnotationTranslation
Where does the data come from?
Protein annotation
Use standard sequence analysis tools Manual guidance/care = better than automated databases!
Homology searching BLAST vs. UniProtKB Protein domain searches, e.g. PFam
Conservation analysis Multiple sequence alignment with homologues
Are functionally important sites conserved?
Phylogenetic analysis Evolutionary relationships can help distinguish function
Assignment to protein subfamily etc. Useful where BLAST hits have competing annotation
http://www.southampton.ac.uk/~re1u06/software/haqesac/
Beyond proteomics:adding value to protein
lists.
What Bioinformatics cannot (usually) do
Magic
Replace hypothesis driven research
Directed analysis is always better than “fishing” (e.g. GO)
Provide a definitive answer
Ranking/prioritising better
Follow-up analyses
Many possibilities What was the aim of the study? What resources are available for your organism?
Imitation is the sincerest form of flattery Find a good study and copy the best bits
Easier to describe Easier to justify to reviewers
Hypothesis-driven analysis is best Many tools facilitate hypothesis generation (data
exploration) Be aware of risk of testing a hypothesis on data used to
generate it Be aware of multiple testing issues
Follow-up analyses
EBI and NCBI both provide many useful tools EBI run many good courses at Hinxton
http://www.ebi.ac.uk/Tools/
Seek collaborations
Time / Energy
Rew
ard
Bioinformatics
Find a tame bioinformatician to help if needed Good collaboration = Trade
Papers / Grants / improving the bioinformatics E.g. adding your organism/database
to an online resource
©Gary Larson
Accession number mapping Other databases may contain better/specific annotation
UniProtKB, OMIM etc.
Results from searches against older databases may need updating
EBI tool: PICR [Protein Identifier Cross-Reference Service]
BioMart: Query & Xref tool for manydatabases www.biomart.org
http://www.ebi.ac.uk/Tools/picr/
BioMart
Gene Ontology analysis
Gene Ontology [GO] = gene annotation project Controlled vocabulary allows standardisation & comparisons
http://www.geneontology.org/
Gene Ontology analysis
Many Gene Ontology exploration tools AmiGO, GOA, FatiGO, DAVID etc. Depend on source databases
May need to map IDs using PICR first
GO enrichment Assess frequency of GO terms in your list against
expectation Often a big multiple testing issue Be aware of biases – how is expectation derived
E.g. Abundant, conserved proteins more likely to be annotated & more likely to be identified in a proteomics experiment
Best if hypothesis-driven or used for data confirmation E.g. Enrichment of certain subcellular fraction
Protein interaction networks Can be useful for identifying protein complexes in
data E.g. STRING [http://string-db.org/]
Example: identifying E. huxleyi proteins with multi-species and EST
sequence databases
Combined search strategy
Genome unavailable (for download & searching)
dbESTThalassiosirapseudonana
Taxa-limitedDatabase
90,000 E huxESTs
Protein ListProtein List
:Rhodophyta::Stramenopiles
::Haptophyceae:
:Alveolata::Cryptophyta:
EST dataset
BLASTdatabase
MS/MS dataMASCOT
hits
MASCOT hitsTranslated to
6RFs
RFs and MASCOTpeptides filtered
FIESTA consensus &
annotation
Final proteinidentifications
BUDAPESTCORE
1
2
3
45
Poor qualityRFs removed
OPTIONAL(MANUAL or AUTOMATED)
90,000 E huxESTs
173 ESTs728
189 RFs
113
615
Taxa-limitedDatabase
117 Cons321
34 Cons34
83 Cons287
173 EST hits (728 peptides)
83 Consensus sequences 40 Clusters by homology
(variants/isoforms)
287 Peptides 239 Unique to one
consensus 48 Shared within one
cluster
http://www.southampton.ac.uk/~re1u06/software/budapest/
Annotating EST ConsensusSequences Homology searching & phylogenetics
SequenceDatabase
Consensus
UniProt
Taxa-limitedDatabase
Alignment
Protein family identification
Redundancy/Variants
Combined search strategy
Genome unavailable (for download & searching)
dbESTThalassiosirapseudonana
Taxa-limitedDatabase
90,000 E huxESTs
173 Hits83 Consensus40+ Proteins
96 Hits26+ Proteins
:Rhodophyta::Stramenopiles
::Haptophyceae:
:Alveolata::Cryptophyta:
64+ Proteins(12 Common)
Conclusions.
Summary Extra analysis of raw protein lists adds value
False positives vs. Real proteins Annotation of uncharacterised hits
Numerous tools for mining protein lists Data exploration and/or hypothesis testing Community/Organism dependent Worth contacting bioinformaticians for further development
Development of customised bioinformatics solutions can greatly increase power of study Increased availability of high throughput technologies
Poor annotation & high error rates Increased need for bioinformatics post-processing to improve
quality
Open [email protected]