+ All Categories
Home > Documents > Predicting Active Site Residue Annotations in the Pfam Database

Predicting Active Site Residue Annotations in the Pfam Database

Date post: 30-Dec-2015
Category:
Upload: jennifer-stone
View: 17 times
Download: 0 times
Share this document with a friend
Description:
[ Publication date: 9 August 2007 ]. Authors: Jaina Mistry; Alex Bateman; Robert D Finn [Authors of this paper &the PFam database]. [ BMC Bioinformatics ]. Predicting Active Site Residue Annotations in the Pfam Database. Presentation by: KEYUR MALAVIYA. TOPICS COVERED. Background - PowerPoint PPT Presentation
Popular Tags:
42
Predicting Active Site Residue Annotations in the Pfam Database [ Publication date: 9 August 2007 ] Presentation by: KEYUR MALAVIYA [ BMC Bioinformatics ] Authors: Jaina Mistry; Alex Bateman; Robert D Finn [Authors of this paper &the PFam database]
Transcript
Page 1: Predicting Active Site Residue Annotations in the Pfam Database

Predicting Active Site Residue Annotations in the Pfam DatabasePredicting Active Site Residue Annotations in the Pfam Database

[ Publication date: 9 August 2007 ]

Presentation by: KEYUR MALAVIYA

[ BMC Bioinformatics ]

Authors:Jaina Mistry; Alex Bateman; Robert D Finn[Authors of this paper &the PFam database]

Page 2: Predicting Active Site Residue Annotations in the Pfam Database

TOPICS COVERED

• Background

• Introduction

• Construction and content

• Transfer of experimental data within Pfam alignments

• UniProtKB data

• CSA data

• Conclusion

Page 3: Predicting Active Site Residue Annotations in the Pfam Database

TOPICS COVERED

• Introduction• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

• Background

Page 4: Predicting Active Site Residue Annotations in the Pfam Database

Active Site

Residue Annotationsin

the PFam Database

Predicting

Background:

Page 5: Predicting Active Site Residue Annotations in the Pfam Database

• Pfam is a collection of protein families and domains

• Pfam contains multiple protein alignments & profile-HMMs of these families

PFam Database

Background:

• Function: To view the domain organization of proteins

• 5% Pfam families are enzymatic

• From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined

• The structure and chemical properties of these residues (the active site) determine the chemistry of the enzyme

Page 6: Predicting Active Site Residue Annotations in the Pfam Database

Background:

• Active site: The active site of an enzyme contains the catalytic and binding sites

• Binding site is a region on a protein (also DNA or RNA) to which specific other molecules & ions — called ligands

• Ligand: Binds to & form a complex with a biomolecule to serve a biological purpose. i.e: it is an effector molecule binding to a site on a target protein

• Enzymes: Controls the flow of metabolites within a cell Catalyze virtually all reactions that make/modify molecules

Page 7: Predicting Active Site Residue Annotations in the Pfam Database

Information about other databases:• NCBI BLAST: Finds regions of local similarity between

sequences (homologs). BLAST can be used to infer functional and evolutionary relationships between sequences

NCBI Blast:

Page 8: Predicting Active Site Residue Annotations in the Pfam Database

Information about other databases:• UniProtKB: Curated protein sequence database (i.e. literature

collated A-S-R) & predicted A-S-R. Only predicts A-S-R by similarity for sequences in UniProtKB/Swiss-Prot

NCBI Blast:

Page 9: Predicting Active Site Residue Annotations in the Pfam Database

• UniProtKB: Curated protein sequence database (i.e. literature collated A-S-R) & predicted A-S-R.

Only predicts A-S-R by similarity for sequences in UniProtKB/Swiss-Prot

Information about other databases:UniProtKB:

Page 10: Predicting Active Site Residue Annotations in the Pfam Database

Information about other databases:• PROSITE: consists of documentation entries describing

protein domains, families and functional sites as well as associated patterns and profiles to identify them

UniProtKB:

Page 11: Predicting Active Site Residue Annotations in the Pfam Database

Information about other databases:

• Catalytic Site Atlas (CSA): documents enzyme active sites and catalytic residues in enzymes of 3D structure

Collates A-S-R from literature for proteins with known structure A-S-R predictions made for proteins with a known structure which it infers on the basis of PSIBLAST hits

One of the largest resources for catalytic sites

• SMART and MEROPS: collate active site data from the literature and use sequence similarity based transfer to annotate active site residues onto the sequences in their protein families

Page 12: Predicting Active Site Residue Annotations in the Pfam Database

Uniprot – Universal protein knowledgebase:

Page 13: Predicting Active Site Residue Annotations in the Pfam Database

Uniprot – Universal protein knowledgebase:

• PFam and UniprotKB: 74% of protein sequences in UniprotKB have at least one match to Pfam. (Sequence coverage is 74% )

Page 14: Predicting Active Site Residue Annotations in the Pfam Database

TOPICS COVERED

• Introduction

• Background

• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

Page 15: Predicting Active Site Residue Annotations in the Pfam Database

Introduction:

• Goal of this Paper: To increase the active site annotations

• Approach: Strict set of rules to reduce the rate of FPs transfer experimentally determined active site residue data to other sequences within the same Pfam family

• Results:• Only 3% of predicted sequences are false positives

• Predicted 606110 active site residues, of which 94% are not found in UniProtKB

• The developed tool for transferring the data can be applied to any alignment with associated experimental active site data and is available for download

• This tool is useful in proteome annotation, comparative genomics, protein evolution and active site characterization

• Problem: Low active site annotations

Page 16: Predicting Active Site Residue Annotations in the Pfam Database

The problem and the solution

• Pfam[1] release 20.0: 8296 protein families

• % Active site residues experimentally determined in enzymatic Pfam families :

• Computationally predict active sites in protein sequences

• Two broad categories:

1) computational methods that transfer experimentally characterized active site data by similarity

2) those that predict active site residues ab initio

HOW?

Only ~0.4% sequences

• To do better: Need to overcome the lack of experimental data

Page 17: Predicting Active Site Residue Annotations in the Pfam Database

ab initio methods:

• Exploit known properties like:

• Active sites usually found buried within a cleft of a protein

• Mutations in them increase stability of an enzyme

• Active sites residues are highly conserved

• Methods: Geometry data, stability profiles and sequence conservation in active site prediction

Page 18: Predicting Active Site Residue Annotations in the Pfam Database

ab initio methods:

• Evolutionary trace (ET): - Identify most highly conserved residues in related sequences, - Map them onto the structure of protein, - Then examines the structure for clusters of residues which could correspond to active sites or other functional sites. - Successful prediction 60-80% of test cases

• Other methods: Neural networks and support vector machines

• Problem: These methods are hard to compare to each other in terms of accuracy

• All have a relatively high rate of False Positives

Page 19: Predicting Active Site Residue Annotations in the Pfam Database

Similarity transfer based methods:

• Transfer A-S-R from the characterized sequences to the uncharacterized sequences

• First identify homologous sequences: Use tools such as BLAST searches, hidden Markov models (HMMs), pattern matching and structural templates

• Transfer active site residues• Transfer A-S-R

• Pfam with this rule based methodology = Pfam+

Page 20: Predicting Active Site Residue Annotations in the Pfam Database

Where we are:

• Introduction

• Background

• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

Page 21: Predicting Active Site Residue Annotations in the Pfam Database

Construction and content:

• The Pfam database is renowned for having no known false positives in its alignments

• The active site Pfam families contain both active and inactive homologues

• Known active site residues from UniProtKB/Swiss-Prot in a Pfam alignment, are conserved in many of the sequences without active site annotation

• Construction: A set of rules that allows conservative transfer of active site annotation from one protein to another protein in the same Pfam alignment

• To predict active site residues:

• identify sequences with experimentally verified active site residues

• use this information to predict active site residues in other members of that family

Page 22: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

find a homologous set of proteins & generate a protein alignment:

Page 23: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

• Identify the positions of all experimentally verified active sites in the alignment:

Page 24: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Page 25: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Seq1 contains 3 experimental active sites (D, E & H) Seq2 contains 2 experimentally defined active site residues (D & E)

Apply step3:H in seq2 is predicted to be an active site residue

Page 26: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

D in column 13, E in column 43 and H in column 45.

Page 27: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Page 28: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Seq1 and Seq 2 now contains 3 experimental active sites (D, E & H)Seq3 contains residues D, E & H in the active site residue columns

Apply step5:D, E & H in seq3 are predicted to be active site residues

Each unannotated sequence in the alignment is analyzed to see if it contains an exact match to the active site pattern

IS THIS ENOUGH? WILL THIS WORK?

If the prediction is wrong: then there will be false positives BUT they will not be “KNOWN” false positives

Page 29: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Page 30: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Two distinct experimentally determined active site patterns within a familyUnannotated sequence matches more than one active site pattern

Seq5 experimentally verified active site residues: H(col:9) E(col:42) Seq6 experimentally verified active site residues: T(col:11) E (col:42)Predict H (col: 9) for seq6 and similarly T (col:11) for seq5 ???NO. Don’t combine since the union of the two active site patterns has not been experimentally observedTrue active site pattern for the family should be union of activesites of Seq5 and Seq6

Page 31: Predicting Active Site Residue Annotations in the Pfam Database

Logic of the rule based methodology

Two distinct experimentally determined active site patterns within a familyUnannotated sequence matches more than one active site pattern

Seq5 experimentally verified active site residues: H(col:9) E(col:42) Seq6 experimentally verified active site residues: T(col:11) E (col:42)What about Seq7???Seq 7 contains active site patterns found in both seq5 & seq6 Seq7 has a higher % identity to seq6 than seq5 T in column 11 & E in column 42 of seq7 are predicted to be A-S-R

Page 32: Predicting Active Site Residue Annotations in the Pfam Database

Data source:

• UniProtKB chosen as preferred source of experimental active sites for Pfam - why?:

• Using UniprotKB gives a low false positive rate

• UniProtKB experimental active sites are more comprehensive than the CSA (they cover sequences with both known and unknown structure)

Page 33: Predicting Active Site Residue Annotations in the Pfam Database

Where we are:

• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

• Background

• Introduction

Page 34: Predicting Active Site Residue Annotations in the Pfam Database

Transfer of UniProtKB experimental data within Pfam alignments

• Use of ‘UniProtKB 8.0’ 2735 experimentally determined active site annotations & alignments in Pfam 20.0

• Pfam+ predicts 6,06,110 active site residues

• UniProtKB predicts 45,685 A-S-R

• Unable to predict the remaining 23% (10312 residues)?

• 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

• Overlap of predicted A-S-R annotation between ‘Pfam predicted’, & UniProtKB

Page 35: Predicting Active Site Residue Annotations in the Pfam Database

• Predictions are based on transferring known experimental data within a Pfam alignment while this 55% doesn’t

Transfer of UniProtKB experimental data within Pfam alignments

. .

• 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

• And this constitutes the 10312 sequences

Page 36: Predicting Active Site Residue Annotations in the Pfam Database

Transfer of UniProtKB experimental data within Pfam alignments

• 96% i.e. 570765 residues of PFam+ active site predictions are not present in UniProtKB – Why:

• UniProtKB makes predictions for sequences in UniProtKB/Swiss-Prot, PFam+ makes predictions for the automatically generated UniProtKB/TrEMBL entries

• A-S-R prediction for UniProtKB/Swiss-Prot alone, PFam+ predicts 12570 additional residues than UniProtKB/Swiss-Prot.

• Reverse comparison - UniProtKB against Pfam: UniProtKB only contains 6% of the active site information contained within Pfam

Page 37: Predicting Active Site Residue Annotations in the Pfam Database

Where we are:

• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

• Background

• Introduction

Page 38: Predicting Active Site Residue Annotations in the Pfam Database

Transfer of CSA experimental data within Pfam alignments

• CSA predicts 5517 active site annotations

• Pfam predicts 3523 active site annotations

• Analysis revealed:

• For 1376 residues, (49% of the cases) there were no CSA experimental active sites within the Pfam alignments

• Experimental CSA active site sequence and the CSA predicted active site sequence are too divergent for both to belong to the same Pfam family

• Removing CSA predicted active site sequences that did not contain experimental active sites still PFam failed to predict 1446 CSA predicted active sites

• Why: The criteria did not match and the broader definition of an active site residue in CSA

* UniProtKB sequence “P77444” has residue 364 A-S-R & residue 226 binding site for pyridoxal phosphate* CSA defines both residues 226 & 364 as A-S-Rs

Page 39: Predicting Active Site Residue Annotations in the Pfam Database

Where we are:

• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

• Background

• Introduction

Page 40: Predicting Active Site Residue Annotations in the Pfam Database

Conclusion:

• Automated rule based methodology accurately transfer active site annotation between sequences within a Pfam alignment & other members within the same Pfam family

• Substantially increased the number of active site annotations in Pfam

• Source of experimental data (different for UniProtKB & CSA) determines the success & coverage of any method that uses similarity for transferring active site information

• Comparing Pfam+ data to PROSITE patterns: this methodology detects three times more active site sequences

• Comparison with the MEROPS data showed the methodology to have a low FP rate (3%), a good specificity (82%), and a reasonable sensitivity (62%) automated methodology predicts a substantial number of active site residues at the expense of losing some sensitivity

Page 41: Predicting Active Site Residue Annotations in the Pfam Database

• The forthcoming release Pfam 22.0 contains 100,000 more Pfam active sites than Pfam 20.0.

• This active site dataset is the largest single resource of active site annotation currently available

Conclusion:

Page 42: Predicting Active Site Residue Annotations in the Pfam Database

THANK YOU

Question / s?


Recommended