Predicting Active Site Residue Annotations in the Pfam Database

Predicting Active Site Residue Annotations in the Pfam DatabasePredicting Active Site Residue Annotations in the Pfam Database

[ Publication date: 9 August 2007 ]

Presentation by: KEYUR MALAVIYA

[ BMC Bioinformatics ]

Authors:Jaina Mistry; Alex Bateman; Robert D Finn[Authors of this paper &the PFam database]

TOPICS COVERED

• Background

• Introduction

• Construction and content

• Transfer of experimental data within Pfam alignments

• UniProtKB data

• CSA data

• Conclusion

TOPICS COVERED

• Introduction• Construction and content• Transfer of experimental data within Pfam alignments

• UniProtKB data• CSA data

• Conclusion

• Background

Active Site

Residue Annotationsin

the PFam Database

Predicting

Background:

• Pfam is a collection of protein families and domains

• Pfam contains multiple protein alignments & profile-HMMs of these families

PFam Database

Background:

• Function: To view the domain organization of proteins

• 5% Pfam families are enzymatic

• From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined

• The structure and chemical properties of these residues (the active site) determine the chemistry of the enzyme

Background:

• Active site: The active site of an enzyme contains the catalytic and binding sites

• Binding site is a region on a protein (also DNA or RNA) to which specific other molecules & ions — called ligands

• Ligand: Binds to & form a complex with a biomolecule to serve a biological purpose. i.e: it is an effector molecule binding to a site on a target protein

• Enzymes: Controls the flow of metabolites within a cell Catalyze virtually all reactions that make/modify molecules

Information about other databases:• NCBI BLAST: Finds regions of local similarity between

sequences (homologs). BLAST can be used to infer functional and evolutionary relationships between sequences

NCBI Blast:

Information about other databases:• UniProtKB: Curated protein sequence database (i.e. literature

collated A-S-R) & predicted A-S-R. Only predicts A-S-R by similarity for sequences in UniProtKB/Swiss-Prot

NCBI Blast:

• UniProtKB: Curated protein sequence database (i.e. literature collated A-S-R) & predicted A-S-R.

Only predicts A-S-R by similarity for sequences in UniProtKB/Swiss-Prot

Information about other databases:UniProtKB:

Information about other databases:• PROSITE: consists of documentation entries describing

protein domains, families and functional sites as well as associated patterns and profiles to identify them

UniProtKB:

Information about other databases:

• Catalytic Site Atlas (CSA): documents enzyme active sites and catalytic residues in enzymes of 3D structure

Collates A-S-R from literature for proteins with known structure A-S-R predictions made for proteins with a known structure which it infers on the basis of PSIBLAST hits

One of the largest resources for catalytic sites

• SMART and MEROPS: collate active site data from the literature and use sequence similarity based transfer to annotate active site residues onto the sequences in their protein families

Uniprot – Universal protein knowledgebase:

Uniprot – Universal protein knowledgebase:

• PFam and UniprotKB: 74% of protein sequences in UniprotKB have at least one match to Pfam. (Sequence coverage is 74% )

TOPICS COVERED

• Introduction

• Background

• Construction and content• Transfer of experimental data within Pfam alignments


• Conclusion

Introduction:

• Goal of this Paper: To increase the active site annotations

• Approach: Strict set of rules to reduce the rate of FPs transfer experimentally determined active site residue data to other sequences within the same Pfam family

• Results:• Only 3% of predicted sequences are false positives

• Predicted 606110 active site residues, of which 94% are not found in UniProtKB

• The developed tool for transferring the data can be applied to any alignment with associated experimental active site data and is available for download

• This tool is useful in proteome annotation, comparative genomics, protein evolution and active site characterization

• Problem: Low active site annotations

The problem and the solution

• Pfam[1] release 20.0: 8296 protein families

• % Active site residues experimentally determined in enzymatic Pfam families :

• Computationally predict active sites in protein sequences

• Two broad categories:

1) computational methods that transfer experimentally characterized active site data by similarity

2) those that predict active site residues ab initio

HOW?

Only ~0.4% sequences

• To do better: Need to overcome the lack of experimental data

ab initio methods:

• Exploit known properties like:

• Active sites usually found buried within a cleft of a protein

• Mutations in them increase stability of an enzyme

• Active sites residues are highly conserved

• Methods: Geometry data, stability profiles and sequence conservation in active site prediction

ab initio methods:

• Evolutionary trace (ET): - Identify most highly conserved residues in related sequences, - Map them onto the structure of protein, - Then examines the structure for clusters of residues which could correspond to active sites or other functional sites. - Successful prediction 60-80% of test cases

• Other methods: Neural networks and support vector machines

• Problem: These methods are hard to compare to each other in terms of accuracy

• All have a relatively high rate of False Positives

Similarity transfer based methods:

• Transfer A-S-R from the characterized sequences to the uncharacterized sequences

• First identify homologous sequences: Use tools such as BLAST searches, hidden Markov models (HMMs), pattern matching and structural templates

• Transfer active site residues• Transfer A-S-R

• Pfam with this rule based methodology = Pfam+

Where we are:

• Introduction

• Background



• Conclusion

Construction and content:

• The Pfam database is renowned for having no known false positives in its alignments

• The active site Pfam families contain both active and inactive homologues

• Known active site residues from UniProtKB/Swiss-Prot in a Pfam alignment, are conserved in many of the sequences without active site annotation

• Construction: A set of rules that allows conservative transfer of active site annotation from one protein to another protein in the same Pfam alignment

• To predict active site residues:

• identify sequences with experimentally verified active site residues

• use this information to predict active site residues in other members of that family

Logic of the rule based methodology

find a homologous set of proteins & generate a protein alignment:


• Identify the positions of all experimentally verified active sites in the alignment:



Seq1 contains 3 experimental active sites (D, E & H) Seq2 contains 2 experimentally defined active site residues (D & E)

Apply step3:H in seq2 is predicted to be an active site residue


D in column 13, E in column 43 and H in column 45.



Seq1 and Seq 2 now contains 3 experimental active sites (D, E & H)Seq3 contains residues D, E & H in the active site residue columns

Apply step5:D, E & H in seq3 are predicted to be active site residues

Each unannotated sequence in the alignment is analyzed to see if it contains an exact match to the active site pattern

IS THIS ENOUGH? WILL THIS WORK?

If the prediction is wrong: then there will be false positives BUT they will not be “KNOWN” false positives



Two distinct experimentally determined active site patterns within a familyUnannotated sequence matches more than one active site pattern

Seq5 experimentally verified active site residues: H(col:9) E(col:42) Seq6 experimentally verified active site residues: T(col:11) E (col:42)Predict H (col: 9) for seq6 and similarly T (col:11) for seq5 ???NO. Don’t combine since the union of the two active site patterns has not been experimentally observedTrue active site pattern for the family should be union of activesites of Seq5 and Seq6


Two distinct experimentally determined active site patterns within a familyUnannotated sequence matches more than one active site pattern

Seq5 experimentally verified active site residues: H(col:9) E(col:42) Seq6 experimentally verified active site residues: T(col:11) E (col:42)What about Seq7???Seq 7 contains active site patterns found in both seq5 & seq6 Seq7 has a higher % identity to seq6 than seq5 T in column 11 & E in column 42 of seq7 are predicted to be A-S-R

Data source:

• UniProtKB chosen as preferred source of experimental active sites for Pfam - why?:

• Using UniprotKB gives a low false positive rate

• UniProtKB experimental active sites are more comprehensive than the CSA (they cover sequences with both known and unknown structure)

Where we are:



• Conclusion

• Background

• Introduction

Transfer of UniProtKB experimental data within Pfam alignments

• Use of ‘UniProtKB 8.0’ 2735 experimentally determined active site annotations & alignments in Pfam 20.0

• Pfam+ predicts 6,06,110 active site residues

• UniProtKB predicts 45,685 A-S-R

• Unable to predict the remaining 23% (10312 residues)?

• 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

• Overlap of predicted A-S-R annotation between ‘Pfam predicted’, & UniProtKB

• Predictions are based on transferring known experimental data within a Pfam alignment while this 55% doesn’t


. .

• 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

• And this constitutes the 10312 sequences


• 96% i.e. 570765 residues of PFam+ active site predictions are not present in UniProtKB – Why:

• UniProtKB makes predictions for sequences in UniProtKB/Swiss-Prot, PFam+ makes predictions for the automatically generated UniProtKB/TrEMBL entries

• A-S-R prediction for UniProtKB/Swiss-Prot alone, PFam+ predicts 12570 additional residues than UniProtKB/Swiss-Prot.

• Reverse comparison - UniProtKB against Pfam: UniProtKB only contains 6% of the active site information contained within Pfam

Where we are:



• Conclusion

• Background

• Introduction

Transfer of CSA experimental data within Pfam alignments

• CSA predicts 5517 active site annotations

• Pfam predicts 3523 active site annotations

• Analysis revealed:

• For 1376 residues, (49% of the cases) there were no CSA experimental active sites within the Pfam alignments

• Experimental CSA active site sequence and the CSA predicted active site sequence are too divergent for both to belong to the same Pfam family

• Removing CSA predicted active site sequences that did not contain experimental active sites still PFam failed to predict 1446 CSA predicted active sites

• Why: The criteria did not match and the broader definition of an active site residue in CSA

* UniProtKB sequence “P77444” has residue 364 A-S-R & residue 226 binding site for pyridoxal phosphate* CSA defines both residues 226 & 364 as A-S-Rs

Where we are:



• Conclusion

• Background

• Introduction

Conclusion:

• Automated rule based methodology accurately transfer active site annotation between sequences within a Pfam alignment & other members within the same Pfam family

• Substantially increased the number of active site annotations in Pfam

• Source of experimental data (different for UniProtKB & CSA) determines the success & coverage of any method that uses similarity for transferring active site information

• Comparing Pfam+ data to PROSITE patterns: this methodology detects three times more active site sequences

• Comparison with the MEROPS data showed the methodology to have a low FP rate (3%), a good specificity (82%), and a reasonable sensitivity (62%) automated methodology predicts a substantial number of active site residues at the expense of losing some sensitivity

• The forthcoming release Pfam 22.0 contains 100,000 more Pfam active sites than Pfam 20.0.

• This active site dataset is the largest single resource of active site annotation currently available

Conclusion:

THANK YOU

Question / s?

Date post:	30-Dec-2015
Category:	Documents
Upload:	jennifer-stone
View:	17 times
Download:	0 times

Predicting Active Site Residue Annotations in the Pfam Database

Documents