Classification of protein and domain families

transcript

Classification of protein and

domain families

Sequence to function

Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences

Domain structures

Domain structure predictions

Structure to function

Fold Group(1100)

HomologousSuperfamily

(2100)

40,000 domain entries

CC AATT HH

Sequence Family

~100,000 domains of known structure in CATH~2 million sequences from genomes assigned to CATH

superfamilies in Gene3D and functionally annotated

Gene3D

Gene3DGene3D::Domain structure annotations in genome Domain structure annotations in genome sequencessequences

scan againstlibrary of HMM

models and sequences for

CATHPfam

NewFam superfamilies

~5 million protein ~5 million protein sequencessequencesfrom 560 from 560

completed completed genomes and genomes and

UniProtUniProt

~ 2 million domain ~ 2 million domain sequences assigned sequences assigned

totoCATH superfamiliesCATH superfamilies

Gene3DGene3D

(1) (1) Cluster ~5 million sequences into protein Cluster ~5 million sequences into protein superfamiliessuperfamilies

(2) Map domains onto the sequences using HMM (2) Map domains onto the sequences using HMM technology technology (CATH & Pfam domains)(CATH & Pfam domains)

>200,000 protein superfamilies

~10,000 domain superfamilies(2100 of known structure)

Arabidopsis C.elegans Drosophila Human Mouse Yeast

Organism

Gene3D Genthreader

Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP

HMM prediction threading prediction

Annotation levels for an average genome

predicted to belong tostructural superfamilies using HMM

or threading techniques

many predicted to be transmembrane

many belonging to small species specific families

0 1000 2000 3000 4000 5000 6000

Families ordered by size

Target selection strategy for PSI-2Target selection strategy for PSI-2

known structure(CATH - MEGA)

unknown structure

(BIG -Pfam)

Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG

0 20 40 60 80 100 120 140 1600

3.40.50.720

3.40.50.300

3.40.50.150

2.60.40.10

1.10.10.10

2.40.50.140

Superfamily Variation: Structure/Sequence

0-25 GO Terms26-50 GO Terms51-100 GO Terms101-200 GO Terms201+ GO Terms

Sequence Families

Population in genomes (x 1000)

iver s

Correlation of sequence and structural variability of CATH Correlation of sequence and structural variability of CATH families with the number of different functional groupsfamilies with the number of different functional groups

Structural diversity in the CATH Domain Superfamily P-loop hydrolases

Cutinase

Cocaine esterase

Acetylcholinesterase

Sequence to function

Domain structures

0.0E+00

5.0E+04

1.0E+05

1.5E+05

2.0E+05

2.5E+05

3.0E+05

3.5E+05

4.0E+05

4.5E+05

11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% 91-100%

Sequecne Identity (%)

Number of domain relatives Number of CATH enzyme superfamilies

Sequence identity thresholds for 90% Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) conservation of enzyme function (to 3 EC Levels)

highly variable highly variable familiesfamilies

Sequence identity threshold for 90% conservation

N-Fold Increase in Functional Annotation for N-Fold Increase in Functional Annotation for Sequences in Gene3DSequences in Gene3D

general thresholds family specific thresholds

Gene3D (6.8%) H.sapiens (5%) A.thaliana (2.7%) C.elegans (1.1%) B.anthracis (3.7%)

Domain - 50/80 and 40/80 cut-offs if identical MDA Domain - Family specific cut-off

Link to UniProt

Links to GO

Links to different levels in the Gene3D protein family

Link to InterPro

Links to CATH/PfamLinks to KEGG

“S” - indicates you can search the term against Gene3D

Get an XML version of this page

Gene3D

Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB

Non-PSI PDBs PSI PDBs

0 terms 1 term 2 terms 3 terms 4 terms

Functional annotation of structures using EC, GO, KEGG, FunCat resources

Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins

Tree Determinants - ValenciaEvolutionary Trace - LichtargeFunshift – SonnhammerSCI-PHY – Sjolander

Score conservation

for each position in the

alignment using an entropy measure

1 = highly conserved

0 = unconservedPutative functional site

Structural model

Methods exploiting information on sequence conserved residue positions

Scorecons –Thornton Protein Keys – Sander

multiple sequence alignment of relatives from functional group

Superfamily Superfamily of known of known structurestructure(CATH)(CATH)

GEMMA: Compares sequence profiles (HMMs) between GEMMA: Compares sequence profiles (HMMs) between subfamilies subfamilies

sequence subfamily 80% seq. id)

putative structure-function

clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark

Purity(high isbest)

Editdistance(low)

VIdistance(low is best)

Deviationfrom no.singletons(low)

Annotation (EC number) coverage of MEGA family 3.90.1200.10

Database annotations Annotations inherited w ithin S60 clusters Annotations inherited w ithin GeMMAfunctional subfamilies

Source of annotation

Covera

experimentalannotations

inherit functions at 60% seq. id.

inherit functions by GEMMA

Functional annotation coverage using different strategies

Gene3D Biominer Methods

•Phylotuner: Correlation of domain occurrence profiles

•GOSS:Semantic Similarity calculation between protein pairs.

•CODA: Domain fusion analysis.

•HiPPI: homology inheritance of protein-protein physical interaction data.

•GECO: Correlation of gene expression data

Protein interactions and gene networks

Domain structures

Structure to function

Methods for Assessing Structural NoveltyMethods for Assessing Structural Novelty

CATHEDRAL – structure comparisonCATHEDRAL – structure comparison

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

CATHEDRAL CE LSQMAN DALI STRUCTAL

Redfern et al. PLOS comp. biol. 2007

Structural clusters in the Aminoacyl tRNA synthetases – like family

Aminoacyl tRNA synthetases

DNA-binding, stress-related

Argininosuccinate lyases

Gln-hydrolyzing synthases

Nucleotidyl-transferases

1bkzA00

2.60.120.200

1dypA00

Galectin binding superfamily

Aminoacyl tRNA synthetases – like

1dnpA00

Deoxyribodi-pyrimidine photo-lyases

Nucleotidylyl-transferases

1ej2A00

AA tRNA synthetase, Class I

1n3lA01

Electron transferflavoprotein

1o97D01

Identifying functional groups in domain Identifying functional groups in domain superfamiliessuperfamilies

Exploiting 3D Templates to Represent Functional Relatives

JESS – Thornton GASP - BabbittSPASM – KleywegtPINTS – RussellDRESPAT - SarawagipvSOAR – Joachimiak

SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment

green and purple – identical residues; orange and white – similar residues

Laskowski and Thornton

FLORA:3D templates for functional groupsFLORA:3D templates for functional groups

From multiple structure alignments of functional subgroups in the superfamily, identify vectors

between amino acids that are highly conserved and distinctive for the functional subgroup.

FLORA:3D templates for functional groups

localFLORA globalFLORA

single site multiple sites

0 1 2 3 4 5 6 7 8 9 10

Local FLORA Global FLORA

FLORA:Performance in recognising functionally related homologues

Benchmark of 36 diverse enzyme groups (from 12 families)

Performance of FLORAPerformance of FLORA

Benchmarked on 36 Benchmarked on 36 large enzyme familieslarge enzyme families

FLORA: 3D Templates for Structure-Function Groups in Domain Families

1dnpA01Deoxyribo-

dipyrimidine photo-lyases

1ej2A00Nucleotidylyl-transferases

1q77A00

Unknown

function MCSG

1n3lA01AA tRNA

synthetases

1o97D01Electrontransfer

flavoprotein

Fold and structural motifs

SSM fold search

Surface clefts

Residueconservation

DNA-bindingHTH motifs

Nest analysis

Sequence motifs(PROSITE, BLOCKS,SMART, Pfam, etc)

Sequence scans

Sequence searchvs PDB

Sequence searchvs Uniprot

Superfamily HMMlibrary

Gene neighbours

n-residue templates

Enzyme active sites

Ligand binding sites

DNA binding sites

Reverse templates

http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/

Function Prediction for Proteins of ‘Putative’ or Unknown Function

Class Sequence

Evidence

StructureEvidence

Sequence +

Structure

Neither Successful

Putative (57)

53 44 41 1

Unknown (132)

95* 69* 57* 25

* Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’

structural data provides relatively more information for proteins about which there is less knowledge

these predictions need to be experimentally validated

Classification of protein and domain families

Documents