Post on 13-Jan-2016
description
transcript
Classification of protein and
domain families
Sequence to function
Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences
Domain structures
Domain structure predictions
Structure to function
Fold Group(1100)
HomologousSuperfamily
(2100)
40,000 domain entries
CC AATT HH
Sequence Family
~100,000 domains of known structure in CATH~2 million sequences from genomes assigned to CATH
superfamilies in Gene3D and functionally annotated
Gene3D
Gene3DGene3D::Domain structure annotations in genome Domain structure annotations in genome sequencessequences
scan againstlibrary of HMM
models and sequences for
CATHPfam
NewFam superfamilies
~5 million protein ~5 million protein sequencessequencesfrom 560 from 560
completed completed genomes and genomes and
UniProtUniProt
~ 2 million domain ~ 2 million domain sequences assigned sequences assigned
totoCATH superfamiliesCATH superfamilies
Gene3DGene3D
(1) (1) Cluster ~5 million sequences into protein Cluster ~5 million sequences into protein superfamiliessuperfamilies
(2) Map domains onto the sequences using HMM (2) Map domains onto the sequences using HMM technology technology (CATH & Pfam domains)(CATH & Pfam domains)
>200,000 protein superfamilies
~10,000 domain superfamilies(2100 of known structure)
0
10
20
30
40
50
60
70
80
90
100
Arabidopsis C.elegans Drosophila Human Mouse Yeast
Organism
Gen
es w
ith
str
uct
ura
l an
no
tati
on
Gene3D Genthreader
Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP
HMM prediction threading prediction
Annotation levels for an average genome
0
50%
100%
predicted to belong tostructural superfamilies using HMM
or threading techniques
many predicted to be transmembrane
many belonging to small species specific families
0
20
40
60
80
100
0 1000 2000 3000 4000 5000 6000
Families ordered by size
Per
cen
tag
e o
f d
om
ain
seq
uen
ces
Target selection strategy for PSI-2Target selection strategy for PSI-2
known structure(CATH - MEGA)
unknown structure
(BIG -Pfam)
Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG
0 20 40 60 80 100 120 140 1600
20
40
60
80
100
120
3.40.50.720
3.40.50.300
3.40.50.150
2.60.40.10
1.10.10.10
2.40.50.140
Superfamily Variation: Structure/Sequence
0-25 GO Terms26-50 GO Terms51-100 GO Terms101-200 GO Terms201+ GO Terms
Sequence Families
Str
uctu
ral D
iver
sity
Population in genomes (x 1000)
Str
uct
ura
l D
iver s
i ty
Correlation of sequence and structural variability of CATH Correlation of sequence and structural variability of CATH families with the number of different functional groupsfamilies with the number of different functional groups
Structural diversity in the CATH Domain Superfamily P-loop hydrolases
Cutinase
Cocaine esterase
Acetylcholinesterase
Sequence to function
Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences
Domain structures
Domain structure predictions
0.0E+00
5.0E+04
1.0E+05
1.5E+05
2.0E+05
2.5E+05
3.0E+05
3.5E+05
4.0E+05
4.5E+05
11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% 91-100%
Sequecne Identity (%)
Num
ber o
f Dom
ain
Rela
tives
0
20
40
60
80
100
120
140
160
Num
ber o
f CAT
H en
zym
e Su
perfa
mili
es
Number of domain relatives Number of CATH enzyme superfamilies
Sequence identity thresholds for 90% Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) conservation of enzyme function (to 3 EC Levels)
highly variable highly variable familiesfamilies
Num
ber
of
seq
uence
s
Sequence identity threshold for 90% conservation
Num
ber
of
fam
ilies
N-Fold Increase in Functional Annotation for N-Fold Increase in Functional Annotation for Sequences in Gene3DSequences in Gene3D
general thresholds family specific thresholds
0
2
4
6
8
Gene3D (6.8%) H.sapiens (5%) A.thaliana (2.7%) C.elegans (1.1%) B.anthracis (3.7%)
N-f
old
in
crea
se i
n c
ove
rag
e
Domain - 50/80 and 40/80 cut-offs if identical MDA Domain - Family specific cut-off
N-f
old
incr
ease
in c
overa
ge
N-f
old
incr
ease
in c
overa
ge
Link to UniProt
Links to GO
Links to different levels in the Gene3D protein family
Link to InterPro
Links to CATH/PfamLinks to KEGG
“S” - indicates you can search the term against Gene3D
Get an XML version of this page
Gene3D
Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB
Non-PSI PDBs PSI PDBs
0 terms 1 term 2 terms 3 terms 4 terms
Functional annotation of structures using EC, GO, KEGG, FunCat resources
Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins
Tree Determinants - ValenciaEvolutionary Trace - LichtargeFunshift – SonnhammerSCI-PHY – Sjolander
Score conservation
for each position in the
alignment using an entropy measure
1 = highly conserved
0 = unconservedPutative functional site
Structural model
Methods exploiting information on sequence conserved residue positions
Scorecons –Thornton Protein Keys – Sander
multiple sequence alignment of relatives from functional group
Superfamily Superfamily of known of known structurestructure(CATH)(CATH)
GEMMA: Compares sequence profiles (HMMs) between GEMMA: Compares sequence profiles (HMMs) between subfamilies subfamilies
sequence subfamily 80% seq. id)
putative structure-function
group
clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity
0
10
20
30
40
50
60
70
80
90
100
Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate
SCI-PHY
GeMMA
0
5
10
15
20
25
Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate
SCI-PHY
GeMMA
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate
SCI-PHY
GeMMA
0
1
2
3
4
5
6
Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate
SCI-PHY
GeMMA
GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark
Purity(high isbest)
Editdistance(low)
VIdistance(low is best)
Deviationfrom no.singletons(low)
Annotation (EC number) coverage of MEGA family 3.90.1200.10
0
10
20
30
40
50
60
70
80
Database annotations Annotations inherited w ithin S60 clusters Annotations inherited w ithin GeMMAfunctional subfamilies
Source of annotation
Co
vera
ge
of
fam
ily (
%)
Covera
ge o
f su
perf
am
ily (
%)
experimentalannotations
inherit functions at 60% seq. id.
inherit functions by GEMMA
Functional annotation coverage using different strategies
Gene3D Biominer Methods
•Phylotuner: Correlation of domain occurrence profiles
•GOSS:Semantic Similarity calculation between protein pairs.
•CODA: Domain fusion analysis.
•HiPPI: homology inheritance of protein-protein physical interaction data.
•GECO: Correlation of gene expression data
Protein interactions and gene networks
Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences
Domain structures
Domain structure predictions
Structure to function
Methods for Assessing Structural NoveltyMethods for Assessing Structural Novelty
CATHEDRAL – structure comparisonCATHEDRAL – structure comparison
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Rank
Pro
po
rtio
n C
orr
ec
t F
old
CATHEDRAL CE LSQMAN DALI STRUCTAL
Redfern et al. PLOS comp. biol. 2007
Structural clusters in the Aminoacyl tRNA synthetases – like family
Aminoacyl tRNA synthetases
DNA-binding, stress-related
Argininosuccinate lyases
Gln-hydrolyzing synthases
Nucleotidyl-transferases
stru
cture
sim
ilari
ty s
core
1bkzA00
2.60.120.200
1dypA00
Galectin binding superfamily
Aminoacyl tRNA synthetases – like
1dnpA00
Deoxyribodi-pyrimidine photo-lyases
Nucleotidylyl-transferases
1ej2A00
AA tRNA synthetase, Class I
1n3lA01
Electron transferflavoprotein
1o97D01
Identifying functional groups in domain Identifying functional groups in domain superfamiliessuperfamilies
Exploiting 3D Templates to Represent Functional Relatives
JESS – Thornton GASP - BabbittSPASM – KleywegtPINTS – RussellDRESPAT - SarawagipvSOAR – Joachimiak
SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment
green and purple – identical residues; orange and white – similar residues
Laskowski and Thornton
FLORA:3D templates for functional groupsFLORA:3D templates for functional groups
From multiple structure alignments of functional subgroups in the superfamily, identify vectors
between amino acids that are highly conserved and distinctive for the functional subgroup.
FLORA:3D templates for functional groups
localFLORA globalFLORA
single site multiple sites
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10
Rank
Co
vera
ge (
%)
Local FLORA Global FLORA
FLORA:Performance in recognising functionally related homologues
Benchmark of 36 diverse enzyme groups (from 12 families)
Performance of FLORAPerformance of FLORA
Benchmarked on 36 Benchmarked on 36 large enzyme familieslarge enzyme families
FLORA: 3D Templates for Structure-Function Groups in Domain Families
1dnpA01Deoxyribo-
dipyrimidine photo-lyases
1ej2A00Nucleotidylyl-transferases
1q77A00
Unknown
function MCSG
1n3lA01AA tRNA
synthetases
1o97D01Electrontransfer
flavoprotein
Fold and structural motifs
SSM fold search
Surface clefts
Residueconservation
DNA-bindingHTH motifs
Nest analysis
Sequence motifs(PROSITE, BLOCKS,SMART, Pfam, etc)
Sequence scans
Sequence searchvs PDB
Sequence searchvs Uniprot
Superfamily HMMlibrary
Gene neighbours
n-residue templates
Enzyme active sites
Ligand binding sites
DNA binding sites
Reverse templates
http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/
Function Prediction for Proteins of ‘Putative’ or Unknown Function
Class Sequence
Evidence
StructureEvidence
Sequence +
Structure
Neither Successful
Putative (57)
53 44 41 1
Unknown (132)
95* 69* 57* 25
* Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’
structural data provides relatively more information for proteins about which there is less knowledge
these predictions need to be experimentally validated