Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | arden-joyce |
View: | 23 times |
Download: | 3 times |
Exploiting Structural and Comparative Exploiting Structural and Comparative
Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions
Predicting domain structure families and their domain contexts
Exploring how structural divergence in domain families correlates with functional change
Predicting domain relatives likely to have significantly different structures and functions
CCAATTHH Domain families of known structureDomain families of known structure
Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes
Thanks to Amos, Rolf and the Swiss-Prot Team!!!!
Congratulations Swiss-Prot - 20 Years!!
H1
Class (3)
Architecture (36)
Topology or Fold (1100)
CCAATT HH
Homologous superfamily
(2100)H2 H3
Orengo and Thornton(1994)
86,000 domains
Gene3DGene3D::Domain annotations in genome sequencesDomain annotations in genome sequences
scan againstscan againstlibrary of HMM library of HMM
modelsmodels
~2100 CATH~2100 CATH~8300 Pfam~8300 Pfam
>2 million protein >2 million protein sequencessequencesfrom 300 from 300
completed completed genomes and genomes and
UniProtUniProt
assign domains toassign domains toCATH and Pfam CATH and Pfam superfamiliessuperfamilies
Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs
DomainFinder: structural domains from CATH take precedent
Gene3D:Gene3D:Domain annotations in genome sequencesDomain annotations in genome sequences
N CCATH-1
Pfam-2Pfam-1
NewFam
CATH-1CATH-1Pfam-1Pfam-1 NewFamNewFam Pfam-2Pfam-2
UniProt sequence
Assigned domains
Domain families ranked by size (number of domain Domain families ranked by size (number of domain sequences)sequences)
Perc
en
tag
e o
f all
dom
ain
fam
ily
seq
uen
ces
in U
niP
rot
Rank by family size
CATH superfamilies of known structure
Pfam families of unknown structure
NewFam of unknown stucture(>50,000 families)
>90% of domain sequences in UniProt can be assigned to ~7000 domain families
Domain families ranked by size (number of domain Domain families ranked by size (number of domain sequences)sequences)
Rank by family size
CATH superfamilies of known structure
Pfam families of unknown structure
NewFam of unknown stucture(>50,000 families)
100 largest families of known structure account for 30% of domain sequences in UniProt
Perc
en
tag
e o
f all
dom
ain
fam
ily
seq
uen
ces
in U
niP
rot
0 20 40 60 80 100 120 140 1600
20
40
60
80
100
120
3.40.50.720
3.40.50.300
3.40.50.150
2.60.40.10
1.10.10.10
2.40.50.140
Superfamily Variation: Structure/Sequence
0-25 GO Terms26-50 GO Terms51-100 GO Terms101-200 GO Terms201+ GO Terms
Sequence Families
Str
uctu
ral D
iver
sity
Population in genomes
Str
uct
ur a
l D
i ver s
i ty
Correlation of sequence and structural variability of Correlation of sequence and structural variability of CATH families with the number of different functional CATH families with the number of different functional
groupsgroups
Exploiting Structural and Comparative Exploiting Structural and Comparative
Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions
Prediting domain structure families and their domain contexts
Exploring how structural divergence in domain families correlates with functional change
Predicting domain relatives likely to have significantly different structures and functions
CCAATTHH Domain families of known structureDomain families of known structure
Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes
Multiple structural alignment by CORA allows identification of consensus secondary structures and secondary structure
embellishments
Some superfamilies show great structural diversitySome superfamilies show great structural diversity
In 117 superfamilies relatives expanded by >2 fold or more
2DSEC algorithm2DSEC algorithm
Gabrielle ReevesJ. Mol. Biol. (2006)
Structural embellishments can modify the active siteStructural embellishments can modify the active site
Galectin binding superfamily
Structural embellishments can modulate domain interactionsStructural embellishments can modulate domain interactions
Glucose 6-phosphate Glucose 6-phosphate dehydrogenasedehydrogenase
side orientationside orientation face orientationface orientation
Dihydrodipiccolinate Dihydrodipiccolinate reductasereductase
Additional secondary structure shown at (a) are involved in Additional secondary structure shown at (a) are involved in subunit interactionssubunit interactions
a
Structural embellishments can modify function by modifying
active site geometry and mediating new domain and subunit
interactions
Biotin carboxylaseBiotin carboxylaseD-alanine-d-alanine ligaseD-alanine-d-alanine ligase
Dimer of biotin carboxylaseDimer of biotin carboxylase
ATP GraspATP Graspsuperfamilysuperfamily
Secondary structure insertions are distributed along the Secondary structure insertions are distributed along the chain but aggregate in 3Dchain but aggregate in 3D
Secondary structure insertions are distributed along the Secondary structure insertions are distributed along the chain but aggregate in 3Dchain but aggregate in 3D
For ~70% of domains analysed, 80% of the secondary structure embellishments are co-located in 3D with 3 or more other
embellishments
In 80% of domains, 1 or more embellishments contacts other domains or subunits
Indel frequency < 1 %
0.85% 0.38% 0.23% 0.11% 0.06% 0.02%
0
20
40
60
80
1 2 3 4 5 6 7 8 9 10 11 12
Size of Indel (number of secondary structures)
Frequency (%)
85% of insertions comprise only 1 or 2 secondary structures
Size of insertion (number of secondary structures)
Freq
uen
cy (
%)
2 Layer Beta Sandwich
3 Layer Alpha/Beta Sandwich2 Layer Alpha/Beta
Alpha/Beta Barrel
Many structurally diverse superfamilies adopt folds with these regular
layered architectures
2 Layer Beta Sandwich
3 Layer Alpha/Beta Sandwich2 Layer Alpha/Beta
Alpha/Beta Barrel
Many structurally diverse superfamilies adopt folds with these regular
layered architectures
Exploiting Structural and Comparative Exploiting Structural and Comparative
Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions
Predicting domain structure families and their domain contexts
Exploring how structural divergence in domain families correlates with functional change
Predicting domain relatives likely to have significantly different structures and functions
CCAATTHH Domain families of known structureDomain families of known structure
Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes
subfamily of subfamily of close sequence close sequence
relatives relatives predicted to predicted to have similar have similar
functionsfunctions(>=60% (>=60% sequence sequence identity)identity)
GEMMA – GEne Model and Model Annotation
Algorithm for Predicting Sequence Homologues with
Similar Structures and Functions
Largest 100 CATH families have more than 20,000 subfamilies
structuralstructuralsuperfamilysuperfamily
structuralstructuralsuperfamilysuperfamily
GEMMA – Predicting Functional Groups in CATH SuperfamiliesGEMMA – Predicting Functional Groups in CATH Superfamilies
Build multiple sequence alignments for each subfamilyBuild multiple sequence alignments for each subfamily
subfamily of subfamily of close relatives close relatives predicted to predicted to have similar have similar
function function (>60% (>60%
identity)identity)
structuralstructuralsuperfamilysuperfamily
GEMMA – Predicting Functional Groups in CATH SuperfamiliesGEMMA – Predicting Functional Groups in CATH Superfamilies
Cluster subfamilies predicted to have similar functions into Cluster subfamilies predicted to have similar functions into functional groupsfunctional groups
subfamily of subfamily of close relatives close relatives predicted to predicted to have similar have similar
function function (>60% (>60%
identity)identity)
SSAP score = 68.69PSS score = 0.375
Pyruvate phosphate dikinase (subfamily 1)
Succinyl-CoA synthetase(subfamily 22)
SSAP score = 93.01PSS score = 0.827
SSAP score = 68.32PSS score = 0.333
Pyruvate phosphate dikinase(subfamily 15)
ATP GraspATP GraspFamilyFamily
192 subfamilies192 subfamilies
subfamily profiles coloured by residue conservationsubfamily profiles coloured by residue conservation (red = high, blue = low)(red = high, blue = low)
Pyruvate phosphate dikinase Pyruvate phosphate dikinase
Profiles aligned using profile-profile comparison (MAFFT)
Many fully conserved positions
6/7 positions are fully conserved
Equivalent functionsEquivalent functions
Scorecons (Valdar and Scorecons (Valdar and Thornton, Profunc)Thornton, Profunc)
Succinyl-CoA synthetasePyruvate phosphate dikinase
Fully conserved positions
No fully conserved positions
subfamily profiles coloured by residue conservationsubfamily profiles coloured by residue conservation (red = high, blue = low)(red = high, blue = low)
Different functionsDifferent functions
Scorecons (Valdar and Scorecons (Valdar and Thornton, Profunc)Thornton, Profunc)
Profiles aligned using profile-profile comparison (MAFFT)
10 experimentally identified enzyme functions identified in this family
Nu
mb
er
of
fun
ctio
nal g
rou
ps
Nu
mb
er
of
fun
ctio
nal g
rou
ps
pre
dic
ted
p
red
icte
d
Performance in Merging Subfamilies into Performance in Merging Subfamilies into Functional GroupsFunctional Groups
Error rate
structuralstructuralsuperfamilysuperfamily
GEMMA – Predicting Functional Groups in CATH SuperfamiliesGEMMA – Predicting Functional Groups in CATH Superfamilies
subfamily of subfamily of close relatives close relatives predicted to predicted to have similar have similar
function function (>60% (>60%
identity)identity)
functional group
Benchmarked on 12 large enzyme families in CATH
6-10 fold reduction in the number of functional subfamilies
SummarySummary
More than half the domains in UniProt can be assigned to families of known structure
Analysis of some very large structural families revealed how secondary structure insertions can modulate functions
Functional groups can be identified in diverse families by comparing multiple features (e.g. residue
conservation, predicted secondary structure)
CATH Gene3D
Lesley GreeneIan Sillitoe
Tony Lewis
Ollie Redfern
Alison Cuff
Tim Dallman
Mark Dibley
Sarah Addou
Stathis Sidderis
Russell Marsden
Dave Lee
Juan Ranea
Ilhem Diboun
Adam Reid
Corin Yeats
MRC, Wellcome Trust, NIH, EU -Biosapiens, Embrace, Enfin, BBSRC
http://www.biochem.ucl.ac.uk/bsm/cath_new