+ All Categories
Home > Documents > Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Date post: 02-Jan-2016
Category:
Upload: arden-joyce
View: 23 times
Download: 3 times
Share this document with a friend
Description:
Predicting domain structure families and their domain contexts Exploring how structural divergence in domain families correlates with functional change Predicting domain relatives likely to have significantly different structures and functions. - PowerPoint PPT Presentation
Popular Tags:
30
Exploiting Structural and Comparative Exploiting Structural and Comparative Genomics to Reveal Protein Functions Genomics to Reveal Protein Functions Predicting domain structure families and their domain contexts Exploring how structural divergence in domain families correlates with functional change Predicting domain relatives likely to have significantly different structures and functions C C A A T T H H Domain families of known structure Domain families of known structure Gene3D Gene3D Protein families and domain Protein families and domain annotations for completed genomes annotations for completed genomes
Transcript
Page 1: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Exploiting Structural and Comparative Exploiting Structural and Comparative

Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions

Predicting domain structure families and their domain contexts

Exploring how structural divergence in domain families correlates with functional change

Predicting domain relatives likely to have significantly different structures and functions

CCAATTHH Domain families of known structureDomain families of known structure

Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes

Page 2: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Thanks to Amos, Rolf and the Swiss-Prot Team!!!!

Congratulations Swiss-Prot - 20 Years!!

Page 3: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

H1

Class (3)

Architecture (36)

Topology or Fold (1100)

CCAATT HH

Homologous superfamily

(2100)H2 H3

Orengo and Thornton(1994)

86,000 domains

Page 4: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Gene3DGene3D::Domain annotations in genome sequencesDomain annotations in genome sequences

scan againstscan againstlibrary of HMM library of HMM

modelsmodels

~2100 CATH~2100 CATH~8300 Pfam~8300 Pfam

>2 million protein >2 million protein sequencessequencesfrom 300 from 300

completed completed genomes and genomes and

UniProtUniProt

assign domains toassign domains toCATH and Pfam CATH and Pfam superfamiliessuperfamilies

Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs

Page 5: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

DomainFinder: structural domains from CATH take precedent

Gene3D:Gene3D:Domain annotations in genome sequencesDomain annotations in genome sequences

N CCATH-1

Pfam-2Pfam-1

NewFam

CATH-1CATH-1Pfam-1Pfam-1 NewFamNewFam Pfam-2Pfam-2

UniProt sequence

Assigned domains

Page 6: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Domain families ranked by size (number of domain Domain families ranked by size (number of domain sequences)sequences)

Perc

en

tag

e o

f all

dom

ain

fam

ily

seq

uen

ces

in U

niP

rot

Rank by family size

CATH superfamilies of known structure

Pfam families of unknown structure

NewFam of unknown stucture(>50,000 families)

>90% of domain sequences in UniProt can be assigned to ~7000 domain families

Page 7: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Domain families ranked by size (number of domain Domain families ranked by size (number of domain sequences)sequences)

Rank by family size

CATH superfamilies of known structure

Pfam families of unknown structure

NewFam of unknown stucture(>50,000 families)

100 largest families of known structure account for 30% of domain sequences in UniProt

Perc

en

tag

e o

f all

dom

ain

fam

ily

seq

uen

ces

in U

niP

rot

Page 8: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

3.40.50.720

3.40.50.300

3.40.50.150

2.60.40.10

1.10.10.10

2.40.50.140

Superfamily Variation: Structure/Sequence

0-25 GO Terms26-50 GO Terms51-100 GO Terms101-200 GO Terms201+ GO Terms

Sequence Families

Str

uctu

ral D

iver

sity

Population in genomes

Str

uct

ur a

l D

i ver s

i ty

Correlation of sequence and structural variability of Correlation of sequence and structural variability of CATH families with the number of different functional CATH families with the number of different functional

groupsgroups

Page 9: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Exploiting Structural and Comparative Exploiting Structural and Comparative

Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions

Prediting domain structure families and their domain contexts

Exploring how structural divergence in domain families correlates with functional change

Predicting domain relatives likely to have significantly different structures and functions

CCAATTHH Domain families of known structureDomain families of known structure

Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes

Page 10: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Multiple structural alignment by CORA allows identification of consensus secondary structures and secondary structure

embellishments

Some superfamilies show great structural diversitySome superfamilies show great structural diversity

In 117 superfamilies relatives expanded by >2 fold or more

2DSEC algorithm2DSEC algorithm

Gabrielle ReevesJ. Mol. Biol. (2006)

Page 11: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Structural embellishments can modify the active siteStructural embellishments can modify the active site

Galectin binding superfamily

Page 12: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Structural embellishments can modulate domain interactionsStructural embellishments can modulate domain interactions

Glucose 6-phosphate Glucose 6-phosphate dehydrogenasedehydrogenase

side orientationside orientation face orientationface orientation

Dihydrodipiccolinate Dihydrodipiccolinate reductasereductase

Additional secondary structure shown at (a) are involved in Additional secondary structure shown at (a) are involved in subunit interactionssubunit interactions

a

Page 13: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Structural embellishments can modify function by modifying

active site geometry and mediating new domain and subunit

interactions

Biotin carboxylaseBiotin carboxylaseD-alanine-d-alanine ligaseD-alanine-d-alanine ligase

Dimer of biotin carboxylaseDimer of biotin carboxylase

ATP GraspATP Graspsuperfamilysuperfamily

Page 14: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Secondary structure insertions are distributed along the Secondary structure insertions are distributed along the chain but aggregate in 3Dchain but aggregate in 3D

Page 15: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Secondary structure insertions are distributed along the Secondary structure insertions are distributed along the chain but aggregate in 3Dchain but aggregate in 3D

Page 16: Exploiting Structural and Comparative Genomics to Reveal Protein Functions
Page 17: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

For ~70% of domains analysed, 80% of the secondary structure embellishments are co-located in 3D with 3 or more other

embellishments

In 80% of domains, 1 or more embellishments contacts other domains or subunits

Indel frequency < 1 %

0.85% 0.38% 0.23% 0.11% 0.06% 0.02%

0

20

40

60

80

1 2 3 4 5 6 7 8 9 10 11 12

Size of Indel (number of secondary structures)

Frequency (%)

85% of insertions comprise only 1 or 2 secondary structures

Size of insertion (number of secondary structures)

Freq

uen

cy (

%)

Page 18: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

2 Layer Beta Sandwich

3 Layer Alpha/Beta Sandwich2 Layer Alpha/Beta

Alpha/Beta Barrel

Many structurally diverse superfamilies adopt folds with these regular

layered architectures

Page 19: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

2 Layer Beta Sandwich

3 Layer Alpha/Beta Sandwich2 Layer Alpha/Beta

Alpha/Beta Barrel

Many structurally diverse superfamilies adopt folds with these regular

layered architectures

Page 20: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Exploiting Structural and Comparative Exploiting Structural and Comparative

Genomics to Reveal Protein FunctionsGenomics to Reveal Protein Functions

Predicting domain structure families and their domain contexts

Exploring how structural divergence in domain families correlates with functional change

Predicting domain relatives likely to have significantly different structures and functions

CCAATTHH Domain families of known structureDomain families of known structure

Gene3DGene3D Protein families and domain annotations Protein families and domain annotations for completed genomesfor completed genomes

Page 21: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

subfamily of subfamily of close sequence close sequence

relatives relatives predicted to predicted to have similar have similar

functionsfunctions(>=60% (>=60% sequence sequence identity)identity)

GEMMA – GEne Model and Model Annotation

Algorithm for Predicting Sequence Homologues with

Similar Structures and Functions

Largest 100 CATH families have more than 20,000 subfamilies

structuralstructuralsuperfamilysuperfamily

Page 22: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

structuralstructuralsuperfamilysuperfamily

GEMMA – Predicting Functional Groups in CATH SuperfamiliesGEMMA – Predicting Functional Groups in CATH Superfamilies

Build multiple sequence alignments for each subfamilyBuild multiple sequence alignments for each subfamily

subfamily of subfamily of close relatives close relatives predicted to predicted to have similar have similar

function function (>60% (>60%

identity)identity)

Page 23: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

structuralstructuralsuperfamilysuperfamily

GEMMA – Predicting Functional Groups in CATH SuperfamiliesGEMMA – Predicting Functional Groups in CATH Superfamilies

Cluster subfamilies predicted to have similar functions into Cluster subfamilies predicted to have similar functions into functional groupsfunctional groups

subfamily of subfamily of close relatives close relatives predicted to predicted to have similar have similar

function function (>60% (>60%

identity)identity)

Page 24: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

SSAP score = 68.69PSS score = 0.375

Pyruvate phosphate dikinase (subfamily 1)

Succinyl-CoA synthetase(subfamily 22)

SSAP score = 93.01PSS score = 0.827

SSAP score = 68.32PSS score = 0.333

Pyruvate phosphate dikinase(subfamily 15)

ATP GraspATP GraspFamilyFamily

192 subfamilies192 subfamilies

Page 25: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

subfamily profiles coloured by residue conservationsubfamily profiles coloured by residue conservation (red = high, blue = low)(red = high, blue = low)

Pyruvate phosphate dikinase Pyruvate phosphate dikinase

Profiles aligned using profile-profile comparison (MAFFT)

Many fully conserved positions

6/7 positions are fully conserved

Equivalent functionsEquivalent functions

Scorecons (Valdar and Scorecons (Valdar and Thornton, Profunc)Thornton, Profunc)

Page 26: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

Succinyl-CoA synthetasePyruvate phosphate dikinase

Fully conserved positions

No fully conserved positions

subfamily profiles coloured by residue conservationsubfamily profiles coloured by residue conservation (red = high, blue = low)(red = high, blue = low)

Different functionsDifferent functions

Scorecons (Valdar and Scorecons (Valdar and Thornton, Profunc)Thornton, Profunc)

Profiles aligned using profile-profile comparison (MAFFT)

Page 27: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

10 experimentally identified enzyme functions identified in this family

Nu

mb

er

of

fun

ctio

nal g

rou

ps

Nu

mb

er

of

fun

ctio

nal g

rou

ps

pre

dic

ted

p

red

icte

d

Performance in Merging Subfamilies into Performance in Merging Subfamilies into Functional GroupsFunctional Groups

Error rate

Page 28: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

structuralstructuralsuperfamilysuperfamily

GEMMA – Predicting Functional Groups in CATH SuperfamiliesGEMMA – Predicting Functional Groups in CATH Superfamilies

subfamily of subfamily of close relatives close relatives predicted to predicted to have similar have similar

function function (>60% (>60%

identity)identity)

functional group

Benchmarked on 12 large enzyme families in CATH

6-10 fold reduction in the number of functional subfamilies

Page 29: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

SummarySummary

More than half the domains in UniProt can be assigned to families of known structure

Analysis of some very large structural families revealed how secondary structure insertions can modulate functions

Functional groups can be identified in diverse families by comparing multiple features (e.g. residue

conservation, predicted secondary structure)

Page 30: Exploiting Structural and Comparative Genomics to Reveal Protein Functions

CATH Gene3D

Lesley GreeneIan Sillitoe

Tony Lewis

Ollie Redfern

Alison Cuff

Tim Dallman

Mark Dibley

Sarah Addou

Stathis Sidderis

Russell Marsden

Dave Lee

Juan Ranea

Ilhem Diboun

Adam Reid

Corin Yeats

MRC, Wellcome Trust, NIH, EU -Biosapiens, Embrace, Enfin, BBSRC

http://www.biochem.ucl.ac.uk/bsm/cath_new


Recommended